Week 41 Neural networks and constructing a neural network code#
Morten Hjorth-Jensen, Department of Physics, University of Oslo, Norway
Date: Week 41
Plan for week 41, October 6-10#
Material for the lecture on Monday October 6, 2025#
Neural Networks, setting up the basic steps, from the simple perceptron model to the multi-layer perceptron model.
Building our own Feed-forward Neural Network, getting started
Readings and Videos:#
These lecture notes
For neural networks we recommend Goodfellow et al chapters 6 and 7.
Rashkca et al., chapter 11, jupyter-notebook sent separately, from GitHub
Neural Networks demystified at https://www.youtube.com/watch?v=bxe2T-V8XRs&list=PLiaHhY2iBX9hdHaRr6b7XevZtgZRa1PoU&ab_channel=WelchLabs
Building Neural Networks from scratch at https://www.youtube.com/watch?v=Wo5dMEP_BbI&list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3&ab_channel=sentdex
Video on Neural Networks at https://www.youtube.com/watch?v=CqOfi41LfDw
Video on the back propagation algorithm at https://www.youtube.com/watch?v=Ilg3gGewQ5U
We also recommend Michael Nielsen’s intuitive approach to the neural networks and the universal approximation theorem, see the slides at http://neuralnetworksanddeeplearning.com/chap4.html.
Mathematics of deep learning#
Two recent books online.
Reminder on books with hands-on material and codes#
Sebastian Rashcka et al, Machine learning with Sickit-Learn and PyTorch
Lab sessions on Tuesday and Wednesday#
Aim: Getting started with coding neural network. The exercises this week aim at setting up the feed-forward part of a neural network.
Lecture Monday October 6#
Introduction to Neural networks#
Artificial neural networks are computational systems that can learn to perform tasks by considering examples, generally without being programmed with any task-specific rules. It is supposed to mimic a biological system, wherein neurons interact by sending signals in the form of mathematical functions between layers. All layers can contain an arbitrary number of neurons, and each connection is represented by a weight variable.
Artificial neurons#
The field of artificial neural networks has a long history of development, and is closely connected with the advancement of computer science and computers in general. A model of artificial neurons was first developed by McCulloch and Pitts in 1943 to study signal processing in the brain and has later been refined by others. The general idea is to mimic neural networks in the human brain, which is composed of billions of neurons that communicate with each other by sending electrical signals. Each neuron accumulates its incoming signals, which must exceed an activation threshold to yield an output. If the threshold is not overcome, the neuron remains inactive, i.e. has zero output.
This behaviour has inspired a simple mathematical model for an artificial neuron.
Here, the output \(y\) of the neuron is the value of its activation function, which have as input a weighted sum of signals \(x_i, \dots ,x_n\) received by \(n\) other neurons.
Conceptually, it is helpful to divide neural networks into four categories:
general purpose neural networks for supervised learning,
neural networks designed specifically for image processing, the most prominent example of this class being Convolutional Neural Networks (CNNs),
neural networks for sequential data such as Recurrent Neural Networks (RNNs), and
neural networks for unsupervised learning such as Deep Boltzmann Machines.
In natural science, DNNs and CNNs have already found numerous applications. In statistical physics, they have been applied to detect phase transitions in 2D Ising and Potts models, lattice gauge theories, and different phases of polymers, or solving the Navier-Stokes equation in weather forecasting. Deep learning has also found interesting applications in quantum physics. Various quantum phase transitions can be detected and studied using DNNs and CNNs, topological phases, and even non-equilibrium many-body localization. Representing quantum states as DNNs quantum state tomography are among some of the impressive achievements to reveal the potential of DNNs to facilitate the study of quantum systems.
In quantum information theory, it has been shown that one can perform gate decompositions with the help of neural.
The applications are not limited to the natural sciences. There is a plethora of applications in essentially all disciplines, from the humanities to life science and medicine.
Neural network types#
An artificial neural network (ANN), is a computational model that consists of layers of connected neurons, or nodes or units. We will refer to these interchangeably as units or nodes, and sometimes as neurons.
It is supposed to mimic a biological nervous system by letting each neuron interact with other neurons by sending signals in the form of mathematical functions between layers. A wide variety of different ANNs have been developed, but most of them consist of an input layer, an output layer and eventual layers in-between, called hidden layers. All layers can contain an arbitrary number of nodes, and each connection between two nodes is associated with a weight variable.
Neural networks (also called neural nets) are neural-inspired nonlinear models for supervised learning. As we will see, neural nets can be viewed as natural, more powerful extensions of supervised learning methods such as linear and logistic regression and soft-max methods we discussed earlier.
Feed-forward neural networks#
The feed-forward neural network (FFNN) was the first and simplest type of ANNs that were devised. In this network, the information moves in only one direction: forward through the layers.
Nodes are represented by circles, while the arrows display the connections between the nodes, including the direction of information flow. Additionally, each arrow corresponds to a weight variable (figure to come). We observe that each node in a layer is connected to all nodes in the subsequent layer, making this a so-called fully-connected FFNN.
Convolutional Neural Network#
A different variant of FFNNs are convolutional neural networks (CNNs), which have a connectivity pattern inspired by the animal visual cortex. Individual neurons in the visual cortex only respond to stimuli from small sub-regions of the visual field, called a receptive field. This makes the neurons well-suited to exploit the strong spatially local correlation present in natural images. The response of each neuron can be approximated mathematically as a convolution operation. (figure to come)
Convolutional neural networks emulate the behaviour of neurons in the visual cortex by enforcing a local connectivity pattern between nodes of adjacent layers: Each node in a convolutional layer is connected only to a subset of the nodes in the previous layer, in contrast to the fully-connected FFNN. Often, CNNs consist of several convolutional layers that learn local features of the input, with a fully-connected layer at the end, which gathers all the local data and produces the outputs. They have wide applications in image and video recognition.
Recurrent neural networks#
So far we have only mentioned ANNs where information flows in one direction: forward. Recurrent neural networks on the other hand, have connections between nodes that form directed cycles. This creates a form of internal memory which are able to capture information on what has been calculated before; the output is dependent on the previous computations. Recurrent NNs make use of sequential information by performing the same task for every element in a sequence, where each element depends on previous elements. An example of such information is sentences, making recurrent NNs especially well-suited for handwriting and speech recognition.
Other types of networks#
There are many other kinds of ANNs that have been developed. One type that is specifically designed for interpolation in multidimensional space is the radial basis function (RBF) network. RBFs are typically made up of three layers: an input layer, a hidden layer with non-linear radial symmetric activation functions and a linear output layer (‘’linear’’ here means that each node in the output layer has a linear activation function). The layers are normally fully-connected and there are no cycles, thus RBFs can be viewed as a type of fully-connected FFNN. They are however usually treated as a separate type of NN due the unusual activation functions.
Multilayer perceptrons#
One uses often so-called fully-connected feed-forward neural networks with three or more layers (an input layer, one or more hidden layers and an output layer) consisting of neurons that have non-linear activation functions.
Such networks are often called multilayer perceptrons (MLPs).
Why multilayer perceptrons?#
According to the Universal approximation theorem, a feed-forward neural network with just a single hidden layer containing a finite number of neurons can approximate a continuous multidimensional function to arbitrary accuracy, assuming the activation function for the hidden layer is a non-constant, bounded and monotonically-increasing continuous function.
Note that the requirements on the activation function only applies to the hidden layer, the output nodes are always assumed to be linear, so as to not restrict the range of output values.
Illustration of a single perceptron model and a multi-perceptron model#
Figure 1: In a) we show a single perceptron model while in b) we dispay a network with two hidden layers, an input layer and an output layer.
Mathematics of deep learning and neural networks#
Neural networks, in its so-called feed-forward form, where each iterations contains a feed-forward stage and a back-propgagation stage, consist of series of affine matrix-matrix and matrix-vector multiplications. The unknown parameters (the so-called biases and weights which deternine the architecture of a neural network), are uptaded iteratively using the so-called back-propagation algorithm. This algorithm corresponds to the so-called reverse mode of automatic differentation.
Basics of an NN#
A neural network consists of a series of hidden layers, in addition to the input and output layers. Each layer \(l\) has a set of parameters \(\boldsymbol{\Theta}^{(l)}=(\boldsymbol{W}^{(l)},\boldsymbol{b}^{(l)})\) which are related to the parameters in other layers through a series of affine transformations, for a standard NN these are matrix-matrix and matrix-vector multiplications. For all layers we will simply use a collective variable \(\boldsymbol{\Theta}\).
It consist of two basic steps:
a feed forward stage which takes a given input and produces a final output which is compared with the target values through our cost/loss function.
a back-propagation state where the unknown parameters \(\boldsymbol{\Theta}\) are updated through the optimization of the their gradients. The expressions for the gradients are obtained via the chain rule, starting from the derivative of the cost/function.
These two steps make up one iteration. This iterative process is continued till we reach an eventual stopping criterion.
Overarching view of a neural network#
The architecture of a neural network defines our model. This model aims at describing some function \(f(\boldsymbol{x}\) which represents some final result (outputs or tagrget values) given a specific inpput \(\boldsymbol{x}\). Note that here \(\boldsymbol{y}\) and \(\boldsymbol{x}\) are not limited to be vectors.
The architecture consists of
An input and an output layer where the input layer is defined by the inputs \(\boldsymbol{x}\). The output layer produces the model ouput \(\boldsymbol{\tilde{y}}\) which is compared with the target value \(\boldsymbol{y}\)
A given number of hidden layers and neurons/nodes/units for each layer (this may vary)
A given activation function \(\sigma(\boldsymbol{z})\) with arguments \(\boldsymbol{z}\) to be defined below. The activation functions may differ from layer to layer.
The last layer, normally called output layer has normally an activation function tailored to the specific problem
Finally we define a so-called cost or loss function which is used to gauge the quality of our model.
The optimization problem#
The cost function is a function of the unknown parameters \(\boldsymbol{\Theta}\) where the latter is a container for all possible parameters needed to define a neural network
If we are dealing with a regression task a typical cost/loss function is the mean squared error
This function represents one of many possible ways to define the so-called cost function. Note that here we have assumed a linear dependence in terms of the paramters \(\boldsymbol{\Theta}\). This is in general not the case.
Parameters of neural networks#
For neural networks the parameters \(\boldsymbol{\Theta}\) are given by the so-called weights and biases (to be defined below).
The weights are given by matrix elements \(w_{ij}^{(l)}\) where the superscript indicates the layer number. The biases are typically given by vector elements representing each single node of a given layer, that is \(b_j^{(l)}\).
Other ingredients of a neural network#
Having defined the architecture of a neural network, the optimization of the cost function with respect to the parameters \(\boldsymbol{\Theta}\), involves the calculations of gradients and their optimization. The gradients represent the derivatives of a multidimensional object and are often approximated by various gradient methods, including
various quasi-Newton methods,
plain gradient descent (GD) with a constant learning rate \(\eta\),
GD with momentum and other approximations to the learning rates such as
Adapative gradient (ADAgrad)
Root mean-square propagation (RMSprop)
Adaptive gradient with momentum (ADAM) and many other
Stochastic gradient descent and various families of learning rate approximations
Other parameters#
In addition to the above, there are often additional hyperparamaters which are included in the setup of a neural network. These will be discussed below.
Universal approximation theorem#
The universal approximation theorem plays a central role in deep learning. Cybenko (1989) showed the following:
Let \(\sigma\) be any continuous sigmoidal function such that
Given a continuous and deterministic function \(F(\boldsymbol{x})\) on the unit cube in \(d\)-dimensions \(F\in [0,1]^d\), \(x\in [0,1]^d\) and a parameter \(\epsilon >0\), there is a one-layer (hidden) neural network \(f(\boldsymbol{x};\boldsymbol{\Theta})\) with \(\boldsymbol{\Theta}=(\boldsymbol{W},\boldsymbol{b})\) and \(\boldsymbol{W}\in \mathbb{R}^{m\times n}\) and \(\boldsymbol{b}\in \mathbb{R}^{n}\), for which
Some parallels from real analysis#
For those of you familiar with for example the Stone-Weierstrass theorem for polynomial approximations or the convergence criterion for Fourier series, there are similarities in the derivation of the proof for neural networks.
The approximation theorem in words#
Any continuous function \(y=F(\boldsymbol{x})\) supported on the unit cube in \(d\)-dimensions can be approximated by a one-layer sigmoidal network to arbitrary accuracy.
Hornik (1991) extended the theorem by letting any non-constant, bounded activation function to be included using that the expectation value
Then we have
More on the general approximation theorem#
None of the proofs give any insight into the relation between the number of of hidden layers and nodes and the approximation error \(\epsilon\), nor the magnitudes of \(\boldsymbol{W}\) and \(\boldsymbol{b}\).
Neural networks (NNs) have what we may call a kind of universality no matter what function we want to compute.
It does not mean that an NN can be used to exactly compute any function. Rather, we get an approximation that is as good as we want.
Class of functions we can approximate#
The class of functions that can be approximated are the continuous ones. If the function \(F(\boldsymbol{x})\) is discontinuous, it won’t in general be possible to approximate it. However, an NN may still give an approximation even if we fail in some points.
Setting up the equations for a neural network#
The questions we want to ask are how do changes in the biases and the weights in our network change the cost function and how can we use the final output to modify the weights and biases?
To derive these equations let us start with a plain regression problem and define our cost function as
where the \(y_i\)s are our \(n\) targets (the values we want to reproduce), while the outputs of the network after having propagated all inputs \(\boldsymbol{x}\) are given by \(\boldsymbol{\tilde{y}}_i\).
Definitions#
With our definition of the targets \(\boldsymbol{y}\), the outputs of the network \(\boldsymbol{\tilde{y}}\) and the inputs \(\boldsymbol{x}\) we define now the activation \(z_j^l\) of node/neuron/unit \(j\) of the \(l\)-th layer as a function of the bias, the weights which add up from the previous layer \(l-1\) and the forward passes/outputs \(\hat{a}^{l-1}\) from the previous layer as
where \(b_k^l\) are the biases from layer \(l\). Here \(M_{l-1}\) represents the total number of nodes/neurons/units of layer \(l-1\). The figure in the whiteboard notes illustrates this equation. We can rewrite this in a more compact form as the matrix-vector products we discussed earlier,
Inputs to the activation function#
With the activation values \(\boldsymbol{z}^l\) we can in turn define the output of layer \(l\) as \(\boldsymbol{a}^l = f(\boldsymbol{z}^l)\) where \(f\) is our activation function. In the examples here we will use the sigmoid function discussed in our logistic regression lectures. We will also use the same activation function \(f\) for all layers and their nodes. It means we have
Derivatives and the chain rule#
From the definition of the activation \(z_j^l\) we have
and
With our definition of the activation function we have that (note that this function depends only on \(z_j^l\))
Derivative of the cost function#
With these definitions we can now compute the derivative of the cost function in terms of the weights.
Let us specialize to the output layer \(l=L\). Our cost function is
The derivative of this function with respect to the weights is
The last partial derivative can easily be computed and reads (by applying the chain rule)
Simpler examples first, and automatic differentiation#
In order to understand the back propagation algorithm and its derivation (an implementation of the chain rule), let us first digress with some simple examples. These examples are also meant to motivate the link with back propagation and automatic differentiation. We will discuss these topics next week (week 42).
Reminder on the chain rule and gradients#
If we have a multivariate function \(f(x,y)\) where \(x=x(t)\) and \(y=y(t)\) are functions of a variable \(t\), we have that the gradient of \(f\) with respect to \(t\) (without the explicit unit vector components)
Multivariable functions#
If we have a multivariate function \(f(x,y)\) where \(x=x(t,s)\) and \(y=y(t,s)\) are functions of the variables \(t\) and \(s\), we have that the partial derivatives
and
the gradient of \(f\) with respect to \(t\) and \(s\) (without the explicit unit vector components)
Automatic differentiation through examples#
A great introduction to automatic differentiation is given by Baydin et al., see https://arxiv.org/abs/1502.05767. See also the video at https://www.youtube.com/watch?v=wG_nF1awSSY.
Automatic differentiation is a represented by a repeated application of the chain rule on well-known functions and allows for the calculation of derivatives to numerical precision. It is not the same as the calculation of symbolic derivatives via for example SymPy, nor does it use approximative formulae based on Taylor-expansions of a function around a given value. The latter are error prone due to truncation errors and values of the step size \(\Delta\).
Simple example#
Our first example is rather simple,
with derivative
We can use SymPy to extract the pertinent lines of Python code through the following simple example
from __future__ import division
from sympy import *
x = symbols('x')
expr = exp(x*x)
simplify(expr)
derivative = diff(expr,x)
print(python(expr))
print(python(derivative))
Smarter way of evaluating the above function#
If we study this function, we note that we can reduce the number of operations by introducing an intermediate variable
leading to
We now assume that all operations can be counted in terms of equal floating point operations. This means that in order to calculate \(f(x)\) we need first to square \(x\) and then compute the exponential. We have thus two floating point operations only.
Reducing the number of operations#
With the introduction of a precalculated quantity \(a\) and thereby \(f(x)\) we have that the derivative can be written as
which reduces the number of operations from four in the orginal expression to two. This means that if we need to compute \(f(x)\) and its derivative (a common task in optimizations), we have reduced the number of operations from six to four in total.
Note that the usage of a symbolic software like SymPy does not include such simplifications and the calculations of the function and the derivatives yield in general more floating point operations.
Chain rule, forward and reverse modes#
In the above example we have introduced the variables \(a\) and \(b\), and our function is
with \(a=x^2\). We can decompose the derivative of \(f\) with respect to \(x\) as
We note that since \(b=f(x)\) that
leading to
as before.
Forward and reverse modes#
We have that
which we can rewrite either as
or
The first expression is called reverse mode (or back propagation) since we start by evaluating the derivatives at the end point and then propagate backwards. This is the standard way of evaluating derivatives (gradients) when optimizing the parameters of a neural network. In the context of deep learning this is computationally more efficient since the output of a neural network consists of either one or some few other output variables.
The second equation defines the so-called forward mode.
More complicated function#
We increase our ambitions and introduce a slightly more complicated function
with derivative
The corresponding SymPy code reads
from __future__ import division
from sympy import *
x = symbols('x')
expr = sqrt(x*x+exp(x*x))
simplify(expr)
derivative = diff(expr,x)
print(python(expr))
print(python(derivative))
Counting the number of floating point operations#
A simple count of operations shows that we need five operations for the function itself and ten for the derivative. Fifteen operations in total if we wish to proceed with the above codes.
Can we reduce this to say half the number of operations?
Defining intermediate operations#
We can indeed reduce the number of operation to half of those listed in the brute force approach above. We define the following quantities
and
and
and
New expression for the derivative#
With these definitions we obtain the following partial derivatives
and
and
and
and
and finally
Final derivatives#
Our final derivatives are thus
and finally
which is just
and requires only three operations if we can reuse all intermediate variables.
In general not this simple#
In general, see the generalization below, unless we can obtain simple analytical expressions which we can simplify further, the final implementation of automatic differentiation involves repeated calculations (and thereby operations) of derivatives of elementary functions.
Automatic differentiation#
We can make this example more formal. Automatic differentiation is a formalization of the previous example (see graph).
We define \(\boldsymbol{x}\in x_1,\dots, x_l\) input variables to a given function \(f(\boldsymbol{x})\) and \(x_{l+1},\dots, x_L\) intermediate variables.
In the above example we have only one input variable, \(l=1\) and four intermediate variables, that is
Furthemore, for \(i=l+1, \dots, L\) (here \(i=2,3,4,5\) and \(f=x_L=d\)), we define the elementary functions \(g_i(x_{Pa(x_i)})\) where \(x_{Pa(x_i)}\) are the parent nodes of the variable \(x_i\).
In our case, we have for example for \(x_3=g_3(x_{Pa(x_i)})=\exp{a}\), that \(g_3=\exp{()}\) and \(x_{Pa(x_3)}=a\).
Chain rule#
We can now compute the gradients by back-propagating the derivatives using the chain rule. We have defined
which allows us to find the derivatives of the various variables \(x_i\) as
Whenever we have a function which can be expressed as a computation graph and the various functions can be expressed in terms of elementary functions that are differentiable, then automatic differentiation works. The functions may not need to be elementary functions, they could also be computer programs, although not all programs can be automatically differentiated.
First network example, simple percepetron with one input#
As yet another example we define now a simple perceptron model with all quantities given by scalars. We consider only one input variable \(x\) and one target value \(y\). We define an activation function \(\sigma_1\) which takes as input
where \(w_1\) is the weight and \(b_1\) is the bias. These are the parameters we want to optimize. The output is \(a_1=\sigma(z_1)\) (see graph from whiteboard notes). This output is then fed into the cost/loss function, which we here for the sake of simplicity just define as the squared error
Optimizing the parameters#
In setting up the feed forward and back propagation parts of the algorithm, we need now the derivative of the various variables we want to train.
We need
Using the chain rule we find
and
which we later will just define as
The derivatives#
The derivatives are now, using the chain rule again
Can you generalize this to more than one hidden layer?
Important observations#
From the above equations we see that the derivatives of the activation functions play a central role. If they vanish, the training may stop. This is called the vanishing gradient problem, see discussions below. If they become large, the parameters \(w_i\) and \(b_i\) may simply go to infinity. This is referenced as the exploding gradient problem.
The training#
The training of the parameters is done through various gradient descent approximations with
and
with \(\eta\) is the learning rate.
One iteration consists of one feed forward step and one back-propagation step. Each back-propagation step does one update of the parameters \(\boldsymbol{\Theta}\).
For the first hidden layer \(a_{i-1}=a_0=x\) for this simple model.
Code example#
The code here implements the above model with one hidden layer and scalar variables for the same function we studied in the previous example. The code is however set up so that we can add multiple inputs \(x\) and target values \(y\). Note also that we have the possibility of defining a feature matrix \(\boldsymbol{X}\) with more than just one column for the input values. This will turn useful in our next example. We have also defined matrices and vectors for all of our operations although it is not necessary here.
import numpy as np
# We use the Sigmoid function as activation function
def sigmoid(z):
return 1.0/(1.0+np.exp(-z))
def forwardpropagation(x):
# weighted sum of inputs to the hidden layer
z_1 = np.matmul(x, w_1) + b_1
# activation in the hidden layer
a_1 = sigmoid(z_1)
# weighted sum of inputs to the output layer
z_2 = np.matmul(a_1, w_2) + b_2
a_2 = z_2
return a_1, a_2
def backpropagation(x, y):
a_1, a_2 = forwardpropagation(x)
# parameter delta for the output layer, note that a_2=z_2 and its derivative wrt z_2 is just 1
delta_2 = a_2 - y
print(0.5*((a_2-y)**2))
# delta for the hidden layer
delta_1 = np.matmul(delta_2, w_2.T) * a_1 * (1 - a_1)
# gradients for the output layer
output_weights_gradient = np.matmul(a_1.T, delta_2)
output_bias_gradient = np.sum(delta_2, axis=0)
# gradient for the hidden layer
hidden_weights_gradient = np.matmul(x.T, delta_1)
hidden_bias_gradient = np.sum(delta_1, axis=0)
return output_weights_gradient, output_bias_gradient, hidden_weights_gradient, hidden_bias_gradient
# ensure the same random numbers appear every time
np.random.seed(0)
# Input variable
x = np.array([4.0],dtype=np.float64)
# Target values
y = 2*x+1.0
# Defining the neural network, only scalars here
n_inputs = x.shape
n_features = 1
n_hidden_neurons = 1
n_outputs = 1
# Initialize the network
# weights and bias in the hidden layer
w_1 = np.random.randn(n_features, n_hidden_neurons)
b_1 = np.zeros(n_hidden_neurons) + 0.01
# weights and bias in the output layer
w_2 = np.random.randn(n_hidden_neurons, n_outputs)
b_2 = np.zeros(n_outputs) + 0.01
eta = 0.1
for i in range(50):
# calculate gradients
derivW2, derivB2, derivW1, derivB1 = backpropagation(x, y)
# update weights and biases
w_2 -= eta * derivW2
b_2 -= eta * derivB2
w_1 -= eta * derivW1
b_1 -= eta * derivB1
We see that after some few iterations (the results do depend on the learning rate however), we get an error which is rather small.
Exercise 1: Including more data#
Try to increase the amount of input and target/output data. Try also to perform calculations for more values of the learning rates. Feel free to add either hyperparameters with an \(l_1\) norm or an \(l_2\) norm and discuss your results. Discuss your results as functions of the amount of training data and various learning rates.
Challenge: Try to change the activation functions and replace the hard-coded analytical expressions with automatic derivation via either autograd or JAX.
Simple neural network and the back propagation equations#
Let us now try to increase our level of ambition and attempt at setting up the equations for a neural network with two input nodes, one hidden layer with two hidden nodes and one output layer with one output node/neuron only (see graph)..
We need to define the following parameters and variables with the input layer (layer \((0)\)) where we label the nodes \(x_0\) and \(x_1\)
The hidden layer (layer \((1)\)) has nodes which yield the outputs \(a_0^{(1)}\) and \(a_1^{(1)}\)) with weight \(\boldsymbol{w}\) and bias \(\boldsymbol{b}\) parameters
The ouput layer#
Finally, we have the ouput layer given by layer label \((2)\) with output \(a^{(2)}\) and weights and biases to be determined given by the variables
Our output is \(\tilde{y}=a^{(2)}\) and we define a generic cost function \(C(a^{(2)},y;\boldsymbol{\Theta})\) where \(y\) is the target value (a scalar here). The parameters we need to optimize are given by
Compact expressions#
We can define the inputs to the activation functions for the various layers in terms of various matrix-vector multiplications and vector additions. The inputs to the first hidden layer are
with outputs
Output layer#
For the final output layer we have the inputs to the final activation function
resulting in the output
Explicit derivatives#
In total we have nine parameters which we need to train. Using the chain rule (or just the back-propagation algorithm) we can find all derivatives. Since we will use automatic differentiation in reverse mode, we start with the derivatives of the cost function with respect to the parameters of the output layer, namely
with
and finally
Final expression#
Defining
we have
Similarly, we obtain
Completing the list#
Similarly, we find
and
where we have defined
Gradient expressions#
For this specific model, with just one output node and two hidden nodes, the gradient descent equations take the following form for output layer
and
and
and
where \(\eta\) is the learning rate.
Exercise 2: Extended program#
We extend our simple code to a function which depends on two variable \(x_0\) and \(x_1\), that is
We feed our network with \(n=100\) entries \(x_0\) and \(x_1\). We have thus two features represented by these variable and an input matrix/design matrix \(\boldsymbol{X}\in \mathbf{R}^{n\times 2}\)
Write a code, based on the previous code examples, which takes as input these data and fit the above function. You can extend your code to include automatic differentiation.
With these examples, we are now ready to embark upon the writing of more a general code for neural networks.
Getting serious, the back propagation equations for a neural network#
Now it is time to move away from one node in each layer only. Our inputs are also represented either by several inputs.
We have thus
Defining
and using the Hadamard product of two vectors we can write this as
Analyzing the last results#
This is an important expression. The second term on the right handside measures how fast the cost function is changing as a function of the \(j\)th output activation. If, for example, the cost function doesn’t depend much on a particular output node \(j\), then \(\delta_j^L\) will be small, which is what we would expect. The first term on the right, measures how fast the activation function \(f\) is changing at a given activation value \(z_j^L\).
More considerations#
Notice that everything in the above equations is easily computed. In particular, we compute \(z_j^L\) while computing the behaviour of the network, and it is only a small additional overhead to compute \(\sigma'(z^L_j)\). The exact form of the derivative with respect to the output depends on the form of the cost function. However, provided the cost function is known there should be little trouble in calculating
With the definition of \(\delta_j^L\) we have a more compact definition of the derivative of the cost function in terms of the weights, namely
Derivatives in terms of \(z_j^L\)#
It is also easy to see that our previous equation can be written as
which can also be interpreted as the partial derivative of the cost function with respect to the biases \(b_j^L\), namely
That is, the error \(\delta_j^L\) is exactly equal to the rate of change of the cost function as a function of the bias.
Bringing it together#
We have now three equations that are essential for the computations of the derivatives of the cost function at the output layer. These equations are needed to start the algorithm and they are
and
and
Final back propagating equation#
We have that (replacing \(L\) with a general layer \(l\))
We want to express this in terms of the equations for layer \(l+1\).
Using the chain rule and summing over all \(k\) entries#
We obtain
and recalling that
with \(M_l\) being the number of nodes in layer \(l\), we obtain
This is our final equation.
We are now ready to set up the algorithm for back propagation and learning the weights and biases.
Setting up the back propagation algorithm#
The four equations provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.
First, we set up the input data \(\hat{x}\) and the activations \(\hat{z}_1\) of the input layer and compute the activation function and the pertinent outputs \(\hat{a}^1\).
Secondly, we perform then the feed forward till we reach the output layer and compute all \(\hat{z}_l\) of the input layer and compute the activation function and the pertinent outputs \(\hat{a}^l\) for \(l=1,2,3,\dots,L\).
Notation: The first hidden layer has \(l=1\) as label and the final output layer has \(l=L\).
Setting up the back propagation algorithm, part 2#
Thereafter we compute the ouput error \(\hat{\delta}^L\) by computing all
Then we compute the back propagate error for each \(l=L-1,L-2,\dots,1\) as
Setting up the Back propagation algorithm, part 3#
Finally, we update the weights and the biases using gradient descent for each \(l=L-1,L-2,\dots,1\) and update the weights and biases according to the rules
with \(\eta\) being the learning rate.
Updating the gradients#
With the back propagate error for each \(l=L-1,L-2,\dots,1\) as
we update the weights and the biases using gradient descent for each \(l=L-1,L-2,\dots,1\) and update the weights and biases according to the rules