Data Analysis and Machine Learning: Neural networks, from the simple perceptron to deep learning and convolutional networks

Loading [MathJax]/extensions/TeX/boldsymbol.js

Setting up the Back propagation algorithm

The four equations provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.

First, we set up the input data $\hat{x}$ and the activations $\hat{z}_1$ of the input layer and compute the activation function and the pertinent outputs $\hat{a}^1$ .

Secondly, we perform then the feed forward till we reach the output layer and compute all $\hat{z}_l$ of the input layer and compute the activation function and the pertinent outputs $\hat{a}^l$ for $l=2,3,\dots,L$ .

Thereafter we compute the ouput error $\hat{\delta}^L$ by computing all $\delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}.$

Then we compute the back propagate error for each $l=L-1,L-2,\dots,2$ as $\delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l).$

Finally, we update the weights and the biases using gradient descent for each $l=L-1,L-2,\dots,2$ and update the weights and biases according to the rules $w_{jk}^l\leftarrow = w_{jk}^l- \eta \delta_j^la_k^{l-1},$ $b_j^l \leftarrow b_j^l-\eta \frac{\partial {\cal C}}{\partial b_j^l}=b_j^l-\eta \delta_j^l,$

The parameter $\eta$ is the learning parameter discussed in connection with the gradient descent methods. Here it is convenient to use stochastic gradient descent (see the examples below) with mini-batches with an outer loop that steps through multiple epochs of training.