Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

Setting up the Back propagation algorithm

The four equations provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.

First, we set up the input data \hat{x} and the activations \hat{z}_1 of the input layer and compute the activation function and the pertinent outputs \hat{a}^1 .

Secondly, we perform then the feed forward till we reach the output layer and compute all \hat{z}_l of the input layer and compute the activation function and the pertinent outputs \hat{a}^l for l=2,3,\dots,L .

Thereafter we compute the ouput error \hat{\delta}^L by computing all \delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}.

Then we compute the back propagate error for each l=L-1,L-2,\dots,2 as \delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l).

Finally, we update the weights and the biases using gradient descent for each l=L-1,L-2,\dots,2 and update the weights and biases according to the rules w_{jk}^l\leftarrow = w_{jk}^l- \eta \delta_j^la_k^{l-1}, b_j^l \leftarrow b_j^l-\eta \frac{\partial {\cal C}}{\partial b_j^l}=b_j^l-\eta \delta_j^l,

The parameter \eta is the learning parameter discussed in connection with the gradient descent methods. Here it is convenient to use stochastic gradient descent (see the examples below) with mini-batches with an outer loop that steps through multiple epochs of training.