Setting up the Back propagation algorithm

The four equations derived last week provide us with a way of computing the gradient of the cost function. Let us write this out in the form of an algorithm.

First, we set up the input data \( \boldsymbol{x} \) and the activations \( \boldsymbol{z}_1 \) of the input layer and compute the activation function and the pertinent outputs \( \boldsymbol{a}^1 \).

Secondly, we perform then the feed forward till we reach the output layer and compute all \( \boldsymbol{z}_l \) of the input layer and compute the activation function and the pertinent outputs \( \boldsymbol{a}^l \) for \( l=2,3,\dots,L \).

Thereafter we compute the ouput error \( \boldsymbol{\delta}^L \) by computing all

$$ \delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}. $$

Then we compute the back propagate error for each \( l=L-1,L-2,\dots,2 \) as

$$ \delta_j^l = \sum_k \delta_k^{l+1}w_{kj}^{l+1}f'(z_j^l). $$

Finally, we update the weights and the biases using gradient descent for each \( l=L-1,L-2,\dots,2 \) and update the weights and biases according to the rules

$$ w_{jk}^l\leftarrow = w_{jk}^l- \eta \delta_j^la_k^{l-1}, $$ $$ b_j^l \leftarrow b_j^l-\eta \frac{\partial {\cal C}}{\partial b_j^l}=b_j^l-\eta \delta_j^l, $$

The parameter \( \eta \) is the learning parameter discussed in connection with the gradient descent methods. Here it is convenient to use stochastic gradient descent (see the examples below) with mini-batches with an outer loop that steps through multiple epochs of training.