Bringing it together

We have now three equations that are essential for the computations of the derivatives of the cost function at the output layer. These equations are needed to start the algorithm and they are

$$ \begin{equation} \frac{\partial{\cal C}(\hat{W^L})}{\partial w_{jk}^L} = \delta_j^La_k^{L-1}, \tag{13} \end{equation} $$

and

$$ \begin{equation} \delta_j^L = f'(z_j^L)\frac{\partial {\cal C}}{\partial (a_j^L)}, \tag{14} \end{equation} $$

and

$$ \begin{equation} \delta_j^L = \frac{\partial {\cal C}}{\partial b_j^L}, \tag{15} \end{equation} $$

An interesting consequence of the above equations is that when the activation \( a_k^{L-1} \) is small, the gradient term, that is the derivative of the cost function with respect to the weights, will also tend to be small. We say then that the weight learns slowly, meaning that it changes slowly when we minimize the weights via say gradient descent. In this case we say the system learns slowly.

Another interesting feature is that is when the activation function, represented by the sigmoid function here, is rather flat when we move towards its end values \( 0 \) and \( 1 \) (see the above Python codes). In these cases, the derivatives of the activation function will also be close to zero, meaning again that the gradients will be small and the network learns slowly again.

We need a fourth equation and we are set. We are going to propagate backwards in order to the determine the weights and biases. In order to do so we need to represent the error in the layer before the final one \( L-1 \) in terms of the errors in the final output layer.