Deriving the back propagation code for a multilayer perceptron model

As we have seen now in a feed forward network, we can express the final output of our network in terms of basic matrix-vector multiplications. The unknowwn quantities are our weights \( w_{ij} \) and we need to find an algorithm for changing them so that our errors are as small as possible. This leads us to the famous back propagation algorithm.

The questions we want to ask are how do changes in the biases and the weights in our network change the cost function and how can we use the final output to modify the weights?

To derive these equations let us start with a plain regression problem and define our cost function as

$$ {\cal C}(\hat{W}) = \frac{1}{2}\sum_{i=1}^n\left(y_i - t_i\right)^2, $$

where the $t_i$s are our \( n \) targets (the values we want to reproduce), while the outputs of the network after having propagated all inputs \( \hat{x} \) are given by \( y_i \). Below we will demonstrate how the basic equations arising from the back propagation algorithm can be modified in order to study classification problems with \( K \) classes.