Layers

Since each input image has 8x8 = 64 pixels or features, we have an input layer of 64 neurons.

We will use 50 neurons in the hidden layer receiving input from the neurons in the input layer. Since each neuron in the hidden layer is connected to the 64 inputs we have 64x50 = 3200 weights to the hidden layer.

If we were building a binary classifier, it would be sufficient with a single neuron in the output layer, which could output 0 or 1 according to the Heaviside function. This would be an example of a hard classifier, meaning it outputs the class of the input directly. However, if we are dealing with noisy data it is often beneficial to use a soft classifier, which outputs the probability of being in class 0 or 1.

For a soft binary classifier, we could use a single neuron and interpret the output as either being the probability of being in class 0 or the probability of being in class 1. Alternatively we could use 2 neurons, and interpret each neuron as the probability of being in each class.

Since we are doing multiclass classification, with 10 categories, it is natural to use 10 neurons in the output layer. We number the neurons \( j = 0,1,...,9 \). The activation of each output neuron \( j \) will be according to the softmax function:

$$ P(\text{class \( j \)} \mid \text{input \( \boldsymbol{a} \)}) = \frac{\exp{(\boldsymbol{a}^T \boldsymbol{w}_j)}} {\sum_{c=0}^{9} \exp{(\boldsymbol{a}^T \boldsymbol{w}_c)}} ,$$

i.e. each neuron \( j \) outputs the probability of being in class \( j \) given an input from the hidden layer \( \boldsymbol{a} \), with \( \boldsymbol{w}_j \) the weights of neuron \( j \) to the inputs. The denominator is a normalization factor to ensure the outputs (probabilities) sum up to 1. The exponent is just the weighted sum of inputs as before:

$$ z_j = \sum_{i=1}^n w_ {ij} a_i+b_j.$$

Since each neuron in the output layer is connected to the 50 inputs from the hidden layer we have 50x10 = 500 weights to the output layer.