Choose cost function and optimizer

To measure how well our neural network is doing we need to introduce a cost function. We will call the function that gives the error of a single sample output the loss function, and the function that gives the total error of our network across all samples the cost function. A typical choice for multiclass classification is the cross-entropy loss, also known as the negative log likelihood.

In multiclass classification it is common to treat each integer label as a so called one-hot vector:

$$ y = 5 \quad \rightarrow \quad \boldsymbol{y} = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0) ,$$

$$ y = 1 \quad \rightarrow \quad \boldsymbol{y} = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) ,$$

i.e. a binary bit string of length \( C \), where \( C = 10 \) is the number of classes in the MNIST dataset.

Let \( y_{ic} \) denote the \( c \)-th component of the \( i \)-th one-hot vector. We define the cost function \( \mathcal{C} \) as a sum over the cross-entropy loss for each point \( \boldsymbol{x}_i \) in the dataset.

In the one-hot representation only one of the terms in the loss function is non-zero, namely the probability of the correct category \( c' \) (i.e. the category \( c' \) such that \( y_{ic'} = 1 \)). This means that the cross entropy loss only punishes you for how wrong you got the correct label. The probability of category \( c \) is given by the softmax function. The vector \( \boldsymbol{\theta} \) represents the parameters of our network, i.e. all the weights and biases.