Week 42 Constructing a Neural Network code with examples

Loading [MathJax]/extensions/TeX/boldsymbol.js

Choose cost function and optimizer

To measure how well our neural network is doing we need to introduce a cost function. We will call the function that gives the error of a single sample output the loss function, and the function that gives the total error of our network across all samples the cost function. A typical choice for multiclass classification is the cross-entropy loss, also known as the negative log likelihood.

In multiclass classification it is common to treat each integer label as a so called one-hot vector:

$y = 5 \quad \rightarrow \quad \boldsymbol{y} = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0) ,$

$y = 1 \quad \rightarrow \quad \boldsymbol{y} = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) ,$

i.e. a binary bit string of length $C$ , where $C = 10$ is the number of classes in the MNIST dataset.

Let $y_{ic}$ denote the $c$ -th component of the $i$ -th one-hot vector. We define the cost function $\mathcal{C}$ as a sum over the cross-entropy loss for each point $\boldsymbol{x}_i$ in the dataset.

In the one-hot representation only one of the terms in the loss function is non-zero, namely the probability of the correct category $c'$ (i.e. the category $c'$ such that $y_{ic'} = 1$ ). This means that the cross entropy loss only punishes you for how wrong you got the correct label. The probability of category $c$ is given by the softmax function. The vector $\boldsymbol{\theta}$ represents the parameters of our network, i.e. all the weights and biases.