Data Analysis and Machine Learning: Neural networks, from the simple perceptron to deep learning and convolutional networks

Loading [MathJax]/extensions/TeX/boldsymbol.js

Defining the cost function

Our cost function is given as (see the Logistic regression lectures) $\mathcal{C}(\hat{\theta}) = - \ln P(\mathcal{D} \mid \hat{\theta}) = - \sum_{i=1}^n y_i \ln[P(y_i = 0)] + (1 - y_i) \ln [1 - P(y_i = 0)] = \sum_{i=1}^n \mathcal{L}_i(\hat{\theta}) .$

This last equality means that we can interpret our cost function as a sum over the loss function for each point in the dataset $\mathcal{L}_i(\hat{\theta})$ . The negative sign is just so that we can think about our algorithm as minimizing a positive number, rather than maximizing a negative number.

In multiclass classification it is common to treat each integer label as a so called one-hot vector:

$y = 5 \quad \rightarrow \quad \hat{y} = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0) ,$ and

$y = 1 \quad \rightarrow \quad \hat{y} = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) ,$

i.e. a binary bit string of length $C$ , where $C = 10$ is the number of classes in the MNIST dataset (numbers from $0$ to $9$ )..

If $\hat{x}_i$ is the $i$ -th input (image), $y_{ic}$ refers to the $c$ -th component of the $i$ -th output vector $\hat{y}_i$ . The probability of $\hat{x}_i$ being in class $c$ will be given by the softmax function: $P(y_{ic} = 1 \mid \hat{x}_i, \hat{\theta}) = \frac{\exp{((\hat{a}_i^{hidden})^T \hat{w}_c)}} {\sum_{c'=0}^{C-1} \exp{((\hat{a}_i^{hidden})^T \hat{w}_{c'})}} ,$

which reduces to the logistic function in the binary case. The likelihood of this $C$ -class classifier is now given as: $P(\mathcal{D} \mid \hat{\theta}) = \prod_{i=1}^n \prod_{c=0}^{C-1} [P(y_{ic} = 1)]^{y_{ic}} .$ Again we take the negative log-likelihood to define our cost function: $\mathcal{C}(\hat{\theta}) = - \log{P(\mathcal{D} \mid \hat{\theta})}.$ See the logistic regression lectures for a full definition of the cost function.

The back propagation equations need now only a small change, namely the definition of a new cost function. We are thus ready to use the same equations as before!