Our cost function is given as (see the Logistic regression lectures)
$$ \mathcal{C}(\boldsymbol{\theta}) = - \ln P(\mathcal{D} \mid \boldsymbol{\theta}) = - \sum_{i=1}^n y_i \ln[P(y_i = 0)] + (1 - y_i) \ln [1 - P(y_i = 0)] = \sum_{i=1}^n \mathcal{L}_i(\boldsymbol{\theta}) . $$This last equality means that we can interpret our cost function as a sum over the loss function for each point in the dataset \( \mathcal{L}_i(\boldsymbol{\theta}) \). The negative sign is just so that we can think about our algorithm as minimizing a positive number, rather than maximizing a negative number.
In multiclass classification it is common to treat each integer label as a so called one-hot vector:
\( y = 5 \quad \rightarrow \quad \boldsymbol{y} = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0) , \) and
\( y = 1 \quad \rightarrow \quad \boldsymbol{y} = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) , \)i.e. a binary bit string of length \( C \), where \( C = 10 \) is the number of classes in the MNIST dataset (numbers from \( 0 \) to \( 9 \))..
If \( \boldsymbol{x}_i \) is the \( i \)-th input (image), \( y_{ic} \) refers to the \( c \)-th component of the \( i \)-th output vector \( \boldsymbol{y}_i \). The probability of \( \boldsymbol{x}_i \) being in class \( c \) will be given by the softmax function:
$$ P(y_{ic} = 1 \mid \boldsymbol{x}_i, \boldsymbol{\theta}) = \frac{\exp{((\boldsymbol{a}_i^{hidden})^T \boldsymbol{w}_c)}} {\sum_{c'=0}^{C-1} \exp{((\boldsymbol{a}_i^{hidden})^T \boldsymbol{w}_{c'})}} , $$which reduces to the logistic function in the binary case. The likelihood of this \( C \)-class classifier is now given as:
$$ P(\mathcal{D} \mid \boldsymbol{\theta}) = \prod_{i=1}^n \prod_{c=0}^{C-1} [P(y_{ic} = 1)]^{y_{ic}} . $$Again we take the negative log-likelihood to define our cost function:
$$ \mathcal{C}(\boldsymbol{\theta}) = - \log{P(\mathcal{D} \mid \boldsymbol{\theta})}. $$See the logistic regression lectures for a full definition of the cost function.
The back propagation equations need now only a small change, namely the definition of a new cost function. We are thus ready to use the same equations as before!