Our cost function is given as (see the Logistic regression lectures)
C(θ)=−lnP(D∣θ)=−n∑i=1yiln[P(yi=0)]+(1−yi)ln[1−P(yi=0)]=n∑i=1Li(θ).This last equality means that we can interpret our cost function as a sum over the loss function for each point in the dataset Li(θ). The negative sign is just so that we can think about our algorithm as minimizing a positive number, rather than maximizing a negative number.
In multiclass classification it is common to treat each integer label as a so called one-hot vector:
y=5→y=(0,0,0,0,0,1,0,0,0,0), and
y=1→y=(0,1,0,0,0,0,0,0,0,0),i.e. a binary bit string of length C, where C=10 is the number of classes in the MNIST dataset (numbers from 0 to 9)..
If xi is the i-th input (image), yic refers to the c-th component of the i-th output vector yi. The probability of xi being in class c will be given by the softmax function:
P(yic=1∣xi,θ)=exp((ahiddeni)Twc)∑C−1c′=0exp((ahiddeni)Twc′),which reduces to the logistic function in the binary case. The likelihood of this C-class classifier is now given as:
P(D∣θ)=n∏i=1C−1∏c=0[P(yic=1)]yic.Again we take the negative log-likelihood to define our cost function:
C(θ)=−logP(D∣θ).See the logistic regression lectures for a full definition of the cost function.
The back propagation equations need now only a small change, namely the definition of a new cost function. We are thus ready to use the same equations as before!