Processing math: 100%

 

 

 

Defining the cost function

Our cost function is given as (see the Logistic regression lectures)

C(θ)=lnP(Dθ)=ni=1yiln[P(yi=0)]+(1yi)ln[1P(yi=0)]=ni=1Li(θ).

This last equality means that we can interpret our cost function as a sum over the loss function for each point in the dataset Li(θ). The negative sign is just so that we can think about our algorithm as minimizing a positive number, rather than maximizing a negative number.

In multiclass classification it is common to treat each integer label as a so called one-hot vector:

y=5y=(0,0,0,0,0,1,0,0,0,0), and

y=1y=(0,1,0,0,0,0,0,0,0,0),

i.e. a binary bit string of length C, where C=10 is the number of classes in the MNIST dataset (numbers from 0 to 9)..

If xi is the i-th input (image), yic refers to the c-th component of the i-th output vector yi. The probability of xi being in class c will be given by the softmax function:

P(yic=1xi,θ)=exp((ahiddeni)Twc)C1c=0exp((ahiddeni)Twc),

which reduces to the logistic function in the binary case. The likelihood of this C-class classifier is now given as:

P(Dθ)=ni=1C1c=0[P(yic=1)]yic.

Again we take the negative log-likelihood to define our cost function:

C(θ)=logP(Dθ).

See the logistic regression lectures for a full definition of the cost function.

The back propagation equations need now only a small change, namely the definition of a new cost function. We are thus ready to use the same equations as before!