As an example of the above, relevant for project 2 as well, let us consider a binary class. As discussed in our logistic regression lectures, we defined a cost function in terms of the parameters β as
C(β)=−n∑i=1(yilogp(yi|xi,β)+(1−yi)log1−p(yi|xi,β)),where we had defined the logistic (sigmoid) function
p(yi=1|xi,β)=exp(β0+β1xi)1+exp(β0+β1xi),and
p(yi=0|xi,β)=1−p(yi=1|xi,β).The parameters β were defined using a minimization method like gradient descent or Newton-Raphson's method.
Now we replace xi with the activation zli for a given layer l and the outputs as yi=ali=f(zli), with zli now being a function of the weights wlij and biases bli. We have then
ali=yi=exp(zli)1+exp(zli),with
zli=∑jwlijal−1j+bli,where the superscript l−1 indicates that these are the outputs from layer l−1. Our cost function at the final layer l=L is now
C(W)=−n∑i=1(tilogaLi+(1−ti)log(1−aLi)),where we have defined the targets ti. The derivatives of the cost function with respect to the output aLi are then easily calculated and we get
∂C(W)∂aLi=aLi−tiaLi(1−aLi).In case we use another activation function than the logistic one, we need to evaluate other derivatives.