Example: binary classification problem

As an example of the above, relevant for project 2 as well, let us consider a binary class. As discussed in our logistic regression lectures, we defined a cost function in terms of the parameters \( \beta \) as

$$ \mathcal{C}(\boldsymbol{\beta}) = - \sum_{i=1}^n \left(y_i\log{p(y_i \vert x_i,\boldsymbol{\beta})}+(1-y_i)\log{1-p(y_i \vert x_i,\boldsymbol{\beta})}\right), $$

where we had defined the logistic (sigmoid) function

$$ p(y_i =1\vert x_i,\boldsymbol{\beta})=\frac{\exp{(\beta_0+\beta_1 x_i)}}{1+\exp{(\beta_0+\beta_1 x_i)}}, $$

and

$$ p(y_i =0\vert x_i,\boldsymbol{\beta})=1-p(y_i =1\vert x_i,\boldsymbol{\beta}). $$

The parameters \( \boldsymbol{\beta} \) were defined using a minimization method like gradient descent or Newton-Raphson's method.

Now we replace \( x_i \) with the activation \( z_i^l \) for a given layer \( l \) and the outputs as \( y_i=a_i^l=f(z_i^l) \), with \( z_i^l \) now being a function of the weights \( w_{ij}^l \) and biases \( b_i^l \). We have then

$$ a_i^l = y_i = \frac{\exp{(z_i^l)}}{1+\exp{(z_i^l)}}, $$

with

$$ z_i^l = \sum_{j}w_{ij}^l a_j^{l-1}+b_i^l, $$

where the superscript \( l-1 \) indicates that these are the outputs from layer \( l-1 \). Our cost function at the final layer \( l=L \) is now

$$ \mathcal{C}(\boldsymbol{W}) = - \sum_{i=1}^n \left(t_i\log{a_i^L}+(1-t_i)\log{(1-a_i^L)}\right), $$

where we have defined the targets \( t_i \). The derivatives of the cost function with respect to the output \( a_i^L \) are then easily calculated and we get

$$ \frac{\partial \mathcal{C}(\boldsymbol{W})}{\partial a_i^L} = \frac{a_i^L-t_i}{a_i^L(1-a_i^L)}. $$

In case we use another activation function than the logistic one, we need to evaluate other derivatives.