Data Analysis and Machine Learning: Neural networks, from the simple perceptron to deep learning and convolutional networks

Loading [MathJax]/extensions/TeX/boldsymbol.js

Example: binary classification problem

As an example of the above, relevant for project 2 as well, let us consider a binary class. As discussed in our logistic regression lectures, we defined a cost function in terms of the parameters $\beta$ as $\mathcal{C}(\hat{\beta}) = - \sum_{i=1}^n \left(y_i\log{p(y_i \vert x_i,\hat{\beta})}+(i-y_i)\log{1-p(y_i \vert x_i,\hat{\beta})}\right),$ where we had defined the logistic (sigmoid) function $p(y_i =1\vert x_i,\hat{\beta})=\frac{\exp{(\beta_0+\beta_1 x_i)}}{1+\exp{(\beta_0+\beta_1 x_i)}},$ and $p(y_i =0\vert x_i,\hat{\beta})=1-p(y_i =1\vert x_i,\hat{\beta}).$ The parameters $\hat{\beta}$ were defined using a minimization method like gradient descent or Newton-Raphson's method.

Now we replace $x_i$ with the activation $z_i^l$ for a given layer $l$ and the outputs as $y_i=a_i^l=f(z_i^l)$ , with $z_i^l$ now being a function of the weights $w_{ij}^l$ and biases $b_i^l$ . We have then $a_i^l = y_i = \frac{\exp{(z_i^l)}}{1+\exp{(z_i^l)}},$ with $z_i^l = \sum_{j}w_{ij}^l a_j^{l-1}+b_i^l,$ where the superscript $l-1$ indicates that these are the outputs from layer $l-1$ . Our cost function at the final layer $l=L$ is now $\mathcal{C}(\hat{W}) = - \sum_{i=1}^n \left(t_i\log{a_i^L}+(1-t_i)\log{(1-a_i^L)}\right),$ where we have defined the targets $t_i$ . The derivatives of the cost function with respect to the output $a_i^L$ are then easily calculated and we get $\frac{\partial \mathcal{C}(\hat{W})}{\partial a_i^L} = \frac{a_i^L-t_i}{a_i^L(1-a_i^L)}.$ In case we use another activation function than the logistic one, we need to evaluate other derivatives.