Gradient descent

The idea of the gradient descent algorithm is to update parameters in a direction where the cost function decreases goes to a minimum.

In general, the update of some parameters \( \boldsymbol{\omega} \) given a cost function defined by some weights \( \boldsymbol{\omega} \), \( C(\boldsymbol{x}, \boldsymbol{\omega}) \), goes as follows:

$$ \boldsymbol{\omega}_{\text{new} } = \boldsymbol{\omega} - \lambda \nabla_{\boldsymbol{\omega}} C(\boldsymbol{x}, \boldsymbol{\omega}) $$

for a number of iterations or until $ \big|\big| \boldsymbol{\omega}_{\text{new} } - \boldsymbol{\omega} \big|\big|$ becomes smaller than some given tolerance.

The value of \( \lambda \) decides how large steps the algorithm must take in the direction of $ \nabla_{\boldsymbol{\omega}} C(\boldsymbol{x}, \boldsymbol{\omega})$. The notation \( \nabla_{\boldsymbol{\omega}} \) express the gradient with respect to the elements in \( \boldsymbol{\omega} \).

In our case, we have to minimize the cost function \( C(\boldsymbol{x}, P) \) with respect to the two sets of weights and biases, that is for the hidden layer \( P_{\text{hidden} } \) and for the output layer \( P_{\text{output} } \) .

This means that \( P_{\text{hidden} } \) and \( P_{\text{output} } \) is updated by

$$ \begin{aligned} P_{\text{hidden},\text{new}} &= P_{\text{hidden}} - \lambda \nabla_{P_{\text{hidden}}} C(\boldsymbol{x}, P) \\ P_{\text{output},\text{new}} &= P_{\text{output}} - \lambda \nabla_{P_{\text{output}}} C(\boldsymbol{x}, P) \end{aligned} $$