Momentum based GD

The stochastic gradient descent (SGD) is almost always used with a momentum or inertia term that serves as a memory of the direction we are moving in parameter space. This is typically implemented as follows

$$ \begin{align} \mathbf{v}_{t}&=\gamma \mathbf{v}_{t-1}+\eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t) \nonumber \\ \boldsymbol{\theta}_{t+1}&= \boldsymbol{\theta}_t -\mathbf{v}_{t}, \tag{1} \end{align} $$

where we have introduced a momentum parameter \( \gamma \), with \( 0\le\gamma\le 1 \), and for brevity we dropped the explicit notation to indicate the gradient is to be taken over a different mini-batch at each step. We call this algorithm gradient descent with momentum (GDM). From these equations, it is clear that \( \mathbf{v}_t \) is a running average of recently encountered gradients and \( (1-\gamma)^{-1} \) sets the characteristic time scale for the memory used in the averaging procedure. Consistent with this, when \( \gamma=0 \), this just reduces down to ordinary SGD as discussed earlier. An equivalent way of writing the updates is

$$ \Delta \boldsymbol{\theta}_{t+1} = \gamma \Delta \boldsymbol{\theta}_t -\ \eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t), $$

where we have defined \( \Delta \boldsymbol{\theta}_{t}= \boldsymbol{\theta}_t-\boldsymbol{\theta}_{t-1} \).