Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

Momentum based GD

The stochastic gradient descent (SGD) is almost always used with a momentum or inertia term that serves as a memory of the direction we are moving in parameter space. This is typically implemented as follows

\begin{align} \mathbf{v}_{t}&=\gamma \mathbf{v}_{t-1}+\eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t) \nonumber \\ \boldsymbol{\theta}_{t+1}&= \boldsymbol{\theta}_t -\mathbf{v}_{t}, \tag{1} \end{align}

where we have introduced a momentum parameter \gamma , with 0\le\gamma\le 1 , and for brevity we dropped the explicit notation to indicate the gradient is to be taken over a different mini-batch at each step. We call this algorithm gradient descent with momentum (GDM). From these equations, it is clear that \mathbf{v}_t is a running average of recently encountered gradients and (1-\gamma)^{-1} sets the characteristic time scale for the memory used in the averaging procedure. Consistent with this, when \gamma=0 , this just reduces down to ordinary SGD as discussed earlier. An equivalent way of writing the updates is

\Delta \boldsymbol{\theta}_{t+1} = \gamma \Delta \boldsymbol{\theta}_t -\ \eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t),

where we have defined \Delta \boldsymbol{\theta}_{t}= \boldsymbol{\theta}_t-\boldsymbol{\theta}_{t-1} .