Momentum parameter

Notice that this equation is identical to previous one if we identify the position of the particle, \( \mathbf{w} \), with the parameters \( \boldsymbol{\theta} \). This allows us to identify the momentum parameter and learning rate with the mass of the particle and the viscous drag as:

$$ \gamma= {m \over m +\mu \Delta t }, \qquad \eta = {(\Delta t)^2 \over m +\mu \Delta t}. $$

Thus, as the name suggests, the momentum parameter is proportional to the mass of the particle and effectively provides inertia. Furthermore, in the large viscosity/small learning rate limit, our memory time scales as \( (1-\gamma)^{-1} \approx m/(\mu \Delta t) \).

Why is momentum useful? SGD momentum helps the gradient descent algorithm gain speed in directions with persistent but small gradients even in the presence of stochasticity, while suppressing oscillations in high-curvature directions. This becomes especially important in situations where the landscape is shallow and flat in some directions and narrow and steep in others. It has been argued that first-order methods (with appropriate initial conditions) can perform comparable to more expensive second order methods, especially in the context of complex deep learning models.

These beneficial properties of momentum can sometimes become even more pronounced by using a slight modification of the classical momentum algorithm called Nesterov Accelerated Gradient (NAG).

In the NAG algorithm, rather than calculating the gradient at the current parameters, \( \nabla_\theta E(\boldsymbol{\theta}_t) \), one calculates the gradient at the expected value of the parameters given our current momentum, \( \nabla_\theta E(\boldsymbol{\theta}_t +\gamma \mathbf{v}_{t-1}) \). This yields the NAG update rule

$$ \begin{align} \mathbf{v}_{t}&=\gamma \mathbf{v}_{t-1}+\eta_{t}\nabla_\theta E(\boldsymbol{\theta}_t +\gamma \mathbf{v}_{t-1}) \nonumber \\ \boldsymbol{\theta}_{t+1}&= \boldsymbol{\theta}_t -\mathbf{v}_{t}. \tag{2} \end{align} $$

One of the major advantages of NAG is that it allows for the use of a larger learning rate than GDM for the same choice of \( \gamma \).