ADAM optimizer

A related algorithm is the ADAM optimizer. In ADAM, we keep a running average of both the first and second moment of the gradient and use this information to adaptively change the learning rate for different parameters. The method isefficient when working with large problems involving lots data and/or parameters. It is a combination of the gradient descent with momentum algorithm and the RMSprop algorithm discussed above.

In addition to keeping a running average of the first and second moments of the gradient (i.e. \( \mathbf{m}_t=\mathbb{E}[\mathbf{g}_t] \) and \( \mathbf{s}_t=\mathbb{E}[\mathbf{g}^2_t] \), respectively), ADAM performs an additional bias correction to account for the fact that we are estimating the first two moments of the gradient using a running average (denoted by the hats in the update rule below). The update rule for ADAM is given by (where multiplication and division are once again understood to be element-wise operations below)

$$ \begin{align} \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) \tag{5}\\ \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \mathbf{g}_t \nonumber \\ \mathbf{s}_t &=\beta_2 \mathbf{s}_{t-1} +(1-\beta_2)\mathbf{g}_t^2 \nonumber \\ \boldsymbol{\mathbf{m}}_t&={\mathbf{m}_t \over 1-\beta_1^t} \nonumber \\ \boldsymbol{\mathbf{s}}_t &={\mathbf{s}_t \over1-\beta_2^t} \nonumber \\ \boldsymbol{\theta}_{t+1}&=\boldsymbol{\theta}_t - \eta_t { \boldsymbol{\mathbf{m}}_t \over \sqrt{\boldsymbol{\mathbf{s}}_t} +\epsilon}, \nonumber \\ \tag{6} \end{align} $$

where \( \beta_1 \) and \( \beta_2 \) set the memory lifetime of the first and second moment and are typically taken to be \( 0.9 \) and \( 0.99 \) respectively, and \( \eta \) and \( \epsilon \) are identical to RMSprop.

Like in RMSprop, the effective step size of a parameter depends on the magnitude of its gradient squared. To understand this better, let us rewrite this expression in terms of the variance \( \boldsymbol{\sigma}_t^2 = \boldsymbol{\mathbf{s}}_t - (\boldsymbol{\mathbf{m}}_t)^2 \). Consider a single parameter \( \theta_t \). The update rule for this parameter is given by

$$ \Delta \theta_{t+1}= -\eta_t { \boldsymbol{m}_t \over \sqrt{\sigma_t^2 + m_t^2 }+\epsilon}. $$