Adam: Update Rule Derivation

Finally, Adam updates parameters using the bias-corrected moments:

$$ \theta_{t+1} =\theta_t -\frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t, $$

where \( \epsilon \) is a small constant (e.g. \( 10^{-8} \)) to prevent division by zero. Breaking it down:

  1. Compute gradient \( \nabla C(\theta_t) \).
  2. Update first moment \( m_t \) and second moment \( v_t \) (exponential moving averages).
  3. Bias-correct: \( \hat{m}_t = m_t/(1-\beta_1^t) \), \( \; \hat{v}_t = v_t/(1-\beta_2^t) \).
  4. Compute step: \( \Delta \theta_t = \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \).
  5. Update parameters: \( \theta_{t+1} = \theta_t - \alpha\, \Delta \theta_t \).

This is the Adam update rule as given in the original paper.