Adam: Update Rule Derivation
Finally, Adam updates parameters using the bias-corrected moments:
$$
\theta_{t+1} =\theta_t -\frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t,
$$
where \( \epsilon \) is a small constant (e.g. \( 10^{-8} \)) to prevent division by zero.
Breaking it down:
- Compute gradient \( \nabla C(\theta_t) \).
- Update first moment \( m_t \) and second moment \( v_t \) (exponential moving averages).
- Bias-correct: \( \hat{m}_t = m_t/(1-\beta_1^t) \), \( \; \hat{v}_t = v_t/(1-\beta_2^t) \).
- Compute step: \( \Delta \theta_t = \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \).
- Update parameters: \( \theta_{t+1} = \theta_t - \alpha\, \Delta \theta_t \).
This is the Adam update rule as given in the original paper.