Adam: Bias Correction
To counteract initialization bias in \( m_t, v_t \), Adam computes bias-corrected estimates
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}.
$$
- When \( t \) is small, \( 1-\beta_i^t \approx 0 \), so \( \hat{m}_t, \hat{v}_t \) significantly larger than raw \( m_t, v_t \), compensating for the initial zero bias.
- As \( t \) increases, \( 1-\beta_i^t \to 1 \), and \( \hat{m}_t, \hat{v}_t \) converge to \( m_t, v_t \).
- Bias correction is important for Adam’s stability in early iterations