Adam maintains two moving averages at each time step \( t \) for each parameter \( w \):
The Momentum term
$$ m_t = \beta_1m_{t-1} + (1-\beta_1)\, \nabla C(\theta_t), $$The RMS term
$$ v_t = \beta_2v_{t-1} + (1-\beta_2)(\nabla C(\theta_t))^2, $$with typical \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \). Initialize \( m_0 = 0 \), \( v_0 = 0 \).
These are biased estimators of the true first and second moment of the gradients, especially at the start (since \( m_0,v_0 \) are zero)