Adam vs. AdaGrad and RMSProp
- AdaGrad: Uses per-coordinate scaling like Adam, but no momentum. Tends to slow down too much due to cumulative history (no forgetting)
- RMSProp: Uses moving average of squared gradients (like Adam’s \( v_t \)) to maintain adaptive learning rates, but does not include momentum or bias-correction.
- Adam: Effectively RMSProp + Momentum + Bias-correction
- Momentum (\( m_t \)) provides acceleration and smoother convergence.
- Adaptive \( v_t \) scaling moderates the step size per dimension.
- Bias correction (absent in AdaGrad/RMSProp) ensures robust estimates early on.
In practice, Adam often yields faster convergence and better tuning stability than RMSProp or AdaGrad alone