Adam Optimizer
Why combine Momentum and RMSProp? Motivation for Adam: Adaptive Moment Estimation (Adam) was introduced by Kingma an Ba (2014) to combine the benefits of momentum and RMSProp.
- Fast convergence by smoothing gradients (accelerates in long-term gradient direction).
- Adaptive rates (RMSProp): Per-dimension learning rate scaling for stability (handles different feature scales, sparse gradients).
- Adam uses both: maintains moving averages of both first moment (gradients) and second moment (squared gradients)
- Additionally, includes a mechanism to correct the bias in these moving averages (crucial in early iterations)
Result: Adam is robust, achieves faster convergence with less tuning, and often outperforms SGD (with momentum) in practice.