Adam Optimizer

Why combine Momentum and RMSProp? Motivation for Adam: Adaptive Moment Estimation (Adam) was introduced by Kingma an Ba (2014) to combine the benefits of momentum and RMSProp.

  1. Fast convergence by smoothing gradients (accelerates in long-term gradient direction).
  2. Adaptive rates (RMSProp): Per-dimension learning rate scaling for stability (handles different feature scales, sparse gradients).
  3. Adam uses both: maintains moving averages of both first moment (gradients) and second moment (squared gradients)
  4. Additionally, includes a mechanism to correct the bias in these moving averages (crucial in early iterations)

Result: Adam is robust, achieves faster convergence with less tuning, and often outperforms SGD (with momentum) in practice.