Why Combine Momentum and RMSProp?

  1. Momentum: Fast convergence by smoothing gradients (accelerates in long-term gradient direction).
  2. Adaptive rates (RMSProp): Per-dimension learning rate scaling for stability (handles different feature scales, sparse gradients).
  3. Adam uses both: maintains moving averages of both first moment (gradients) and second moment (squared gradients)
  4. Additionally, includes a mechanism to correct the bias in these moving averages (crucial in early iterations)

Result: Adam is robust, achieves faster convergence with less tuning, and often outperforms SGD (with momentum) in practice