Adam vs. AdaGrad and RMSProp

  1. AdaGrad: Uses per-coordinate scaling like Adam, but no momentum. Tends to slow down too much due to cumulative history (no forgetting)
  2. RMSProp: Uses moving average of squared gradients (like Adam’s \( v_t \)) to maintain adaptive learning rates, but does not include momentum or bias-correction.
  3. Adam: Effectively RMSProp + Momentum + Bias-correction

In practice, Adam often yields faster convergence and better tuning stability than RMSProp or AdaGrad alone