AdaGrad Properties
- AdaGrad automatically tunes the step size for each parameter. Parameters with more volatile or large gradients get smaller steps, and those with small or infrequent gradients get relatively larger steps
- No manual schedule needed: The accumulation \( r_t \) keeps increasing (or stays the same if gradient is zero), so step sizes \( \eta/\sqrt{r_t} \) are non-increasing. This has a similar effect to a learning rate schedule, but individualized per coordinate.
- Sparse data benefit: For very sparse features, \( r_{t,j} \) grows slowly, so that feature’s parameter retains a higher learning rate for longer, allowing it to make significant updates when it does get a gradient signal
- Convergence: In convex optimization, AdaGrad can be shown to achieve a sub-linear convergence rate comparable to the best fixed learning rate tuned for the problem
It effectively reduces the need to tune \( \eta \) by hand.
- Limitations: Because \( r_t \) accumulates without bound, AdaGrad’s learning rates can become extremely small over long training, potentially slowing progress. (Later variants like RMSProp, AdaDelta, Adam address this by modifying the accumulation rule.)