Challenge: Choosing a Fixed Learning Rate
A fixed \( \eta \) is hard to get right:
- If \( \eta \) is too large, the updates can overshoot the minimum, causing oscillations or divergence
- If \( \eta \) is too small, convergence is very slow (many iterations to make progress)
In practice, one often uses trial-and-error or schedules (decaying \( \eta \) over time) to find a workable balance.
For a function with steep directions and flat directions, a single global \( \eta \) may be inappropriate:
- Steep coordinates require a smaller step size to avoid oscillation.
- Flat/shallow coordinates could use a larger step to speed up progress.
- This issue is pronounced in high-dimensional problems with **sparse or varying-scale features** – we need a method to adjust step sizesper feature.