Challenge: Choosing a Fixed Learning Rate

A fixed \( \eta \) is hard to get right:

  1. If \( \eta \) is too large, the updates can overshoot the minimum, causing oscillations or divergence
  2. If \( \eta \) is too small, convergence is very slow (many iterations to make progress)

In practice, one often uses trial-and-error or schedules (decaying \( \eta \) over time) to find a workable balance. For a function with steep directions and flat directions, a single global \( \eta \) may be inappropriate:

  1. Steep coordinates require a smaller step size to avoid oscillation.
  2. Flat/shallow coordinates could use a larger step to speed up progress.
  3. This issue is pronounced in high-dimensional problems with **sparse or varying-scale features** – we need a method to adjust step sizesper feature.