Week 37: Gradient descent methods

AdaGrad Update Rule Derivation

We scale the gradient by the inverse square root of the accumulated matrix $ H_t $. The AdaGrad update at step $ t $ is:

$$ \theta_{t+1} =\theta_t - \eta H_t^{-1/2} g_t, $$

where $ H_t^{-1/2} $ is the diagonal matrix with entries $ (r_{t}^{(1)})^{-1/2}, \dots, (r_{t}^{(d)})^{-1/2} $ In coordinates, this means each parameter $ j $ has an individual step size:

$$ \theta_{t+1,j} =\theta_{t,j} -\frac{\eta}{\sqrt{r_{t,j}}}g_{t,j}. $$

In practice we add a small constant $ \epsilon $ in the denominator for numerical stability to avoid division by zero:

$$ \theta_{t+1,j}= \theta_{t,j}-\frac{\eta}{\sqrt{\epsilon + r_{t,j}}}g_{t,j}. $$

Equivalently, the effective learning rate for parameter $ j $ at time $ t $ is $ \displaystyle \alpha_{t,j} = \frac{\eta}{\sqrt{\epsilon + r_{t,j}}} $. This decreases over time as $ r_{t,j} $ grows.