We scale the gradient by the inverse square root of the accumulated matrix \( H_t \). The AdaGrad update at step \( t \) is:
$$ \theta_{t+1} =\theta_t - \eta H_t^{-1/2} g_t, $$where \( H_t^{-1/2} \) is the diagonal matrix with entries \( (r_{t}^{(1)})^{-1/2}, \dots, (r_{t}^{(d)})^{-1/2} \) In coordinates, this means each parameter \( j \) has an individual step size:
$$ \theta_{t+1,j} =\theta_{t,j} -\frac{\eta}{\sqrt{r_{t,j}}}g_{t,j}. $$In practice we add a small constant \( \epsilon \) in the denominator for numerical stability to avoid division by zero:
$$ \theta_{t+1,j}= \theta_{t,j}-\frac{\eta}{\sqrt{\epsilon + r_{t,j}}}g_{t,j}. $$Equivalently, the effective learning rate for parameter \( j \) at time \( t \) is \( \displaystyle \alpha_{t,j} = \frac{\eta}{\sqrt{\epsilon + r_{t,j}}} \). This decreases over time as \( r_{t,j} \) grows.