Week 37: Gradient descent methods

Derivation of the AdaGrad Algorithm

AdaGrad maintains a running sum of squared gradients for each parameter (coordinate)
Let $ g_t = \nabla C_{i_t}(x_t) $ be the gradient at step $ t $ (or a subgradient for nondifferentiable cases).
Initialize $ r_0 = 0 $ (an all-zero vector in $ \mathbb{R}^d $).
At each iteration $ t $, update the accumulation:

$$ r_t = r_{t-1} + g_t \circ g_t, $$

Here $ g_t \circ g_t $ denotes element-wise square of the gradient vector. $ g_t^{(j)} = g_{t-1}^{(j)} + (g_{t,j})^2 $ for each parameter $ j $.
We can view $ H_t = \mathrm{diag}(r_t) $ as a diagonal matrix of past squared gradients. Initially $ H_0 = 0 $.