- AdaGrad maintains a running sum of squared gradients for each parameter (coordinate)
- Let \( g_t = \nabla C_{i_t}(x_t) \) be the gradient at step \( t \) (or a subgradient for nondifferentiable cases).
- Initialize \( r_0 = 0 \) (an all-zero vector in \( \mathbb{R}^d \)).
- At each iteration \( t \), update the accumulation:
$$
r_t = r_{t-1} + g_t \circ g_t,
$$
- Here \( g_t \circ g_t \) denotes element-wise square of the gradient vector. \( g_t^{(j)} = g_{t-1}^{(j)} + (g_{t,j})^2 \) for each parameter \( j \).
- We can view \( H_t = \mathrm{diag}(r_t) \) as a diagonal matrix of past squared gradients. Initially \( H_0 = 0 \).