Derivation of the AdaGrad Algorithm

  1. AdaGrad maintains a running sum of squared gradients for each parameter (coordinate)
  2. Let \( g_t = \nabla C_{i_t}(x_t) \) be the gradient at step \( t \) (or a subgradient for nondifferentiable cases).
  3. Initialize \( r_0 = 0 \) (an all-zero vector in \( \mathbb{R}^d \)).
  4. At each iteration \( t \), update the accumulation:
$$ r_t = r_{t-1} + g_t \circ g_t, $$
  1. Here \( g_t \circ g_t \) denotes element-wise square of the gradient vector. \( g_t^{(j)} = g_{t-1}^{(j)} + (g_{t,j})^2 \) for each parameter \( j \).
  2. We can view \( H_t = \mathrm{diag}(r_t) \) as a diagonal matrix of past squared gradients. Initially \( H_0 = 0 \).