Addresses AdaGrad’s diminishing learning rate issue. Uses a decaying average of squared gradients (instead of a cumulative sum):
$$ v_t = \rho v_{t-1} + (1-\rho)(\nabla C(\theta_t))^2, $$with \( \rho \) typically \( 0.9 \) (or \( 0.99 \)).
RMSProp was first proposed in lecture notes by Geoff Hinton, 2012 – unpublished.)