Week 40: Gradient descent methods (continued) and start Neural networks

Loading [MathJax]/extensions/TeX/boldsymbol.js

RMS prop

In RMS prop, in addition to keeping a running average of the first moment of the gradient, we also keep track of the second moment denoted by $\mathbf{s}_t=\mathbb{E}[\mathbf{g}_t^2]$ . The update rule for RMS prop is given by

$\begin{align} \mathbf{g}_t &= \nabla_\theta E(\boldsymbol{\theta}) \tag{3}\\ \mathbf{s}_t &=\beta \mathbf{s}_{t-1} +(1-\beta)\mathbf{g}_t^2 \nonumber \\ \boldsymbol{\theta}_{t+1}&=&\boldsymbol{\theta}_t - \eta_t { \mathbf{g}_t \over \sqrt{\mathbf{s}_t +\epsilon}}, \nonumber \end{align}$

where $\beta$ controls the averaging time of the second moment and is typically taken to be about $\beta=0.9$ , $\eta_t$ is a learning rate typically chosen to be $10^{-3}$ , and $\epsilon\sim 10^{-8}$ is a small regularization constant to prevent divergences. Multiplication and division by vectors is understood as an element-wise operation. It is clear from this formula that the learning rate is reduced in directions where the norm of the gradient is consistently large. This greatly speeds up the convergence by allowing us to use a larger learning rate for flat directions.