Processing math: 100%

 

 

 

Momentum based GD

The stochastic gradient descent (SGD) is almost always used with a momentum or inertia term that serves as a memory of the direction we are moving in parameter space. This is typically implemented as follows vt=γvt1+ηtθE(θt)θt+1=θtvt, where we have introduced a momentum parameter γ, with 0γ1, and for brevity we dropped the explicit notation to indicate the gradient is to be taken over a different mini-batch at each step. We call this algorithm gradient descent with momentum (GDM). From these equations, it is clear that vt is a running average of recently encountered gradients and (1γ)1 sets the characteristic time scale for the memory used in the averaging procedure. Consistent with this, when γ=0, this just reduces down to ordinary SGD as discussed earlier. An equivalent way of writing the updates is Δθt+1=γΔθt ηtθE(θt), where we have defined Δθt=θtθt1.