The stochastic gradient descent (SGD) is almost always used with a momentum or inertia term that serves as a memory of the direction we are moving in parameter space. This is typically implemented as follows vt=γvt−1+ηt∇θE(θt)θt+1=θt−vt, where we have introduced a momentum parameter γ, with 0≤γ≤1, and for brevity we dropped the explicit notation to indicate the gradient is to be taken over a different mini-batch at each step. We call this algorithm gradient descent with momentum (GDM). From these equations, it is clear that vt is a running average of recently encountered gradients and (1−γ)−1 sets the characteristic time scale for the memory used in the averaging procedure. Consistent with this, when γ=0, this just reduces down to ordinary SGD as discussed earlier. An equivalent way of writing the updates is Δθt+1=γΔθt− ηt∇θE(θt), where we have defined Δθt=θt−θt−1.