Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

Computation of gradients

This in turn means that the gradient can be computed as a sum over i -gradients

\nabla_\beta C(\mathbf{\beta}) = \sum_i^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}).

Stochasticity/randomness is introduced by only taking the gradient on a subset of the data called minibatches. If there are n data points and the size of each minibatch is M , there will be n/M minibatches. We denote these minibatches by B_k where k=1,\cdots,n/M .