Processing math: 100%

 

 

 

Computation of gradients

This in turn means that the gradient can be computed as a sum over i-gradients βC(β)=niβci(xi,β).

Stochasticity/randomness is introduced by only taking the gradient on a subset of the data called minibatches. If there are n data points and the size of each minibatch is M, there will be n/M minibatches. We denote these minibatches by Bk where k=1,,n/M.