Loading [MathJax]/extensions/TeX/boldsymbol.js

 

 

 

SGD example

As an example, suppose we have 10 data points (\mathbf{x}_1,\cdots, \mathbf{x}_{10}) and we choose to have M=5 minibathces, then each minibatch contains two data points. In particular we have B_1 = (\mathbf{x}_1,\mathbf{x}_2), \cdots, B_5 = (\mathbf{x}_9,\mathbf{x}_{10}) . Note that if you choose M=1 you have only a single batch with all data points and on the other extreme, you may choose M=n resulting in a minibatch for each datapoint, i.e B_k = \mathbf{x}_k .

The idea is now to approximate the gradient by replacing the sum over all data points with a sum over the data points in one the minibatches picked at random in each gradient descent step

\nabla_{\beta} C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}) \rightarrow \sum_{i \in B_k}^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}).