Data Analysis and Machine Learning Lectures: Optimization and Gradient Methods

Loading [MathJax]/extensions/TeX/boldsymbol.js

SGD example

As an example, suppose we have

$10$ data points

$(\mathbf{x}_1,\cdots, \mathbf{x}_{10})$ and we choose to have

$M=5$ minibathces, then each minibatch contains two data points. In particular we have

$B_1 = (\mathbf{x}_1,\mathbf{x}_2), \cdots, B_5 = (\mathbf{x}_9,\mathbf{x}_{10})$ . Note that if you choose

$M=1$ you have only a single batch with all data points and on the other extreme, you may choose

$M=n$ resulting in a minibatch for each datapoint, i.e

$B_k = \mathbf{x}_k$ .

The idea is now to approximate the gradient by replacing the sum over all data points with a sum over the data points in one the minibatches picked at random in each gradient descent step $\nabla_{\beta} C(\mathbf{\beta}) = \sum_{i=1}^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}) \rightarrow \sum_{i \in B_k}^n \nabla_\beta c_i(\mathbf{x}_i, \mathbf{\beta}).$