SGD vs Full-Batch GD: Convergence Speed and Memory Comparison

Theoretical Convergence Speed and convex optimization

Consider minimizing an empirical cost function

$$ C(\theta) =\frac{1}{N}\sum_{i=1}^N l_i(\theta), $$

where each \( l_i(\theta) \) is a differentiable loss term. Gradient Descent (GD) updates parameters using the full gradient \( \nabla C(\theta) \), while Stochastic Gradient Descent (SGD) uses a single sample (or mini-batch) gradient \( \nabla l_i(\theta) \) selected at random. In equation form, one GD step is:

$$ \theta_{t+1} = \theta_t-\eta \nabla C(\theta_t) =\theta_t -\eta \frac{1}{N}\sum_{i=1}^N \nabla l_i(\theta_t), $$

whereas one SGD step is:

$$ \theta_{t+1} = \theta_t -\eta \nabla l_{i_t}(\theta_t), $$

with \( i_t \) randomly chosen. On smooth convex problems, GD and SGD both converge to the global minimum, but their rates differ. GD can take larger, more stable steps since it uses the exact gradient, achieving an error that decreases on the order of \( O(1/t) \) per iteration for convex objectives (and even exponentially fast for strongly convex cases). In contrast, plain SGD has more variance in each step, leading to sublinear convergence in expectation – typically \( O(1/\sqrt{t}) \) for general convex objectives (\thetaith appropriate diminishing step sizes) . Intuitively, GD’s trajectory is smoother and more predictable, while SGD’s path oscillates due to noise but costs far less per iteration, enabling many more updates in the same time.

Strongly Convex Case

If \( C(\theta) \) is strongly convex and \( L \)-smooth (so GD enjoys linear convergence), the gap \( C(\theta_t)-C(\theta^*) \) for GD shrinks as

$$ C(\theta_t) - C(\theta^* ) \le \Big(1 - \frac{\mu}{L}\Big)^t [C(\theta_0)-C(\theta^*)], $$

a geometric (linear) convergence per iteration . Achieving an \( \epsilon \)-accurate solution thus takes on the order of \( \log(1/\epsilon) \) iterations for GD. However, each GD iteration costs \( O(N) \) gradient evaluations. SGD cannot exploit strong convexity to obtain a linear rate – instead, with a properly decaying step size (e.g. \( \eta_t = \frac{1}{\mu t} \)) or iterate averaging, SGD attains an \( O(1/t) \) convergence rate in expectation . For example, one result of Moulines and Bach 2011, see https://papers.nips.cc/paper_files/paper/2011/hash/40008b9a5380fcacce3976bf7c08af5b-Abstract.html shows that with \( \eta_t = \Theta(1/t) \),

$$ \mathbb{E}[C(\theta_t) - C(\theta^*)] = O(1/t), $$

for strongly convex, smooth \( F \) . This \( 1/t \) rate is slower per iteration than GD’s exponential decay, but each SGD iteration is \( N \) times cheaper. In fact, to reach error \( \epsilon \), plain SGD needs on the order of \( T=O(1/\epsilon) \) iterations (sub-linear convergence), while GD needs \( O(\log(1/\epsilon)) \) iterations. When accounting for cost-per-iteration, GD requires \( O(N \log(1/\epsilon)) \) total gradient computations versus SGD’s \( O(1/\epsilon) \) single-sample computations. In large-scale regimes (huge \( N \)), SGD can be faster in wall-clock time because \( N \log(1/\epsilon) \) may far exceed \( 1/\epsilon \) for reasonable accuracy levels. In other words, with millions of data points, one epoch of GD (one full gradient) is extremely costly, whereas SGD can make \( N \) cheap updates in the time GD makes one – often yielding a good solution faster in practice, even though SGD’s asymptotic error decays more slowly. As one lecture succinctly puts it: “SGD can be super effective in terms of iteration cost and memory, but SGD is slow to converge and can’t adapt to strong convexity” . Thus, the break-even point depends on \( N \) and the desired accuracy: for moderate accuracy on very large \( N \), SGD’s cheaper updates win; for extremely high precision (very small \( \epsilon \)) on a modest \( N \), GD’s fast convergence per step can be advantageous.

Non-Convex Problems

In non-convex optimization (e.g. deep neural networks), neither GD nor SGD guarantees global minima, but SGD often displays faster progress in finding useful minima. Theoretical results here are weaker, usually showing convergence to a stationary point \( \theta \) (\( |\nabla C| \) is small) in expectation. For example, GD might require \( O(1/\epsilon^2) \) iterations to ensure \( |\nabla C(\theta)| < \epsilon \), and SGD typically has similar polynomial complexity (often worse due to gradient noise). However, a noteworthy difference is that SGD’s stochasticity can help escape saddle points or poor local minima. Random gradient fluctuations act like implicit noise, helping the iterate “jump” out of flat saddle regions where full-batch GD could stagnate . In fact, research has shown that adding noise to GD can guarantee escaping saddle points in polynomial time, and the inherent noise in SGD often serves this role. Empirically, this means SGD can sometimes find a lower loss basin faster, whereas full-batch GD might get “stuck” near saddle points or need a very small learning rate to navigate complex error surfaces . Overall, in modern high-dimensional machine learning, SGD (or mini-batch SGD) is the workhorse for large non-convex problems because it converges to good solutions much faster in practice, despite the lack of a linear convergence guarantee. Full-batch GD is rarely used on large neural networks, as it would require tiny steps to avoid divergence and is extremely slow per iteration .