Week 37: Gradient descent methods

Memory Usage and Scalability

A major advantage of SGD is its memory efficiency in handling large datasets. Full-batch GD requires access to the entire training set for each iteration, which often means the whole dataset (or a large subset) must reside in memory to compute \( \nabla C(\theta) \) . This results in memory usage that scales linearly with the dataset size \( N \). For instance, if each training sample is large (e.g. high-dimensional features), computing a full gradient may require storing a substantial portion of the data or all intermediate gradients until they are aggregated. In contrast, SGD needs only a single (or a small mini-batch of) training example(s) in memory at any time . The algorithm processes one sample (or mini-batch) at a time and immediately updates the model, discarding that sample before moving to the next. This streaming approach means that memory footprint is essentially independent of \( N \) (apart from storing the model parameters themselves). As one source notes, gradient descent “requires more memory than SGD” because it “must store the entire dataset for each iteration,” whereas SGD “only needs to store the current training example” . In practical terms, if you have a dataset of size, say, 1 million examples, full-batch GD would need memory for all million every step, while SGD could be implemented to load just one example at a time – a crucial benefit if data are too large to fit in RAM or GPU memory. This scalability makes SGD suitable for large-scale learning: as long as you can stream data from disk, SGD can handle arbitrarily large datasets with fixed memory. In fact, SGD “does not need to remember which examples were visited” in the past, allowing it to run in an online fashion on infinite data streams . Full-batch GD, on the other hand, would require multiple passes through a giant dataset per update (or a complex distributed memory system), which is often infeasible.

There is also a secondary memory effect: computing a full-batch gradient in deep learning requires storing all intermediate activations for backpropagation across the entire batch. A very large batch (approaching the full dataset) might exhaust GPU memory due to the need to hold activation gradients for thousands or millions of examples simultaneously. SGD/minibatches mitigate this by splitting the workload – e.g. with a mini-batch of size 32 or 256, memory use stays bounded, whereas a full-batch (size = \( N \)) forward/backward pass could not even be executed if \( N \) is huge. Techniques like gradient accumulation exist to simulate large-batch GD by summing many small-batch gradients – but these still process data in manageable chunks to avoid memory overflow. In summary, memory complexity for GD grows with \( N \), while for SGD it remains \( O(1) \) w.r.t. dataset size (only the model and perhaps a mini-batch reside in memory) . This is a key reason why batch GD “does not scale” to very large data and why virtually all large-scale machine learning algorithms rely on stochastic or mini-batch methods.