Also known as: SGD
Stochastic Gradient Descent (SGD) is the workhorse optimisation algorithm of deep learning. Instead of computing the gradient over the entire training set (which is prohibitively expensive), SGD estimates it from a single randomly sampled example or, more commonly, a small mini-batch of 32 to 512 examples. The gradient estimate is noisy but unbiased—its expectation equals the true gradient—and its variance decreases as batch size grows.
The update rule is $\mathbf{w} \leftarrow \mathbf{w} - \eta \hat{\nabla} L$, where $\hat{\nabla} L$ is the mini-batch gradient estimate and $\eta$ is the learning rate. The stochasticity is not merely a concession to computational limits; it is actively beneficial. Gradient noise helps the optimiser escape shallow local minima and saddle points, and empirical work shows that SGD preferentially finds flat minima in the loss landscape, which tend to generalise better than sharp minima.
SGD with momentum adds a velocity term that accumulates past gradients, damping oscillations and accelerating progress along consistent directions—like a ball rolling downhill with inertia. Nesterov accelerated gradient performs a lookahead step before computing the gradient, improving convergence. Despite the rise of adaptive optimisers like Adam, well-tuned SGD with momentum remains competitive and is still the default choice in many computer vision pipelines, where it often yields slightly better generalisation than Adam.
Related terms: Gradient Descent, Adam
Discussed in:
- Chapter 10: Training & Optimisation — Stochastic Gradient Descent
Also defined in: Textbook of AI