Also known as: SGD, stochastic approximation
Stochastic Gradient Descent (SGD) is the workhorse optimisation algorithm of deep learning. Where ordinary (batch) gradient descent computes the gradient of the loss over the entire training set at every step, prohibitively expensive when the dataset has millions or billions of examples, SGD estimates the gradient from a single randomly sampled example or, more commonly, a small mini-batch of 32 to 512 examples. The estimate is noisy but unbiased: its expectation equals the true gradient, and its variance shrinks as $1/B$ in batch size $B$.
The update rule
Given parameters $\mathbf{w}$, learning rate $\eta$, and a mini-batch loss $L_{\mathcal{B}}$,
$$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \,\hat{\nabla} L_{\mathcal{B}}(\mathbf{w}_t),$$
where $\hat{\nabla} L_{\mathcal{B}}$ is computed by automatic differentiation (backpropagation) over the mini-batch. Robbins and Monro introduced the underlying stochastic-approximation idea in 1951; convergence requires a learning-rate schedule satisfying $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$, e.g. $\eta_t \propto 1/t$. In practice, deep-learning schedules use warmup, then cosine or step decay.
Why noise helps
The stochasticity of SGD is not merely a concession to compute. It is actively beneficial in three ways:
- Escape from saddle points and shallow minima. In high dimensions, critical points of the loss are overwhelmingly saddles rather than local minima; gradient noise pushes the iterates off saddle plateaus.
- Implicit regularisation toward flat minima. Empirical and theoretical work (Keskar 2017, Jastrzebski 2017) shows that SGD preferentially finds flat minima in the loss landscape, basins where the loss is approximately invariant to small perturbations of the weights, and flat minima generalise better than sharp minima.
- Stochastic differential equation (SDE) limit. In a continuous-time approximation, SGD behaves like Langevin dynamics with a temperature proportional to $\eta / B$. The ratio $\eta/B$ is therefore a more meaningful hyperparameter than either alone.
Variants
- SGD with momentum ($\mathbf{v}_{t+1} = \beta \mathbf{v}_t + \hat{\nabla} L,\; \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \mathbf{v}_{t+1}$) accumulates a velocity that damps oscillations and accelerates progress along consistent directions, a ball rolling downhill with inertia ($\beta \approx 0.9$).
- Nesterov accelerated gradient evaluates the gradient at the look-ahead point $\mathbf{w}_t - \eta \beta \mathbf{v}_t$, achieving $O(1/t^2)$ convergence in the convex case.
- AdaGrad, RMSProp, Adam, AdamW maintain per-parameter adaptive step sizes; Adam is the default for Transformer training. Plain SGD with momentum nonetheless remains the standard for ResNet/ConvNet training in computer vision, where it tends to give slightly better generalisation than Adam.
Modern relevance
Every large language model in the GPT, Claude, Gemini and Llama families is trained with a variant of SGD, typically AdamW with weight decay , for trillions of token steps. Distributed SGD (data parallelism, ZeRO sharding, pipeline parallelism) is the algorithmic backbone of frontier-model training. Understanding the geometry of SGD trajectories remains an active research area: linear-mode connectivity, the lottery-ticket hypothesis, and grokking are all phenomena that emerged from careful study of SGD dynamics.
Video
Related terms: Gradient Descent, Adam, Backpropagation, Learning Rate, leon-bottou
Discussed in:
- Chapter 6: ML Fundamentals, Training Neural Networks
- Chapter 10: Training & Optimisation, Training Optimisation