3.9 Gradient descent and its variants
This section gives you the working knowledge you need. We start with vanilla gradient descent and ask why we step in the direction of the negative gradient at all. We then pass to stochastic gradient descent, the form actually used in practice. We add momentum, which damps the oscillations that plague plain SGD on long shallow valleys. We meet Adam and its modern descendant AdamW, which adapts the step size for each parameter individually and is the default optimiser for most deep-learning tasks today. We discuss learning-rate schedules, which control how the step size changes over the course of training. Finally we name a few further optimisers, RMSProp, AdaGrad, LARS, Lion, Sophia, Shampoo, that you will meet in the literature.
§3.7 was the engine room: how to compute gradients efficiently. This section is what we do with them. Chapters 9 and 10 return to optimisation in detail, including convergence rates, conditioning, second-order methods, and the geometry of loss landscapes.
Vanilla gradient descent
The update rule could not be simpler: $$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \nabla \mathcal{L}(\boldsymbol{\theta}_t). $$ At each step we compute the gradient at the current parameter vector, scale it by the learning rate, and subtract. Because $-\nabla \mathcal{L}$ is the direction of steepest descent (recall §3.4), every sufficiently small step strictly decreases the loss, provided the gradient is non-zero. The procedure halts when the gradient is zero or, in practice, when its norm falls below some threshold, at which point we have reached a stationary point.
How well does this work? The classical convergence theory gives clean answers under standard assumptions. Suppose $\mathcal{L}$ is convex and its gradient is $L$-Lipschitz, meaning that $\|\nabla \mathcal{L}(\boldsymbol{\theta}) - \nabla \mathcal{L}(\boldsymbol{\theta}')\| \le L \|\boldsymbol{\theta} - \boldsymbol{\theta}'\|$ for some constant $L$. Then for any learning rate $\eta < 2/L$, gradient descent converges to the global minimum, and the loss decreases at rate $O(1/t)$, meaning that after $t$ steps we are within $C/t$ of the optimum, for some constant $C$ depending on the initialisation. If we strengthen "convex" to "strongly convex", meaning the loss is at least as curved as some quadratic with positive curvature in every direction, the rate improves to $O(\rho^t)$ for some $\rho < 1$. The error shrinks geometrically; we approach the optimum exponentially fast.
A worked example sharpens the intuition. Let $\mathcal{L}(\theta) = \theta^2$, the simplest convex quadratic. Then $\nabla \mathcal{L}(\theta) = 2\theta$ and the update rule reads $\theta_{t+1} = \theta_t - 2\eta \, \theta_t = (1 - 2\eta)\theta_t$. With $\eta = 0.5$ and $\theta_0 = 1$ we have $\theta_1 = 1 - 1 = 0$: convergence in a single step, because the chosen learning rate exactly cancels the curvature. With $\eta = 0.1$ we get $\theta_1 = 0.8$, $\theta_2 = 0.64$, $\theta_3 = 0.512$, and so on, geometric convergence with ratio $0.8$. With $\eta = 1.5$, however, the multiplier $1 - 2\eta = -2$ has magnitude greater than one. We get $\theta_1 = -2$, $\theta_2 = 4$, $\theta_3 = -8$: the iterates explode, oscillating with growing amplitude. The boundary between convergence and divergence is exactly $\eta = 1$, which corresponds to $\eta = 2/L$ since the Lipschitz constant of $\nabla \mathcal{L} = 2\theta$ is $L = 2$.
The lesson generalises: the learning rate is the most consequential hyperparameter in any training run. Set it too small and training crawls; set it too large and the parameters diverge. There is no universally good value; it depends on the architecture, the loss, the batch size and the data. Most production training begins with a learning-rate sweep over half a dozen orders of magnitude on a small subset of data before the full run is committed.
Stochastic gradient descent (SGD)
In machine learning the loss is almost always an average over training examples: $$ \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^N \mathcal{L}_i(\boldsymbol{\theta}), $$ where $\mathcal{L}_i$ is the per-example loss on the $i$-th training point. Computing $\nabla \mathcal{L}$ exactly therefore requires a forward and backward pass over all $N$ examples. When $N$ is in the millions or billions, and modern training corpora regularly run to trillions of tokens, a single full-batch step is unaffordable.
The fix is conceptually simple. Rather than computing the gradient over the entire dataset, we draw a small random subset, a minibatch of size $B$ (say, 32, 256, or in large-scale work tens of thousands), and compute the gradient on it. By linearity of expectation, the average gradient over a uniformly sampled minibatch is an unbiased estimate of the full-dataset gradient: its expected value is exactly $\nabla \mathcal{L}$. The variance of the estimate scales as $\Sigma/B$, where $\Sigma$ is the per-example covariance of gradients. So a larger batch gives a less noisy estimate, but each step costs proportionally more.
The resulting algorithm is stochastic gradient descent (SGD): $$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \widehat{\nabla} \mathcal{L}(\boldsymbol{\theta}_t), $$ where $\widehat{\nabla} \mathcal{L}$ is the minibatch gradient. In practice we shuffle the dataset once per epoch and march through it minibatch by minibatch, so that every example contributes exactly once to the gradient per epoch.
The noise in SGD is not pure liability. It has a mild regularising effect, the random kicks help the optimiser escape sharp local minima and saddle points, and there is empirical evidence that SGD's noise biases solutions towards flatter regions of the loss surface, which generalise better. Strict theoretical convergence requires either a decreasing learning-rate schedule (the Robbins–Monro conditions: $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$) or, in practice, a sufficiently small terminal learning rate. With reasonable batch sizes and a sensible schedule, SGD is the unglamorous workhorse on top of which everything else is built.
Momentum
Plain SGD has a famous failure mode: in long shallow valleys it oscillates back and forth across the steep walls while creeping slowly along the floor. Geometrically, the gradient is dominated by the cross-valley component, and the along-valley component, which is what we actually want to follow, is too small to make progress before the next zig-zag kicks the iterate sideways again.
Momentum Polyak, 1964 addresses this by giving the optimiser inertia. Instead of stepping in the direction of the current gradient, we step in the direction of an exponential moving average of recent gradients: $$ \mathbf{m}_t = \beta \mathbf{m}_{t-1} + \nabla \mathcal{L}(\boldsymbol{\theta}_t), $$ $$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \mathbf{m}_t. $$ The decay parameter $\beta$ is typically $0.9$, sometimes $0.99$ in long training runs. The vector $\mathbf{m}_t$ is sometimes called the velocity, by analogy with a physical particle: the update is a heavy ball rolling downhill, with friction $1 - \beta$, accumulating speed in directions of consistent gradient and being damped in directions where the gradient flips back and forth.
In our shallow-valley scenario, the cross-valley contributions to $\mathbf{m}_t$ alternate in sign and largely cancel, while the along-valley contributions all point the same way and accumulate. The net effect is that the optimiser builds up speed along the valley floor and barely moves across it. On a quadratic with condition number $\kappa$, one can show that momentum's asymptotic convergence rate is $(\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1)$, a square-root improvement over plain gradient descent's $(\kappa-1)/(\kappa+1)$.
A close cousin is Nesterov's accelerated gradient (NAG), which evaluates the gradient not at the current $\boldsymbol{\theta}_t$ but at the look-ahead point $\boldsymbol{\theta}_t - \eta \beta \mathbf{m}_{t-1}$. The asymptotic rate is the same as plain momentum, but with better constants and, anecdotally, slightly more stable behaviour near the optimum.
Momentum is the bread-and-butter optimiser in computer vision. Most ResNet, EfficientNet and ConvNeXt training recipes on ImageNet still use SGD with Nesterov momentum and a learning-rate schedule, despite the rise of adaptive methods elsewhere. When the data is rich and the architecture is well-conditioned, SGD with momentum often produces the cleanest generalisation.
Adam and AdamW
The optimiser that dominates almost every other corner of deep learning is Adam, introduced by Kingma and Ba in 2014. Adam combines momentum with a per-parameter learning-rate adjustment, scaling each coordinate's step inversely by the recent root-mean-square of its gradient. The update rule is $$ \mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \, \nabla \mathcal{L}(\boldsymbol{\theta}_t), $$ $$ \mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \, (\nabla \mathcal{L}(\boldsymbol{\theta}_t))^2, $$ $$ \hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \qquad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}, $$ $$ \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}, $$ where the squaring and the square root are element-wise. The default hyperparameters from the original paper, $\eta = 10^{-3}$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, work well on a wide variety of problems and are still the standard starting point a decade later.
The first moment $\mathbf{m}_t$ is an exponential moving average of the gradient, momentum, written in convex-combination form. The second moment $\mathbf{v}_t$ is an exponential moving average of the squared gradient; element by element it tracks how big each parameter's gradient has been recently. The bias-correction step that produces $\hat{\mathbf{m}}_t$ and $\hat{\mathbf{v}}_t$ undoes the systematic underestimate that would otherwise occur in the first few steps because $\mathbf{m}_0 = \mathbf{v}_0 = 0$. The final update divides the bias-corrected first moment by the square root of the bias-corrected second moment, plus a tiny $\epsilon$ for numerical safety. The effect is that each parameter receives an effective learning rate of $\eta / \sqrt{\hat{\mathbf{v}}_t}$: parameters whose gradients have been consistently large get small steps, parameters whose gradients have been consistently small get large steps. Coordinates self-normalise.
Adam's appeal is practical. It is dramatically less sensitive to learning-rate choice than vanilla SGD, it copes well with sparse gradients (common in NLP), it handles ill-conditioned problems where some coordinates have vastly different gradient scales, and it rarely diverges. For language-model training, recommendation systems, generative modelling and most reinforcement-learning settings, Adam is the default.
A subtle but important refinement is AdamW, due to Loshchilov and Hutter (2017). The original Adam combines weight decay (an L2 regularisation term added to the loss) with adaptive scaling, which means the effective amount of regularisation depends on the gradient magnitude, usually not what one wants. AdamW decouples weight decay from the gradient-based update: $$ \boldsymbol{\theta}_{t+1} = (1 - \eta \lambda) \, \boldsymbol{\theta}_t - \eta \, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon}. $$ The weight-decay term $-\eta \lambda \boldsymbol{\theta}_t$ is applied independently of the adaptive scaling. Empirically, AdamW generalises noticeably better than plain Adam, especially on transformer architectures, and is the optimiser used in most modern large-language-model training. Whenever this book or the literature says "Adam" in the context of post-2018 systems, AdamW is almost always what is actually meant.
A historical caveat worth noting: SGD with momentum sometimes still beats Adam on well-tuned image-classification benchmarks. The conventional wisdom is that Adam converges fast and reliably, but its final solutions can sit in slightly sharper minima than SGD's, with a small generalisation penalty. Where you can afford the tuning effort and the data is amenable, SGD with momentum may give the best test-set numbers; where you cannot, Adam (or AdamW) is the pragmatic default.
Learning-rate schedules
The learning rate $\eta$ need not be constant. Modern recipes nearly always vary it across training, often by an order of magnitude or more. The four schedules you will meet most often are these.
Constant. The simplest possible schedule. Useful for debugging and short experiments; rarely optimal for full training runs.
Step decay. Train at a fixed $\eta$ for a number of epochs, then drop it by a factor of ten (or five, or two), and repeat. A classic ImageNet recipe: train at $\eta = 0.1$ for 30 epochs, drop to $0.01$ for another 30, drop to $0.001$ for the last 30.
Cosine decay. Smoothly anneal the learning rate from $\eta_{\max}$ to (near) zero over the course of training, following one half of a cosine curve: $$ \eta_t = \eta_{\max} \cdot \tfrac{1}{2} \left(1 + \cos\left(\frac{\pi t}{T}\right)\right), $$ where $T$ is the total number of training steps. The smooth descent avoids the discontinuities of step decay and tends to find slightly better optima.
Linear warmup followed by cosine decay. The de-facto standard for transformer training. For the first 1–10% of training steps, ramp the learning rate linearly from zero up to $\eta_{\max}$; for the remainder, follow a cosine decay down to zero. The warmup phase prevents the early instability that otherwise tends to plague large transformers, where a fresh random initialisation produces gradients that, scaled by Adam's adaptive step, can cause enormous updates in the first handful of iterations.
A worked example. Suppose we are training for $T = 100\,000$ steps, with $\eta_{\max} = 10^{-3}$ and a $5000$-step warmup. At step $1000$, we are in the linear-warmup phase and $\eta = 10^{-3} \times 1000/5000 = 2 \times 10^{-4}$. At step $5000$, warmup ends and $\eta = \eta_{\max} = 10^{-3}$. At step $50\,000$ we are halfway through the cosine phase, with progress fraction $(50\,000 - 5000)/(100\,000 - 5000) = 45\,000/95\,000 \approx 0.474$, so $\eta = 10^{-3} \cdot \tfrac{1}{2}(1 + \cos(\pi \cdot 0.474)) \approx 10^{-3} \cdot 0.541 \approx 5.4 \times 10^{-4}$. At step $100\,000$, $\eta$ has cosine-decayed to (approximately) zero.
Schedules matter. A well-chosen schedule with a mediocre optimiser often beats a poorly tuned schedule with a state-of-the-art one.
Other optimisers worth knowing
A few further names that you will meet and should be able to place.
RMSProp. Tielman and Hinton, in unpublished lecture notes. The second-moment normalisation of Adam without the first-moment momentum. It predates Adam and is still occasionally used in reinforcement learning, where the non-stationary objective makes the second-moment adaptation particularly useful.
AdaGrad. Duchi, Hazan and Singer (2011). Accumulates the sum of squared gradients without exponential decay. The denominator therefore grows monotonically, which makes the effective learning rate decrease over time. Effective for very sparse features, especially in early-2010s NLP, but in deep nets the denominator typically grows too aggressively, throttling progress. Generally superseded by Adam.
LARS and LAMB. You and colleagues. Layer-wise adaptive learning rates designed for very large batch training. When batch sizes reach tens of thousands, necessary for keeping vast GPU clusters busy, vanilla optimisers become unstable; LARS and LAMB rescale per-layer to maintain stable training.
Lion. Chen and colleagues (2023). A simpler optimiser using only sign information from the gradient. Sometimes faster than Adam, particularly for image generation; uses less memory because it does not maintain a second-moment estimate.
Sophia. Liu and colleagues (2023). A pseudo-second-order optimiser that uses a cheap diagonal Hessian estimate. Reports significant speed-ups on large language-model training, though adoption has been measured.
Shampoo, K-FAC, and natural gradient methods. True second-order or near-second-order methods that approximate (or partially approximate) the Hessian or Fisher information matrix. Computationally expensive per step but each step is much more informed; trade-offs depend on problem structure.
The list keeps growing. AdamW is good enough for nearly every problem most readers will encounter; the others matter mainly at the frontier or in narrow domains.
What you should take away
- The vanilla gradient-descent update $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla \mathcal{L}$ is the conceptual core of every modern training algorithm. Everything else is a refinement that makes this idea tractable on real data and real hardware.
- The learning rate is the most consequential hyperparameter. Too small wastes compute; too large diverges. Always sweep it before committing to a long run.
- Stochastic gradient descent, using minibatch estimates of the gradient, is what every practical system uses, because computing the full-dataset gradient is unaffordable at modern scale.
- Momentum smooths out oscillations and accelerates progress along consistent directions; Adam adds per-parameter adaptive step sizes; AdamW further decouples weight decay from the gradient and is the default optimiser for nearly every transformer trained today.
- Learning-rate schedules, particularly linear warmup followed by cosine decay, matter as much as the choice of optimiser. A warmup phase prevents early instability; a decay phase lets the optimiser settle into a good minimum.