Training & Optimisation: 10.5   Momentum

Dr Chris Paton

10.5 Momentum

Vanilla stochastic gradient descent has a curious failure mode that anyone who has plotted a training loss has felt: the loss zig-zags. It dives, then climbs, then dives again, with the parameter trajectory ricocheting between the walls of a narrow valley while making only sluggish progress towards the actual minimum. On a long, gentle plateau the opposite pathology shows up, the gradient is small, the steps are tiny, and the optimiser appears to lose interest entirely. Both behaviours have the same root cause: each SGD update only knows about the gradient at a single point, so it has no memory of where it has been and no inertia to carry it through awkward terrain.

Momentum is the classical fix, and it is essentially the same idea your physical intuition already supplies. Imagine rolling a heavy ball down the loss surface rather than dropping a feather at every iterate. The ball does not stop dead at every shallow basin; it accumulates speed in the direction it has been travelling. When successive gradients agree, the ball gathers pace and shoots through the plateau. When successive gradients disagree, as in the zig-zag walls of the valley, the components cancel out, leaving only the consistent component that points along the valley floor. Both the slow plateau and the oscillating valley get repaired by the same trick: average the recent gradients with an exponentially decaying weight, then take a step in the direction of the average rather than the instantaneous gradient.

This is not a marginal trick reserved for textbook examples. Heavy-ball momentum has been part of the standard optimiser since Polyak introduced it in 1964; Nesterov's accelerated variant has been part of convex optimisation theory since 1983; every adaptive optimiser used in modern deep learning (Adam, AdamW, NAdam, Lion) has momentum sitting somewhere inside it. If you train a residual network on ImageNet, you almost certainly use SGD with Nesterov momentum. If you train a transformer language model, you use AdamW, which is itself a weighted blend of two momentum-style buffers.

Symbols Used Here

$\theta$parameters

$g_t$gradient at step $t$

$m_t$momentum buffer

$\beta$momentum coefficient

$\eta$learning rate

Heavy-ball (Polyak) momentum

Polyak's heavy-ball method introduces a single new state variable, the momentum buffer $m_t$, which carries an exponentially weighted average of past gradients. The update rule is

$$m_t = \beta\, m_{t-1} + g_t,$$ $$\theta_{t+1} = \theta_t - \eta\, m_t,$$

with $m_0 = 0$ and $\beta \in [0, 1)$ the momentum coefficient. The default in deep learning is $\beta = 0.9$; for very long training runs $\beta = 0.99$ is sometimes used. Setting $\beta = 0$ recovers vanilla SGD exactly. The buffer $m_t$ has the same shape as the parameter vector and lives alongside it in memory, doubling the optimiser's state.

To see why this helps, unroll the recursion. After $t$ steps,

$$m_t = g_t + \beta g_{t-1} + \beta^2 g_{t-2} + \dots + \beta^{t-1} g_1.$$

So $m_t$ is a weighted sum of every gradient seen so far, with older gradients down-weighted geometrically. The effective averaging horizon is roughly $1/(1-\beta)$ steps: at $\beta = 0.9$ that is about ten gradients; at $\beta = 0.99$ about a hundred. Components of the gradient that point in a consistent direction across many steps add up; components that flip sign, like the back-and-forth of a narrow valley, partially cancel. The optimiser's effective velocity in directions of agreement is amplified by roughly $1/(1-\beta)$, which is one reason the same nominal learning rate often needs reducing when momentum is switched on.

There is also a useful equivalent form. Subtracting $\theta_t$ from both sides of the parameter update gives

$$\theta_{t+1} - \theta_t = \beta(\theta_t - \theta_{t-1}) - \eta\, g_t,$$

which makes the inertial reading explicit: the new step is the previous step scaled by $\beta$, plus a fresh gradient kick. This is also why the method is called "heavy ball": a particle of mass $m$ in a potential $L(\theta)$ obeys $m\ddot\theta = -\nabla L - \gamma \dot\theta$, and a leapfrog discretisation of that ODE produces exactly the recurrence above with $\beta$ playing the role of $1 - \gamma\Delta t$. Larger $\beta$ corresponds to less friction.

The convergence theory is sharp on quadratics. If $L(\theta) = \tfrac12 \theta^\top A \theta$ with eigenvalues bounded between $\alpha$ and $\beta_{\max}$, vanilla gradient descent converges geometrically at rate $(\kappa-1)/(\kappa+1)$ where $\kappa = \beta_{\max}/\alpha$ is the condition number. Heavy-ball momentum, with optimal step size $\eta = 4/(\sqrt{\beta_{\max}} + \sqrt{\alpha})^2$ and optimal coefficient $\beta = (\sqrt\kappa - 1)^2 / (\sqrt\kappa + 1)^2$, contracts at $(\sqrt\kappa - 1)/(\sqrt\kappa + 1)$. The square root inside the condition number is the whole story. For $\kappa = 100$, plain GD shrinks the error by a factor of about $0.98$ per step; momentum shrinks it by $0.82$. Asymptotically, the speedup is $O(\sqrt\kappa)$, which on ill-conditioned problems is enormous.

In practice, you do not tune $\beta$ on real networks; you set it to $0.9$ and tune the learning rate. The buffer is initialised at zero, which means the first few steps are biased small, a minor issue that Adam later corrects explicitly with its bias-correction term but that heavy-ball momentum simply ignores.

Nesterov accelerated gradient

Nesterov's 1983 modification looks deceptively similar but has importantly better theoretical guarantees. Instead of computing the gradient at the current point $\theta_t$ and then taking a momentum-augmented step, Nesterov first looks ahead in the momentum direction and computes the gradient at the lookahead point:

$$\tilde\theta_t = \theta_t - \eta\beta\, m_{t-1},$$ $$m_t = \beta\, m_{t-1} + g(\tilde\theta_t),$$ $$\theta_{t+1} = \theta_t - \eta\, m_t.$$

The order matters. Heavy-ball momentum applies the inertial step somewhat blindly and corrects with the gradient at where it started; Nesterov peeks at where it is about to be and asks the gradient there for advice. If the loss surface is curving away from the momentum direction, the lookahead gradient will spot this earlier and apply a correction before the optimiser overshoots. The change is small in code, perhaps a single line if your optimiser stores the parameter offset, but the consequences for convex problems are substantial.

For smooth convex functions with $L$-Lipschitz gradient, plain gradient descent achieves $L(\theta_T) - L(\theta^*) = O(1/T)$ after $T$ steps. Nesterov's method achieves $O(1/T^2)$, and a matching lower bound shows this rate is optimal among all first-order methods that only use gradient queries. The improvement is sometimes called Nesterov's acceleration, and Nesterov himself described it as exploiting "estimating sequences", a sequence of upper-bounding quadratic models that the iterate is encouraged to track. For strongly convex problems the analogous statement is that the rate moves from $(1 - 1/\kappa)^T$ to $(1 - 1/\sqrt\kappa)^T$, the same square-root improvement that heavy-ball momentum gives on quadratics, but proved in greater generality.

Deep learning is not convex, so we should not expect the convex bound to apply directly. Empirically, the practical benefit of Nesterov over heavy-ball on real networks is real but modest: a few percent improvement in final accuracy, faster initial loss reduction, and slightly more tolerance to large learning rates. It is not a difference between "works" and "does not work"; it is a difference between "works well" and "works marginally better". Most deep learning frameworks expose Nesterov as a flag (nesterov=True in PyTorch's SGD) and most ImageNet recipes turn it on as a matter of course.

A subtlety worth flagging: the implementation form most often shipped in libraries is not the lookahead form above but Sutskever et al.'s reformulation, which keeps the parameters at $\theta_t$ rather than $\tilde\theta_t$ between iterations. Algebraically the two are equivalent, but the Sutskever form fits the standard "compute gradient at current parameters, update buffer, update parameters" loop better and avoids extra parameter copies.

Worked example

Consider the simplest possible loss, $L(\theta) = \tfrac12 \theta^2$, with gradient $g(\theta) = \theta$ and unique minimum at $\theta^* = 0$. We start at $\theta_0 = 1$ and use $\eta = 0.1$.

Vanilla SGD gives $\theta_{t+1} = \theta_t - 0.1\,\theta_t = 0.9\,\theta_t$, so the iterates are $1, 0.9, 0.81, 0.729, \dots$, a clean exponential decay with rate $0.9$. Reaching $|\theta| < 0.01$ takes about $44$ steps.

Heavy-ball momentum with $\beta = 0.9$ gives the recurrence $m_t = 0.9\, m_{t-1} + \theta_{t-1}$ and $\theta_t = \theta_{t-1} - 0.1\, m_t$, starting from $m_0 = 0$. The first few iterates are roughly $\theta_0 = 1$, $\theta_1 = 0.90$, $\theta_2 = 0.72$, $\theta_3 = 0.49$, $\theta_4 = 0.23$, $\theta_5 = -0.03$. Two things are visible. First, the descent is markedly faster than vanilla SGD: by step $5$ the iterate is already nearly at zero. Second, the iterate has overshot: $\theta_5$ is slightly negative. The momentum buffer was carrying the optimiser downward, and when the gradient flipped sign at the minimum the buffer's inertia kept it going for an extra step.

What follows is a damped oscillation. The iterates swing through the minimum and back, with the amplitude shrinking each cycle. Reaching $|\theta| < 0.01$ takes about $25$ steps, almost twice as fast as vanilla SGD despite the wasteful overshoot. With $\beta = 0.99$ the oscillation is more dramatic; with $\beta = 0.5$ the behaviour is closer to plain SGD.

Nesterov on the same problem behaves similarly but with smaller overshoot, because the lookahead gradient at $\tilde\theta = \theta - 0.1 \cdot 0.9\, m$ already reflects that we are about to cross zero. The damping is faster and the trajectory smoother. On this particular loss the difference is barely visible to the eye, but on a two-dimensional ill-conditioned quadratic, say $L(\theta_1, \theta_2) = \tfrac12(\theta_1^2 + 100\,\theta_2^2)$, the contrast becomes dramatic. Vanilla SGD zig-zags violently across the high-curvature $\theta_2$ direction while inching along the low-curvature $\theta_1$ direction. Heavy-ball momentum damps the zig-zag and accelerates progress along $\theta_1$. Nesterov damps it slightly more aggressively and tolerates a larger learning rate without diverging.

Practical use

Two main recipes dominate. First, SGD with Nesterov momentum at $\beta = 0.9$ remains the default for image classification on ImageNet-scale convolutional networks: ResNets, EfficientNets, ConvNeXts. The recipe is mature: large initial learning rate (often $0.1$ for batch size $256$, scaled linearly with batch size as in §10.9), cosine or step-decay schedule, weight decay around $10^{-4}$, momentum $0.9$, Nesterov on. It generalises slightly better than Adam on these tasks for reasons that are still debated, possibly related to the implicit regularisation effects discussed in §10.14.

Second, Adam and its descendants absorb momentum as a built-in component. Adam maintains both a first-moment estimate $m_t$ (the same exponentially weighted average of gradients we have just discussed, with $\beta_1 = 0.9$) and a second-moment estimate $v_t$ (an exponentially weighted average of squared gradients, with $\beta_2 = 0.999$). The first moment is momentum; the second provides per-parameter adaptive scaling. AdamW additionally decouples weight decay from the gradient. For training transformers, diffusion models, and most modern non-image architectures, AdamW is the default, and it inherits all the benefits of momentum while adding the adaptive scaling that the next section will introduce in detail.

A few practical notes worth remembering. Increasing $\beta$ generally requires decreasing $\eta$ to maintain stability; the rough heuristic is that the "effective" learning rate scales as $\eta/(1-\beta)$. The momentum buffer doubles the optimiser's memory footprint, and Adam-style optimisers triple it; for very large models this matters and is part of why ZeRO and FSDP shard optimiser state across GPUs (§10.12). Finally, when restarting training from a checkpoint, remember to save and reload $m_t$ along with $\theta_t$, a fresh momentum buffer means a noisy first few hundred steps, which can occasionally destabilise fine-tuning runs.

What you should take away

Momentum maintains an exponentially weighted average of past gradients, $m_t = \beta\, m_{t-1} + g_t$, and updates parameters via $\theta_{t+1} = \theta_t - \eta\, m_t$. Default $\beta = 0.9$.
On ill-conditioned quadratics, momentum improves the convergence rate from $O(\kappa)$ steps to $O(\sqrt\kappa)$ steps; on real networks the speedup is smaller but reliably present.
Nesterov's variant evaluates the gradient at a lookahead point and achieves the optimal $O(1/T^2)$ rate for smooth convex problems, against $O(1/T)$ for plain gradient descent.
Practical recipes split cleanly: SGD with Nesterov momentum for ImageNet-style image classification; Adam or AdamW (with momentum built-in) for almost everything else, including transformers and diffusion models.
Momentum doubles optimiser state, interacts with the learning rate (raising $\beta$ usually means lowering $\eta$), and must be saved with the parameters when checkpointing.