Training & Optimisation: 10.14   Double descent and implicit regularisation

Dr Chris Paton

10.14 Double descent and implicit regularisation

The classical statistical learning view of generalisation says that as model capacity increases, test error first decreases (the model fits better) and then increases (the model overfits). The optimal capacity is some "sweet spot" where bias and variance balance, the U-shaped bias-variance curve.

Modern deep learning violates this picture. Networks with billions of parameters trained on millions of examples have orders-of-magnitude more parameters than data points, yet generalise well. Why?

The double descent curve

Belkin et al. (2019) and Nakkiran et al. (2019) showed that the bias-variance curve has a second descent, beyond the classical one:

Under-parameterised: increasing capacity reduces test error (classical regime).
Interpolation threshold: capacity matches data, model exactly interpolates training set, test error spikes.
Over-parameterised: capacity exceeds data, test error decreases again, sometimes below the classical minimum.

The interpolation threshold is the spike in the middle. Past it, more capacity helps. This is the regime modern deep learning operates in.

Why does over-parameterisation help?

In the over-parameterised regime, infinitely many parameter settings interpolate the training set. The set of interpolating $\theta$ forms a high-dimensional manifold. SGD does not pick one at random, it converges to a specific point on this manifold determined by the optimiser dynamics and initialisation.

This is implicit regularisation: the optimiser silently picks "simple" interpolators. Specifically, gradient descent on linear regression with squared loss converges to the minimum-norm interpolator. For deep networks the picture is more complex but qualitatively similar: gradient descent prefers low-rank solutions, low-frequency solutions, low-Lipschitz solutions, depending on architecture and data.

NTK and infinite-width limits

The Neural Tangent Kernel (Jacot et al. 2018) gives a rigorous handle on this. In the infinite-width limit, gradient descent on a neural network behaves as kernel regression with a fixed kernel (the NTK). For wide enough networks, training is effectively convex and the implicit-regularisation argument becomes a theorem.

Lottery tickets

Frankle and Carbin (2019)'s lottery ticket hypothesis: a randomly initialised, over-parameterised network contains a sub-network (about $10$–$20\%$ of weights) that, trained in isolation from its original initialisation, matches the full network's accuracy. The "winning ticket" sub-network is identified by training, pruning small weights, and rewinding the remaining weights to their initial values.

This reframes over-parameterisation as a search problem: we train large not because we need the capacity, but because more parameters give us more lottery tickets to find a good one.

Practical takeaways

Don't be afraid of over-parameterisation. A network with many more parameters than data points can still generalise, sometimes better than a smaller one.
Watch the interpolation threshold. If you happen to be near it, results can be highly unstable. Push past it (or stay well below it).
Initialisation and optimiser matter. Implicit regularisation depends on them. Switching from SGD to Adam silently changes which interpolator you converge to.