ML Fundamentals: 6.12   The double-descent phenomenon

Dr Chris Paton

6.12 The double-descent phenomenon

The classical bias–variance picture predicts that test error rises catastrophically once a model is rich enough to interpolate the training data. For decades, this was ML orthodoxy.

In 2019, Belkin, Hsu, Ma, and Mandal 2019 showed that the picture is incomplete. As model complexity increases past the interpolation threshold, the point where the model has just enough parameters to fit the training data exactly, test error dips, rises, and then falls again, sometimes below the classical optimum. The shape of the curve is a "double descent." A second descent occurs in the overparameterised regime.

The empirical picture

Plot test error against the parameter-to-sample ratio $p/n$ for, say, a fully connected network on MNIST or CIFAR-10. You see:

$p/n \ll 1$: classical regime; test error follows the U-shape of bias–variance.
$p/n \approx 1$: a sharp peak; the model fits the data exactly with no slack, so any noise is amplified. Numerically this regime is also unstable.
$p/n \gg 1$: the modern regime; there are infinitely many zero-training-error solutions, and the inductive bias of the optimiser (typically SGD) selects the one with the smallest norm. Test error decreases monotonically with capacity.

This is consistent with the empirical observation that real deep networks have many more parameters than training examples, GPT-3 has 175 B parameters trained on 300 B tokens (or, in 2025-26 terms, Llama 3 70B trained on 15 trillion tokens, or DeepSeek-V3 with 671 billion total / 37 billion active parameters trained on 14.8 trillion tokens), and yet generalise.

Why does it happen?

The intuition: among the infinite zero-error solutions in the overparameterised regime, the one chosen by SGD with weight decay tends to have small norm. Small-norm interpolators are smooth and generalise well. The min-norm interpolating solution to least squares, $\hat w = X^\top (XX^\top)^{-1} y$, often has lower test error than the regularised classical estimate when $p > n$. Bartlett, Long, Lugosi, and Tsigler (2020) made this precise as benign overfitting: under suitable spectral conditions on $X$, the min-norm interpolator generalises despite fitting the noise.

We do not have a complete theory. The neural-tangent-kernel (NTK) Jacot, 2018 picture explains the very-wide-network limit, but real networks are not in that limit. The lottery ticket hypothesis Frankle, 2018 gives a complementary picture: dense overparameterised networks contain sparse subnetworks that, when trained alone, would match the dense network's accuracy.

The practical lesson is clear: do not be afraid to use models with more parameters than data points, provided you have an optimiser and regulariser whose implicit bias selects well-behaved solutions.