Double descent describes a non-monotonic relationship between model capacity and test error first systematically documented by Belkin, Hsu, Ma and Mandal (2019). As the number of parameters $p$ increases relative to the number of training samples $N$, the test error follows two regimes:
- Classical regime ($p < N$). Test error follows the textbook U-shape: bias dominates for small $p$, variance dominates as $p$ approaches $N$, and the optimum lies in the interior.
- Modern regime ($p > N$). As $p$ increases past the interpolation threshold $p \approx N$, test error sharply spikes, then decreases monotonically, sometimes well below the classical optimum.
The combined curve has two descents, hence "double descent".
Where the spike comes from. At $p \approx N$, the linear system $\Phi \theta = y$ (or its nonlinear analogue) has approximately as many parameters as constraints; the unique interpolating solution is highly sensitive to label noise. The minimum-norm solution has $\|\theta\|^2 \to \infty$ near the threshold. Past the threshold, multiple interpolating solutions exist, and implicit regularisation of gradient descent selects the one with smallest norm, which generalises better.
Analytical form for linear models. For least-squares regression with $p$ random Fourier features and the minimum-norm interpolant, the asymptotic test risk under proportional scaling $p/N \to \gamma$ is
$$R(\gamma) = \sigma^2 \frac{\gamma}{|1 - \gamma|} + \text{(approximation terms)}$$
diverging as $\gamma \to 1$ from either side and decaying as $\gamma \to \infty$. Belkin et al. and Hastie, Montanari, Rosset and Tibshirani (2022) derived closed-form expressions confirming this picture.
Sample-wise double descent. Holding $p$ fixed, test error can be non-monotonic in $N$: adding training examples sometimes hurts. Nakkiran et al. (2020) documented this in deep networks on CIFAR-10 and CIFAR-100, and showed that explicit regularisation can mitigate it.
Epoch-wise double descent. During training, test error of overparameterised networks can rise then fall as a function of training epochs, mirroring the model-size curve.
Implications.
- Interpolation can generalise. The classical wisdom that fitting training data perfectly causes overfitting fails in the modern regime. Trained-to-zero-loss neural networks generalise well.
- Bias-variance tradeoff is incomplete. The classical decomposition still holds at any fixed $p$, but the curve as a function of $p$ no longer has a single optimum.
- Optimisation matters. In the overparameterised regime, the algorithm, not the model class, selects which interpolating solution is found.
- Capacity is the wrong axis. Effective complexity in the modern regime is governed by norm, margin, or sharpness of the solution rather than parameter count.
Double descent has been observed in linear regression, kernel methods, random feature models, decision trees, and deep networks. It motivated a substantial reorientation of statistical learning theory away from uniform-convergence bounds based on parameter count and towards norm-based, algorithmic stability, and PAC-Bayes bounds that adapt to the actual solution found.
Video
Related terms: Bias-Variance Tradeoff, Implicit Regularisation, Neural Tangent Kernel, Statistical Learning Theory, Regularisation
Discussed in:
- Chapter 6: ML Fundamentals, Generalisation in Deep Learning