The bias-variance tradeoff decomposes the expected squared prediction error of an estimator into interpretable components. Let $y = f(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\mathrm{Var}(\varepsilon) = \sigma^2$, and let $\hat f$ be an estimator trained on a random sample $S$. The expected squared error at a test point $x$ is
$$\mathbb{E}_{S, \varepsilon}\big[(y - \hat f(x))^2\big] = \underbrace{\big(\mathbb{E}_S[\hat f(x)] - f(x)\big)^2}_{\mathrm{Bias}[\hat f(x)]^2} + \underbrace{\mathbb{E}_S\big[\big(\hat f(x) - \mathbb{E}_S[\hat f(x)]\big)^2\big]}_{\mathrm{Var}[\hat f(x)]} + \sigma^2.$$
- Bias measures systematic error: how far the average prediction (over training samples) is from the truth. High bias indicates underfitting , the model class is too restrictive.
- Variance measures sensitivity to the particular training sample: how much $\hat f(x)$ fluctuates as $S$ is resampled. High variance indicates overfitting, the model is too flexible.
- Irreducible noise $\sigma^2$ is independent of the model.
Classical implication. As model capacity increases:
- bias decreases (more flexibility to capture $f$),
- variance increases (more sensitivity to noise in $S$),
- their sum is U-shaped, with an optimal interior capacity.
This U-curve motivated the entire framework of model selection: cross-validation, AIC, BIC, structural risk minimisation, regularisation paths. The tradeoff implies that picking too rich a model class is harmful, and that fitting the training data perfectly ($\hat R = 0$) is dangerous.
Where it breaks down. Belkin et al. (2019) documented double descent: as capacity is pushed past the interpolation threshold $p \approx N$, test error first spikes then descends a second time, often falling below the classical optimum. The classical decomposition still holds at any fixed capacity, bias and variance are well-defined, but the behaviour of variance in the overparameterised regime contradicts the classical narrative:
- At interpolation ($p \approx N$), variance diverges because the system is barely identifiable.
- In the heavily overparameterised regime ($p \gg N$), variance decreases because implicit regularisation of gradient descent selects minimum-norm solutions; the effective complexity is governed by norm, not parameter count.
Modern reformulation. A more accurate picture for deep learning is:
$$\text{test error} = \text{approximation error} + \text{estimation error},$$
where estimation error in overparameterised regimes is controlled by algorithmic properties (which interpolating solution is chosen) rather than by capacity alone. Tools that capture this include PAC-Bayes bounds, Rademacher complexity of norm-bounded classes, and stability analyses.
When the classical tradeoff still applies.
- Underparameterised models (linear regression with $p < N$, kernel methods with regularisation, decision trees of bounded depth).
- Explicit-regularisation regimes where the regulariser controls effective capacity.
- Settings dominated by approximation error, too-simple models, where adding capacity helps.
Pedagogical role. The bias-variance decomposition remains the cleanest way to introduce the concept of generalisation error, and its components are still meaningful diagnostically. Cross-validation curves, learning curves, and ensemble methods (which reduce variance without increasing bias, bagging) are all best understood through this lens. The tradeoff is incomplete, not wrong: it describes one regime of a richer landscape revealed by double descent.
Modern treatments therefore present bias-variance as a foundational decomposition whose practical implications depend on the regime, under-, near-, or over-parameterised, in which the model operates.
Interactive
Related terms: Double Descent, Implicit Regularisation, Statistical Learning Theory, Regularisation, Rademacher Complexity, PAC-Bayes
Discussed in:
- Chapter 6: ML Fundamentals, Generalisation in Deep Learning