Bias-Variance Tradeoff, Glossary, Textbook of AI

The bias-variance tradeoff decomposes the expected squared prediction error of an estimator into interpretable components. Let $y = f(x) + \varepsilon$ with $\mathbb{E}[\varepsilon] = 0$ and $\mathrm{Var}(\varepsilon) = \sigma^2$, and let $\hat f$ be an estimator trained on a random sample $S$. The expected squared error at a test point $x$ is

$$\mathbb{E}_{S, \varepsilon}\big[(y - \hat f(x))^2\big] = \underbrace{\big(\mathbb{E}_S[\hat f(x)] - f(x)\big)^2}_{\mathrm{Bias}[\hat f(x)]^2} + \underbrace{\mathbb{E}_S\big[\big(\hat f(x) - \mathbb{E}_S[\hat f(x)]\big)^2\big]}_{\mathrm{Var}[\hat f(x)]} + \sigma^2.$$

Bias measures systematic error: how far the average prediction (over training samples) is from the truth. High bias indicates underfitting , the model class is too restrictive.
Variance measures sensitivity to the particular training sample: how much $\hat f(x)$ fluctuates as $S$ is resampled. High variance indicates overfitting, the model is too flexible.
Irreducible noise $\sigma^2$ is independent of the model.

Classical implication. As model capacity increases:

bias decreases (more flexibility to capture $f$),
variance increases (more sensitivity to noise in $S$),
their sum is U-shaped, with an optimal interior capacity.

This U-curve motivated the entire framework of model selection: cross-validation, AIC, BIC, structural risk minimisation, regularisation paths. The tradeoff implies that picking too rich a model class is harmful, and that fitting the training data perfectly ($\hat R = 0$) is dangerous.

Where it breaks down. Belkin et al. (2019) documented double descent: as capacity is pushed past the interpolation threshold $p \approx N$, test error first spikes then descends a second time, often falling below the classical optimum. The classical decomposition still holds at any fixed capacity, bias and variance are well-defined, but the behaviour of variance in the overparameterised regime contradicts the classical narrative:

At interpolation ($p \approx N$), variance diverges because the system is barely identifiable.
In the heavily overparameterised regime ($p \gg N$), variance decreases because implicit regularisation of gradient descent selects minimum-norm solutions; the effective complexity is governed by norm, not parameter count.

Modern reformulation. A more accurate picture for deep learning is:

$$\text{test error} = \text{approximation error} + \text{estimation error},$$

where estimation error in overparameterised regimes is controlled by algorithmic properties (which interpolating solution is chosen) rather than by capacity alone. Tools that capture this include PAC-Bayes bounds, Rademacher complexity of norm-bounded classes, and stability analyses.

When the classical tradeoff still applies.

Underparameterised models (linear regression with $p < N$, kernel methods with regularisation, decision trees of bounded depth).
Explicit-regularisation regimes where the regulariser controls effective capacity.
Settings dominated by approximation error, too-simple models, where adding capacity helps.

Pedagogical role. The bias-variance decomposition remains the cleanest way to introduce the concept of generalisation error, and its components are still meaningful diagnostically. Cross-validation curves, learning curves, and ensemble methods (which reduce variance without increasing bias, bagging) are all best understood through this lens. The tradeoff is incomplete, not wrong: it describes one regime of a richer landscape revealed by double descent.

Modern treatments therefore present bias-variance as a foundational decomposition whose practical implications depend on the regime, under-, near-, or over-parameterised, in which the model operates.

Interactive

The bias-variance tradeoff. Underfitting is high bias, overfitting is high variance. The best model balances the two.

Discussed in:

Chapter 6: ML Fundamentals, Generalisation in Deep Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.