5.16 Bias–Variance Tradeoff
If you have ever tried to fit a curve through a cloud of points and wondered whether you were drawing too straight a line or chasing every wiggle, you have already met the bias–variance tradeoff. It is the single most useful idea in supervised learning, and it explains why "more powerful model" is not always the right answer. A model with very few parameters, say, a straight line, is not flexible enough to capture a complicated pattern, so on average it misses the truth. A model with very many parameters, a wildly curvy polynomial, a deep network with millions of weights, can in principle fit anything, but it is so eager to fit that it ends up fitting the random noise in our particular sample. The first failure is called bias: a systematic offset between what the model predicts on average and the truth. The second is called variance: how much the model's predictions wobble when we re-train it on a different random sample of data. The tradeoff says these two failures pull in opposite directions, and our job is to find the balance.
The classical picture is a U-shape. As we increase model flexibility, more features, higher polynomial degree, deeper trees, less regularisation, bias falls and variance rises. Test error, the sum of the two (plus an irreducible noise term we cannot remove), first goes down, hits a minimum, and then goes up again. The minimum is where we want to be. Recently, deep learning has scrambled this tidy picture in surprising ways. Very large neural networks, the kind that have many more parameters than training points, often generalise better than smaller, classically-tuned models. The phenomenon is called double descent, and we will discuss it once the classical story is in place.
This section ties together threads we have already seen. In §5.4 we introduced bias and variance separately, as properties of an estimator. Here we put them together as a decomposition of test error. The same logic underpins regularisation in §6 (machine learning fundamentals), the design of training and validation splits in §10 (training procedure), the ensembling tricks of random forests and gradient boosting, and, at scale, the modern empirical findings about overparameterised networks.
The classical decomposition
Let us write down the algebra so the qualitative story has a precise backing. Suppose the world generates each observation as $y = f(\mathbf{x}) + \epsilon$, where $f$ is the true underlying function we want to learn and $\epsilon$ is independent noise with mean zero and variance $\sigma^2$, for instance, $\epsilon \sim \mathcal{N}(0, \sigma^2)$. We do not have $f$. We have a finite training sample, and from it we fit a model $\hat f$. The randomness in the training sample makes $\hat f$ itself random: a different draw of $n$ data points would have given a slightly different fit.
Now imagine evaluating the model at a new test point $\mathbf{x}$. Two things are random: the noise $\epsilon$ on the new observation $y$, and the training sample we used to build $\hat f$. The expected squared prediction error decomposes neatly into three pieces:
$$\mathbb{E}\big[(\hat f(\mathbf{x}) - y)^2\big] \;=\; \underbrace{\big(\mathbb{E}[\hat f(\mathbf{x})] - f(\mathbf{x})\big)^2}_{\text{Bias}^2} \;+\; \underbrace{\mathbb{E}\big[(\hat f(\mathbf{x}) - \mathbb{E}[\hat f(\mathbf{x})])^2\big]}_{\text{Variance}} \;+\; \underbrace{\sigma^2}_{\text{noise}}.$$
Each term has a vivid interpretation.
- Bias$^2$: take the average prediction across many hypothetical training datasets, then ask how far that average is from the truth. If our model class cannot represent $f$, for instance, fitting a straight line to a sine wave, the average will systematically miss, and bias will be large no matter how much data we collect.
- Variance: how much does the prediction at $\mathbf{x}$ jiggle around when we re-train the model on different samples? A degree-30 polynomial fit to 50 noisy points will swing dramatically from one sample to the next; a single global mean will not move at all.
- Noise, $\sigma^2$: the part of $y$ that is fundamentally unpredictable from $\mathbf{x}$. No model, perfect or otherwise, can drive this below $\sigma^2$. It sets the floor.
The decomposition is exact for squared error. For other losses (cross-entropy, hinge), analogous decompositions exist but the algebra is messier. The important takeaway is that test error is bounded below by noise, and the only knobs we control are bias and variance. Good modelling is the art of trading them off.
How model capacity moves the trade-off
Capacity is an informal word for "how many distinct functions a model can express". A linear regression with one feature has tiny capacity; a deep neural network with a hundred million parameters has enormous capacity. Capacity governs bias and variance in opposite directions.
Low capacity means the model class is small. The chance that the truth lives inside it is small, so the average fit is far from the truth, high bias. But because there are few free parameters, different training samples produce similar fits, low variance. Visually, every line through the cloud looks much the same; they are all wrong in roughly the same way. This regime is called underfitting.
High capacity is the opposite. The model class is so rich that it can represent almost any function, so on average it can hit the truth, low bias. But with finite data, the fit is dictated as much by the noise in the particular sample as by the underlying signal. Different samples therefore produce wildly different fits, high variance. Each individual fit looks confident and detailed, but two such fits, on two different samples, disagree all over the place. This is overfitting.
The sweet spot is somewhere in between: enough capacity for the model class to contain something close to the truth, but not so much that we end up modelling the random fluctuations.
If we plotted the three quantities against capacity, we would see bias falling monotonically, more flexibility, less systematic error, variance rising monotonically, and total expected test error tracing a U-shape. The minimum of the U is the optimum. Sample size shifts this picture: with more data, variance shrinks at every capacity level, and the optimum slides rightwards towards more complex models. This is the formal reason why "more data lets us train bigger models", the variance penalty for capacity gets cheaper.
Worked example: polynomial fits
The crispest illustration is to fit polynomials of different degree to the same noisy data and watch the U-shape emerge. Generate 50 data points from $y = \sin(x) + 0.3\,\epsilon$ on $x \in [-3, 3]$, with $\epsilon \sim \mathcal{N}(0, 1)$. Fit polynomials of degree 1, 3, 9 and 30. To estimate variance we repeat the whole exercise on, say, a hundred fresh datasets and look at how the fitted curves spread out.
- Degree 1. A straight line cannot follow a sine wave. The fit is roughly $y \approx 0$ everywhere. Bias is enormous, the average line misses the true sine almost everywhere. Variance is tiny, every dataset gives roughly the same flat-ish line. Total error: dominated by bias.
- Degree 3. A cubic can bend twice. It approximates one full hump of the sine reasonably well. Bias is much lower; variance is still modest. Total error drops sharply.
- Degree 9. A nine-term polynomial can match the sine wave very closely on $[-3, 3]$. Bias is small; variance is moderate. This is near the bottom of the U.
- Degree 30. With thirty coefficients fitted to fifty points, the polynomial has near-zero error on the training points, it passes through almost every one, but oscillates wildly between them. Bias on training data is essentially zero, but variance is enormous: tiny perturbations to the data produce drastically different curves. Test error is terrible.
import numpy as np
def make_data(n=50, noise=0.3, seed=0):
rng = np.random.default_rng(seed)
x = rng.uniform(-3, 3, size=n)
y = np.sin(x) + rng.normal(scale=noise, size=n)
return x, y
x_test = np.linspace(-3, 3, 200)
truth = np.sin(x_test)
for d in [1, 3, 9, 30]:
preds = np.empty((100, len(x_test)))
for s in range(100):
x, y = make_data(seed=s)
coef = np.polyfit(x, y, d)
preds[s] = np.polyval(coef, x_test)
bias_sq = np.mean((preds.mean(0) - truth) ** 2)
variance = np.mean(preds.var(0))
print(f"deg {d:2d} bias^2={bias_sq:.3f} var={variance:.3f} sum={bias_sq+variance:.3f}")
Run this and you see the U: the sum of bias-squared and variance is largest at degree 1, smallest somewhere around degree 3 to 9, and balloons again at degree 30. This is the classical bias–variance tradeoff in one screenful of code.
Regularisation moves capacity smoothly
Choosing among polynomial degrees feels coarse: degrees are integers, and you cannot really build a "degree 4.7" model. Regularisation gives a continuous knob. The most common form is L2 regularisation, also called ridge regression. Instead of minimising the training loss alone, we minimise
$$\mathcal{L}_{\text{reg}}(\mathbf{w}) \;=\; \mathcal{L}_{\text{train}}(\mathbf{w}) \;+\; \lambda \, \|\mathbf{w}\|^2,$$
where $\mathbf{w}$ are the model's weights and $\lambda \geq 0$ is a non-negative tuning parameter. The penalty $\lambda \|\mathbf{w}\|^2$ pulls all coefficients towards zero, which shrinks the effective capacity of the model, large coefficients are now "expensive" and the optimiser prefers gentler fits.
Two limits make the effect clear:
- As $\lambda \to 0$, the penalty disappears and we recover the unregularised maximum-likelihood fit. Bias is at its minimum for the chosen architecture; variance is at its maximum.
- As $\lambda \to \infty$, the penalty dominates and pushes every coefficient to zero. The prediction collapses to a constant (typically the mean of $y$). Bias is huge; variance is exactly zero.
In between, $\lambda$ slides smoothly along the bias–variance curve. The sweet spot is found by cross-validation: try a grid of $\lambda$ values, score each one on held-out folds, and pick the one with the lowest validation error. From a Bayesian perspective L2 regularisation is exactly equivalent to a zero-mean Gaussian prior on the weights and reading off the MAP estimate (§5.6); this is one of the small bridges between frequentist and Bayesian statistics that recur throughout AI. L1 regularisation (the lasso) penalises $\|\mathbf{w}\|_1$ instead of $\|\mathbf{w}\|^2$, which has the additional effect of driving some coefficients exactly to zero, useful for feature selection. Dropout, weight decay, early stopping, and data augmentation are all flavours of the same idea: trade a small amount of bias for a large reduction in variance.
Modern double descent
For decades the U-shape looked complete. Then deep learning broke it. As you push neural networks past the interpolation threshold, the point at which they have just enough parameters to fit the training set perfectly, achieving zero training error, and keep going, something strange happens. Test error spikes at the threshold (variance is at its worst right where the model is barely able to interpolate) and then falls again as you add even more parameters. The full curve is a U, then a peak, then a second descent.
This double descent picture was sharpened by Belkin, Hsu, Ma and Mandal (2019) and confirmed empirically across many architectures and datasets by Nakkiran and colleagues (2020). The empirical finding is unexpected under classical theory: enormous overparameterised models, modern transformers with billions of parameters trained on trillions of tokens, frequently generalise better than their smaller, classically-tuned cousins, even though classical theory predicts they should be calamitous overfitters.
Why does the second descent happen? Several explanations, none yet complete, are converging on a picture. The optimiser matters: stochastic gradient descent, even on a model with many more parameters than data points, is biased towards finding flat, low-norm solutions among the infinitely many that fit the training data. This implicit regularisation does much of the work that explicit regularisation used to do. The geometry of high-dimensional loss landscapes also helps: in high dimensions, "bad" minima are rare and saddle points are flat. Theoretical tools such as the neural tangent kernel approximate very wide networks as kernel methods and recover sensible generalisation bounds.
For our purposes the message is twofold. First, the bias–variance decomposition is still correct, it is an algebraic identity. What it does not tell you is how parameter count maps to effective capacity, and that map is far more subtle in deep models than it is in linear regression. Second, double descent is now a routine empirical observation in deep learning research, and the modern recipe, bigger models, more data, less explicit regularisation, has eclipsed careful capacity tuning for the largest models.
What this means for AI practice
Putting the classical and modern pictures together, a few practical rules emerge.
- Tabular data, modest sample sizes. The classical picture rules. Use cross-validation to choose model capacity and regularisation strength. Gradient-boosted trees (XGBoost, LightGBM, CatBoost) navigate the bias–variance trade-off particularly well and remain the default for tabular benchmarks.
- Deep learning with abundant data. The modern recipe is: scale the model, scale the data, lean on implicit regularisation from SGD and architectural inductive biases, and use modest explicit regularisation (dropout, weight decay, data augmentation) more as a stabiliser than as a primary control. The bias–variance framing still applies, it just lives in a regime where naive parameter counts mislead.
- Always measure. Whatever your regime, the cheapest way to find your spot on the curve is a validation set. Plot training and validation error against capacity (or against $\lambda$, or against epochs, or against model size). A widening gap is variance; a high but flat training error is bias.
- Beware the interpolation peak. If you find yourself sitting at a model size where training error is near zero but validation error is unusually high, you may be near the classical interpolation threshold. Either shrink the model or, in deep learning, push past it.
The classical bias–variance framing is a useful framing but does not, on its own, determine modern deep-learning practice. Treat it as the conceptual scaffolding around which the empirical findings are still being assembled.
What you should take away
- Test error decomposes exactly into bias-squared plus variance plus irreducible noise. Only bias and variance are under our control; the noise term is a floor.
- Bias is what you have when the model class is too restrictive to express the truth. Variance is what you have when the model is so flexible that different samples produce very different fits. They pull in opposite directions.
- As model capacity rises, bias falls and variance rises. Total expected test error traces a U-shape, and the minimum is the optimum. More data shifts the optimum towards more capacity.
- Regularisation (L2, L1, dropout, early stopping, augmentation) is a smooth knob for trading a little bias for a large reduction in variance. Tune it with cross-validation.
- In modern overparameterised deep learning, the U becomes a U-then-peak-then-second-descent. Very large networks often generalise better than smaller ones thanks to the implicit regularisation of SGD and the geometry of high-dimensional loss surfaces. The classical decomposition still holds; the mapping from parameter count to effective capacity is what changes.