ML Fundamentals: 6.4   Generalisation: the central problem

Dr Chris Paton

6.4 Generalisation: the central problem

Why should training a model on a finite collection of examples tell us anything reliable about how it will behave tomorrow, on examples it has never seen? This is the question every working machine-learning practitioner pretends to have answered the moment they report a test-set accuracy. The pretence is harmless when the data are clean, the sample is large, and the world is stationary. It is dangerous everywhere else. Generalisation theory is the branch of machine learning that turns this question from a hopeful gesture into a quantitative statement: given some assumptions about how the data are sampled, we can write down explicit bounds linking what we measure on the training set to what we should expect on fresh draws from the same distribution. The bounds are usually loose for modern deep networks, sometimes spectacularly so, and yet the conceptual framework is essential, because it tells us what we are buying with each extra example, each extra parameter, and each extra layer of architectural cleverness.

This section develops the formal scaffolding behind the bias–variance discussion of §5.4. There we treated generalisation as a decomposition of expected error into reducible bias, irreducible noise, and variance from the training sample. Here we go further: we ask under what conditions the gap between training error and true error can be controlled, and how that control depends on the size of the sample, the richness of the model class, and the confidence we want in our claim. The machinery we develop will reappear in Chapter 10, where we treat practical training, and in Chapter 16, where we discuss alignment and robustness, both of which are, at heart, generalisation problems wearing different clothes.

Symbols Used Here

$h \in \mathcal{H}$hypothesis from class

$R(h)$true risk

$\hat R(h)$empirical risk on training set

$n$training-set size

$\delta$confidence parameter

$\mathcal{H}$hypothesis class

Empirical and true risk

The first step is to give names to two quantities that are frequently confused. The true risk of a hypothesis $h$ is its expected loss on a fresh example drawn from the underlying data distribution $p$:

$$R(h) = \mathbb{E}_{(\mathbf{x}, y) \sim p}[\mathcal{L}(h(\mathbf{x}), y)].$$

This is the quantity we actually care about. It tells us, on average, how much loss the model will incur once we deploy it. The trouble is that we cannot compute it: $p$ is unknown, and the expectation runs over every possible example we have not yet seen.

What we can compute is the empirical risk, which is just the average loss on the training set:

$$\hat R(h) = \frac{1}{n}\sum_{i=1}^{n} \mathcal{L}(h(\mathbf{x}_i), y_i).$$

Empirical risk is concrete, finite, and trivially measurable. It is what gradient descent actually minimises, and what we report when we say "training loss is 0.18". The whole point of generalisation theory is to relate this knowable quantity to the unknowable one we care about.

The generalisation gap is the absolute difference

$$|R(h) - \hat R(h)|.$$

If the gap is small, training-set performance is a faithful proxy for deployment performance. If the gap is large, the training number is a mirage. Two questions follow immediately: how should the gap shrink as the sample grows, and how does the choice of hypothesis class influence it? These two questions structure everything that follows. A useful intuition: empirical risk is a sample mean, true risk is a population mean, and statistical theory tells us sample means concentrate around population means. The art lies in turning that vague reassurance into a number.

Hoeffding's bound for fixed hypothesis

The cleanest result in the area is Hoeffding's inequality. It says that for any fixed hypothesis $h$, chosen before we look at the data, and any loss bounded in $[0, 1]$, the gap between empirical and true risk concentrates very rapidly as the sample grows. Specifically, with probability at least $1 - \delta$ over the random draw of the training set,

$$|R(h) - \hat R(h)| \le \sqrt{\frac{\log(2/\delta)}{2n}}.$$

The bound is strong: it tells us nothing about the shape of the loss surface, the structure of $h$, or any property of the data distribution beyond the boundedness of the loss. It is universal across these dimensions, and the gap shrinks as $1/\sqrt{n}$, which is the rate at which sample means generally concentrate on population means.

A worked example clarifies the magnitudes. Suppose $n = 10\,000$ and we want a 95 per cent confidence guarantee, so $\delta = 0.05$. Then the bound gives

$$\sqrt{\frac{\log(2/0.05)}{2 \cdot 10\,000}} = \sqrt{\frac{\log 40}{20\,000}} \approx 0.014.$$

So for any single, pre-chosen hypothesis on a sample of ten thousand, the empirical and true risks differ by no more than about 1.4 percentage points, ninety-five times out of a hundred. That is a useful, non-trivial guarantee.

The catch is in the word "fixed". Hoeffding's bound treats the hypothesis as a constant chosen before the data are seen. In real machine learning, we search for $h$ using the training data: we run gradient descent, we tune hyperparameters, we cross-validate. The hypothesis we end up with, $\hat h$, is a function of the very sample we want to evaluate it on. This breaks the assumption underlying Hoeffding. Picture searching through a million candidate models: even if each individual model has a tiny gap with high probability, the worst gap among all million is much larger, because we have many chances for an unlucky draw. Hoeffding's bound applies to the candidate before the lottery, not to the winner after.

Uniform convergence

The fix is to bound the gap uniformly over every hypothesis the search might produce. If our class $\mathcal{H}$ contains finitely many hypotheses, a union bound over Hoeffding gives, with probability at least $1 - \delta$,

$$\sup_{h \in \mathcal{H}} |R(h) - \hat R(h)| \le \sqrt{\frac{\log|\mathcal{H}| + \log(2/\delta)}{2n}}.$$

This is the uniform convergence bound. The price for handling any model the search might pick is the $\log|\mathcal{H}|$ term: the larger the class, the more chances for an unlucky empirical estimate, and the larger the gap we must tolerate.

Note how this aligns with intuition. Doubling the class size adds only $\log 2$ to the numerator: a richer class costs little when measured logarithmically. The real cost comes from the square root of $1/n$: every quadrupling of the sample halves the worst-case gap. So we can afford very large finite classes if we have moderately large samples. With a million hypotheses and $n = 10\,000$, the bound at $\delta = 0.05$ becomes

$$\sqrt{\frac{\log 10^6 + \log 40}{20\,000}} = \sqrt{\frac{13.8 + 3.7}{20\,000}} \approx 0.030,$$

still under three per cent. Search did not destroy the guarantee; it just halved its sharpness.

The deeper problem is that interesting hypothesis classes are never finite. The set of all linear classifiers in $\mathbb{R}^d$, or all neural networks of a given architecture, or all decision trees of bounded depth, is uncountably infinite. The naive union bound becomes useless: $\log|\mathcal{H}|$ is no longer defined. We need a way to measure the effective size of an infinite class, a notion of capacity that captures how much real flexibility the class offers, not how many parameter settings exist on paper. Two such measures dominate the literature: the VC dimension, which counts how many points the class can label arbitrarily, and Rademacher complexity, which measures how well the class can fit pure noise. Both feed into bounds of essentially the same shape as the finite case, with the cardinality replaced by an appropriate complexity term.

VC dimension

The Vapnik–Chervonenkis dimension is the oldest and most famous capacity measure for binary classifiers. We say a class $\mathcal{H}$ shatters a set of $n$ points if, for every one of the $2^n$ possible binary labellings, some hypothesis in $\mathcal{H}$ realises that labelling. The VC dimension of $\mathcal{H}$ is the largest $n$ for which there exists a set of $n$ points that $\mathcal{H}$ shatters.

The classical example is linear classifiers in $\mathbb{R}^d$. Any three points in general position in the plane can be shattered by half-planes, pick any of the $2^3 = 8$ labellings and a line exists separating the positive from the negative points, but four points cannot all be shattered (the XOR pattern defeats every line). So linear classifiers in $\mathbb{R}^2$ have VC dimension three, and more generally linear classifiers in $\mathbb{R}^d$ have VC dimension $d + 1$.

The payoff of the definition is a uniform bound that scales with VC dimension rather than cardinality. Up to logarithmic factors,

$$|R(h) - \hat R(h)| \le O\!\left(\sqrt{\text{VC}(\mathcal{H})/n}\right).$$

This is the workhorse generalisation bound of classical learning theory. It says that to control the gap to within $\varepsilon$ we need roughly $n \sim \text{VC}/\varepsilon^2$ examples. Sample complexity scales linearly with capacity, a clean and intuitive result.

For deep networks the story collapses. The VC dimension of a network is roughly proportional to its number of parameters; modern networks routinely have ten or a hundred times as many parameters as training examples. Plug those numbers into the VC bound and you get a predicted gap larger than one, that is, the bound says the classifier might be wrong literally everywhere, which is no information at all. The bound is vacuous. And yet the same networks, trained with stochastic gradient descent, generalise extremely well in practice: the empirical gap is small, often under five per cent, exactly when the theory predicts it could be enormous. The classical theory does not lie; it just has nothing useful to say about over-parameterised models.

Rademacher complexity

Rademacher complexity is a tighter, data-dependent capacity measure. The empirical Rademacher complexity of $\mathcal{H}$ on a sample $\mathbf{x}_1, \dots, \mathbf{x}_n$ is

$$\mathcal{R}_n(\mathcal{H}) = \mathbb{E}_\sigma\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^{n} \sigma_i h(\mathbf{x}_i)\right],$$

where each $\sigma_i$ is independently $\pm 1$ with equal probability. The intuition is direct: Rademacher complexity measures how well the class can correlate with random labels. A class that can fit any noise pattern is highly flexible; a class that cannot has limited capacity. Crucially, this quantity is computed on the actual sample, so it captures structure in the data the VC dimension ignores.

Rademacher complexity yields generalisation bounds of the same square-root shape but typically tighter than VC. It is the natural tool for analysing margin-based methods such as support vector machines and kernel methods, where capacity depends on the geometry of the data, not just the dimensionality of the parameter space. It also underpins many modern data-dependent bounds, including those for boosting, deep nets at infinite width, and contemporary attempts to explain the puzzle of generalisation in over-parameterised models.

PAC-Bayes

PAC-Bayes generalises everything we have seen by working with distributions over hypotheses rather than single hypotheses. Choose a prior distribution $P$ over $\mathcal{H}$ before seeing the data, learn a posterior $Q$ from the sample, and evaluate the posterior's expected risk. The PAC-Bayes bound then says, with probability at least $1 - \delta$,

$$\mathbb{E}_{h \sim Q}[R(h)] \le \mathbb{E}_{h \sim Q}[\hat R(h)] + \sqrt{\frac{\text{KL}(Q\|P) + \log(1/\delta)}{2n}} + \cdots,$$

where $\text{KL}(Q\|P)$ is the Kullback–Leibler divergence between posterior and prior. The bound is naturally suited to stochastic networks, networks with weights drawn from a learned distribution at inference time, and to Bayesian neural networks. It has produced some of the only non-vacuous generalisation bounds for deep networks trained on real data, by carefully choosing the prior and tightening the posterior around the SGD trajectory.

Why classical theory fails for deep networks

Deep neural networks have far more parameters than training examples. By the worst-case VC theory, the generalisation gap should be enormous. Empirically, it is not: trained networks routinely achieve test accuracy within a couple of points of training accuracy. This is one of the celebrated unsolved problems of machine-learning theory, and three threads of active research try to explain it.

Implicit bias of SGD. Stochastic gradient descent does not pick an arbitrary point in the loss landscape; it preferentially settles in regions of low parameter norm, low curvature, and wide minima. Wide minima, where the loss is flat in many directions, generalise better than sharp minima, because small perturbations in the data shift the loss only slightly. SGD is not a neutral optimiser, it has its own inductive bias, and that bias is part of why deep nets generalise.

Neural tangent kernel. As a network's width tends to infinity, training dynamics linearise around their initialisation, and the network becomes equivalent to a kernel method with a specific, computable kernel, the neural tangent kernel. In this regime classical generalisation theory for kernel methods applies, and gives sensible bounds. The puzzle becomes how the finite-width regime departs from the infinite-width one.

Double descent. Classical theory predicts a U-shaped error curve as model capacity increases: too little, underfitting; too much, overfitting. Empirically, deep networks show a second descent: error rises, peaks at the interpolation threshold (where the model just barely fits the training set), and then falls again as parameters increase further. Beyond interpolation, more parameters help. Classical bounds simply miss this regime entirely.

The gap between theory and practice in deep learning generalisation remains one of the field's open problems. The theory is not wrong, its assumptions are merely too pessimistic for the regime we now operate in. Repairing it is a live research area and the locus of much of the most interesting work in learning theory today.

What you should take away

True risk is what we care about; empirical risk is what we measure. The generalisation gap, $|R(h) - \hat R(h)|$, is the bridge between them, and bounding it is the central technical project of learning theory.
Hoeffding's inequality bounds the gap for a single, pre-chosen hypothesis at rate $\sqrt{\log(2/\delta)/(2n)}$. The bound does not apply to a hypothesis chosen by training, because the choice depends on the very sample being evaluated.
Uniform convergence handles the search problem by bounding the gap across all hypotheses simultaneously. For finite classes the price is logarithmic in cardinality; for infinite classes we need capacity measures.
VC dimension counts the largest set a class can shatter; for linear classifiers in $\mathbb{R}^d$ it is $d + 1$. Rademacher complexity measures correlation with random labels and is data-dependent; PAC-Bayes works with distributions over hypotheses and gives the tightest non-vacuous deep-net bounds to date.
Classical bounds are vacuous for over-parameterised deep networks, yet those networks generalise. Implicit bias of SGD, neural tangent kernel theory, and the double-descent phenomenon are the three main threads currently trying to close the gap between theory and practice.