Generalisation, Glossary, Textbook of AI

Generalisation is the central problem of machine learning: how well a model trained on data $\mathcal{D}_\mathrm{train}$ performs on unseen data drawn from the same distribution. The generalisation gap is the difference between training loss $\hat L$ and true loss $L$:

$$\text{generalisation gap} = L(\theta) - \hat L(\theta)$$

A model with low training loss but high test loss is overfitting: it has memorised training-specific details rather than learning the underlying pattern.

Classical theory: generalisation is bounded by the complexity of the hypothesis class (VC dimension, Rademacher complexity, PAC-Bayes). Bounds of the form

$$L(\theta) \leq \hat L(\theta) + O\!\left(\sqrt{\frac{\text{complexity}(\mathcal{H})}{N}}\right)$$

where $N$ is the training-set size. Tighter generalisation requires either a smaller hypothesis class or more data.

The deep-learning puzzle: modern overparameterised neural networks have classical-bound complexity that vastly exceeds their training-set size, yet they generalise well. Classical theory predicts catastrophic overfitting; empirically this doesn't happen.

Proposed explanations:

Implicit regularisation of SGD: stochastic gradient noise biases solutions toward flat minima that generalise better.
Lottery ticket hypothesis: dense networks contain sparse trainable subnetworks; the dense initialisation provides exploration without the parameter count "really" being used.
Neural tangent kernel: in the infinite-width limit, networks behave as kernel regression; effective capacity is determined by the kernel, not parameter count.
Manifold hypothesis: real-world data lies on a low-dimensional manifold, so high-dimensional capacity is largely unused.
Double descent: classical complexity bounds underestimate generalisation in the interpolation regime.

Out-of-distribution (OOD) generalisation is harder still: performing well on data drawn from different distributions than training. Distribution shifts can be:

Covariate shift: $p(x)$ changes, $p(y | x)$ stable.
Label shift: $p(y)$ changes, $p(x | y)$ stable.
Concept drift: $p(y | x)$ changes.
Domain shift: a different domain entirely.

Approaches to OOD generalisation: domain adaptation, invariant risk minimisation, group distributionally robust optimisation, foundation-model pretraining, scale.

Generalisation remains a partially-understood phenomenon in deep learning. The empirical reality (large overparameterised networks generalise well) and the classical theory (predicts they shouldn't) are reconciled by ongoing research.

Discussed in:

Chapter 6: ML Fundamentals, Machine Learning Fundamentals

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).