Generalisation is the central problem of machine learning: how well a model trained on data $\mathcal{D}_\mathrm{train}$ performs on unseen data drawn from the same distribution. The generalisation gap is the difference between training loss $\hat L$ and true loss $L$:
$$\text{generalisation gap} = L(\theta) - \hat L(\theta)$$
A model with low training loss but high test loss is overfitting: it has memorised training-specific details rather than learning the underlying pattern.
Classical theory: generalisation is bounded by the complexity of the hypothesis class (VC dimension, Rademacher complexity, PAC-Bayes). Bounds of the form
$$L(\theta) \leq \hat L(\theta) + O\!\left(\sqrt{\frac{\text{complexity}(\mathcal{H})}{N}}\right)$$
where $N$ is the training-set size. Tighter generalisation requires either a smaller hypothesis class or more data.
The deep-learning puzzle: modern overparameterised neural networks have classical-bound complexity that vastly exceeds their training-set size, yet they generalise well. Classical theory predicts catastrophic overfitting; empirically this doesn't happen.
Proposed explanations:
- Implicit regularisation of SGD: stochastic gradient noise biases solutions toward flat minima that generalise better.
- Lottery ticket hypothesis: dense networks contain sparse trainable subnetworks; the dense initialisation provides exploration without the parameter count "really" being used.
- Neural tangent kernel: in the infinite-width limit, networks behave as kernel regression; effective capacity is determined by the kernel, not parameter count.
- Manifold hypothesis: real-world data lies on a low-dimensional manifold, so high-dimensional capacity is largely unused.
- Double descent: classical complexity bounds underestimate generalisation in the interpolation regime.
Out-of-distribution (OOD) generalisation is harder still: performing well on data drawn from different distributions than training. Distribution shifts can be:
- Covariate shift: $p(x)$ changes, $p(y | x)$ stable.
- Label shift: $p(y)$ changes, $p(x | y)$ stable.
- Concept drift: $p(y | x)$ changes.
- Domain shift: a different domain entirely.
Approaches to OOD generalisation: domain adaptation, invariant risk minimisation, group distributionally robust optimisation, foundation-model pretraining, scale.
Generalisation remains a partially-understood phenomenon in deep learning. The empirical reality (large overparameterised networks generalise well) and the classical theory (predicts they shouldn't) are reconciled by ongoing research.
Related terms: VC Dimension, Statistical Learning Theory, Implicit Regularisation, Double Descent, Out-of-Distribution Generalisation
Discussed in:
- Chapter 6: ML Fundamentals, Machine Learning Fundamentals