Glossary

Generalisation

Generalisation is the central problem of machine learning: how well a model trained on data $\mathcal{D}_\mathrm{train}$ performs on unseen data drawn from the same distribution. The generalisation gap is the difference between training loss $\hat L$ and true loss $L$:

$$\text{generalisation gap} = L(\theta) - \hat L(\theta)$$

A model with low training loss but high test loss is overfitting: it has memorised training-specific details rather than learning the underlying pattern.

Classical theory: generalisation is bounded by the complexity of the hypothesis class (VC dimension, Rademacher complexity, PAC-Bayes). Bounds of the form

$$L(\theta) \leq \hat L(\theta) + O\!\left(\sqrt{\frac{\text{complexity}(\mathcal{H})}{N}}\right)$$

where $N$ is the training-set size. Tighter generalisation requires either a smaller hypothesis class or more data.

The deep-learning puzzle: modern overparameterised neural networks have classical-bound complexity that vastly exceeds their training-set size, yet they generalise well. Classical theory predicts catastrophic overfitting; empirically this doesn't happen.

Proposed explanations:

  • Implicit regularisation of SGD: stochastic gradient noise biases solutions toward flat minima that generalise better.
  • Lottery ticket hypothesis: dense networks contain sparse trainable subnetworks; the dense initialisation provides exploration without the parameter count "really" being used.
  • Neural tangent kernel: in the infinite-width limit, networks behave as kernel regression; effective capacity is determined by the kernel, not parameter count.
  • Manifold hypothesis: real-world data lies on a low-dimensional manifold, so high-dimensional capacity is largely unused.
  • Double descent: classical complexity bounds underestimate generalisation in the interpolation regime.

Out-of-distribution (OOD) generalisation is harder still: performing well on data drawn from different distributions than training. Distribution shifts can be:

  • Covariate shift: $p(x)$ changes, $p(y | x)$ stable.
  • Label shift: $p(y)$ changes, $p(x | y)$ stable.
  • Concept drift: $p(y | x)$ changes.
  • Domain shift: a different domain entirely.

Approaches to OOD generalisation: domain adaptation, invariant risk minimisation, group distributionally robust optimisation, foundation-model pretraining, scale.

Generalisation remains a partially-understood phenomenon in deep learning. The empirical reality (large overparameterised networks generalise well) and the classical theory (predicts they shouldn't) are reconciled by ongoing research.

Related terms: VC Dimension, Statistical Learning Theory, Implicit Regularisation, Double Descent, Out-of-Distribution Generalisation

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).