Overfitting occurs when a model fits the training data so closely that it captures not only the genuine patterns but also the noise and idiosyncrasies. An overfit model achieves low training error but high test error—it fails to generalise to new data drawn from the same distribution. Overfitting is the central concern of statistical learning theory and the reason the field has developed such an elaborate toolkit of regularisation techniques.
Overfitting arises when a model's capacity exceeds what the data can support. A linear regression with many more features than examples will memorise the training set but predict arbitrarily on new points. A deep neural network with millions of parameters can, if trained without regularisation, achieve zero training error even on randomly labelled data—yet such a model is useless in practice. Statistical learning theory quantifies this via concepts like VC dimension and PAC bounds, which relate generalisation error to training error, model capacity, and sample size.
Diagnosing overfitting requires a held-out validation set: if training error is low but validation error is high, the model is overfitting. Remedies include collecting more data, simplifying the model, adding regularisation (L1/L2, dropout, weight decay), data augmentation, early stopping, and ensembling. The opposite failure, underfitting, occurs when the model is too simple to capture the underlying patterns; both training and validation errors are high. Navigating between underfitting and overfitting is the central craft of applied machine learning.
Related terms: Underfitting, Regularisation, Bias-Variance Tradeoff, Cross-Validation
Discussed in:
- Chapter 6: ML Fundamentals — The ML Framework
Also defined in: Textbook of AI, Textbook of Medical AI