Underfitting is the mirror image of overfitting: the model is too simple, too constrained, or trained for too short a time to capture the patterns in the data, resulting in poor performance on both the training set and the test set. An underfit model exhibits high bias, it systematically misses the truth regardless of how much training data it sees, because the function class it can represent is too restrictive to contain the true input–output relationship.
Diagnostic signature
The clearest diagnostic for underfitting versus overfitting is the gap between training error and test error:
- Underfitting, high training error, high test error, small gap. The model cannot even fit what it has seen.
- Overfitting, low training error, high test error, large gap. The model has memorised idiosyncrasies that do not generalise.
- Good fit, low training error, low test error, small gap.
Plotting learning curves (error versus training-set size) makes this visible: an underfit model's training and test errors converge to a high plateau as data increases, while an overfit model's curves diverge.
Examples
- A linear regression fitted to data generated by $y = \sin(x) + \varepsilon$ underfits: no straight line can capture the sinusoid.
- A shallow CNN with a handful of filters applied to ImageNet underfits: the model lacks the capacity to learn the diversity of object categories.
- A decision stump (tree of depth 1) on a problem with high-order feature interactions underfits, because it can only test one feature at a time.
- Excessive regularisation, for example weight decay set too high, or dropout probability too aggressive, can cause an otherwise well-sized model to underfit.
Remedies
The remedies for underfitting are essentially the opposite of those for overfitting:
- Increase model capacity, more layers, wider layers, more parameters, more complex non-linearities.
- Reduce regularisation, lower weight decay, reduce dropout probability, relax early-stopping patience.
- Engineer richer features, for tabular models without learned representations, polynomial features, interaction terms, or domain-specific transformations.
- Train longer, if the training loss is still decreasing at the end of the schedule, the model has not yet converged.
- Better optimiser, a poor optimiser or learning rate can leave a model stuck far from a good fit.
Modern deep learning
Classical statistics framed model selection as a U-shaped curve: error first decreases as capacity grows (escaping underfit), then rises (entering overfit). The minimum sits at a "sweet spot" of intermediate complexity. Modern deep learning often sidesteps underfitting entirely by starting with very large, highly expressive models, far past the classical interpolation threshold, and relying on regularisation, large datasets, and the implicit bias of stochastic gradient descent to control overfitting.
The double descent phenomenon (Belkin et al., 2019) refines the classical picture: as model capacity grows past the interpolation threshold (the point at which the model can fit the training data exactly), test error initially rises sharply but then descends again, often reaching a lower value than the classical sweet spot. The discovery of double descent has not overturned the bias–variance trade-off, but it has shifted modern practice toward "go big and regularise" rather than the classical "find the sweet spot."
Interactive
Video
Related terms: Overfitting, Bias-Variance Tradeoff, Regularisation, Weight Decay, Generalisation, Double Descent
Discussed in:
- Chapter 6: ML Fundamentals, Generalisation, overfitting, and underfitting