Glossary

Early Stopping

Early stopping is a simple yet effective regularisation technique: monitor the model's performance on a held-out validation set during training, and halt when the validation loss stops improving, even if the training loss is still decreasing. The model checkpoint from the epoch with the best validation loss is kept and used as the final model.

The bias-variance phases of training

The rationale is that overfitting typically progresses in distinct phases:

  1. Underfitting phase. Both training and validation loss decrease together as the model learns genuine, generalisable patterns.
  2. Generalising phase. The model approaches its capacity ceiling for the true signal; both losses plateau.
  3. Overfitting phase. The model begins to memorise idiosyncrasies of the training set; training loss continues to fall while validation loss starts to rise.

Early stopping identifies the inflection point between phases 2 and 3 and commits to the best-generalising model.

Practical implementation

The standard implementation tracks validation loss at the end of each epoch (or every $k$ steps for large datasets) and uses a patience parameter:

if val_loss < best_val_loss - min_delta:
    best_val_loss = val_loss
    best_weights = copy(model.weights)
    wait = 0
else:
    wait += 1
    if wait >= patience:
        stop_training()

Typical settings: patience of 5-20 epochs for classical tasks, $\mathrm{min\_delta} = 0$ or a small positive threshold, restoration of the best-seen weights after stopping.

Theoretical justification

Early stopping in gradient descent on a quadratic loss is equivalent to L2 regularisation (ridge regression) with a strength depending on learning rate and iteration count. The proof, due to Bishop (1995) and elaborated by Friedman & Popescu (2003), is essentially:

$$\theta_t = (I - \eta H)^t \theta_0 + \text{(forcing term)},$$

where $H$ is the Hessian. Stopping at finite $t$ leaves components in low-curvature directions of $H$ shrunken , exactly the effect L2 has by adding $\lambda I$ to $H$. This gives a principled justification for what practitioners had long used heuristically.

Advantages

Early stopping is essentially free:

  • No model changes are required.
  • No optimiser changes are required.
  • No extra hyperparameters beyond patience.
  • It saves compute: a run that would have taken 100 epochs may be stopped after 30.
  • It is robust to misspecified training schedules, even with too many epochs scheduled, the best checkpoint is retained.

Interaction with other techniques

Early stopping interacts subtly with learning rate schedules. With cosine annealing or one-cycle policies, the validation loss may temporarily worsen as the learning rate transitions, only to improve again, hence the patience parameter. With warm-restart schedules (SGDR), one typically does not early-stop within a restart cycle.

For very large models trained for a fixed compute budget (e.g. Chinchilla-scale LLMs), early stopping is replaced by fixed-step training: the scaling laws dictate the optimal token count for a given parameter count, and one trains exactly to that budget.

Modern usage

Early stopping remains almost universally employed in classical deep learning (CNNs for vision, fine-tuning of pretrained models, tabular models) and is particularly valuable when combined with learning-rate schedules that anneal smoothly. In frontier LLM pretraining it is less central, those runs are scaled to a fixed token budget, but it returns for fine-tuning, where overfitting on small datasets is the norm.

Interactive

Overfitting and early stopping. Training loss keeps falling. Validation loss bottoms out, then rises. The gap is overfitting.

Related terms: Regularisation, Overfitting, Machine Learning, Gradient Descent, Scaling Laws

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.