Training, Validation, and Test Sets, Glossary, Textbook of AI

A machine learning model's data is typically divided into three disjoint sets. The training set is used to fit the model's parameters. The validation set (sometimes called the development or dev set) is used to tune hyperparameters, select between competing models, and decide when to stop training. The test set is used exactly once, at the very end, to provide an unbiased estimate of generalisation performance.

The strict separation matters because any data that influences model choice, feature selection, architecture search, hyperparameter tuning, early stopping decisions, is, in a subtle sense, being "used" to train the model. If the same data is used for both tuning and final evaluation, the resulting performance estimate is optimistically biased because the model has been indirectly optimised for that particular sample. The test set must be locked away and untouched during development.

Common splits are 60/20/20 or 70/15/15 for small datasets, or 98/1/1 for very large datasets where even 1% is millions of examples. Cross-validation can replace a dedicated validation set in settings where data is scarce. Data leakage, the accidental inclusion of test-set information in training, is a common and pernicious error. Classic examples include normalising features using statistics computed from the full dataset (including test), upsampling before splitting, or including duplicate examples across splits. Leakage produces deceptively good numbers during development and catastrophic failures in production.

Related terms: Cross-Validation, Overfitting

Discussed in:

Chapter 6: ML Fundamentals, Model Evaluation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.