Data Drift, Glossary, Textbook of AI

Data Drift refers to a change in the statistical distribution of input features over time, between when a model was trained and when it is deployed. A fraud detection model trained on last year's transaction patterns may see different patterns this year as fraudsters evolve their strategies. A demand forecasting model trained on pre-pandemic data may fail catastrophically when consumer behaviour changes overnight. A vision model trained in sunny California may struggle with rain-soaked British motorways.

Data drift is distinct from concept drift, which refers to a change in the relationship between inputs and outputs (the underlying function $P(y \mid x)$ changes) even if input distributions stay the same. In practice both occur and both degrade deployed model performance. Detecting drift requires monitoring: track statistical properties of incoming data and compare against baselines established during training. Standard tests include Kolmogorov-Smirnov for continuous features, chi-squared for categorical features, and population stability index (PSI).

Remediation strategies include automated retraining on recent data, online learning with continuous updates, ensembling with models trained at different time horizons, and falling back to human review or simpler baselines when drift is severe. The assumption that training data is representative of deployment data is foundational to supervised learning but is rarely exactly true in practice. Monitoring for drift is one of the most important MLOps practices, and the ability to detect, diagnose, and respond to drift distinguishes robust production ML systems from fragile ones that silently degrade over time.

Related terms: MLOps

Discussed in:

Chapter 17: Applications, Deployment & MLOps

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.