MLOps, Glossary, Textbook of AI

MLOps (Machine Learning Operations) is the engineering discipline that addresses the gap between a trained model in a research notebook and a reliable, scalable, maintainable production system. It adapts DevOps principles, continuous integration, continuous deployment, automated testing, monitoring, to the unique challenges of machine learning, which include data versioning, model training reproducibility, and the non-stationary nature of real-world data.

A typical MLOps pipeline includes: model serving infrastructure (TensorFlow Serving, TorchServe, Triton) handling prediction requests with appropriate latency and throughput; CI/CD pipelines that automatically retrain, evaluate, and deploy models when data or code changes; feature stores (Feast, Tecton) providing consistent feature computation between training and serving to avoid training-serving skew; experiment tracking (MLflow, Weights & Biases) for reproducibility; model registries for versioning and governance; and monitoring systems that detect data drift and concept drift.

Sculley et al.'s 2015 paper "Hidden Technical Debt in Machine Learning Systems" highlighted that the model itself is often a small fraction of the total system, surrounded by vast infrastructure for data collection, feature extraction, configuration, monitoring, and testing, each of which can accrue debt that makes the system increasingly brittle. Successful MLOps requires close collaboration between data scientists, data engineers, ML engineers, software engineers, and site reliability engineers. A mature MLOps practice treats ML systems not as one-off models but as living products that require ongoing attention, iteration, and governance throughout their lifecycles.

Related terms: Data Drift

Discussed in:

Chapter 17: Applications, Deployment & MLOps

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.