Anomaly Detection—also called outlier or novelty detection—is the task of identifying data points that deviate significantly from the expected pattern. Anomalies may represent genuine rare events (fraud, intrusions, equipment failures), data quality issues, or scientifically interesting phenomena (new particle signatures, astronomical objects). The challenge is that anomalies are, by definition, infrequent and diverse: they make up a tiny fraction of data and may manifest in unanticipated ways, so supervised approaches are often impractical.
Statistical approaches model the distribution of normal data and flag low-probability points. Mahalanobis distance generalises the z-score to multivariate Gaussian data; Gaussian mixture models handle multimodal data; kernel density estimation provides a non-parametric alternative. The Isolation Forest algorithm takes a different approach: random binary trees isolate anomalies, which tend to have shorter average path lengths to their leaves because they are easier to separate from the rest.
Proximity-based methods like Local Outlier Factor (LOF) compare each point's local density to its neighbours', detecting points in sparser regions. One-class SVM learns a boundary enclosing normal data. Autoencoders provide a deep learning approach: a model trained on normal data reconstructs normal inputs well but fails on anomalies, so reconstruction error serves as the anomaly score. Evaluation is difficult because anomalies are rare; AUROC and AUPRC are far more informative than accuracy when anomalies make up less than 1% of the data.
Related terms: Autoencoder
Discussed in:
- Chapter 8: Unsupervised Learning — Anomaly Detection
Also defined in: Textbook of AI