Unsupervised Learning: 8.15   Anomaly detection

Dr Chris Paton

8.15 Anomaly detection

Anomaly detection finds points that differ significantly from the rest. Outliers might be fraud, intrusions, defects, measurement errors, or scientific discoveries.

8.15.1 Statistical methods

Univariate Z-scores. Flag $\lvert(x-\hat\mu)/\hat\sigma\rvert > 3$. Useful only on roughly Gaussian data, single feature.

Mahalanobis distance. For multivariate Gaussian, $D(\mathbf{x})=\sqrt{(\mathbf{x}-\hat{\boldsymbol{\mu}})^{\top}\hat{\boldsymbol{\Sigma}}^{-1}(\mathbf{x}-\hat{\boldsymbol{\mu}})}$ has $D^2\sim\chi^2_d$ under the null; threshold accordingly. Robust covariance estimators (MCD) help when contamination is present.

GMMs and KDE. For non-Gaussian densities, model with a GMM or KDE; flag low-likelihood points.

8.15.2 Isolation forest

Liu, Ting and Zhou's isolation forest 2008 takes a different approach: rather than model normal data, isolate anomalies. The algorithm builds an ensemble of random binary trees, each built by picking a random feature and a random split until every point is in its own leaf. Anomalies, being few and different, end up in shorter paths from the root.

The anomaly score for $\mathbf{x}$ is

$$ s(\mathbf{x}, n) = 2^{-E[h(\mathbf{x})]/c(n)}, $$

where $E[h(\mathbf{x})]$ is the average path length from root to leaf across trees and $c(n)\approx 2H(n-1) - 2(n-1)/n$ is a normalising constant. Scores near 1 indicate anomalies; scores near 0.5 indicate normal points.

Isolation forest runs in $O(n\log n)$, handles high dimensions, makes no distributional assumption.

8.15.3 Local Outlier Factor (LOF)

LOF Breunig, 2000 compares the local density around a point to the density around its $k$-nearest neighbours. Define the local reachability density

$$ \mathrm{lrd}_k(\mathbf{x}) = 1\,\Big/\,\frac{\sum_{\mathbf{y}\in N_k(\mathbf{x})}\mathrm{reach\text{-}dist}_k(\mathbf{x},\mathbf{y})}{\lvert N_k(\mathbf{x})\rvert}, $$

where $\mathrm{reach\text{-}dist}_k(\mathbf{x},\mathbf{y}) = \max\{d_k(\mathbf{y}), d(\mathbf{x},\mathbf{y})\}$ and $d_k(\mathbf{y})$ is $\mathbf{y}$'s distance to its $k$-th neighbour. Then

$$ \mathrm{LOF}_k(\mathbf{x}) = \frac{\sum_{\mathbf{y}\in N_k(\mathbf{x})}\mathrm{lrd}_k(\mathbf{y})/\mathrm{lrd}_k(\mathbf{x})}{\lvert N_k(\mathbf{x})\rvert}. $$

Values $\gg 1$ indicate $\mathbf{x}$ sits in a sparser region than its neighbours, a local outlier. LOF can detect points that are unusual relative to their local neighbourhood even when they would not stand out under a global density model.

8.15.4 One-class SVM and autoencoder reconstruction error

One-class SVM (Schölkopf et al. 1999) learns a function $f$ that is positive on the support of the data and negative elsewhere by separating the data from the origin in a kernel feature space with maximal margin. Hyperparameter $\nu$ controls the fraction of allowed outliers.

Autoencoder reconstruction error. Train an autoencoder on a sample assumed to be mostly normal. At test time, $\lVert\mathbf{x}-g(f(\mathbf{x}))\rVert$ is the anomaly score. Effective in image, time-series and tabular data; the network learns the manifold of normal data and fails to reconstruct points off it.

8.15.5 Evaluation

Anomaly detection is fundamentally imbalanced: anomalies might be 0.1% of the data. Accuracy is meaningless ("predict normal" achieves 99.9%). Use AUROC and AUPRC for threshold-independent evaluation; precision at $k$ for ranking quality; cost curves when miss and false-alarm costs are known.

In production, the false-positive rate is usually the binding constraint: too many alarms erode trust and overwhelm investigators. Choose your method based on the real cost of misses versus false alarms, the availability of labels and the need for explainable flags.