AUC, Glossary, Textbook of AI

AUC stands for Area Under the Curve and almost always refers to the area under the Receiver Operating Characteristic (ROC) curve, which plots the true-positive rate (sensitivity, recall) against the false-positive rate ($1 - $ specificity) as the classification threshold varies from $-\infty$ to $+\infty$.

Definition

For a classifier producing scores $s(\mathbf{x})$ and a threshold $\tau$:

$$\mathrm{TPR}(\tau) = P(s(\mathbf{x}) > \tau \mid y = 1), \qquad \mathrm{FPR}(\tau) = P(s(\mathbf{x}) > \tau \mid y = 0)$$

The ROC curve is the parametric plot $(\mathrm{FPR}(\tau), \mathrm{TPR}(\tau))$. AUC-ROC is the area under this curve:

$$\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(u)) \, du$$

AUC ranges from $0.5$ (random guessing, diagonal line) to $1.0$ (perfect ranking, passes through top-left corner). An AUC below $0.5$ means the classifier is worse than random and inverting its predictions would give AUC $> 0.5$.

Probabilistic interpretation

AUC has a clean and famous interpretation: it equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example.

$$\mathrm{AUC} = P\!\left( s(\mathbf{x}^+) > s(\mathbf{x}^-) \right)$$

This makes AUC equivalent to the Mann–Whitney U statistic and to the Wilcoxon rank-sum test. A classifier with AUC $0.8$ correctly ranks $80$% of positive–negative pairs.

Why it is popular

AUC has several attractive properties:

Threshold-independent, summarises performance across all possible operating points, not just one.
Scale-invariant, only the rank order of scores matters; calibration does not affect AUC.
Class-proportion robust, within reason, useful when deployment class balance differs from training.
Interpretable, the ranking probability is intuitive even to non-specialists.

Caveats and alternatives

AUC can mislead under severe class imbalance. When positives are rare (say $1$ in $1000$), the ROC curve is dominated by the large negative class. A classifier may achieve AUC $0.95$ while still producing mostly false positives among its top predictions, because moving from FPR $0$ to FPR $0.01$ on a $1{,}000{,}000$-strong negative class adds $10{,}000$ false positives.

In imbalanced settings, the area under the precision–recall curve (AUC-PR or average precision) is more informative, since it focuses on the positive class and is sensitive to the absolute number of false positives. Reporting both, alongside precision, recall, and F1 at a chosen operating threshold, gives the most complete picture. Calibration metrics such as Brier score and reliability diagrams complement AUC by checking that predicted probabilities match observed frequencies, which AUC does not test.

AUC is the standard primary metric in clinical machine-learning publications, Kaggle binary-classification competitions, information-retrieval ranking, and credit scoring.

Video

Related terms: AUC-ROC, F1 Score

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.