AUC-ROC, Glossary, Textbook of AI

Also known as: AUROC, ROC AUC, C-statistic

The ROC curve plots true positive rate (TPR, recall) against false positive rate (FPR) for a binary classifier across all decision thresholds:

TPR = TP / (TP + FN), fraction of positives correctly identified.
FPR = FP / (FP + TN), fraction of negatives incorrectly identified as positive.

AUC-ROC is the area under this curve, always between 0.5 (random classifier) and 1.0 (perfect).

Probabilistic interpretation: AUC-ROC equals the probability that a randomly chosen positive example receives a higher score from the classifier than a randomly chosen negative example:

$$\mathrm{AUC} = P(\hat y_+ > \hat y_- )$$

where $\hat y_+$ is the model score on a randomly drawn positive and $\hat y_-$ on a negative.

Computation: rank all examples by predicted score; for each positive example count how many negatives it outranks; the average is AUC. Equivalently, AUC = (average rank of positives - $(n_+ + 1)/2$) / $n_-$ where $n_\pm$ are the number of positives/negatives.

Properties:

Threshold-independent: assesses the model's ranking quality across all operating points.
Insensitive to class balance: AUC of 0.8 means the same regardless of the positive class fraction.
Probabilistic interpretation: the probability that the classifier ranks a random positive above a random negative.
Equals Mann-Whitney U statistic divided by the number of pairs, closely connected to the Wilcoxon rank-sum test.

AUC-ROC vs AUC-PR:

For highly imbalanced data (e.g. 1% positives), AUC-ROC can be misleadingly high because a classifier can achieve large AUC by correctly ranking the abundant negatives. AUC-PR (area under the precision-recall curve) is then preferred, it focuses on positive-class performance.

Multi-class AUC:

One-vs-rest (OvR): compute binary AUC for each class against all others; average.

One-vs-one (OvO): compute AUC for each pair of classes; average.

Practical considerations:

Choose AUC-ROC for ranking tasks, balanced data, or when both classes matter equally.
Choose AUC-PR for highly imbalanced data, anomaly detection, or when only positive performance matters.
Choose F1 for a specific operating point (typical decision threshold).
Calibration is separate, AUC tells you nothing about whether predicted probabilities are correctly calibrated.

In medical AI specifically, AUC-ROC is the dominant metric for diagnostic/screening models, often paired with sensitivity/specificity at clinically chosen thresholds.

Hanley-McNeil statistic: standard error of AUC for confidence intervals and significance tests.

DeLong test: comparing AUCs of two ROC curves on the same test set, accounting for the dependence between paired predictions.

Video

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).