Accuracy, Precision, Recall and F1, Glossary, Textbook of AI

Also known as: F1-score, precision-recall, classification metrics

Classification performance is summarised by several complementary metrics, each emphasising different aspects. Accuracy is the fraction of predictions that are correct. It is intuitive but misleading when classes are imbalanced: a model that always predicts "not spam" in a dataset where 99% of emails are legitimate achieves 99% accuracy yet is useless.

Precision is the fraction of positive predictions that are correct: $\text{precision} = \frac{TP}{TP + FP}$. Recall (also called sensitivity or true-positive rate) is the fraction of actual positives that the model catches: $\text{recall} = \frac{TP}{TP + FN}$. Precision answers "of the ones I flagged, how many were right?"; recall answers "of the ones that matter, how many did I find?". There is almost always a tradeoff: a stricter threshold increases precision but decreases recall.

The F1 score is the harmonic mean of precision and recall: $F1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$. It balances the two and is a better single-number summary than accuracy for imbalanced problems. Macro-averaged F1 computes F1 per class and averages; weighted F1 weights by class frequency. The confusion matrix tabulates all four outcome types, TP, TN, FP, FN, and provides the complete picture. The ROC curve and AUC (area under the ROC curve) provide threshold-independent summaries, and the precision-recall curve is more informative than ROC when classes are severely imbalanced.

Related terms: AUC

Discussed in:

Chapter 6: ML Fundamentals, Model Evaluation

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.