F1 Score, Glossary, Textbook of AI

Also known as: F-score, F-measure

For binary classification, define:

Precision $= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$, fraction of predicted positives that are actually positive.

Recall (sensitivity, true positive rate) $= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$, fraction of actual positives correctly predicted.

The F1 score is their harmonic mean:

$$F_1 = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$

F1 ranges from 0 (no positive correctly predicted) to 1 (perfect precision and recall). The harmonic mean penalises imbalance, if either precision or recall is low, F1 is low.

The general $F_\beta$ score weights recall $\beta$ times as much as precision:

$$F_\beta = (1 + \beta^2) \frac{\mathrm{precision} \cdot \mathrm{recall}}{\beta^2 \mathrm{precision} + \mathrm{recall}}$$

$\beta = 1$ recovers $F_1$. $\beta = 2$ ($F_2$) emphasises recall (cost of missing positives is twice that of false alarms), used in medical screening. $\beta = 0.5$ ($F_{0.5}$) emphasises precision.

Multiclass extensions:

Macro-F1: compute $F_1$ separately for each class and average. Equal weight per class regardless of class frequency.

Micro-F1: aggregate TP, FP, FN across all classes, then compute a single $F_1$. For multiclass, micro-F1 equals overall accuracy.

Weighted F1: macro-F1 weighted by class support.

When to use F1:

Imbalanced classes: where accuracy is misleading (e.g. fraud detection, rare-disease diagnosis).
Both false positives and false negatives matter.
Information retrieval and many other binary-prediction tasks.

When NOT to use F1:

Severe class imbalance with one type of error very costly: use AUC-ROC or Precision-Recall curves directly.
Multi-label classification: use macro-/micro-/weighted variants explicitly.
Need calibrated probabilities: F1 (and any threshold metric) requires choosing a threshold, usually 0.5, and tells you nothing about calibration.

F1 vs AUC-ROC:

F1 evaluates a single threshold.
AUC-ROC evaluates all thresholds simultaneously.
For imbalanced data, AUC-PR (precision-recall AUC) is often more informative than AUC-ROC.

F1 trick: the F1 score is symmetric in precision and recall, they can be swapped without changing F1. This is not true of $F_\beta$ for $\beta \neq 1$.

Discussed in:

Chapter 7: Supervised Learning, Evaluation Metrics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).