Also known as: F-score, F-measure
For binary classification, define:
Precision $= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$, fraction of predicted positives that are actually positive.
Recall (sensitivity, true positive rate) $= \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$, fraction of actual positives correctly predicted.
The F1 score is their harmonic mean:
$$F_1 = \frac{2 \cdot \mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}$$
F1 ranges from 0 (no positive correctly predicted) to 1 (perfect precision and recall). The harmonic mean penalises imbalance, if either precision or recall is low, F1 is low.
The general $F_\beta$ score weights recall $\beta$ times as much as precision:
$$F_\beta = (1 + \beta^2) \frac{\mathrm{precision} \cdot \mathrm{recall}}{\beta^2 \mathrm{precision} + \mathrm{recall}}$$
$\beta = 1$ recovers $F_1$. $\beta = 2$ ($F_2$) emphasises recall (cost of missing positives is twice that of false alarms), used in medical screening. $\beta = 0.5$ ($F_{0.5}$) emphasises precision.
Multiclass extensions:
Macro-F1: compute $F_1$ separately for each class and average. Equal weight per class regardless of class frequency.
Micro-F1: aggregate TP, FP, FN across all classes, then compute a single $F_1$. For multiclass, micro-F1 equals overall accuracy.
Weighted F1: macro-F1 weighted by class support.
When to use F1:
- Imbalanced classes: where accuracy is misleading (e.g. fraud detection, rare-disease diagnosis).
- Both false positives and false negatives matter.
- Information retrieval and many other binary-prediction tasks.
When NOT to use F1:
- Severe class imbalance with one type of error very costly: use AUC-ROC or Precision-Recall curves directly.
- Multi-label classification: use macro-/micro-/weighted variants explicitly.
- Need calibrated probabilities: F1 (and any threshold metric) requires choosing a threshold, usually 0.5, and tells you nothing about calibration.
F1 vs AUC-ROC:
- F1 evaluates a single threshold.
- AUC-ROC evaluates all thresholds simultaneously.
- For imbalanced data, AUC-PR (precision-recall AUC) is often more informative than AUC-ROC.
F1 trick: the F1 score is symmetric in precision and recall, they can be swapped without changing F1. This is not true of $F_\beta$ for $\beta \neq 1$.
Related terms: Precision (classification), Recall (classification), AUC-ROC
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics