Also known as: ECE, calibration error
A model is calibrated if its predicted probabilities match observed frequencies. For example, among predictions with $P(y = 1 | x) = 0.7$, exactly 70% should actually be positive.
Expected Calibration Error (ECE) quantifies the deviation. Bin predictions by predicted probability into $M$ bins $B_1, \ldots, B_M$; for each bin compute:
- Confidence $\mathrm{conf}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \hat p_i$, average predicted probability.
- Accuracy $\mathrm{acc}(B_m) = \frac{1}{|B_m|} \sum_{i \in B_m} \mathbb{1}[y_i = \arg\max \hat p_i]$, fraction correct.
$$\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} \cdot |\mathrm{acc}(B_m) - \mathrm{conf}(B_m)|$$
A perfectly calibrated model has ECE = 0. Typical values: well-calibrated models have ECE < 0.05; modern uncalibrated deep networks often have ECE > 0.10.
Reliability diagram: plot $\mathrm{acc}(B_m)$ against $\mathrm{conf}(B_m)$ for each bin. Perfect calibration is the diagonal. Curves above the diagonal are under-confident; below, over-confident.
Modern deep networks are typically miscalibrated:
Guo et al. (2017) showed that modern CNNs are systematically overconfident, confidence higher than accuracy. The cause is partly the softmax+cross-entropy loss (which incentivises pushing logits to be very large for confident predictions, beyond what calibration would warrant), partly batch normalisation effects, partly modern architectures with high capacity and weight decay.
Calibration techniques:
Temperature scaling (Guo 2017): after training, learn a single scalar $T$ to rescale the logits: $\hat p = \mathrm{softmax}(z/T)$. $T > 1$ smooths overconfident predictions. Tune $T$ on a validation set by minimising NLL. Surprisingly effective, often reduces ECE by an order of magnitude with no loss in accuracy.
Platt scaling: fit a logistic regression to map predicted scores to calibrated probabilities. Mostly used for binary models (especially SVMs).
Isotonic regression: non-parametric monotonic mapping. More flexible than Platt scaling; can overfit on small datasets.
Histogram binning: replace predictions in each bin with the bin's empirical accuracy.
Focal loss (Lin et al. 2017): down-weights well-classified examples, sometimes improves calibration as a side effect.
Variants of ECE:
Adaptive ECE (ACE): uses bins with equal numbers of samples rather than equal width, more reliable for skewed prediction distributions.
Class-wise ECE: compute ECE separately per class and average.
Maximum Calibration Error (MCE): $\max_m |\mathrm{acc}(B_m) - \mathrm{conf}(B_m)|$, worst bin instead of weighted average.
Brier score: another calibration-incorporating metric, $\frac{1}{N} \sum_n (\hat p_n - y_n)^2$. Decomposes into reliability + resolution + uncertainty.
Why calibration matters:
- Decision-making: a probability you can actually use (e.g. medical diagnostic systems where a 0.5 cutoff matters).
- Selective prediction: abstain when uncertain, requires calibrated uncertainty.
- Combination with other systems: averaging probabilities from miscalibrated models gives meaningless results.
- Bayesian model selection and uncertainty quantification rely on calibrated probability estimates.
Related terms: Cross-Entropy Loss, Softmax, Temperature (sampling)
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics