Confidence Interval, Glossary, Textbook of AI

A confidence interval (CI) is a range of values, computed from sample data, that is constructed so that, over hypothetical repetitions of the sampling and analysis, a specified proportion -- the confidence level, conventionally 95% -- of the resulting intervals would contain the true population parameter. Confidence intervals communicate both a point estimate and its uncertainty, and are generally far more informative than point estimates or $p$-values alone.

For the population mean of a Gaussian sample with known variance $\sigma^2$, the $100(1-\alpha)\%$ confidence interval is

$$\bar{X} \pm z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}},$$

where $z_{1-\alpha/2}$ is the standard-normal quantile (1.96 for 95%). With unknown variance one substitutes the sample standard deviation $s$ and uses the Student's $t$-distribution quantile $t_{1-\alpha/2, n-1}$, giving

$$\bar{X} \pm t_{1-\alpha/2, n-1} \frac{s}{\sqrt{n}}.$$

For proportions, the Wilson score interval is preferred to the simpler Wald interval, especially with small samples or extreme proportions. For more general parameters, Wald, likelihood-ratio, score and profile-likelihood intervals each have advantages.

The precise interpretation of a confidence interval is famously subtle and frequently mis-stated. A 95% CI does not mean there is a 95% probability that the true parameter lies in this particular interval -- that is the interpretation of a Bayesian credible interval under a chosen prior. Instead, the 95% refers to the long-run success rate of the procedure: if we were to repeat the sampling and CI construction many times, 95% of the resulting intervals would contain the true parameter. The interval itself either contains the parameter or it does not; once computed, no further probability statement attaches to the unknown fixed parameter.

A 95% CI corresponds, by duality, to a 5%-level two-sided hypothesis test: the interval contains exactly those null values that would not be rejected. Wider intervals reflect either smaller samples, larger variance, or higher confidence levels. Halving the width of a CI requires quadrupling the sample size, a hard constraint that drives power analysis in study design.

When analytic formulas are unavailable -- for medians, ratios, AUCs, model-performance metrics, complex pipeline outputs -- bootstrap confidence intervals offer a flexible non-parametric alternative. The basic recipe (Efron, 1979) resamples the data with replacement $B$ times (typically $B \geq 1000$), computes the statistic on each resample, and takes the empirical $\alpha/2$ and $1-\alpha/2$ quantiles. Bias-corrected and accelerated (BCa) bootstrap intervals refine this to handle skewness and bias. Percentile, basic and studentised variants trade simplicity against accuracy.

In machine learning, confidence intervals are used to report performance metrics with appropriate uncertainty: "accuracy 82.3% (95% CI: 80.1% to 84.5%)". They are essential when comparing models -- non-overlapping CIs are sufficient (but not necessary) for a significant difference, whereas formal tests on paired predictions (McNemar, paired bootstrap) provide more power. Conformal prediction generalises the CI idea to predictions: given any model and a calibration set, conformal methods produce prediction intervals (or sets) with finite-sample, distribution-free coverage guarantees -- a contemporary practice rapidly entering AI deployment for uncertainty quantification.

Interactive

Confidence intervals catch the true mean. Repeated samples produce intervals. Ninety-five percent of them cover the unknown population mean.

Video

Related terms: Hypothesis Testing, P-value, Bagging, Central Limit Theorem

Discussed in:

Chapter 4: Probability, Probability and Statistics
Chapter 6: ML Fundamentals, Machine Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.