Statistics: 5.7   Confidence and Credible Intervals

Dr Chris Paton

5.7 Confidence and Credible Intervals

When you ask a model how accurate it is, or you read in a paper that "the new method scored 87.3%", a single number is rarely the whole story. The number you see is a point estimate, one guess at a quantity you cannot directly observe. Run the experiment again on a fresh sample of data and the answer would shift. It might shift by a tiny amount, or it might shift enough to change which model you call "best". A point estimate that arrives without any sense of its wobble is hiding important information from you.

Intervals are the standard way to put that wobble back on the page. Instead of writing "accuracy is 87.3%", you write "accuracy is 87.3%, with a 95% interval of 85.1% to 89.3%". The interval communicates how seriously to take the central number. A wide interval is a polite way of saying "we are not very sure"; a narrow one says "we have good evidence here".

There are two main schools of thought about how to build such intervals, and they correspond to the frequentist and Bayesian frameworks introduced in §5.2. Confidence intervals are the frequentist construction. Credible intervals are the Bayesian one. The names are confusingly similar, the formulas often produce nearly identical numbers, and many people use the words interchangeably. They are not the same. They answer subtly different questions, and the difference matters whenever uncertainty is consequential, clinical decisions, safety claims, or close-call comparisons between models.

This section turns the variance of estimators (§5.4) into intervals you can quote.

Symbols Used Here

$\hat\theta$point estimate

$\text{SE}(\hat\theta)$standard error

$z_{\alpha/2}$critical Gaussian quantile

$\alpha$significance level (interval has level $1-\alpha$)

Confidence intervals

A confidence interval is a recipe for turning a point estimate and its standard error into a range of plausible values. The frequentist definition is precise and worth stating carefully. A $100(1-\alpha)\%$ confidence interval for a parameter $\theta$ is a random interval, computed from the data, with the property that under repeated sampling, that is, if you re-ran the whole experiment many times, each time computing a new interval from a new dataset, the interval would contain the true $\theta$ in a fraction $1-\alpha$ of those repetitions. The classic recipe in the Gaussian case is

$$\left[\hat\theta - z_{\alpha/2}\,\text{SE}(\hat\theta),\ \hat\theta + z_{\alpha/2}\,\text{SE}(\hat\theta)\right].$$

For a 95% interval, $\alpha = 0.05$ and $z_{\alpha/2} \approx 1.96$. For a 99% interval, the multiplier becomes $2.576$, and the interval gets wider, more confidence costs more width.

Worked example. Suppose you estimate a population mean as $\hat\mu = 5.0$ with standard error $\text{SE} = 0.5$. Then the 95% confidence interval is

$$5.0 \pm 1.96 \cdot 0.5 = [4.02,\ 5.98].$$

That is the kind of computation you will do dozens of times in a career, point estimate, plus or minus the multiplier times the standard error.

Now the awkward part: what does the interval $[4.02, 5.98]$ actually mean? It is tempting to read it as "there is a 95% probability that the true mean lies between 4.02 and 5.98". Almost everyone reads it that way the first time. The frequentist orthodoxy says you cannot. The reason is that the parameter $\theta$ is treated as fixed but unknown, it does not have a probability distribution. The interval is the random object: its endpoints depend on the sample you drew. Different samples produce different intervals. The 95% probability is a property of the procedure, not of any particular interval that the procedure spat out.

A useful way to picture it: imagine running the experiment 1,000 times, computing a fresh CI each time. About 950 of those intervals will straddle the true $\theta$, and about 50 will miss it. Once you have one specific interval in front of you, however, that interval either contains $\theta$ or it does not. The 95% does not transfer onto the single interval you happen to be looking at. The phrase "we are 95% confident" is a polite shorthand: it means "we used a recipe that is right 95% of the time across all the experiments in the world that ever use it". This is a real claim, but it is not the claim most people hear.

If that sounds finicky, it is, and it is one of the main reasons many practitioners reach for credible intervals when interpretation matters.

CIs from the central limit theorem

Most confidence intervals you will compute in practice rest on a single workhorse result: the central limit theorem. For a sample mean of $n$ independent and identically distributed observations with finite variance, the sampling distribution of $\bar X$ is approximately Gaussian for large $n$:

$$\bar X \approx \mathcal{N}\!\left(\mu,\ \frac{\sigma^2}{n}\right).$$

That gives the standard error $\sigma/\sqrt n$ and the familiar interval $\bar X \pm z_{\alpha/2}\,\sigma/\sqrt n$. In real datasets you almost never know $\sigma$, so you replace it with the sample standard deviation $\hat\sigma$ and use the t-distribution with $n-1$ degrees of freedom in place of the Gaussian. The t-distribution has heavier tails to compensate for the extra uncertainty in estimating $\sigma$. Once $n$ is around 30 or larger the t-multiplier and z-multiplier are practically the same.

Worked example. A test set has $n = 100$ observations, with $\bar X = 75$ and $\hat\sigma = 10$. The standard error is $\text{SE} = 10/\sqrt{100} = 1$. The 95% confidence interval is

$$75 \pm 1.96 \cdot 1 \approx [73.04,\ 76.96].$$

Three things are worth noting. First, the interval shrinks as $\sqrt n$, to halve its width you need four times the data. Second, the recipe assumes the observations are independent; this fails for time series, clustered data, or repeated measures, all of which need different treatment. Third, the CLT approximation can be poor for small $n$ or very skewed data; in those cases prefer the bootstrap (below) or an exact method.

Credible intervals

The Bayesian story starts somewhere different. In Bayesian statistics, the parameter $\theta$ has a probability distribution, the posterior, that captures everything you believe about $\theta$ given the data and your prior. A $100(1-\alpha)\%$ credible interval is simply any interval that contains $1-\alpha$ of the posterior probability. This is the interval that supports the natural English reading: given the data and prior, $\theta$ lies inside this interval with probability $1-\alpha$.

There are two common ways to choose which interval. The equal-tailed interval cuts $\alpha/2$ of the posterior probability from each tail; it is easy to compute from posterior quantiles. The highest posterior density (HPD) interval is the shortest interval containing $1-\alpha$ probability, equivalently the set of $\theta$ values whose posterior density is above some threshold. For symmetric posteriors the two coincide; for skewed posteriors the HPD is more faithful to where the bulk of the belief lies.

Worked example. Suppose your posterior for some parameter is $\theta \mid \mathcal{D} \sim \mathcal{N}(2, 1)$, a Gaussian with mean 2 and variance 1. The 95% credible interval is the central 95% of that Gaussian:

$$2 \pm 1.96 \cdot 1 \approx [0.04,\ 3.96].$$

Notice that arithmetically this is the same calculation as a frequentist CI under a Gaussian model with a flat prior. You will often see Bayesian and frequentist intervals that match to within rounding. What changes is the interpretation. The Bayesian sentence "there is a 95% probability that $\theta$ is between 0.04 and 3.96" is mathematically licensed by the posterior. The frequentist sentence with the same numbers cannot say that.

For asymmetric posteriors, Beta distributions for proportions, Gamma distributions for rates, anything skewed, the equal-tailed and HPD intervals differ. For a $\operatorname{Beta}(8, 4)$ posterior, the 95% equal-tailed interval is approximately $[0.39, 0.89]$ and the HPD is approximately $[0.40, 0.89]$; they are similar because the posterior is only mildly skewed.

Where credible intervals truly shine is when the prior carries useful information. In a clinical trial with a small sample, an informative prior built from earlier studies can make the credible interval substantially narrower than the corresponding CI, because the prior contributes evidence too. With a flat prior, the two converge. The choice between flavours is partly philosophical and partly practical: pick credible intervals when you want to make probability statements about parameters and you can defend a prior; pick confidence intervals when the procedure-level guarantee is what you care about.

Bootstrap confidence intervals

Both formulas above assumed you knew the sampling distribution of your estimator, Gaussian for the mean by the CLT, t for unknown variance. What if you do not? What is the sampling distribution of the median, or the 90th percentile of latency, or BLEU score on a test set, or the AUC of a classifier? For most quantities of interest in machine learning there is no closed form.

The bootstrap is the workhorse answer. The idea, due to Efron (1979), is to treat your sample as if it were the population and resample from it with replacement. You draw $B$ bootstrap samples, each the same size as your original dataset, by sampling with replacement. For each bootstrap sample you compute your estimator. The resulting collection of $B$ values is the bootstrap distribution, and it approximates the true sampling distribution.

From there you can build an interval in several ways. The simplest, the percentile method, takes the 2.5th and 97.5th percentiles of the bootstrap values directly as the 95% interval. The basic (or pivotal) bootstrap reflects those quantiles around the original estimate. The BCa (bias-corrected and accelerated) interval applies two adjustments, for median bias and for skewness, and is generally more accurate, at modest extra computational cost. Typical choices for $B$ are 1,000 for quick exploration and 10,000 or more for publication.

The bootstrap is now standard in machine learning evaluation. To put a CI on BLEU or ROUGE for a translation model, you bootstrap-resample the test sentences. To compare two classifiers on a benchmark, you bootstrap their score difference. The method handles weird estimators, weird data, and weird metrics with the same uniform recipe. Its main weakness is that it inherits any bias the estimator already has, and it can struggle with statistics like the maximum or with strongly dependent data.

Confidence intervals in ML

Confidence intervals appear in machine learning in several characteristic places.

For test-set accuracy, a binomial proportion, the naive interval $\hat p \pm z\sqrt{\hat p(1-\hat p)/n}$ is the Wald interval and behaves badly when $\hat p$ is near 0 or 1, or when the test set is small. The Wilson score interval is a far better default and should be your reflex for proportions. With 873 correct out of 1,000 test examples, the Wilson 95% interval is approximately $[85.1\%, 89.3\%]$ and is reliable even with much smaller $n$.

For generative metrics like BLEU, ROUGE, METEOR, or chrF, there is no clean parametric form, so bootstrap over the test examples. Each bootstrap sample re-draws sentences with replacement and recomputes the corpus-level metric.

For pairwise model comparisons, for example, win rates in Chatbot Arena or A/B tests, put a confidence interval on the win rate or score difference rather than reporting only a point. Two models posting 87.3% and 88.1% on the same 1,000-example test set are statistically indistinguishable on independent intervals; a paired comparison on per-example outcomes is much sharper.

For training-time variance, train each configuration with multiple random seeds, typically three to ten, and report the mean with either a standard deviation or a CI across seeds. Failing to do this is one of the most reliable ways to publish a result that does not replicate. Hyperparameter sweeps without seed-level CIs routinely confuse stochastic noise for genuine improvement.

The pattern across all of these is the same: pair every reported number with an explicit indication of its wobble.

Common pitfalls

Treating one CI as a probability statement about $\theta$. This is the single most common error. Only credible intervals support that reading. If you want it, build a posterior.

Computing CIs after p-hacking. If you sliced the data, picked the most flattering subgroup, or stopped collecting data when the result looked good, the procedure that produced your interval is not the one whose 95% guarantee you are quoting. The advertised coverage no longer applies.

Ignoring multiple comparisons. If you compute 20 independent 95% CIs, on average one will fail to cover. Quoting individual intervals as if each were the only test in town overstates your evidence. Use simultaneous methods (Bonferroni, false discovery rate, or hierarchical models) when many intervals share the page.

Wald intervals for proportions near 0 or 1. The normal-approximation interval can extend past $[0, 1]$, undercover badly, or even collapse to a single point when $\hat p = 0$ or $1$. Use Wilson, Jeffreys, or Clopper-Pearson; or, if you are Bayesian, the Beta posterior credible interval.

A quieter pitfall is reporting CIs on a held-out set and then continuing to tune on it. Once the test set has been peeked at, the CI is no longer a clean estimate of generalisation; it is a description of how well you fitted the held-out data. Lock the test set; report the interval once.

What you should take away

Always pair a point estimate with an interval; a single number without uncertainty is incomplete.
The 95% in a confidence interval is a property of the procedure across hypothetical repetitions, not of the specific interval you computed.
Credible intervals support the natural reading "$\theta$ is here with probability $1-\alpha$", but require a posterior and therefore a prior.
When closed-form sampling distributions are unavailable, which is most of the time in ML, bootstrap.
For proportions, prefer the Wilson interval; for generative metrics, bootstrap; for seed variance, train multiple seeds and report mean with spread.