5.9 Bootstrap and Resampling
In §5.7 we built confidence intervals by leaning on a piece of theory: for the sample mean, the central limit theorem guarantees an approximately normal sampling distribution, so $\bar X \pm 1.96\,\hat\sigma/\sqrt{n}$ is a sensible 95% interval. That recipe is wonderful when it applies, but it only applies to a handful of estimators. What is the standard error of the median? Of the correlation coefficient? Of the area under a ROC curve? Of the BLEU score on a held-out test set? For most of the quantities we actually care about in machine learning, no clean closed-form sampling distribution exists, and the pencil-and-paper approach simply runs out of road.
There is a trick that gets us most of the way there without any further mathematics. It is called the bootstrap, introduced by Bradley Efron in 1979, and it is arguably the single most useful idea in computational statistics. The bootstrap takes a problem we cannot solve analytically, what does the sampling distribution of my estimator look like?, and converts it into a problem we can solve by brute force on a laptop: resample the data, recompute the estimator, look at what comes out.
The intuition is short. The reason we cannot pin down the sampling distribution of $\hat\theta$ is that we cannot generate fresh datasets from the population $F$ that produced our data, we only have one dataset. But that one dataset is itself a fairly good picture of $F$: each observation is a draw from $F$, and the empirical distribution $\hat F_n$ that puts mass $1/n$ on each data point is the best non-parametric estimate of $F$ we have. So instead of drawing new samples from $F$ (impossible), we draw new samples from $\hat F_n$ (trivial: sample from the data with replacement). For each pretend-new dataset, recompute the estimator. Stack the results. The empirical distribution of those bootstrapped estimates approximates the sampling distribution of the original estimator. Confidence intervals, standard errors, bias estimates, all of them fall out of that one cloud of numbers.
This section walks through the algorithm, a worked example, the three standard ways to extract a confidence interval, the situations where the bootstrap shines and the situations where it quietly fails, and the closely-related idea of permutation tests. It closes by showing why the bootstrap is the modern AI evaluator's best friend.
The bootstrap algorithm
The procedure has three steps and fits on a postcard.
- Start with the original data $\mathcal{D} = \{x_1, x_2, \ldots, x_n\}$ and the estimator $\hat\theta = \hat\theta(\mathcal{D})$ you already computed once.
- For each replication $b = 1, \ldots, B$:
- Draw $n$ items from $\mathcal{D}$ with replacement to form $\mathcal{D}^*_b$. The same original observation may appear several times in a resample, and roughly a third of the original points will be missing from any given resample, that is fine, it is exactly the variability we want to capture.
- Compute $\hat\theta^*_b = \hat\theta(\mathcal{D}^*_b)$, applying the same procedure you applied to the original data.
- Use the empirical distribution of $\{\hat\theta^*_1, \hat\theta^*_2, \ldots, \hat\theta^*_B\}$ as a stand-in for the true sampling distribution of $\hat\theta$. The standard deviation of those $B$ numbers estimates the standard error; their quantiles give you a confidence interval; their mean minus $\hat\theta$ estimates the bias.
That is it. There is no separate theory you have to verify, no normality assumption, no requirement that your estimator have a tractable form. As long as you can compute $\hat\theta$ on a dataset, you can bootstrap it. The price is computation: you fit the estimator $B$ times instead of once. With $B = 1000$ that is usually trivial; with $B = 10000$ and an expensive estimator it can take a while, but you can run the replications in parallel because each one is independent.
A subtle but important point: the resampling has to mimic the way the original data were collected. If your data are i.i.d., resample individual rows. If they come in clusters (patients within hospitals, words within documents, frames within videos), resample whole clusters, this is the block bootstrap, and ignoring it leads to confidence intervals that are far too narrow.
Worked example: confidence interval for the median
Suppose we have collected $n = 20$ exam scores out of 100:
$$58,\ 62,\ 65,\ 68,\ 70,\ 72,\ 73,\ 74,\ 75,\ 75,\ 76,\ 77,\ 78,\ 79,\ 81,\ 83,\ 85,\ 88,\ 91,\ 95.$$
The sample median is 75 (the average of the tenth and eleventh order statistics). We would like a 95% confidence interval for the population median. There is no neat formula here: the sampling distribution of the median depends on the density of the underlying distribution at the median, which we do not know. The bootstrap sweeps the difficulty aside.
We draw $B = 1000$ resamples of size 20 with replacement from this set of scores. For each one we compute the median. A typical first resample might contain 65 twice, 78 three times, and so on; its median might come out to 76. The next resample's median might be 73.5. After $B$ replications we have 1000 numbers, the bootstrap distribution of the median. We sort them. The 25th and 975th values bracket the central 95% of the distribution.
Running this on the data above yields a percentile interval close to $[72.5,\ 78.0]$. We report: the sample median is 75 with a 95% bootstrap percentile CI of $[72.5,\ 78.0]$.
A few points are worth pausing on. First, the resampled medians take only a discrete set of values, because they are always averages of order statistics from the original 20 scores, you can see this if you plot a histogram, which has visible spikes rather than a smooth curve. That is a quirk of the median, not a bug in the method. Second, doubling $B$ from 1000 to 2000 barely shifts the interval; the bootstrap is mainly limited by the size of the original dataset, not by $B$, once $B$ is in the low thousands. Third, the same code structure works for any estimator, replace np.median with np.std, with a regression coefficient, with a classifier accuracy on a held-out set, and you immediately get a CI for that quantity. That generality is the whole point.
In Python the loop is six lines (np.random.choice(data, size=n, replace=True) inside a for loop, then np.percentile(boot, [2.5, 97.5])). On a laptop, $B = 10000$ resamples of a small dataset finish in under a second.
Three flavours of bootstrap CI
Once you have $B$ bootstrap replicates $\hat\theta^*_1, \ldots, \hat\theta^*_B$ there are several ways to turn them into an interval, and the choice matters when the sampling distribution is skewed.
- Percentile CI. Take the $\alpha/2$ and $1-\alpha/2$ empirical quantiles of the bootstrap distribution: $[\hat\theta^*_{(\alpha/2)},\ \hat\theta^*_{(1-\alpha/2)}]$. Trivial to compute and intuitive, the interval is literally the middle 95% of resampled estimates. Works well when the sampling distribution of $\hat\theta$ is roughly symmetric and unbiased. Can be misleading when it is skewed, because the percentile method copies the skewness directly into the CI, which is the wrong direction.
- Basic (pivotal) CI. Pivot around the original estimate: $[2\hat\theta - \hat\theta^*_{(1-\alpha/2)},\ 2\hat\theta - \hat\theta^*_{(\alpha/2)}]$. The reasoning is that $\hat\theta^* - \hat\theta$ is approximately distributed like $\hat\theta - \theta$, so we can flip the bootstrap quantiles around $\hat\theta$ to bracket the true parameter. Often performs better than the percentile CI when the bootstrap distribution is asymmetric.
- BCa (bias-corrected and accelerated) CI. A refinement that adjusts both for any bias the bootstrap distribution shows relative to $\hat\theta$, and for acceleration, a measure of how the standard error changes with $\theta$. BCa is more accurate (its coverage error shrinks faster as $n$ grows) and is the default in
scipy.stats.bootstrap. It is slightly fiddlier, you also need a jackknife pass to estimate the acceleration, but you should not have to implement it yourself.
A practical rule: start with BCa if your library offers it, fall back to the basic interval if you are coding by hand and your distribution looks asymmetric, and reach for the percentile interval only when you want simplicity and the distribution is clearly symmetric. In all three cases, the underlying bootstrap loop is identical; the flavours differ only in how the final interval is computed from the same $B$ replicates.
When the bootstrap works (and when it doesn't)
The bootstrap is reliable for smooth functionals of the data, quantities that vary continuously and not too sharply as you wiggle the dataset. That covers almost everything you care about in ML evaluation: means, variances, regression coefficients, $R^2$, AUC, BLEU, perplexity differences, calibration measures. For these the bootstrap distribution converges to the true sampling distribution as $n \to \infty$, often at the same rate as the central limit theorem.
It works approximately for moderately rough functionals such as the median, other quantiles, and classification accuracy. The bootstrap distribution is a touch lumpy in finite samples (recall the spikes in the median example), but the resulting intervals are usually trustworthy down to small datasets, and BCa often closes the remaining gap.
It fails in three situations you should learn to recognise.
- Extreme order statistics. The bootstrap is famously inconsistent for the maximum of a uniform distribution. The reason is that the original maximum appears in any resample that includes it, so the bootstrap distribution of the max has a point mass at the observed maximum. No amount of resampling fixes this, it is a structural failure, not a sample-size issue. The same problem haunts other extremes such as the minimum and very high quantiles.
- Heavy-tailed distributions. When the data have infinite variance (Cauchy-like, certain financial returns), the bootstrap distribution of the mean does not converge to the right thing. Subsampling, drawing resamples smaller than $n$, can rescue you, but the standard bootstrap cannot.
- Dependent or hierarchical data. If observations are correlated across time, space, or cluster, plain row-resampling pretends the data are independent and produces intervals that are far too narrow. The fix is the block bootstrap: resample contiguous blocks of time series, or whole clusters of related rows, so the resamples preserve the dependence structure of the original.
When in doubt, run a small simulation. Generate data from a known distribution, compute many bootstrap CIs, and check whether they cover the true parameter at the advertised rate. Coverage simulations take ten minutes to write and will save you from publishing wrong error bars.
Permutation tests
Resampling has a sibling for hypothesis testing called the permutation test. The setup is the one we met in §5.8: under the null hypothesis of no group difference, the labels attached to the observations are exchangeable, it makes no difference which observations you call "treatment" and which "control". So we can construct an empirical null distribution by repeatedly shuffling the labels and recomputing the test statistic.
A typical workflow for an A/B experiment: we observe a difference in mean conversion of 0.05 between the new variant and the control. Is that real, or noise? Pool the two groups, shuffle the group labels, recompute the difference in means, repeat 10000 times. This gives 10000 differences in means under the null. The $p$-value is the fraction of those whose magnitude is at least as extreme as 0.05. If only 50 of the 10000 shuffled differences exceed 0.05, then $p \approx 50 / 10000 = 0.005$, and we reject the null at the 1% level.
Permutation tests share the bootstrap's appeal, no theory, no normality, just arithmetic, and they have one extra virtue: when the exchangeability assumption holds, they give exact type-I error rates rather than asymptotic ones. They generalise effortlessly to any test statistic you like (mean difference, median difference, AUC difference, F1 difference), so you do not have to hunt for the textbook test that matches your situation. The catch is that they test a sharp null (the two distributions are identical, not merely have equal means), which is occasionally the wrong null. For most practical comparisons in ML evaluation that distinction does not matter.
Bootstrap in ML evaluation
The bootstrap is everywhere in modern AI papers, often without being named. Whenever you see a number reported as $0.842 \pm 0.011$ on a held-out test set, there is usually a bootstrap behind the $\pm$. The standard moves:
- CI on a test-set metric. Resample the test examples (with replacement), recompute accuracy/F1/AUC/BLEU on each resample, take the 2.5% and 97.5% quantiles. This converts a single test number into an honest interval, and it works for any metric without bespoke theory.
- CI for a model's improvement. Bootstrap the paired difference in metrics on the same test set: for each resample, compute metric$_A$ minus metric$_B$, then take quantiles of those differences. If the resulting CI excludes zero, the improvement is real at the chosen confidence level. Pairing matters, it removes test-set variance and gives much tighter intervals than bootstrapping the two models separately.
- Significance via permutation. For a clean $p$-value on "is system A better than system B?", permute model labels within each test example and recompute the metric difference. This is the standard recipe behind tools such as
paired-bootstrap-resamplingin NLP.
Two cautions before you reach for the bootstrap on a leaderboard. First, if your test set is small and the metric is bounded (accuracy between 0 and 1), the resampled metric distribution can be lumpy enough that BCa is worth the extra effort. Second, the bootstrap captures variability due to the test set; it does not capture variability due to training randomness, hyperparameter choices, or initialisation seeds. To quantify those you need to retrain the model multiple times with different seeds, which is expensive but increasingly expected in serious empirical work.
What you should take away
- The bootstrap converts a hard analytical question into an easy computational one. Whenever you cannot derive a sampling distribution, resample with replacement, recompute the estimator, and treat the resulting cloud of numbers as the sampling distribution.
- The algorithm is three lines of code and works for almost any estimator that takes a dataset and returns a number. Use $B \approx 1000$ for a CI, $B \approx 10000$ when you need more decimal places.
- Three flavours of CI exist, percentile, basic, and BCa, differing only in how the same bootstrap replicates are summarised. Prefer BCa when your library provides it; it adjusts for bias and skewness automatically.
- The bootstrap fails predictably on extreme order statistics, heavy-tailed distributions, and dependent data; for the last of these, use the block bootstrap. When in doubt, simulate and check coverage.
- In ML evaluation, the bootstrap is the default. Use it for confidence intervals on test-set metrics, for paired comparisons between models, and alongside permutation tests for significance. The mean$\pm$bootstrap CI is the modern minimum standard for reporting an empirical result.