4.9 Limit theorems
Probability theory has two crown jewels, and they are both statements about what happens when you collect a lot of data. The Law of Large Numbers says that if you average many independent draws from the same distribution, the average settles down to the true mean. The Central Limit Theorem says something stronger and much more surprising: the way that average jiggles around the true mean is itself almost perfectly Gaussian, no matter what the underlying distribution looked like. Lopsided coins, dice, exam marks, photon counts, click-through rates, average enough of them and the bell curve appears.
These two results are why "more data gives better estimates" is not just a vague slogan but a theorem. They are the silent foundation of almost every statistical and machine-learning method you will meet. When a textbook says that an estimator is consistent, the proof is the Law of Large Numbers in disguise. When it says you can build a confidence interval by adding 1.96 standard errors to each side of a sample mean, the justification is the Central Limit Theorem. When SGD on millions of mini-batches converges, both theorems are doing quiet structural work in the background.
In §4.8 we met inequalities, Markov, Chebyshev, Hoeffding, which give finite-sample bounds for any $n$. This section takes the next step. Instead of asking "how bad can things be at sample size $n$?" we ask "what happens as $n$ grows without limit?". Limit theorems are asymptotic statements: they describe a destination, not a journey. But once you understand the destination, the inequalities of §4.8 tell you how quickly you arrive.
The law of large numbers
Suppose $X_1, X_2, \ldots$ is an infinite sequence of i.i.d. random variables, each with finite mean $\mu = \mathbb{E}[X_1]$. Form the running sample mean $\bar X_n = \tfrac{1}{n}(X_1 + X_2 + \cdots + X_n)$. The Law of Large Numbers (LLN) comes in two strengths.
The weak LLN says that for any positive tolerance $\epsilon > 0$, $$ P(|\bar X_n - \mu| > \epsilon) \to 0 \quad \text{as } n \to \infty. $$ That is, the chance that the sample average lies further than $\epsilon$ from the truth shrinks to zero. We write this as $\bar X_n \xrightarrow{p} \mu$ and call it convergence in probability.
The strong LLN says something stronger: $$ P\!\left(\lim_{n\to\infty} \bar X_n = \mu\right) = 1. $$ With probability one, the entire infinite sequence of running averages settles on $\mu$. We write this as $\bar X_n \xrightarrow{a.s.} \mu$, almost surely. The weak version permits occasional swings far from $\mu$ as long as they grow rare; the strong version says that, on a typical infinite sample path, the swings eventually stop happening at all.
In practice the two say the same useful thing: collect more data, and the sample mean closes in on the true mean. The strong version is technically harder to prove (it requires something like Kolmogorov's three-series theorem) but the weak version has a one-line proof when the variance $\sigma^2$ is finite. Apply Chebyshev's inequality (§4.8) to $\bar X_n$, noting that $\mathrm{Var}(\bar X_n) = \sigma^2/n$: $$ P(|\bar X_n - \mu| > \epsilon) \;\leq\; \frac{\mathrm{Var}(\bar X_n)}{\epsilon^2} \;=\; \frac{\sigma^2}{n\epsilon^2} \;\to\; 0. $$ Two lines of algebra dispatch one of the most-used theorems in science.
A worked feel for the speed. Imagine rolling a fair six-sided die. The true mean is $\mu = 3.5$ and the variance is $\sigma^2 = 35/12 \approx 2.92$. Pick out one realisation and watch the sample mean evolve. With $n = 1$ the "mean" is just the single roll, could be anywhere from 1 to 6. With $n = 10$ the sample mean is typically within roughly $\pm 0.5$ of 3.5. With $n = 100$ the typical departure is around $\pm 0.17$. With $n = 1{,}000$ it is roughly $\pm 0.054$. These typical sizes are just $\sigma/\sqrt{n}$, the standard error, which we will meet again in a moment. The shrinking is real but slow.
The LLN is the silent foundation underneath almost every machine-learning training algorithm. Empirical risk minimisation is justified because the empirical mean of the loss converges to the true expected loss. Monte Carlo integration converges because the empirical average of $f(X_i)$ converges to $\mathbb{E}[f(X)]$. Stochastic gradient descent converges (under mild extra regularity) because each stochastic gradient is an unbiased estimator of the true gradient and averaging unbiased estimators is what the LLN is about. Whenever you write loss.mean() in PyTorch and trust the result, you are leaning on the Law of Large Numbers.
The central limit theorem
The LLN tells you that $\bar X_n$ converges to $\mu$ but it does not say how the residual $\bar X_n - \mu$ is distributed for large but finite $n$. The Central Limit Theorem (CLT) answers this and gives a clean reply: after rescaling by $\sqrt{n}$, the residual is approximately Gaussian.
Formally, if $X_1, X_2, \ldots$ are i.i.d. with finite mean $\mu$ and finite variance $\sigma^2 > 0$, then $$ \frac{\sqrt{n}\,(\bar X_n - \mu)}{\sigma} \;\xrightarrow{d}\; \mathcal{N}(0, 1). $$ The arrow $\xrightarrow{d}$ denotes convergence in distribution: the cumulative distribution function of the left-hand side converges, at every continuity point, to that of a standard Gaussian. Equivalently, $\bar X_n$ is approximately $\mathcal{N}(\mu, \sigma^2/n)$ for large $n$.
Read that again. The individual $X_i$ might be Bernoulli flips, Poisson counts, exponential waiting times, the heights of trees, the number of synaptic vesicles released at a junction, the daily returns of a stock, anything with a finite variance. Their sample mean still ends up Gaussian. This universality is why the bell curve appears throughout the natural and social sciences: any quantity that arises as a sum of many small independent contributions inherits, almost as a free gift, an approximately Gaussian distribution.
A worked example. Take 1{,}000 independent rolls of a fair die. The sample mean has true mean 3.5 and variance $2.92/1000 = 0.00292$, so its standard error is $\sigma/\sqrt{n} = \sqrt{0.00292} \approx 0.054$. By the CLT the sample mean is approximately $\mathcal{N}(3.5, 0.054^2)$. So the probability that the sample mean differs from 3.5 by more than 0.1 is approximately $$ P(|\bar X_{1000} - 3.5| > 0.1) \;\approx\; P\!\left(|Z| > \tfrac{0.1}{0.054}\right) \;=\; P(|Z| > 1.85) \;\approx\; 0.064. $$ About a 6% chance, comfortably small. Compare that with what Chebyshev's inequality would give you: $P(|\bar X_{1000} - 3.5| > 0.1) \leq \sigma^2/(n \cdot 0.1^2) = 0.292$. The CLT replaces the loose Chebyshev bound of 29% with the tight Gaussian estimate of 6%. That is the practical payoff of having a distribution rather than only a bound.
Why does it work? The cleanest proof uses characteristic functions, $\varphi_X(t) = \mathbb{E}[e^{itX}]$. Standardise to $Y_i = (X_i - \mu)/\sigma$ so that $\mathbb{E}[Y_i] = 0$ and $\mathrm{Var}(Y_i) = 1$. Taylor-expand near zero: $\varphi_Y(t) = 1 - t^2/2 + o(t^2)$. The standardised sample mean $S_n = (1/\sqrt{n})\sum_i Y_i$ has characteristic function $$ \varphi_{S_n}(t) \;=\; \big[\varphi_Y(t/\sqrt{n})\big]^n \;=\; \left(1 - \frac{t^2}{2n} + o\!\left(\frac{1}{n}\right)\right)^{\!n} \;\to\; e^{-t^2/2}, $$ which is precisely the characteristic function of $\mathcal{N}(0, 1)$. Lévy's continuity theorem upgrades pointwise convergence of $\varphi$ to convergence in distribution, and the CLT falls out. The structural reason for universality is the squared term: every distribution with finite variance has the same quadratic dent at the origin of its characteristic function, and that quadratic dent is what the limit cares about.
Why the CLT matters in AI
Once you start looking, the Central Limit Theorem is everywhere in machine learning.
- SGD convergence analysis. A mini-batch gradient is the average of per-example gradients. The CLT says this average is approximately Gaussian around the true gradient with covariance shrinking as $1/B$ in batch size $B$. The whole literature of "SGD is approximately Langevin dynamics with Gaussian noise" rests on this.
- Confidence intervals. The "$\bar X_n \pm 1.96\, s/\sqrt{n}$" recipe is a CLT consequence. It works for almost any large-sample mean, accuracy on a held-out test set, click-through rate in an A/B test, perplexity on a validation corpus.
- Hypothesis tests. Z-tests and t-tests use Gaussian (or near-Gaussian) reference distributions for sample means. The justification for treating that reference as Gaussian is the CLT.
- The bootstrap. When you resample your dataset with replacement and recompute a statistic, the empirical distribution of bootstrap replicates is approximately Gaussian around the original estimate. That is why bootstrap percentile intervals work.
- Why noise looks Gaussian. Many real-world noise sources, sensor readings, gene-expression levels, exam marks, are sums of many small effects. The CLT explains why our standard "$y = f(x) + \varepsilon$ with $\varepsilon \sim \mathcal{N}(0, \sigma^2)$" assumption is so often a good approximation.
- Berry-Esseen bounds. A finite-sample sharpening of the CLT, $|F_n(x) - \Phi(x)| \leq C\,\mathbb{E}[|X-\mu|^3]/(\sigma^3\sqrt{n})$ with $C \approx 0.4748$, tells you how close to Gaussian $\bar X_n$ already is at the $n$ you actually have. Useful when you need to defend, not assume, the Gaussian approximation.
Almost every confidence-bearing statement in modern ML, from a 95% interval on test accuracy to a $p$-value on whether a new model beats a baseline, silently invokes the CLT.
Caveats
The limit theorems are powerful but they have conditions. Ignoring them produces silently wrong answers.
- Finite mean and variance. The LLN needs a finite mean; the CLT additionally needs finite variance. The Cauchy distribution has neither, and its sample mean has the same Cauchy distribution as a single observation, averaging never helps. Stock returns, file sizes, and word frequencies often have heavier tails than people expect; check before assuming Gaussianity.
- The $1/\sqrt{n}$ rate. To halve your standard error you must quadruple your sample size. Diminishing returns set in quickly: going from 100 to 10{,}000 samples cuts error by 10×, but the next factor-of-10 reduction costs another 99 × 100 samples. This is the asymptotic shape of every "more data" argument in ML.
- Heavy tails converge slowly. When the underlying distribution has high skew or heavy tails, the Berry-Esseen constant is large and the Gaussian approximation can be poor at moderate $n$. Heavy-tailed gradient noise has been observed empirically in transformer training, and in those regimes the implicit bias of SGD is best modelled with $\alpha$-stable distributions instead.
- Independence and identical distribution. The classical CLT assumes both. Real ML data are often correlated (consecutive frames of video, neighbouring tokens in text) or non-stationary (drifting test distributions). Generalised forms, the Lindeberg-Feller, martingale, and dependent-sequence CLTs, exist, but they have stronger conditions.
What you should take away
- The Law of Large Numbers says that as the sample size grows, the sample mean converges to the population mean. This is why empirical risk, Monte Carlo estimates, and stochastic gradients work at all.
- The Central Limit Theorem says that, after rescaling by $\sqrt{n}$, the residual $\bar X_n - \mu$ is approximately Gaussian, regardless of the underlying distribution, provided the variance is finite.
- The standard error $\sigma/\sqrt{n}$ is the universal yardstick for how much sample means jiggle. Halving it costs a fourfold increase in data.
- Confidence intervals, hypothesis tests, the bootstrap, and SGD noise analysis all lean on the CLT. When you write $\pm 1.96\,s/\sqrt{n}$, you are using it.
- Conditions matter. Heavy tails, infinite variance, and dependent samples all break the classical theorems. When you suspect your data might violate the assumptions, reach for the inequalities of §4.8 or a non-asymptotic concentration bound instead.