Central Limit Theorem, Glossary, Textbook of AI

The Central Limit Theorem (CLT) is one of the most important results in probability theory and the cornerstone of classical statistical inference. In its classical (Lindeberg--Lévy) form, it states that for a random sample $X_1, X_2, \ldots, X_n$ drawn independently and identically from any distribution with finite mean $\mu$ and variance $\sigma^2$, the standardised sample mean

$$Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}}$$

converges in distribution to a standard normal as $n \to \infty$:

$$Z_n \xrightarrow{d} \mathcal{N}(0, 1).$$

Equivalently, $\bar{X}_n \approx \mathcal{N}(\mu, \sigma^2/n)$ for large $n$. Crucially, the shape of the underlying distribution does not matter, provided the variance is finite. Heavy-tailed distributions with infinite variance instead obey the generalised central limit theorem and converge to stable distributions, of which the Gaussian is the special variance-finite case.

More general versions relax the i.i.d. assumption. The Lindeberg--Feller condition allows non-identically distributed but independent variables; martingale CLTs allow dependence; the Lyapunov version uses a third-moment condition. Berry--Esseen inequalities quantify the rate of convergence as $O(1/\sqrt{n})$, with the constant depending on the third absolute moment.

The CLT explains why the Gaussian distribution is so ubiquitous in nature and in statistics. Many real-world quantities -- measurement errors, biological traits like adult height, financial returns over short horizons, additive noise in physical processes -- arise as sums or averages of many small, roughly independent influences, and the CLT guarantees they will be approximately Gaussian regardless of the details. In practice, the CLT "kicks in" surprisingly quickly: sample sizes of $n = 30$ or more often suffice for a reasonable Gaussian approximation when the underlying distribution is not too skewed, though heavy-tailed distributions require much larger samples.

The CLT is the engine that drives classical statistical inference. It justifies the construction of approximate confidence intervals ($\bar{X} \pm 1.96 \ , s/\sqrt{n}$ for a 95% interval) and hypothesis tests based on the normal distribution, even when the underlying data are far from Gaussian. Pearson's chi-squared test, the $z$-test, the Wald test for maximum-likelihood estimators, and many more rest on the CLT applied to sufficient statistics.

In machine learning, the CLT underlies the statistical properties of mini-batch gradient estimates in stochastic gradient descent: each batch gradient is approximately Gaussian around the true gradient, with variance scaling as $1/B$ for batch size $B$. It justifies bootstrap confidence intervals for performance metrics, the analysis of stochastic approximation algorithms, the asymptotic normality of maximum-likelihood parameter estimates, and the construction of credible intervals around predictions. The delta method, which propagates Gaussian uncertainty through smooth nonlinear transformations, is itself a corollary of the CLT.

Historically, the theorem was first stated for Bernoulli trials by Abraham de Moivre in 1733, generalised by Pierre-Simon Laplace in 1810, and given its modern measure-theoretic form by Lyapunov (1901) and Lindeberg (1922).

Interactive

The sampling distribution emerges. Repeated samples from a non-normal population produce Gaussian sample means.

Sums of any distribution become Gaussian. Roll one die, then two, then ten. The distribution of the average converges to a bell curve.

Video

Discussed in:

Chapter 4: Probability, Probability and Statistics
Chapter 6: ML Fundamentals, Machine Learning

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.