Statistics: 5.1   Why Statistics for AI

Dr Chris Paton

5.1 Why Statistics for AI

Statistics is the discipline of learning from data. Probability theory, which we developed in the previous chapter, runs the inference in one direction: it begins with a known mechanism, a coin with bias $p$, a Gaussian with mean $\mu$ and variance $\sigma^2$, a Markov chain with a fixed transition matrix, and asks what data such a mechanism is likely to produce. Statistics runs the inference in the opposite direction. We are handed a finite pile of observations $\mathcal{D}$ and asked to reason backwards to the mechanism that generated them. We do not know the parameters; we only see their footprints in the data. Our task is to reconstruct the footprints' owner well enough to make decisions, predictions, or scientific claims.

Artificial intelligence sits squarely on this side of the divide. Every model in this textbook, from a simple logistic regression to a hundred-billion-parameter language model, is fit to data and then judged on data it has not seen. The fitting step is a statistical estimation problem: given the training set, choose the parameters that best explain it. The judging step is a statistical evaluation problem: given a held-out sample, decide whether the model has captured the underlying regularity or merely memorised the noise. There is no escape from statistics in modern AI. Even systems that are deployed end-to-end without explicit probabilistic vocabulary are, under the hood, performing inference about an unobserved data-generating process.

This section pins down what statistics is for, why Chapters 6 to 15 lean on it, and three ideas — the five core tasks, the two philosophical schools, and the bias–variance–noise decomposition — that recur throughout.

Symbols Used Here

$\theta$parameters of a statistical model

$\mathcal{D}$data (the observed sample)

$\hat\theta$estimator (a function of the data that guesses $\theta$)

$\mathbb{E}[\hat\theta]$expected value of the estimator over draws of data

$\text{Var}(\hat\theta)$variance of the estimator over draws of data

$\text{Bias}(\hat\theta) = \mathbb{E}[\hat\theta] - \theta$bias of the estimator

What statistics actually does

Strip away the jargon and statistics performs five core tasks. Almost everything we do later in the book reduces to one of these, sometimes several at once.

1. Point estimation. Given the data, return a single best guess $\hat\theta$ for the parameter $\theta$. The guess might be a probability, a mean, a regression coefficient, or, in the case of a neural network, a vector with billions of entries. The guess is a function of the data: change the data, change the guess. Common procedures include the sample mean, the sample variance, ordinary least squares, and maximum likelihood, all of which we develop in §5.4 and §5.5.

2. Interval estimation. A point estimate alone is dishonest, because it pretends the data fixed the answer when in fact a different sample would have given a different answer. Interval estimation attaches an honest measure of uncertainty: the confidence interval (frequentist) or credible interval (Bayesian). We say something like "the treatment effect is 4.2 mmHg, 95% CI [1.1, 7.3]". The width of the interval tells the reader how much trust the number deserves.

3. Hypothesis testing. Given a specific claim about $\theta$, for example, $\theta = 0$, meaning a drug has no effect, decide whether the observed data is compatible with that claim or whether the claim should be rejected. This is the territory of $p$-values, significance levels, and the two error types (Type I and Type II), all developed in §5.8.

4. Prediction. Given the estimated model, forecast the value of a new observation. This is the bread-and-butter task of supervised machine learning: we fit a model on training data, then predict the label of a previously unseen input. Prediction is statistics-flavoured because we are using a model fitted to a finite sample to make claims about a population we have not fully seen.

5. Model checking. Even a perfectly fitted model is worthless if the model class itself is wrong. Are the residuals patternless or do they show systematic curvature? Does the assumed Gaussian noise have the right tails? Does our independence assumption hold once we look at consecutive observations? Model checking is the statistical immune system, and skipping it is how published results turn into retractions.

To anchor these abstractly, picture a small clinical trial. We compare drug A to drug B for blood pressure reduction in 200 hypertensive patients, randomised one-to-one. The five tasks unfold as follows. We first compute a point estimate of the average treatment effect, the difference in mean blood-pressure change between groups, perhaps $-4.2$ mmHg in favour of A. We then construct a 95% confidence interval, perhaps $[-7.3, -1.1]$, which tells us that the data are compatible with treatment effects ranging from one to seven mmHg of benefit. We test the null hypothesis that the true effect is zero, obtaining a $p$-value (perhaps $p = 0.01$) that we interpret carefully. We then use the fitted model to predict the response of a new 65-year-old patient with starting pressure 150/95. Finally, we check the residuals for skewness, look for outliers, and verify that the assumption of constant variance across treatment arms holds. Only then do we trust the result.

This pattern, estimate, quantify uncertainty, test, predict, check, repeats for every AI model you will ever build, with the same five steps wearing different costumes. Training a classifier? The fitted weights are point estimates; the held-out accuracy comes with a binomial confidence interval; the comparison against a baseline is a hypothesis test; the deployed model is a predictor; the calibration plot is a model check. Tuning a recommender system? Same five tasks. Evaluating a large language model on a benchmark? Same five tasks. The cosmetics differ, sometimes we call them "training", "evaluation", "ablation", "shipping", "monitoring", but the underlying operations are the ones a Victorian agronomist would recognise. Statistics is the connective tissue that lets us read any of these problems with the same eye.

A practical consequence is worth flagging now: many of the most expensive mistakes in AI come from skipping one of the five. A team that reports a single point estimate of accuracy without an interval has hidden the variance. A team that runs hundreds of unflagged comparisons has destroyed the meaning of any individual $p$-value. A team that never inspects residuals or per-subgroup performance has skipped model checking. Each of these failures has a name, a remedy, and a section number later in this chapter.

Frequentist vs Bayesian, again

Statistics has two grand philosophical traditions, and the divide matters because modern AI uses both, often in the same paper. We met the contrast briefly in the probability chapter; we revisit it here because it now becomes operational.

The frequentist view treats $\theta$ as a fixed unknown constant out there in the world. Probability lives only in the data: had we drawn a different sample, we would have computed a different estimate. An estimator $\hat\theta$ is therefore a random variable (because the data is random), and we judge it by its sampling distribution, the distribution of values it would take across imagined re-runs of the same experiment. A 95% confidence interval is a procedure that, applied across hypothetical re-runs, would cover the true $\theta$ in 95 of them. The concept is subtle and frequently mis-stated: any given interval either contains $\theta$ or it does not; the 95% refers to the procedure, not to a particular interval.

The Bayesian view treats $\theta$ itself as uncertain, not because $\theta$ wobbles in the world, but because we do not know which value it is. Probability is a degree of belief, codified by a prior $p(\theta)$ before we see data and a posterior $p(\theta \mid \mathcal{D})$ afterwards, linked by Bayes' theorem. The data, once observed, is fixed. A 95% credible interval is the interval that contains 95% of the posterior probability mass; one can directly say "there is a 95% chance the parameter is in this range", which is the natural reading most working scientists default to in any case.

Modern machine learning is unapologetically mixed. Empirical risk minimisation, the workflow of every standard neural network, where we minimise an average loss over a labelled training set and report a test-set metric, is essentially frequentist: parameters are point estimates and uncertainty is quantified via held-out splits and bootstrapping. Bayesian ideas, meanwhile, underpin variational inference, Gaussian processes, generative modelling with priors over latent variables, Monte Carlo dropout, and almost every uncertainty-aware decision system. L2 weight decay is exactly MAP estimation under a Gaussian prior; the lasso is MAP under a Laplace prior; ensembling resembles posterior averaging. You will not be a complete practitioner if you only speak one of the two languages.

We develop both, side by side, throughout the chapter: frequentist sampling distributions in §5.4 and §5.7, Bayesian inference in §5.6, hierarchical models in §5.12. The right pragmatic stance is not to pick a side but to recognise which question each tradition answers cleanly. If you want to know what would happen across many runs of the same procedure, a useful question for hardware reliability, A/B tests, and quality control, frequentist sampling distributions are the natural language. If you want to combine prior knowledge with limited data and end up with an explicit probability statement about the unknown, useful for clinical reasoning, scientific discovery and rare-event problems, Bayesian inference is the natural language. The mature practitioner switches fluently between the two, picking whichever framing makes the next decision clearer.

Bias–variance–noise decomposition

The single most useful equation in applied statistics, and arguably in all of supervised AI, is the bias–variance–noise decomposition. We take it now in its squared-error form. Suppose we want to predict $y$ from $\mathbf{x}$ with an estimator $\hat f$ fitted on a random training sample. At a fixed test point, the expected squared error of $\hat f$ decomposes into three irreducible pieces:

$$\mathbb{E}\big[(\hat f(\mathbf{x}) - y)^2\big] = \text{Var}\!\big(\hat f(\mathbf{x})\big) + \text{Bias}\!\big(\hat f(\mathbf{x})\big)^2 + \sigma^2_{\text{noise}}.$$

The expectation is over both the random training data (which makes $\hat f$ random) and the random test outcome $y$.

Each term has a clear meaning. Variance captures how wildly $\hat f$ swings from one training set to another: a high-variance method memorises its sample. Bias captures the systematic error, how far the typical prediction $\mathbb{E}[\hat f]$ sits from the true function: a high-bias method is too rigid to capture the signal. Noise, often denoted $\sigma^2_{\text{noise}}$, is the variability in $y$ that no model can ever predict from $\mathbf{x}$ alone, because the world is genuinely stochastic.

Two cartoons fix the pattern. A 1-nearest-neighbour classifier latches onto whichever single training point is closest, so its predictions change drastically with each new training set, high variance. In the limit of infinite data it converges to the true class boundary, so its asymptotic bias is low. A simple linear regression, by contrast, can only ever fit a hyperplane: it varies very little across resamples (low variance) but is condemned to be systematically wrong if the truth is curved (high bias). Most modern methods sit somewhere in between, and the engineer's job is to find the sweet spot.

In practice, the bias–variance balance is governed by three knobs: model capacity (number of parameters, depth of a tree, width of a network), regularisation strength (weight decay, dropout, early stopping), and training-set size. Bigger models lower bias but raise variance; stronger regularisation raises bias but lowers variance; more data lowers variance for free. The classical learning curve, training error climbing and validation error falling as data grows, is the bias–variance decomposition in action, and almost every diagnostic in §5.13 and §5.16 returns to this equation.

What this chapter covers

The remainder of Chapter 5 takes these ideas and turns them into working machinery. §5.2 contrasts frequentist and Bayesian thinking in detail, with worked examples. §5.3 collects the descriptive tools, means, medians, variances, quantiles, robust summaries, that every analysis begins with. §5.4 formalises estimators and their properties, including consistency, unbiasedness, and the mean-squared-error decomposition just sketched. §5.5 develops maximum likelihood estimation, the workhorse behind almost every supervised learning loss. §5.6 introduces MAP estimation and Bayesian inference in earnest. §5.7 builds confidence and credible intervals. §5.8 covers hypothesis testing, $p$-values, and the design of tests. §5.9 introduces the bootstrap as a non-parametric tool for everything we cannot do analytically. §5.10 and §5.11 develop linear and generalised linear models, the statistical chassis on which logistic regression, Poisson regression, and many neural-network output layers are built. §5.12 covers hierarchical and empirical Bayes models. §5.13 tackles model selection and cross-validation. §5.14 previews causal inference. §5.15 turns the lens onto AI itself: how we evaluate machine-learning systems with held-out splits, A/B tests, calibration plots, and fairness metrics. §5.16 returns to the bias–variance tradeoff with the full apparatus in place. §5.17 closes the chapter.

You should not treat this as a one-shot read. The ideas here recur in Chapters 6 through 15: every loss function is a likelihood, every regulariser is a prior, every learning-curve plot is the bias–variance decomposition, every benchmark number on a leaderboard is a confidence interval in disguise. We will refer back to this chapter constantly, and the time you invest in it now will repay itself in faster reading and surer judgement later.

What you should take away

Statistics inverts probability, it reasons from observed data back to an unknown data-generating process, and AI is fundamentally a statistical activity.
Five tasks recur in every analysis: point estimation, interval estimation, hypothesis testing, prediction, and model checking.
Frequentist and Bayesian traditions both run through modern ML; you will need to read fluently in both, and most algorithms can be cast in either light.
Expected squared error decomposes into variance, squared bias, and irreducible noise, the three knobs of model capacity, regularisation, and data size move these terms predictably.
The vocabulary of weights, loss, and training is statistical vocabulary in disguise; understanding the underlying inference is what makes the difference between using a model and diagnosing it.