- Summarise data using measures of central tendency and spread, and choose appropriate summaries for the data type
- Distinguish population from sample and construct confidence intervals to quantify estimation uncertainty
- Formulate and test statistical hypotheses, interpreting p-values and the risks of Type I and Type II errors
- Derive parameter estimates via maximum likelihood estimation and connect MLE to loss minimisation
- Decompose prediction error into bias, variance, and irreducible noise and use the tradeoff to diagnose models
Probability asks: "Given a known process, what data will we see?" Statistics asks the reverse: "Given observed data, what process produced them?" This inverse question is the heart of machine learning. You always have a finite sample and must draw conclusions about the broader world — making predictions, estimating parameters, and judging whether patterns are real or just noise.
This chapter builds the statistical foundations you need. You will start with descriptive statistics, move through sampling and estimation, hypothesis testing, and maximum likelihood, and finish with the bias–variance trade-off — the conceptual cornerstone that governs every learning algorithm. For deeper treatment, see Hastie, Tibshirani, and Friedman Hastie, 2009 and Bishop Bishop, 2006.
5.1 Descriptive Statistics
Before fitting any model, look at your data. Descriptive statistics give you the tools for that first look.
Central Tendency
- Mean: x̄ = (1/n) Σ x
i. Minimises the sum of squared deviations. Sensitive to outliers — one extreme value can shift it a lot. - Median: the middle value when sorted. Far more robust to outliers. Minimises the sum of absolute deviations.
- Mode: the most frequent value. The only measure that works for categorical data.
Spread
- Variance: s^2^ = (1/(n − 1)) Σ (x
i− x̄)^2^. The n − 1 (Bessel's correction) makes this an unbiased estimate of the population variance. - Standard deviation: s = √s^2^. Shares the units of the original data, making it easier to interpret.
- Interquartile range (IQR): the gap between the 75th and 25th percentiles. A robust alternative to the standard deviation.
- Range: maximum minus minimum. Simple but heavily affected by outliers.
Shape
- Skewness measures asymmetry. Income distributions are typically right-skewed (long upper tail).
- Kurtosis measures tail heaviness relative to a Gaussian. Financial returns are often heavy-tailed, which is why Gaussian risk models can underestimate extreme events.
In AI, skewness and kurtosis guide the choice of data transforms (e.g., log or Box–Cox to reduce skew) and the choice of distributional assumptions in generative models.
Multivariate Summaries
For data with many features, the covariance matrix and correlation matrix capture pairwise linear links. Scatter plot matrices and correlation heat maps reveal clusters of redundant features. Spotting this early is key for feature engineering and for avoiding issues like multicollinearity in regression. For very high-dimensional data, automated tools (variance inflation factors, PCA) take over from visual inspection.
Why This Matters
It is tempting to skip straight to modelling. That is risky. Anscombe's quartet — four datasets with identical means, variances, and regression lines but wildly different scatter plots — shows that summary statistics can hide important structure. Always plot your data. Histograms, box plots, and scatter plots reveal outliers, class imbalance, missing-value patterns, and data-collection errors that would silently corrupt your models.
5.2 Sampling & Estimation
You observe a sample — a finite subset of a larger population — and want to draw conclusions about the population. The quality of your conclusions depends on how the sample was obtained.
Sampling
A simple random sample gives every possible subset equal probability. Other schemes — stratified, cluster, systematic — exploit known population structure to improve efficiency. In machine learning, the i.i.d. assumption (training examples drawn independently from a fixed distribution) is a form of random-sampling assumption. When this breaks — due to selection bias or distribution shift — models can fail badly in deployment.
Estimators
An estimator is a function of the sample that targets a population quantity. The sample mean X̄ estimates the population mean μ. Because the sample is random, the estimator is itself random — it has a sampling distribution.
Key properties of estimators:
- Bias: E[θ̂] − θ. An unbiased estimator has zero bias.
- Consistency: the estimator converges to the true value as n → ∞.
- Efficiency: among unbiased estimators, the one with the smallest variance wins.
The Central Limit Theorem
The CLT says that, for a sample of size n from a distribution with mean μ and finite variance σ^2^, the standardised sample mean:
(X̄ − μ) / (σ / √n)
converges to a standard normal as n → ∞. This holds regardless of the original distribution. It justifies normal-based confidence intervals and tests, even for non-Gaussian data. In practice, n ≥ 30 often suffices, though heavy-tailed data may need more.
Confidence Intervals
A confidence interval is a range that, over repeated sampling, contains the true parameter with a stated probability (typically 95%). For the mean with known variance: X̄ ± 1.96 σ/√n. With unknown variance, replace σ with the sample standard deviation s and use the t-distribution with n − 1 degrees of freedom.
The width shrinks as √n: more data gives more precision, but with diminishing returns. In AI, confidence intervals quantify uncertainty in performance metrics, predictions, and A/B tests.
Sufficient Statistics and Efficiency Bounds
A sufficient statistic captures all the information in the data relevant to a parameter. For a Gaussian mean with known variance, the sample mean is sufficient. The Rao–Blackwell theorem says conditioning any estimator on a sufficient statistic yields an equal or better estimator. The Cramér–Rao lower bound gives an absolute floor on the variance of any unbiased estimator, defined by the Fisher information. These results tell you when you have found the best possible estimator.
5.3 Hypothesis Testing
Hypothesis testing gives you a formal framework for making yes/no decisions under uncertainty.
The Setup
- State a null hypothesis H
0(the status quo) and an alternative H1(the claim you want to evaluate). - Compute a test statistic from the data.
- Compare it to the distribution under H
0. - If it falls in the rejection region (determined by the significance level α, usually 0.05), reject H
0.
Two Types of Error
- Type I (false positive): rejecting H
0when it is true. Probability = α. - Type II (false negative): failing to reject H
0when it is false. Probability = β. - Power = 1 − β: probability of correctly rejecting a false H
0.
In AI, Type I errors are false positives (a spam filter flagging good email) and Type II errors are false negatives (spam reaching the inbox). The parallel is direct.
P-Values
The p-value is the probability, under H0, of getting a result at least as extreme as what you observed. A small p-value is evidence against H0.
Critical caveats:
- The p-value is not the probability that H
0is true. - Statistical significance does not imply practical significance. With enough data, trivially small effects can yield tiny p-values.
The ML community has increasingly moved toward reporting effect sizes, confidence intervals, and (in Bayesian settings) posterior probabilities. But hypothesis testing remains required in clinical AI, where regulators mandate controlled trials.
Common Tests
- z-test: large samples, known variance.
- t-test: small samples, unknown variance.
- Chi-squared test: categorical data.
- F-test: comparing variances or nested models.
- Non-parametric alternatives (Wilcoxon, Kruskal–Wallis, permutation tests) make fewer assumptions and are increasingly used to compare model performance.
Multiple Testing
Testing 100 hypotheses at α = 0.05 produces ~5 false positives by chance. The Bonferroni correction divides α by the number of tests (conservative). The Benjamini–Hochberg procedure controls the false discovery rate (FDR) — the expected proportion of rejections that are wrong — and is widely used in genomics, neuroscience, and feature selection. In ML, evaluating hundreds of hyperparameter configs on a validation set raises the same concern, motivating held-out test sets and nested cross-validation.
Bayesian Alternative
Bayes factors compare the marginal likelihoods of two models: BF10 = P(D | H1) / P(D | H0). Unlike p-values, Bayes factors can provide evidence for the null, not just fail to reject it. They naturally penalise complexity through the marginal likelihood — a built-in Occam's razor.
5.4 Maximum Likelihood Estimation
MLE is the most widely used method for fitting parametric models in machine learning.
The Idea
Given data D = {x1, …, xn} drawn i.i.d. from p(x | θ), the likelihood is:
L(θ) = ∏i=1^n^ p(xi | θ)
The MLE is the value of θ that maximises L(θ). In practice, you maximise the log-likelihood (sums are easier than products):
ℓ(θ) = Σi=1^n^ log p(xi | θ)
Examples
- Gaussian mean: MLE = sample mean.
- Gaussian variance: MLE = sample average of squared deviations (biased, but the bias vanishes as n grows).
- Bernoulli p: MLE = sample proportion of successes.
- Logistic regression, neural networks: no closed form. Minimise the negative log-likelihood with gradient descent.
The key insight: training a classifier with cross-entropy loss is MLE.
Large-Sample Properties
Under mild conditions, the MLE is Hastie, 2009:
- Consistent: converges to the true parameter as n → ∞.
- Asymptotically normal: its distribution approaches a Gaussian.
- Asymptotically efficient: its variance hits the Cramér–Rao lower bound.
This makes MLE the default when sample sizes are large relative to parameters. But in high dimensions (deep learning), MLE can overfit — motivating regularisation.
MAP Estimation
The maximum a posteriori (MAP) estimator adds a prior:
θ̂MAP = argmaxθ [log p(D | θ) + log p(θ)]
- A Gaussian prior gives L2 regularisation (weight decay).
- A Laplace prior gives L1 regularisation (sparsity).
MAP unifies frequentist and Bayesian views: the same algorithm is either penalised MLE or the mode of the posterior. This duality recurs throughout machine learning.
In Practice
For generalised linear models, Newton–Raphson converges fast and is standard. For neural networks, SGD and its adaptive variants (Adam, RMSProp) handle large data and non-convex loss surfaces. The negative log-likelihood landscape of a neural network is highly non-convex — yet SGD reliably finds solutions that generalise well, a phenomenon that remains an active area of research.
5.5 Bias–Variance Tradeoff
The bias–variance tradeoff is one of the most important ideas in machine learning. It explains why making a model more complex does not always help, and it provides the basis for regularisation, model selection, and ensembles.
The Decomposition
Suppose the true relationship is Y = f(X) + ε, where ε is noise with variance σ^2^. You train a model f̂ on a random sample. The expected squared error at a point x decomposes into three terms:
E[(Y − f̂(x))^2^] = σ^2^ + (Bias)^2^ + Variance
- σ^2^ (irreducible noise): inherent randomness. No model can remove it.
- Bias²: how far the average prediction is from the truth. High bias means the model systematically misses patterns.
- Variance: how much predictions change across different training sets. High variance means the model is unstable.
The Tradeoff
Simple models (e.g., linear regression with few features) have high bias and low variance. They miss patterns but are stable. Complex models (e.g., deep networks) have low bias and high variance. They capture more but are sensitive to the specific training data.
The optimal point minimises the sum. Reducing bias by adding complexity increases variance, and vice versa.
Regularisation Navigates the Tradeoff
All regularisation techniques introduce a small amount of bias to achieve a large reduction in variance:
- L2 (ridge): shrinks weights toward zero.
- L1 (lasso) Tibshirani, 1996: sets some weights exactly to zero (feature selection).
- Dropout Srivastava, 2014: randomly deactivates neurons, preventing co-adaptation.
- Early stopping: halts training before the model memorises noise.
Ensembles
- Bagging Breiman, 1996 (e.g., random forests Breiman, 2001): trains multiple models on bootstrap samples and averages. Reduces variance without much bias increase.
- Boosting Friedman, 2001 (e.g., XGBoost Chen, 2016, LightGBM): trains models sequentially, each focusing on previous errors. Primarily reduces bias.
Gradient-boosted trees dominate tabular data precisely because they navigate this tradeoff so effectively.
Double Descent
Classical theory says test error follows a U-shape as complexity grows. But heavily overparameterised deep networks can memorise the training data perfectly and still generalise well. This is double descent: test error follows the U, then drops again beyond the interpolation threshold.
Explanations invoke implicit regularisation by SGD, high-dimensional loss landscape geometry, and the benign nature of the minima gradient methods find. These findings refine our understanding but do not invalidate the bias–variance framework. Effective complexity depends on the optimiser, the data, and the architecture — not just the parameter count. The decomposition remains the essential starting point for reasoning about generalisation.