5.4 Estimators, Bias, Variance, MSE
This sub-chapter sets up a vocabulary for talking about how good a recipe is. Three words do most of the work: bias, variance, and mean squared error. Bias is the systematic offset between the estimator and the truth, the recipe's tendency to land consistently above or below the right answer. Variance is the wobble, how much the estimate jumps about when you redraw the data. Mean squared error glues the two together and tells you how far the estimate sits from the truth on average. Once these three words are clear, you can compare estimators on a single scale and see why "obvious" choices are sometimes beaten by stranger-looking ones.
Where §5.3 was descriptive, §5.4 turns inferential: we treat the data as one realisation drawn from a wider population and ask what the data tell us about that population. §5.5 then commits to maximum likelihood as the default recipe.
Bias
Bias is the question: if I could repeat my experiment many times, fresh sample, same recipe, would the average estimate land on the truth, or on some other number? Formally, $$\text{Bias}(\hat\theta) = \mathbb{E}[\hat\theta] - \theta.$$ A bias of zero means the recipe lands on the truth on average; positive bias means it tends to overshoot; negative bias means it tends to undershoot. Notice that bias is a property of the recipe, not of any particular sample. Your one estimate from your one sample will rarely equal $\mathbb{E}[\hat\theta]$ exactly, and that gap is the variance, which we treat next.
A clean worked example. Take $X_1, \ldots, X_n$ drawn independently from a normal distribution with unknown mean $\mu$ and variance $\sigma^2$. The sample mean $\hat\mu = \frac{1}{n}\sum_{i=1}^{n} X_i$ has expected value $\mathbb{E}[\hat\mu] = \mu$, because expectation distributes over sums. The sample mean is therefore unbiased for $\mu$. Reassuring, the obvious recipe for "average" lands on the truth in expectation.
The obvious recipe for "spread", however, is not. Define the plug-in sample variance $$S_n^2 = \frac{1}{n}\sum_{i=1}^{n} (X_i - \bar X)^2.$$ A short calculation gives $\mathbb{E}[S_n^2] = \frac{n-1}{n}\sigma^2$, which is biased downward: on average, $S_n^2$ underestimates the true variance. The intuitive reason is that we used $\bar X$, the sample mean, in place of the unknown true mean $\mu$. The squared deviations from $\bar X$ are mechanically smaller than the squared deviations from $\mu$ would have been, because $\bar X$ is the value that minimises the sum of squared deviations in this sample. We have used the data twice, once to centre, once to spread, and the centring pulls the spread inward.
The classical fix is Bessel's correction: divide by $n-1$ instead of $n$. $$s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (X_i - \bar X)^2.$$ This estimator satisfies $\mathbb{E}[s^2] = \sigma^2$, unbiased. Almost every introductory statistics text presents Bessel's correction as the right answer, full stop. We shall see in a moment that this is too quick. Unbiasedness is a virtue, but it is not the only virtue, and chasing it can cost you elsewhere.
Variance
Bias is the average position of the bullseye relative to the target's centre. Variance is how tightly the bullets cluster, regardless of where they cluster. Formally, $$\text{Var}(\hat\theta) = \mathbb{E}\!\big[(\hat\theta - \mathbb{E}\hat\theta)^2\big].$$ A low-variance estimator gives nearly the same answer every time you redraw the data; a high-variance estimator hops around. Crucially, variance has nothing to do with the truth $\theta$, only with the spread of $\hat\theta$ across hypothetical repetitions of the experiment.
Worked example. For the sample mean of $n$ independent draws with variance $\sigma^2$, $$\text{Var}(\bar X) = \frac{\sigma^2}{n}.$$ The standard deviation of $\bar X$, called the standard error, is therefore $\sigma/\sqrt{n}$. Two consequences. First, larger samples shrink the standard error, but only at rate $\sqrt{n}$: to halve your standard error you need not twice but four times the data. To shrink it tenfold, a hundredfold more data. This $\sqrt{n}$ ceiling is one of the most quietly important facts in all of statistics. It is why pollsters commission samples in the low thousands rather than the millions, the marginal precision from extra interviews tails off quickly. It is also why scaling laws in deep learning are so unusual: in regimes where extra data really does keep paying off, you are seeing something more interesting than mere statistical averaging.
Second, variance shrinks with information per sample. If your individual measurements are noisy, $\sigma^2$ is large and you need more of them. If they are precise, fewer suffice. This trade between sample size and per-sample noise is the substrate of every experimental-design decision.
MSE = bias² + variance
Bias and variance both contribute to error, and there is a tidy identity that says exactly how. The mean squared error is the expected squared distance from the truth, $$\text{MSE}(\hat\theta) = \mathbb{E}\!\big[(\hat\theta - \theta)^2\big].$$ Add and subtract $\mathbb{E}[\hat\theta]$ inside the bracket, expand the square, and the cross-term vanishes (because $\mathbb{E}[\hat\theta - \mathbb{E}\hat\theta] = 0$). What remains is the headline identity of this sub-chapter: $$\boxed{\;\text{MSE}(\hat\theta) = \text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta).\;}$$ Total error decomposes cleanly into systematic offset squared, plus wobble. You cannot judge an estimator by looking at one term alone.
Two extreme caricatures make the point. Consider the lazy estimator that ignores the data and always reports zero, $\hat\theta_{\text{lazy}} = 0$. Its variance is exactly zero, it never wobbles, but its bias is whatever the truth happens to be, $-\theta$. The MSE is $\theta^2$, possibly huge. Zero variance is not enough.
Now the opposite caricature: an estimator that uses only the first observation, $\hat\theta_{\text{one}} = X_1$. Its expected value is $\mu$, so it is unbiased; its variance is $\sigma^2$, the full per-sample noise. The MSE is $\sigma^2$, no better than a single measurement no matter how many samples you actually have. Unbiasedness is not enough either.
Now revisit the two variance estimators from earlier. The biased plug-in $S_n^2$ has bias $-\sigma^2/n$ but smaller variance than the unbiased $s^2$, multiplying any random variable by $(n-1)/n$ shrinks its variance by $((n-1)/n)^2$. Working through the algebra, $$\text{MSE}(S_n^2) = \frac{2n-1}{n^2}\sigma^4, \qquad \text{MSE}(s^2) = \frac{2}{n-1}\sigma^4.$$ For $n = 10$ the biased estimator scores about $0.19\,\sigma^4$ against the unbiased estimator's $0.222\,\sigma^4$. The biased recipe wins on MSE. Statisticians use $s^2$ in classical inference because its unbiasedness makes confidence-interval arithmetic exact, and use $S_n^2$ inside likelihood calculations because that is what falls out of the maximum-likelihood machinery. In machine learning we rarely demand unbiasedness; we want low total error, full stop. The bias–variance identity is the reason that is even a coherent thing to want.
Consistency
Bias and variance are properties at a single sample size. Consistency is the long-run promise: as the sample grows, does the estimator close in on the truth?
Formally, $\hat\theta_n$ is consistent for $\theta$ if it converges in probability to $\theta$ as $n \to \infty$, that is, for any tolerance $\varepsilon > 0$, the chance of the estimate landing further than $\varepsilon$ from the truth shrinks to zero. A useful sufficient condition: if the MSE goes to zero, the estimator is consistent (this follows from Chebyshev's inequality). So both bias and variance must shrink in the limit; they need not be exactly zero at any finite $n$, only fade away as data accumulate.
The sample mean is consistent by the law of large numbers: $\bar X_n \to \mu$ in probability. Bessel-corrected $s^2$ is consistent for $\sigma^2$. Maximum likelihood estimators are consistent under mild regularity conditions, assumptions that essentially say the model is well behaved, the parameter identifiable, and the likelihood smooth enough to differentiate. Consistency is the minimum standard you should expect from any estimator that aspires to use; an inconsistent recipe does not improve no matter how much data you give it.
Asymptotic normality
Beyond consistency, many estimators come with a stronger guarantee about how they approach the truth. For a wide class of well-behaved estimators, $$\sqrt{n}\,(\hat\theta_n - \theta) \to \mathcal{N}(0, V)$$ in distribution, where $V$ is the asymptotic variance. Rearranged, this says that for large $n$ the sampling distribution of $\hat\theta_n$ is approximately $\mathcal{N}(\theta, V/n)$. For maximum-likelihood estimators the asymptotic variance is the inverse Fisher information, $V = 1/I(\theta)$, which we meet again in the next subsection.
Why does this matter? Because it converts a vague guarantee ("the estimator gets close to the truth") into a quantitative one ("the estimator wobbles around the truth like a Gaussian with this specific spread"). Once you know the sampling distribution is approximately normal, you can build confidence intervals, $\hat\theta \pm 1.96\sqrt{V/n}$ for a 95% interval, and you can run hypothesis tests by comparing $\hat\theta$ against null-hypothesis values on a $z$ scale. Almost every confidence interval in scientific publication ultimately rests on an asymptotic-normality result.
The intuition behind asymptotic normality is the central limit theorem: most estimators are, when you expand them carefully, sums or smooth functions of sums of independent terms. Sums of independent random variables, suitably scaled, look Gaussian. So the Gaussian shape is not a coincidence: it is a consequence of estimators being averaging-like creatures and of large-$n$ behaviour washing out per-sample idiosyncrasies. The scaling factor $\sqrt{n}$ is the Goldilocks rate: divide by $n$ and the limit is a constant (the law of large numbers); divide by $\sqrt{n}$ and the limit is a non-degenerate distribution.
In modern AI, asymptotic normality is the silent assumption behind error bars on benchmark scores, the bootstrap intervals reported alongside model accuracies, and the Wald-style confidence statements in A/B tests of model variants. It is also the formal justification for treating fitted parameters of large models as Gaussian-distributed for purposes of uncertainty quantification, Laplace approximation, for example, is essentially "use the asymptotic normal distribution of the MLE as an approximate posterior".
The Cramér–Rao lower bound
Among all unbiased estimators of $\theta$, can you make the variance as small as you like by being clever enough about your recipe? No. There is a hard floor.
Define the Fisher information for one observation as $$I(\theta) = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X\mid\theta)}{\partial\theta^2}\right].$$ This is a measure of how sharply the log-likelihood curves around the true parameter, the sharper the peak, the more the data tell you about $\theta$. The Cramér–Rao lower bound then says that for any unbiased estimator built from $n$ i.i.d. observations, $$\text{Var}(\hat\theta) \ge \frac{1}{n\,I(\theta)}.$$ You cannot do better. An unbiased estimator that attains this bound is called efficient.
The deep reason maximum likelihood is the default choice of estimator in statistics and machine learning is that it is asymptotically efficient: as $n$ grows, the MLE's variance approaches $1/(nI(\theta))$, i.e., it saturates the Cramér–Rao bound. Among estimators that come with the right large-sample behaviour, you cannot beat MLE for precision. Fisher information also reappears throughout modern AI as the natural-gradient preconditioner $F^{-1}\nabla\ell$, which gives a step direction invariant to how you parameterise your model, and modern second-order optimisers like K-FAC and Shampoo are tractable approximations to natural gradient descent.
Where these concepts appear in AI
Once you have the bias–variance vocabulary, you start seeing it everywhere in machine learning.
Generalisation error. The expected test-set MSE of a supervised learner decomposes into the squared bias of its predictions, their variance across training-set redraws, and an irreducible noise term, the same identity, just upgraded from a single parameter to a whole prediction function. Section 5.16 returns to this in detail.
Regularisation as a bias–variance trade. Adding an $L^2$ penalty to a regression's loss shrinks coefficients toward zero. This introduces bias (you no longer minimise the unregularised loss) but reduces variance (less wiggle room means more stable fits across data redraws). The penalty strength is tuned to minimise total MSE, not to make either ingredient zero.
Variance reduction in reinforcement learning. Policy-gradient algorithms estimate the gradient of expected reward by sampling trajectories, and that estimator can have huge variance. Subtracting a state-value baseline $V(s)$ from the sampled return reduces variance without introducing bias, the formal name for this trick is Rao–Blackwellisation, which says that conditioning an unbiased estimator on a sufficient statistic never hurts and usually helps.
Asymptotic normality for uncertainty. Confidence intervals on benchmark accuracies, calibration intervals on probabilistic forecasts, and Laplace-approximation posteriors over neural-network weights all lean on the asymptotic-normal sampling distribution we sketched above. When you see a $\pm$ in a deep-learning paper, the central limit theorem is somewhere underneath it.
What you should take away
- An estimator is a function from data to a parameter guess; because the data are random, the estimator is itself a random variable with a sampling distribution.
- Bias is the systematic offset $\mathbb{E}[\hat\theta] - \theta$, and variance is the sample-to-sample wobble. Mean squared error decomposes cleanly as $\text{MSE} = \text{Bias}^2 + \text{Var}$.
- Unbiasedness is not sacred: a biased estimator with smaller variance can beat an unbiased one on MSE. In machine learning we usually optimise total error, not unbiasedness alone.
- Consistency ($\hat\theta_n \to \theta$ as $n \to \infty$) is the minimum standard; asymptotic normality ($\sqrt{n}(\hat\theta - \theta) \to \mathcal{N}(0, V)$) gives quantitative error bars; the Cramér–Rao lower bound ($\text{Var}(\hat\theta) \ge 1/(nI(\theta))$) tells you how good unbiased estimators can possibly get.
- The same bias–variance vocabulary structures generalisation error, regularisation, variance reduction in policy gradients, and uncertainty quantification in deep learning; it is one of the most reused ideas in the whole subject.