5.2 Frequentist vs Bayesian: The Two Schools
Statistics has two great philosophical traditions, and they have been arguing politely with one another for the better part of three centuries. The frequentist tradition treats the unknown quantity you want to learn, the true accuracy of a model, the true rate of side effects from a drug, the true average height of adult women in Auckland, as a fixed number, and treats the data you collect as random. The Bayesian tradition does the opposite: it treats the unknown quantity as itself uncertain (so we describe our state of knowledge about it with a probability distribution), and treats the data, once you have actually observed it, as fixed.
That single switch, what is random, the parameter or the data?, is the philosophical fork in the road. Both traditions are mathematically sound. Both produce numbers. Both are widely used in modern machine learning. But the split affects how you interpret a confidence interval, what a p-value can and cannot tell you, what a posterior probability is, and which everyday questions you are actually allowed to answer in each framework. A scientist who never thinks about the difference will sooner or later say something that is either nonsense in their own framework or quietly correct in the other one.
This section introduces the philosophical and methodological landscape. §§5.5 and 5.6 contrast the two schools on specific tasks (maximum likelihood for the frequentist; MAP and full posteriors for the Bayesian).
The frequentist view
For a frequentist, $\theta$ is a fixed unknown number out there in the world. There is some true rate at which a coin lands heads; some true accuracy of your classifier on the deployment distribution; some true average blood pressure of New Zealand adults. We do not know the value, but it is not random. It does not have a probability distribution. It is just unknown.
What is random, in the frequentist picture, is the data. We imagine that the experiment we ran is one of an infinite ensemble of possible repetitions, each producing slightly different data because of sampling variation. If we measured the blood pressure of a hundred different adults, we would get a slightly different sample mean every time. The estimator, the recipe by which we turn data into a guess about $\theta$, therefore has a sampling distribution: a distribution over the values it would take across these hypothetical re-runs of the experiment. Frequentist statistics is, at heart, the study of these sampling distributions.
A 95% confidence interval is the classic example. It is not a statement about $\theta$. It is a statement about a procedure. We construct the interval using a recipe that, if we re-ran the whole experiment many times, would capture the true $\theta$ inside the interval roughly 95% of the time. For any particular interval that has actually been computed, $\theta$ is either inside it or it is not; there is no probability involved, because $\theta$ is fixed. Likewise a p-value is a probability about data assuming a particular hypothesis is true, not a probability about the hypothesis itself.
Key features of the frequentist view:
- $\theta$ is treated as a fixed unknown constant.
- Probabilities attach only to data and to procedures, never to $\theta$.
- There is no prior distribution and no posterior distribution.
- Guarantees take the form "if I repeated this analysis many times, the procedure would behave well on average."
The Bayesian view
For a Bayesian, the picture is rotated by ninety degrees. The unknown parameter $\theta$ is itself uncertain, and we describe that uncertainty using a probability distribution. Crucially this is not a claim that the world is intrinsically random, the true rate of heads for a particular coin is what it is, but rather a claim about our knowledge. Probability, in the Bayesian view, is the formal language of degrees of belief.
Before we collect any data, we attach a prior distribution $p(\theta)$ to $\theta$. The prior summarises what we believe about $\theta$ before looking at the data: perhaps we think any value between zero and one is equally plausible (a so-called uninformative prior), or perhaps a careful reading of the literature already pins $\theta$ down to a narrow range. We then observe the data $\mathcal{D}$, and update our beliefs using Bayes' theorem:
$$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)\, p(\theta)}{p(\mathcal{D})}.$$
The result is the posterior distribution $p(\theta \mid \mathcal{D})$, which describes our updated state of knowledge about $\theta$ now that we have seen the data. Once you have a posterior, you can compute anything else you want from it: a point estimate (the posterior mean, median, or mode), an interval (a credible interval that contains 95% of the posterior probability), or the predictive probability of a future observation.
A 95% Bayesian credible interval has the interpretation that most people incorrectly attach to a frequentist confidence interval: there is, given my prior and the data, a 95% probability that $\theta$ lies inside this particular interval. That feels intuitive precisely because it is the kind of statement most ordinary scientific reasoning is trying to make. The cost of getting that natural interpretation is that you must specify a prior, and your conclusions can in principle depend on what you chose. Bayesians typically respond that the prior is simply an honest record of the modelling assumptions you were going to make anyway; frequentists worry that priors smuggle subjectivity in by the back door.
Key features of the Bayesian view:
- $\theta$ is treated as a random quantity (in the belief sense, not the physical sense).
- Probabilities can attach to anything you are uncertain about, including hypotheses and parameters.
- A prior distribution is required (it may be deliberately vague, sometimes called uninformative or weakly informative).
- Inference produces a full posterior distribution, not just a point estimate.
- Updating is incremental: today's posterior becomes tomorrow's prior when new data arrive.
Same maths, different interpretation
Here is something that surprises many beginners. For a wide range of routine problems, the two schools produce almost identical numerical answers, even though they tell completely different philosophical stories about what those numbers mean.
Take the simplest case: estimating the probability $\theta$ that a coin lands heads. Suppose you flip the coin $n$ times and observe $k$ heads. The frequentist maximum likelihood estimate is $\hat\theta = k/n$. The Bayesian, starting from a uniform prior, formally a $\operatorname{Beta}(1,1)$, obtains a $\operatorname{Beta}(k+1, n-k+1)$ posterior whose mean is $(k+1)/(n+2)$ and whose mode is $k/n$. With even a modest sample size the Bayesian and frequentist point estimates are barely distinguishable. Likewise the frequentist confidence interval $\hat\theta \pm 1.96 \cdot \mathrm{SE}$ and the Bayesian central credible interval, for moderate $n$ and a vague prior, occupy almost exactly the same range on the number line.
What changes is the story you tell about that range. The frequentist story is: "this procedure produces intervals that contain the true $\theta$ in 95% of repeated experiments; this happens to be one such interval." The Bayesian story is: "given my prior and the data I observed, there is a 95% probability that $\theta$ lies in this interval." Same numbers, completely different claims. A scientist who slips between the two interpretations without noticing, saying, for example, that "there is a 95% chance the true accuracy is between 0.81 and 0.86" while quoting a frequentist confidence interval, has technically made a category error, even though the practical consequences are usually small.
This near-equivalence in routine cases is the reason most working statisticians and machine learning engineers can be philosophically agnostic most of the time. The cases where it actually matters are the cases where the two diverge.
When the two diverge
The schools genuinely come apart in four important situations.
Small samples or strong prior information. When data are scarce, Bayesian methods naturally regularise the answer towards the prior, which is exactly what you want when you have real prior knowledge, say, that drug response rates in adults rarely exceed 80%. Frequentist estimators, lacking any way to incorporate this knowledge, can be unstable or even pathological at small $n$ (a 0/0 maximum likelihood estimate, an impossibly wide confidence interval).
Direct probability statements about $\theta$. Only the Bayesian framework can answer the question "what is the probability that the true treatment effect is greater than zero?" because that question requires a probability distribution over $\theta$. Frequentists must reformulate the question as one about data, typically as a hypothesis test, and the answer (a p-value) is about the data, not the parameter.
Sequential and online inference. Bayesian updating is naturally incremental: today's posterior becomes tomorrow's prior, and you can stop or continue collecting data whenever you wish without invalidating the analysis. Classical frequentist inference, by contrast, depends on the sampling plan, peeking at the data and stopping early generally inflates the false-positive rate, requiring specialised sequential-testing corrections.
Multiple comparisons and hierarchical structure. Bayesian hierarchical models naturally pool information across related groups and shrink noisy estimates towards a shared mean, providing automatic regularisation when you are testing many hypotheses at once. Frequentist multiple-testing corrections (Bonferroni, false-discovery-rate procedures) achieve a similar protection, but as explicit add-ons rather than a natural consequence of the model.
Modern machine learning uses both
Despite the rhetoric, modern ML cheerfully borrows from both traditions, often within a single paper.
Empirical risk minimisation, training a neural network by minimising the average loss on labelled data, is implicitly frequentist: you produce a single point estimate of the weights and report a single number for accuracy on a held-out test set. L2 weight decay, the most common regulariser in deep learning, is implicitly Bayesian: it is exactly maximum-a-posteriori inference under a zero-mean Gaussian prior on the weights. L1 regularisation corresponds to a Laplace prior, dropout to an implicit weight prior, and early stopping to an implicit prior on training trajectories.
Variational autoencoders are explicitly Bayesian: they construct an approximate posterior over latent codes and train it by maximising a variational lower bound on the marginal likelihood. Bayesian deep learning more broadly aims to produce posterior distributions over network weights, using techniques such as Laplace approximations, deep ensembles, Monte Carlo dropout, and stochastic weight averaging Gaussian, in order to give calibrated uncertainty estimates for safety-critical applications.
Conformal prediction, in vogue since the late 2010s, sits in the frequentist camp: it produces prediction sets with guaranteed marginal coverage on top of any underlying model, Bayesian or not. Empirical Bayes, used in Gaussian process kernel learning and automatic relevance determination, treats hyperparameters as parameters of a higher-level model and learns them from the marginal likelihood, blurring the line cheerfully.
Most papers do not bother to declare a school. They use whichever framework is convenient for the task at hand, and they expect the reader to follow.
What you should take away
- The frequentist and Bayesian schools differ on what is random, the parameter (Bayesian) or the data (frequentist), and that single difference cascades into every downstream interpretation.
- A frequentist confidence interval is a statement about a procedure; a Bayesian credible interval is a statement about $\theta$ given the data and the prior. They look the same on the page; they mean different things.
- For routine problems with moderate data and weak priors, the two schools usually produce nearly identical numbers. The differences bite when data are scarce, when you want direct probability statements about $\theta$, when inference is sequential, or when many comparisons are involved.
- Modern machine learning is technically frequentist on the surface (point estimates, held-out accuracy) but implicitly Bayesian underneath (regularisation, ensembling, calibrated uncertainty). The two traditions are tools, not tribes.
- When you read or write a statistical claim, ask explicitly which school's interpretation you are invoking. Most real-world misuse of statistics, over-claimed p-values, mis-stated confidence intervals, comes from sliding between the two without noticing.