Supervised Learning: 7.7   Naive Bayes

Dr Chris Paton

7.7 Naive Bayes

Naive Bayes is the simplest Bayesian classifier worth taking seriously. It begins with Bayes' theorem and then makes one bold simplifying move: it assumes that, once you know the class, every feature is independent of every other feature. The assumption is almost always wrong about the world. The classifier built on top of it is, despite that, often surprisingly accurate, occasionally embarrassingly so. It is fast to train, fast to predict, requires only modest amounts of data, has a closed-form fit with no iterative optimisation, and gives a baseline against which any more sophisticated method should be measured. For text classification, spam filtering, sentiment tagging, simple topic labelling, it remains the textbook starting point.

In §7.6 we built decision trees, where the model carves the input space into rectangles and assigns a class to each leaf. The geometry was paramount: a tree learns which axis to split on, where to split it, and what to predict at the bottom. Naive Bayes tells a different kind of story. Instead of partitioning space, it estimates a probability distribution for each class and uses Bayes' rule to invert the direction of conditioning. The output is a posterior probability over classes, not a region label. This is our first properly probabilistic classifier: it commits to a generative model of the data and then uses that model to classify by asking which class makes the observed features most plausible.

The shift in viewpoint matters. A decision tree asks "where in feature space does this point sit?" and reads off a label. Naive Bayes asks "which class would most readily generate a point that looks like this?" and ranks the candidates. The two questions usually agree on the answer, but they expose different machinery. The Bayesian framing also makes it natural to combine the classifier with prior knowledge, a doctor's base rate for a disease, a moderator's prior on spam frequency, by adjusting $P(y)$ rather than retraining.

Symbols Used Here

$y$class label

$\mathbf{x}$feature vector

$P(\mathbf{x} \mid y)$class-conditional likelihood

$P(y)$class prior

The model

Bayes' theorem rewrites the posterior in terms of the likelihood and the prior: $$P(y \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y)\,P(y)}{P(\mathbf{x})} \;\propto\; P(y)\,P(\mathbf{x} \mid y).$$ The denominator $P(\mathbf{x})$ does not depend on the class, so it falls away when we ask which class maximises the posterior. To classify, we pick $\hat{y} = \arg\max_k P(y=k)\,P(\mathbf{x}\mid y=k)$. The prior $P(y)$ is easy: it is just the class frequency in the training set. The likelihood $P(\mathbf{x}\mid y)$ is hard: with $d$ features it is a joint distribution over $d$-dimensional space, and there are typically not enough data to estimate it directly.

The naive assumption sweeps that difficulty away. It says that given the class label, the features carry no information about each other: $$P(\mathbf{x}\mid y) = \prod_{j=1}^{d} P(x_j \mid y).$$ A $d$-dimensional density estimation problem becomes $d$ separate one-dimensional problems, one per feature per class. With $K$ classes and $d$ features, we estimate at most $K \cdot d$ small models rather than one enormous joint distribution. The cost is bias: real features are correlated, and treating them as independent is a lie. The benefit is variance: each one-dimensional estimate is robust because every training row contributes to every feature's marginal. Because we only need the argmax of the posterior rather than the calibrated posterior itself, Naive Bayes can pick the right class even when its independence assumption distorts the probabilities. Numerical work is done in log-space, $\log P(y) + \sum_j \log P(x_j\mid y)$, to avoid the underflow that comes from multiplying many small probabilities together.

Variants by feature type

Naive Bayes is a family of models, not a single one. They differ in the assumed shape of $P(x_j \mid y)$ and are picked to match the data type at hand.

Multinomial Naive Bayes is the workhorse of text classification. Features are non-negative counts: how often each vocabulary word appears in a document. The conditional likelihood is multinomial, and the parameter $\theta_{jk}$ is the probability that, conditional on a document being in class $k$, a randomly drawn token is word $j$. Maximum-likelihood estimation just counts: $\hat\theta_{jk}$ is the number of times word $j$ appeared across all class-$k$ documents divided by the total token count in class $k$. The danger with raw counts is words unseen in a particular class; their MLE is zero, which sends the entire log-likelihood to minus infinity and rules the class out on a single rare term. Laplace (add-one) smoothing, $\hat\theta_{jk} = (N_{jk}+\alpha)/(N_k + \alpha d)$ with $\alpha=1$, prevents the catastrophe by pretending every word appeared once in every class.

Bernoulli Naive Bayes treats each feature as a binary indicator: word present or absent, ignoring count. The likelihood per feature is $P(x_j\mid y=k) = \theta_{jk}^{x_j}(1-\theta_{jk})^{1-x_j}$. It scores absence explicitly, a missing word counts as evidence, which makes it useful for short texts where presence carries more signal than frequency.

Gaussian Naive Bayes handles continuous features. For each class $k$ and feature $j$, fit a one-dimensional Gaussian by computing the per-class sample mean $\mu_{jk}$ and variance $\sigma^2_{jk}$. The conditional density is $\mathcal{N}(\mu_{jk}, \sigma^2_{jk})$, and the log-likelihood is the familiar quadratic. With shared variances across classes the decision boundary is linear; with class-specific variances it becomes quadratic.

Other variants exist, Complement Naive Bayes (a small but useful tweak that helps with imbalanced text classes) and Categorical Naive Bayes for nominal features, but the three above cover almost every practical case. The first decision in any Naive Bayes project is which variant matches the feature type: counts, binary indicators, or continuous values.

Worked: spam classifier

Take a tiny vocabulary, free, meeting, prince, agenda, and a training set of emails labelled spam or ham. Multinomial Naive Bayes fits in three steps. First, estimate the priors by counting documents: if 30% of the training emails are spam, then $P(\text{spam})=0.3$ and $P(\text{ham})=0.7$. Second, for each class, count token occurrences across all documents in that class, then compute smoothed conditional probabilities. Suppose free appears 90 times in spam emails out of a spam-class total of 1000 tokens; with Laplace smoothing and a vocabulary of four words, $\hat{P}(\text{free}\mid \text{spam}) = (90+1)/(1000+4) \approx 0.0907$. Free will appear far less often in ham, perhaps $\hat{P}(\text{free}\mid \text{ham}) \approx 0.005$. Third, store the prior log-probabilities and the per-class log-probabilities for every word.

To classify a new email like "free meeting agenda", sum the log-probabilities for the words present in each class, add the log-prior, and pick the larger. The arithmetic on this small example would give a higher score for ham, because meeting and agenda are far more characteristic of ham than free is of spam. Had the email read "free prince free", the spam log-score would dominate. Section 4.15 walked through the same calculation with explicit numbers; the message there was that the model is essentially a pair of weighted log-counts, one per class, with the decision turning on whichever sum is larger.

The classifier scales effortlessly to vocabularies of tens or hundreds of thousands of words because each parameter is just a count, and prediction touches only the words present in the email. Training on a million emails takes seconds. Updating the model when new spam appears is a matter of incrementing a few counters. None of this is glamorous, and that is precisely the point.

The same recipe transfers directly to non-spam tasks. Classifying a clinical note as "fall risk" versus "no fall risk" reuses the spam pipeline with a different vocabulary; the per-class log-weights then tell you which terms (gait, dizziness, syncope) are pushing towards risk and which (independent, mobile, steady) are pushing away. That kind of audit trail is hard to extract from a transformer and easy to read off a Naive Bayes table.

Strengths and weaknesses

Naive Bayes is fast in both senses. Training is one pass over the data, computing counts or sufficient statistics, linear in the number of training examples and features. Prediction is a single weighted sum per class. There is no hyperparameter search of any consequence (the smoothing constant $\alpha$ is the only knob, and the default $\alpha=1$ usually works). The model is interpretable: every feature contributes a fixed log-weight per class, and you can read off which features push hardest towards which label. Because each feature's contribution is a one-dimensional summary, the method is robust to many irrelevant features, they tend to contribute roughly equal log-likelihoods across classes and cancel out. It also works in the small-data regime, where more flexible models would overfit, because the per-feature marginals are stable.

The weaknesses come from the same source as the strengths. The independence assumption is wrong: in real text, "New" and "York" co-occur far more than independence would predict, and treating their evidence as additive double-counts. As a consequence, Naive Bayes tends to be badly calibrated, its posterior probabilities are systematically too confident, often pinned near 0 or 1 even when the truth is closer to 50-50. If you want a probability that genuinely reflects uncertainty (for triage thresholds, expected-value calculations, or stacking), you must apply post-hoc calibration such as Platt scaling or isotonic regression. The model also cannot capture feature interactions: it cannot learn that two features matter only when both are present, because the contribution of each is a fixed per-class log-weight regardless of context. And Gaussian Naive Bayes specifically can perform poorly when the per-class feature distributions are far from Gaussian.

Where Naive Bayes lives in 2026

Naive Bayes remains a genuine baseline rather than a museum piece. Spam filters in modern email systems still incorporate Naive Bayes scores as one feature among many, alongside neural classifiers and behavioural signals. Smaller-scale document classification, sorting customer-support tickets, tagging news feeds, triaging research abstracts, frequently runs on Multinomial Naive Bayes because it is good enough, costs almost nothing to train, and is easy to inspect. In educational and rapid-prototyping settings, it is often the first classifier reached for: it gets a result on a new dataset within minutes and tells you whether the problem is hard before you invest in a transformer.

For serious text tasks at scale, neural classifiers, fine-tuned transformer encoders, LLM-based zero-shot or few-shot pipelines, now dominate. They capture word order, long-range context, and feature interactions that Naive Bayes throws away. Even so, Naive Bayes survives because of the engineering economics: it runs on a CPU, fits in a few megabytes, and updates cheaply. It also survives as a sanity check. If your million-parameter model cannot beat Multinomial Naive Bayes on your dataset, your model is broken or your features are wrong.

It is also a useful teaching device. The arithmetic is transparent enough to do by hand on a small example, the assumption is explicit enough to argue with, and the connection to Bayes' theorem makes the link to the broader probabilistic framework concrete. Students who first meet Naive Bayes tend to internalise the prior–likelihood–posterior structure in a way that abstract treatments rarely achieve, and the lesson generalises to logistic regression, Bayesian networks, and the probabilistic deep models we meet in later chapters.

What you should take away

Naive Bayes applies Bayes' theorem with one simplifying assumption: features are conditionally independent given the class. The assumption is usually wrong, yet the classifier often works.
The trick reduces $d$-dimensional density estimation to $d$ one-dimensional ones, trading bias for variance, useful when data are scarce.
The three standard variants, Multinomial, Bernoulli, Gaussian, match the three common feature types: counts, binary indicators, continuous values.
Strengths are speed, interpretability, robustness to irrelevant features, and small-data competence; weaknesses are poorly calibrated probabilities and an inability to model feature interactions.
In 2026 Naive Bayes is mostly a baseline and an educational anchor rather than a production champion, but it remains the right first move when you want a working classifier in five minutes.