Probability: 4.15   Worked mini-project: calibrating a spam filter

Dr Chris Paton

4.15 Worked mini-project: calibrating a spam filter

Spam classification is a wonderful first project because every step is small, every number can be checked by hand, and the result is genuinely useful. We will train a Naive Bayes classifier from a tiny made-up corpus, use it to predict whether a single new email is spam, ask whether the predicted probabilities can be trusted, repair an obvious failure mode using smoothing, and finally judge the classifier's quality with the standard metrics from information retrieval.

Symbols Used Here

$Y$label (spam=1, ham=0)

$\mathbf{x}$vector of word indicators

$\theta$model parameters

$P(Y=1)$base rate

$P(x_j=1 \mid Y)$feature probabilities

The setup

Imagine you are setting up a spam filter for a small email service. You have collected a labelled training set of 1000 emails: 200 of them have been marked as spam by users, the remaining 800 are honest mail (commonly called ham). You will not look at the raw text. Instead you reduce each email to a vector of five binary features, one for each word in a deliberately small vocabulary: {free, click, money, dear, regards}. Feature $x_j$ is 1 if the corresponding word appears anywhere in the email and 0 if it does not. Whether a word occurs once or seventeen times is ignored; this is the simplest possible bag-of-words representation.

Why such a tiny vocabulary? Because every quantity in the worked example will be a small number you can verify on a piece of paper. Real production spam filters use vocabularies of tens of thousands of words and feature vectors with similarly many entries, but the mathematics is identical, only the bookkeeping changes.

The label $Y$ takes the value 1 for spam and 0 for ham. The feature vector $\mathbf{x} = (x_1, x_2, x_3, x_4, x_5)$ records which of the five vocabulary words appeared. Our three goals for the rest of the section are: (i) train a Naive Bayes classifier from the 1000 labelled emails using maximum likelihood; (ii) check whether the resulting probability outputs are well calibrated; and (iii) score a brand-new email, one that was not in the training set, and decide whether to send it to the inbox or the junk folder.

That programme is short to state but it touches every important idea in the chapter. Each of the next subsections takes one of these goals and walks through it in detail.

Estimating priors and likelihoods (MLE)

The first quantity we need is the prior, the base rate of spam in the world the classifier will operate in. Without any information about the email's content, what fraction of incoming mail is spam? With 200 spam in our 1000 training emails, the maximum likelihood estimate is simply the proportion:

$$ \hat P(Y=1) = \frac{200}{1000} = 0.2, \qquad \hat P(Y=0) = \frac{800}{1000} = 0.8. $$

Section 4.12 derived this result formally. The likelihood of 200 spam in 1000 trials under a Bernoulli model with parameter $\pi$ is $\pi^{200}(1-\pi)^{800}$; differentiating its logarithm and setting the derivative to zero gives $\hat\pi = 200/1000$. The MLE is just the empirical fraction. That is reassuring rather than surprising, but it is worth pausing to notice that the formula we derived from a long calculation is the same number a sensible person would have written down without any calculation at all.

Next we need the likelihoods, for each word and each class, the probability that the word appears given the class. Again, the MLE is the empirical fraction. For word $j$ and class $y$,

$$ \hat P(x_j = 1 \mid Y = y) = \frac{n_{j, y}}{n_y}, $$

where $n_{j, y}$ counts how many emails of class $y$ contain word $j$, and $n_y$ counts how many emails are in class $y$. Suppose, after going through the training set, you have tabulated the following:

Word	$P(\cdot \mid \text{spam})$	$P(\cdot \mid \text{ham})$
free	0.7	0.05
click	0.6	0.1
money	0.5	0.02
dear	0.3	0.4
regards	0.05	0.5

Read the first row as: 70 per cent of spam emails contain the word "free", whereas only 5 per cent of ham emails do. The numbers are made up, but they are entirely typical: marketing words like "free", "click", and "money" are far more frequent in spam, while polite signatures like "regards" and salutations like "dear" lean towards ham.

Notice we are storing only $\hat P(x_j = 1 \mid y)$, the probability that word $j$ is present. The complementary probability that word $j$ is absent is $1 - \hat P(x_j = 1 \mid y)$, by the rules of probability. We will need both forms in the next subsection.

This is the moment to confront the word naive. A real email's words are correlated, "click" and "here" appear together; "best" and "regards" appear together. A faithful model would store the joint distribution over all five binary indicators, which has $2^5 = 32$ entries per class and $2^{10000}$ entries for a realistic vocabulary. That is hopeless. Naive Bayes makes the deliberately wrong assumption that the words are conditionally independent given the label. This collapses the joint into a product of single-word probabilities and replaces $2^{10000}$ parameters with $2 \times 10000$. The assumption is a lie, but it is a useful lie, it lets us fit the model from realistic amounts of data and, in practice, it usually works well enough to be a strong baseline.

The five-row table above, together with the two-number prior, is the entire trained model. There is nothing else.

Naive Bayes prediction

A new email arrives. After scanning the vocabulary you observe that it contains the words "click" and "money" but none of the other three. The feature vector is therefore

$$ \mathbf{x} = (x_{\text{free}}, x_{\text{click}}, x_{\text{money}}, x_{\text{dear}}, x_{\text{regards}}) = (0, 1, 1, 0, 0). $$

Should this email go to the inbox or the spam folder? We need $P(\text{spam} \mid \mathbf{x})$, the posterior probability that the email is spam given everything we observed. Bayes' theorem (§4.3) says

$$ P(Y = y \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid Y = y)\, P(Y = y)}{P(\mathbf{x})}. $$

Under the Naive Bayes assumption, the class-conditional likelihood factorises into a product over the five words:

$$ P(\mathbf{x} \mid Y = y) = \prod_{j=1}^{5} P(x_j \mid Y = y). $$

Plug in the spam column of the table, remembering that for words that did not appear we use $1 - P(x_j = 1 \mid \text{spam})$:

$$ \begin{aligned} P(\mathbf{x} \mid \text{spam}) &= (1 - 0.7)(0.6)(0.5)(1 - 0.3)(1 - 0.05) \\ &= 0.3 \cdot 0.6 \cdot 0.5 \cdot 0.7 \cdot 0.95 \\ &= 0.0599. \end{aligned} $$

Repeat for the ham column:

$$ \begin{aligned} P(\mathbf{x} \mid \text{ham}) &= (1 - 0.05)(0.1)(0.02)(1 - 0.4)(1 - 0.5) \\ &= 0.95 \cdot 0.1 \cdot 0.02 \cdot 0.6 \cdot 0.5 \\ &= 0.00057. \end{aligned} $$

The class-conditional likelihood for spam is more than a hundred times larger than for ham. That is a strong signal, but Bayes' theorem demands that we also weight by the prior. Multiply each likelihood by its prior:

$$ P(\mathbf{x} \mid \text{spam})\, P(\text{spam}) = 0.0599 \times 0.2 = 0.01198, $$ $$ P(\mathbf{x} \mid \text{ham})\, P(\text{ham}) = 0.00057 \times 0.8 = 0.000456. $$

The denominator $P(\mathbf{x})$ in Bayes' theorem is the same for both classes, so we can compute it by adding these two unnormalised numbers, that is the law of total probability in action:

$$ P(\mathbf{x}) = 0.01198 + 0.000456 = 0.012436. $$

Dividing through gives the posterior:

$$ P(\text{spam} \mid \mathbf{x}) = \frac{0.01198}{0.012436} \approx 0.963. $$

The model is 96.3 per cent confident the email is spam. Under a 0.5 threshold we send it to the junk folder. Notice how the prior pulled the answer towards ham (since most mail is ham), but the evidence, two strongly spam-associated words and three absent ham-associated words, overwhelmed the prior and pushed the posterior past 0.96.

In production you would typically work in log space rather than with raw probabilities. A real email with thousands of word features would produce class-conditional likelihoods so small that they would underflow standard floating point. Computing $\log P(\mathbf{x} \mid y) + \log P(y)$ as a sum of logarithms avoids that problem and changes nothing else: the class with the larger log-posterior is the class with the larger posterior.

Calibration: are predicted probabilities accurate?

The classifier just told us its confidence is 96 per cent. Should we believe it? A model that outputs $\hat p$ is well calibrated when, among all the emails for which it predicts probability $\hat p$, the empirical fraction that really are spam is close to $\hat p$. If you take every email the model rates at 0.7 and find that about 70 per cent of them turn out to be spam, the model is calibrated. If only 50 per cent of those 0.7-rated emails are spam, the model is over-confident, its probabilities are too extreme. Naive Bayes is famously over-confident, because the independence assumption causes the same evidence to be effectively counted multiple times in correlated features, pushing outputs implausibly close to 0 or 1.

Three standard tools diagnose calibration. A reliability diagram bins the model's predictions, say into ten bins of width 0.1, and plots, for each bin, the empirical fraction of positives against the average predicted probability. A perfectly calibrated model lies on the diagonal; a curve below the diagonal reveals over-confidence. The expected calibration error (ECE) summarises the diagram in a single number,

$$ \mathrm{ECE} = \sum_{b=1}^B \frac{|B_b|}{N} \,\big|\bar y_b - \bar p_b\big|, $$

a weighted average of the gaps between empirical accuracy $\bar y_b$ and average predicted confidence $\bar p_b$ in each bin $B_b$. A value of 0 is perfect; values above 0.05 are typically considered poor. The Brier score is the mean squared difference between predicted probability and binary outcome and is strictly proper: it is minimised in expectation only by the true probabilities, so it rewards both calibration and sharpness.

Three repair methods follow. Platt scaling fits a one-dimensional logistic regression mapping the model's logit to a recalibrated probability on a held-out set. Temperature scaling is the simpler variant for softmax classifiers: divide the logits by a single learned scalar $T > 1$, which softens over-confident outputs without changing the rank order of predictions, so accuracy is unchanged. Isotonic regression is more flexible, it fits any monotonic function, but needs more validation data and can over-fit. For this chapter the take-home is the diagnosis, not the cure: always check calibration before trusting predicted probabilities.

Smoothing the estimates

There is a quiet bug lurking in the MLE recipe. Suppose the training set happens to contain zero ham emails with the word "free". The MLE then says $\hat P(\text{free} \mid \text{ham}) = 0/n_0 = 0$. As soon as a new email contains "free", even alongside a hundred ham-flavoured cues, the class-conditional likelihood for ham becomes $0 \times \cdots = 0$, the posterior collapses to $P(\text{ham} \mid \mathbf{x}) = 0$, and the classifier confidently announces spam regardless of every other word. A single missing observation has erased a class of evidence forever. This is sometimes called the zero-frequency problem.

The standard fix is Laplace smoothing, also called add-one smoothing. Replace the maximum likelihood estimate with

$$ \hat P(x_j = 1 \mid y) = \frac{n_{j, y} + 1}{n_y + 2}, $$

adding one pseudo-count to each cell of the count table. The "+2" in the denominator preserves probabilistic consistency: the estimated probabilities of "word present" and "word absent" still sum to one. With Laplace smoothing the estimate for an unseen feature is $1/(n_y + 2)$ rather than 0, small but non-zero, so it can be over-ridden by stronger evidence rather than vetoing it absolutely.

Laplace smoothing has a clean Bayesian interpretation. Adding one pseudo-count to each cell is exactly the maximum a posteriori (MAP) estimate under a Beta(2, 2) prior on the Bernoulli parameter, a mild prior belief that the probability is somewhere near a half. Larger pseudo-counts correspond to stronger priors. The chapter on statistics (Chapter 5) returns to this idea under the heading of regularisation; for now, the practical lesson is that MLE alone is brittle on small samples and a one-pseudo-count cushion is almost always worth its small bias.

Evaluation: precision, recall, F1

Calibration is about the honesty of the probabilities. Final evaluation is about the decisions, the binary spam/ham judgements you actually make after thresholding the probability. Run the classifier over a held-out test set and tabulate the four possibilities into a confusion matrix. True positives (TP) are spam emails correctly sent to junk; false negatives (FN) are spam wrongly sent to the inbox; false positives (FP) are ham wrongly sent to junk; true negatives (TN) are ham correctly sent to the inbox.

From those counts come three standard scalars:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \qquad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \qquad F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}. $$

Precision answers: of the emails I sent to junk, what fraction really were spam? Recall answers: of the spam emails out there, what fraction did I catch? The two are usually in tension, and the F1 score (their harmonic mean) is a single summary that punishes imbalance, it is small if either precision or recall is small.

The threshold is your knob. By default we predict spam when $\hat P(\text{spam} \mid \mathbf{x}) > 0.5$, but nothing about probability theory pins us to that value. For email, false positives are catastrophic, a missed birthday card from your aunt is far worse than a piece of spam that slipped through, so production filters often raise the threshold to 0.9 or higher, sacrificing recall for precision. A medical screening test inverts the calculus: missing a real disease (a false negative) is much worse than a false alarm, so the threshold drops, accepting many false positives in order to catch every true case. The same trained model can be tuned to either operating point simply by sliding the threshold along the predicted probability axis.

What you should take away

Naive Bayes is Bayes' rule plus an independence assumption. You estimate one prior and a small table of per-feature likelihoods using maximum likelihood, then combine them with Bayes' theorem to get the posterior over labels. Worked through carefully, the email "click + money" lands at $P(\text{spam} \mid \mathbf{x}) \approx 0.963$.
Maximum likelihood means counting. Both the priors and the likelihoods are empirical fractions. The complicated calculus of Section 4.12 reduces, for Bernoulli features, to dividing a count by another count.
Predicted probabilities need a separate calibration check. A classifier can be accurate yet over-confident. Reliability diagrams, ECE, and the Brier score diagnose calibration; temperature and Platt scaling repair it.
Zero counts are a bug, not a feature. Laplace smoothing, adding one pseudo-count per cell, is the cheapest and most common fix, and it has a clean Bayesian reading as a Beta(2, 2) MAP estimate.
Decisions are not probabilities. Once you threshold the posterior you face the precision–recall trade-off. Move the threshold up to favour precision, down to favour recall, and choose the operating point that matches the relative cost of the two error types in your application.