Naive Bayes is a probabilistic classifier based on Bayes' theorem with the naive (often-false but useful) assumption that features are conditionally independent given the class. For features $x = (x_1, \ldots, x_d)$ and class $y \in \{1, \ldots, K\}$,
$$P(y | x) = \frac{P(y) \prod_{i=1}^d P(x_i | y)}{P(x)} \propto P(y) \prod_i P(x_i | y)$$
The classifier assigns $\hat y = \arg\max_y P(y) \prod_i P(x_i | y)$.
The conditional independence assumption is rarely true in practice but the resulting models often perform surprisingly well, particularly when (a) the dependence structure is mild, (b) training data is limited so a low-parameter model has lower variance, and (c) the class boundary is well-separated even under the misspecified model.
Maximum likelihood estimation of parameters from training data $\{(x_n, y_n)\}_{n=1}^N$:
Class priors: $\hat P(y) = \frac{1}{N} \sum_n \mathbb{1}[y_n = y]$
Feature likelihoods depend on the data type:
Multinomial Naive Bayes (text classification, count data):
$$\hat P(x_i | y) = \frac{\sum_n x_{ni} \mathbb{1}[y_n = y] + \alpha}{\sum_n \sum_j x_{nj} \mathbb{1}[y_n = y] + \alpha d}$$
with Laplace (add-$\alpha$) smoothing $\alpha > 0$ preventing zero probabilities for unseen features. Standard for text classification (spam filtering, topic classification): each $x_i$ is a word count or TF-IDF.
Gaussian Naive Bayes (continuous features):
$$P(x_i | y) = \mathcal{N}(x_i | \mu_{iy}, \sigma_{iy}^2)$$
with $\mu_{iy}, \sigma_{iy}^2$ the empirical mean and variance of feature $i$ within class $y$.
Bernoulli Naive Bayes (binary features): $P(x_i = 1 | y) = p_{iy}$.
Inference is fast and embarrassingly parallel: a single matrix-vector multiplication in log space per prediction.
Practical strengths:
- Fast training and inference, closed-form parameter estimates, simple prediction.
- Few hyperparameters, only the smoothing $\alpha$.
- Works well with little data because the model's low capacity combats overfitting.
- Strong baseline for text classification, often within a few percent of more sophisticated methods on simple tasks.
Practical weaknesses:
- Calibration is poor, predicted probabilities tend to be over-confident due to the violated independence assumption (the joint likelihood multiplies many redundant evidence terms).
- Feature engineering matters, the model relies on the user choosing informative independent-ish features.
- Cannot capture interactions, XOR-like problems are unlearnable.
Naive Bayes was the dominant text classification method from the 1990s through the early 2010s. Modern transformer-based classifiers have surpassed it in raw accuracy but Naive Bayes remains a useful, fast, interpretable baseline.
Video
Related terms: Bayes' Theorem, Logistic Regression
Discussed in:
- Chapter 7: Supervised Learning, Supervised Learning