Bernoulli Distribution, Glossary, Textbook of AI

The Bernoulli distribution $\mathrm{Bernoulli}(p)$ models a single binary outcome, a coin flip with success probability $p$. Named after Swiss mathematician Jacob Bernoulli (1654–1705), it is the atom from which most discrete distributions in statistics and machine learning are built.

Definition

For $X \sim \mathrm{Bernoulli}(p)$:

$$P(X = 1) = p, \qquad P(X = 0) = 1 - p$$

Equivalently, the probability mass function is $P(X = x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$.

Moments:

Mean: $\mathbb{E}[X] = p$
Variance: $\mathrm{Var}(X) = p(1-p)$, maximised at $p = 0.5$ where it equals $1/4$
Entropy: $H(X) = -p \log p - (1-p) \log(1-p)$, maximised at $p = 0.5$ where it equals $\log 2$

In machine learning

The Bernoulli is the foundation of binary classification: each label $y_n \in \{0, 1\}$ is modelled as Bernoulli with parameter $p_n = P(y_n = 1 \mid \mathbf{x}_n)$ predicted by the classifier. Maximum likelihood gives binary cross-entropy loss:

$$\mathcal{L}(\boldsymbol{\theta}) = -\sum_n \left[ y_n \log p_n + (1 - y_n) \log(1 - p_n) \right]$$

Logistic regression parameterises $p$ via a sigmoid of a linear function:

$$p_n = \sigma(\mathbf{w}^\top \mathbf{x}_n + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_n + b)}}$$

The same form appears in the output layer of any neural binary classifier and as the per-pixel likelihood in a Bernoulli variational autoencoder for binarised images.

Conjugate prior

The Beta distribution is conjugate to the Bernoulli likelihood: with prior $\theta \sim \mathrm{Beta}(a, b)$ and observed data with $s$ successes and $f$ failures, the posterior is

$$\theta \mid \text{data} \sim \mathrm{Beta}(a + s, \, b + f)$$

The posterior mean $(a + s)/(a + b + s + f)$ is a smoothed estimate of the success rate, with prior parameters $a$ and $b$ acting as pseudo-counts. Laplace's rule of succession is the special case $a = b = 1$.

Generalisations

The Bernoulli sits at the bottom of a hierarchy of discrete distributions:

Binomial $\mathrm{Bin}(n, p)$, sum of $n$ independent Bernoullis; counts successes in $n$ trials.
Categorical (multinoulli), one-hot over $K$ classes, generalises Bernoulli to multiple outcomes.
Multinomial, sum of $n$ categoricals; counts in $n$ trials over $K$ classes.
Geometric, number of Bernoulli trials until the first success.
Negative binomial, number of trials until $r$ successes.

The categorical with softmax parameterisation is the multi-class generalisation of logistic regression.

Throughout AI

Bernoulli random variables appear throughout AI: dropout regularisation (each unit independently kept with probability $p$), Bernoulli noise injection, discrete latent variables in VAEs, straight-through estimators for differentiable Bernoulli sampling, and Bayesian model averaging where each model is included with a Bernoulli inclusion variable.

Discussed in:

Chapter 3: Calculus, Probability Foundations

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).