The Bernoulli distribution $\mathrm{Bernoulli}(p)$ models a single binary outcome, a coin flip with success probability $p$. Named after Swiss mathematician Jacob Bernoulli (1654–1705), it is the atom from which most discrete distributions in statistics and machine learning are built.
Definition
For $X \sim \mathrm{Bernoulli}(p)$:
$$P(X = 1) = p, \qquad P(X = 0) = 1 - p$$
Equivalently, the probability mass function is $P(X = x) = p^x (1-p)^{1-x}$ for $x \in \{0, 1\}$.
Moments:
- Mean: $\mathbb{E}[X] = p$
- Variance: $\mathrm{Var}(X) = p(1-p)$, maximised at $p = 0.5$ where it equals $1/4$
- Entropy: $H(X) = -p \log p - (1-p) \log(1-p)$, maximised at $p = 0.5$ where it equals $\log 2$
In machine learning
The Bernoulli is the foundation of binary classification: each label $y_n \in \{0, 1\}$ is modelled as Bernoulli with parameter $p_n = P(y_n = 1 \mid \mathbf{x}_n)$ predicted by the classifier. Maximum likelihood gives binary cross-entropy loss:
$$\mathcal{L}(\boldsymbol{\theta}) = -\sum_n \left[ y_n \log p_n + (1 - y_n) \log(1 - p_n) \right]$$
Logistic regression parameterises $p$ via a sigmoid of a linear function:
$$p_n = \sigma(\mathbf{w}^\top \mathbf{x}_n + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x}_n + b)}}$$
The same form appears in the output layer of any neural binary classifier and as the per-pixel likelihood in a Bernoulli variational autoencoder for binarised images.
Conjugate prior
The Beta distribution is conjugate to the Bernoulli likelihood: with prior $\theta \sim \mathrm{Beta}(a, b)$ and observed data with $s$ successes and $f$ failures, the posterior is
$$\theta \mid \text{data} \sim \mathrm{Beta}(a + s, \, b + f)$$
The posterior mean $(a + s)/(a + b + s + f)$ is a smoothed estimate of the success rate, with prior parameters $a$ and $b$ acting as pseudo-counts. Laplace's rule of succession is the special case $a = b = 1$.
Generalisations
The Bernoulli sits at the bottom of a hierarchy of discrete distributions:
- Binomial $\mathrm{Bin}(n, p)$, sum of $n$ independent Bernoullis; counts successes in $n$ trials.
- Categorical (multinoulli), one-hot over $K$ classes, generalises Bernoulli to multiple outcomes.
- Multinomial, sum of $n$ categoricals; counts in $n$ trials over $K$ classes.
- Geometric, number of Bernoulli trials until the first success.
- Negative binomial, number of trials until $r$ successes.
The categorical with softmax parameterisation is the multi-class generalisation of logistic regression.
Throughout AI
Bernoulli random variables appear throughout AI: dropout regularisation (each unit independently kept with probability $p$), Bernoulli noise injection, discrete latent variables in VAEs, straight-through estimators for differentiable Bernoulli sampling, and Bayesian model averaging where each model is included with a Bernoulli inclusion variable.
Related terms: Categorical Distribution, Logistic Regression, Cross-Entropy Loss, Dropout
Discussed in:
- Chapter 3: Calculus, Probability Foundations