Also known as: multinoulli, generalised Bernoulli
The categorical distribution $\mathrm{Cat}(p_1, \ldots, p_K)$ models a single draw from $K$ classes with probabilities $p_k \geq 0$, $\sum_k p_k = 1$:
$$P(X = k) = p_k$$
Often encoded as a one-hot vector $y \in \{0, 1\}^K$ with $\sum_k y_k = 1$.
The categorical is the foundation of multi-class classification (each label is categorical with parameters predicted by the model) and language modelling (each token is a categorical over the vocabulary).
Maximum-likelihood training on data $\{y_n\}$ (one-hot encoded) with model $p_\theta(x)$ minimises categorical cross-entropy:
$$\mathcal{L} = -\sum_n \sum_k y_{nk} \log p_\theta(k | x_n) = -\sum_n \log p_\theta(y_n | x_n)$$
The softmax function maps real-valued logits to a categorical distribution: $p_k = e^{z_k} / \sum_j e^{z_j}$.
Dirichlet is the conjugate prior: with prior $\mathrm{Dir}(\alpha)$ and observed counts $n_k$ for each class, posterior is $\mathrm{Dir}(\alpha + n)$.
Multinomial is the categorical's $n$-trial counterpart: the joint distribution of class counts in $n$ independent categorical draws.
Sampling: cumulative-sum trick, partition $[0, 1)$ into intervals of width $p_k$, draw $u \sim \mathrm{Uniform}(0, 1)$, return the index of the interval containing $u$. Time complexity $O(K)$.
Gumbel-max trick for differentiable approximate sampling: $\arg\max_k (z_k + g_k)$ where $g_k \sim \mathrm{Gumbel}(0, 1)$ produces an exact categorical sample. The Gumbel-softmax relaxation $\mathrm{softmax}((z + g)/\tau)$ is differentiable and approaches the one-hot as temperature $\tau \to 0$, used to backpropagate through discrete sampling in VAEs and discrete latent variable models.
Related terms: Bernoulli Distribution, Softmax, Cross-Entropy Loss, Dirichlet Distribution
Discussed in:
- Chapter 4: Probability, Probability