Categorical Distribution, Glossary, Textbook of AI

Also known as: multinoulli, generalised Bernoulli

The categorical distribution $\mathrm{Cat}(p_1, \ldots, p_K)$ models a single draw from $K$ classes with probabilities $p_k \geq 0$, $\sum_k p_k = 1$:

$$P(X = k) = p_k$$

Often encoded as a one-hot vector $y \in \{0, 1\}^K$ with $\sum_k y_k = 1$.

The categorical is the foundation of multi-class classification (each label is categorical with parameters predicted by the model) and language modelling (each token is a categorical over the vocabulary).

Maximum-likelihood training on data $\{y_n\}$ (one-hot encoded) with model $p_\theta(x)$ minimises categorical cross-entropy:

$$\mathcal{L} = -\sum_n \sum_k y_{nk} \log p_\theta(k | x_n) = -\sum_n \log p_\theta(y_n | x_n)$$

The softmax function maps real-valued logits to a categorical distribution: $p_k = e^{z_k} / \sum_j e^{z_j}$.

Dirichlet is the conjugate prior: with prior $\mathrm{Dir}(\alpha)$ and observed counts $n_k$ for each class, posterior is $\mathrm{Dir}(\alpha + n)$.

Multinomial is the categorical's $n$-trial counterpart: the joint distribution of class counts in $n$ independent categorical draws.

Sampling: cumulative-sum trick, partition $[0, 1)$ into intervals of width $p_k$, draw $u \sim \mathrm{Uniform}(0, 1)$, return the index of the interval containing $u$. Time complexity $O(K)$.

Gumbel-max trick for differentiable approximate sampling: $\arg\max_k (z_k + g_k)$ where $g_k \sim \mathrm{Gumbel}(0, 1)$ produces an exact categorical sample. The Gumbel-softmax relaxation $\mathrm{softmax}((z + g)/\tau)$ is differentiable and approaches the one-hot as temperature $\tau \to 0$, used to backpropagate through discrete sampling in VAEs and discrete latent variable models.

Discussed in:

Chapter 4: Probability, Probability

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).