Latent Dirichlet Allocation (mathematical detail), Glossary, Textbook of AI

Latent Dirichlet Allocation (LDA) (Blei, Ng, Jordan 2003) is a probabilistic generative model of document collections. Generative process for $D$ documents over $V$ vocabulary terms with $K$ topics:

Draw topic-word distributions $\beta_k \sim \mathrm{Dirichlet}(\eta)$ for $k = 1, \ldots, K$, each topic is a distribution over words.
For each document $d = 1, \ldots, D$:

a. Draw the document's topic mixture $\theta_d \sim \mathrm{Dirichlet}(\alpha)$, a distribution over the $K$ topics.

b. For each word $n$ in document $d$:

i. Draw topic $z_{d,n} \sim \mathrm{Categorical}(\theta_d)$.

ii. Draw word $w_{d,n} \sim \mathrm{Categorical}(\beta_{z_{d,n}})$.

The joint distribution is

$$P(\theta, \beta, z, w | \alpha, \eta) = \prod_k P(\beta_k | \eta) \prod_d P(\theta_d | \alpha) \prod_n P(z_{d,n} | \theta_d) P(w_{d,n} | \beta_{z_{d,n}})$$

Inference computes posterior $P(\theta, \beta, z | w)$. Direct computation is intractable due to coupling between $\theta$ and $z$.

Collapsed Gibbs sampling (Griffiths & Steyvers 2004): integrate out $\theta$ and $\beta$ analytically (the Dirichlet conjugacy makes this clean), leaving a sampler over only the topic assignments $z$. Conditional probability for a single token's topic:

$$P(z_{d,n} = k | z_{-(d,n)}, w, \alpha, \eta) \propto \frac{n_{d,k}^{-(d,n)} + \alpha_k}{\sum_j (n_{d,j}^{-(d,n)} + \alpha_j)} \cdot \frac{n_{k,w_{d,n}}^{-(d,n)} + \eta_{w_{d,n}}}{\sum_v (n_{k,v}^{-(d,n)} + \eta_v)}$$

where $n_{d,k}$ counts how many tokens in document $d$ are currently assigned to topic $k$, and $n_{k,v}$ counts how many times word $v$ has been assigned to topic $k$, both excluding the current token. Iterate over all tokens, sampling each conditional on the rest. Burn in for hundreds to thousands of full passes.

Variational EM (the original Blei-Ng-Jordan method): approximate the posterior by a factorised variational distribution

$$q(\theta, z | \gamma, \phi) = \prod_d q(\theta_d | \gamma_d) \prod_n q(z_{d,n} | \phi_{d,n})$$

Coordinate-ascent updates of the variational parameters $\gamma$ and $\phi$ converge to a local maximum of the ELBO. Variational EM is typically faster than Gibbs but more biased.

Stochastic variational inference (Hoffman, Blei, Bach 2010) processes mini-batches, scaling LDA to billions of tokens.

Practical considerations:

Hyperparameter $\alpha$: smaller values give sparser per-document topic mixtures (each document focuses on few topics). Symmetric $\alpha = 50/K$ is a common default.
Hyperparameter $\eta$: smaller values give sparser topics. Symmetric $\eta = 0.01$ is standard.
Number of topics $K$: chosen by held-out perplexity, topic coherence (Mimno et al.), or domain knowledge. Typical 50–500 for general corpora.
Stop-word removal and stemming/lemmatisation improve topic quality.
TF-IDF weighting of input is a useful preprocessing step.

Modern neural topic models (using contextual embeddings, BERTopic, prompted LLMs) have largely displaced LDA in research, but LDA's interpretability, computational efficiency and theoretical clarity keep it widely deployed in industry.

Related terms: Latent Dirichlet Allocation, david-blei, Gibbs Sampling, Variational Inference

Discussed in:

Chapter 8: Unsupervised Learning, Unsupervised Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).