8.14 Topic modelling: latent Dirichlet allocation
Latent Dirichlet Allocation (LDA) Blei, 2003 is a probabilistic generative model of document collections. Each document is a mixture over topics, each topic is a distribution over words, and the latent topic of each word position is what we infer.
8.14.1 The generative model
Hyperparameters: $\alpha$ (Dirichlet prior on document-topic distributions), $\eta$ (Dirichlet prior on topic-word distributions), $K$ (number of topics), $V$ (vocabulary size).
For each topic $k=1,\dots,K$:
- Draw $\boldsymbol{\beta}_k\sim\mathrm{Dirichlet}(\eta\mathbf{1}_V)$, the topic's word distribution.
For each document $d=1,\dots,D$:
- Draw $\boldsymbol{\theta}_d\sim\mathrm{Dirichlet}(\alpha\mathbf{1}_K)$, the document's topic distribution.
- For each word position $n=1,\dots,N_d$:
- Draw topic $z_{dn}\sim\mathrm{Categorical}(\boldsymbol{\theta}_d)$.
- Draw word $w_{dn}\sim\mathrm{Categorical}(\boldsymbol{\beta}_{z_{dn}})$.
$$ p(\mathbf{w},\mathbf{z},\boldsymbol{\theta},\boldsymbol{\beta}\mid\alpha,\eta) = \prod_k p(\boldsymbol{\beta}_k\mid\eta)\prod_d p(\boldsymbol{\theta}_d\mid\alpha)\prod_n p(z_{dn}\mid\boldsymbol{\theta}_d)p(w_{dn}\mid\boldsymbol{\beta}_{z_{dn}}). $$
Inference targets the posterior $p(\mathbf{z},\boldsymbol{\theta},\boldsymbol{\beta}\mid\mathbf{w})$, which is intractable. We use one of two approximations: variational inference (Blei, Ng, Jordan 2003) or collapsed Gibbs sampling (Griffiths & Steyvers 2004). We derive the latter.
8.14.2 Collapsed Gibbs sampler
By Dirichlet-multinomial conjugacy, we integrate out $\boldsymbol{\theta}$ and $\boldsymbol{\beta}$ analytically and sample only the topic assignments $\mathbf{z}$. Define counts:
- $n_{dk}^{(-i)}$: number of words in document $d$ assigned to topic $k$, excluding position $i$.
- $m_{kw}^{(-i)}$: number of times word $w$ is assigned to topic $k$ across all documents, excluding position $i$.
- $m_k^{(-i)}=\sum_w m_{kw}^{(-i)}$: total assignments to topic $k$, excluding $i$.
The collapsed conditional is
$$ p(z_i = k\mid \mathbf{z}^{(-i)},\mathbf{w},\alpha,\eta) \;\propto\; \frac{n_{d_i k}^{(-i)} + \alpha}{N_{d_i}-1+K\alpha}\cdot\frac{m_{k w_i}^{(-i)} + \eta}{m_k^{(-i)} + V\eta}. $$
This factors into "how much does document $d_i$ already use topic $k$" times "how strongly does topic $k$ favour word $w_i$".
After many sweeps (typically several hundred), we estimate
$$ \hat\theta_{dk} = \frac{n_{dk} + \alpha}{N_d + K\alpha},\qquad \hat\beta_{kw} = \frac{m_{kw} + \eta}{m_k + V\eta}. $$
8.14.3 Practical use
Choose $K$ by held-out perplexity or topic coherence (NPMI). Use $\alpha\approx 0.1$ and $\eta\approx 0.01$ for short documents; weaker priors for long documents. Inspect topics by the top-10 words by $\hat\beta_{kw}$. Validate by humans rating coherence.
# gensim
from gensim import corpora, models
texts = [["natural", "language", "processing", ...], ...]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(t) for t in texts]
lda = models.LdaModel(corpus, num_topics=20, id2word=dictionary, passes=10, alpha="auto", eta="auto")
for k in range(20):
print("Topic", k, lda.show_topic(k, topn=10))
LDA was the dominant topic model from 2003 to roughly 2018. It has since been superseded for many tasks by neural topic models (BERTopic, ETM) and contextual-embedding-based clustering, but LDA remains valuable for its closed-form interpretability and minimal compute requirement.