Also known as: LDA
Latent Dirichlet Allocation (LDA), introduced by David Blei, Andrew Ng and Michael Jordan in 2003, is a probabilistic generative model for collections of documents. The model assumes: (1) each document is a mixture over a fixed number of latent topics, where the mixture proportions are drawn from a Dirichlet distribution; (2) each topic is a distribution over the vocabulary; (3) each word in a document is generated by first sampling a topic from the document's mixture, then sampling a word from that topic.
Given an observed corpus, inference recovers the topic-word distributions and the document-topic mixtures. Standard inference methods are variational inference (Blei, Ng and Jordan's original method) and collapsed Gibbs sampling (Griffiths and Steyvers, 2004), the latter often preferred for its simplicity.
LDA shaped a decade of NLP and computational social science research, providing principled topic discovery for tasks ranging from understanding the historical evolution of scientific journals to characterising legislator speech to organising image collections. Extensions abound: dynamic LDA for evolving topics over time, correlated topic models that allow topic correlations, hierarchical LDA with topic hierarchies, author-topic models, and many others.
Modern neural topic models (using variational autoencoders, BERT-derived embeddings, and most recently large language models with prompting) have largely displaced LDA in research practice, but LDA remains widely used in industry for its interpretability, computational efficiency and theoretical clarity.
Related terms: david-blei, Topic Model, Variational Inference
Discussed in:
- Chapter 8: Unsupervised Learning, Unsupervised Learning