A topic model is a statistical model that discovers thematic structure in collections of documents. Each document is represented as a mixture over a small set of latent topics, each topic as a distribution over the vocabulary. The most influential topic model is Latent Dirichlet Allocation (LDA) (Blei, Ng, Jordan 2003).
Topic models are an unsupervised alternative to manual document-tagging or clustering. Given a corpus, the model returns: for each topic, the words most strongly associated with it (often interpretable as a theme); and for each document, the topic mixture (interpretable as the document's thematic profile).
Topic-model variants extend LDA in many directions: temporal evolution of topics (dynamic topic models), correlations between topics (correlated topic models), topic hierarchies (hierarchical LDA), authorship effects (author-topic models), supervised topic models that incorporate document labels, and more. The framework has been applied to document collections, image collections (where "words" are quantised image patches), genetics (where "documents" are individuals and "words" are genetic variants), and many others.
Modern neural topic models (using variational autoencoders, contextual embeddings, and most recently large language models with prompting) have largely displaced LDA in research practice. The LLM-prompting approach simply asks a model to identify the themes in a document, often producing more interpretable results than LDA, at the cost of giving up the principled probabilistic framework.
Related terms: Latent Dirichlet Allocation, david-blei, Unsupervised Learning
Discussed in:
- Chapter 8: Unsupervised Learning, Unsupervised Learning