An Autoregressive Model generates sequences one element at a time, with each new element conditioned on all previous ones. Formally, it factorises the joint distribution of a sequence via the chain rule of probability: $p(x_1, x_2, \ldots, x_T) = \prod_t p(x_t \mid x_1, \ldots, x_{t-1})$. Each conditional probability is parameterised by a neural network—typically a transformer decoder with causal self-attention that prevents positions from seeing the future.
Autoregressive modelling is the dominant paradigm for language modelling. GPT, LLaMA, Claude, Gemini, and essentially all modern large language models are autoregressive transformers trained to predict the next token. At inference time, tokens are generated one at a time: the model predicts a probability distribution over the vocabulary, a token is sampled (or the most probable is selected), the token is appended to the context, and the process repeats until an end-of-sequence marker is produced or a length limit is reached.
Sampling strategies dramatically affect output quality. Greedy decoding always picks the most probable token—fast but often repetitive. Beam search keeps multiple candidate sequences and picks the best complete one—better than greedy but can produce bland outputs. Temperature sampling controls randomness; top-k and nucleus (top-p) sampling restrict sampling to the most probable tokens, balancing coherence and diversity. Autoregressive models have also been applied beyond text: PixelCNN and PixelRNN for images, WaveNet for audio, AlphaFold's recycling mechanism for protein structure, and many others. The sequential nature of generation makes autoregressive models slow at inference time, motivating techniques like speculative decoding that partly parallelise the process.
Related terms: Language Model, Transformer, GPT
Discussed in:
- Chapter 14: Generative Models — Language Generation
- Chapter 12: Sequence Models — Language Models
Also defined in: Textbook of AI