Autoregressive Model, Glossary, Textbook of AI

An Autoregressive Model generates sequences one element at a time, with each new element conditioned on all previous ones. Formally, it factorises the joint distribution of a sequence via the chain rule of probability: $p(x_1, x_2, \ldots, x_T) = \prod_t p(x_t \mid x_1, \ldots, x_{t-1})$. Each conditional probability is parameterised by a neural network, typically a transformer decoder with causal self-attention that prevents positions from seeing the future.

Autoregressive modelling is the dominant paradigm for language modelling. GPT, LLaMA, Claude, Gemini, and nearly all modern large language models are autoregressive transformers trained to predict the next token. At inference time, tokens are generated one at a time: the model predicts a probability distribution over the vocabulary, a token is sampled (or the most probable is selected), the token is appended to the context, and the process repeats until an end-of-sequence marker is produced or a length limit is reached.

Sampling strategies dramatically affect output quality. Greedy decoding always picks the most probable token, fast but often repetitive. Beam search keeps multiple candidate sequences and picks the best complete one, better than greedy but can produce bland outputs. Temperature sampling controls randomness; top-k and nucleus (top-p) sampling restrict sampling to the most probable tokens, balancing coherence and diversity. Autoregressive models have also been applied beyond text: PixelCNN and PixelRNN for images, WaveNet for audio, AlphaFold's recycling mechanism for protein structure, and many others. The sequential nature of generation makes autoregressive models slow at inference time, motivating techniques like speculative decoding that partly parallelise the process.

Interactive

Causal masking forces a transformer to look only at the past. An upper-triangular mask sets future attention scores to minus infinity.

Video

Related terms: Language Model, Transformer, GPT

Discussed in:

Chapter 14: Generative Models, Language Generation
Chapter 12: Sequence Models, Language Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.