A Transformer Decoder is a stack of identical layers designed for autoregressive sequence generation. In the original encoder-decoder transformer, each decoder layer has three sub-layers: masked multi-head self-attention (where causal masking prevents positions from attending to subsequent positions, preserving the autoregressive property), cross-attention (queries from the decoder, keys and values from the encoder output, allowing the decoder to access information from the input sequence), and a position-wise feed-forward network. Each is wrapped in residual connection and layer normalisation.
Decoder-only models like GPT dispense with the encoder entirely and consist of stacked decoder layers with only causal self-attention—no cross-attention. They are pretrained as autoregressive language models and have proven remarkably versatile: virtually any task can be cast as text generation with appropriate prompting. This simplicity and flexibility have made decoder-only the dominant architecture for modern LLMs (GPT, LLaMA, Claude, Mistral, Gemini).
At inference time, the decoder generates tokens one at a time: it predicts a distribution over the vocabulary, samples or selects the next token, appends it to the context, and repeats. This sequential nature makes inference slow for long outputs. Techniques like KV caching (storing past key and value tensors to avoid recomputation), speculative decoding (a small draft model proposes tokens that the large model verifies in parallel), and multi-query or grouped-query attention (reducing KV cache size) all accelerate inference. The decoder-only transformer, despite being a simple architectural choice, has become the dominant computational primitive of modern AI.
Related terms: Transformer, GPT, Self-Attention, Autoregressive Model
Discussed in:
- Chapter 13: Attention & Transformers — The Transformer
Also defined in: Textbook of AI