Attention & Transformers: 13.7   Encoder, decoder, encoder–decoder

Dr Chris Paton

13.7 Encoder, decoder, encoder–decoder

The original Transformer of 2017 was a single architecture with two stacks: an encoder that read the source sentence and a decoder that generated the target. Within three years that single design had splintered into three distinct families, each optimised for a different relationship between input and output. Encoder-only models such as BERT and the encoder half of T5 read text bidirectionally and produce contextual representations for understanding tasks. Decoder-only models such as GPT, Llama, Mistral and Claude generate text autoregressively and now dominate the frontier. Encoder–decoder models such as T5, BART and the original transformer keep both halves and excel when the input and output are clearly distinct sequences, as in translation or summarisation. The differences between the three families are not large in absolute terms (they share the same block, the same attention mechanism, the same feed-forward sublayer), but they differ in two specific places: the attention mask used during training and the presence or absence of cross-attention. Those two small choices determine the entire downstream story.

This section asks what we can build by stacking the transformer blocks of §13.6 in different patterns: why the original encoder–decoder design was the natural starting point, why decoder-only architectures came to dominate generation, and why encoder-only models retain a niche in retrieval and embedding work.

Symbols Used Here

$\mathbf{x}$input tokens

$\mathbf{y}$output tokens (for generation)

Encoder-only (BERT)

An encoder-only Transformer is a stack of blocks in which every position attends to every other position in both directions. There is no causal mask, no autoregressive constraint, no separate output stack. The model simply takes a sequence of tokens, embeds them, adds positional information and passes them through $L$ identical blocks. The output is a sequence of contextual embeddings, one per input token, each enriched by information from every other position. BERT Devlin, 2019 is the canonical example. Its pretraining objective is masked language modelling: replace fifteen per cent of the input tokens with a special [MASK] symbol and ask the model to predict the originals from the surrounding context. Because attention is bidirectional, the model uses both left and right neighbours (which is the whole point). At inference time on understanding tasks, the entire input is available, so artificially restricting the attention pattern would discard information.

The applications follow directly from this design. Sentence classification, sentiment analysis, natural language inference, extractive question answering and named-entity recognition all reduce to taking the contextual embeddings and feeding them into a small task-specific head. Retrieval-augmented generation, semantic search and clustering use a single pooled embedding per document, typically the [CLS] token's final hidden state or a mean across positions. Sentence-BERT Reimers, 2019 adapted the BERT recipe to produce well-behaved sentence embeddings, and that line of work, through E5, BGE and the modern retriever zoo, is still very much alive in 2026. The encoder is what survives in modern retrieval pipelines; the decoder is what does the generating downstream.

The price of bidirectionality is that BERT cannot generate text fluently. Predicting a masked token in the middle of a known sentence is not the same task as continuing a half-written sentence. For three years between 2018 and 2021, BERT and its descendants (RoBERTa, DeBERTa, ELECTRA, ALBERT) dominated the NLP leaderboards for understanding tasks, but they never produced an open-ended chatbot. The encoder-only paradigm pretrains for representation, not for generation, and the two objectives turn out to be different enough that a model trained for one cannot be cheaply repurposed for the other.

Two architectural details are worth noting. BERT prepends a special [CLS] token whose final hidden state acts as a sentence-level summary, useful as a classification feature. It also uses learned absolute positional embeddings rather than the sinusoidal scheme of the original Transformer, which simplifies the implementation but caps the maximum sequence length at training time. Most modern encoder-only retrievers replace these with rotary or ALiBi positions to extend context.

Decoder-only (GPT, Llama, Claude)

A decoder-only Transformer is a stack of blocks in which every position attends only to itself and to earlier positions. The attention mask is lower-triangular: position $t$ can read positions $1, 2, \dots, t$ but not $t+1$ onwards. The pretraining objective is plain next-token prediction. Given a sequence $w_1, \dots, w_n$, the model maximises

$$ \log p(w_1, \dots, w_n) = \sum_{t=1}^n \log p(w_t \mid w_1, \dots, w_{t-1}), $$

which factorises the joint distribution over the sequence into a product of conditionals, one per position. At inference time the model samples one token, appends it to the context, and repeats. This is the generative loop that powers every large chat model in 2026.

Decoder-only is the family that contains GPT-2, GPT-3, GPT-4 and GPT-5; Llama 1, 2 and 3; Mistral, Mixtral and Codestral; the Claude family; Gemini's text decoders; DeepSeek-V3 and R1; and almost every other frontier system. The architecture is conceptually clean. There is one stack, one objective, one inference mode. Pretraining is just next-token prediction over a very large corpus of internet text, books, code and curated documents. Fine-tuning, instruction tuning, reinforcement learning from human feedback and constitutional methods all build on the same model without changing the underlying architecture.

The decisive moment in the rise of decoder-only architectures came with GPT-3 Brown, 2020. At one hundred and seventy-five billion parameters and several hundred billion training tokens, the model demonstrated in-context learning: at inference time a few examples of any task could be prepended to the prompt and the model would generalise to a new instance, with no gradient updates at all. Translation, summarisation, classification, arithmetic and even rudimentary code generation all fell out of a single trained model, purely by prompting. This was the phase change. After GPT-3 it was clear that a decoder-only model at sufficient scale was not just a generator but a general-purpose function approximator that could be steered through natural language. Every frontier model since has followed the same template.

Encoder–decoder (T5, BART, original)

The original Transformer was an encoder–decoder. The encoder reads the source sequence with full bidirectional attention, producing a sequence of contextual representations. The decoder generates the target sequence one token at a time, with two attention sublayers per block instead of one: a causal self-attention over its own previously generated tokens, and a cross-attention that queries the encoder's output. Cross-attention is what binds the two halves: the decoder's queries come from its own state, but the keys and values come from the encoder, so each decoder position can attend selectively to the relevant parts of the input.

This design is the natural choice when input and output are clearly different sequences with a well-defined directional relationship. Translation is the canonical case: read a French sentence, write its English equivalent. The encoder need not generate, so its attention can be unconstrained; the decoder must emit one token at a time, so its self-attention must be causal; and the decoder must consult the source at every step, so cross-attention is essential. Summarisation, abstractive question answering and grammatical correction fit the same pattern. T5 Raffel, 2019 systematised the idea by recasting every NLP task as text-to-text, with task prefixes embedded in the input string. BART Lewis, 2020 used a denoising autoencoder objective in which the encoder sees a corrupted version of a document and the decoder reconstructs the clean version. Whisper Radford, 2023 is encoder–decoder for speech-to-text: a convolutional encoder reads log-mel spectrograms and a Transformer decoder emits text. Most practical multimodal systems that bind a vision encoder to a language model (Flamingo, Qwen-VL, the encoder–decoder branch of GPT-4V) follow the same template, because the input modality (image, audio, video) and the output modality (text) are genuinely different and benefit from a dedicated reader on the input side.

Encoder–decoder models retained leadership in classical translation and summarisation benchmarks until around 2022. After that the gap closed quickly, because decoder-only models at scale began matching them on text-to-text tasks while remaining simpler to serve. The encoder–decoder split has not disappeared; it has migrated to multimodal systems, where the encoder is replaced by a vision tower or audio frontend and the decoder remains a language model.

Why decoder-only won

It is worth spending a few paragraphs on why decoder-only architectures came to dominate, because the reasons are instructive about how the field moves. Three forces converged.

The first is unification. With BERT-style models, pretraining and downstream use look different: you pretrain with a masked language modelling head, then attach a task-specific head and fine-tune. Each new task spawns a new artefact. With GPT-style models, pretraining is next-token prediction and downstream use is also next-token prediction: you simply prefix the input with an instruction or a few examples and let the model continue. One model, one objective, one interface. That operational simplicity scales much better than a zoo of task-specific fine-tunes, both organisationally and economically.

The second is signal density. A causal language modelling objective produces a loss term at every position in the sequence. A masked language modelling objective produces a loss only at the fifteen per cent of positions that are masked. Per training token consumed, a decoder-only model receives roughly six times as much supervised gradient signal as a BERT-style encoder. For a fixed compute budget, that translates into faster learning, and given the scaling laws that underpin modern training [Kaplan, 2020; Hoffmann, 2022], faster learning per token compounds into a substantially better model at the same flop count.

The third is emergent capability. Once decoder-only models crossed a certain scale threshold, somewhere between GPT-2 and GPT-3, they began doing things that masked-LM models had not been demonstrating at comparable parameter counts: multi-step reasoning by chain of thought, in-context learning of new tasks from a few examples, code generation, instruction following and tool use. Whether these capabilities are intrinsic to the autoregressive objective or simply a consequence of decoder-only models being pushed to larger scale first is hard to disentangle, but the practical effect is that the capability frontier moved with them.

A fourth, more pedestrian reason is engineering. Decoder-only architectures pair cleanly with the KV cache, the trick that makes autoregressive inference tractable: at each step, the keys and values for previously generated tokens can be cached and reused, reducing per-token cost from quadratic to linear in the generated length. Encoder–decoder models have to maintain both an encoder cache and a decoder cache, with cross-attention over the encoder output at every step, which complicates serving stacks like vLLM, PagedAttention and speculative decoding without delivering a corresponding capability advantage.

Encoder-only models retain a clear niche in retrieval and embedding work, where you want a single fixed-dimensional vector per document and bidirectional context genuinely helps. Encoder–decoder models retain a niche in speech-to-text and tightly coupled modality bridges. But for open-ended text-to-text generation at frontier scale in 2026, decoder-only is the answer almost without exception, and the field has converged hard on that consensus.

What you should take away

Encoder-only models use bidirectional self-attention, no causal mask, and produce contextual embeddings; they live on in retrieval, classification and search.
Decoder-only models use causal self-attention only, and pretrain by next-token prediction; this family contains every frontier chat and reasoning model in 2026.
Encoder–decoder models use both stacks bound by cross-attention; they remain the right choice for translation, summarisation and modality bridges such as Whisper.
Decoder-only architectures dominate generation because of unification of pretraining and downstream use, denser gradient signal per token, emergent capabilities at scale, and friendlier inference economics through the KV cache.
The choice of architecture is mostly a choice of attention mask plus the presence or absence of cross-attention; everything else (the block, the optimiser, the tokeniser, the data pipeline) is shared across all three families.