Chapter Thirteen

Attention & Transformers

Learning Objectives
  1. Motivate attention as a content-based lookup that removes the RNN bottleneck
  2. Derive the scaled dot-product self-attention formula and its query/key/value matrices
  3. Describe the full transformer block, including feed-forward layers, residual connections, and layer norm
  4. Explain why positional encodings are needed and compare sinusoidal and learned variants
  5. Distinguish encoder-only (BERT), decoder-only (GPT), and encoder–decoder (T5) transformer variants

For years, recurrent networks were the default for sequential data. LSTMs and GRUs processed tokens one at a time, passing information forward through hidden states. This created two painful problems. Training could not be parallelised across time steps. And long-range dependencies had to survive passage through dozens of gates, losing information along the way.

The attention mechanism fixed both problems at once. Instead of threading information through a chain of states, attention lets the model look at every part of the input directly — in parallel, with no recurrence. The Transformer architecture Vaswani, 2017, built entirely on attention, has since conquered nearly every area of AI.

13.1   Attention Mechanism

The Problem It Solved

In early neural machine translation Bahdanau, 2014, an encoder RNN read a source sentence and compressed it into a single fixed-length vector. The decoder then generated the translation from that one vector. The problem: one vector had to encode the meaning of the entire sentence. As sentences grew longer, quality collapsed.

How Attention Works

Attention lets the decoder look back at every encoder hidden state, not just the final one. At each generation step, the decoder computes a weighted combination of all encoder states, focusing on the most relevant ones.

Given encoder hidden states h1, …, hT and decoder state st:

  1. Compute alignment scores: et,i between st and each hi.
  2. Normalise with softmax: αt,i = softmax(et,i).
  3. Compute context: ct = Σi αt,i hi.

The alignment function can be additive (Bahdanau): et,i = v^T^ tanh(W1st + W2hi). Or dot-product (Luong) Luong, 2015: et,i = st^T^ hi. Dot-product is cheaper and maps well to matrix operations.

Impact

On long sentences, attention-equipped models maintained high quality because the decoder could look back at relevant source words regardless of distance. Attention also provided interpretability — you could visualise which source words the model focused on for each output word. This led to rapid adoption in speech, image captioning, and question answering.

The Conceptual Shift

Before attention, the metaphor was a conveyor belt: information entered at one end and was transformed as it passed through. Attention replaced this with something like a library lookup: at each step, the model asks a query ("what do I need?"), compares it against keys ("what is available?"), and retrieves the matching values. This query–key–value framing would become the foundation of the Transformer.

13.2   Self-Attention

In the original mechanism, queries came from the decoder and keys/values came from the encoder. Self-attention lets a single sequence attend to itself. Every position acts as query, key, and value, so the model captures dependencies between any two positions regardless of distance — in a single step, not through a chain.

The Formula

Given n input vectors of dimension dmodel, project them into three spaces:

  • Q = XW^Q^ (queries)
  • K = XW^K^ (keys)
  • V = XW^V^ (values)

The output is:

Attention(Q, K, V) = softmax(QK^T^ / √dk) V

The division by √dk prevents the dot products from growing too large, which would push the softmax into regions with tiny gradients.

Multiple Views

Self-attention can be seen through several lenses:

  • Graph view: the input is a fully connected graph. Attention weights are edge strengths.
  • Retrieval view: each position broadcasts a query and retrieves a weighted mix of values.
  • Linear algebra view: softmax(QK^T^ / √dk) is a data-dependent mixing matrix that blends the value vectors for each position.

Causal vs Bidirectional

In bidirectional self-attention, every position attends to every other. This suits tasks like classification where the full input is available. In causal self-attention, position i can only attend to positions ji. A mask sets future positions to −∞ before the softmax, driving their weights to zero. Causal masking is essential for language modelling and any sequential generation task.

The Quadratic Cost

Every position attends to every other, so the attention matrix is n × n. Time and memory are O(n^2^). For a few hundred tokens, this is fine. For thousands or tens of thousands, it becomes a bottleneck — motivating the efficient attention methods covered in Section 13.6.

Why It Works So Well

Self-attention captures global dependencies in a single layer — syntactic patterns (subject–verb agreement across clauses), semantic relationships (coreference resolution), and discourse-level coherence. A CNN needs many layers of local operations to achieve the same reach. An RNN needs many time steps. This directness — any position can interact with any other in one step — is a key reason for the Transformer's dominance.

13.3   The Transformer

Vaswani et al. (2017) Vaswani, 2017 showed that self-attention alone — without recurrence or convolution — can build state-of-the-art sequence models. The paper's title said it all: "Attention Is All You Need."

Encoder

Each encoder layer has two sub-layers:

  1. Multi-head self-attention (Section 13.5): every position attends to every other.
  2. Feed-forward network: a two-layer MLP applied independently to each position. Hidden dimension is typically 4× the model dimension. Activation: ReLU or GELU.

Both sub-layers are wrapped in residual connections (output + input) followed by layer normalisation. Residual connections provide gradient highways for training deep stacks.

Decoder

Each decoder layer has three sub-layers:

  1. Masked self-attention: causal mask prevents looking ahead.
  2. Cross-attention: queries from the decoder, keys and values from the encoder. This is how the decoder accesses the input.
  3. Feed-forward network: same as the encoder.

Each sub-layer has residual connections and layer normalisation.

The Original Configuration

The base model: dmodel = 512, 8 attention heads, 6 layers each for encoder and decoder, 65 million parameters. The "big" model: dmodel~ = 1024, 16 heads, ~213 million parameters. Training used Adam with linear warmup then inverse-square-root decay. The big model set new records on WMT 2014 translation while training in 3.5 days on 8 GPUs — compared to weeks for RNN-based models.

Impact

Within two years, Transformers achieved state-of-the-art results in language modelling (GPT), language understanding (BERT Devlin, 2019), image classification (ViT Dosovitskiy, 2020), speech (wav2vec 2.0), and protein structure prediction (AlphaFold 2 Jumper, 2021). The reasons: no information bottleneck, easy parallelisation on GPUs/TPUs, and a simple design that adapts to new domains by changing only the tokenisation and input embedding.

Practical Details

The original Transformer applied layer normalisation after the residual addition (post-norm). Later work found that applying it before each sub-layer (pre-norm) gives more stable training, especially for deep models. Pre-norm is now the standard.

13.4   Positional Encoding

Self-attention is permutation-equivariant: rearrange the inputs and the outputs rearrange the same way. That is fine for sets, but fatal for sequences. "The cat sat on the mat" and "mat the on sat cat the" contain the same tokens but mean different things. Positional encoding injects order information.

Sinusoidal Encodings

The original Transformer used fixed sinusoidal functions:

  • PE(pos, 2i) = sin(pos / 10000^2i/dmodel^)
  • PE(pos, 2i+1) = cos(pos / 10000^2i/dmodel^)

These are added to the input embeddings. Each position gets a unique encoding. The sinusoidal form means that, for any fixed offset k, the encoding at pos + k is a linear function of the encoding at pos — which may help the model learn relative positions. Sinusoidal encodings also extend to lengths not seen in training.

Learned Positional Embeddings

BERT and GPT-2 used a lookup table: one learned vector per position, up to a maximum length. Performance is comparable to sinusoidal within the training range, but there is no way to handle positions beyond the maximum.

Relative Position Encodings

Shaw et al. (2018) Shaw, 2018 and Transformer-XL modified the attention scores directly to depend on the distance ij between positions, rather than their absolute locations. This handles variable lengths and captures the intuition that relationships depend on relative position.

ALiBi Press, 2021 takes this further: subtract a linear penalty proportional to |ij| from the attention scores. No learned parameters at all.

Rotary Position Embedding (RoPE)

RoPE Su, 2021 encodes position by rotating query and key vectors. The inner product of the rotated vectors depends only on the relative position (because the product of two rotations depends on the difference of their angles). RoPE is used in LLaMA Touvron, 2023, PaLM Chowdhery, 2022, and many other large models. It works well with scaling techniques (NTK-aware scaling, YaRN) for extending context length beyond training.

Why It Matters

The encoding scheme determines how well a model handles sequences longer than those seen in training (length extrapolation). Sinusoidal and learned encodings extrapolate poorly. RoPE extrapolates moderately well. ALiBi extrapolates well by design. As context windows grow to hundreds of thousands of tokens, this remains an active research area.

13.5   Multi-Head Attention

A single attention head computes one set of weights and one combination of values. This limits each layer to one pattern of attention. Multi-head attention runs multiple heads in parallel, each with its own projections, then concatenates the results.

The Mechanics

For each head j in {1, …, h}:

  • Project: Qj = XWj^Q^, Kj = XWj^K^, Vj = XWj^V^, with dk = dmodel / h.
  • Compute attention: headj = Attention(Qj, Kj, Vj).

Then concatenate and project: MultiHead(X) = Concat(head1, …, headh) W^O^.

Total cost is similar to a single full-dimension head, since each head works in a reduced space.

Head Specialisation

Heads learn to focus on different things. In language models, some heads attend to the previous token (local context). Others attend to the first token (anchor). Others track syntactic relationships (verb → subject). Many heads appear redundant — pruning them barely affects performance, motivating efficiency work.

Efficient Variants

  • Multi-query attention (MQA) Shazeer, 2019: all heads share one set of key and value projections. Separate queries only. This dramatically reduces the KV cache memory during generation — often the serving bottleneck.
  • Grouped-query attention (GQA, used in LLaMA 2): heads are grouped into clusters sharing key–value projections. A tuneable trade-off between expressiveness and efficiency.

Both achieve performance close to full multi-head attention at much lower inference cost.

The Output Projection

The final projection W^O^ lets the model learn how to combine information from different heads. Without it, the heads would be independent stacks with no interaction. Dropout is typically applied here for regularisation.

Robustness

Multi-head attention is one of the most stable design choices in the Transformer. While nearly every other component has been modified since 2017, the basic multi-head structure has barely changed. Typical head counts: 8, 12, 16, or more, guided by the constraint that dk = dmodel / h should not drop below ~32.

13.6   Transformer Variants

The original Transformer was a powerful template. The demands of diverse tasks and ever-growing scale have produced a rich family of variants.

Efficient Attention

The O(n^2^) cost of standard attention is too high for very long sequences — long documents, high-resolution images, genomic data. Solutions:

  • Sparse patterns: Longformer Beltagy, 2020 combines local sliding-window attention with a few global tokens. BigBird adds random connections. Both preserve the theoretical power of full attention.
  • Hashing: Reformer uses locality-sensitive hashing to group similar tokens, achieving O(n log n).
  • Low-rank projections: Linformer projects keys and values to a lower dimension, giving O(n) complexity. Performer uses random feature maps to approximate the softmax kernel, also achieving O(n).
  • Hardware-aware: Flash Attention Dao, 2022 does not change the maths — it restructures the computation to minimise GPU memory movement. The speedup is dramatic. Flash Attention is now a standard in most Transformer implementations.

Architectural Variants

Three main families:

  • Encoder-only (e.g., BERT Devlin, 2019): bidirectional self-attention. Pre-trained with masked language modelling (predict masked tokens from context). Fine-tuned for classification, NER, question answering.
  • Decoder-only (e.g., GPT): causal self-attention. Pre-trained as an autoregressive language model. Versatile — nearly any task can be cast as text generation.
  • Encoder–decoder (e.g., T5 Raffel, 2019, BART): the original architecture. Excels at tasks with a clear input–output structure: translation, summarisation, structured output.

Beyond Text

  • Vision Transformer (ViT) Dosovitskiy, 2020: split an image into patches, embed them, and process with a standard Transformer. Matches or beats CNNs with enough pre-training data.
  • Audio Transformers: apply the same patch approach to spectrograms.
  • AlphaFold 2 Jumper, 2021: a custom Transformer-like architecture (the Evoformer) for protein structure prediction.
  • Graph Transformers: apply attention to molecular and relational data.

Mixture-of-Experts (MoE)

In an MoE layer Fedus, 2021, the standard feed-forward network is replaced by a collection of expert networks. A gating mechanism routes each token to a small subset (typically 1–2 out of many). This decouples parameter count from compute cost: the model can have trillions of parameters, but each token only activates a fraction. The Switch Transformer showed this works at huge scale. Mixtral demonstrated strong performance at favourable cost-efficiency ratios. MoE points toward a future where model size and inference cost are increasingly separate.