Chapter Thirteen

Attention & Transformers

Learning Objectives
  1. Motivate attention as a content-based lookup that removes the recurrent bottleneck
  2. Derive scaled dot-product attention and explain the $\sqrt{d_k}$ factor from a variance argument
  3. Build multi-head attention from first principles and count its parameters
  4. Distinguish self-attention, cross-attention and encoder–decoder attention
  5. Derive sinusoidal, learned, RoPE and ALiBi positional encodings
  6. Compare pre-norm and post-norm Transformer blocks; explain SwiGLU and RMSNorm
  7. Distinguish encoder-only (BERT), decoder-only (GPT) and encoder–decoder (T5) families
  8. Implement a complete decoder-only Transformer in PyTorch and train it on a toy task
  9. Estimate parameter counts and FLOPs and derive the $\sim 12 L d^2 + Vd$ rule
  10. Explain FlashAttention, sparse attention, Mixture-of-Experts and linear-attention variants
  11. Reason about KV-cache, prefill–decode asymmetry and PagedAttention in inference economics

In June 2017 a paper from Google Brain and Google Research carried a title that, in retrospect, reads like a manifesto: Attention Is All You Need Vaswani, 2017. The eight authors proposed an architecture they called the Transformer, a sequence model with no recurrence and no convolution, built entirely from a single primitive: attention. Within five years almost every state-of-the-art result in natural language processing, computer vision, speech, biology, and reinforcement learning was being produced by Transformers or close descendants. The Transformer is now the architectural substrate of GPT-4, Claude, Gemini, LLaMA, BERT, ViT, AlphaFold, Whisper, Sora, and the broader generation of foundation models that defines contemporary AI.

This chapter is the centrepiece of the textbook. We start where the field stood in 2014: an encoder–decoder recurrent network with a fixed-size bottleneck. We derive attention as the response to that bottleneck, develop scaled dot-product attention from variance arguments, generalise to multi-head attention, and assemble the full Transformer block including positional encodings, residuals, normalisation, and feed-forward layers. We work through a numerical example by hand. We then implement an entire decoder-only Transformer in roughly two hundred lines of PyTorch and train it on a toy task. From that foundation we tour the families that have grown out of the original paper (BERT, GPT, T5, ViT, multimodal models, Mixture-of-Experts). We close with the engineering questions that dominate today: the quadratic-complexity wall, FlashAttention, linear-attention variants such as Mamba, and the inference economics of prefill, decode, and the KV cache.

By the end of this chapter you should be able to draw the Transformer block from memory, explain every line, write the code, and reason about why it scales the way it does.

A note on style. This chapter is denser than the others. The mathematics of attention is genuinely subtle, and the engineering details of modern Transformers (RoPE, GQA, FlashAttention, KV caching, MoE routing) each deserve their own paragraph. We will not skimp on any of them. If a section feels heavy, read the pseudocode and worked examples first; the prose will then settle into place.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.