Attention & Transformers: Summary

Dr Chris Paton

Summary

The Transformer arose as a response to the bottleneck of recurrent encoder–decoder models. By using attention as the sole sequence-mixing primitive, it replaced sequential information flow with parallel content-based lookup. The architecture combines multi-head scaled dot-product attention, position-wise feed-forward networks, residual connections, and layer normalisation; positional information is injected through sinusoidal, learned, or rotary embeddings. Three families followed: encoder-only (BERT, masked LM), decoder-only (GPT, autoregressive LM with in-context learning), and encoder–decoder (T5). The decoder-only family came to dominate, scaled to hundreds of billions of parameters, and produced the foundation models that define modern AI. Engineering refinements (pre-norm, RMSNorm, SwiGLU, RoPE, GQA) are now standard. The quadratic-complexity wall is addressed by FlashAttention (exact, IO-aware) and a family of approximate or non-attention sequence models (Linformer, Performer, RetNet, RWKV, Mamba). Vision Transformers extended the same architecture to images; multimodal Transformers fuse modalities through shared token sequences. Mixture-of-Experts decouples parameter count from per-token compute. Inference economics (KV cache, prefill versus decode, continuous batching, PagedAttention, speculative decoding) now dominate the cost of running models in production. The Transformer's combination of generality, scalability, and trainability is the architectural foundation of the foundation-model era.