Attention & Transformers: Key principles

Dr Chris Paton

Key principles

Key Principle

Attention is content-based lookup. A query produces a softmax-weighted average of values, with weights given by query–key dot products. Every Transformer is a stack of these.

Key Principle

The $\sqrt{d_k}$ scaling is not optional. Without it, dot-product variance scales with dimension, the softmax saturates, and gradients vanish. A one-line variance argument forces the constant.

Key Principle

Multi-head attention partitions a single $d \times d$ projection across $h$ heads. Parameter count is $4 d^2$ regardless of $h$; head count is a representational, not a budgetary, choice.

Key Principle

Self-attention is permutation-equivariant. Order information must be added separately. RoPE is the modern default; ALiBi extrapolates by design.

Key Principle

Pre-norm trains, post-norm fights you. Putting LayerNorm inside the residual sub-layers gives a clean gradient highway and stable training at hundreds of layers.

Key Principle

Parameter count is $\sim 12 L d^2 + 2 V d$. Training cost is $\sim 6 N D$ FLOPs for $N$ parameters and $D$ tokens. Chinchilla says $D \approx 20 N$ tokens at compute-optimal.

Key Principle

FlashAttention computes exact softmax attention in $O(n)$ memory. It does not change the maths; it changes the IO pattern. Modern PyTorch calls it transparently.

Key Principle

Inference is dominated by the KV cache. Decode is memory-bandwidth-bound, not compute-bound. Smaller KV caches (GQA, MQA, quantisation) and continuous batching dominate serving economics.

Key Principle

Mixture of Experts decouples capacity from compute. Trillion-parameter models with tens of billions of active parameters per token are now standard at the frontier.