The Transformer is the neural-network architecture introduced by Vaswani et al. in the 2017 paper Attention Is All You Need. It uses only multi-head self-attention and feed-forward layers, no recurrence, no convolution. The architecture is parallelisable across sequence positions during training (unlike RNNs, which are sequential), making it dramatically more efficient on modern GPUs and TPUs.
The standard Transformer block: (1) Multi-head self-attention with residual connection and layer normalisation; (2) Position-wise feed-forward network (typically a 4×-expansion MLP) with residual connection and layer normalisation.
The original architecture was an encoder-decoder for machine translation, but two restricted versions have proved more important:
Decoder-only (causal masked self-attention, autoregressive prediction), the GPT family, every modern large language model. These models predict the next token given all previous tokens and naturally handle generation tasks.
Encoder-only (full bidirectional attention), BERT and its descendants. These models produce contextual representations of every token simultaneously and are well-suited for classification, embedding and retrieval.
The Transformer has spread far beyond language. Vision Transformers (Dosovitskiy 2020) split images into patches and process them with a Transformer, matching or exceeding CNN performance. AlphaFold 2 uses Transformer-style attention for protein structure prediction. Multimodal models (CLIP, DALL-E, Gemini) use Transformers as their unified backbone. The Transformer is now the dominant computational primitive of AI.
The architecture's main weakness, quadratic O(n²) attention cost in sequence length, has motivated extensive research into linear-attention alternatives (Mamba, RWKV, RetNet, FlashAttention's memory-efficient implementation) but as of 2025 the standard quadratic Transformer remains dominant.
Mathematics
A Transformer block (post-norm formulation, original Vaswani 2017):
$$Y = \mathrm{LayerNorm}\bigl(X + \mathrm{MultiHeadAttention}(X)\bigr)$$
$$Z = \mathrm{LayerNorm}\bigl(Y + \mathrm{FFN}(Y)\bigr)$$
Modern models use pre-norm (norm before each sub-layer):
$$Y = X + \mathrm{MultiHeadAttention}(\mathrm{LayerNorm}(X))$$
$$Z = Y + \mathrm{FFN}(\mathrm{LayerNorm}(Y))$$
Pre-norm trains more stably and is the standard in GPT-2 and onwards.
The feed-forward network (FFN) applies a position-wise MLP, same parameters at every position:
$$\mathrm{FFN}(x) = W_2 \, \phi(W_1 x + b_1) + b_2$$
with $W_1 \in \mathbb{R}^{d_{ff} \times d}$, $W_2 \in \mathbb{R}^{d \times d_{ff}}$ and $d_{ff} = 4 d$ typically. Modern variants use gated linear units:
$$\mathrm{SwiGLU}(x) = \bigl(x W_g \odot \mathrm{Swish}(x W_1)\bigr) W_2$$
Used in LLaMA, PaLM, Mistral.
Parameter count for one Transformer block with hidden dimension $d$ and $h$ heads ($d_k = d/h$):
- Attention: $4 d^2$ (Q, K, V, O projections)
- FFN: $2 d \cdot d_{ff} = 8 d^2$ (with $d_{ff} = 4d$)
- LayerNorm: $\sim 2 d$ (negligible)
- Total per block: $\approx 12 d^2$
For a model with $L$ layers, total parameters $\approx 12 L d^2$ plus embedding ($V d$ for vocabulary $V$). A 70B-parameter model has $L = 80, d = 8192$, giving $12 \times 80 \times 8192^2 \approx 64 \times 10^9$ parameters in the blocks plus another few billion in embeddings.
Computational cost per forward pass for a sequence of length $n$:
- Attention: $4 n^2 d + n^2 \cdot h$ (the dot products and the softmax)
- FFN: $8 n d^2$
- Total: $O(n^2 d + n d^2)$ per block.
For modest $n$ ($n < d$), FFN dominates; for long sequences ($n > d$), attention's quadratic term dominates. The crossover is around $n = d$, motivating the substantial research on linear-attention alternatives at long context.
Interactive
Video
Related terms: Attention Mechanism, Self-Attention, BERT, GPT, ashish-vaswani
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers