13.6 The Transformer block
So far we have been collecting parts. We have scaled dot-product attention, which lets a model decide which positions in a sequence should pay attention to which other positions. We have multi-head attention, which runs several attention computations in parallel so the model can track different kinds of relations at once. We have positional encoding, which gives an otherwise position-blind architecture a sense of order. What we do not yet have is a way to put these parts together into a single unit that can be repeated.
That unit is the transformer block. A block is the smallest piece of a transformer that you can sensibly point at and call complete. It takes a stack of token vectors in, and it produces a stack of token vectors out, with the same shape. Because the input and output shapes match, you can take the output and feed it straight into another block, and another, and another. Modern language models do exactly this, GPT-3 stacks ninety-six blocks on top of one another, LLaMA stacks thirty-two or eighty depending on the size, and the principle scales almost arbitrarily.
A block contains, at minimum, four ingredients: a multi-head attention sublayer, a feed-forward sublayer, a residual connection wrapped around each, and a normalisation step that keeps the activations well-behaved. There are minor variations, pre-norm versus post-norm, RMSNorm versus LayerNorm, SwiGLU versus GELU, but the skeleton is universal. If you understand one transformer block, you understand them all. The variants are tweaks to the same recipe.
This section assembles the components covered in §13.2 to §13.5 into that recipe. Once we have the recipe, §13.7 will show how to wire blocks together into encoders, decoders, and encoder-decoder pairs. For now, we are building the brick, not the wall.
The classic block (post-norm)
The original transformer paper Vaswani, 2017 writes the block as two short equations:
$$\mathbf{y} = \text{LN}(\mathbf{x} + \text{Attn}(\mathbf{x}))$$ $$\mathbf{z} = \text{LN}(\mathbf{y} + \text{FFN}(\mathbf{y}))$$
Let us read those equations slowly. The input $\mathbf{x}$ is a matrix; each row is the vector for one token in the sequence. The first line says: pass the input through multi-head attention, add the result back to the original input (this is the residual connection), then run the sum through layer normalisation. The output of that operation is $\mathbf{y}$, which has the same shape as $\mathbf{x}$. The second line is structurally identical, except that the sublayer is now a feed-forward network instead of attention.
This pattern, a sublayer, a residual addition, a normalisation, is sometimes written as the "Add and Norm" sandwich, because diagrams of the block typically draw a small box labelled "Add and Norm" sitting above each sublayer. The residual addition is what gives the block its skip connection: the original signal can pass through unchanged, while the sublayer contributes a learned correction.
Why bother with the residual? Two reasons. First, gradients during training can flow back along the addition path without being attenuated by the sublayer's weights, so very deep stacks remain trainable. Second, a fresh transformer block initialised with small weights computes something close to the identity function, the sublayer outputs are tiny, the addition leaves $\mathbf{x}$ almost untouched. That means stacking more blocks does not immediately destroy the signal; the network can learn what each block should do gradually.
Why bother with the normalisation? Without it, the activations inside a deep stack tend to drift in magnitude. After a few additions, some dimensions can blow up while others collapse, and the network becomes numerically unstable. Layer normalisation rescales each token's vector so that its components have mean zero and variance one (with a learned scale and shift), which keeps everything in a sensible range.
This was the post-norm arrangement that powered the original 2017 transformer. It works, but as the field tried to train deeper and deeper models, it began to misbehave.
The pre-norm variant
Researchers who tried to scale the original recipe past about a dozen layers ran into a frustrating problem: training would diverge unless the learning rate was warmed up very slowly, over thousands of steps, from almost zero up to its target value. Skip the warm-up and the loss would explode. Investigation traced the problem to where the layer normalisation sat. Putting LN after the residual addition meant that the gradient flowing back through the residual path was repeatedly squashed by the normalisation, and at large depths the signal-to-noise ratio in those gradients fell apart Xiong, 2020.
The fix is small but consequential. Move the normalisation inside the sublayer, before attention or before the FFN, instead of after the addition:
$$\mathbf{y} = \mathbf{x} + \text{Attn}(\text{LN}(\mathbf{x}))$$ $$\mathbf{z} = \mathbf{y} + \text{FFN}(\text{LN}(\mathbf{y}))$$
Compare this with the post-norm equations: the normalisation has migrated inside the parentheses, and the residual sum is no longer being normalised. The skip path $\mathbf{x} \to \mathbf{y} \to \mathbf{z}$ is now a clean identity highway. Whatever the sublayers compute is added on; the original signal continues unmolested through the additions. Gradients backwards through this path are likewise unmolested. The result is a stack that trains stably at hundreds of layers without the painful warm-up schedule.
The flip side is that, by the time you reach the top of a tall pre-norm stack, the residual stream may have grown in magnitude, because nothing along the way has rescaled it. The standard remedy is to apply one final layer normalisation at the very end of the stack, after the last block, before the output projection. This single extra LN keeps the final logits well-conditioned without disturbing the identity highway between blocks.
Pre-norm is now the default. GPT-2 and GPT-3 use it, T5 uses a similar arrangement, and the entire LLaMA family along with Mistral, Qwen, DeepSeek and most other open-weight models published since about 2020 use pre-norm. If you read a paper from the past few years and the equations look like the ones above, with normalisation sitting inside the sublayer and a clean residual addition outside, you are looking at pre-norm. If the normalisation sits outside the residual addition, you are reading either an older paper or a deliberate replication of the original 2017 recipe.
A small but useful piece of intuition: pre-norm reframes the residual stream as the primary object that flows through the network, with attention and FFN sublayers acting as small read-write modules that occasionally add information to it. We will return to this view at the end of the section, because it explains a great deal about how trained transformers behave.
The feed-forward sublayer
The feed-forward sublayer, often abbreviated FFN, is the simpler of the two sublayers. It is just a small two-layer neural network applied independently to each position in the sequence. There is no mixing across positions; that is attention's job. The FFN's job is to transform each token's vector representation in place.
$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \cdot \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$
The first matrix $\mathbf{W}_1$ projects from the model dimension up to a hidden dimension. A non-linearity $\sigma$ is applied. The second matrix $\mathbf{W}_2$ projects back down to the model dimension. By convention, the hidden dimension is four times the model dimension. So if $d_{\text{model}} = 4096$, the hidden dimension is $16{,}384$. The biases $\mathbf{b}_1$ and $\mathbf{b}_2$ are sometimes omitted in modern implementations to save a tiny amount of memory.
Why a 4× expansion? It is empirical, but stable across model sizes from one hundred million to one hundred billion parameters. One useful interpretation, due to Geva and colleagues 2021, is that the FFN behaves like a key-value memory: $\mathbf{W}_1$ contains a set of "memory keys", the non-linearity gates which keys are active for the current input, and $\mathbf{W}_2$ contains the corresponding "memory values" that get retrieved and added back. Wider hidden dimension means more memory slots, hence more facts the model can store.
The choice of non-linearity has shifted over time. The original transformer used ReLU. BERT and GPT-2 switched to GELU Hendrycks, 2016, a smooth approximation to ReLU built from the Gaussian cumulative distribution function. The current state of the art is SwiGLU Shazeer, 2020, a gated feed-forward design used in PaLM, LLaMA, Mistral and DeepSeek:
$$\text{FFN}_{\text{SwiGLU}}(\mathbf{x}) = \mathbf{W}_3 \big( \text{Swish}(\mathbf{W}_1 \mathbf{x}) \odot \mathbf{W}_2 \mathbf{x} \big)$$
Three matrices instead of two. The first is passed through a Swish activation; the second is left as a plain linear projection; the two are multiplied element-wise; the third matrix projects back down. The element-wise product gives the network a multiplicative gating mechanism: one branch decides what to let through; the other branch decides what content flows. To keep the parameter count comparable to the standard FFN, the inner dimension is reduced from $4 d_{\text{model}}$ to roughly $\tfrac{8}{3} d_{\text{model}}$, then rounded to a multiple of 64 or 128 to suit the hardware.
The FFN is by some margin the largest part of a transformer block by parameter count, which is the next thing worth being explicit about.
Why these design choices
It is worth pausing to summarise what each ingredient is for, because the four pieces are not interchangeable and each plays a distinct structural role.
Residual connections are the gradient highway. They let signal and gradient pass through tens or hundreds of layers without being degraded by the intervening sublayers. They also mean that adding more blocks does not catastrophically change the function the network computes, at initialisation each new block is close to the identity, so the network can learn what to do with the extra capacity gradually. Without residual connections, transformers beyond about six layers become very difficult to train.
Layer normalisation (or RMSNorm) stabilises the magnitude of activations. Without it, the variance of the residual stream tends to grow or shrink as it passes through the stack, and at some depth the network becomes numerically pathological, either saturating its non-linearities or vanishing into rounding error. Normalisation rescales each token's vector to a controlled range while leaving its direction (which is what carries the meaning) intact.
The feed-forward network with 4× expansion is where most of a transformer's parameters live, and it is doing the heavy lifting of nonlinear computation per position. Attention can only mix existing information across positions; it cannot produce new features. The FFN is what transforms the mixed information into something more useful, and it is large enough to act as a sizeable lookup memory. In a typical block, the FFN holds about two thirds of the parameters; attention holds the rest.
Multi-head attention is the only mechanism by which information moves between positions. In a transformer, every cross-token computation passes through attention. The "multi-head" part lets the block run several different patterns of attention in parallel, so one head can track syntactic agreement, another can track topical reference, another can track position-relative copying, and so on, all in the same layer.
Take any one piece away and the block stops working. Remove residuals and depth becomes untrainable. Remove normalisation and stability collapses. Remove the FFN and you have only linear projections plus softmax averaging, which is severely under-expressive. Remove attention and you have an MLP applied to each token in isolation, no context, no language modelling. The four ingredients earn their places.
Worked: parameter count for one block
Let us count parameters for a single block of BERT-base, which has $d_{\text{model}} = 768$, twelve attention heads, and the standard 4× FFN expansion, so $d_{\text{ff}} = 3072$.
For multi-head attention, there are four projection matrices: query, key, value, and the output projection that combines the per-head results back into the residual stream. Each is a $768 \times 768$ matrix. Total attention parameters: $4 \times 768^2 = 4 \times 589{,}824 = 2{,}359{,}296$, or about $2.36$ million.
For the FFN, there are two matrices: an "up" projection from $768$ to $3072$ and a "down" projection from $3072$ back to $768$. Each contains $768 \times 3072 = 2{,}359{,}296$ parameters. Total: $2 \times 2{,}359{,}296 = 4{,}718{,}592$, about $4.72$ million.
The two layer normalisations (one per sublayer) each have a scale and a shift vector of length $768$, totalling $4 \times 768 = 3{,}072$ parameters, under three thousand. Negligible compared to the matrices.
Adding it up, one BERT-base block contains roughly $2.36 + 4.72 \approx 7.08$ million parameters. BERT-base stacks twelve such blocks, so the blocks alone account for $12 \times 7.08 \approx 85$ million parameters. The token and positional embeddings add a further $\sim 24$ million on top of that, taking the total advertised count to around $110$ million. Most of the model's capacity, about three quarters of it, lives in the stack of blocks; within each block, two thirds is the FFN and one third is attention. That ratio holds, with minor variation, all the way up to today's hundred-billion-parameter models.
Modern improvements
Once the basic recipe was nailed down, the field began iterating on small refinements. Most are not earth-shattering individually, but together they account for a meaningful fraction of the gains between GPT-2 (2019) and the current generation of open models.
RMSNorm Zhang, 2019 drops the mean-centring step from layer normalisation. It rescales by the root-mean-square of the activations, applies a learned scale, and skips both the mean computation and the shift parameter. Empirically, it matches LayerNorm in quality while being slightly faster, and it has become the default in LLaMA, Mistral, Qwen and the broader open-model lineage. The mean-centring of LayerNorm turned out to be largely unnecessary.
Parallel attention and FFN is a layout used by PaLM and GPT-NeoX. Instead of computing attention then FFN sequentially, both sublayers read from the same input and their outputs are summed before being added back to the residual. The change costs nothing in parameter count and can be slightly faster on parallel hardware, because the two sublayers no longer have a sequential dependency.
Mixture of experts (MoE) replaces the dense FFN with a routed sparse mixture: a small router decides which of, say, eight or sixty-four expert FFNs each token should be sent to, and only the chosen experts run. Total parameter count goes up; per-token computation stays roughly constant. DeepSeek-V3 and Mixtral are large recent examples.
Grouped-query attention (GQA) has multiple query heads share a single key-value head, reducing the size of the key-value cache that has to be kept in memory during decoding. The quality cost is small; the inference memory savings are substantial.
FlashAttention Dao, 2022 is not a change to the block's mathematics but to its implementation: it computes exact attention in tiles that fit in fast on-chip memory rather than streaming the giant attention matrix to and from main memory. The practical effect is two to four times faster training with less memory used.
What you should take away
- A transformer block is the unit of repetition: input shape equals output shape, so blocks can be stacked arbitrarily to make the network deeper.
- Each block contains a multi-head attention sublayer and a feed-forward sublayer, with a residual connection and a normalisation around each. These four ingredients are not optional; remove one and the block fails for a different specific reason.
- The pre-norm arrangement, with normalisation inside each sublayer and a clean residual addition outside, is what allows modern transformers to train stably at very large depth without learning-rate warm-up.
- Most of a block's parameters live in the feed-forward network, which expands to roughly four times the model dimension and behaves like a key-value memory. Attention is the only mechanism that moves information between positions; the FFN does the per-position transformation.
- The current open-model recipe, pre-norm + RMSNorm + SwiGLU + grouped-query attention with rotary positions and FlashAttention kernels, is a small set of tweaks on top of the original 2017 design rather than a rewrite. Understanding the classic block is most of what you need to read any modern transformer paper.