Attention & Transformers: 13.9   Parameter count and FLOPs

Dr Chris Paton

13.9 Parameter count and FLOPs

How big is a transformer? How much compute does it take to train one? And how much does it cost to keep one running once it is trained? These three questions sound mundane, but they govern everything: which model you can fit on which GPU, how long the training run will block your cluster for, what your monthly inference bill looks like, and whether your scaling roadmap is realistic. All three can be answered with arithmetic that fits on the back of a napkin, provided you know where the cost lives. This section sets out the rules of thumb. §13.6 described the transformer block as an architecture; §13.9 quantifies what that architecture costs.

Symbols Used Here

$N$model parameters

$L$number of layers

$d$model dim

$h$heads

$T$context length

$V$vocab size

$C$total training compute (FLOPs)

Parameter count of a block

A transformer block has two halves: the attention sub-layer and the feed-forward sub-layer. Each contributes weight matrices, and at modern scales those matrices dominate everything else (LayerNorm, biases, embeddings inside the block) by orders of magnitude.

Inside attention, the four projections $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V, \mathbf{W}^O$ are each $d \times d$, contributing $4d^2$ parameters per block. Multi-head attention does not change this count: splitting into $h$ heads of width $d/h$ rearranges the same parameters into a different tensor shape, it does not add or remove any. So whatever the head count, attention costs $4d^2$ per block.

The FFN takes $\mathbf{x} \in \mathbb{R}^d$ to a hidden dimension of $4d$ and back. That is two matrices, of shapes $d \times 4d$ and $4d \times d$, each contributing $4d^2$ parameters, for $8d^2$ total. Add attention and FFN and you get $12d^2$ parameters per block, with the FFN carrying two-thirds of the weight.

Multiply by $L$ blocks and add the input and output embeddings $V d$ (often tied to halve the count, sometimes left untied):

$$ \boxed{\;N \approx 12 L d^2 + 2 V d.\;} $$

This single formula reproduces the headline parameter count of nearly every published decoder-only transformer to within a few per cent. For BERT-base ($d = 768$, $L = 12$), it gives $12 \cdot 12 \cdot 768^2 \approx 85$ M parameters in the blocks, plus another $\sim 25$ M in embeddings, matching the 110 M total reported in the original paper. For GPT-3 ($d = 12288$, $L = 96$, $V \approx 50257$), the block term is $174$ B and the embedding term is $1.2$ B, summing to the advertised 175 B. For LLaMA-3 70B, the rule needs a small SwiGLU correction (the FFN has three matrices of shape $\frac{8}{3}d \times d$ rather than two of shape $4d \times d$), but the order of magnitude is right.

The practical lesson: at any reasonable scale, parameters live in the linear layers, the FFN holds two-thirds of them, embeddings are a rounding error, and width matters quadratically while depth matters linearly. Doubling $d$ quadruples the model; doubling $L$ only doubles it. This asymmetry is why frontier models reach for ever wider hidden dimensions before they reach for ever more layers.

FLOPs per token

A matrix multiplication of shape $a \times b$ by $b \times c$ costs $2 a b c$ FLOPs (one multiply and one add per output entry). Apply that to a forward pass through every weight matrix in the model: each parameter is touched roughly twice per token (once to multiply by an activation, once to accumulate). So a forward pass over one token costs about $2N$ FLOPs.

Backpropagation costs roughly twice as much again, because for every linear layer you need both an input gradient and a weight gradient, each its own matrix multiply. Adding it up, a full training step (forward plus backward) costs about $6N$ FLOPs per token. Different references quote the constant as anywhere between $5$ and $7$ depending on whether they count fused multiply-adds, attention scores, or activation arithmetic, but the $6N$ figure is the canonical Chinchilla-paper convention Hoffmann, 2022 and is good to within ten per cent.

For one training step on a batch of $B$ tokens: roughly $6NB$ FLOPs. For inference, only the forward pass runs, so the cost falls to about $2N$ FLOPs per generated token, a third of the training cost. This is why a model that took six months to train can be served for years on far less hardware: training pays the back-pass tax that inference does not.

A useful sanity check: for a 7 B model generating one token, the work is $1.4 \times 10^{10}$ FLOPs. An H100 at peak fp16 throughput delivers about $10^{15}$ FLOPs per second, so a token should take fourteen microseconds. In practice it takes ten or twenty milliseconds, three orders of magnitude longer, because inference is memory-bandwidth-bound rather than compute-bound, a story unpacked in §13.19. The arithmetic still bounds you, but it bounds you from below; the achieved throughput is set by how fast you can stream weights across the memory hierarchy, not by how fast the multipliers can multiply.

A second sanity check: at training time, the $6N$ figure assumes activation arithmetic and attention scoring are negligible compared with weight matrix multiplies. That assumption holds for $T \ll d$, breaks gently for $T \sim d$, and breaks severely for $T \gg d$. If you are training a long-context model, the $6N$ rule of thumb under-counts compute, sometimes by a factor of two, and you should add the explicit attention-score term separately.

Total training compute

Stitching the rules together yields the single most useful equation in modern AI economics:

$$ \boxed{\;C \approx 6 N T\;} $$

where $T$ is the total number of training tokens. Plug in any model you like.

GPT-3 trained $N = 1.75 \times 10^{11}$ parameters on $T = 3 \times 10^{11}$ tokens, giving $C \approx 6 \cdot 1.75 \times 10^{11} \cdot 3 \times 10^{11} = 3.15 \times 10^{23}$ FLOPs. The GPT-3 paper quotes $3.14 \times 10^{23}$, agreement to better than one per cent. Chinchilla itself trained $N = 7 \times 10^{10}$ parameters on $T = 1.4 \times 10^{12}$ tokens, giving $C \approx 5.9 \times 10^{23}$ FLOPs, only twice as much compute, on a model less than half the size, but at the compute-optimal ratio of about twenty tokens per parameter. The result was a smaller model that beat GPT-3 across the board, and the paper rewrote the field's view of what scaling means.

The Chinchilla scaling laws went further: they predicted that for a fixed compute budget $C$, the loss-minimising allocation is $N \propto C^{0.5}$ and $T \propto C^{0.5}$. Halve the budget and you should halve both the parameter count and the data, not one or the other. Pre-Chinchilla models violated this badly, GPT-3 used about 1.7 tokens per parameter, ten times below the optimum, and were therefore undertrained. Post-Chinchilla, frontier labs have swung the other way: LLaMA-3 trained 70 B parameters on 15 T tokens, more than two hundred tokens per parameter, ten times above the Chinchilla optimum. Why? Because Chinchilla optimises training-time loss, but a deployed model is paid for at inference time. A smaller model that has been overtrained is cheaper to serve, and serves billions of tokens before its training cost is amortised.

So $C \approx 6NT$ is not a normative equation, it does not tell you how to allocate $N$ versus $T$, but it does tell you, given any allocation, what your bill will be. And the bill is large: at $4 \times 10^{14}$ effective FLOPs per second per H100 (about a third of peak utilisation), a $6 \times 10^{23}$-FLOP run takes $1.5 \times 10^9$ GPU-seconds, or about 17 GPU-millennia, or roughly two months on a 10 000-GPU cluster. This is the kind of arithmetic that motivates the buying decisions of frontier labs.

Where the FLOPs live

Within a single block, FLOPs roughly mirror the parameter split. The FFN has $8d^2$ parameters and is touched once per token, so it consumes about $16 d \cdot T$ FLOPs per token across the batch. Attention has $4d^2$ matrix-multiplication parameters (the four projections), contributing $8d \cdot T$ FLOPs per token, plus the attention scores themselves: computing $\mathbf{Q}\mathbf{K}^\top$ costs $2 T d$ per query, summed over $T$ queries for $2 T^2 d$ per layer, and applying the softmax weights to $\mathbf{V}$ adds another $2 T^2 d$.

Tally this: in the regime $T \ll d$, the score and value-projection terms are small, FFN dominates at about 70 per cent of FLOPs, and attention's projections take the remaining 30 per cent. For BERT-base ($d = 768$, $T = 512$), $T < d$ and the 70/30 split holds tidily. For GPT-3 ($d = 12288$, $T = 2048$), $T \ll d$ and FFN dominates even more strongly.

The picture inverts at long context. Once $T > d$, the $T^2 d$ score and softmax-times-value terms outweigh the linear $T d^2$ projection terms, and attention takes over. For a 1 M-token context on a $d = 4096$ model, attention is the entire bill and the FFN is the sideshow. This crossover is why §13.13 calls attention the quadratic wall, and why every long-context architecture (FlashAttention's IO-aware kernels, Linformer's low-rank attention, Mamba's state-space alternative) is fundamentally trying to flatten that quadratic.

Memory cost

Parameters at fp16 take two bytes each, so a 70 B model needs 140 GB just to hold its weights. Optimiser state for Adam adds two more copies (first and second moments) at fp32, tripling that to 420 GB. But the dominant memory cost during training is activations, not parameters.

To run backpropagation, every layer must store the inputs to its linear maps so the backward pass can compute weight gradients. A naive forward-backward pass scales activation memory as $O(L \cdot B \cdot T \cdot d)$ per micro-batch, plus the FFN intermediate at $4d$. For LLaMA-3 70B with $L = 80$, $T = 8192$, $B = 4$: roughly $80 \cdot 4 \cdot 8192 \cdot 28672 \cdot 2$ bytes $\approx 150$ GB, more than the parameters themselves. The standard fix is activation checkpointing: store only $\sqrt{L}$ checkpoints along the forward pass, recompute the rest during backward, cutting activation memory by a factor of $\sqrt{L}$ at the cost of one extra forward pass per backward pass.

At inference, the activation problem disappears (no backward pass), but a new one takes its place: the KV cache. Each generated token must attend to every previous token, which means caching the keys and values from every layer. KV-cache memory scales as $2 \cdot L \cdot T \cdot d$ scalars per sequence (a factor of 2 for K and V). For $L = 100$, $T = 4096$, $d = 12288$ in float16: $2 \cdot 100 \cdot 4096 \cdot 12288 \cdot 2$ bytes $\approx 20$ GB per concurrent sequence. Serving thirty users at once means well over half a terabyte of KV cache before the model itself is loaded, which is why §13.19 (PagedAttention, vLLM) treats KV-cache management as the central engineering problem of LLM serving.

What you should take away

Parameter count is $12 L d^2 + 2 V d$. Width matters quadratically, depth linearly; embeddings are a rounding error; the FFN holds two-thirds of the weights.
Training compute is $C \approx 6 N T$. Plug in any model and any token count and you have the FLOPs, accurate to ten per cent; inference is a third of that, $\approx 2 N$ per token.
Chinchilla's optimum is twenty tokens per parameter. Modern frontier models intentionally overtrain because a smaller model is cheaper to serve, even at the cost of training-suboptimal loss.
At normal context lengths, FFN dominates roughly 70/30 over attention. That ratio inverts once $T > d$, which is why long-context training is fundamentally a different engineering regime.
Memory, not FLOPs, is the binding constraint. Activations during training and the KV cache during inference dominate the memory bill; checkpointing, mixed precision, ZeRO sharding and paged attention exist because the arithmetic above tells you exactly how big a problem you are solving.