Multi-head attention, a feed-forward network, residual connections and layer norm: the building block of every modern LLM.
From the chapter: Chapter 15: Modern AI
Glossary: transformer, multi head attention, layer norm, residual connection, feed forward
Transcript
Every modern large language model is a stack of identical transformer blocks. Inside each block, the same five operations.
A token enters at the bottom. It is a vector with hundreds or thousands of dimensions.
First, layer normalisation. The vector is rescaled so its mean is zero and its variance one. This stabilises training.
Then multi-head attention. Each token computes a query, a key and a value. Many attention heads run in parallel, each looking at a different aspect of the sequence. Their outputs are concatenated.
A residual connection adds the attention output back to the original vector. The original is preserved; the attention has contributed an update.
A second layer normalisation, then a feed-forward network. Two linear layers with a non-linearity between them. This processes each token independently.
Another residual adds the feed-forward output back in.
The result is a refined token vector that is ready for the next block.
Stack ninety-six of these in GPT-3, eighty in Llama 3 four-hundred-five billion. The block is the same every time, only the weights differ.