Inside a transformer block, Textbook of AI

Multi-head attention, a feed-forward network, residual connections and layer norm: the building block of every modern LLM.

From the chapter: Chapter 15: Modern AI

Glossary: transformer, multi head attention, layer norm, residual connection, feed forward

Transcript

Every modern large language model is a stack of identical transformer blocks. Inside each block, the same five operations.

A token enters at the bottom. It is a vector with hundreds or thousands of dimensions.

First, layer normalisation. The vector is rescaled so its mean is zero and its variance one. This stabilises training.

Then multi-head attention. Each token computes a query, a key and a value. Many attention heads run in parallel, each looking at a different aspect of the sequence. Their outputs are concatenated.

A residual connection adds the attention output back to the original vector. The original is preserved; the attention has contributed an update.

A second layer normalisation, then a feed-forward network. Two linear layers with a non-linearity between them. This processes each token independently.

Another residual adds the feed-forward output back in.

The result is a refined token vector that is ready for the next block.

Stack ninety-six of these in GPT-3, eighty in Llama 3 four-hundred-five billion. The block is the same every time, only the weights differ.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).