Each head learns a different similarity pattern. Their outputs concatenate and project to one tensor.
From the chapter: Chapter 13: Attention & Transformers
Glossary: multi head attention, self attention
Transcript
Self-attention computes a single weighted average of values, with weights from query-key dot products.
Multi-head attention runs that machinery several times in parallel. Each instance is one head.
Head one has its own query, key, and value projection matrices. It looks at the input through its own narrow lens. Maybe it focuses on the previous noun.
Head two has different projections. Maybe it tracks long-range dependency, like which subject this verb belongs to.
Head three. Head four. In GPT-2, twelve heads. In Llama, thirty-two. In some models, ninety-six.
All heads run on the same input, in parallel.
Each head produces its own attended output, a sequence of vectors smaller than the model dimension.
We concatenate them along the feature axis, getting back the full model dimension. A final linear projection mixes them.
Different heads specialise during training. Some pay attention to syntax. Some to coreference. Some to position. Some to specific tokens.
Multi-head attention is what gives transformers their flexibility. One layer can simultaneously route information along many different paths through the sequence.
The cost is more parameters and more computation, but the same FLOP count as a single big attention layer with the same model dimension.