Multi-Head Attention, Glossary, Textbook of AI

Multi-Head Attention runs multiple self-attention operations in parallel, each with its own learned projection matrices, and concatenates their outputs. This allows the model to simultaneously attend to information from different representational subspaces at different positions. One head might capture syntactic dependencies, another semantic similarity, and another positional proximity, all within the same layer.

Formally, with $h$ heads and model dimension $d_{\text{model}}$, each head uses projection matrices producing queries, keys, and values of dimension $d_k = d_{\text{model}}/h$. Each head computes scaled dot-product attention independently: $\text{head}_j = \text{Attention}(Q_j, K_j, V_j)$. The heads are concatenated and passed through a final linear projection $W^O$. Total computational cost is similar to a single head with full dimensionality, since each head operates on a reduced dimension.

Empirical analyses have found clear patterns of specialisation. Some heads consistently attend to the previous token, others to the first token in the sequence, others implement syntactic operations like subject-verb agreement. Many heads turn out to be redundant, pruning them barely affects performance, motivating efficiency variants. Multi-Query Attention (MQA) shares a single key-value projection across all heads, dramatically reducing the memory needed to cache K/V during autoregressive generation. Grouped-Query Attention (GQA) groups heads into clusters sharing K/V projections, a middle ground between full multi-head and single-key-value. Both are used in modern LLMs like LLaMA 2 to reduce inference cost.

Interactive

Multiple attention heads in parallel. Each head learns a different similarity pattern. Their outputs concatenate and project to one tensor.

Video

Related terms: Self-Attention, Attention Mechanism, Transformer

Discussed in:

Chapter 13: Attention & Transformers, Multi-Head Attention

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.