Multi-Head Attention runs multiple self-attention operations in parallel, each with its own learned projection matrices, and concatenates their outputs. This allows the model to simultaneously attend to information from different representational subspaces at different positions. One head might capture syntactic dependencies, another semantic similarity, and another positional proximity—all within the same layer.
Formally, with $h$ heads and model dimension $d_{\text{model}}$, each head uses projection matrices producing queries, keys, and values of dimension $d_k = d_{\text{model}}/h$. Each head computes scaled dot-product attention independently: $\text{head}_j = \text{Attention}(Q_j, K_j, V_j)$. The heads are concatenated and passed through a final linear projection $W^O$. Total computational cost is similar to a single head with full dimensionality, since each head operates on a reduced dimension.
Empirical analyses have found striking patterns of specialisation. Some heads consistently attend to the previous token, others to the first token in the sequence, others implement syntactic operations like subject-verb agreement. Many heads turn out to be redundant—pruning them barely affects performance—motivating efficiency variants. Multi-Query Attention (MQA) shares a single key-value projection across all heads, dramatically reducing the memory needed to cache K/V during autoregressive generation. Grouped-Query Attention (GQA) groups heads into clusters sharing K/V projections, a middle ground between full multi-head and single-key-value. Both are used in modern LLMs like LLaMA 2 to reduce inference cost.
Related terms: Self-Attention, Attention Mechanism, Transformer
Discussed in:
- Chapter 13: Attention & Transformers — Multi-Head Attention
Also defined in: Textbook of AI