Self-Attention, Glossary, Textbook of AI

Self-attention is the special case of attention in which queries, keys and values all come from the same input, each position attends to every other position in the same sequence. The mechanism allows a model to combine information across arbitrary distances in a single layer, in contrast to recurrent models (which require sequential processing) or convolutional models (which require deep stacks to integrate distant information).

For an input sequence X with rows being position embeddings, self-attention computes Q = XW_Q, K = XW_K, V = XW_V via learned projections, then Attention(Q, K, V) = softmax(QKᵀ/√d_k)V. Each output row is a weighted sum of all input rows, with the weights determined by the query-key similarity.

Causal masking restricts each position to attend only to previous positions (used in autoregressive models like GPT). Bidirectional self-attention allows attention in both directions (used in encoder-only models like BERT).

Multi-head self-attention runs h parallel self-attention computations with different learned projections (typically h = 8 to 96), concatenates their outputs and applies a final projection. The multiple heads allow the model to attend to different aspects of the input simultaneously, syntactic structure, lexical similarity, positional patterns and so on.

Self-attention is the central computational primitive of every modern Transformer. Its quadratic O(n²) cost in sequence length is the main scalability bottleneck and the focus of substantial research (FlashAttention, sparse attention, linear-attention variants).

Mathematics

Given an input sequence as rows of a matrix $X \in \mathbb{R}^{n \times d}$, self-attention projects each row to query, key and value vectors:

$$Q = X W^Q, \quad K = X W^K, \quad V = X W^V$$

with $W^Q, W^K \in \mathbb{R}^{d \times d_k}$, $W^V \in \mathbb{R}^{d \times d_v}$. The output of self-attention is

$$\mathrm{SelfAttn}(X) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V \in \mathbb{R}^{n \times d_v}.$$

Each row of the output is a weighted sum of all value rows, with weights determined by similarities between the corresponding query and all keys.

Multi-head self-attention splits the dimension across $h$ heads, runs $h$ self-attentions in parallel with different projections, and concatenates the results:

$$\mathrm{MHSA}(X) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O$$

$$\mathrm{head}_i = \mathrm{SelfAttn}_i(X)$$

Total parameter count for an MHSA layer with $d_k = d_v = d/h$: $4 d^2$ (the four matrices $W^Q$, $W^K$, $W^V$, $W^O$ each $d \times d$).

Causal self-attention adds a triangular mask before softmax so position $i$ attends only to positions $\leq i$, the autoregressive constraint of language models. Bidirectional self-attention uses no mask, every position attends to every other.

Interactive

Self-attention as Q–K–V dot products. Query, key and value vectors produce an attention matrix over four tokens.

Multiple attention heads in parallel. Each head learns a different similarity pattern. Their outputs concatenate and project to one tensor.

Video

Related terms: Attention Mechanism, Multi-Head Attention, Transformer

Discussed in:

Chapter 13: Attention & Transformers, Attention and Transformers

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.