Self-attention is the special case of attention in which queries, keys and values all come from the same input, each position attends to every other position in the same sequence. The mechanism allows a model to combine information across arbitrary distances in a single layer, in contrast to recurrent models (which require sequential processing) or convolutional models (which require deep stacks to integrate distant information).
For an input sequence X with rows being position embeddings, self-attention computes Q = XW_Q, K = XW_K, V = XW_V via learned projections, then Attention(Q, K, V) = softmax(QKᵀ/√d_k)V. Each output row is a weighted sum of all input rows, with the weights determined by the query-key similarity.
Causal masking restricts each position to attend only to previous positions (used in autoregressive models like GPT). Bidirectional self-attention allows attention in both directions (used in encoder-only models like BERT).
Multi-head self-attention runs h parallel self-attention computations with different learned projections (typically h = 8 to 96), concatenates their outputs and applies a final projection. The multiple heads allow the model to attend to different aspects of the input simultaneously, syntactic structure, lexical similarity, positional patterns and so on.
Self-attention is the central computational primitive of every modern Transformer. Its quadratic O(n²) cost in sequence length is the main scalability bottleneck and the focus of substantial research (FlashAttention, sparse attention, linear-attention variants).
Mathematics
Given an input sequence as rows of a matrix $X \in \mathbb{R}^{n \times d}$, self-attention projects each row to query, key and value vectors:
$$Q = X W^Q, \quad K = X W^K, \quad V = X W^V$$
with $W^Q, W^K \in \mathbb{R}^{d \times d_k}$, $W^V \in \mathbb{R}^{d \times d_v}$. The output of self-attention is
$$\mathrm{SelfAttn}(X) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V \in \mathbb{R}^{n \times d_v}.$$
Each row of the output is a weighted sum of all value rows, with weights determined by similarities between the corresponding query and all keys.
Multi-head self-attention splits the dimension across $h$ heads, runs $h$ self-attentions in parallel with different projections, and concatenates the results:
$$\mathrm{MHSA}(X) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O$$
$$\mathrm{head}_i = \mathrm{SelfAttn}_i(X)$$
Total parameter count for an MHSA layer with $d_k = d_v = d/h$: $4 d^2$ (the four matrices $W^Q$, $W^K$, $W^V$, $W^O$ each $d \times d$).
Causal self-attention adds a triangular mask before softmax so position $i$ attends only to positions $\leq i$, the autoregressive constraint of language models. Bidirectional self-attention uses no mask, every position attends to every other.
Interactive
Video
Related terms: Attention Mechanism, Multi-Head Attention, Transformer
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers