Also known as: attention
An attention mechanism is a weighted aggregation operation that lets a neural network model focus on different parts of its input depending on the current task. Originally introduced by Bahdanau, Cho and Bengio (2014) for neural machine translation, attention computes a query-dependent weighted sum of value vectors, with the weights given by a similarity function between the query and corresponding key vectors.
In its modern form (Vaswani et al., 2017), attention is computed as Attention(Q, K, V) = softmax(QKᵀ/√d_k) V, where Q is a matrix of queries, K of keys, V of values, and d_k the key dimension. The softmax over key-query similarities produces a probability distribution; the values are then averaged according to those probabilities.
Self-attention is the special case where Q, K and V all come from the same input, each position attends to every other. Cross-attention uses queries from one stream and keys/values from another (the standard encoder-decoder attention). Multi-head attention runs several attention computations in parallel with different learned projections, allowing the model to attend to different subspaces simultaneously.
Attention is computationally O(n²) in sequence length n, quadratic in time and memory. FlashAttention (Dao et al., 2022) reduces the memory cost via tile-based computation; linear-attention variants (Linformer, Performer, Mamba) reduce the time complexity to O(n) at the cost of expressiveness.
Attention is now the dominant primitive of AI: every modern language model, every Vision Transformer, AlphaFold's structure module and many others are built around it. The 2017 Transformer paper Attention Is All You Need gave the architecture its name and its modern formulation.
Mathematics
Modern scaled dot-product attention is
$$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V$$
where $Q \in \mathbb{R}^{n_q \times d_k}$ is a matrix of $n_q$ queries, $K \in \mathbb{R}^{n_k \times d_k}$ keys and $V \in \mathbb{R}^{n_k \times d_v}$ values. The factor $1/\sqrt{d_k}$ keeps logits at unit scale regardless of dimension; without it, large $d_k$ pushes the softmax into saturated regions and gradients vanish.
Multi-head attention projects inputs through $h$ different learned matrices and runs attention in parallel:
$$\mathrm{head}_i = \mathrm{Attention}(X W_i^Q, X W_i^K, X W_i^V)$$
$$\mathrm{MultiHead}(X) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O$$
where $W_i^Q, W_i^K \in \mathbb{R}^{d \times d_k}$, $W_i^V \in \mathbb{R}^{d \times d_v}$, $W^O \in \mathbb{R}^{h d_v \times d}$, and typically $d_k = d_v = d/h$.
Causal masking for autoregressive models adds a mask $M$ before the softmax:
$$M_{ij} = \begin{cases} 0 & j \leq i \\ -\infty & j > i \end{cases}$$
$$\mathrm{Attention}^{\mathrm{causal}}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}} + M\right) V$$
The $-\infty$ entries become 0 after softmax, blocking each position from attending to future positions.
Computational cost: $O(n^2 d)$ time and $O(n^2)$ memory in sequence length $n$. FlashAttention (Dao 2022) reduces memory to $O(n)$ via tile-based computation while computing exactly the same output. Linear-attention variants (Linformer, Performer, Mamba) reduce time to $O(n)$ at the cost of expressiveness.
Interactive
Video
Related terms: Transformer, Self-Attention, Multi-Head Attention, dzmitry-bahdanau, ashish-vaswani
Discussed in:
- Chapter 13: Attention & Transformers, Attention and Transformers