Glossary

Self-Attention

Self-Attention generalises the attention mechanism by allowing a single sequence to attend to itself. Every position simultaneously acts as query, key, and value, enabling the model to capture dependencies between any two positions regardless of their distance. This is a profound departure from recurrence, where information between distant positions must be relayed through a chain of intermediate hidden states.

The computation is elegant. Given an input sequence of $n$ vectors of dimension $d_{\text{model}}$, project them into queries, keys, and values: $Q = XW^Q$, $K = XW^K$, $V = XW^V$. The attention output is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The scaling by $\sqrt{d_k}$ prevents dot products from growing too large, which would push the softmax into regions of tiny gradients.

A key distinction: bidirectional self-attention lets every position attend to every other, suitable for encoding tasks (BERT). Causal self-attention masks future positions so that position $i$ only attends to positions $j \leq i$, required for autoregressive generation (GPT). Self-attention's time and memory complexity are $O(n^2)$ in sequence length—negligible for short sequences but prohibitive for very long ones, motivating research into efficient attention variants. Despite the quadratic cost, self-attention has proven extraordinarily effective across domains and is the defining computational primitive of modern deep learning.

Related terms: Attention Mechanism, Transformer, Multi-Head Attention

Discussed in:

Also defined in: Textbook of AI