Glossary

Attention Mechanism

The Attention Mechanism emerged from a simple observation: when a neural network must produce an output that depends on an input sequence, not all positions in the sequence are equally relevant. Originally introduced by Bahdanau, Cho, and Bengio (2014) for neural machine translation, attention allowed a decoder to, at each output step, compute a weighted combination of all encoder hidden states, focusing on those most relevant to the current output token.

The mechanism works as follows. At each decoder step $t$, compute an alignment score $e_{t,i}$ between the decoder's state and each encoder state $h_i$. Pass these through a softmax to obtain attention weights $\alpha_{t,i}$ that sum to 1. The context vector is then the weighted sum $c_t = \sum_i \alpha_{t,i} h_i$. The scoring function can be additive (Bahdanau's concat: $v^T \tanh(W_1 s_t + W_2 h_i)$) or multiplicative (Luong's dot product: $s_t^T h_i$). The latter is cheaper and amenable to efficient matrix multiplication.

Attention transformed sequence modelling. It removed the information bottleneck of fixed-length context vectors, dramatically improving performance on long sequences. It also provided interpretable alignment matrices, making it possible to visualise which source words the model attended to when generating each target word. The query-key-value framing that emerged from this work—later made explicit in the Transformer—would become the organising principle of modern deep learning, powering everything from language models to vision transformers to protein structure prediction.

Related terms: Self-Attention, Transformer, Multi-Head Attention

Discussed in:

Also defined in: Textbook of AI