Attention Weights are the softmax-normalised scores produced by the attention mechanism that determine how much each input position contributes to each output position. For a sequence of length $n$, self-attention produces an $n \times n$ matrix of attention weights where row $i$ tells us how much position $i$ attends to each other position. The weights in each row sum to 1, forming a probability distribution over input positions.
Attention weights are computed by taking the dot product of queries and keys, scaling by $\sqrt{d_k}$ to prevent softmax saturation, and applying softmax. In practice, many positions receive near-zero attention while a few dominate, giving the mechanism its characteristic "attending to the important parts" behaviour. Causal (autoregressive) attention masks future positions before the softmax, forcing attention to flow only from earlier to later positions.
Attention weights provide some degree of interpretability: by visualising the attention matrix, researchers can see which inputs each output depends on most heavily. Early analyses of machine translation models revealed that attention roughly learns word alignments between source and target languages. Later analyses of BERT and GPT found heads specialised for syntactic dependencies, coreference, and position-based patterns. However, there is ongoing debate about whether attention weights truly reflect causal contribution or merely correlate with it: experiments show that models can produce similar outputs with very different attention patterns, suggesting attention visualisation is a useful but imperfect window into model reasoning.
Related terms: Self-Attention, Transformer, Explainable AI
Discussed in:
- Chapter 13: Attention & Transformers — Self-Attention
Also defined in: Textbook of AI