Query, key and value vectors produce an attention matrix over four tokens.
From the chapter: Chapter 13: Attention & Transformers
Glossary: attention mechanism, self attention, softmax, transformer
People: ashish vaswani, dzmitry bahdanau
References: Vaswani, 2017
Transcript
Self-attention is the core operation in a Transformer. It lets every token in a sequence look at every other token.
For each token the model computes three vectors: a query, a key, and a value. Here they are visualised in two dimensions, for the four-word sentence cat sat on mat.
The attention weight from one token to another is the dot product of the first token's query with the second token's key, divided by the square root of the dimension, then passed through softmax.
When a query points the same way as a key, the dot product is large and the attention weight is high. The matrix on the right shows all sixteen weights at once.
When queries align with their own keys, the matrix is mostly diagonal. Each token attends mostly to itself.
If cat needs to attend to mat, perhaps to resolve a pronoun later in the sentence, its query rotates toward mat's key, and the attention pattern shifts.
Finally, the output for each token is a weighted sum of every value vector. Each token gathers information from wherever it chooses to look.
That is one attention head. A Transformer stacks dozens of these in parallel, each learning what to attend to and how to mix it.