Positional Encoding, Glossary, Textbook of AI

Positional Encoding injects awareness of token order into the Transformer architecture. Self-attention is permutation-equivariant: if you rearrange the input tokens, the outputs rearrange the same way. For sequence data where order carries essential meaning, this is catastrophic. Positional encoding adds position-dependent information to the input embeddings so the model can distinguish "dog bites man" from "man bites dog."

The original Transformer used sinusoidal positional encodings: $\text{PE}(\text{pos}, 2i) = \sin(\text{pos}/10000^{2i/d_{\text{model}}})$ and $\text{PE}(\text{pos}, 2i+1) = \cos(\ldots)$. These provide a unique encoding for each position, and for any fixed offset $k$, $\text{PE}(\text{pos}+k)$ is a linear function of $\text{PE}(\text{pos})$, potentially allowing the model to learn relative-position attention. Alternatively, learned positional embeddings (used in BERT, GPT-2) are a lookup table of vectors. They perform comparably for sequences within the training length but do not extrapolate.

Relative positional encodings (Shaw et al., 2018; Transformer-XL) modify attention logits directly with a function of relative distance $i - j$. ALiBi (Press et al., 2022) subtracts a linear penalty proportional to distance from attention logits, simple and extrapolates well. Rotary Position Embedding (RoPE) (Su et al., 2021) rotates query and key vectors by a position-dependent angle, so their inner product depends only on relative position. RoPE is clean, extrapolates moderately well with interpolation techniques, and has become the dominant choice in modern LLMs including LLaMA, PaLM, and GPT-NeoX.

Interactive

Sinusoidal positional encodings. Sines and cosines of many frequencies tag each position with a unique fingerprint.

Video

Related terms: Transformer, Self-Attention

Discussed in:

Chapter 13: Attention & Transformers, Positional Encoding

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.