Positional Encoding injects awareness of token order into the Transformer architecture. Self-attention is permutation-equivariant: if you rearrange the input tokens, the outputs rearrange the same way. For sequence data where order carries essential meaning, this is catastrophic. Positional encoding adds position-dependent information to the input embeddings so the model can distinguish "dog bites man" from "man bites dog."
The original Transformer used sinusoidal positional encodings: $\text{PE}(\text{pos}, 2i) = \sin(\text{pos}/10000^{2i/d_{\text{model}}})$ and $\text{PE}(\text{pos}, 2i+1) = \cos(\ldots)$. These provide a unique encoding for each position, and for any fixed offset $k$, $\text{PE}(\text{pos}+k)$ is a linear function of $\text{PE}(\text{pos})$, potentially allowing the model to learn relative-position attention. Alternatively, learned positional embeddings (used in BERT, GPT-2) are a lookup table of vectors. They perform comparably for sequences within the training length but do not extrapolate.
Relative positional encodings (Shaw et al., 2018; Transformer-XL) modify attention logits directly with a function of relative distance $i - j$. ALiBi (Press et al., 2022) subtracts a linear penalty proportional to distance from attention logits—simple and extrapolates well. Rotary Position Embedding (RoPE) (Su et al., 2021) rotates query and key vectors by a position-dependent angle, so their inner product depends only on relative position. RoPE is elegant, extrapolates moderately well with interpolation techniques, and has become the dominant choice in modern LLMs including LLaMA, PaLM, and GPT-NeoX.
Related terms: Transformer, Self-Attention
Discussed in:
- Chapter 13: Attention & Transformers — Positional Encoding
Also defined in: Textbook of AI