Rotary Position Embedding (RoPE), introduced by Jianlin Su et al. in the 2021 paper RoFormer: Enhanced Transformer with Rotary Position Embedding, is a positional encoding scheme for Transformers in which positional information is injected by applying rotations to pairs of embedding dimensions, with rotation angle proportional to position.
The mechanism
For a query or key vector $\mathbf{x} \in \mathbb{R}^d$ at position $m$, RoPE pairs adjacent dimensions $(x_{2i}, x_{2i+1})$ and rotates each pair by angle $m \theta_i$:
$$\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix},$$
with frequencies $\theta_i = 10000^{-2i/d}$ chosen as a geometric progression, the same base used in the original sinusoidal positional encoding.
Equivalently, identifying each pair with a complex number $z_i = x_{2i} + i\, x_{2i+1}$, the rotation is multiplication by $e^{im\theta_i}$. Applied separately to query $\mathbf{q}_m$ and key $\mathbf{k}_n$, the dot product becomes
$$\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle = \sum_i |z_i^q||z_i^k|\cos\big((m-n)\theta_i + \phi_i\big),$$
which depends only on the relative position $m - n$, not on absolute positions.
Advantages over earlier schemes
- Relative-position aware without explicit relative-position bias matrices (as in T5 or Shaw et al.).
- Extrapolates more cleanly to sequences longer than seen during training.
- Norm-preserving: rotations are isometries, so vector magnitudes are unchanged.
- Parameter-free: no additional learnable parameters; frequencies are fixed.
- Cheap and parallel: implemented as element-wise multiplications and a complex-conjugate pair, fully fused into attention kernels.
Adoption
RoPE has become the de facto standard positional encoding in modern large language models. LLaMA, LLaMA 2, LLaMA 3, PaLM, Mistral, Mixtral, GPT-NeoX, Falcon, Qwen, DeepSeek, Gemma and most other major open and closed models use RoPE. Sinusoidal absolute embeddings (the original Transformer) and learned absolute embeddings (BERT, GPT-2) are now mostly relegated to historical or specialised settings.
Long-context variants
Extending RoPE to context lengths beyond training requires careful frequency scaling because high-frequency dimensions wrap around and lose information at extrapolated positions:
- Position interpolation (PI) (Chen et al. 2023): linearly compress positions to fit within trained range.
- NTK-aware RoPE: a frequency-dependent scaling that better preserves high-frequency information.
- YaRN (Peng et al. 2023): combines NTK scaling with attention temperature adjustment for state-of-the-art context-length extrapolation.
- Dynamic NTK and LongRoPE: adapt scaling on the fly or via search.
Modern long-context models routinely handle 128 K to 1 M token contexts using RoPE-based positional encoding with appropriate scaling, sliding-window attention, and continued pre-training on long sequences.
Video
Related terms: Positional Encoding, Transformer, Attention Mechanism, GPT
Discussed in:
- Chapter 11: CNNs, The Transformer Architecture