Glossary

RoPE

Rotary Position Embedding (RoPE), introduced by Jianlin Su et al. in the 2021 paper RoFormer: Enhanced Transformer with Rotary Position Embedding, is a positional encoding scheme for Transformers in which positional information is injected by applying rotations to pairs of embedding dimensions, with rotation angle proportional to position.

The mechanism

For a query or key vector $\mathbf{x} \in \mathbb{R}^d$ at position $m$, RoPE pairs adjacent dimensions $(x_{2i}, x_{2i+1})$ and rotates each pair by angle $m \theta_i$:

$$\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix},$$

with frequencies $\theta_i = 10000^{-2i/d}$ chosen as a geometric progression, the same base used in the original sinusoidal positional encoding.

Equivalently, identifying each pair with a complex number $z_i = x_{2i} + i\, x_{2i+1}$, the rotation is multiplication by $e^{im\theta_i}$. Applied separately to query $\mathbf{q}_m$ and key $\mathbf{k}_n$, the dot product becomes

$$\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle = \sum_i |z_i^q||z_i^k|\cos\big((m-n)\theta_i + \phi_i\big),$$

which depends only on the relative position $m - n$, not on absolute positions.

Advantages over earlier schemes

  • Relative-position aware without explicit relative-position bias matrices (as in T5 or Shaw et al.).
  • Extrapolates more cleanly to sequences longer than seen during training.
  • Norm-preserving: rotations are isometries, so vector magnitudes are unchanged.
  • Parameter-free: no additional learnable parameters; frequencies are fixed.
  • Cheap and parallel: implemented as element-wise multiplications and a complex-conjugate pair, fully fused into attention kernels.

Adoption

RoPE has become the de facto standard positional encoding in modern large language models. LLaMA, LLaMA 2, LLaMA 3, PaLM, Mistral, Mixtral, GPT-NeoX, Falcon, Qwen, DeepSeek, Gemma and most other major open and closed models use RoPE. Sinusoidal absolute embeddings (the original Transformer) and learned absolute embeddings (BERT, GPT-2) are now mostly relegated to historical or specialised settings.

Long-context variants

Extending RoPE to context lengths beyond training requires careful frequency scaling because high-frequency dimensions wrap around and lose information at extrapolated positions:

  • Position interpolation (PI) (Chen et al. 2023): linearly compress positions to fit within trained range.
  • NTK-aware RoPE: a frequency-dependent scaling that better preserves high-frequency information.
  • YaRN (Peng et al. 2023): combines NTK scaling with attention temperature adjustment for state-of-the-art context-length extrapolation.
  • Dynamic NTK and LongRoPE: adapt scaling on the fly or via search.

Modern long-context models routinely handle 128 K to 1 M token contexts using RoPE-based positional encoding with appropriate scaling, sliding-window attention, and continued pre-training on long sequences.

Video

Related terms: Positional Encoding, Transformer, Attention Mechanism, GPT

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).