13.5 Positional encoding

Self-attention has a strange property that takes most newcomers by surprise. If you shuffle the input tokens, the output tokens shuffle in exactly the same way, but nothing else changes. The mathematics treats the input as a bag of tokens, an unordered set, not a sequence. Every token looks at every other token through the same query-key-value mechanism, and that mechanism never asks who came first. From the point of view of attention, the sentences "the cat sat on the mat" and "mat the on sat cat the" produce the same attention pattern, just relabelled. For a model that is supposed to understand language, that is fatal. Word order is half of what makes language meaningful. Dog bites man is a quiet day; man bites dog is a news story.

Positional encoding is the fix. It is a way of stamping each token with a tag that says where it sits in the sequence, so that attention can tell first from fifth from fiftieth. The tag is a vector, the same shape as the token embedding, and it is added to the embedding before attention sees it. Once that addition is done, two copies of the same word at different positions look slightly different to attention, and attention has the information it needs to use word order.

This section is the bridge between the attention mechanics of §§13.2-13.4 and the full transformer block of §13.6. Sections 13.2-13.4 built the engine; this section tightens the bolt that stops the engine spinning the wheels in any direction it likes. Once we have positional encodings, we have everything needed to assemble a transformer.

Symbols Used Here
$\mathbf{x}_t$token embedding at position $t$
$\mathbf{p}_t$positional embedding at position $t$
$d$embedding dimension
$T$sequence length

The order problem

Imagine you are at a dinner party and someone gives you a stack of cards, each with one word on it. They tell you the cards belong to a sentence but they do not tell you the order. You can read each card, you can see which words are present, but you cannot reconstruct the sentence. Bites, dog, man, that could be a peaceful day or an emergency.

Self-attention is in the same position. It receives a set of token embeddings, one per word. Each embedding is a vector of, say, 768 numbers, encoding the meaning of that word. Attention computes a weighted sum over all the other tokens for each token, where the weights come from query-key dot products. The dot products do not know which token is at the start of the sentence and which is at the end. They only know what the token vectors look like.

Mathematically we say self-attention is permutation-equivariant. If you take the input matrix $\mathbf{X}$ with rows $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T$, and you swap rows 2 and 5 to make $\mathbf{X}'$, then the output is the same as the original output with rows 2 and 5 swapped. Nothing else changes. The mechanism is blind to the absolute order of its inputs.

The fix is to add a position-dependent vector $\mathbf{p}_t$ to each token embedding, so that the input to attention becomes $\mathbf{x}_t + \mathbf{p}_t$ rather than $\mathbf{x}_t$ alone. If $\mathbf{p}_1 \ne \mathbf{p}_2$, the token at position 1 looks different to the token at position 2, even if it is the same word. Attention can then use the difference to discriminate between word orders. The whole question of positional encoding boils down to: what should $\mathbf{p}_t$ be?

Sinusoidal positional encoding

The original transformer paper, Vaswani et al. (2017), proposed a fixed answer based on sinusoids. For position $t$ and even dimension index $2i$ they set

$$\mathrm{PE}(t, 2i) = \sin\!\left(\frac{t}{10000^{2i/d}}\right), \qquad \mathrm{PE}(t, 2i+1) = \cos\!\left(\frac{t}{10000^{2i/d}}\right).$$

That looks fearsome until you unpack it. Each pair of dimensions $(2i, 2i+1)$ is a sinusoid with its own frequency. Low-index dimensions oscillate fast, they change visibly as you step from one position to the next. High-index dimensions oscillate slowly, they change only over hundreds or thousands of positions. Together, the bank of sinusoids gives every position $t$ a unique fingerprint of $d$ numbers, with both fast-twitch and slow-twitch components.

Three properties make this attractive. First, every position gets a unique vector, so attention can distinguish them. Second, linear combinations of the encoding at one position can produce the encoding at another, with the linear combination depending only on the offset between the two positions, not on either of the absolute positions. This is just the angle-addition formula in disguise: $\sin(\omega(t+k)) = \sin(\omega t)\cos(\omega k) + \cos(\omega t)\sin(\omega k)$. The model can therefore learn to attend by relative offset by learning the appropriate small linear transformation. Third, the formula extends to any position, including positions longer than the model saw in training, because sine and cosine are defined everywhere.

Let us work out a tiny case. Take $d = 4$ and $t = 5$. The two frequencies are $\omega_0 = 1/10000^0 = 1$ and $\omega_1 = 1/10000^{0.5} = 1/100 = 0.01$. So $\mathrm{PE}(5, 0) = \sin(5) \approx -0.959$ and $\mathrm{PE}(5, 1) = \cos(5) \approx 0.284$. The first pair has been spun around fast, five radians is nearly a full revolution and a half. The second pair, in contrast, gives $\mathrm{PE}(5, 2) = \sin(0.05) \approx 0.0500$ and $\mathrm{PE}(5, 3) = \cos(0.05) \approx 0.9988$. Five times a tiny frequency is still tiny; the second pair has barely budged from its starting point. So the encoding for position 5 is approximately $(-0.959, 0.284, 0.0500, 0.9988)$. Different frequency bands carry position information at different scales, so the model has both fine-grained "you are at position 5, not 4" signals and coarse-grained "you are early in the sequence, not deep inside it" signals from a single vector.

In practice the original transformer adds $\mathrm{PE}(t)$ directly to the token embedding before the first attention layer. After that the position information rides along through every layer, getting transformed and rearranged but never erased.

Learned positional embeddings

A simpler alternative is just to learn the position vectors. Instead of computing $\mathbf{p}_t$ from a formula, set up a lookup table $\mathbf{P} \in \mathbb{R}^{T_{\max} \times d}$, where $T_{\max}$ is the longest sequence the model will ever see in training, and let the optimiser fill in each row by gradient descent like any other parameter. This is the approach used by BERT, GPT-1, GPT-2 and GPT-3.

Learned embeddings are attractively simple. There is no hand-crafted formula to derive, no choice of base frequency, no theoretical argument about angle-addition. You just declare a table, initialise it randomly, and let backpropagation do its job. Within the training range, learned embeddings perform on par with sinusoidal encodings, and sometimes a little better because the model has tailored each position vector to the specific patterns in its training data.

The fatal flaw appears when you try to run the model on a longer sequence than $T_{\max}$. Suppose you trained at 1024 tokens of context and now want to run at 2048. There is no row of $\mathbf{P}$ for position 1100, position 1500 or position 2000. The lookup is undefined. Pragmatic options are to truncate the input back to the training length, to extend the table with random or interpolated rows, or to retrain from scratch at the new length, but none of these is satisfying. Learned absolute embeddings simply have no opinion about positions they did not see.

That limitation matters a lot in 2026, when the most important commercial pressure on language models is long context, running at 32K, 128K, even 1M tokens of input. A scheme that cannot extrapolate beyond its training-time maximum is a dead end for those applications, which is why the field has largely moved on to relative or rotary schemes.

Rotary positional embeddings (RoPE)

RoPE, introduced by Su et al. (2021), is the dominant choice in modern open-weight large language models. LLaMA, Mistral, Qwen, DeepSeek, GLM, PaLM, Yi and many others use RoPE. The idea is elegant: instead of adding a position vector to each token embedding, rotate the query and key vectors by an angle that depends on their position. If both queries and keys are rotated by their respective angles, then the dot product between them depends only on the difference of those angles, which is the relative position. Absolute position cancels out by construction.

Concretely, take a query vector at position $i$ and split it into pairs of components: $(q_0, q_1), (q_2, q_3), \dots$. Each pair is treated as a 2-D vector, and is rotated by an angle $i \omega_k$ where $\omega_k$ is a frequency picked from the same geometric ladder as the sinusoidal scheme: fast frequencies for low pairs, slow frequencies for high pairs. The rotation is the standard 2-D rotation matrix. Apply the same idea to keys at position $j$, with rotations $j \omega_k$. Now compute the dot product. Because rotations compose by adding their angles, and because the dot product of two rotated vectors depends only on the angle between them, the dot product $\mathbf{q}_i^\top \mathbf{k}_j$ comes out as a function of $j - i$ only. Translate the entire sequence five tokens to the right and every dot product is unchanged.

Three things make RoPE the workhorse of modern LLMs. First, it is relative by construction, which matches how language actually works: the relationship between a verb and its object depends on how far apart they are, not on whether they are at character 100 or character 100,000. Second, it does not change the magnitude of queries or keys, only their direction, which keeps the attention scores well-conditioned. Third, it extrapolates much better than learned absolute encodings. Rotations are well-defined for any angle, including angles larger than anything seen in training, so a model trained at 8K context can be pushed to 32K, 128K or beyond with simple frequency-rescaling tricks like NTK-aware interpolation and YaRN. The active research frontier in long context, getting a model trained at one length to behave well at twenty times that length, almost universally assumes RoPE.

ALiBi (attention with linear biases)

ALiBi, by Press et al. (2022), goes one step simpler. It does not encode position into the embeddings or the queries or the keys at all. It adds a linear penalty directly to the attention scores. Specifically,

$$\operatorname{score}(i, j) = \mathbf{q}_i^\top \mathbf{k}_j + m_h \cdot (i - j),$$

where $m_h$ is a fixed (not learned) slope that depends on which attention head we are in. Different heads get different slopes, geometrically spaced, so the model has heads that focus tightly on nearby tokens (steep slope) alongside heads that look further afield (gentle slope). Faraway tokens get large negative bonuses on their scores, effectively, gentle masking that fades with distance. There are no position embeddings to learn, no rotations to apply, no parameters at all beyond the slope schedule. Despite that, ALiBi extrapolates startlingly well to lengths it never saw in training, often outperforming sinusoidal and learned schemes on long-context evaluations, and is competitive with RoPE in many settings.

Other variants

Relative position encoding, introduced by Shaw et al. (2018) and used by T5 and several multilingual models, modulates the keys (and sometimes the values) of attention by a vector that depends on the offset $i - j$ between query and key. There is one such vector per offset, learned from data. This is conceptually a halfway house between absolute encodings and RoPE: the model sees relative offsets explicitly, but as additive vectors rather than as rotations.

Position interpolation, introduced by Chen et al. (2023) and refined as NTK-aware scaling and YaRN, is a family of tricks for stretching RoPE to longer contexts after training. The idea is to shrink the rotation frequencies so that what was a full revolution at position 4096 in training becomes only a partial revolution at position 16384 in deployment. This lets models trained at modest context length serve much longer prompts without retraining from scratch, and is the technique behind most "context extension" releases on Hugging Face.

Where each is used

A rough map of which scheme appears where is useful. The original transformer (2017) and BERT (2018) used absolute encodings, sinusoidal in the original paper, learned in BERT and the GPT-1 through GPT-3 series. T5 (2020) introduced relative position encoding into the mainstream, where it is still common in multilingual and seq-to-seq models. From around 2021, RoPE took over: LLaMA, LLaMA 2, LLaMA 3, Mistral, Qwen, DeepSeek, Yi, GLM and most other open-weight LLMs released in the last few years all use RoPE. ALiBi has a smaller but devoted following, including BLOOM and several MosaicML releases. Closed-weight commercial models, GPT-3.5, GPT-4, Claude, Gemini, do not document their positional schemes publicly, but the consensus from leaks and from the behaviour of their long-context modes is that they use RoPE or close relatives.

For a practitioner starting a new training run in 2026, the realistic choice is RoPE or ALiBi. Sinusoidal encodings still appear in tutorials and small didactic models, but rarely in production. Learned absolute embeddings remain in the older Hugging Face checkpoints you may inherit, and you should be aware that fine-tuning such a model on longer contexts than it was trained on will break unless you swap the positional scheme. If your downstream application demands long context, retrieval-augmented generation over a whole document, code understanding across an entire repository, or transcript analysis of a multi-hour meeting, pick RoPE with a proven long-context scaling recipe, or ALiBi, and validate that the model actually performs at the lengths you care about rather than trusting a number on a model card.

What you should take away

  1. Self-attention by itself is order-blind: it treats its input as a set, and "dog bites man" looks the same as "man bites dog" without help.
  2. Positional encoding fixes this by adding (or rotating in) a position-dependent signal, so attention can tell which token came first.
  3. Sinusoidal encodings (Vaswani 2017) are fixed sinusoids of geometric frequencies, added to embeddings; learned absolute encodings (BERT, GPT-2) are a lookup table, both fail to extrapolate beyond training length.
  4. RoPE rotates queries and keys so that attention scores depend only on relative position. It is the default for almost all modern open LLMs and extrapolates well with NTK or YaRN scaling.
  5. ALiBi adds a linear distance penalty to attention scores. It is parameter-free, simple to implement and extrapolates well; it is the main alternative to RoPE for long-context models.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).