Sinusoidal positional encodings, Textbook of AI

Sines and cosines of many frequencies tag each position with a unique fingerprint.

From the chapter: Chapter 13: Attention & Transformers

Transcript

A transformer's self-attention is permutation invariant. Without help, the word "dog bites man" looks identical to "man bites dog".

Positional encodings break the symmetry. We add a position-dependent vector to every token embedding.

Pick many sinusoidal frequencies, geometrically spaced. For each token position, evaluate sine and cosine at every frequency.

Position one is a vector of sine and cosine values. Position two is a different vector. Position one hundred yet another.

Plot them as a heatmap. Rows are positions, columns are dimensions. Different frequencies produce different stripe patterns.

The crucial property: position embeddings for nearby tokens are similar. For distant tokens they differ. The dot product between position vectors decays smoothly with distance.

This is added to every token before it enters the first transformer block. From then on, every attention head can use it to tell where each token is.

Modern variants, like rotary position embeddings, encode position not as an addition but as a rotation in the query and key vectors. Same idea, slightly different mechanics, better extrapolation.

But the original sinusoidal scheme makes the principle clear. To make attention position-aware, add a deterministic, smoothly varying signal to each token.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).