Vector Quantisation, Glossary, Textbook of AI

Vector quantisation (VQ) replaces a continuous vector $\mathbf{z} \in \mathbb{R}^d$ with the index of the nearest codeword in a finite codebook $\mathcal{C} = \{\mathbf{e}_1, \ldots, \mathbf{e}_K\}$:

$$q(\mathbf{z}) = \arg\min_{k} \|\mathbf{z} - \mathbf{e}_k\|^2.$$

The continuous vector is thereby replaced by a single integer in $\{1, \ldots, K\}$, dramatically compressing the representation while losing only the within-cluster variation. VQ has a long history in classical signal processing (Linde, Buzo & Gray, 1980) for speech and image compression, but its modern resurgence is driven by deep learning, where it is used to bridge continuous neural representations and the discrete-token interface that Transformer language models expect.

VQ-VAE

The Vector-Quantised Variational Autoencoder (van den Oord, Vinyals & Kavukcuoglu, 2017) trains:

An encoder that maps inputs $\mathbf{x}$ to continuous latents $\mathbf{z} = E(\mathbf{x})$.
A quantiser that snaps $\mathbf{z}$ to the nearest codeword $\mathbf{e}_k$.
A decoder that reconstructs $\hat{\mathbf{x}} = D(\mathbf{e}_k)$.

The training loss has three components:

$$\mathcal{L} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[\mathbf{z}] - \mathbf{e}_k\|^2}_{\text{codebook}} + \beta\, \underbrace{\|\mathbf{z} - \text{sg}[\mathbf{e}_k]\|^2}_{\text{commitment}},$$

where $\text{sg}[\cdot]$ denotes the stop-gradient operator. Because the $\arg\min$ is non-differentiable, gradients pass through the quantiser via the straight-through estimator: the encoder receives the gradient that the decoder sends back, as if the quantisation were an identity.

Residual VQ

Residual vector quantisation (RVQ) quantises iteratively: the first codebook quantises $\mathbf{z}$, the second codebook quantises the residual $\mathbf{z} - \mathbf{e}_{k_1}$, the third quantises $\mathbf{z} - \mathbf{e}_{k_1} - \mathbf{e}_{k_2}$, and so on. With $L$ codebooks of size $K$, RVQ achieves $K^L$ effective codewords with only $LK$ stored vectors, exponentially expanding capacity at linear cost. RVQ underpins modern neural audio codecs, EnCodec (Meta), SoundStream (Google), which compress speech and music to a few hundred bits per second with high perceptual fidelity.

Modern uses

VQ is the dominant strategy for discretising non-text modalities so they can be modelled by autoregressive Transformer language models:

DALL-E (original), VQ-VAE tokenised images into a 32×32 grid of discrete tokens, which a 12-billion-parameter Transformer then modelled autoregressively.
VALL-E, AudioLM, MusicGen, RVQ tokenises audio into hierarchical streams that LMs predict.
MaskGIT, Muse, Parti, VQ image tokens with non-autoregressive masked-prediction transformers.
VQ-BET for robotics, discretises continuous actions into codebook tokens.

Pathologies and remedies

The most common training failure is codebook collapse: only a handful of codewords are ever used, and the rest go untrained, wasting representational capacity. Standard mitigations include:

EMA codebook updates, update each codeword toward the running mean of the encoder outputs assigned to it, rather than via gradient descent.
Codebook reset, periodically reinitialise unused codewords to the location of a recent encoder output.
Learnable temperature / Gumbel softmax, relax the hard $\arg\min$ to a differentiable distribution that anneals toward discreteness.
Rotation tricks and finite scalar quantisation (FSQ), recent alternatives that sidestep codebook learning entirely by quantising each dimension independently to a fixed grid.

Related terms: Variational Autoencoder, Autoencoder, Tokenisation

Discussed in:

Chapter 11: CNNs, Discrete latent representations

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).