Glossary

Vector Quantisation

Vector quantisation (VQ) replaces a continuous vector $\mathbf{z} \in \mathbb{R}^d$ with the index of the nearest codeword in a finite codebook $\mathcal{C} = \{\mathbf{e}_1, \ldots, \mathbf{e}_K\}$:

$$q(\mathbf{z}) = \arg\min_{k} \|\mathbf{z} - \mathbf{e}_k\|^2.$$

The continuous vector is thereby replaced by a single integer in $\{1, \ldots, K\}$, dramatically compressing the representation while losing only the within-cluster variation. VQ has a long history in classical signal processing (Linde, Buzo & Gray, 1980) for speech and image compression, but its modern resurgence is driven by deep learning, where it is used to bridge continuous neural representations and the discrete-token interface that Transformer language models expect.

VQ-VAE

The Vector-Quantised Variational Autoencoder (van den Oord, Vinyals & Kavukcuoglu, 2017) trains:

  1. An encoder that maps inputs $\mathbf{x}$ to continuous latents $\mathbf{z} = E(\mathbf{x})$.
  2. A quantiser that snaps $\mathbf{z}$ to the nearest codeword $\mathbf{e}_k$.
  3. A decoder that reconstructs $\hat{\mathbf{x}} = D(\mathbf{e}_k)$.

The training loss has three components:

$$\mathcal{L} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[\mathbf{z}] - \mathbf{e}_k\|^2}_{\text{codebook}} + \beta\, \underbrace{\|\mathbf{z} - \text{sg}[\mathbf{e}_k]\|^2}_{\text{commitment}},$$

where $\text{sg}[\cdot]$ denotes the stop-gradient operator. Because the $\arg\min$ is non-differentiable, gradients pass through the quantiser via the straight-through estimator: the encoder receives the gradient that the decoder sends back, as if the quantisation were an identity.

Residual VQ

Residual vector quantisation (RVQ) quantises iteratively: the first codebook quantises $\mathbf{z}$, the second codebook quantises the residual $\mathbf{z} - \mathbf{e}_{k_1}$, the third quantises $\mathbf{z} - \mathbf{e}_{k_1} - \mathbf{e}_{k_2}$, and so on. With $L$ codebooks of size $K$, RVQ achieves $K^L$ effective codewords with only $LK$ stored vectors, exponentially expanding capacity at linear cost. RVQ underpins modern neural audio codecs, EnCodec (Meta), SoundStream (Google), which compress speech and music to a few hundred bits per second with high perceptual fidelity.

Modern uses

VQ is the dominant strategy for discretising non-text modalities so they can be modelled by autoregressive Transformer language models:

  • DALL-E (original), VQ-VAE tokenised images into a 32×32 grid of discrete tokens, which a 12-billion-parameter Transformer then modelled autoregressively.
  • VALL-E, AudioLM, MusicGen, RVQ tokenises audio into hierarchical streams that LMs predict.
  • MaskGIT, Muse, Parti, VQ image tokens with non-autoregressive masked-prediction transformers.
  • VQ-BET for robotics, discretises continuous actions into codebook tokens.

Pathologies and remedies

The most common training failure is codebook collapse: only a handful of codewords are ever used, and the rest go untrained, wasting representational capacity. Standard mitigations include:

  • EMA codebook updates, update each codeword toward the running mean of the encoder outputs assigned to it, rather than via gradient descent.
  • Codebook reset, periodically reinitialise unused codewords to the location of a recent encoder output.
  • Learnable temperature / Gumbel softmax, relax the hard $\arg\min$ to a differentiable distribution that anneals toward discreteness.
  • Rotation tricks and finite scalar quantisation (FSQ), recent alternatives that sidestep codebook learning entirely by quantising each dimension independently to a fixed grid.

Related terms: Variational Autoencoder, Autoencoder, Tokenisation

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).