Vector quantisation (VQ) replaces a continuous vector $\mathbf{z} \in \mathbb{R}^d$ with the index of the nearest codeword in a finite codebook $\mathcal{C} = \{\mathbf{e}_1, \ldots, \mathbf{e}_K\}$:
$$q(\mathbf{z}) = \arg\min_{k} \|\mathbf{z} - \mathbf{e}_k\|^2.$$
The continuous vector is thereby replaced by a single integer in $\{1, \ldots, K\}$, dramatically compressing the representation while losing only the within-cluster variation. VQ has a long history in classical signal processing (Linde, Buzo & Gray, 1980) for speech and image compression, but its modern resurgence is driven by deep learning, where it is used to bridge continuous neural representations and the discrete-token interface that Transformer language models expect.
VQ-VAE
The Vector-Quantised Variational Autoencoder (van den Oord, Vinyals & Kavukcuoglu, 2017) trains:
- An encoder that maps inputs $\mathbf{x}$ to continuous latents $\mathbf{z} = E(\mathbf{x})$.
- A quantiser that snaps $\mathbf{z}$ to the nearest codeword $\mathbf{e}_k$.
- A decoder that reconstructs $\hat{\mathbf{x}} = D(\mathbf{e}_k)$.
The training loss has three components:
$$\mathcal{L} = \underbrace{\|\mathbf{x} - \hat{\mathbf{x}}\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[\mathbf{z}] - \mathbf{e}_k\|^2}_{\text{codebook}} + \beta\, \underbrace{\|\mathbf{z} - \text{sg}[\mathbf{e}_k]\|^2}_{\text{commitment}},$$
where $\text{sg}[\cdot]$ denotes the stop-gradient operator. Because the $\arg\min$ is non-differentiable, gradients pass through the quantiser via the straight-through estimator: the encoder receives the gradient that the decoder sends back, as if the quantisation were an identity.
Residual VQ
Residual vector quantisation (RVQ) quantises iteratively: the first codebook quantises $\mathbf{z}$, the second codebook quantises the residual $\mathbf{z} - \mathbf{e}_{k_1}$, the third quantises $\mathbf{z} - \mathbf{e}_{k_1} - \mathbf{e}_{k_2}$, and so on. With $L$ codebooks of size $K$, RVQ achieves $K^L$ effective codewords with only $LK$ stored vectors, exponentially expanding capacity at linear cost. RVQ underpins modern neural audio codecs, EnCodec (Meta), SoundStream (Google), which compress speech and music to a few hundred bits per second with high perceptual fidelity.
Modern uses
VQ is the dominant strategy for discretising non-text modalities so they can be modelled by autoregressive Transformer language models:
- DALL-E (original), VQ-VAE tokenised images into a 32×32 grid of discrete tokens, which a 12-billion-parameter Transformer then modelled autoregressively.
- VALL-E, AudioLM, MusicGen, RVQ tokenises audio into hierarchical streams that LMs predict.
- MaskGIT, Muse, Parti, VQ image tokens with non-autoregressive masked-prediction transformers.
- VQ-BET for robotics, discretises continuous actions into codebook tokens.
Pathologies and remedies
The most common training failure is codebook collapse: only a handful of codewords are ever used, and the rest go untrained, wasting representational capacity. Standard mitigations include:
- EMA codebook updates, update each codeword toward the running mean of the encoder outputs assigned to it, rather than via gradient descent.
- Codebook reset, periodically reinitialise unused codewords to the location of a recent encoder output.
- Learnable temperature / Gumbel softmax, relax the hard $\arg\min$ to a differentiable distribution that anneals toward discreteness.
- Rotation tricks and finite scalar quantisation (FSQ), recent alternatives that sidestep codebook learning entirely by quantising each dimension independently to a fixed grid.
Related terms: Variational Autoencoder, Autoencoder, Tokenisation
Discussed in:
- Chapter 11: CNNs, Discrete latent representations