EnCodec (Défossez, Copet, Synnaeve, Adi, Meta AI, High Fidelity Neural Audio Compression, October 2022) is a learned audio codec that compresses 24 kHz mono or 48 kHz stereo audio to 1.5-24 kbps with quality matching MP3 at 64 kbps and Opus at 12 kbps. Beyond compression, its discrete tokens form the substrate for VALL-E, MusicGen, AudioGen, Bark, and most other modern audio language models.
Architecture. A streaming convolutional autoencoder:
- Encoder. Five convolutional blocks, each: residual block (kernel 3, two layers) + strided conv (stride 2/4/5/8). Channels expand $32 \to 512$. The receptive field maps 24 kHz audio to 75 Hz latents (320× downsampling).
- Quantiser. Residual Vector Quantisation (RVQ) with up to $N_q = 32$ codebooks of 1024 entries each. Different bandwidth budgets correspond to different $N_q$.
- Decoder. Mirror of the encoder using transposed convolutions; final $\tanh$ output produces waveform.
Residual Vector Quantisation. Standard VQ approximates a vector $z$ by its nearest codebook entry $e_{q_1}$, with residual $r_1 = z - e_{q_1}$. RVQ recursively quantises the residual: $r_2 = r_1 - e_{q_2}^{(2)}$, and so on. After $N_q$ stages the approximation is
$$\hat{z} = \sum_{i=1}^{N_q} e_{q_i}^{(i)}, \quad q_i = \arg\min_v \big\|r_{i-1} - e_v^{(i)}\big\|_2.$$
Each codebook adds $\log_2 1024 = 10$ bits per frame; at 75 Hz that is 750 bps per codebook. So $N_q = 2$ gives 1.5 kbps, $N_q = 8$ gives 6 kbps, $N_q = 32$ gives 24 kbps. Quantiser dropout during training (sampling $N_q$ uniformly per batch) yields a single model that supports all bitrates.
Codebook updates. Codebook entries are updated via exponential moving average of cluster centroids (van den Oord 2017), with dead-code reinitialisation: any code whose usage drops below threshold is replaced by a random encoder activation, preventing codebook collapse.
Multi-objective loss. EnCodec is trained with five terms:
$$\mathcal{L} = \lambda_t \mathcal{L}_t + \lambda_f \mathcal{L}_f + \lambda_g \mathcal{L}_g + \lambda_{\text{feat}} \mathcal{L}_{\text{feat}} + \lambda_w \mathcal{L}_w.$$
Here $\mathcal{L}_t$ is the time-domain L1 reconstruction, $\mathcal{L}_f$ a multi-scale mel-spectrogram L1+L2 loss, $\mathcal{L}_g$ a hinge GAN loss using multi-scale STFT discriminators (à la HiFi-GAN), $\mathcal{L}_{\text{feat}}$ a discriminator feature-matching loss, and $\mathcal{L}_w$ a commitment+codebook loss for the quantiser. The discriminators are critical: pure spectrogram losses produce muffled audio; adversarial training restores high-frequency detail.
Streaming and stereo. Causal padding throughout the encoder makes EnCodec streamable with $\sim$13 ms latency. For stereo, two channels are processed jointly with shared convolutions but per-channel quantisation.
Why audio LMs need EnCodec. Pre-EnCodec audio LMs used SoundStream (Google) at lower fidelity; OpenAI's Jukebox used a hand-tuned VQ-VAE hierarchy. EnCodec's open weights, high quality, and clean RVQ structure made it the de facto tokeniser. DAC (Descript Audio Codec, 2023) further improves codec efficiency at 8 kbps and is the modern successor.
Related terms: Vector Quantisation, Convolution, Convolutional Neural Network, VALL-E, MusicGen, AudioLM
Discussed in:
- Chapter 12: Sequence Models, Neural Audio Codecs