VALL-E (Wang et al., Microsoft, Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, January 2023) reframed text-to-speech as a conditional language modelling problem over discrete neural codec tokens, achieving high-quality zero-shot voice cloning from only a 3-second enrollment audio. It marked the shift from regression-based TTS (Tacotron, FastSpeech) to discrete-token autoregressive TTS, mirroring the earlier shift from RNN language models to GPT.
Codec front-end. VALL-E uses EnCodec at 6 kbps, producing 8 codebook indices per audio frame at 75 Hz. For each timestep $t$ the codec emits a vector $c_t = (c_t^1, \ldots, c_t^8)$, where $c_t^j \in \{1, \ldots, 1024\}$ is the index from the $j$-th residual vector quantisation level. Higher-index codebooks capture progressively finer acoustic detail.
Two-stage modelling. Generating eight codebooks autoregressively in raster order would be $8\times$ too slow. VALL-E factorises:
AR (autoregressive) decoder models the first codebook stream conditioned on phoneme transcript $\tilde{x}$ and acoustic prompt $\tilde{C}^{1:8}$:
$$p(c^1 \mid \tilde{x}, \tilde{C}; \theta_{\text{AR}}) = \prod_{t=1}^{T} p(c_t^1 \mid c_{\lt t}^1, \tilde{x}, \tilde{C}^1; \theta_{\text{AR}}).$$
NAR (non-autoregressive) decoder generates codebooks 2-8 in parallel within each layer, conditioning on all previous codebooks at the same timestep:
$$p(c^{2:8} \mid \tilde{x}, \tilde{C}, c^1; \theta_{\text{NAR}}) = \prod_{j=2}^{8} p(c^j \mid c^{1:j-1}, \tilde{x}, \tilde{C}; \theta_{\text{NAR}}).$$
Both decoders are decoder-only Transformers (12 layers, 16 heads, $d_{\text{model}} = 1024$). The AR model uses causal attention; the NAR model uses bidirectional attention over the full sequence at each refinement step.
Zero-shot prompting. At inference, the user supplies a 3-second reference clip from the target speaker. Its EnCodec tokens $\tilde{C}$ are prepended to the AR decoder's input alongside the phoneme transcript of the reference. The model then continues generating tokens for the new transcript, speaker identity, prosody, emotion, and recording acoustics are all carried over implicitly through the token prefix, with no speaker embedding or fine-tuning. This is in-context learning transferred from text LMs to speech.
Training data. LibriLight, 60,000 hours of unlabelled English audiobooks, was force-aligned to phoneme transcripts with a hybrid ASR system. Training is the standard cross-entropy next-token loss with teacher forcing.
Variants. VALL-E X (March 2023) extends to cross-lingual zero-shot cloning. VALL-E 2 (June 2024) achieves human parity on LibriSpeech and VCTK by adding grouped code modelling (predict $G$ codes per step) and repetition-aware sampling (penalise immediate token reuse to remove glitches). NaturalSpeech 3 factorises the codec into prosody, content, timbre, and acoustic-detail streams, each with its own LM.
Risks. Zero-shot voice cloning trivialises deepfake audio: a 3-second clip from a public video suffices to impersonate any speaker. Microsoft did not release VALL-E weights for this reason, and the AI Safety Institute lists voice cloning among the highest-priority misuse vectors. Watermarking schemes (AudioSeal, WavMark) and detection classifiers are active research responses.
Video
Related terms: EnCodec, Transformer, Vector Quantisation, AudioLM, WaveNet
Discussed in:
- Chapter 12: Sequence Models, Speech Synthesis