wav2vec 2.0, Glossary, Textbook of AI

wav2vec 2.0 (Baevski, Zhou, Mohamed, Auli, NeurIPS 2020) is the dominant self-supervised pre-training framework for speech, demonstrating that ten minutes of labelled audio plus 53,000 hours of unlabelled audio could match the WER of 960 hours of fully supervised training. It established the now-standard recipe of pre-train-then-fine-tune for ASR.

Architecture. Three components process raw 16 kHz waveform:

CNN feature encoder $f: \mathcal{X} \to \mathcal{Z}$, seven temporal convolutional blocks (kernel 10/3/3/3/3/2/2, stride 5/2/2/2/2/2/2) producing latent representations $z_t$ at a 20 ms stride (50 Hz frame rate). Each block: Conv → GroupNorm → GELU.
Quantisation module $q: z_t \to q_t$, a product quantiser with $G = 2$ codebooks of $V = 320$ entries each, selected via Gumbel-softmax:

$$p_{g,v} = \frac{\exp((\ell_{g,v} + n_v)/\tau)}{\sum_{v'} \exp((\ell_{g,v'} + n_{v'})/\tau)}, \quad n_v \sim -\log(-\log U(0,1)),$$

with temperature $\tau$ annealed from 2.0 to 0.5. Codebooks are trained jointly via straight-through gradients.
Transformer context network $g: \mathcal{Z} \to \mathcal{C}$, 12 (Base) or 24 (Large) Transformer blocks with relative positional convolutional embedding and pre-LayerNorm.

Masking. A fraction $p = 0.065$ of time steps in $z_t$ are sampled as mask starts; each is replaced (with consecutive 10 frames) by a learned mask embedding before entering the Transformer. The Transformer must recover the quantised target $q_t$ at masked positions from surrounding unmasked context.

Contrastive loss. Let $c_t$ be the Transformer output at masked step $t$ and $q_t$ its quantised target. Distractors $\tilde{q}_t$ are sampled uniformly from quantised representations of other masked steps in the same utterance. The InfoNCE loss is

$$\mathcal{L}_m = -\log \frac{\exp(\text{sim}(c_t, q_t)/\kappa)}{\sum_{\tilde{q} \in \mathcal{Q}_t} \exp(\text{sim}(c_t, \tilde{q})/\kappa)},$$

with cosine similarity $\text{sim}(a, b) = a^\top b / \|a\| \|b\|$ and temperature $\kappa = 0.1$. A diversity loss $\mathcal{L}_d$ encourages uniform codebook usage via the entropy of the average softmax probabilities. The total objective is $\mathcal{L} = \mathcal{L}_m + \alpha \mathcal{L}_d$ with $\alpha = 0.1$.

Fine-tuning. A randomly initialised linear projection over the alphabet is added on top, and the model is trained with CTC loss on labelled audio. SpecAugment-style masking on the latents acts as regularisation. The CNN encoder is frozen for the first 10 k updates.

Results and impact. Wav2vec 2.0 Large fine-tuned on LibriSpeech 960 h reaches 1.8 / 3.3 WER. With only 10 minutes of labels it achieves 4.8 / 8.2, opening low-resource ASR to dozens of languages. The framework was extended to HuBERT (k-means targets), WavLM (denoising + speaker mixing), XLS-R (128-language multilingual), and MMS (1,107 languages, Meta 2023). Its representations also transfer to speaker verification, emotion recognition, and as conditioning for TTS systems such as VALL-E.

Discussed in:

Chapter 12: Sequence Models, Self-Supervised Speech

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).