Whisper is an automatic speech recognition (ASR) system released by OpenAI in September 2022 (Radford et al., Robust Speech Recognition via Large-Scale Weak Supervision). It is a sequence-to-sequence Transformer that maps log-mel spectrograms to token sequences, jointly performing transcription, translation, language identification, and voice-activity detection in a single model.
Audio front-end. Audio is resampled to 16 kHz, split into 30-second chunks (zero-padded if shorter), and converted to an 80-channel log-magnitude mel spectrogram with a 25 ms window and 10 ms hop. This yields a feature tensor of shape $80 \times 3000$ per chunk. Two 1-D convolutions (stride 1 and 2, GELU activation, kernel size 3) downsample the time axis by a factor of two and project to the model dimension $d_{\text{model}}$, giving 1500 encoder positions.
Architecture. The encoder is a stack of pre-norm Transformer blocks with sinusoidal positional embeddings; the decoder is a causal Transformer with learned positional embeddings, cross-attending to the encoder output. Five sizes were released (tiny 39M → large 1.55B), all sharing the identical encoder-decoder topology. Multi-head self-attention follows the standard formulation:
$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V.$$
Multitask interface. Rather than training separate models, Whisper conditions on special tokens at the start of the decoder: a language tag (e.g. <|en|>), a task tag (<|transcribe|> or <|translate|>), and timestamp tokens. This lets one network handle 99 languages, X-to-English translation, and word-level timing. The training objective is the standard cross-entropy next-token loss:
$$\mathcal{L} = -\sum_{t} \log p_\theta(y_t \mid y_{\lt t}, x).$$
Training data. 680,000 hours of audio-transcript pairs were scraped from the web; 117,000 hours cover 96 non-English languages and 125,000 hours are translation pairs. Heuristic filters remove machine-generated transcripts. Crucially, no human-annotated benchmark data was used, so reported zero-shot WER on LibriSpeech, Common Voice, and TED-LIUM reflect genuine generalisation.
Decoding tricks. Whisper uses temperature fallback (raise sampling temperature on low-confidence chunks), compression-ratio and avg-logprob heuristics to detect hallucinated repetitions, and previous-text conditioning for long-form coherence. Timestamp tokens are interleaved with text tokens, enabling word-aligned subtitles without forced alignment.
Robustness. Compared with supervised models such as wav2vec 2.0 fine-tuned on LibriSpeech, Whisper-large halves the average WER on out-of-distribution datasets while remaining competitive in-distribution. The weakly supervised, large-scale recipe, analogous to CLIP for vision-language , is the paper's central thesis: scale and diversity of supervision matter more than annotation quality.
Limitations. Whisper hallucinates plausible but fictional text on silent or noisy segments, especially in low-resource languages, and its 30-second receptive field forces ad-hoc chunking for long audio. It is also offline-only, there is no streaming variant, which motivated successor systems such as RNN-Transducer and Conformer-based streaming ASR.
Related terms: Transformer, Attention Mechanism, Cross-Entropy Loss, CTC Loss, Conformer, wav2vec 2.0
Discussed in:
- Chapter 12: Sequence Models, Speech Recognition