LibriSpeech and LibriLight, Glossary, Textbook of AI

LibriSpeech (Panayotov, Chen, Povey & Khudanpur, ICASSP 2015) and LibriLight (Kahn, Rivière, Zheng et al., ICASSP 2020) are open English-speech corpora derived from LibriVox, the volunteer-recorded public-domain audiobook project. Together they constitute the canonical benchmarks for automatic speech recognition (ASR) and self-supervised speech representation learning.

LibriSpeech

LibriSpeech contains 1,000 hours of 16 kHz read English speech segmented into utterances of 10-20 seconds. The corpus is split into:

train-clean-100, 100 hours, lower-WER speakers.
train-clean-360, 360 hours, lower-WER speakers.
train-other-500, 500 hours, higher-WER speakers (regional accents, lower SNR).
dev-clean / dev-other / test-clean / test-other, 5-10 hours each.

Speaker assignment is disjoint across train and test, ensuring genuinely held-out evaluation. Reference text is the LibriVox source script aligned to audio with the Kaldi alignment toolkit.

LibriSpeech released under CC-BY-4.0 has been the standard ASR benchmark since 2015. Word-error-rate progression on test-clean is a canonical deep-learning timeline:

2015 (DNN-HMM, Kaldi nnet3), 5.5%.
2019 (Transformer, ESPnet), 2.6%.
2020 (wav2vec 2.0 large), 1.9%.
2022 (Whisper large), 2.7% zero-shot, 1.4% fine-tuned.
2024 (Conformer-LM ensembles), < 1.4%.

LibriLight

LibriLight (released 2020) extends the LibriVox extraction to 60,000 hours of unlabelled speech, with 10 hours, 1 hour and 10 minutes of paired-text supervision splits. It was specifically designed to enable self-supervised speech learning. wav2vec 2.0 (Baevski et al., NeurIPS 2020) used LibriLight as its 60k-hour pretraining substrate; HuBERT, WavLM, data2vec and the speech components of AudioLM and VALL-E followed.

Licensing

LibriVox audio is public domain in the United States. LibriSpeech and LibriLight annotations are CC-BY-4.0. This makes the LibriVox-derived stack the only large open speech corpus that is unambiguously safe to distribute and use commercially.

Limitations

LibriSpeech is read speech, not conversational, and its prosody, vocabulary and turn-taking patterns differ substantially from real-world ASR conditions. Speaker demographics are skewed toward male, North American volunteer narrators, with limited age, accent and L2-speaker diversity. Domain coverage is literary, late-19th and early-20th-century novels, biasing learnt acoustic and language models. Newer benchmarks Common Voice, VoxPopuli, GigaSpeech and People's Speech address these limitations but no single successor has displaced LibriSpeech as the headline ASR benchmark.

Related terms: AudioSet, Common Crawl

Discussed in:

Chapter 8: Unsupervised Learning, Speech and Audio
Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).