MADLAD-400 (Multilingual Audited Dataset: Languages Across the Domains 400 Billion; Kudugunta, Caswell, Zhang et al., NeurIPS 2023, arXiv:2309.04662) is Google Research's open multilingual web corpus, the largest publicly distributed multilingual training set as of release.
Composition
Despite the 400 in its name (which refers to its initial 400 B-token target), MADLAD-400 actually contains roughly 3 trillion tokens in its full release, spanning 419 languages, of which 279 languages have at least 1 million tokens of cleaned text. It was extracted from Common Crawl with a custom pipeline:
- CLD3 + a custom transformer for fine-grained language identification at the document level.
- MoE-based filtering trained to reject machine-translated content, low-quality boilerplate, and code-switched material.
- Per-language audits by 50+ native-speaker linguists who reviewed random samples in each language and produced both quantitative quality scores and qualitative failure-mode reports.
The audit step is MADLAD-400's distinctive contribution: every language entry has a documented quality classification (high, medium, low) that downstream users can filter against.
Two release tiers
- MADLAD-400 Noisy, full extraction, minimum filtering, useful as a coverage baseline.
- MADLAD-400 Clean, passes the per-language quality audit, suitable for training.
Licensing
Released under CC-BY-4.0 by Google, with downstream use governed by the underlying Common Crawl provenance. The audit reports themselves are released as documentation appendices.
Models trained on MADLAD-400
MADLAD-400 trained the MADLAD-400 translation models (3 B and 7 B parameters), Google's open multilingual NMT systems. It is a major component of Gemma's multilingual training mixture, Aya (Cohere For AI's open multilingual chat model), and the NLLB (No Language Left Behind) follow-up corpora. Many academic massively-multilingual models, mGPT, BLOOMZ, mT5-XXL fine-tunes, incorporate MADLAD-400 either as primary data or as low-resource-language augmentation.
Significance
MADLAD-400's audited multilingual design directly addresses the long-standing problem of silent low-resource-language quality decay , many multilingual web corpora contain large fractions of machine-translated, hallucinated or boilerplate text in languages with limited web presence, and downstream models trained on this material show pathological behaviour that goes undetected in standard evaluation. The MADLAD-400 audit makes those failure modes visible and filterable, raising the floor of responsible low-resource multilingual training.
Related terms: Common Crawl, FineWeb and FineWeb-Edu, Dolma and OLMo, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora