MADLAD-400, Glossary, Textbook of AI

MADLAD-400 (Multilingual Audited Dataset: Languages Across the Domains 400 Billion; Kudugunta, Caswell, Zhang et al., NeurIPS 2023, arXiv:2309.04662) is Google Research's open multilingual web corpus, the largest publicly distributed multilingual training set as of release.

Composition

Despite the 400 in its name (which refers to its initial 400 B-token target), MADLAD-400 actually contains roughly 3 trillion tokens in its full release, spanning 419 languages, of which 279 languages have at least 1 million tokens of cleaned text. It was extracted from Common Crawl with a custom pipeline:

CLD3 + a custom transformer for fine-grained language identification at the document level.
MoE-based filtering trained to reject machine-translated content, low-quality boilerplate, and code-switched material.
Per-language audits by 50+ native-speaker linguists who reviewed random samples in each language and produced both quantitative quality scores and qualitative failure-mode reports.

The audit step is MADLAD-400's distinctive contribution: every language entry has a documented quality classification (high, medium, low) that downstream users can filter against.

Two release tiers

MADLAD-400 Noisy, full extraction, minimum filtering, useful as a coverage baseline.
MADLAD-400 Clean, passes the per-language quality audit, suitable for training.

Licensing

Released under CC-BY-4.0 by Google, with downstream use governed by the underlying Common Crawl provenance. The audit reports themselves are released as documentation appendices.

Models trained on MADLAD-400

MADLAD-400 trained the MADLAD-400 translation models (3 B and 7 B parameters), Google's open multilingual NMT systems. It is a major component of Gemma's multilingual training mixture, Aya (Cohere For AI's open multilingual chat model), and the NLLB (No Language Left Behind) follow-up corpora. Many academic massively-multilingual models, mGPT, BLOOMZ, mT5-XXL fine-tunes, incorporate MADLAD-400 either as primary data or as low-resource-language augmentation.

Significance

MADLAD-400's audited multilingual design directly addresses the long-standing problem of silent low-resource-language quality decay , many multilingual web corpora contain large fractions of machine-translated, hallucinated or boilerplate text in languages with limited web presence, and downstream models trained on this material show pathological behaviour that goes undetected in standard evaluation. The MADLAD-400 audit makes those failure modes visible and filterable, raising the floor of responsible low-resource multilingual training.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).