DCLM, the DataComp for Language Models, is a community benchmark and reference corpus released in June 2024 by a consortium led by Apple Foundation Models, the University of Washington, Tel Aviv University and Toyota Research Institute (Li, Fang, Smyrnis et al., arXiv:2406.11794). It treats data curation as the first-class research problem in modern LLM development, holding model architecture, optimiser and total compute fixed and competing solely on data choices.
DCLM-Pool
The benchmark begins with DCLM-Pool, a corpus of 240 trillion tokens extracted from all Common Crawl snapshots between 2008 and 2022 using resiliparse text extraction. This is the largest publicly indexed pre-training pool to date, intended as the substrate from which participants design filtering pipelines.
Filtering tracks
Participants choose a compute budget, 400 M, 1 B, 3 B, or 7 B parameters trained for a Chinchilla-optimal token count, and submit a filtered subset of DCLM-Pool. Models are trained with a fixed OpenLM recipe and evaluated on 53 downstream tasks spanning MMLU, ARC, HellaSwag, GSM8K, HumanEval and a battery of natural-language-understanding suites. The leaderboard is deterministic: identical filtering produces identical models, so improvements are unambiguously attributable to data choices.
DCLM-Baseline
The reference filter, DCLM-Baseline, applies (i) a fastText classifier trained on OpenHermes-2.5 and educational web pages versus random RefinedWeb, (ii) Bloom-filter near-deduplication, and (iii) the Gopher quality heuristics. The result is a 3.8 T-token corpus that, at the 7 B / 2.5 T-token compute budget, achieves MMLU 64%, comparable to Llama-3-8B trained on a much larger but less aggressively curated corpus.
Licensing and significance
DCLM-Baseline is distributed under CC-BY-4.0 with Common Crawl provenance flags. The DCLM-7B reference model and full filtering code are open-source.
DCLM's contribution is methodological as much as substantive: it institutionalises the comparison of data pipelines under controlled compute, much as GLUE and SuperGLUE institutionalised model comparison a decade earlier. It also provides empirical backing for the now-standard claim that targeted quality filtering at small data scale matches large data scale at uniform quality, the philosophical core of the FineWeb-Edu, OLMo-2 and Apple's own production training stacks.
Related terms: FineWeb and FineWeb-Edu, Common Crawl, RedPajama, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora
- Chapter 15: Modern AI, Modern AI