Common Crawl, Glossary, Textbook of AI

Common Crawl is a non-profit foundation, established in 2007 by Gil Elbaz, that operates a continuously updated, openly redistributable archive of the public web. Each monthly crawl harvests several billion HTML pages, typically 3-4 billion pages and 300-400 TB of compressed data per snapshot, and publishes the results as WARC (Web ARChive), WAT (metadata) and WET (extracted plain text) files on Amazon S3. By 2025 the cumulative archive exceeds 250 billion pages and is several petabytes in size.

Scope and source

The crawler is a customised Apache Nutch instance that respects robots.txt, follows links breadth-first from a seed list weighted by link popularity, and deliberately samples the long tail rather than concentrating on a few mega-domains. The corpus therefore contains roughly proportional representation of news, blogs, forums, e-commerce, government, academic and personal pages, in dozens of languages, although English remains over-represented at roughly 40-50% of tokens.

Licensing

Common Crawl publishes under the legal theory that crawling public pages is a fair-use activity analogous to a search-engine index, and that downstream users assume responsibility for any copyright in the underlying material. The archive itself is offered under a permissive Common Crawl Terms of Use that allows research and commercial reuse. Individual pages, however, retain whatever copyright their original publishers held, a fact at the centre of every major AI training-data lawsuit since 2023, including New York Times v. OpenAI and Authors Guild v. OpenAI.

Role in language-model training

Common Crawl is the single largest input to modern LLM pre-training. GPT-3 drew roughly 60% of its 300-billion-token mixture from a filtered Common Crawl subset; LLaMA, LLaMA-2, LLaMA-3, Mistral, Falcon, Qwen, DeepSeek-V3, Claude and Gemini all rely on derivatives. Almost every public derivative, C4, The Pile (in part), RedPajama, FineWeb, DCLM-Baseline, MADLAD-400, OSCAR, CCNet, begins with raw Common Crawl WET files and applies progressively more aggressive deduplication, language identification, quality filtering and toxicity filtering.

Known issues

The raw archive contains substantial boilerplate (navigation, cookie banners, footers), duplicate content (the same page often appears across crawls), machine-generated spam, adult content, personally identifiable information, and pages from shadow libraries such as Library Genesis and Sci-Hub. Filtering pipelines remove most of this, but residual contamination of evaluation benchmarks, pages on the open web that quote MMLU, HumanEval or HellaSwag verbatim, is a persistent concern. The corpus also encodes the demographic biases of the open web: over-representation of English, Western, male and educated voices, and under-representation of Indigenous and low-resource-language communities.

Modern relevance

Despite the rise of curated alternatives, no organisation has yet assembled a public corpus of comparable scale and diversity, and Common Crawl remains the de facto bedrock of open language-model training. Its continued existence as an independent non-profit, rather than the proprietary asset of any single AI lab, is one of the few structural counterweights to the closed-data trend in frontier-model development.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora
Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).