C4 (Colossal Clean Crawled Corpus), Glossary, Textbook of AI

C4, the Colossal Clean Crawled Corpus, is a heavily filtered single-snapshot derivative of Common Crawl released by Colin Raffel and colleagues at Google with the T5 paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., JMLR 2020). It became one of the most widely re-used pre-training corpora of the early transformer era.

Construction

C4 begins with the April 2019 Common Crawl WET dump and applies a sequence of deterministic English-only filters:

Keep only lines ending in a terminal punctuation mark (full stop, exclamation, question mark, quotation).
Discard pages with fewer than five sentences and any line shorter than three words.
Remove pages containing any token from a published bad-words list (the so-called List of Dirty, Naughty, Obscene and Otherwise Bad Words).
Drop pages with curly braces (a heuristic for code or template noise).
Drop boilerplate strings ("lorem ipsum", "javascript", privacy-policy phrases).
Deduplicate at the three-sentence span level.
Filter to pages classified as English with high confidence by langdetect.

The resulting corpus is roughly 750 GB of plain text, around 156 billion tokens in the T5 SentencePiece vocabulary.

Variants

The original C4-en spawned several derivatives: mC4 (multilingual, 101 languages, used for mT5), C4-Realnewslike (subset filtered to news domains), C4-WebTextLike (subset filtered by similarity to OpenAI's WebText), and C4-100languages for low-resource transfer studies.

Licensing and audit

C4 is distributed via TensorFlow Datasets and Hugging Face Datasets under the Common Crawl terms; users must regenerate the corpus from raw WET files rather than redistribute it. A 2021 audit by the Allen Institute for AI found significant inclusion of machine-translated text, patent filings, US-government documents and a long tail of adult and conspiracy-site content that the bad-words filter had failed to catch, a clear demonstration that simple keyword filters are insufficient quality control at web scale.

Models trained on C4

C4 trained the original T5 family (small through 11B parameters) and was a major component of T5.1.1, UL2, PaLM (in part), and many academic baselines. The Pile explicitly excluded C4 to avoid duplication; RedPajama and FineWeb later superseded it. C4 nevertheless remains a standard reference corpus for research that needs a clean, reproducible English Common Crawl extract of a known fixed size.

Related terms: Common Crawl, The Pile, FineWeb and FineWeb-Edu, Language Model

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).