RedPajama, Glossary, Textbook of AI

RedPajama is an open-data project initiated by Together AI, Ontocord.ai, ETH Zürich, Stanford CRFM and MILA in April 2023, with the explicit goal of reproducing the seven-source training mixture described in Meta's LLaMA paper (Touvron et al., February 2023) and releasing it under a permissive licence.

RedPajama-Data-1T

The first release, RedPajama-Data-1T, contains 1.2 trillion tokens assembled to match the proportions reported in the LLaMA paper:

Common Crawl filtered with the CCNet pipeline, then quality-classified, 878 B tokens.
C4, 175 B tokens.
GitHub code (permissively licensed only), 59 B tokens.
Books from the Gutenberg-derived PG-19 plus a subset of Books3 (later removed), 26 B tokens.
arXiv, 28 B tokens.
Wikipedia (20 languages), 24 B tokens.
Stack Exchange, 20 B tokens.

The release also includes RedPajama-INCITE, a family of 3B and 7B base and instruction-tuned models trained on the corpus with the LLaMA recipe.

RedPajama-V2

The November 2023 follow-up, RedPajama-V2, expanded the Common Crawl portion alone to over 30 trillion tokens spanning 84 monthly snapshots and five languages (English, French, German, Italian, Spanish). Critically, V2 ships with 40+ pre-computed quality signals per document, perplexity from a KenLM language model, line-length statistics, repetition ratios, fraction of bad-words, stop-word ratios, language-detection confidence, allowing downstream users to define their own quality filters rather than committing to a fixed cleaning recipe.

Licensing

Code and metadata are MIT-licensed; the underlying text inherits Common Crawl, GitHub, arXiv and Wikipedia licences. The Books3 subset was withdrawn following the same 2023 controversy that affected The Pile.

Models trained on RedPajama

Beyond Together AI's own RedPajama-INCITE and OpenChatKit, the corpus has trained MosaicML MPT-7B, parts of OpenLLaMA, the academic TinyLlama project, and many smaller research models. RedPajama-V2's quality signals underpin DCLM baselines and several FineWeb ablation studies.

Significance

RedPajama was the first credible attempt to fully open-source a frontier-scale LLM training corpus with documented provenance, and its quality-signal architecture pioneered the late-binding filtering approach that DCLM and FineWeb-Edu would later refine into a science.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).