SlimPajama, Glossary, Textbook of AI

SlimPajama (Soboleva, Al-Khateeb, Myers et al., Cerebras Systems blog, June 2023, https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama) is a 627-billion-token cleaned and deduplicated version of RedPajama-1T, released by Cerebras Systems as a more compute-efficient pre-training substrate.

Construction

SlimPajama applies two transformations to RedPajama:

Document-level deduplication with MinHash + Locality-Sensitive Hashing at Jaccard threshold 0.8, removing 49.6% of RedPajama's tokens. The largest reductions are in Common Crawl (-63%), GitHub (-46%) and Books (-32%), sub-corpora where RedPajama had retained substantial cross-snapshot and fork duplicates.
Quality filtering with the same Gopher-style heuristics used by FineWeb (line-length, symbol ratios, repetition, terminal-punctuation), removing an additional small fraction of clearly malformed pages.

The result is a 627 B-token corpus retaining the same seven-source mixture proportions as RedPajama but at roughly half the size and substantially higher per-token quality.

Sub-corpus distribution

Common Crawl, 330 B tokens (52.7%).
C4, 137 B tokens (21.9%).
GitHub, 32 B tokens (5.1%).
Books, 17 B tokens (2.7%).
arXiv, 15 B tokens (2.4%).
Wikipedia, 19 B tokens (3.0%).
Stack Exchange, 17 B tokens (2.7%).

Licensing

Released on Hugging Face under the same licence stack as RedPajama: MIT for processing code, with underlying source-text licences from Common Crawl, GitHub, arXiv, Wikipedia and the Books3 / PG-19 mixture (Books3 portion subsequently withdrawn).

Models trained on SlimPajama

BTLM-3B-8K (Cerebras's 3 B-parameter long-context base model), TinyLlama-1.1B, TinyLlama-1.1B-Chat, several MosaicML MPT ablation runs, and a long tail of academic models in the 1-7 B parameter range. SlimPajama is the standard corpus for academic-scale reproducible pre-training experiments because its size fits 8x A100 / 4x H100 budgets without the data engineering burden of full RedPajama or FineWeb.

Significance

SlimPajama demonstrated empirically that deduplication accounts for the majority of effective-quality gains in late-2023 pre-training corpora, a finding that shaped the deduplication-heavy pipeline of DCLM, FineWeb and OLMo-2's Dolmino. Its compact size and clear provenance make it the canonical mid-scale open pre-training corpus for academic work.

Related terms: RedPajama, The Pile, FineWeb and FineWeb-Edu, Common Crawl

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).