SlimPajama (Soboleva, Al-Khateeb, Myers et al., Cerebras Systems blog, June 2023, https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama) is a 627-billion-token cleaned and deduplicated version of RedPajama-1T, released by Cerebras Systems as a more compute-efficient pre-training substrate.
Construction
SlimPajama applies two transformations to RedPajama:
Document-level deduplication with MinHash + Locality-Sensitive Hashing at Jaccard threshold 0.8, removing 49.6% of RedPajama's tokens. The largest reductions are in Common Crawl (-63%), GitHub (-46%) and Books (-32%), sub-corpora where RedPajama had retained substantial cross-snapshot and fork duplicates.
Quality filtering with the same Gopher-style heuristics used by FineWeb (line-length, symbol ratios, repetition, terminal-punctuation), removing an additional small fraction of clearly malformed pages.
The result is a 627 B-token corpus retaining the same seven-source mixture proportions as RedPajama but at roughly half the size and substantially higher per-token quality.
Sub-corpus distribution
- Common Crawl, 330 B tokens (52.7%).
- C4, 137 B tokens (21.9%).
- GitHub, 32 B tokens (5.1%).
- Books, 17 B tokens (2.7%).
- arXiv, 15 B tokens (2.4%).
- Wikipedia, 19 B tokens (3.0%).
- Stack Exchange, 17 B tokens (2.7%).
Licensing
Released on Hugging Face under the same licence stack as RedPajama: MIT for processing code, with underlying source-text licences from Common Crawl, GitHub, arXiv, Wikipedia and the Books3 / PG-19 mixture (Books3 portion subsequently withdrawn).
Models trained on SlimPajama
BTLM-3B-8K (Cerebras's 3 B-parameter long-context base model), TinyLlama-1.1B, TinyLlama-1.1B-Chat, several MosaicML MPT ablation runs, and a long tail of academic models in the 1-7 B parameter range. SlimPajama is the standard corpus for academic-scale reproducible pre-training experiments because its size fits 8x A100 / 4x H100 budgets without the data engineering burden of full RedPajama or FineWeb.
Significance
SlimPajama demonstrated empirically that deduplication accounts for the majority of effective-quality gains in late-2023 pre-training corpora, a finding that shaped the deduplication-heavy pipeline of DCLM, FineWeb and OLMo-2's Dolmino. Its compact size and clear provenance make it the canonical mid-scale open pre-training corpus for academic work.
Related terms: RedPajama, The Pile, FineWeb and FineWeb-Edu, Common Crawl
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora