The Pile, Glossary, Textbook of AI

The Pile is an 825 GiB open-source language-modelling dataset released by EleutherAI in December 2020 (Gao, Biderman, Black et al., arXiv:2101.00027). It was the first large public corpus deliberately designed as a diverse mixture rather than a single web scrape, motivated by the observation that GPT-3-style models benefit from exposure to specialised text genres beyond raw Common Crawl.

Composition

The Pile aggregates 22 sub-corpora, weighted to balance scale against quality:

Pile-CC, a custom Common Crawl extract using EleutherAI's cc_net pipeline (227 GiB).
PubMed Central, biomedical full-text articles (90 GiB).
Books3, a Bibliotik-derived corpus of 196,640 books (101 GiB), later removed for copyright reasons.
OpenWebText2, community replication of WebText (63 GiB).
ArXiv, STEM preprints (56 GiB).
GitHub, open-source code (95 GiB).
FreeLaw, US court opinions (51 GiB).
Stack Exchange, USPTO, Wikipedia, PubMed Abstracts, Project Gutenberg (PG-19), OpenSubtitles, DM Mathematics, Ubuntu IRC, HackerNews, YouTube Subtitles, PhilPapers, NIH ExPorter, Enron Emails, EuroParl, and BookCorpus2.

Each sub-corpus is upsampled or downsampled so that the highest-quality sources contribute disproportionately to the training mixture; for example, Wikipedia is repeated three times, while raw Common Crawl appears only once.

Models trained on The Pile

The Pile was the training set for GPT-Neo (2.7B), GPT-J (6B), GPT-NeoX-20B, the Pythia scaling suite (70M-12B), Cerebras-GPT, Stability AI's StableLM, parts of MPT and many academic models. The Pythia suite, released alongside intermediate checkpoints at 154 training steps each, is still the most heavily used resource for interpretability research and training-dynamics studies.

Controversies

The most consequential issue is Books3: a 37 GB plain-text corpus of pirated books originally compiled by Shawn Presser from the Bibliotik shadow library. After the Atlantic published a searchable database of Books3 titles in 2023 and authors discovered their works inside, EleutherAI removed Books3 from The Pile distribution. Books3 had nevertheless already trained LLaMA, GPT-J, BloombergGPT and Stability AI's StableLM, and it features in active litigation against Meta (Kadrey v. Meta) and Microsoft. The Pile also contains Enron emails with personally identifiable information, and substantial GitHub code under copyleft licences whose redistribution under permissive ML terms is contested.

Legacy

Despite its withdrawal in original form, The Pile's design philosophy, an explicit, documented, weighted mixture of heterogeneous sources , set the template that RedPajama, Dolma, SlimPajama and even Meta's internal LLaMA mixture all follow. Its Pythia offspring remains the canonical platform for academic mechanistic-interpretability work.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).