Books1, Books2, Books3, Glossary, Textbook of AI

Books1, Books2 and Books3 are three separate book-text corpora that have featured prominently in frontier LLM training mixtures. Their provenance, legal status and resulting controversy diverge sharply.

Books1 (BookCorpus)

Books1 is BookCorpus, a corpus assembled by Zhu, Kiros, Zemel et al. for Aligning Books and Movies (ICCV 2015) by scraping 11,038 free e-books from Smashwords. It contains roughly 985 million words of long-form prose, predominantly romance and fantasy self-published novels. BookCorpus was used to train the original BERT and GPT-1, and is included in GPT-3's mixture as Books1 (12 B tokens, 8% weight). A 2021 audit (Bandy & Vincent) found significant Smashwords licence violations and de facto duplicate authors, but the licensing question is mostly settled: Smashwords authors had assented to free distribution.

Books2

Books2 is the unspecified 55 GB / 55 B-token book corpus used in GPT-3, accounting for 8% of the training mixture. OpenAI has never disclosed its provenance, which has fed extensive speculation that it includes content from shadow libraries such as Library Genesis or Z-Library. The 2023 Authors Guild v. OpenAI complaint specifically alleges that Books2 is sourced from pirate repositories.

Books3

Books3 is the most controversial of the three: a 37 GB, 196,640-book corpus compiled in 2020 by independent researcher Shawn Presser by scraping the Bibliotik shadow library and converting EPUB files to plain text. Presser explicitly stated that he intended Books3 as an open replication of OpenAI's Books2. The corpus was then incorporated into The Pile by EleutherAI and used to train GPT-J, GPT-Neo, GPT-NeoX-20B, Pythia, LLaMA, LLaMA-2, BloombergGPT and Stability AI's StableLM.

In August 2023 The Atlantic published a searchable Books3 index that allowed authors to check whether their books had been included. The list contained almost every major contemporary novelist; lawsuits followed within weeks. Kadrey, Silverman & Golden v. Meta (filed July 2023) names Books3 directly. The Danish Rights Alliance issued a take-down to The Eye (which hosted Books3) in August 2023, after which Books3 was removed from public distribution. EleutherAI subsequently withdrew Books3 from new releases of The Pile.

Status of LLMs trained on Books3

Models trained before the take-down, including LLaMA-1, LLaMA-2 and the Pythia suite, cannot be retroactively un-trained on Books3. Whether their continued distribution constitutes copyright infringement is the central legal question in Kadrey v. Meta (motion to dismiss largely denied, March 2024) and Authors Guild v. Microsoft and OpenAI. The eventual resolution will materially shape what counts as legitimate pre-training data for the next generation of frontier models.

Related terms: The Pile, GPT-3, Llama 3 / 3.1 / 3.3, Common Crawl

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora
Chapter 16: Ethics & Safety, Ethics, Safety and Alignment

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).