WebText and WebText2, Glossary, Textbook of AI

WebText is the closed pre-training corpus assembled by OpenAI for GPT-2 (Radford, Wu, Child et al. 2019), and WebText2 is its enlarged successor used as one of the seven mixture components for GPT-3 (Brown, Mann, Ryder et al. 2020). Neither dataset has ever been released; both are reconstructed in the literature only from the brief descriptions in the GPT-2 and GPT-3 papers.

Construction

OpenAI scraped every outbound link posted to Reddit that received at least 3 karma ("a heuristic indicator for whether other users found the link interesting, educational or just funny"). Pages were de-duplicated, stripped of Wikipedia (which OpenAI separately included to avoid double-counting on benchmarks), and cleaned of HTML markup. The cut-off was December 2017 for WebText and roughly mid-2019 for WebText2.

The resulting corpus contained 45 million links filtered to 8 million documents, totalling 40 GB of plain text and roughly 9 billion tokens for WebText. WebText2 grew to approximately 19 billion tokens.

Role in GPT pre-training

GPT-2 was trained exclusively on WebText. GPT-3's training mixture allocated 22% of its 300 B-token total to WebText2, disproportionately high relative to its size, reflecting OpenAI's belief that Reddit-curated content was higher-quality than raw Common Crawl. WebText2 is also the largest source of the factual recall capabilities benchmarked in the GPT-3 paper.

Open replications

Because OpenAI never released WebText, the open community reconstructed it as OpenWebText (Gokaslan & Cohen 2019) and OpenWebText2 (EleutherAI 2020) by following the same Reddit-karma recipe against PushShift Reddit dumps. OpenWebText2 was later included as a sub-corpus of The Pile.

Issues

The Reddit-karma filter encodes the demographic and topical biases of Reddit's user base, heavily English-speaking, US-leaning, male and technologically inclined, and downweights non-English web content almost entirely. The corpus also contains de facto high concentrations of gaming, technology and politics discussion. The exclusion of Wikipedia means GPT-2's surprising encyclopedic competence comes from the secondary references on Reddit-popular pages rather than from the encyclopedia itself.

WebText is historically important as the first demonstration that a 1.5 B-parameter language model trained on web text could exhibit competent zero-shot task transfer, the GPT-2 paper Language Models are Unsupervised Multitask Learners, and as the proof-of-concept that motivated the entire LLM scaling programme.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora
Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).