WebText is the closed pre-training corpus assembled by OpenAI for GPT-2 (Radford, Wu, Child et al. 2019), and WebText2 is its enlarged successor used as one of the seven mixture components for GPT-3 (Brown, Mann, Ryder et al. 2020). Neither dataset has ever been released; both are reconstructed in the literature only from the brief descriptions in the GPT-2 and GPT-3 papers.
Construction
OpenAI scraped every outbound link posted to Reddit that received at least 3 karma ("a heuristic indicator for whether other users found the link interesting, educational or just funny"). Pages were de-duplicated, stripped of Wikipedia (which OpenAI separately included to avoid double-counting on benchmarks), and cleaned of HTML markup. The cut-off was December 2017 for WebText and roughly mid-2019 for WebText2.
The resulting corpus contained 45 million links filtered to 8 million documents, totalling 40 GB of plain text and roughly 9 billion tokens for WebText. WebText2 grew to approximately 19 billion tokens.
Role in GPT pre-training
GPT-2 was trained exclusively on WebText. GPT-3's training mixture allocated 22% of its 300 B-token total to WebText2, disproportionately high relative to its size, reflecting OpenAI's belief that Reddit-curated content was higher-quality than raw Common Crawl. WebText2 is also the largest source of the factual recall capabilities benchmarked in the GPT-3 paper.
Open replications
Because OpenAI never released WebText, the open community reconstructed it as OpenWebText (Gokaslan & Cohen 2019) and OpenWebText2 (EleutherAI 2020) by following the same Reddit-karma recipe against PushShift Reddit dumps. OpenWebText2 was later included as a sub-corpus of The Pile.
Issues
The Reddit-karma filter encodes the demographic and topical biases of Reddit's user base, heavily English-speaking, US-leaning, male and technologically inclined, and downweights non-English web content almost entirely. The corpus also contains de facto high concentrations of gaming, technology and politics discussion. The exclusion of Wikipedia means GPT-2's surprising encyclopedic competence comes from the secondary references on Reddit-popular pages rather than from the encyclopedia itself.
WebText is historically important as the first demonstration that a 1.5 B-parameter language model trained on web text could exhibit competent zero-shot task transfer, the GPT-2 paper Language Models are Unsupervised Multitask Learners, and as the proof-of-concept that motivated the entire LLM scaling programme.
Related terms: OpenWebText and OpenWebText2, GPT-3, Common Crawl, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora
- Chapter 15: Modern AI, Modern AI