OpenWebText and OpenWebText2, Glossary, Textbook of AI

OpenWebText is the open-source replica of OpenAI's WebText corpus, built by Aaron Gokaslan and Vanya Cohen in 2019 (Gokaslan & Cohen 2019, https://skylion007.github.io/OpenWebTextCorpus). OpenWebText2 is its enlarged successor, produced by EleutherAI and incorporated as a sub-corpus of The Pile.

Construction

Both versions follow OpenAI's published recipe: extract every outbound URL from PushShift Reddit dumps that received at least three karma, deduplicate, fetch the linked page with newspaper3k, strip boilerplate, retain English-only content. OpenWebText v1 covered Reddit submissions through 2018 and contained roughly 8 million documents and 38 GB / ~9 B tokens, closely matching the original WebText footprint. OpenWebText2 extended the time range through April 2020, added improved deduplication, and expanded to roughly 65 GB / 17 B tokens.

Licensing

OpenWebText is distributed without an explicit licence; downstream users assume the underlying-page copyrights, which are heterogeneous. EleutherAI publishes OpenWebText2 under the same Common Crawl-style downstream-responsibility model that governs the rest of The Pile.

Models trained on OpenWebText

Hugging Face's GPT-2 reproduction trained on OpenWebText to verify the original GPT-2 results. DistilGPT-2, several academic GPT-2 variants, and a long tail of small-scale research models followed. OpenWebText2 contributed to GPT-Neo, GPT-J, GPT-NeoX-20B, Pythia and MPT-7B as part of The Pile.

Significance

OpenWebText is historically important as the first credible open replication of a frontier closed-data pre-training corpus. It demonstrated that the GPT-2 results were not the product of any secret-sauce data choices, and provided the open-source community with the substrate for several years of replication and ablation work that fed directly into the design of The Pile, RedPajama, and the modern open-data pipeline.

Related terms: WebText and WebText2, The Pile, Common Crawl, Language Model

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).