OpenWebText is the open-source replica of OpenAI's WebText corpus, built by Aaron Gokaslan and Vanya Cohen in 2019 (Gokaslan & Cohen 2019, https://skylion007.github.io/OpenWebTextCorpus). OpenWebText2 is its enlarged successor, produced by EleutherAI and incorporated as a sub-corpus of The Pile.
Construction
Both versions follow OpenAI's published recipe: extract every outbound URL from PushShift Reddit dumps that received at least three karma, deduplicate, fetch the linked page with newspaper3k, strip boilerplate, retain English-only content. OpenWebText v1 covered Reddit submissions through 2018 and contained roughly 8 million documents and 38 GB / ~9 B tokens, closely matching the original WebText footprint. OpenWebText2 extended the time range through April 2020, added improved deduplication, and expanded to roughly 65 GB / 17 B tokens.
Licensing
OpenWebText is distributed without an explicit licence; downstream users assume the underlying-page copyrights, which are heterogeneous. EleutherAI publishes OpenWebText2 under the same Common Crawl-style downstream-responsibility model that governs the rest of The Pile.
Models trained on OpenWebText
Hugging Face's GPT-2 reproduction trained on OpenWebText to verify the original GPT-2 results. DistilGPT-2, several academic GPT-2 variants, and a long tail of small-scale research models followed. OpenWebText2 contributed to GPT-Neo, GPT-J, GPT-NeoX-20B, Pythia and MPT-7B as part of The Pile.
Significance
OpenWebText is historically important as the first credible open replication of a frontier closed-data pre-training corpus. It demonstrated that the GPT-2 results were not the product of any secret-sauce data choices, and provided the open-source community with the substrate for several years of replication and ablation work that fed directly into the design of The Pile, RedPajama, and the modern open-data pipeline.
Related terms: WebText and WebText2, The Pile, Common Crawl, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora