Glossary

Wikipedia (training corpus)

Wikipedia dumps are the universal pre-training ingredient of large language models: high-quality, factually anchored, encyclopedic prose, freely licensed under CC-BY-SA 3.0 (with CC-BY-SA 4.0 for newer revisions), and available in over 300 languages.

Source and structure

The Wikimedia Foundation publishes complete database dumps roughly twice per month at https://dumps.wikimedia.org. The dumps include page text in MediaWiki markup, revision history, page metadata and article-link graphs. For LLM training, the Wikipedia 20220301.en snapshot or similar is processed by WikiExtractor or mwparserfromhell to strip MediaWiki templates, infoboxes, references and citation markup, leaving roughly 6.5 million English articles and approximately 20 GB / 4 billion tokens of clean prose. Multilingual dumps add another 30 B+ tokens across major languages.

Role in pre-training

Wikipedia features in nearly every published LLM training mixture: GPT-3 (3% weight, deliberately upsampled), GPT-2, BERT (where Wikipedia + BookCorpus was the only training data), T5, LLaMA family (4.5%), PaLM, Mistral, Qwen, DeepSeek-V3, Claude and Gemini. Within The Pile and RedPajama, Wikipedia is repeated two to three times because of its high quality density.

Licensing

CC-BY-SA is a copyleft licence: derivative works must be released under the same licence and must attribute Wikipedia. Whether training a neural network on CC-BY-SA text constitutes a derivative work is legally unsettled. The Wikimedia Foundation has so far declined to enforce a strong interpretation, but several authors of long-form Wikipedia articles have raised the question publicly.

Limitations and biases

Wikipedia's content reflects its editor base, predominantly male, Western, English-speaking, technologically literate, with documented gender gaps in biographical coverage (only ~19% of biographies are of women), systemic biases toward popular-culture topics, and rapid drift as new editors arrive. Wikipedia also encodes a neutral-point-of-view style that, when memorised at high weight, can produce LLM outputs that read as superficially balanced while occluding genuine controversy. The CCNet Wikipedia-quality classifier used in LLaMA's CommonCrawl filter further amplifies these stylistic preferences across the rest of the training corpus.

Modern relevance

Wikipedia's training-data importance is now structural rather than scale-driven: at 4 B English tokens it is dwarfed by FineWeb's 15 T, but its factual reliability per token remains the highest of any large open corpus, and almost every published evaluation suite (MMLU, TriviaQA, NaturalQuestions, HotpotQA) tests knowledge derived from Wikipedia. A frontier model trained without Wikipedia would underperform on every standard evaluation.

Related terms: Common Crawl, The Pile, Language Model, GPT-3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).