Academic Paper Corpora (arXiv, S2ORC, PubMed Central), Glossary, Textbook of AI

LLM pre-training mixtures routinely allocate several percent of total tokens to academic full-text to acquire mathematical, technical and biomedical fluency. The three dominant open corpora are arXiv, S2ORC and PubMed Central.

arXiv

arXiv.org, founded by Paul Ginsparg in 1991 at Los Alamos and now hosted by Cornell, is the canonical preprint server for physics, mathematics, computer science, statistics, quantitative biology and economics. As of 2025 it contains over 2.5 million preprints. For LLM training the relevant inputs are the LaTeX source bundles (more than 50 GB compressed), processed with arxiv-latex-cleaner and pandoc to recover plain text with mathematics intact. The cleaned arXiv corpus contributes roughly 30-100 B tokens depending on the cut-off, with The Pile weighting it at 9% and RedPajama at 2.5%.

Licensing: most arXiv preprints are deposited under arXiv's non-exclusive licence, with a substantial subset under CC-BY or CC-BY-NC. Bulk redistribution for ML training is governed by arXiv's bulk-access policy, which permits research use.

S2ORC

S2ORC, the Semantic Scholar Open Research Corpus (Lo, Wang, Neumann et al., ACL 2020), is the Allen Institute for AI's machine-readable scientific corpus. It contains 136 million English-language papers with structured full text for 12 million open-access papers, parsed with GROBID into sections, references, equations and figures. Total scale is approximately 80 GB of cleaned text, 15-20 B tokens.

S2ORC's distinctive contribution is its citation graph: every paper-to-paper reference is normalised across PubMed, DOI and Semantic Scholar IDs, enabling retrieval-augmented training and graph-aware models. OLMo and Galactica both used S2ORC.

PubMed Central

PubMed Central (PMC), run by the NCBI since 2000, is the open-access biomedical-literature archive. It contains roughly 9 million full-text articles as of 2025, of which approximately 3 million are licensed for redistribution and bulk download. Cleaned PMC contributes 30-40 B tokens of biomedical prose.

PMC training data is the foundation of biomedical LLMs: BioBERT, PubMedBERT, BioGPT, Med-PaLM, Galactica, MedAlpaca, and the open Meditron-70B. The Pile includes both PubMed Abstracts (300 M abstracts, ~10 B tokens) and PubMed Central (full text, ~30 B tokens) as separate sub-corpora.

Issues common to all three

Mathematical typesetting in HTML or PDF rarely round-trips cleanly to plain text, leaving residual MathML or LaTeX artefacts. Citation strings appear in dense bursts that distort document statistics. The author-population skew is heavily Western and English-language. And, in the case of arXiv and PMC, embargo violations, papers posted before journal-permitted dates, propagate into LLM training without authorial consent.

Related terms: The Pile, RedPajama, Language Model, Common Crawl

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).