The Stack Exchange network, including Stack Overflow, Mathematics Stack Exchange, Cross Validated, TeX Stack Exchange, Server Fault, Super User and roughly 170 other Q&A sites, is a heavily used component of LLM training mixtures. Quarterly Stack Exchange Data Dumps are released by Stack Exchange Inc. at https://archive.org/details/stackexchange under CC-BY-SA 4.0, containing questions, answers, comments, edit history and vote scores.
Scale
The mid-2024 dump contains roughly 24 million questions, 35 million answers and 80 million comments across all sites. Stack Overflow alone supplies about 18 million questions and 27 million answers. Cleaned to plain Markdown with code fences preserved, the full corpus is approximately 30 GB, roughly 8-12 B tokens. The Pile weights Stack Exchange at 2%, RedPajama at 1.7%, and most production LLM mixtures allocate 1-3%.
Role in coding ability
Stack Overflow is widely credited as the single most influential source for emergent programming competence in pre-trained LLMs. The Q&A format pairs natural-language problem descriptions with executable code and explanatory prose, which is almost ideal supervision for the coding question-answer task that downstream developers use LLMs for. Codex, Code Llama, DeepSeek-Coder, Qwen-Coder, StarCoder and CodeGen all included Stack Exchange in pre-training, and ablation studies (BigCode 2023) showed measurable HumanEval and MBPP regressions when it was removed.
Licensing controversy
All Stack Exchange content is licensed CC-BY-SA 4.0, which requires attribution and copyleft. In 2023 Stack Overflow Inc. revised its position, declaring that training large language models on Stack Overflow data without commercial agreement violates the spirit of the licence even if the formal licence text does not explicitly cover ML training. This led to a public dispute when Stack Overflow signed a paid licensing deal with OpenAI in May 2024, prompting protest edits from users who deleted or vandalised their answers; Stack Overflow reverted these edits and suspended several accounts, generating further controversy. The episode crystallised the broader question of whether user-generated CC-BY-SA content can be legitimately monetised by the host platform without re-consenting contributors.
Quality issues
Stack Exchange contains substantial outdated answers (answers correct for an old library version remain top-voted), duplicated content (the same question is answered across multiple Stack Exchange sites), low-quality edits in the long tail of small Stack Exchange sites, and gamed votes, particularly on Stack Overflow's reputation system. Cleaning pipelines typically retain only answers with score > 0 and keep the question + accepted answer pair, dropping comments and meta-discussion.
Related terms: GitHub Code Corpus, The Stack and The Stack v2, The Pile, Language Model
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora