Glossary

GitHub Code Corpus

GitHub hosts more than 400 million public repositories as of 2025 and constitutes the single largest accessible source of computer source code on the internet. Pre-training on GitHub code underpins the entire family of code-capable LLMs.

Construction

Code corpora are assembled by cloning public GitHub repositories (typically by querying the GH Archive event stream, the GitHub REST API or BigQuery's bigquery-public-data.github_repos dataset), then filtering by:

  • License detection, go-license-detector or scancode-toolkit identifies SPDX licences on each file.
  • Language detection, usually go-enry (the engine behind GitHub's own Linguist) classifies files into the 200+ languages GitHub recognises.
  • Quality heuristics, line length, alphanumeric ratio, fraction of auto-generated content, fraction of vendored code, fraction of binary blobs.
  • Deduplication, exact and near-duplicate detection (MinHash + LSH) is essential because GitHub contains massive forking-based duplication.

The processed scale is roughly 1 TB of source code, 200-400 B tokens depending on language coverage and aggressiveness of filtering.

Models trained on GitHub code

OpenAI Codex (2021, the engine behind the original GitHub Copilot) was trained on a 159 GB filtered GitHub subset. Code Llama added 500 B GitHub tokens to LLaMA-2's mixture. DeepSeek-Coder, Qwen-Coder, StarCoder, StarCoder2, CodeGen, CodeGen-2, Replit-Code, Phi-Code and the MPT-Code family were all trained either exclusively on GitHub code or on GitHub-dominant mixtures. GPT-4, Claude, Gemini and DeepSeek-V3 all include large GitHub portions in their multilingual training mixtures.

Licensing controversy

The legal status of training on GitHub code is the most actively litigated question in AI training data. Doe v. GitHub (filed 2022) is a class-action suit alleging that GitHub Copilot, by occasionally regurgitating code verbatim, violates the GPL, MIT, Apache and BSD attribution and copyleft requirements of the source code it was trained on. The case has narrowed substantially through motion practice, most counts dismissed by 2023, but the residual DMCA Section 1202 claims (removal of copyright-management information) remain active as of 2025.

The BigCode project explicitly responded to this controversy by training The Stack only on permissively licensed repositories (MIT, Apache, BSD) and offering an opt-out at https://www.bigcode-project.org/docs/about/the-stack. DeepSeek-Coder and Qwen-Coder publish lists of excluded licences but do not honour user opt-outs.

Quality issues

GitHub contains massive duplication through forking (a single repository's code may appear hundreds of times), machine-generated boilerplate (auto-generated bindings, vendor directories), secret leakage (API keys, AWS credentials, still visible in commits even after deletion), personally identifiable email addresses in commit metadata, and adversarial poisoning (the 2024 xz backdoor episode demonstrated that production-quality malicious code can persist in public repositories for years). Effective deduplication and secret scrubbing are now standard pipeline stages.

Related terms: The Stack and The Stack v2, Stack Exchange and Stack Overflow Corpus, Language Model, DeepSeek-V3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).