The Stack and The Stack v2, Glossary, Textbook of AI

The Stack is a permissively licensed source-code corpus released by the BigCode project (a joint Hugging Face / ServiceNow open collaboration) in October 2022, with The Stack v2 following in February 2024 (Lozhkov, Li, Allal et al., arXiv:2402.19173). The Stack is the open-data foundation of the StarCoder and StarCoder2 model families.

The Stack v1

Released October 2022, The Stack v1 contains 6.4 TB of source code in 358 programming languages, all under permissive SPDX licences (MIT, Apache-2.0, BSD-3-Clause, ISC and similar). The corpus was extracted from the GitHub Archive event stream from January 2014 to March 2022, deduplicated with MinHashLSH (Jaccard similarity > 0.85), and filtered for licence compliance using go-license-detector. The Stack v1 trained the original StarCoder and StarCoderBase (15.5 B parameters) released in May 2023.

The Stack v2

The Stack v2 uses the Software Heritage archive as its source rather than scraping GitHub directly, raising the scale to 67.5 TB of source code across 619 programming languages. The dedup-pipeline is more aggressive (exact + near-duplicate + repository-level deduplication), and the corpus integrates commit messages, issues and pull-request discussions. After filtering, the training subset is approximately 3 TB / 900 B tokens.

The Stack v2 trained the StarCoder2 family (3 B, 7 B and 15 B parameters), released in February 2024, which set the open-model state of the art on HumanEval and MBPP at their respective parameter counts.

Opt-out mechanism

BigCode is the only major code corpus that operates an active developer opt-out: GitHub usernames listed at https://huggingface.co/spaces/bigcode/in-the-stack are excluded from new releases of The Stack. As of 2025 over 17,000 developers have opted out. The opt-out is honoured by retraining; older StarCoder weights cannot be retroactively cleansed.

Licensing

The Stack metadata and processing code are released under Apache-2.0. The underlying source code retains its original SPDX licence (MIT, Apache-2.0, BSD-3-Clause, ISC). BigCode publishes the per-file licence map alongside the corpus so downstream users can apply licence-aware filters.

Significance

The Stack v2 is the canonical open code training set for the post-LLaMA era and the only large code corpus that combines genuine scale, permissive licensing and a working opt-out. It serves as a proof-of-concept that responsibly licensed code training data can match the downstream quality of opaque, indiscriminate GitHub scrapes, at least for open-weight models in the 15 B-parameter range.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).