Glossary

Dolma and OLMo

Dolma (Soldaini, Kinney, Bhagia et al., ACL 2024, arXiv:2402.00159) is a 3-trillion-token English pre-training corpus released by the Allen Institute for AI in February 2024. It was assembled as the training substrate for OLMo, AI2's fully open language model , the first frontier-scale model to release training data, training code, intermediate checkpoints, and final weights together under a common open licence.

Composition (Dolma v1.7)

Dolma is a deliberate mixture of seven sub-corpora:

  • Common Crawl (filtered with CCNet + Gopher quality + per-domain blocklists), 2,415 B tokens.
  • The Stack (permissively licensed code), 411 B tokens.
  • C4, 198 B tokens.
  • Reddit (filtered submissions and comments via PushShift), 89 B tokens.
  • PeS2o (Semantic Scholar full-text papers), 70 B tokens.
  • Project Gutenberg (public-domain books), 6 B tokens.
  • Wikipedia + Wikibooks (multilingual subset), 4 B tokens.

Construction philosophy

Unlike The Pile or RedPajama, every Dolma processing decision is documented in the Dolma toolkit released alongside the data, language identification, deduplication (Bloom-filter near-dedup), quality classification, PII redaction (using Presidio) and toxicity filtering. The toolkit is reproducible by any user with access to Common Crawl WARC files.

Licensing

Dolma is released under the AI2 ImpACT Medium Risk licence, a use-based licence that requires downstream users to refrain from a list of high-risk applications (military, surveillance, manipulation). The underlying source materials retain their original licences.

OLMo

OLMo-7B (Groeneveld, Beltagy, Walsh et al., arXiv:2402.00838) was the first model trained on Dolma. OLMo 1B and OLMo 1.7-7B followed with extensive intermediate-checkpoint releases (every 1000 steps for the full pretraining run), enabling fine-grained mechanistic-interpretability and training-dynamics research. OLMo-2 7B / 13B (December 2024) trained on Dolmino (a curated 100B-token mid-training upgrade of Dolma) and matches Llama-3 quality at equal compute.

Significance

Dolma + OLMo set a new standard for reproducible language-model science: data, code, intermediate checkpoints, evaluation suites and final weights are all publicly available under documented licences. It is currently the only frontier-scale model whose entire pre-training pipeline can be replicated end-to-end by external researchers.

Related terms: Common Crawl, The Stack and The Stack v2, The Pile, FineWeb and FineWeb-Edu, Language Model

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).