FineWeb is a 15-trillion-token English-language pre-training corpus released by Hugging Face in April 2024 (Penedo, Kydlíček, Lozhkov et al., arXiv:2406.17557). It was the first fully open dataset at LLaMA-3 scale and the first to publish, alongside the data, a controlled set of ablation studies justifying every filtering decision.
FineWeb construction
FineWeb processes 96 Common Crawl snapshots from summer 2013 through April 2024 with the open-source datatrove pipeline:
- URL filtering against a 4.6 M-domain blocklist of adult and unsafe sites.
- Trafilatura text extraction in place of the default WET extraction, recovering substantially cleaner main-content text.
- fastText language identification retaining only English with confidence > 0.65.
- Gopher-style quality filters (Rae et al. 2021) on document length, symbol ratios and repetition.
- MinHash-LSH near-deduplication within each snapshot only, a deliberate departure from across-dump deduplication, which the authors found hurts downstream performance because it removes the common, frequently re-encountered examples that anchor model knowledge.
The team published a series of 1.8B-parameter ablation models showing that FineWeb beats C4, RefinedWeb, RedPajama-V2 and Dolma-1.6 on a downstream HellaSwag/ARC/MMLU/CommonsenseQA aggregate.
FineWeb-Edu
FineWeb-Edu is a curriculum-filtered subset (about 1.3 T tokens in the high-quality variant, 5.4 T tokens in the more permissive variant). Filtering uses an educational-quality classifier, a Snowflake-arctic-embed encoder trained on 450,000 samples scored 0-5 by Llama-3-70B for "how educational is this text on a primary-to-graduate-school scale". Documents scoring above 3 are retained.
Models pre-trained on FineWeb-Edu show substantial gains on knowledge-intensive evaluations (MMLU +5 points, ARC +6 points) compared with FineWeb at equal compute, validating the quality over quantity hypothesis at trillion-token scale.
Licensing and reception
FineWeb is released under ODC-By 1.0 with the same downstream-responsibility caveat as Common Crawl. Hugging Face also released the full processing pipeline, the ablation models, and a 100-page technical report documenting the failure modes of every filter the authors tested.
Within months of release FineWeb and FineWeb-Edu had become the default open pre-training corpus, supplanting C4 and RedPajama in academic and industrial settings alike. DCLM-Baseline, OLMo-2 and SmolLM all use FineWeb-Edu as either primary data or curriculum-stage data.
Related terms: Common Crawl, C4 (Colossal Clean Crawled Corpus), DCLM (DataComp-LM), RedPajama, Llama 3 / 3.1 / 3.3
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora
- Chapter 15: Modern AI, Modern AI