FineWeb and FineWeb-Edu, Glossary, Textbook of AI

FineWeb is a 15-trillion-token English-language pre-training corpus released by Hugging Face in April 2024 (Penedo, Kydlíček, Lozhkov et al., arXiv:2406.17557). It was the first fully open dataset at LLaMA-3 scale and the first to publish, alongside the data, a controlled set of ablation studies justifying every filtering decision.

FineWeb construction

FineWeb processes 96 Common Crawl snapshots from summer 2013 through April 2024 with the open-source datatrove pipeline:

URL filtering against a 4.6 M-domain blocklist of adult and unsafe sites.
Trafilatura text extraction in place of the default WET extraction, recovering substantially cleaner main-content text.
fastText language identification retaining only English with confidence > 0.65.
Gopher-style quality filters (Rae et al. 2021) on document length, symbol ratios and repetition.
MinHash-LSH near-deduplication within each snapshot only, a deliberate departure from across-dump deduplication, which the authors found hurts downstream performance because it removes the common, frequently re-encountered examples that anchor model knowledge.

The team published a series of 1.8B-parameter ablation models showing that FineWeb beats C4, RefinedWeb, RedPajama-V2 and Dolma-1.6 on a downstream HellaSwag/ARC/MMLU/CommonsenseQA aggregate.

FineWeb-Edu

FineWeb-Edu is a curriculum-filtered subset (about 1.3 T tokens in the high-quality variant, 5.4 T tokens in the more permissive variant). Filtering uses an educational-quality classifier, a Snowflake-arctic-embed encoder trained on 450,000 samples scored 0-5 by Llama-3-70B for "how educational is this text on a primary-to-graduate-school scale". Documents scoring above 3 are retained.

Models pre-trained on FineWeb-Edu show substantial gains on knowledge-intensive evaluations (MMLU +5 points, ARC +6 points) compared with FineWeb at equal compute, validating the quality over quantity hypothesis at trillion-token scale.

Licensing and reception

FineWeb is released under ODC-By 1.0 with the same downstream-responsibility caveat as Common Crawl. Hugging Face also released the full processing pipeline, the ablation models, and a 100-page technical report documenting the failure modes of every filter the authors tested.

Within months of release FineWeb and FineWeb-Edu had become the default open pre-training corpus, supplanting C4 and RedPajama in academic and industrial settings alike. DCLM-Baseline, OLMo-2 and SmolLM all use FineWeb-Edu as either primary data or curriculum-stage data.

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora
Chapter 15: Modern AI, Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).