DataComp (Gadre, Ilharco, Fang et al., NeurIPS 2023, arXiv:2304.14108) is a community benchmark for image-text data curation, organised by the University of Washington, Columbia, LAION, Hugging Face, Stability AI and Apple Foundation Models. It plays the same role for vision-language pre-training that DCLM plays for language-only pre-training.
CommonPool
DataComp begins with CommonPool, an unfiltered set of 12.8 billion image-text pairs extracted from Common Crawl (using the same alt-text recipe as LAION, but without LAION's CLIP filtering). CommonPool is provided as URLs with metadata; participants may filter or re-weight, but they may not introduce external pairs.
Tracks
Participants choose a compute budget, small (12.8 M pairs, 12.8 M training samples), medium (128 M / 128 M), large (1.28 B / 1.28 B), or xlarge (12.8 B / 12.8 B), and submit a filtered subset. The benchmark trains a CLIP ViT-B/32 (small/medium) or ViT-L/14 (large/xlarge) with a fixed OpenCLIP recipe and evaluates on 38 downstream tasks: ImageNet zero-shot classification, VTAB, WILDS, retrieval (MS-COCO, Flickr30k), and a battery of distribution-shift suites.
DataComp-1B
The reference filter, DataComp-1B, applies CLIP-score filtering (top 30%) and image-based deduplication, retaining 1.28 billion of CommonPool's 12.8 billion pairs. Trained as a CLIP ViT-L/14, DataComp-1B achieves 79.2% ImageNet zero-shot accuracy, outperforming OpenAI's original CLIP (75.5%) and OpenCLIP-LAION-2B (75.5%) at equal compute, despite training on substantially less data.
Significance
DataComp's central empirical finding is that CLIP-score-filtered Common Crawl pairs match or beat LAION-2B at equal compute. This has two consequences. First, it provides a clean-room alternative to LAION for organisations wary of LAION's CSAM history, DataComp re-runs its own NSFW filtering on CommonPool. Second, it institutionalises the data-curation-as-research-problem framing that has come to dominate frontier-model training.
DataComp also released DataComp-LM as a sibling project for language-only pre-training, which became the DCLM benchmark.
Related terms: LAION-400M and LAION-5B, CLIP, DCLM (DataComp-LM), Common Crawl
Discussed in:
- Chapter 13: Attention & Transformers, Training Data and Web Corpora
- Chapter 9: Neural Networks, Computer Vision