LAION-400M and LAION-5B, Glossary, Textbook of AI

LAION-400M (Schuhmann et al., NeurIPS Data-Centric AI workshop 2021) and LAION-5B (Schuhmann, Beaumont, Vencu et al., NeurIPS Datasets 2022) are the largest open image-text pair corpora ever released. They are the foundation of every open text-to-image model, including Stable Diffusion 1.x and 2.x, Midjourney v1-3, and academic CLIP and ALIGN replications.

Construction

LAION ("Large-scale Artificial Intelligence Open Network", a German non-profit) extracted HTML <img> alt-text pairs from Common Crawl WAT files (where alt-text is the natural-language caption attached to an image). The candidate pairs were then filtered with OpenAI's CLIP ViT-B/32 by computing the cosine similarity of CLIP image and text embeddings and retaining pairs with similarity > 0.3. The result:

LAION-400M, 400 million English image-text pairs (April 2021).
LAION-5B, 5.85 billion image-text pairs (March 2022): 2.32 B English, 2.26 B other-language, 1.27 B unknown-language. Stored as URLs + metadata only, totalling 240 GB compressed.

Models trained on LAION

Stable Diffusion 1.4 / 1.5 were trained on the LAION-Aesthetics v2.5+ subset (~600 M aesthetically filtered pairs). Stable Diffusion 2.x moved to LAION-2B-en. OpenCLIP (the open replication of CLIP) and the DataComp baselines were trained on LAION-2B-en. Midjourney v1-v3, Imagen (in part) and many academic diffusion models followed.

CSAM controversy

In December 2023 the Stanford Internet Observatory (Thiel et al., Identifying and Eliminating CSAM in Generative ML Training Data and Models) used PhotoDNA and NCMEC hashes to identify at least 3,226 suspected child-sexual-abuse-material URLs within LAION-5B. LAION immediately took down both LAION-400M and LAION-5B from public download. The cleaned re-release, Re-LAION-5B-research-safe, became available in August 2024 with the flagged URLs and several thousand additional candidates removed.

The controversy had downstream consequences: Stability AI updated its terms to forbid generation of CSAM and shipped safety filters with newer model releases; many universities removed Stable Diffusion 1.5 weights from their model hubs; and the legal exposure of training on a dataset that contained illegal material became an active question in jurisdictions including Germany, the UK, and the US.

Other concerns

LAION's CLIP-similarity filter introduces systematic biases, pairs that CLIP scores highly are over-represented, propagating CLIP's existing demographic biases into the next generation of models. The dataset contains trademarked imagery, copyrighted artworks scraped from artists' portfolios (the Stable Diffusion v1.5 v. Andersen class action centres on this), and personally identifiable medical imagery (the Lapine v. LAION episode of 2022 surfaced personally identifiable medical photographs). It also has a heavy English-language and Western-aesthetic skew despite the multilingual portion.

Modern relevance

Despite withdrawal and controversy, LAION-5B remains the largest open image-text resource, and Re-LAION-5B-research-safe along with DataComp-1B are the only viable open substrates for training competitive open text-to-image models. Frontier closed models (DALL-E 3, Imagen 3, Midjourney v6) train on proprietary substitutes that no external party can audit.

Related terms: CLIP, Stable Diffusion, Common Crawl, DataComp

Discussed in:

Chapter 13: Attention & Transformers, Training Data and Web Corpora
Chapter 16: Ethics & Safety, Ethics, Safety and Alignment

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).