Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021), References, Textbook of AI

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2103.00020

Abstract. Introduces CLIP (Contrastive Language-Image Pre-training), which jointly trains image and text encoders on 400 million image-text pairs using a contrastive objective. The shared embedding space enables zero-shot image classification and underpins modern text-to-image generation.

Tags: multimodal clip contrastive

Cited in:

Chapter 13: Attention & Transformers

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Learning Transferable Visual Models From Natural Language Supervision