Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021)
arXiv.
DOI: https://doi.org/10.48550/arxiv.2103.00020
Abstract. Introduces CLIP (Contrastive Language-Image Pre-training), which jointly trains image and text encoders on 400 million image-text pairs using a contrastive objective. The shared embedding space enables zero-shot image classification and underpins modern text-to-image generation.
Tags: multimodal clip contrastive
Cited in: