References

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2103.00020

Abstract. Introduces CLIP (Contrastive Language-Image Pre-training), which jointly trains image and text encoders on 400 million image-text pairs using a contrastive objective. The shared embedding space enables zero-shot image classification and underpins modern text-to-image generation.

Tags: multimodal clip contrastive

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).