Learning Transferable Visual Models From Natural Language Supervision
Radford, Kim, Hallacy, et al.
What it says
CLIP trains a ViT image encoder and a transformer text encoder jointly on 400M (image, caption) pairs scraped from the web. The loss is contrastive: within a batch of N pairs, the correct image-text pairings should have the highest cosine similarity among all N² possibilities. At inference, zero-shot classification works by encoding the candidate labels as text prompts (a photo of a {label}) and picking the label whose text embedding is closest to the image embedding.
Why it matters
CLIP made zero-shot image classification work at a useful level of quality and produced an image encoder whose features generalize to nearly any vision task. It’s the vision backbone inside Stable Diffusion, a building block in multimodal LLMs, and the foundation for open-vocabulary detection and segmentation systems. For most “I have images and text” problems, CLIP (or SigLIP, its successor) is the first thing to try.
Read next
- ALIGN (Jia et al, 2021) — a concurrent paper with a similar recipe at larger scale.
- SigLIP (Zhai et al, 2023) — sigmoid loss replaces softmax, better at scale.
- OpenCLIP (Ilharco et al, 2022) — the open-weights reproduction that made CLIP broadly usable.