2021 ICML 2021

Learning Transferable Visual Models From Natural Language Supervision

Radford, Kim, Hallacy, et al.

TL;DR
Train an image encoder and a text encoder to embed matching (image, caption) pairs nearby in a shared space. Gives zero-shot image classification across thousands of categories.

What it says

CLIP trains a ViT image encoder and a transformer text encoder jointly on 400M (image, caption) pairs scraped from the web. The loss is contrastive: within a batch of N pairs, the correct image-text pairings should have the highest cosine similarity among all N² possibilities. At inference, zero-shot classification works by encoding the candidate labels as text prompts (a photo of a {label}) and picking the label whose text embedding is closest to the image embedding.

Why it matters

CLIP made zero-shot image classification work at a useful level of quality and produced an image encoder whose features generalize to nearly any vision task. It’s the vision backbone inside Stable Diffusion, a building block in multimodal LLMs, and the foundation for open-vocabulary detection and segmentation systems. For most “I have images and text” problems, CLIP (or SigLIP, its successor) is the first thing to try.

  • ALIGN (Jia et al, 2021) — a concurrent paper with a similar recipe at larger scale.
  • SigLIP (Zhai et al, 2023) — sigmoid loss replaces softmax, better at scale.
  • OpenCLIP (Ilharco et al, 2022) — the open-weights reproduction that made CLIP broadly usable.