2020 ICLR 2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, Beyer, Kolesnikov, et al.

TL;DR
Slice an image into 16x16 patches, treat them as tokens, run a plain transformer. With enough data, it matches or beats CNNs on ImageNet and other benchmarks.

What it says

The Vision Transformer (ViT) takes an image, cuts it into non-overlapping 16x16 patches, flattens each patch, linearly projects it to an embedding, adds a positional embedding, and feeds the resulting sequence to a standard transformer encoder. A [CLS] token provides the classification head. Trained on small datasets like ImageNet-1k, ViT loses to ResNet — CNNs have a helpful inductive bias ViT lacks. Trained on hundreds of millions of images (JFT-300M), ViT matches or beats CNNs and scales more gracefully.

Why it matters

ViT broke the assumption that computer vision needed convolution. It opened the door to unified architectures across modalities (the same transformer now backs language, vision, audio, and multimodal models) and to very large vision models like those inside CLIP, DINOv2, and modern multimodal LLMs. For large-scale vision work, transformers are now the default.

  • Swin Transformer (Liu et al, 2021) — shifted-window attention for better CV inductive bias.
  • CLIP (Radford et al, 2021) — the language-supervised ViT that powers open-vocabulary vision.
  • DINOv2 (Oquab et al, 2023) — strong self-supervised ViT representations.