An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, Beyer, Kolesnikov, et al.
What it says
The Vision Transformer (ViT) takes an image, cuts it into non-overlapping 16x16 patches, flattens each patch, linearly projects it to an embedding, adds a positional embedding, and feeds the resulting sequence to a standard transformer encoder. A [CLS] token provides the classification head. Trained on small datasets like ImageNet-1k, ViT loses to ResNet — CNNs have a helpful inductive bias ViT lacks. Trained on hundreds of millions of images (JFT-300M), ViT matches or beats CNNs and scales more gracefully.
Why it matters
ViT broke the assumption that computer vision needed convolution. It opened the door to unified architectures across modalities (the same transformer now backs language, vision, audio, and multimodal models) and to very large vision models like those inside CLIP, DINOv2, and modern multimodal LLMs. For large-scale vision work, transformers are now the default.
Read next
- Swin Transformer (Liu et al, 2021) — shifted-window attention for better CV inductive bias.
- CLIP (Radford et al, 2021) — the language-supervised ViT that powers open-vocabulary vision.
- DINOv2 (Oquab et al, 2023) — strong self-supervised ViT representations.