2020 · ICLR 2021

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, Beyer, Kolesnikov, et al.

2020 ICLR 2021

TL;DR

Slice an image into 16x16 patches, treat them as tokens, run a plain transformer. With enough data, it matches or beats CNNs on ImageNet and other benchmarks.

Read paper

BACKLOG · WORK IN PROGRESS

This paper is being written.

The metadata and shape of this page are stable, but the body content isn't ready yet. We'll publish it once it meets the bar of teaching something new with worked examples and real tools.

Back to papers Track progress on GitHub