2017 NeurIPS
Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
TL;DR
Replaces recurrence and convolutions with self-attention. Introduces the Transformer architecture that powers every modern LLM.
Why it matters
This is the paper. Every LLM you’ve ever heard of — GPT, Claude, Gemini, LLaMA, Mistral — descends directly from the architecture introduced here. Before this paper, sequence modeling was dominated by RNNs and LSTMs, which processed tokens one at a time and were painful to parallelize on GPUs. The Transformer threw that out and won.
Key contributions
- Self-attention as the only sequence operator. No recurrence, no convolution. Each token attends to every other token via scaled dot-product attention.
- Multi-head attention. Several attention heads run in parallel and capture different relationships between tokens.
- Positional encodings (sinusoidal) inject order information into an otherwise permutation-invariant operator.
- Encoder-decoder layout with residual connections, layer norm, and position-wise feed-forward networks. The basic block has been remixed thousands of times since but the recipe is largely unchanged.
- Massive parallelism. Because every position can be processed at once, training scales beautifully on GPUs and TPUs. This is the property that unlocked the LLM era.
Why it still matters in 2026
- The decoder block from this paper is still the unit cell of GPT-style models nine years later.
- The math —
softmax(Q @ K.T / sqrt(d_k)) @ V— is unchanged. Modern improvements (FlashAttention, GQA, RoPE) optimize how it runs and how positions are encoded, not what it computes. - Reading this paper teaches you the mental model used in every follow-up paper. It’s the rare 15-page paper that pays back the time twentyfold.
Follow-up reading
- BERT (2018) — encoder-only Transformer, masked language modeling.
- GPT-2 (2019) and GPT-3 (2020) — scale the decoder. Discover that scale alone unlocks emergent capabilities.
- T5 (2019) — encoder-decoder, all NLP tasks as text-to-text.
- FlashAttention (2022) — how to make attention fast and memory-light on real hardware.
- RoPE / RoFormer (2021) — better positional encoding, used by LLaMA.
→ Internal: Transformers & Attention