2020 NeurIPS 2020

Language Models are Few-Shot Learners

Brown, Mann, Ryder, Subbiah, et al.

TL;DR
Scales the GPT decoder to 175B parameters and shows that a single model, with no gradient updates, can do many tasks from a handful of in-context examples.

What it says

The authors train a 175B-parameter autoregressive transformer (GPT-3) on a filtered Common Crawl corpus plus books and Wikipedia. With no fine-tuning at all, they show that dropping a few examples into the prompt lets GPT-3 do translation, question answering, arithmetic, SAT-style analogies, code generation, and more. Performance scales smoothly with model size — bigger models are not just better, they unlock qualitatively new behaviors from in-context examples alone.

Why it matters

GPT-3 is the paper that made “prompt it” a serious engineering approach. It validated decoder-only scaling, in-context learning, and general-purpose LLMs as products. Everything in the current chat-assistant ecosystem — ChatGPT, Claude, Gemini — is a descendant of the recipe validated here.

  • Scaling Laws for Neural Language Models (Kaplan et al, 2020) — the loss curves behind the bet on scale.
  • InstructGPT (Ouyang et al, 2022) — turning a base GPT into a helpful assistant.
  • Chinchilla (Hoffmann et al, 2022) — compute-optimal scaling, correcting GPT-3 era assumptions.