2022 NeurIPS 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou

TL;DR
Asking a large model to "think step by step" — or showing a few examples that include reasoning traces — dramatically improves accuracy on math and multi-step problems.

What it says

For arithmetic, commonsense, and symbolic reasoning tasks, instructing or few-shot-prompting a model to produce intermediate reasoning steps before its final answer significantly improves accuracy on large models. On GSM8K the jump for PaLM-540B is from ~18% to ~57%. The effect is emergent — small models don’t benefit and sometimes get worse — and it’s sensitive to prompt format.

Why it matters

Chain-of-thought is the simplest and most durable prompting trick in the LLM toolkit. It spawned an entire literature: self-consistency, tree-of-thoughts, step-by-step verification, and eventually the reasoning-trained models (o1, R1) that internalize CoT. If you’re evaluating an LLM on anything multi-step, “let’s think step by step” is table stakes.

  • Self-Consistency Improves Chain of Thought (Wang et al, 2022) — sample many CoT traces and vote.
  • Tree of Thoughts (Yao et al, 2023) — search over reasoning trees instead of linear traces.
  • Large Language Models are Zero-Shot Reasoners (Kojima et al, 2022) — shows that just “let’s think step by step” works zero-shot.