Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou
What it says
For arithmetic, commonsense, and symbolic reasoning tasks, instructing or few-shot-prompting a model to produce intermediate reasoning steps before its final answer significantly improves accuracy on large models. On GSM8K the jump for PaLM-540B is from ~18% to ~57%. The effect is emergent — small models don’t benefit and sometimes get worse — and it’s sensitive to prompt format.
Why it matters
Chain-of-thought is the simplest and most durable prompting trick in the LLM toolkit. It spawned an entire literature: self-consistency, tree-of-thoughts, step-by-step verification, and eventually the reasoning-trained models (o1, R1) that internalize CoT. If you’re evaluating an LLM on anything multi-step, “let’s think step by step” is table stakes.
Read next
- Self-Consistency Improves Chain of Thought (Wang et al, 2022) — sample many CoT traces and vote.
- Tree of Thoughts (Yao et al, 2023) — search over reasoning trees instead of linear traces.
- Large Language Models are Zero-Shot Reasoners (Kojima et al, 2022) — shows that just “let’s think step by step” works zero-shot.