Training Compute-Optimal Large Language Models
Hoffmann, Borgeaud, Mensch, et al.
What it says
DeepMind trains over 400 models at a range of sizes and token budgets, then fits a joint loss function in parameters and tokens. The conclusion overturns the Kaplan et al. 2020 scaling laws: for compute-optimal training, parameters and tokens should scale in roughly equal proportion — about 20 tokens per parameter. They train Chinchilla (70B, 1.4T tokens) using the same compute as Gopher (280B) and Chinchilla wins on essentially every benchmark.
Why it matters
Chinchilla reframed how the field thinks about LLM scaling. GPT-3, Gopher, and the early large models were badly undertrained. Post-Chinchilla, models like LLaMA and the open-weights wave deliberately trained smaller models on far more data, which is why you can run a 7B model today that beats GPT-3 at a fraction of the cost.
Read next
- Scaling Laws for Neural Language Models (Kaplan et al, 2020) — the earlier scaling laws Chinchilla corrects.
- LLaMA (Touvron et al, 2023) — a concrete recipe built on Chinchilla’s data-heavy principle.
- Training Compute-Optimal Large Language Models revisited (various replications) — follow-up analyses of the token-per-parameter ratio.