2022 · NeurIPS 2022

Training Compute-Optimal Large Language Models

Hoffmann, Borgeaud, Mensch, et al.

2022 NeurIPS 2022

TL;DR

For a fixed training compute budget, you should scale model size and training tokens roughly equally. GPT-3 and PaLM were massively undertrained; a 70B model trained on 1.4T tokens beats a 280B one.

Read paper

BACKLOG · WORK IN PROGRESS

This paper is being written.

The metadata and shape of this page are stable, but the body content isn't ready yet. We'll publish it once it meets the bar of teaching something new with worked examples and real tools.

Back to papers Track progress on GitHub