Training Language Models to Follow Instructions with Human Feedback
Ouyang, Wu, Jiang, et al.
What it says
OpenAI describes the three-stage pipeline that shipped as InstructGPT. Stage 1: supervised fine-tuning on human-written demonstrations of ideal responses. Stage 2: collect human preferences between pairs of model outputs, train a reward model to predict those preferences. Stage 3: use PPO to fine-tune the SFT model against the reward model, with a KL penalty back to the SFT reference to avoid drift. The 1.3B InstructGPT model is preferred over the 175B base GPT-3 in blind human evals.
Why it matters
This is the paper behind ChatGPT and the reason every production chat model since then has gone through a preference-learning stage. It made clear that alignment and usefulness are separate dimensions from raw capability — a smaller model trained with RLHF can feel much better to users than a larger untuned one. The exact recipe (SFT → RM → PPO) is still the baseline to beat.
Read next
- Constitutional AI (Bai et al, 2022) — replacing some human labels with model-generated critiques.
- Direct Preference Optimization (Rafailov et al, 2023) — skip the RM and PPO stages.
- LLaMA-2 paper (Touvron et al, 2023) — a very detailed open account of the same recipe at scale.