Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafailov, Sharma, Mitchell, Ermon, Manning, Finn
What it says
Classical RLHF trains a reward model from preferences, then runs PPO against it. DPO observes that, under the standard KL-regularized RL objective, the optimal policy has a closed-form relationship to the reward function, and that relationship can be inverted to give a simple classification-style loss directly on (prompt, preferred, rejected) triples. No reward model, no rollout sampling, no value function — just a forward pass on the policy and the reference model.
Why it matters
DPO is much simpler and cheaper to implement than PPO-based RLHF, with comparable or better results on public benchmarks. It’s become the default preference-learning method in open-weights fine-tuning stacks and inspired a family of follow-ups: IPO, KTO, ORPO, SimPO, each tweaking the loss or reference handling.
Read next
- InstructGPT (Ouyang et al, 2022) — the PPO-based RLHF DPO replaces.
- KTO (Ethayarajh et al, 2024) — preference learning from unpaired good/bad labels.
- SimPO (Meng et al, 2024) — reference-free variant.