2023 NeurIPS 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Sharma, Mitchell, Ermon, Manning, Finn

TL;DR
Derive a closed-form loss that optimizes a policy against preference data without training a separate reward model or running PPO. Much simpler than RLHF, competitive quality.

What it says

Classical RLHF trains a reward model from preferences, then runs PPO against it. DPO observes that, under the standard KL-regularized RL objective, the optimal policy has a closed-form relationship to the reward function, and that relationship can be inverted to give a simple classification-style loss directly on (prompt, preferred, rejected) triples. No reward model, no rollout sampling, no value function — just a forward pass on the policy and the reference model.

Why it matters

DPO is much simpler and cheaper to implement than PPO-based RLHF, with comparable or better results on public benchmarks. It’s become the default preference-learning method in open-weights fine-tuning stacks and inspired a family of follow-ups: IPO, KTO, ORPO, SimPO, each tweaking the loss or reference handling.

  • InstructGPT (Ouyang et al, 2022) — the PPO-based RLHF DPO replaces.
  • KTO (Ethayarajh et al, 2024) — preference learning from unpaired good/bad labels.
  • SimPO (Meng et al, 2024) — reference-free variant.