2023 · NeurIPS 2023

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafailov, Sharma, Mitchell, Ermon, Manning, Finn

2023 NeurIPS 2023

TL;DR

Derive a closed-form loss that optimizes a policy against preference data without training a separate reward model or running PPO. Much simpler than RLHF, competitive quality.

Read paper

BACKLOG · WORK IN PROGRESS

This paper is being written.

The metadata and shape of this page are stable, but the body content isn't ready yet. We'll publish it once it meets the bar of teaching something new with worked examples and real tools.

Back to papers Track progress on GitHub