RLHF
Reinforcement Learning from Human FeedbackFine-tuning a language model using a reward model trained on human preference data, with reinforcement learning to optimize for the reward.
In one line
Train a reward model from human preference comparisons, then use RL (typically PPO) to fine-tune the language model to score high under that reward.
What it actually means
The RLHF recipe, as popularized by InstructGPT, has three stages. First, supervised fine-tuning on demonstrations of good behavior. Second, collect human preferences: show two model outputs for the same prompt, ask which is better, train a reward model to predict those preferences. Third, use PPO (or similar) to fine-tune the policy so it maximizes the reward model’s score, with a KL penalty against the original SFT model to keep it from drifting into nonsense. DPO, a newer method, skips the explicit reward model and reward-maximization step — it derives an equivalent loss directly from the preferences.
Why it matters
RLHF is what turned raw next-token predictors into chat assistants. The difference between a base model and an RLHF’d model is enormous: base models complete text, RLHF models follow instructions, refuse harmful requests, and stay on topic. Almost every production chat model you’ve used — GPT-4, Claude, Gemini — went through some form of preference learning.
Example
1. SFT: supervised training on prompt → response demos
2. RM: train reward model on (prompt, good, bad) triplets
3. RL: PPO with reward = RM(prompt, response) - beta * KL(policy || SFT)
You’ll hear it when
- Reading about instruction tuning and chat model alignment.
- Comparing RLHF, DPO, KTO, and RLAIF.
- Debugging reward hacking or mode collapse in alignment.
- Discussing why base models sound different from chat models.