Training · advanced
Direct Preference Optimization (DPO)
DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.
Explanation
RLHF requires training a reward model and then running PPO, which is finicky. DPO derives a closed-form loss that turns preference learning into a supervised problem: maximize the log-probability margin between the preferred and rejected responses, regularized to stay close to a reference model.
The result is similar quality to RLHF in many evaluations, with less code, less tuning, and lower compute. DPO has become a popular default for open-source post-training.
Examples
- Mistral-7B-Instruct-v0.2 was DPO-tuned.
- Zephyr — an early high-profile DPO success.
When to use direct preference optimization
When you have preference data and want a simpler pipeline than full RLHF.
Frequently asked
What is Direct Preference Optimization?
DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.
What is an example of direct preference optimization?
Mistral-7B-Instruct-v0.2 was DPO-tuned.
How is Direct Preference Optimization related to Reinforcement Learning from Human Feedback?
Direct Preference Optimization and Reinforcement Learning from Human Feedback are both training concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.
When should I use direct preference optimization?
When you have preference data and want a simpler pipeline than full RLHF.
Is Direct Preference Optimization considered advanced?
Direct Preference Optimization is generally considered advanced-level material in the AI and LLM space.