Skip to main content
ModelTerms

Training · advanced

Direct Preference Optimization (DPO)

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Explanation

RLHF requires training a reward model and then running PPO, which is finicky. DPO derives a closed-form loss that turns preference learning into a supervised problem: maximize the log-probability margin between the preferred and rejected responses, regularized to stay close to a reference model.

The result is similar quality to RLHF in many evaluations, with less code, less tuning, and lower compute. DPO has become a popular default for open-source post-training.

Examples

  • Mistral-7B-Instruct-v0.2 was DPO-tuned.
  • Zephyr — an early high-profile DPO success.

When to use direct preference optimization

When you have preference data and want a simpler pipeline than full RLHF.

Frequently asked

What is Direct Preference Optimization?

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

What is an example of direct preference optimization?

Mistral-7B-Instruct-v0.2 was DPO-tuned.

How is Direct Preference Optimization related to Reinforcement Learning from Human Feedback?

Direct Preference Optimization and Reinforcement Learning from Human Feedback are both training concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

When should I use direct preference optimization?

When you have preference data and want a simpler pipeline than full RLHF.

Is Direct Preference Optimization considered advanced?

Direct Preference Optimization is generally considered advanced-level material in the AI and LLM space.

Side-by-side comparisons

Sources