Training · advanced

Direct Preference Optimization (DPO)

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Published May 29, 2026

Explanation

RLHF requires training a reward model and then running PPO, which is finicky. DPO derives a closed-form loss that turns preference learning into a supervised problem: maximize the log-probability margin between the preferred and rejected responses, regularized to stay close to a reference model.

The result is similar quality to RLHF in many evaluations, with less code, less tuning, and lower compute. DPO has become a popular default for open-source post-training.

Examples

Mistral-7B-Instruct-v0.2 was DPO-tuned.
Zephyr — an early high-profile DPO success.

When to use direct preference optimization

When you have preference data and want a simpler pipeline than full RLHF.

Frequently asked

What is Direct Preference Optimization?

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

What is an example of direct preference optimization?

Mistral-7B-Instruct-v0.2 was DPO-tuned.

How is Direct Preference Optimization related to Reinforcement Learning from Human Feedback?

Direct Preference Optimization and Reinforcement Learning from Human Feedback are both training concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

When should I use direct preference optimization?

When you have preference data and want a simpler pipeline than full RLHF.

Is Direct Preference Optimization considered advanced?

Direct Preference Optimization is generally considered advanced-level material in the AI and LLM space.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Supervised Fine-TuningTraining

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Side-by-side comparisons

Sources

DPO paper (arXiv)