Training · intermediate

Preference Data (preference pairs)

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

Published May 30, 2026

Explanation

For each prompt, you collect two or more candidate responses and a label saying which one a human (or judge model) preferred. Public datasets like Anthropic HH-RLHF, OpenAssistant, and UltraFeedback contain hundreds of thousands of such pairs.

Preference data is cheaper and faster to collect than full SFT data because labelers only have to compare, not write. It is the input to both classic RLHF (via a reward model) and direct methods (DPO, IPO, KTO).

Quality of preference data matters more than quantity past a certain point — sloppy labels cap how good the resulting model can be.

Examples

Anthropic HH-RLHF (~170K preference pairs).
UltraFeedback (~64K pairs distilled from GPT-4 judgments).

Frequently asked

What is Preference Data?

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

What is an example of preference data?

Anthropic HH-RLHF (~170K preference pairs).

How is Preference Data related to Direct Preference Optimization?

Preference Data and Direct Preference Optimization are both training concepts. DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Is Preference Data considered intermediate?

Preference Data is generally considered intermediate-level material in the AI and LLM space.

Direct Preference OptimizationTraining

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Reward ModelTraining

A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Synthetic DataTraining

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

Side-by-side comparisons

Sources

Anthropic HH-RLHF dataset