Training · intermediate
Preference Data (preference pairs)
Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.
Explanation
For each prompt, you collect two or more candidate responses and a label saying which one a human (or judge model) preferred. Public datasets like Anthropic HH-RLHF, OpenAssistant, and UltraFeedback contain hundreds of thousands of such pairs.
Preference data is cheaper and faster to collect than full SFT data because labelers only have to compare, not write. It is the input to both classic RLHF (via a reward model) and direct methods (DPO, IPO, KTO).
Quality of preference data matters more than quantity past a certain point — sloppy labels cap how good the resulting model can be.
Examples
- Anthropic HH-RLHF (~170K preference pairs).
- UltraFeedback (~64K pairs distilled from GPT-4 judgments).
Frequently asked
What is Preference Data?
Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.
What is an example of preference data?
Anthropic HH-RLHF (~170K preference pairs).
How is Preference Data related to Direct Preference Optimization?
Preference Data and Direct Preference Optimization are both training concepts. DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.
Is Preference Data considered intermediate?
Preference Data is generally considered intermediate-level material in the AI and LLM space.