Comparison

Preference Data vs Reward Model

Preference Data and Reward Model are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Preference Data

Preference Data comes up when the question is fundamentally about training.

Anthropic HH-RLHF (~170K preference pairs).

When you would reach for Reward Model

Reward Model comes up when the question is fundamentally about training.

Anthropic's preference model trained on HH-RLHF data.

Frequently asked

What is the difference between Preference Data and Reward Model?

Preference Data: Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training. Reward Model: A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

When should I use Preference Data vs Reward Model?

Preference Data is the right concept when you are focused on training. Reward Model applies when you are focused on training.

Are Preference Data and Reward Model the same thing?

No. Preference Data is training; Reward Model is training. They are related but address different parts of the AI stack.