Training · advanced

Reward Model

A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

Published May 30, 2026

Explanation

Step 2 of the classic RLHF recipe: collect pairs of model outputs with human "which is better?" labels, then train a model (often a smaller copy of the policy) to predict which output a human would prefer.

The reward model becomes the proxy for human judgment during RL fine-tuning, since asking humans to label every PPO rollout would be impossibly slow.

Reward models can be gamed — "reward hacking" — and their accuracy ceilings limit how well the final LLM can be aligned. DPO sidesteps the explicit reward model entirely by deriving its loss from preference data directly.

Examples

Anthropic's preference model trained on HH-RLHF data.
OpenAI's reward model used in InstructGPT.

Frequently asked

What is Reward Model?

A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

What is an example of reward model?

Anthropic's preference model trained on HH-RLHF data.

How is Reward Model related to Reinforcement Learning from Human Feedback?

Reward Model and Reinforcement Learning from Human Feedback are both training concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Is Reward Model considered advanced?

Reward Model is generally considered advanced-level material in the AI and LLM space.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Direct Preference OptimizationTraining

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Preference DataTraining

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Supervised Fine-TuningTraining

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

Side-by-side comparisons

Sources

InstructGPT (arXiv)