Training · advanced
Reward Model
A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.
Explanation
Step 2 of the classic RLHF recipe: collect pairs of model outputs with human "which is better?" labels, then train a model (often a smaller copy of the policy) to predict which output a human would prefer.
The reward model becomes the proxy for human judgment during RL fine-tuning, since asking humans to label every PPO rollout would be impossibly slow.
Reward models can be gamed — "reward hacking" — and their accuracy ceilings limit how well the final LLM can be aligned. DPO sidesteps the explicit reward model entirely by deriving its loss from preference data directly.
Examples
- Anthropic's preference model trained on HH-RLHF data.
- OpenAI's reward model used in InstructGPT.
Frequently asked
What is Reward Model?
A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.
What is an example of reward model?
Anthropic's preference model trained on HH-RLHF data.
How is Reward Model related to Reinforcement Learning from Human Feedback?
Reward Model and Reinforcement Learning from Human Feedback are both training concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.
Is Reward Model considered advanced?
Reward Model is generally considered advanced-level material in the AI and LLM space.