Training · advanced
Reinforcement Learning from Human Feedback (RLHF)
RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.
Explanation
RLHF has three steps. First, collect pairs of model outputs and have humans pick which one is better. Second, train a reward model that scores outputs the way humans would. Third, fine-tune the LLM with reinforcement learning (usually PPO) to produce outputs the reward model rates highly.
RLHF was the key technique behind ChatGPT's "feels helpful and harmless" character. Without it, models tend to be either bland (SFT only) or have noticeable failure modes that humans easily spot but losses do not penalize.
Newer approaches like DPO skip the explicit reward model and optimize directly on preference data, simplifying the recipe.
Examples
- ChatGPT trained with RLHF to refuse unsafe requests.
- Anthropic's Claude tuned via Constitutional AI + RLHF variant.
Frequently asked
What is Reinforcement Learning from Human Feedback?
RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.
What is an example of reinforcement learning from human feedback?
ChatGPT trained with RLHF to refuse unsafe requests.
How is Reinforcement Learning from Human Feedback related to Direct Preference Optimization?
Reinforcement Learning from Human Feedback and Direct Preference Optimization are both training concepts. DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.
Is Reinforcement Learning from Human Feedback considered advanced?
Reinforcement Learning from Human Feedback is generally considered advanced-level material in the AI and LLM space.
Related terms
Side-by-side comparisons
- Reinforcement Learning from Human Feedback vs Direct Preference Optimization
- Reinforcement Learning from Human Feedback vs Constitutional AI
- Reinforcement Learning from Human Feedback vs Alignment
- Reinforcement Learning from Human Feedback vs Fine-tuning
- Reinforcement Learning from Human Feedback vs Supervised Fine-Tuning