Skip to main content
ModelTerms

Training · advanced

Reinforcement Learning from Human Feedback (RLHF)

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Explanation

RLHF has three steps. First, collect pairs of model outputs and have humans pick which one is better. Second, train a reward model that scores outputs the way humans would. Third, fine-tune the LLM with reinforcement learning (usually PPO) to produce outputs the reward model rates highly.

RLHF was the key technique behind ChatGPT's "feels helpful and harmless" character. Without it, models tend to be either bland (SFT only) or have noticeable failure modes that humans easily spot but losses do not penalize.

Newer approaches like DPO skip the explicit reward model and optimize directly on preference data, simplifying the recipe.

Examples

  • ChatGPT trained with RLHF to refuse unsafe requests.
  • Anthropic's Claude tuned via Constitutional AI + RLHF variant.

Frequently asked

What is Reinforcement Learning from Human Feedback?

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

What is an example of reinforcement learning from human feedback?

ChatGPT trained with RLHF to refuse unsafe requests.

How is Reinforcement Learning from Human Feedback related to Direct Preference Optimization?

Reinforcement Learning from Human Feedback and Direct Preference Optimization are both training concepts. DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Is Reinforcement Learning from Human Feedback considered advanced?

Reinforcement Learning from Human Feedback is generally considered advanced-level material in the AI and LLM space.

Direct Preference OptimizationTraining

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Constitutional AISafety & Alignment

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Supervised Fine-TuningTraining

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

Reward ModelTraining

A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

Preference DataTraining

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

Side-by-side comparisons

Sources