Training · intermediate

Supervised Fine-Tuning (SFT)

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

Published May 29, 2026

Explanation

SFT is the workhorse of post-training. The data is typically human-written or human-curated (instruction, response) pairs, and the model is trained with the same next-token-prediction loss as pretraining but on this narrower distribution.

SFT alone can produce a useful chat model. RLHF then improves it further by optimizing against a learned reward model rather than mimicking specific responses.

Examples

Training Llama-3-Base on Anthropic's HH-RLHF "chosen" responses as a first pass.
Custom SFT on a company's historical support tickets.

Frequently asked

What is Supervised Fine-Tuning?

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

What is an example of supervised fine-tuning?

Training Llama-3-Base on Anthropic's HH-RLHF "chosen" responses as a first pass.

How is Supervised Fine-Tuning related to Fine-tuning?

Supervised Fine-Tuning and Fine-tuning are both training concepts. Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Is Supervised Fine-Tuning considered intermediate?

Supervised Fine-Tuning is generally considered intermediate-level material in the AI and LLM space.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Instruction TuningTraining

Instruction tuning is fine-tuning on examples of (instruction, desired response) pairs so a base model learns to follow natural-language directions.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Direct Preference OptimizationTraining

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Side-by-side comparisons

Sources

InstructGPT (arXiv)