Skip to main content
ModelTerms

Learning path · 24 min · intermediate

The post-training pipeline

Pretraining gives you a base model. This is how it becomes ChatGPT.

Almost everything that makes a frontier LLM useful happens AFTER pretraining. Instruction tuning teaches it to follow directions. RLHF or DPO makes it helpful and harmless. Synthetic data lets the process scale. This path walks the full pipeline.

  1. Pretraining

    Why this step: Where the base model comes from. The expensive part. Sets the stage.

    Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

    Read full entry →Training · intermediate
  2. Instruction Tuning

    Why this step: The first post-training move — teach the base model to follow instructions.

    Instruction tuning is fine-tuning on examples of (instruction, desired response) pairs so a base model learns to follow natural-language directions.

    Read full entry →Training · intermediate
  3. Supervised Fine-TuningSFT

    Why this step: The technique under instruction tuning. The unglamorous workhorse.

    SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

    Read full entry →Training · intermediate
  4. Preference Datapreference pairs

    Why this step: The fuel for the next stage — pairs of (preferred, rejected) responses.

    Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

    Read full entry →Training · intermediate
  5. Reward Model

    Why this step: The scoring function trained from preference data. The proxy for human judgment.

    A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

    Read full entry →Training · advanced
  6. Reinforcement Learning from Human FeedbackRLHF

    Why this step: The classic recipe. Why ChatGPT felt like ChatGPT and not just a smart autocomplete.

    RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

    Read full entry →Training · advanced
  7. Direct Preference OptimizationDPO

    Why this step: The modern, simpler alternative to RLHF. Often the default in 2026.

    DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

    Read full entry →Training · advanced
  8. Constitutional AICAI

    Why this step: Anthropic's twist — use the model to critique itself, scaling alignment beyond human labels.

    Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

    Read full entry →Safety & Alignment · advanced
  9. Synthetic Data

    Why this step: How modern post-training scales beyond expensive human labels.

    Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

    Read full entry →Training · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path