Training

How models learn from data before they ever talk to a user.

Backpropagation is the algorithm used to compute how each weight in a neural network should change to reduce error, by propagating gradients backward through the network.

intermediate

Direct Preference Optimization

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

advanced

Distillation

Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.

intermediate

Fine-tuning

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

intermediate

Gradient Descent

Gradient descent is the optimization algorithm at the heart of training: nudge each weight in the direction that reduces the loss, with a small step size set by the learning rate.

intermediate

Instruction Tuning

Instruction tuning is fine-tuning on examples of (instruction, desired response) pairs so a base model learns to follow natural-language directions.

intermediate

Learning Rate

The learning rate is the step size used to update weights during training. Too high and training diverges; too low and it crawls or gets stuck.

intermediate

LoRA

LoRA is a parameter-efficient fine-tuning method that freezes a model's original weights and learns small low-rank update matrices alongside them. Cheap fine-tuning on a single GPU.

intermediate

Loss Function

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

intermediate

Preference Data

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

intermediate

Pretraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

intermediate

QLoRA

QLoRA fine-tunes a 4-bit quantized base model with LoRA adapters, letting you train 70B-class models on a single 48 GB GPU at near-full fine-tuning quality.

advanced

Reinforcement Learning from Human Feedback

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

advanced

Reward Model

A reward model scores model outputs the way humans would, learned from preference data. RLHF then optimizes the policy LLM to maximize the reward model's score.

advanced

Scaling Laws

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

intermediate

Supervised Fine-Tuning

SFT is fine-tuning where each training example has an explicit input and a desired output, supervised by a loss that penalizes deviation from that output.

intermediate

Synthetic Data

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

intermediate

Training Compute

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

advanced