Training · intermediate

Distillation (knowledge distillation)

Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.

Published May 29, 2026

Explanation

The teacher generates outputs (or output probability distributions) for many inputs; the student is trained to match them. Because the teacher's soft probabilities carry more information than just the top-1 answer, students can learn surprisingly well from a teacher far larger than their own capacity.

Distillation is how many cheap, fast production models are built: take a frontier model's outputs as training data for a smaller open model. Most production "small" models in 2026 (Llama-3-8B-Instruct, Phi-3, Gemma) are distilled in some form.

Examples

DistilBERT: a 6-layer student of 12-layer BERT, 60% the size, 95%+ of the performance.
Phi-3: small models trained heavily on textbook-quality data and synthetic data from larger teachers.

Frequently asked

What is Distillation?

Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.

What is an example of distillation?

DistilBERT: a 6-layer student of 12-layer BERT, 60% the size, 95%+ of the performance.

How is Distillation related to Fine-tuning?

Distillation and Fine-tuning are both training concepts. Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Is Distillation considered intermediate?

Distillation is generally considered intermediate-level material in the AI and LLM space.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

QuantizationInfrastructure

Quantization reduces model weights from 16- or 32-bit floats to lower-precision types (INT8, INT4) so the model needs less memory and runs faster, usually with minor quality loss.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Synthetic DataTraining

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

Side-by-side comparisons

Sources

Distilling the Knowledge in a Neural Network (Hinton, arXiv)