Training · intermediate
Distillation (knowledge distillation)
Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.
Explanation
The teacher generates outputs (or output probability distributions) for many inputs; the student is trained to match them. Because the teacher's soft probabilities carry more information than just the top-1 answer, students can learn surprisingly well from a teacher far larger than their own capacity.
Distillation is how many cheap, fast production models are built: take a frontier model's outputs as training data for a smaller open model. Most production "small" models in 2026 (Llama-3-8B-Instruct, Phi-3, Gemma) are distilled in some form.
Examples
- DistilBERT: a 6-layer student of 12-layer BERT, 60% the size, 95%+ of the performance.
- Phi-3: small models trained heavily on textbook-quality data and synthetic data from larger teachers.
Frequently asked
What is Distillation?
Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.
What is an example of distillation?
DistilBERT: a 6-layer student of 12-layer BERT, 60% the size, 95%+ of the performance.
How is Distillation related to Fine-tuning?
Distillation and Fine-tuning are both training concepts. Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.
Is Distillation considered intermediate?
Distillation is generally considered intermediate-level material in the AI and LLM space.