Comparison

Distillation vs Pretraining

Distillation and Pretraining are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Distillation

Distillation comes up when the question is fundamentally about training.

DistilBERT: a 6-layer student of 12-layer BERT, 60% the size, 95%+ of the performance.

When you would reach for Pretraining

Pretraining comes up when the question is fundamentally about training.

GPT-3 pretrained on ~300B tokens.

Frequently asked

What is the difference between Distillation and Pretraining?

Distillation: Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality. Pretraining: Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

When should I use Distillation vs Pretraining?

Distillation is the right concept when you are focused on training. Pretraining applies when you are focused on training.

Are Distillation and Pretraining the same thing?

No. Distillation is training; Pretraining is training. They are related but address different parts of the AI stack.