Training · intermediate
Synthetic Data
Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.
Explanation
Modern post-training relies heavily on synthetic data. Smaller open models (Llama-3-Instruct, Phi, Gemma) are often fine-tuned on outputs from much larger teachers, sometimes filtered for correctness against verifiers.
The economics: a fixed budget that buys 10K human-written examples might buy millions of synthetic ones at GPT-4 inference prices. As long as quality remains high (filtering, ranking, deduplication), the math wins.
Risks: model collapse if you train on too much of your own output, and inherited blind spots from the teacher.
Examples
- Phi-3 trained heavily on textbook-quality synthetic data.
- Tulu-3 post-training mixes synthetic instructions with human-written data.
Frequently asked
What is Synthetic Data?
Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.
What is an example of synthetic data?
Phi-3 trained heavily on textbook-quality synthetic data.
How is Synthetic Data related to Distillation?
Synthetic Data and Distillation are both training concepts. Distillation trains a smaller "student" model to imitate the outputs of a larger "teacher" model. The student becomes much cheaper to run while retaining much of the teacher's quality.
Is Synthetic Data considered intermediate?
Synthetic Data is generally considered intermediate-level material in the AI and LLM space.