Comparison

Pretraining vs Synthetic Data

Pretraining and Synthetic Data are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Pretraining

Pretraining comes up when the question is fundamentally about training.

GPT-3 pretrained on ~300B tokens.

When you would reach for Synthetic Data

Synthetic Data comes up when the question is fundamentally about training.

Phi-3 trained heavily on textbook-quality synthetic data.

Frequently asked

What is the difference between Pretraining and Synthetic Data?

Pretraining: Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later. Synthetic Data: Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

When should I use Pretraining vs Synthetic Data?

Pretraining is the right concept when you are focused on training. Synthetic Data applies when you are focused on training.

Are Pretraining and Synthetic Data the same thing?

No. Pretraining is training; Synthetic Data is training. They are related but address different parts of the AI stack.