Comparison

Pretraining vs Scaling Laws

Pretraining and Scaling Laws are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Pretraining

Pretraining comes up when the question is fundamentally about training.

GPT-3 pretrained on ~300B tokens.

When you would reach for Scaling Laws

Scaling Laws comes up when the question is fundamentally about training.

Predicting GPT-4's loss before training based on smaller-scale runs.

Frequently asked

What is the difference between Pretraining and Scaling Laws?

Pretraining: Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later. Scaling Laws: Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

When should I use Pretraining vs Scaling Laws?

Pretraining is the right concept when you are focused on training. Scaling Laws applies when you are focused on training.

Are Pretraining and Scaling Laws the same thing?

No. Pretraining is training; Scaling Laws is training. They are related but address different parts of the AI stack.