Training · intermediate
Scaling Laws
Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.
Explanation
Kaplan et al. (2020) showed that LLM loss falls as a clean power law in compute, parameters, and data — letting labs forecast how much smarter a 10x bigger model would be before training it.
Hoffmann et al. (2022, "Chinchilla") refined this with the compute-optimal frontier: for a given compute budget, there is one ideal parameter count and one ideal token count. Most pre-2022 LLMs were oversized relative to their data; "Chinchilla-optimal" rebalanced toward more tokens.
Newer work extends scaling laws to inference, distillation, and reasoning compute — the field's most durable empirical generalization.
Examples
- Predicting GPT-4's loss before training based on smaller-scale runs.
- Llama 3 trained on 15T tokens — well past Chinchilla-optimal for its size.
Frequently asked
What is Scaling Laws?
Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.
What is an example of scaling laws?
Predicting GPT-4's loss before training based on smaller-scale runs.
How is Scaling Laws related to Pretraining?
Scaling Laws and Pretraining are both training concepts. Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.
Is Scaling Laws considered intermediate?
Scaling Laws is generally considered intermediate-level material in the AI and LLM space.