Training · intermediate

Scaling Laws

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Published May 30, 2026

Explanation

Kaplan et al. (2020) showed that LLM loss falls as a clean power law in compute, parameters, and data — letting labs forecast how much smarter a 10x bigger model would be before training it.

Hoffmann et al. (2022, "Chinchilla") refined this with the compute-optimal frontier: for a given compute budget, there is one ideal parameter count and one ideal token count. Most pre-2022 LLMs were oversized relative to their data; "Chinchilla-optimal" rebalanced toward more tokens.

Newer work extends scaling laws to inference, distillation, and reasoning compute — the field's most durable empirical generalization.

Examples

Predicting GPT-4's loss before training based on smaller-scale runs.
Llama 3 trained on 15T tokens — well past Chinchilla-optimal for its size.

Frequently asked

What is Scaling Laws?

What is an example of scaling laws?

Predicting GPT-4's loss before training based on smaller-scale runs.

How is Scaling Laws related to Pretraining?

Scaling Laws and Pretraining are both training concepts. Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Is Scaling Laws considered intermediate?

Scaling Laws is generally considered intermediate-level material in the AI and LLM space.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

Parameter CountArchitecture

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

Loss FunctionTraining

A loss function measures how wrong a model's predictions are. Training minimizes it. For LLMs the loss is the cross-entropy of predicted vs. actual next tokens.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Scaling Laws

Explanation

Examples

Frequently asked

What is Scaling Laws?

What is an example of scaling laws?

How is Scaling Laws related to Pretraining?

Is Scaling Laws considered intermediate?

Side-by-side comparisons

Sources

Explanation

Examples

Frequently asked

What is Scaling Laws?

What is an example of scaling laws?

How is Scaling Laws related to Pretraining?

Is Scaling Laws considered intermediate?

Related terms

Side-by-side comparisons

Sources