Training · advanced
Training Compute (training FLOPs)
Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.
Explanation
Modern frontier training runs sit around 10^25-10^26 FLOPs. The Biden executive order and EU AI Act both used 10^26 FLOPs as the threshold for "frontier model" treatment.
Compute is roughly 6 × parameters × tokens for a dense transformer. So a 70B-parameter model trained on 15T tokens is ~6.3 × 10^24 FLOPs. Chinchilla scaling tells you the best ratio of params to tokens for a fixed compute budget.
Training compute is a key input to scaling laws and the main thing governments will likely regulate as models grow.
Examples
- GPT-3: ~3 × 10^23 FLOPs.
- Llama 3 70B: ~6 × 10^24 FLOPs.
- GPT-4: estimated ~2 × 10^25 FLOPs.
Frequently asked
What is Training Compute?
Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.
What is an example of training compute?
GPT-3: ~3 × 10^23 FLOPs.
How is Training Compute related to Scaling Laws?
Training Compute and Scaling Laws are both training concepts. Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.
Is Training Compute considered advanced?
Training Compute is generally considered advanced-level material in the AI and LLM space.