Skip to main content
ModelTerms

Training · advanced

Training Compute (training FLOPs)

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

Explanation

Modern frontier training runs sit around 10^25-10^26 FLOPs. The Biden executive order and EU AI Act both used 10^26 FLOPs as the threshold for "frontier model" treatment.

Compute is roughly 6 × parameters × tokens for a dense transformer. So a 70B-parameter model trained on 15T tokens is ~6.3 × 10^24 FLOPs. Chinchilla scaling tells you the best ratio of params to tokens for a fixed compute budget.

Training compute is a key input to scaling laws and the main thing governments will likely regulate as models grow.

Examples

  • GPT-3: ~3 × 10^23 FLOPs.
  • Llama 3 70B: ~6 × 10^24 FLOPs.
  • GPT-4: estimated ~2 × 10^25 FLOPs.

Frequently asked

What is Training Compute?

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

What is an example of training compute?

GPT-3: ~3 × 10^23 FLOPs.

How is Training Compute related to Scaling Laws?

Training Compute and Scaling Laws are both training concepts. Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Is Training Compute considered advanced?

Training Compute is generally considered advanced-level material in the AI and LLM space.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Parameter CountArchitecture

Parameter count is the total number of learnable weights in a model — "7B" means 7 billion parameters. It is the most cited model-size metric, though not always the most informative.

GPUInfrastructure

GPUs are the parallel processors that train and run nearly every modern AI model. Their throughput on matrix multiplication is what makes deep learning practical.

Side-by-side comparisons

Sources