Training · intermediate

Pretraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

Published May 29, 2026

Explanation

Pretraining is by far the most expensive phase: weeks to months on thousands of GPUs, costing tens of millions of dollars for frontier models. The data is usually a curated mix of web text, books, code, and academic papers, deduplicated and quality-filtered.

The objective is almost always next-token prediction (with masked-token prediction for encoder models). No human feedback is involved — the model just learns the statistical structure of language at scale.

The product of pretraining is the base model. It will autocomplete text fluently but is not yet tuned for following instructions or having helpful conversations — that comes in later stages.

Examples

GPT-3 pretrained on ~300B tokens.
Llama 3 pretrained on ~15T tokens.

Frequently asked

What is Pretraining?

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

What is an example of pretraining?

GPT-3 pretrained on ~300B tokens.

How is Pretraining related to Fine-tuning?

Pretraining and Fine-tuning are both training concepts. Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Is Pretraining considered intermediate?

Pretraining is generally considered intermediate-level material in the AI and LLM space.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

Instruction TuningTraining

Instruction tuning is fine-tuning on examples of (instruction, desired response) pairs so a base model learns to follow natural-language directions.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Large Language ModelFoundations

A large language model is a neural network trained on huge amounts of text to predict the next token in a sequence. GPT-4, Claude, and Gemini are all LLMs.

Foundation ModelFoundations

A foundation model is a single large model pretrained on broad data that can be adapted to many downstream tasks. LLMs are the most common type.

Scaling LawsTraining

Scaling laws are the empirical power-law relationship between model size, training data, training compute, and resulting loss. They predict that bigger, more data-fed models keep improving in a smooth, forecastable way.

Training ComputeTraining

Training compute is the total floating-point operations used to pretrain a model, usually expressed as FLOPs (e.g. 10^25 FLOPs). It is the headline number governments now regulate.

Synthetic DataTraining

Synthetic data is training data produced by a model — instructions distilled from GPT-4, code generated and filtered by tests, reasoning traces sampled from a stronger model — rather than handwritten by humans.

Side-by-side comparisons

Sources

GPT-3 paper (arXiv)