Skip to main content
ModelTerms

Evaluation · intermediate

Data Contamination (benchmark contamination, test-set contamination)

Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.

Explanation

Frontier models are pretrained on most of the public internet. Once a benchmark gets popular, its questions tend to show up in blog posts, GitHub repos, and Stack Overflow — and from there into the next pretraining mix.

Detected via N-gram overlap checks, paraphrase searches, or held-out validation. Mitigation: regularly publish new benchmarks (LiveCodeBench, ARC-AGI-2), use canary strings to detect ingestion, or measure on private held-out variants.

The reason MMLU-style benchmarks have largely "saturated" — measured improvement may partly reflect more contamination rather than more capability.

Examples

  • MMLU questions appearing verbatim in pretraining data crawls.
  • LiveCodeBench refreshing its problem set monthly to stay ahead of contamination.

Frequently asked

What is Data Contamination?

Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.

What is an example of data contamination?

MMLU questions appearing verbatim in pretraining data crawls.

How is Data Contamination related to Benchmark?

Data Contamination and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is Data Contamination considered intermediate?

Data Contamination is generally considered intermediate-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

PretrainingTraining

Pretraining is the initial training phase where an LLM learns to predict the next token on trillions of tokens of general text. It produces a base model that can be adapted later.

MMLUEvaluation

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

HumanEvalEvaluation

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

HallucinationEvaluation

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

Side-by-side comparisons

Sources