Evaluation · intermediate
Data Contamination (benchmark contamination, test-set contamination)
Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.
Explanation
Frontier models are pretrained on most of the public internet. Once a benchmark gets popular, its questions tend to show up in blog posts, GitHub repos, and Stack Overflow — and from there into the next pretraining mix.
Detected via N-gram overlap checks, paraphrase searches, or held-out validation. Mitigation: regularly publish new benchmarks (LiveCodeBench, ARC-AGI-2), use canary strings to detect ingestion, or measure on private held-out variants.
The reason MMLU-style benchmarks have largely "saturated" — measured improvement may partly reflect more contamination rather than more capability.
Examples
- MMLU questions appearing verbatim in pretraining data crawls.
- LiveCodeBench refreshing its problem set monthly to stay ahead of contamination.
Frequently asked
What is Data Contamination?
Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.
What is an example of data contamination?
MMLU questions appearing verbatim in pretraining data crawls.
How is Data Contamination related to Benchmark?
Data Contamination and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Is Data Contamination considered intermediate?
Data Contamination is generally considered intermediate-level material in the AI and LLM space.