Comparison

Data Contamination vs Hallucination

Data Contamination and Hallucination are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Data Contamination

Data Contamination comes up when the question is fundamentally about evaluation.

MMLU questions appearing verbatim in pretraining data crawls.

When you would reach for Hallucination

Hallucination comes up when the question is fundamentally about evaluation.

Citing a paper that does not exist.

Frequently asked

What is the difference between Data Contamination and Hallucination?

Data Contamination: Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned. Hallucination: A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

When should I use Data Contamination vs Hallucination?

Data Contamination is the right concept when you are focused on evaluation. Hallucination applies when you are focused on evaluation.

Are Data Contamination and Hallucination the same thing?

No. Data Contamination is evaluation; Hallucination is evaluation. They are related but address different parts of the AI stack.