Skip to main content
ModelTerms

Evaluation · beginner

Hallucination

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

Explanation

LLMs predict likely tokens. "Likely" does not mean "true." When asked about a topic the model did not see much of in training (rare authors, niche software versions, recent events past the knowledge cutoff), it tends to interpolate something fluent and wrong rather than admit it does not know.

Reducing hallucinations: lower temperature, retrieve real sources (RAG), allow the model to say "I do not know," constrain outputs to specific schemas, and use grounded eval to catch regressions. Hallucinations cannot be fully eliminated with current architectures — the safer play is treating LLM output as a draft to verify.

Examples

  • Citing a paper that does not exist.
  • Inventing a function name for a library that does not have one.

Frequently asked

What is Hallucination?

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

What is an example of hallucination?

Citing a paper that does not exist.

How is Hallucination related to Retrieval-Augmented Generation?

Hallucination and Retrieval-Augmented Generation are both evaluation concepts. RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

Is Hallucination considered beginner?

Hallucination is generally considered beginner-level material in the AI and LLM space.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

TemperatureInference

Temperature is a generation parameter that controls randomness. 0 is deterministic (always pick the most likely token); higher values produce more diverse, surprising output.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

FaithfulnessEvaluation

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Drift DetectionInfrastructure

Drift detection watches for changes in the statistical distribution of inputs, outputs, or quality scores over time — so you can catch a model degrading in production before users complain.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Side-by-side comparisons

Sources