Skip to main content
ModelTerms

Evaluation · intermediate

Faithfulness (groundedness)

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Explanation

A faithful answer never introduces information not present in the retrieved chunks. An unfaithful answer hallucinates — confidently asserting facts that the source did not contain.

Standard scoring: LLM-as-judge reads (retrieved context, generated answer), decomposes the answer into atomic claims, and checks whether each claim is entailed by the context. Pass = all claims entailed; partial-fail = some entailed; fail = at least one major claim contradicts or fabricates.

Phoenix, Ragas, and Anthropic's eval recipes all ship faithfulness graders out of the box. Most RAG quality dashboards lead with faithfulness as the headline metric — it correlates strongly with user trust and is the easiest hallucination to catch in CI.

Examples

  • Faithfulness eval flags an answer that cited "California enacted X in 2024" when the retrieved policy said 2023; the trace surfaces the original failure.
  • Phoenix's built-in faithfulness evaluator running on 5% of production traces.

When to use faithfulness

Always for RAG — faithfulness is the single most actionable production metric.

Frequently asked

What is Faithfulness?

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

What is an example of faithfulness?

Faithfulness eval flags an answer that cited "California enacted X in 2024" when the retrieved policy said 2023; the trace surfaces the original failure.

How is Faithfulness related to Retrieval-Augmented Generation?

Faithfulness and Retrieval-Augmented Generation are both evaluation concepts. RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

When should I use faithfulness?

Always for RAG — faithfulness is the single most actionable production metric.

Is Faithfulness considered intermediate?

Faithfulness is generally considered intermediate-level material in the AI and LLM space.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

HallucinationEvaluation

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Reference-Free EvaluationEvaluation

Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

Answer RelevanceEvaluation

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Side-by-side comparisons

Sources