Skip to main content
ModelTerms

Evaluation · intermediate

Answer Relevance (Q&A relevance)

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

Explanation

An answer can be faithful (every claim supported by the context) but irrelevant (it answered a different question). And it can be relevant but unfaithful (correct shape, wrong facts). Both metrics together pin down RAG quality.

Standard scoring: LLM-as-judge reads (question, answer), asks "does this answer the question?" on a 1-5 scale or pass/fail. Some implementations also reverse-engineer questions from the answer and check whether they match the original.

Combined faithfulness + relevance + retrieval-relevance (were the retrieved chunks on-topic?) is the canonical RAG triad — Ragas and Phoenix both standardize on this.

Examples

  • A user asks "what's the cancellation policy?" and the model returns the refund policy: faithful but low answer-relevance.
  • Phoenix Q&A relevance evaluator running per-trace alongside faithfulness.

Frequently asked

What is Answer Relevance?

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

What is an example of answer relevance?

A user asks "what's the cancellation policy?" and the model returns the refund policy: faithful but low answer-relevance.

How is Answer Relevance related to Faithfulness?

Answer Relevance and Faithfulness are both evaluation concepts. Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Is Answer Relevance considered intermediate?

Answer Relevance is generally considered intermediate-level material in the AI and LLM space.

FaithfulnessEvaluation

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Reference-Free EvaluationEvaluation

Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

Side-by-side comparisons

Sources