Evaluation · intermediate

Answer Relevance (Q&A relevance)

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

Published May 31, 2026

Explanation

An answer can be faithful (every claim supported by the context) but irrelevant (it answered a different question). And it can be relevant but unfaithful (correct shape, wrong facts). Both metrics together pin down RAG quality.

Standard scoring: LLM-as-judge reads (question, answer), asks "does this answer the question?" on a 1-5 scale or pass/fail. Some implementations also reverse-engineer questions from the answer and check whether they match the original.

Combined faithfulness + relevance + retrieval-relevance (were the retrieved chunks on-topic?) is the canonical RAG triad — Ragas and Phoenix both standardize on this.

Examples

A user asks "what's the cancellation policy?" and the model returns the refund policy: faithful but low answer-relevance.
Phoenix Q&A relevance evaluator running per-trace alongside faithfulness.

Frequently asked

What is Answer Relevance?

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

What is an example of answer relevance?

A user asks "what's the cancellation policy?" and the model returns the refund policy: faithful but low answer-relevance.

How is Answer Relevance related to Faithfulness?

Answer Relevance and Faithfulness are both evaluation concepts. Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Is Answer Relevance considered intermediate?

Answer Relevance is generally considered intermediate-level material in the AI and LLM space.

FaithfulnessEvaluation

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Retrieval-Augmented GenerationAgents & Tools

RAG retrieves relevant documents from a corpus at query time and includes them in the prompt, letting an LLM answer with up-to-date, source-cited, private information without retraining.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Reference-Free EvaluationEvaluation

Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

Arize PhoenixInfrastructure

Arize Phoenix is an open-source LLM observability and evaluation tool. It ingests OpenTelemetry traces, renders them in a debug UI, and provides built-in LLM-as-judge evaluators for hallucination, relevance, and toxicity.

Answer Relevance (Q&A relevance)

Explanation

Examples

Frequently asked

What is Answer Relevance?

What is an example of answer relevance?

How is Answer Relevance related to Faithfulness?

Is Answer Relevance considered intermediate?

Side-by-side comparisons

Sources

Explanation

Examples

Frequently asked

What is Answer Relevance?

What is an example of answer relevance?

How is Answer Relevance related to Faithfulness?

Is Answer Relevance considered intermediate?

Related terms

Side-by-side comparisons

Sources