Skip to main content
ModelTerms

Evaluation · intermediate

LLM-as-Judge (LLM-as-a-judge)

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

Explanation

Show GPT-4 (or Claude) two candidate responses and ask "which one better answers the user's question?" Repeat over thousands of examples and you have win-rates, A/B test results, and regression detection without the per-eval cost of human raters.

Known biases: judges tend to prefer longer responses, responses from the same model family, and responses that match their own style. Pair-wise comparison plus position-randomization helps. Treating judge scores as one of several signals (alongside benchmarks and human spot-check) is the conservative play.

Examples

  • MT-Bench: GPT-4 scoring 80 multi-turn questions.
  • Internal eval: "did this support reply resolve the ticket?" judged by Claude.

When to use llm-as-judge

When you need to evaluate thousands of open-ended outputs cheaply and quickly.

Frequently asked

What is LLM-as-Judge?

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

What is an example of llm-as-judge?

MT-Bench: GPT-4 scoring 80 multi-turn questions.

How is LLM-as-Judge related to Benchmark?

LLM-as-Judge and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

When should I use llm-as-judge?

When you need to evaluate thousands of open-ended outputs cheaply and quickly.

Is LLM-as-Judge considered intermediate?

LLM-as-Judge is generally considered intermediate-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

HallucinationEvaluation

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Prompt EngineeringPrompting

Prompt engineering is the craft of writing prompts that reliably produce the behavior you want from an LLM. It blends formatting, examples, tone, and constraints.

FaithfulnessEvaluation

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

Answer RelevanceEvaluation

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

Pairwise ComparisonEvaluation

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

Online EvaluationEvaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

Side-by-side comparisons

Sources