Evaluation · intermediate
LLM-as-Judge (LLM-as-a-judge)
LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.
Explanation
Show GPT-4 (or Claude) two candidate responses and ask "which one better answers the user's question?" Repeat over thousands of examples and you have win-rates, A/B test results, and regression detection without the per-eval cost of human raters.
Known biases: judges tend to prefer longer responses, responses from the same model family, and responses that match their own style. Pair-wise comparison plus position-randomization helps. Treating judge scores as one of several signals (alongside benchmarks and human spot-check) is the conservative play.
Examples
- MT-Bench: GPT-4 scoring 80 multi-turn questions.
- Internal eval: "did this support reply resolve the ticket?" judged by Claude.
When to use llm-as-judge
When you need to evaluate thousands of open-ended outputs cheaply and quickly.
Frequently asked
What is LLM-as-Judge?
LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.
What is an example of llm-as-judge?
MT-Bench: GPT-4 scoring 80 multi-turn questions.
How is LLM-as-Judge related to Benchmark?
LLM-as-Judge and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
When should I use llm-as-judge?
When you need to evaluate thousands of open-ended outputs cheaply and quickly.
Is LLM-as-Judge considered intermediate?
LLM-as-Judge is generally considered intermediate-level material in the AI and LLM space.