Learning path · 28 min · intermediate

How modern teams actually evaluate LLMs

Past benchmarks: pairwise comparison, win rates, LLM-as-judge, online and offline.

Saying "GPT-4 is better than Llama" used to mean "I asked them both stuff and GPT-4 felt smarter." Modern teams have a real methodology — offline eval suites in CI, online eval over production traffic, LLM-as-judge pairwise comparisons aggregating into win rates. This path is the discipline.

Benchmarkeval
Why this step: The standardized-test approach. Start here for the foundational frame.
A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Read full entry →Evaluation · beginner
Ground Truthgold label
Why this step: The expensive ingredient that makes reference-based eval possible.
Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.
Read full entry →Evaluation · beginner
Reference-Based Evaluationsupervised eval
Why this step: What you do when you have ground truth.
Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
Read full entry →Evaluation · beginner
Reference-Free Evaluationunsupervised eval
Why this step: What you do when you don't — the modern production default.
Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"
Read full entry →Evaluation · intermediate
LLM-as-JudgeLLM-as-a-judge
Why this step: The dominant technique for scaling reference-free eval cheaply.
LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.
Read full entry →Evaluation · intermediate
Pairwise ComparisonA/B eval
Why this step: The methodology that aggregates into win rate.
Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.
Read full entry →Evaluation · intermediate
Win Rate
Why this step: The most-cited modern eval scalar.
Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.
Read full entry →Evaluation · beginner
ELO RatingElo
Why this step: How pairwise comparisons compose into a global ranking.
ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.
Read full entry →Evaluation · intermediate
Chatbot ArenaLMSYS Arena
Why this step: The public infrastructure that ranks every major model.
Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.
Read full entry →Evaluation · beginner
Offline Evaluationoffline eval
Why this step: The CI-runnable test-suite eval pattern.
Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.
Read full entry →Evaluation · beginner
Online Evaluationonline eval
Why this step: The "actually monitor production" pattern. Modern best practice.
Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.
Read full entry →Evaluation · intermediate
Eval-Driven DevelopmentEDD
Why this step: The discipline that ties it all together.
Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.
Read full entry →Evaluation · beginner
Data Contaminationbenchmark contamination
Why this step: Why benchmark scores quietly decay over time.
Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.
Read full entry →Evaluation · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path