Skip to main content
ModelTerms

Learning path · 28 min · intermediate

How modern teams actually evaluate LLMs

Past benchmarks: pairwise comparison, win rates, LLM-as-judge, online and offline.

Saying "GPT-4 is better than Llama" used to mean "I asked them both stuff and GPT-4 felt smarter." Modern teams have a real methodology — offline eval suites in CI, online eval over production traffic, LLM-as-judge pairwise comparisons aggregating into win rates. This path is the discipline.

  1. Benchmarkeval

    Why this step: The standardized-test approach. Start here for the foundational frame.

    A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

    Read full entry →Evaluation · beginner
  2. Ground Truthgold label

    Why this step: The expensive ingredient that makes reference-based eval possible.

    Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

    Read full entry →Evaluation · beginner
  3. Reference-Based Evaluationsupervised eval

    Why this step: What you do when you have ground truth.

    Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

    Read full entry →Evaluation · beginner
  4. Reference-Free Evaluationunsupervised eval

    Why this step: What you do when you don't — the modern production default.

    Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

    Read full entry →Evaluation · intermediate
  5. LLM-as-JudgeLLM-as-a-judge

    Why this step: The dominant technique for scaling reference-free eval cheaply.

    LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

    Read full entry →Evaluation · intermediate
  6. Pairwise ComparisonA/B eval

    Why this step: The methodology that aggregates into win rate.

    Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

    Read full entry →Evaluation · intermediate
  7. Win Rate

    Why this step: The most-cited modern eval scalar.

    Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

    Read full entry →Evaluation · beginner
  8. ELO RatingElo

    Why this step: How pairwise comparisons compose into a global ranking.

    ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

    Read full entry →Evaluation · intermediate
  9. Chatbot ArenaLMSYS Arena

    Why this step: The public infrastructure that ranks every major model.

    Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

    Read full entry →Evaluation · beginner
  10. Offline Evaluationoffline eval

    Why this step: The CI-runnable test-suite eval pattern.

    Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

    Read full entry →Evaluation · beginner
  11. Online Evaluationonline eval

    Why this step: The "actually monitor production" pattern. Modern best practice.

    Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

    Read full entry →Evaluation · intermediate
  12. Eval-Driven DevelopmentEDD

    Why this step: The discipline that ties it all together.

    Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.

    Read full entry →Evaluation · beginner
  13. Data Contaminationbenchmark contamination

    Why this step: Why benchmark scores quietly decay over time.

    Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.

    Read full entry →Evaluation · intermediate

You finished the path.

Now stress-test what you remember.

Take the mixed quiz →Pick another path