Evaluation

How we measure whether a model is any good.

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

beginner

Answer Relevance

Answer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.

intermediate

ARC-AGI

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

intermediate

Benchmark

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

beginner

Chatbot Arena

Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

beginner

Data Contamination

Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.

intermediate

ELO Rating

ELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.

intermediate

Eval-Driven Development

Eval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.

beginner

Faithfulness

Faithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.

intermediate

Ground Truth

Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

beginner

Hallucination

A hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.

beginner

HumanEval

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

beginner

LLM-as-Judge

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

intermediate

MMLU

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

beginner

Offline Evaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

beginner

Online Evaluation

Online evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.

intermediate

Pairwise Comparison

Pairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.

intermediate

Perplexity

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

intermediate

Reference-Based Evaluation

Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

beginner

Reference-Free Evaluation

Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

intermediate

Regression Testing (LLMs)

LLM regression testing is the practice of running every prompt or model change against a fixed set of "must-pass" examples — bug repros, edge cases, known failure modes — to catch quality regressions.

beginner

SWE-bench

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

intermediate

User Feedback Loop

A user feedback loop ingests explicit signals — thumbs up/down, edits, regenerates, copy-to-clipboard — back into evaluation and fine-tuning, turning real usage into a continuous quality signal.

intermediate

Win Rate

Win rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.

beginner