Category
Evaluation
How we measure whether a model is any good.
Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.
beginnerAnswer relevance measures whether the response actually answers the question asked — independent of whether it is true. The complement to faithfulness in RAG eval.
intermediateARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.
intermediateA benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
beginnerChatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.
beginnerData contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.
intermediateELO is a rating system originally from chess that converts pairwise wins between players into a single skill number. Chatbot Arena uses it to rank LLMs from anonymous user votes.
intermediateEval-driven development is the LLM analog of test-driven development: you write evals for behavior before changing the prompt or model, and every change is graded against the same eval suite.
beginnerFaithfulness measures whether an LLM's answer is supported by the retrieved context — every claim either appears in the source material or follows directly from it. The most important RAG quality metric.
intermediateGround truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.
beginnerA hallucination is a confidently-stated, plausible-sounding LLM output that is factually wrong. It is the failure mode that most often surprises non-expert users.
beginnerHumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.
beginnerLLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.
intermediateMMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.
beginnerOffline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.
beginnerOnline evaluation runs scoring functions over live production traffic — usually a sample of recent traces — to monitor quality continuously instead of relying solely on a fixed offline dataset.
intermediatePairwise comparison asks a judge — human or LLM — to pick the better of two responses to the same prompt. Aggregates to a win rate; the dominant method for comparing model or prompt versions.
intermediatePerplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.
intermediateReference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
beginnerReference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"
intermediateLLM regression testing is the practice of running every prompt or model change against a fixed set of "must-pass" examples — bug repros, edge cases, known failure modes — to catch quality regressions.
beginnerSWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.
intermediateA user feedback loop ingests explicit signals — thumbs up/down, edits, regenerates, copy-to-clipboard — back into evaluation and fine-tuning, turning real usage into a continuous quality signal.
intermediateWin rate is the share of pairwise comparisons one candidate wins against another. The standard scalar for "model A is better than model B" in modern LLM evaluation.
beginner