Skip to main content
ModelTerms

Evaluation · beginner

Reference-Based Evaluation (supervised eval)

Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

Explanation

When you have ground truth — a labeled dataset where each input has a known correct output — you can ask the simpler question: did the model produce something close to the reference? For deterministic tasks (extraction, classification, single-fact Q&A, code that has to pass tests) this is the right tool.

Methods range from strict (exact match) to lenient (BLEU/ROUGE n-gram overlap) to semantic (LLM-as-judge "semantically equivalent to reference?" or embedding cosine similarity). Strictness depends on how much output variance is acceptable.

Limits: collecting references is expensive; references can be wrong or stale; many real tasks (writing, dialogue, coding new code) have no single correct output.

Examples

  • A classifier eval: gold label is "spam", model output is "spam" → match, score 1.
  • A code task: model output is run against pytest; references = test cases; pass/fail is the score.

When to use reference-based evaluation

When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.

Frequently asked

What is Reference-Based Evaluation?

Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

What is an example of reference-based evaluation?

A classifier eval: gold label is "spam", model output is "spam" → match, score 1.

How is Reference-Based Evaluation related to Reference-Free Evaluation?

Reference-Based Evaluation and Reference-Free Evaluation are both evaluation concepts. Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

When should I use reference-based evaluation?

When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.

Is Reference-Based Evaluation considered beginner?

Reference-Based Evaluation is generally considered beginner-level material in the AI and LLM space.

Reference-Free EvaluationEvaluation

Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"

Ground TruthEvaluation

Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

HumanEvalEvaluation

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

MMLUEvaluation

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

Side-by-side comparisons

Sources