Evaluation · beginner
Reference-Based Evaluation (supervised eval)
Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
Explanation
When you have ground truth — a labeled dataset where each input has a known correct output — you can ask the simpler question: did the model produce something close to the reference? For deterministic tasks (extraction, classification, single-fact Q&A, code that has to pass tests) this is the right tool.
Methods range from strict (exact match) to lenient (BLEU/ROUGE n-gram overlap) to semantic (LLM-as-judge "semantically equivalent to reference?" or embedding cosine similarity). Strictness depends on how much output variance is acceptable.
Limits: collecting references is expensive; references can be wrong or stale; many real tasks (writing, dialogue, coding new code) have no single correct output.
Examples
- A classifier eval: gold label is "spam", model output is "spam" → match, score 1.
- A code task: model output is run against pytest; references = test cases; pass/fail is the score.
When to use reference-based evaluation
When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.
Frequently asked
What is Reference-Based Evaluation?
Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
What is an example of reference-based evaluation?
A classifier eval: gold label is "spam", model output is "spam" → match, score 1.
How is Reference-Based Evaluation related to Reference-Free Evaluation?
Reference-Based Evaluation and Reference-Free Evaluation are both evaluation concepts. Reference-free evaluation grades an output without a ground-truth answer to compare against — using rubric-based LLM-as-judge, self-consistency, or property checks like "is the answer grounded in the retrieved context?"
When should I use reference-based evaluation?
When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.
Is Reference-Based Evaluation considered beginner?
Reference-Based Evaluation is generally considered beginner-level material in the AI and LLM space.