Comparison
Annotation vs Reference-Based Evaluation
Annotation and Reference-Based Evaluation are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.
When you would reach for Annotation
Annotation comes up when the question is fundamentally about evaluation.
A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.
When you would reach for Reference-Based Evaluation
When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.
A classifier eval: gold label is "spam", model output is "spam" → match, score 1.
Frequently asked
What is the difference between Annotation and Reference-Based Evaluation?
Annotation: Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation. Reference-Based Evaluation: Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
When should I use Annotation vs Reference-Based Evaluation?
Annotation is the right concept when you are focused on evaluation. When ground truth is available and one or a small number of outputs are clearly correct — extraction, classification, structured-output, code-with-tests.
Are Annotation and Reference-Based Evaluation the same thing?
No. Annotation is evaluation; Reference-Based Evaluation is evaluation. They are related but address different parts of the AI stack.