Skip to main content
ModelTerms

Evaluation · beginner

Ground Truth (gold label, gold answer)

Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

Explanation

In classic ML the ground-truth label comes free with the dataset (cat/dog, spam/not-spam). In LLM evaluation, ground truth has to be manually written by domain experts or extracted from existing artifacts (issue → PR fix, question → cited source). It is the most expensive component of the eval setup.

Strategies to reduce cost: use existing artifacts (tests in code, accepted answers in a Q&A forum, gold passages in a search index), distill from a stronger model with human spot-check, or pivot to reference-free eval where ground truth is impractical.

The eval is only as good as the ground truth. A noisy gold set caps how reliably you can detect improvements.

Examples

  • For a coding eval: ground truth = the passing tests in the repo at HEAD.
  • For a customer-support eval: ground truth = the human agent's actual response (with the caveat that humans are not always right either).

Frequently asked

What is Ground Truth?

Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

What is an example of ground truth?

For a coding eval: ground truth = the passing tests in the repo at HEAD.

How is Ground Truth related to Reference-Based Evaluation?

Ground Truth and Reference-Based Evaluation are both evaluation concepts. Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

Is Ground Truth considered beginner?

Ground Truth is generally considered beginner-level material in the AI and LLM space.

Reference-Based EvaluationEvaluation

Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

AnnotationEvaluation

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Side-by-side comparisons

Sources