Evaluation · beginner
Ground Truth (gold label, gold answer)
Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.
Explanation
In classic ML the ground-truth label comes free with the dataset (cat/dog, spam/not-spam). In LLM evaluation, ground truth has to be manually written by domain experts or extracted from existing artifacts (issue → PR fix, question → cited source). It is the most expensive component of the eval setup.
Strategies to reduce cost: use existing artifacts (tests in code, accepted answers in a Q&A forum, gold passages in a search index), distill from a stronger model with human spot-check, or pivot to reference-free eval where ground truth is impractical.
The eval is only as good as the ground truth. A noisy gold set caps how reliably you can detect improvements.
Examples
- For a coding eval: ground truth = the passing tests in the repo at HEAD.
- For a customer-support eval: ground truth = the human agent's actual response (with the caveat that humans are not always right either).
Frequently asked
What is Ground Truth?
Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.
What is an example of ground truth?
For a coding eval: ground truth = the passing tests in the repo at HEAD.
How is Ground Truth related to Reference-Based Evaluation?
Ground Truth and Reference-Based Evaluation are both evaluation concepts. Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."
Is Ground Truth considered beginner?
Ground Truth is generally considered beginner-level material in the AI and LLM space.