Evaluation · beginner
Annotation (labeling, human labeling)
Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.
Explanation
Annotation pipelines for LLM apps usually look like: sample production traces, route them to an annotation UI (Argilla, Phoenix, Label Studio, custom internal tools), have humans label correctness / faithfulness / category / preferred response, save labels back to the dataset.
Annotation quality drives downstream value. Sloppy labels create noisy evals that mask real signal. Multi-annotator agreement (Cohen's kappa) is the standard quality metric; gold rate, label drift over time, and per-annotator bias are routine monitors.
Modern hybrid pattern: an LLM produces a candidate label cheaply, a human reviews and corrects only the uncertain cases. Cuts cost ~80% while preserving quality.
Examples
- A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.
- GPT-4 pre-labels 10K traces as faithful / unfaithful; humans review the borderline 1K.
Frequently asked
What is Annotation?
Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.
What is an example of annotation?
A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.
How is Annotation related to Ground Truth?
Annotation and Ground Truth are both evaluation concepts. Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.
Is Annotation considered beginner?
Annotation is generally considered beginner-level material in the AI and LLM space.