Skip to main content
ModelTerms

Evaluation · beginner

Annotation (labeling, human labeling)

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

Explanation

Annotation pipelines for LLM apps usually look like: sample production traces, route them to an annotation UI (Argilla, Phoenix, Label Studio, custom internal tools), have humans label correctness / faithfulness / category / preferred response, save labels back to the dataset.

Annotation quality drives downstream value. Sloppy labels create noisy evals that mask real signal. Multi-annotator agreement (Cohen's kappa) is the standard quality metric; gold rate, label drift over time, and per-annotator bias are routine monitors.

Modern hybrid pattern: an LLM produces a candidate label cheaply, a human reviews and corrects only the uncertain cases. Cuts cost ~80% while preserving quality.

Examples

  • A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.
  • GPT-4 pre-labels 10K traces as faithful / unfaithful; humans review the borderline 1K.

Frequently asked

What is Annotation?

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

What is an example of annotation?

A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.

How is Annotation related to Ground Truth?

Annotation and Ground Truth are both evaluation concepts. Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

Is Annotation considered beginner?

Annotation is generally considered beginner-level material in the AI and LLM space.

Side-by-side comparisons

Sources