Evaluation · beginner

Annotation (labeling, human labeling)

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

Published May 31, 2026

Explanation

Annotation pipelines for LLM apps usually look like: sample production traces, route them to an annotation UI (Argilla, Phoenix, Label Studio, custom internal tools), have humans label correctness / faithfulness / category / preferred response, save labels back to the dataset.

Annotation quality drives downstream value. Sloppy labels create noisy evals that mask real signal. Multi-annotator agreement (Cohen's kappa) is the standard quality metric; gold rate, label drift over time, and per-annotator bias are routine monitors.

Modern hybrid pattern: an LLM produces a candidate label cheaply, a human reviews and corrects only the uncertain cases. Cuts cost ~80% while preserving quality.

Examples

A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.
GPT-4 pre-labels 10K traces as faithful / unfaithful; humans review the borderline 1K.

Frequently asked

What is Annotation?

Annotation is the process of attaching ground truth or quality labels to data — by humans, sometimes augmented by an LLM. The unglamorous but decisive lever in LLM evaluation.

What is an example of annotation?

A team samples 200 production traces weekly, routes them to an internal Argilla instance, and has reviewers label correctness + a category tag.

How is Annotation related to Ground Truth?

Annotation and Ground Truth are both evaluation concepts. Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

Is Annotation considered beginner?

Annotation is generally considered beginner-level material in the AI and LLM space.

Ground TruthEvaluation

Ground truth is the known-correct answer for an eval input. For supervised tasks it is the label used to grade model outputs; for LLM apps it is often human-curated reference answers.

Preference DataTraining

Preference data is collections of (chosen, rejected) response pairs over the same prompt. It is the fuel for DPO and reward-model training.

Reference-Based EvaluationEvaluation

Reference-based evaluation compares the model output against a known correct answer using exact match, edit distance, BLEU, ROUGE, or LLM-as-judge "matches the reference."

Offline EvaluationEvaluation

Offline evaluation runs a fixed dataset of inputs through a candidate model or prompt, scores each output, and reports aggregate quality — the standard way to compare changes before shipping.

Side-by-side comparisons

Sources

Argilla — Human feedback for LLMs