Evaluation · beginner
Benchmark (eval)
A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Explanation
A benchmark fixes the questions, the evaluation method, and (usually) the scoring rubric, so you can directly compare GPT-4 to Claude to Llama on the same axis. Without benchmarks, "this model is better" is just vibes.
Benchmarks decay: as soon as a benchmark gets popular, its data leaks into the next pretraining corpus and models start memorizing it. New benchmarks (GPQA, ARC-AGI, SWE-bench) are introduced regularly to stay ahead of contamination.
Production teams typically build their own evals on top of standard benchmarks — measuring the things that actually matter for their use case.
Examples
- MMLU: 57 academic subjects, multiple choice.
- HumanEval: 164 Python coding problems.
- SWE-bench: real GitHub issues from Python repos.
Frequently asked
What is Benchmark?
A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
What is an example of benchmark?
MMLU: 57 academic subjects, multiple choice.
How is Benchmark related to MMLU?
Benchmark and MMLU are both evaluation concepts. MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.
Is Benchmark considered beginner?
Benchmark is generally considered beginner-level material in the AI and LLM space.