Skip to main content
ModelTerms

Evaluation · beginner

Benchmark (eval)

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Explanation

A benchmark fixes the questions, the evaluation method, and (usually) the scoring rubric, so you can directly compare GPT-4 to Claude to Llama on the same axis. Without benchmarks, "this model is better" is just vibes.

Benchmarks decay: as soon as a benchmark gets popular, its data leaks into the next pretraining corpus and models start memorizing it. New benchmarks (GPQA, ARC-AGI, SWE-bench) are introduced regularly to stay ahead of contamination.

Production teams typically build their own evals on top of standard benchmarks — measuring the things that actually matter for their use case.

Examples

  • MMLU: 57 academic subjects, multiple choice.
  • HumanEval: 164 Python coding problems.
  • SWE-bench: real GitHub issues from Python repos.

Frequently asked

What is Benchmark?

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

What is an example of benchmark?

MMLU: 57 academic subjects, multiple choice.

How is Benchmark related to MMLU?

Benchmark and MMLU are both evaluation concepts. MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

Is Benchmark considered beginner?

Benchmark is generally considered beginner-level material in the AI and LLM space.

MMLUEvaluation

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

HumanEvalEvaluation

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

PerplexityEvaluation

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

LLM-as-JudgeEvaluation

LLM-as-judge uses a strong LLM to score or compare outputs from other LLMs. It is how most production teams evaluate quality at scale when human review is too slow.

ARC-AGIEvaluation

ARC-AGI (Abstraction and Reasoning Corpus) is a benchmark of grid-puzzle tasks designed to require fluid reasoning rather than memorization. Humans score 85%; models stayed below 5% for years.

SWE-benchEvaluation

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

Data ContaminationEvaluation

Data contamination is when benchmark questions or answers leak into a model's pretraining corpus, inflating its score because it memorized rather than reasoned.

Chatbot ArenaEvaluation

Chatbot Arena is a public LLM evaluation platform where anonymous users submit prompts, see two random models' responses side-by-side, vote for the better one, and contribute to a global ELO leaderboard.

Side-by-side comparisons

Sources