Comparison

Benchmark vs MMLU

Benchmark and MMLU are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for Benchmark

Benchmark comes up when the question is fundamentally about evaluation.

MMLU: 57 academic subjects, multiple choice.

When you would reach for MMLU

MMLU comes up when the question is fundamentally about evaluation.

GPT-4: 86.4% MMLU (5-shot, original release).

Frequently asked

What is the difference between Benchmark and MMLU?

Benchmark: A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples. MMLU: MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

When should I use Benchmark vs MMLU?

Benchmark is the right concept when you are focused on evaluation. MMLU applies when you are focused on evaluation.

Are Benchmark and MMLU the same thing?

No. Benchmark is evaluation; MMLU is evaluation. They are related but address different parts of the AI stack.