Evaluation · beginner
MMLU (Massive Multitask Language Understanding)
MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.
Explanation
Subjects range from elementary mathematics to US history to professional medicine. Each question is multiple choice; the model picks A/B/C/D. Score is plain accuracy.
MMLU was a useful frontier signal from 2020-2023. By 2024 frontier models score above 85% and the benchmark is largely saturated for the top tier — newer evals like MMLU-Pro and GPQA replace it for frontier comparisons.
Examples
- GPT-4: 86.4% MMLU (5-shot, original release).
- Llama 3 70B: ~80% MMLU.
Frequently asked
What is MMLU?
MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.
What is an example of mmlu?
GPT-4: 86.4% MMLU (5-shot, original release).
How is MMLU related to Benchmark?
MMLU and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Is MMLU considered beginner?
MMLU is generally considered beginner-level material in the AI and LLM space.