Evaluation · beginner

MMLU (Massive Multitask Language Understanding)

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

Published May 29, 2026

Explanation

Subjects range from elementary mathematics to US history to professional medicine. Each question is multiple choice; the model picks A/B/C/D. Score is plain accuracy.

MMLU was a useful frontier signal from 2020-2023. By 2024 frontier models score above 85% and the benchmark is largely saturated for the top tier — newer evals like MMLU-Pro and GPQA replace it for frontier comparisons.

Examples

GPT-4: 86.4% MMLU (5-shot, original release).
Llama 3 70B: ~80% MMLU.

Frequently asked

What is MMLU?

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

What is an example of mmlu?

GPT-4: 86.4% MMLU (5-shot, original release).

How is MMLU related to Benchmark?

MMLU and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is MMLU considered beginner?

MMLU is generally considered beginner-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

HumanEvalEvaluation

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

PerplexityEvaluation

Perplexity measures how "surprised" a language model is by held-out text. Lower is better. It is the natural intrinsic eval for next-token prediction.

Side-by-side comparisons

Sources

MMLU paper (arXiv)