Skip to main content
ModelTerms

Comparison

HumanEval vs MMLU

HumanEval and MMLU are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for HumanEval

HumanEval comes up when the question is fundamentally about evaluation.

GPT-4: ~88% pass@1 on HumanEval.

When you would reach for MMLU

MMLU comes up when the question is fundamentally about evaluation.

GPT-4: 86.4% MMLU (5-shot, original release).

Frequently asked

What is the difference between HumanEval and MMLU?

HumanEval: HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body. MMLU: MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

When should I use HumanEval vs MMLU?

HumanEval is the right concept when you are focused on evaluation. MMLU applies when you are focused on evaluation.

Are HumanEval and MMLU the same thing?

No. HumanEval is evaluation; MMLU is evaluation. They are related but address different parts of the AI stack.