Evaluation · beginner

HumanEval

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

Published May 29, 2026

Explanation

Score is pass@k: the percentage of problems where at least one of k generated solutions passes all tests. pass@1 is the strict measure.

HumanEval was the dominant code benchmark for years and is now largely saturated (frontier models score 80-90%+). The community has moved to larger, harder code benchmarks like LiveCodeBench (uncontaminated, regularly refreshed) and SWE-bench (real GitHub issues).

Examples

GPT-4: ~88% pass@1 on HumanEval.
CodeLlama 34B: ~48% pass@1.

Frequently asked

What is HumanEval?

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

What is an example of humaneval?

GPT-4: ~88% pass@1 on HumanEval.

How is HumanEval related to Benchmark?

HumanEval and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is HumanEval considered beginner?

HumanEval is generally considered beginner-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

MMLUEvaluation

MMLU is a benchmark of ~16K multiple-choice questions across 57 subjects from elementary to professional. It is one of the most widely cited LLM benchmarks.

Fine-tuningTraining

Fine-tuning continues training a pretrained model on a smaller, task-specific dataset, adjusting its weights to specialize behavior or knowledge.

SWE-benchEvaluation

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

Side-by-side comparisons

Sources

HumanEval paper (arXiv)