Skip to main content
ModelTerms

Evaluation · beginner

HumanEval

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

Explanation

Score is pass@k: the percentage of problems where at least one of k generated solutions passes all tests. pass@1 is the strict measure.

HumanEval was the dominant code benchmark for years and is now largely saturated (frontier models score 80-90%+). The community has moved to larger, harder code benchmarks like LiveCodeBench (uncontaminated, regularly refreshed) and SWE-bench (real GitHub issues).

Examples

  • GPT-4: ~88% pass@1 on HumanEval.
  • CodeLlama 34B: ~48% pass@1.

Frequently asked

What is HumanEval?

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

What is an example of humaneval?

GPT-4: ~88% pass@1 on HumanEval.

How is HumanEval related to Benchmark?

HumanEval and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is HumanEval considered beginner?

HumanEval is generally considered beginner-level material in the AI and LLM space.

Side-by-side comparisons

Sources