Evaluation · beginner
HumanEval
HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.
Explanation
Score is pass@k: the percentage of problems where at least one of k generated solutions passes all tests. pass@1 is the strict measure.
HumanEval was the dominant code benchmark for years and is now largely saturated (frontier models score 80-90%+). The community has moved to larger, harder code benchmarks like LiveCodeBench (uncontaminated, regularly refreshed) and SWE-bench (real GitHub issues).
Examples
- GPT-4: ~88% pass@1 on HumanEval.
- CodeLlama 34B: ~48% pass@1.
Frequently asked
What is HumanEval?
HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.
What is an example of humaneval?
GPT-4: ~88% pass@1 on HumanEval.
How is HumanEval related to Benchmark?
HumanEval and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.
Is HumanEval considered beginner?
HumanEval is generally considered beginner-level material in the AI and LLM space.