Comparison

HumanEval vs SWE-bench

HumanEval and SWE-bench are both common AI/LLM terms but cover different ideas. Here is a quick side-by-side.

When you would reach for HumanEval

HumanEval comes up when the question is fundamentally about evaluation.

GPT-4: ~88% pass@1 on HumanEval.

When you would reach for SWE-bench

SWE-bench comes up when the question is fundamentally about evaluation.

A SWE-agent run patching a Django bug, verified by Django's own test suite.

Frequently asked

What is the difference between HumanEval and SWE-bench?

HumanEval: HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body. SWE-bench: SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

When should I use HumanEval vs SWE-bench?

HumanEval is the right concept when you are focused on evaluation. SWE-bench applies when you are focused on evaluation.

Are HumanEval and SWE-bench the same thing?

No. HumanEval is evaluation; SWE-bench is evaluation. They are related but address different parts of the AI stack.