Skip to main content
ModelTerms

Evaluation · intermediate

SWE-bench

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

Explanation

Each task hands the model a repository at a specific commit and a natural-language issue. The model produces a patch; the patch is applied and the repo's tests are run. Pass means the previously-failing tests now pass.

SWE-bench Verified is a 500-task curated subset that filters out underspecified issues. Public leaderboards show frontier agents (Claude, GPT, open-source agents like SWE-agent) climbing past 60% on Verified by mid-2025 — from <5% just two years earlier.

The benchmark of choice for measuring "real coding agent" quality.

Examples

  • A SWE-agent run patching a Django bug, verified by Django's own test suite.
  • Claude Sonnet 4 scoring >60% on SWE-bench Verified.

Frequently asked

What is SWE-bench?

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

What is an example of swe-bench?

A SWE-agent run patching a Django bug, verified by Django's own test suite.

How is SWE-bench related to Benchmark?

SWE-bench and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is SWE-bench considered intermediate?

SWE-bench is generally considered intermediate-level material in the AI and LLM space.

Side-by-side comparisons

Sources