Evaluation · intermediate

SWE-bench

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

Published May 30, 2026

Explanation

Each task hands the model a repository at a specific commit and a natural-language issue. The model produces a patch; the patch is applied and the repo's tests are run. Pass means the previously-failing tests now pass.

SWE-bench Verified is a 500-task curated subset that filters out underspecified issues. Public leaderboards show frontier agents (Claude, GPT, open-source agents like SWE-agent) climbing past 60% on Verified by mid-2025 — from <5% just two years earlier.

The benchmark of choice for measuring "real coding agent" quality.

Examples

A SWE-agent run patching a Django bug, verified by Django's own test suite.
Claude Sonnet 4 scoring >60% on SWE-bench Verified.

Frequently asked

What is SWE-bench?

SWE-bench is a benchmark of ~2.3K real GitHub issues from popular Python repos. The model must read the codebase, understand the bug, and write a patch that passes the existing tests.

What is an example of swe-bench?

A SWE-agent run patching a Django bug, verified by Django's own test suite.

How is SWE-bench related to Benchmark?

SWE-bench and Benchmark are both evaluation concepts. A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Is SWE-bench considered intermediate?

SWE-bench is generally considered intermediate-level material in the AI and LLM space.

BenchmarkEvaluation

A benchmark is a standardized test that scores models on a fixed task, letting you compare them on equal footing. MMLU, HumanEval, and HELM are common examples.

Agentic CodingAgents & Tools

Agentic coding is an LLM-driven workflow where the model reads code, plans changes, edits files, runs commands, and iterates against feedback — autonomously closing tasks rather than just suggesting code.

HumanEvalEvaluation

HumanEval is a benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and unit tests. The model writes the function body.

AgentAgents & Tools

An AI agent is an LLM-driven system that decides which actions to take, executes them via tools, observes the results, and iterates until a goal is met.

SWE-bench

Explanation

Examples

Frequently asked

What is SWE-bench?

What is an example of swe-bench?

How is SWE-bench related to Benchmark?

Is SWE-bench considered intermediate?

Side-by-side comparisons

Sources

Explanation

Examples

Frequently asked

What is SWE-bench?

What is an example of swe-bench?

How is SWE-bench related to Benchmark?

Is SWE-bench considered intermediate?

Related terms

Side-by-side comparisons

Sources