SWE-bench explained: what it measures and why frontier labs care
SWE-bench has become the benchmark that frontier labs cite most prominently. Here's how it works, why it's credible, and what the scores actually tell you.
SWE-bench evaluates a model's ability to resolve real GitHub issues in popular open-source Python repositories. The model receives the issue description and the repository state at the time the issue was filed, then must produce a patch that makes the test suite pass.
Why it matters
Unlike multiple-choice benchmarks, SWE-bench requires the model to write real, executable code that solves a real problem. There's no way to guess — either the patch works or it doesn't. This makes it one of the most externally-valid coding benchmarks available.
The Verified subset (the one labs typically cite) filters issues to those with confirmed human solvability and clear problem statements, reducing noise from ambiguous or under-specified issues.
Current scores (as of mid-2025)
Claude Opus 4 leads at 52%, followed by GPT-5 at 48% (third-party) and Claude Sonnet 4 at 44%. These numbers come from scaffolded runs — the model typically has access to file listing, code reading, and test execution tools through an agent harness.
Caveats
Scaffolding differences cause large score variation across published numbers. A model run with a sophisticated agent framework (file search, iterative debugging, test feedback loops) will score substantially higher than the same model run in a simple single-turn setup. When comparing numbers across papers and announcements, the scaffolding setup matters as much as the model itself.
What it doesn't measure
SWE-bench is Python-only and focused on bug fixes in existing codebases. It doesn't measure feature implementation from scratch, multi-language codebases, or documentation quality. It also has a Python ecosystem bias — models trained on more Python will likely score better.