SWE-bench Verified
Coding% resolvedReal GitHub issues solved end-to-end by the model.
At a glance
🏆 Top score
Total results
19
Models tested
19
Providers
7
Verified · Self-reported
16 · 3
Average
56.72 % resolved
Median
53.6 % resolved
Range
32 – 72.7 % resolved
Latest result
Apr 18, 2026
Score distribution
1
1
2
1
3
3
0
1
1
6
32.052.472.7
19 results across 10 score bands
Methodology
Model must generate a patch that resolves a real GitHub issue. Verified subset has human-confirmed solvability.
Limitations
Scaffolding and harness differences cause large score variation across published numbers.
By provider
- Anthropic· 6 models72.7 % resolvedClaude Sonnet 4.6Average: 63.63 % resolvedBest: 72.7 % resolved
- Average: 56.49 % resolvedBest: 71.7 % resolved
- Google· 2 models63.2 % resolvedGemini 2.5 ProAverage: 58.35 % resolvedBest: 63.2 % resolved
- Mistral AI· 1 model53.6 % resolvedDevstralAverage: 53.6 % resolvedBest: 53.6 % resolved
- Average: 50 % resolvedBest: 50 % resolved
- DeepSeek· 1 model42 % resolvedDeepSeek V3 (2506)Average: 42 % resolvedBest: 42 % resolved
- Meta· 1 model38.2 % resolvedLlama 4 MaverickAverage: 38.2 % resolvedBest: 38.2 % resolved
Full leaderboard
Showing 19 of 19| # | Model | Provider | Score (% resolved) |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 72.7 |
| 2 | Claude Opus 4.5 | Anthropic | 72.5 |
| 3 | o3-pro | OpenAI | 71.7 |
| 4 | o3 | OpenAI | 71.7 |
| 5 | Claude 3.7 Sonnet | Anthropic | 70.3 |
| 6 | Claude Sonnet 4.5 | Anthropic | 70.3 |
| 7 | o4-mini | OpenAI | 68.1 |
| 8 | Gemini 2.5 Pro | 63.2 | |
| 9 | GPT-4.1 | OpenAI | 54.6 |
| 10 | Devstral | Mistral AI | 53.6 |
| 11 | Gemini 2.5 Flash | 53.5 | |
| 12 | Claude Opus 4 | Anthropic | 52 |
| 13 | Grok 3 | xAI | 50 |
| 14 | o3-mini | OpenAI | 49.3 |
| 15 | GPT-5 | OpenAI | 48 |
| 16 | Claude Sonnet 4 | Anthropic | 44 |
| 17 | DeepSeek V3 (2506) | DeepSeek | 42 |
| 18 | Llama 4 Maverick | Meta | 38.2 |
| 19 | GPT-4.1 mini | OpenAI | 32 |
Comments
Sign in to leave a comment.