SWE-bench Verified

Coding% resolved

Real GitHub issues solved end-to-end by the model.

At a glance

🏆 Top score
Total results
19
Models tested
19
Providers
7
Verified · Self-reported
16 · 3
Average
56.72 % resolved
Median
53.6 % resolved
Range
32 – 72.7 % resolved
Latest result
Apr 18, 2026

Score distribution

1
1
2
1
3
3
0
1
1
6
32.052.472.7
19 results across 10 score bands

Methodology

Model must generate a patch that resolves a real GitHub issue. Verified subset has human-confirmed solvability.

Limitations

Scaffolding and harness differences cause large score variation across published numbers.

By provider

Full leaderboard

Showing 19 of 19
#ModelProviderScore (% resolved)
1Claude Sonnet 4.6Anthropic
72.7
2Claude Opus 4.5Anthropic
72.5
3o3-proOpenAI
71.7
4o3OpenAI
71.7
5Claude 3.7 SonnetAnthropic
70.3
6Claude Sonnet 4.5Anthropic
70.3
7o4-miniOpenAI
68.1
8Gemini 2.5 ProGoogle
63.2
9GPT-4.1OpenAI
54.6
10DevstralMistral AI
53.6
11Gemini 2.5 FlashGoogle
53.5
12Claude Opus 4Anthropic
52
13Grok 3xAI
50
14o3-miniOpenAI
49.3
15GPT-5OpenAI
48
16Claude Sonnet 4Anthropic
44
17DeepSeek V3 (2506)DeepSeek
42
18Llama 4 MaverickMeta
38.2
19GPT-4.1 miniOpenAI
32

Comments

Sign in to leave a comment.