Aider Polyglot
Coding% pass@2Real-world coding edits across 6 programming languages — measures whether the model produces a correct edit accepted on second attempt.
At a glance
Total results
13
Models tested
13
Providers
7
Verified · Self-reported
13 · 0
Average
48.95 % pass@2
Median
53.8 % pass@2
Range
3.6 – 88 % pass@2
Latest result
Apr 18, 2026
Score distribution
2
1
0
0
1
3
3
0
2
1
3.645.888.0
13 results across 10 score bands
Methodology
225 exercism-style coding exercises in C++, Go, Java, JavaScript, Python and Rust. Score = pass_rate_2 (% of cases where the model's second attempt produces a passing solution).
Limitations
Aider's harness shapes prompts in a specific way — results may not directly compare to other coding benchmarks.
By provider
- Average: 55.25 % pass@2Best: 88 % pass@2
- Google· 1 model72.9 % pass@2Gemini 2.5 ProAverage: 72.9 % pass@2Best: 72.9 % pass@2
- DeepSeek· 2 models56.9 % pass@2DeepSeek R1Average: 56 % pass@2Best: 56.9 % pass@2
- Average: 53.3 % pass@2Best: 53.3 % pass@2
- Alibaba· 1 model40 % pass@2Qwen: Qwen3 32BAverage: 40 % pass@2Best: 40 % pass@2
- Meta· 1 model15.6 % pass@2Llama 4 MaverickAverage: 15.6 % pass@2Best: 15.6 % pass@2
- Mistral AI· 1 model11.1 % pass@2CodestralAverage: 11.1 % pass@2Best: 11.1 % pass@2
Full leaderboard
Showing 13 of 13| # | Model | Provider | Score (% pass@2) |
|---|---|---|---|
| 1 | GPT-5 | OpenAI | 88 |
| 2 | Gemini 2.5 Pro | 72.9 | |
| 3 | o4-mini | OpenAI | 72 |
| 4 | o1 | OpenAI | 61.7 |
| 5 | DeepSeek R1 | DeepSeek | 56.9 |
| 6 | DeepSeek V3 | DeepSeek | 55.1 |
| 7 | o3 | OpenAI | 53.8 |
| 8 | Grok 3 | xAI | 53.3 |
| 9 | GPT-4.1 | OpenAI | 52.4 |
| 10 | Qwen: Qwen3 32B | Alibaba | 40 |
| 11 | Llama 4 Maverick | Meta | 15.6 |
| 12 | Codestral | Mistral AI | 11.1 |
| 13 | GPT-4o | OpenAI | 3.6 |
Comments
Sign in to leave a comment.