HumanEval
Codingpass@1 %Python coding benchmark of 164 programming problems.
At a glance
Total results
9
Models tested
9
Providers
7
Verified · Self-reported
0 · 9
Average
90.44 pass@1 %
Median
90.2 pass@1 %
Range
85.4 – 94 pass@1 %
Latest result
Jun 1, 2025
Score distribution
1
0
0
2
1
1
0
1
1
2
85.489.794.0
9 results across 10 score bands
Methodology
pass@1 on 164 handwritten Python problems with unit tests. Scores reflect whether the first generation passes all tests.
Limitations
Saturated near 95%+ for frontier models. Narrow in language and problem style.
By provider
- Average: 93.75 pass@1 %Best: 94 pass@1 %
- Anthropic· 2 models93 pass@1 %Claude Opus 4Average: 92.5 pass@1 %Best: 93 pass@1 %
- DeepSeek· 1 model90.2 pass@1 %DeepSeek V3Average: 90.2 pass@1 %Best: 90.2 pass@1 %
- Meta· 1 model89 pass@1 %Llama 3.1 405BAverage: 89 pass@1 %Best: 89 pass@1 %
- Average: 88.5 pass@1 %Best: 88.5 pass@1 %
- Google· 1 model88.4 pass@1 %Gemini 2 ProAverage: 88.4 pass@1 %Best: 88.4 pass@1 %
- Mistral AI· 1 model85.4 pass@1 %CodestralAverage: 85.4 pass@1 %Best: 85.4 pass@1 %
Full leaderboard
Showing 9 of 9| # | Model | Provider | Score (pass@1 %) |
|---|---|---|---|
| 1 | GPT-5 | OpenAI | 94 |
| 2 | o3-mini | OpenAI | 93.5 |
| 3 | Claude Opus 4 | Anthropic | 93 |
| 4 | Claude Sonnet 4 | Anthropic | 92 |
| 5 | DeepSeek V3 | DeepSeek | 90.2 |
| 6 | Llama 3.1 405B | Meta | 89 |
| 7 | Grok 3 | xAI | 88.5 |
| 8 | Gemini 2 Pro | 88.4 | |
| 9 | Codestral | Mistral AI | 85.4 |
Comments
Sign in to leave a comment.