HumanEval

Codingpass@1 %

Python coding benchmark of 164 programming problems.

At a glance

🏆 Top score
GPT-5OpenAI94 pass@1 %
Total results
9
Models tested
9
Providers
7
Verified · Self-reported
0 · 9
Average
90.44 pass@1 %
Median
90.2 pass@1 %
Range
85.4 – 94 pass@1 %
Latest result
Jun 1, 2025

Score distribution

1
0
0
2
1
1
0
1
1
2
85.489.794.0
9 results across 10 score bands

Methodology

pass@1 on 164 handwritten Python problems with unit tests. Scores reflect whether the first generation passes all tests.

Limitations

Saturated near 95%+ for frontier models. Narrow in language and problem style.

By provider

Full leaderboard

Showing 9 of 9
#ModelProviderScore (pass@1 %)
1GPT-5OpenAI
94
2o3-miniOpenAI
93.5
3Claude Opus 4Anthropic
93
4Claude Sonnet 4Anthropic
92
5DeepSeek V3DeepSeek
90.2
6Llama 3.1 405BMeta
89
7Grok 3xAI
88.5
8Gemini 2 ProGoogle
88.4
9CodestralMistral AI
85.4

Comments

Sign in to leave a comment.