Benchmarks

8 benchmarks tracked · 79 total results · 4 categories

Coding

HumanEval
9 results
Python coding benchmark of 164 programming problems.
🏆GPT-5OpenAI
94 pass@1 %
SWE-bench Verified
19 results
Real GitHub issues solved end-to-end by the model.
🏆Claude Sonnet 4.6Anthropic
72.7 % resolved

General knowledge

MMLU
13 results
Massive Multitask Language Understanding — 57-subject multiple-choice exam.
🏆o3OpenAI
91 % accuracy
MMLU-Pro
7 results
Harder reformulation of MMLU with 10 answer choices and deeper reasoning.
🏆o3OpenAI
81.2 % accuracy

Math

Reasoning

GPQA Diamond
7 results
Graduate-level, Google-proof science reasoning questions.
🏆o3OpenAI
87.7 % accuracy