8 benchmarks tracked · 79 total results · 4 categories
Python coding benchmark of 164 programming problems.
Real GitHub issues solved end-to-end by the model.
Massive Multitask Language Understanding — 57-subject multiple-choice exam.
Harder reformulation of MMLU with 10 answer choices and deeper reasoning.
American Invitational Mathematics Examination, 2024 problems.
Grade-school math word problems.
12,500 competition math problems across 7 subjects.
Graduate-level, Google-proof science reasoning questions.