GPQA Diamond

Reasoning% accuracy

Graduate-level, Google-proof science reasoning questions.

At a glance

🏆 Top score
o3OpenAI87.7 % accuracy
Total results
7
Models tested
7
Providers
5
Verified · Self-reported
1 · 6
Average
76.56 % accuracy
Median
74 % accuracy
Range
68.1 – 87.7 % accuracy
Latest result
Jun 1, 2025

Score distribution

1
2
0
1
0
1
0
0
1
1
68.177.987.7
7 results across 10 score bands

Methodology

Expert-written multiple-choice questions in biology, chemistry, and physics, designed to be difficult even with web search. Diamond subset is the hardest.

Limitations

Small test set, high variance. Only measures a narrow slice of scientific reasoning.

By provider

  • OpenAI· 3 models
    87.7 % accuracy
    o3
    Average: 79.23 % accuracyBest: 87.7 % accuracy
  • xAI· 1 model
    84.6 % accuracy
    Grok 3
    Average: 84.6 % accuracyBest: 84.6 % accuracy
  • Anthropic· 1 model
    74 % accuracy
    Claude Opus 4
    Average: 74 % accuracyBest: 74 % accuracy
  • DeepSeek· 1 model
    71.5 % accuracy
    DeepSeek R1
    Average: 71.5 % accuracyBest: 71.5 % accuracy
  • Google· 1 model
    68.1 % accuracy
    Gemini 2 Pro
    Average: 68.1 % accuracyBest: 68.1 % accuracy

Full leaderboard

Showing 7 of 7
#ModelProviderScore (% accuracy)
1o3OpenAI
87.7
2Grok 3xAI
84.6
3o1OpenAI
78
4Claude Opus 4Anthropic
74
5GPT-5OpenAI
72
6DeepSeek R1DeepSeek
71.5
7Gemini 2 ProGoogle
68.1

Comments

Sign in to leave a comment.