GPQA Diamond
Reasoning% accuracyGraduate-level, Google-proof science reasoning questions.
At a glance
Total results
7
Models tested
7
Providers
5
Verified · Self-reported
1 · 6
Average
76.56 % accuracy
Median
74 % accuracy
Range
68.1 – 87.7 % accuracy
Latest result
Jun 1, 2025
Score distribution
1
2
0
1
0
1
0
0
1
1
68.177.987.7
7 results across 10 score bands
Methodology
Expert-written multiple-choice questions in biology, chemistry, and physics, designed to be difficult even with web search. Diamond subset is the hardest.
Limitations
Small test set, high variance. Only measures a narrow slice of scientific reasoning.
By provider
- Average: 79.23 % accuracyBest: 87.7 % accuracy
- Average: 84.6 % accuracyBest: 84.6 % accuracy
- Anthropic· 1 model74 % accuracyClaude Opus 4Average: 74 % accuracyBest: 74 % accuracy
- DeepSeek· 1 model71.5 % accuracyDeepSeek R1Average: 71.5 % accuracyBest: 71.5 % accuracy
- Google· 1 model68.1 % accuracyGemini 2 ProAverage: 68.1 % accuracyBest: 68.1 % accuracy
Full leaderboard
Showing 7 of 7| # | Model | Provider | Score (% accuracy) |
|---|---|---|---|
| 1 | o3 | OpenAI | 87.7 |
| 2 | Grok 3 | xAI | 84.6 |
| 3 | o1 | OpenAI | 78 |
| 4 | Claude Opus 4 | Anthropic | 74 |
| 5 | GPT-5 | OpenAI | 72 |
| 6 | DeepSeek R1 | DeepSeek | 71.5 |
| 7 | Gemini 2 Pro | 68.1 |
Comments
Sign in to leave a comment.