GPQA Diamond

Reasoning% accuracy

Graduate-level, Google-proof science reasoning questions.

At a glance

🏆 Top score

o3 OpenAI87.7 % accuracy

Total results

Models tested

Providers

Verified · Self-reported

1 · 6

Average

76.56 % accuracy

Median

74 % accuracy

Range

68.1 – 87.7 % accuracy

Score distribution

68.177.987.7

7 results across 10 score bands

Methodology

Expert-written multiple-choice questions in biology, chemistry, and physics, designed to be difficult even with web search. Diamond subset is the hardest.

Limitations

Small test set, high variance. Only measures a narrow slice of scientific reasoning.

By provider

OpenAI· 3 models
87.7 % accuracy
o3
Average: 79.23 % accuracyBest: 87.7 % accuracy
xAI· 1 model
84.6 % accuracy
Grok 3
Average: 84.6 % accuracyBest: 84.6 % accuracy
Anthropic· 1 model
74 % accuracy
Claude Opus 4
Average: 74 % accuracyBest: 74 % accuracy
DeepSeek· 1 model
71.5 % accuracy
DeepSeek R1
Average: 71.5 % accuracyBest: 71.5 % accuracy
Google· 1 model
68.1 % accuracy
Gemini 2 Pro
Average: 68.1 % accuracyBest: 68.1 % accuracy

Full leaderboard

Showing 7 of 7

ProviderSourceSort by

#	Model	Provider	Score (% accuracy)	Source	Date
1	o3	OpenAI	87.7	Self-reported OpenAI o3 system card	Jun 1, 2025
2	Grok 3	xAI	84.6	Self-reported xAI model card	Jun 1, 2025
3	o1	OpenAI	78	Self-reported OpenAI o1 system card	Jun 1, 2025
4	Claude Opus 4	Anthropic	74	Self-reported Anthropic model card	Jun 1, 2025
5	GPT-5	OpenAI	72	Third-party Artificial Analysis	Jun 1, 2025
6	DeepSeek R1	DeepSeek	71.5	Self-reported DeepSeek R1 paper	Jun 1, 2025
7	Gemini 2 Pro	Google	68.1	Self-reported Google model card	Jun 1, 2025