MMLU

General knowledge% accuracy

Massive Multitask Language Understanding — 57-subject multiple-choice exam.

At a glance

🏆 Top score

o3 OpenAI91 % accuracy

Total results

Models tested

Providers

Verified · Self-reported

2 · 11

Average

85.14 % accuracy

Median

85.2 % accuracy

Range

75.7 – 91 % accuracy

75.783.391.0

13 results across 10 score bands

5-shot evaluation across 57 subjects (humanities, STEM, social sciences, law, medicine). Score is percentage of correct answers.

Widely used and partially contaminated; ceilings are compressed at the top. Does not measure open-ended reasoning or tool use.

OpenAI· 3 models
91 % accuracy
o3
Average: 87 % accuracyBest: 91 % accuracy
DeepSeek· 2 models
90.8 % accuracy
DeepSeek R1
Average: 89.65 % accuracyBest: 90.8 % accuracy
Anthropic· 2 models
87.5 % accuracy
Claude Opus 4
Average: 85.3 % accuracyBest: 87.5 % accuracy
xAI· 1 model
87 % accuracy
Grok 3
Average: 87 % accuracyBest: 87 % accuracy
Google· 1 model
85.2 % accuracy
Gemini 2 Pro
Average: 85.2 % accuracyBest: 85.2 % accuracy
Microsoft· 1 model
84.8 % accuracy
Phi-4
Average: 84.8 % accuracyBest: 84.8 % accuracy
Meta· 1 model
82 % accuracy
Llama 3 70B
Average: 82 % accuracyBest: 82 % accuracy
Mistral AI· 1 model
81.2 % accuracy
Mistral Large
Average: 81.2 % accuracyBest: 81.2 % accuracy
Cohere· 1 model
75.7 % accuracy
Command R+
Average: 75.7 % accuracyBest: 75.7 % accuracy

Showing 13 of 13

ProviderSourceSort by

#	Model	Provider	Score (% accuracy)	Source	Date
1	o3	OpenAI	91	Third-party Artificial Analysis	Jun 1, 2025
2	DeepSeek R1	DeepSeek	90.8	Self-reported DeepSeek R1 paper	Jun 1, 2025
3	DeepSeek V3	DeepSeek	88.5	Self-reported DeepSeek tech report	Jun 1, 2025
4	GPT-5	OpenAI	88	Self-reported OpenAI system card	Jun 1, 2025
5	Claude Opus 4	Anthropic	87.5	Self-reported Anthropic model card	Jun 1, 2025
6	Grok 3	xAI	87	Self-reported xAI model card	Jun 1, 2025
7	Gemini 2 Pro	Google	85.2	Third-party Artificial Analysis	Jun 1, 2025
8	Phi-4	Microsoft	84.8	Self-reported Microsoft Phi-4 tech report	Jun 1, 2025
9	Claude Sonnet 4	Anthropic	83.1	Self-reported Anthropic model card	Jun 1, 2025
10	Llama 3 70B	Meta	82	Self-reported Meta Llama 3 announcement	Jun 1, 2025
11	GPT-4o mini	OpenAI	82	Self-reported OpenAI system card	Jun 1, 2025
12	Mistral Large	Mistral AI	81.2	Self-reported Mistral announcement	Jun 1, 2025
13	Command R+	Cohere	75.7	Self-reported Cohere model card	Jun 1, 2025