MMLU
General knowledge% accuracyMassive Multitask Language Understanding — 57-subject multiple-choice exam.
At a glance
Total results
13
Models tested
13
Providers
9
Verified · Self-reported
2 · 11
Average
85.14 % accuracy
Median
85.2 % accuracy
Range
75.7 – 91 % accuracy
Latest result
Jun 1, 2025
Score distribution
1
0
0
1
3
1
1
2
2
2
75.783.391.0
13 results across 10 score bands
Methodology
5-shot evaluation across 57 subjects (humanities, STEM, social sciences, law, medicine). Score is percentage of correct answers.
Limitations
Widely used and partially contaminated; ceilings are compressed at the top. Does not measure open-ended reasoning or tool use.
By provider
- Average: 87 % accuracyBest: 91 % accuracy
- DeepSeek· 2 models90.8 % accuracyDeepSeek R1Average: 89.65 % accuracyBest: 90.8 % accuracy
- Anthropic· 2 models87.5 % accuracyClaude Opus 4Average: 85.3 % accuracyBest: 87.5 % accuracy
- Average: 87 % accuracyBest: 87 % accuracy
- Google· 1 model85.2 % accuracyGemini 2 ProAverage: 85.2 % accuracyBest: 85.2 % accuracy
- Average: 84.8 % accuracyBest: 84.8 % accuracy
- Meta· 1 model82 % accuracyLlama 3 70BAverage: 82 % accuracyBest: 82 % accuracy
- Mistral AI· 1 model81.2 % accuracyMistral LargeAverage: 81.2 % accuracyBest: 81.2 % accuracy
- Cohere· 1 model75.7 % accuracyCommand R+Average: 75.7 % accuracyBest: 75.7 % accuracy
Full leaderboard
Showing 13 of 13| # | Model | Provider | Score (% accuracy) |
|---|---|---|---|
| 1 | o3 | OpenAI | 91 |
| 2 | DeepSeek R1 | DeepSeek | 90.8 |
| 3 | DeepSeek V3 | DeepSeek | 88.5 |
| 4 | GPT-5 | OpenAI | 88 |
| 5 | Claude Opus 4 | Anthropic | 87.5 |
| 6 | Grok 3 | xAI | 87 |
| 7 | Gemini 2 Pro | 85.2 | |
| 8 | Phi-4 | Microsoft | 84.8 |
| 9 | Claude Sonnet 4 | Anthropic | 83.1 |
| 10 | Llama 3 70B | Meta | 82 |
| 11 | GPT-4o mini | OpenAI | 82 |
| 12 | Mistral Large | Mistral AI | 81.2 |
| 13 | Command R+ | Cohere | 75.7 |
Comments
Sign in to leave a comment.