MMLU

General knowledge% accuracy

Massive Multitask Language Understanding — 57-subject multiple-choice exam.

At a glance

🏆 Top score
o3OpenAI91 % accuracy
Total results
13
Models tested
13
Providers
9
Verified · Self-reported
2 · 11
Average
85.14 % accuracy
Median
85.2 % accuracy
Range
75.7 – 91 % accuracy
Latest result
Jun 1, 2025

Score distribution

1
0
0
1
3
1
1
2
2
2
75.783.391.0
13 results across 10 score bands

Methodology

5-shot evaluation across 57 subjects (humanities, STEM, social sciences, law, medicine). Score is percentage of correct answers.

Limitations

Widely used and partially contaminated; ceilings are compressed at the top. Does not measure open-ended reasoning or tool use.

By provider

  • OpenAI· 3 models
    91 % accuracy
    o3
    Average: 87 % accuracyBest: 91 % accuracy
  • DeepSeek· 2 models
    90.8 % accuracy
    DeepSeek R1
    Average: 89.65 % accuracyBest: 90.8 % accuracy
  • Anthropic· 2 models
    87.5 % accuracy
    Claude Opus 4
    Average: 85.3 % accuracyBest: 87.5 % accuracy
  • xAI· 1 model
    87 % accuracy
    Grok 3
    Average: 87 % accuracyBest: 87 % accuracy
  • Google· 1 model
    85.2 % accuracy
    Gemini 2 Pro
    Average: 85.2 % accuracyBest: 85.2 % accuracy
  • Microsoft· 1 model
    84.8 % accuracy
    Phi-4
    Average: 84.8 % accuracyBest: 84.8 % accuracy
  • Meta· 1 model
    82 % accuracy
    Llama 3 70B
    Average: 82 % accuracyBest: 82 % accuracy
  • Mistral AI· 1 model
    81.2 % accuracy
    Mistral Large
    Average: 81.2 % accuracyBest: 81.2 % accuracy
  • Cohere· 1 model
    75.7 % accuracy
    Command R+
    Average: 75.7 % accuracyBest: 75.7 % accuracy

Full leaderboard

Showing 13 of 13
#ModelProviderScore (% accuracy)
1o3OpenAI
91
2DeepSeek R1DeepSeek
90.8
3DeepSeek V3DeepSeek
88.5
4GPT-5OpenAI
88
5Claude Opus 4Anthropic
87.5
6Grok 3xAI
87
7Gemini 2 ProGoogle
85.2
8Phi-4Microsoft
84.8
9Claude Sonnet 4Anthropic
83.1
10Llama 3 70BMeta
82
11GPT-4o miniOpenAI
82
12Mistral LargeMistral AI
81.2
13Command R+Cohere
75.7

Comments

Sign in to leave a comment.