GSM8K

Math% accuracy

Grade-school math word problems.

At a glance

🏆 Top score
DeepSeek V3DeepSeek97.1 % accuracy
Total results
8
Models tested
8
Providers
6
Verified · Self-reported
0 · 8
Average
95.25 % accuracy
Median
95.6 % accuracy
Range
93 – 97.1 % accuracy
Latest result
Jun 1, 2025

Score distribution

2
0
0
1
0
1
1
1
0
2
93.095.097.1
8 results across 10 score bands

Methodology

8.5k grade-school math word problems; final numeric answer is checked.

Limitations

Mostly saturated on frontier models. Low headroom for differentiation.

By provider

  • DeepSeek· 1 model
    97.1 % accuracy
    DeepSeek V3
    Average: 97.1 % accuracyBest: 97.1 % accuracy
  • OpenAI· 3 models
    97 % accuracy
    o3-mini
    Average: 95.43 % accuracyBest: 97 % accuracy
  • Microsoft· 1 model
    95.8 % accuracy
    Phi-4
    Average: 95.8 % accuracyBest: 95.8 % accuracy
  • Anthropic· 1 model
    95.4 % accuracy
    Claude Opus 4
    Average: 95.4 % accuracyBest: 95.4 % accuracy
  • Google· 1 model
    94.4 % accuracy
    Gemini 2 Pro
    Average: 94.4 % accuracyBest: 94.4 % accuracy
  • Meta· 1 model
    93 % accuracy
    Llama 3 70B
    Average: 93 % accuracyBest: 93 % accuracy

Full leaderboard

Showing 8 of 8
#ModelProviderScore (% accuracy)
1DeepSeek V3DeepSeek
97.1
2o3-miniOpenAI
97
3GPT-5OpenAI
96.1
4Phi-4Microsoft
95.8
5Claude Opus 4Anthropic
95.4
6Gemini 2 ProGoogle
94.4
7GPT-4o miniOpenAI
93.2
8Llama 3 70BMeta
93

Comments

Sign in to leave a comment.