MultiChallenge
Instruction following%Multi-step instruction-following across diverse tasks (math, coding, writing, reasoning) — measures aggregate capability breadth.
At a glance
🏆 Top score
Total results
7
Models tested
7
Providers
4
Verified · Self-reported
7 · 0
Average
40.39 %
Median
41.73 %
Range
18.8 – 58.65 %
Latest result
Apr 18, 2026
Score distribution
1
0
0
1
1
1
1
1
Methodology
Combines 10+ challenging tasks per query. Scored on whether every sub-task is correctly completed.
Limitations
Strict scoring — partial credit not counted.
By provider
- Alibaba· 1 model58.65 %Qwen3.5-27BAverage: 58.65 %Best: 58.65 %
- Google· 1 model46.99 %Gemma 4 31BAverage: 46.99 %Best: 46.99 %
- Anthropic· 3 models46.62 %Claude Opus 4.7
Full leaderboard
Showing 7 of 7| # | Model | Provider | Score (%) |
|---|---|---|---|
| 1 | Qwen3.5-27B | Alibaba | 58.65 |
| 2 | Gemma 4 31B | 46.99 | |
| 3 | Claude Opus 4.7 | Anthropic | 46.62 |
| 4 | Claude Sonnet 4.6 | Anthropic | 41.73 |
| 5 | Claude Haiku 4.5 | Anthropic | 36.47 |
| 6 | GPT-5.4 mini | OpenAI | 33.46 |
| 7 | GPT-5.4 nano | OpenAI | 18.8 |
Comments
Sign in to leave a comment.