MultiChallenge

Instruction following%

Multi-step instruction-following across diverse tasks (math, coding, writing, reasoning) — measures aggregate capability breadth.

At a glance

🏆 Top score

Total results

Models tested

Providers

Verified · Self-reported

7 · 0

Average

40.39 %

Median

41.73 %

Range

18.8 – 58.65 %

Combines 10+ challenging tasks per query. Scored on whether every sub-task is correctly completed.

Strict scoring — partial credit not counted.

Showing 7 of 7

ProviderSourceSort by

#	Model	Provider	Score (%)	Source	Date
1	Qwen3.5-27B	Alibaba	58.65	Third-party llm-stats.com	Apr 18, 2026
2	Gemma 4 31B	Google	46.99	Third-party llm-stats.com	Apr 18, 2026
3	Claude Opus 4.7	Anthropic	46.62	Third-party llm-stats.com	Apr 18, 2026
4	Claude Sonnet 4.6	Anthropic	41.73	Third-party llm-stats.com	Apr 18, 2026
5	Claude Haiku 4.5	Anthropic	36.47	Third-party llm-stats.com	Apr 18, 2026
6	GPT-5.4 mini	OpenAI	33.46	Third-party llm-stats.com	Apr 18, 2026
7	GPT-5.4 nano	OpenAI	18.8	Third-party llm-stats.com	Apr 18, 2026