MRCR v2
Long context%Multi-Round Conversational Reasoning — tests whether a model can maintain facts and context across long multi-turn dialogues.
At a glance
🏆 Top score
Total results
9
Models tested
9
Providers
5
Verified · Self-reported
9 · 0
Average
28.67 %
Median
23.76 %
Range
15.41 – 50.78 %
Latest result
Apr 18, 2026
Score distribution
2
2
1
0
1
1
0
0
Methodology
Conversational chains of 8–32 turns requiring the model to track entities, preferences, and reasoning steps.
Limitations
Synthetic conversations may not capture organic dialogue patterns.
By provider
- Anthropic· 3 models50.78 %Claude Sonnet 4.6Average: 38.32 %Best: 50.78 %
- Google· 1 model33.17 %Gemma 4 31BAverage: 33.17 %Best: 33.17 %
Full leaderboard
Showing 9 of 9| # | Model | Provider | Score (%) |
|---|---|---|---|
| 1 | Claude Sonnet 4.6 | Anthropic | 50.78 |
| 2 | Claude Sonnet 4 | Anthropic | 47.54 |
| 3 | Gemma 4 31B | 33.17 | |
| 4 | GPT-5.4 | OpenAI | 29.84 |
| 5 | GPT-5.4 mini | OpenAI | 23.76 |
| 6 | Qwen3.5-27B | Alibaba | 20.72 |
| 7 | Grok 4.20 | xAI | 20.19 |
| 8 | Claude Haiku 4.5 | Anthropic | 16.65 |
| 9 | GPT-5.4 nano | OpenAI | 15.41 |
Comments
Sign in to leave a comment.