LongBench v2
Long context%Long-context understanding across documents of 32K–2M tokens — tests whether the model can retrieve and reason over facts deep in the input.
At a glance
🏆 Top score
Total results
9
Models tested
9
Providers
4
Verified · Self-reported
9 · 0
Average
30.04 %
Median
31.63 %
Range
1.11 – 61.11 %
Latest result
Apr 18, 2026
Score distribution
3
0
1
0
0
1
0
1
Methodology
Multi-task benchmark covering long-doc QA, summarization, and code over 32K–2M token inputs.
Limitations
Dataset covers English and Chinese; weaker on other languages.
By provider
- Alibaba· 1 model61.11 %Qwen3.5-27BAverage: 61.11 %Best: 61.11 %
- Average: 43.26 %Best: 54.42 %
- Anthropic· 3 models53.89 %Claude Opus 4.7
Full leaderboard
Showing 9 of 9| # | Model | Provider | Score (%) |
|---|---|---|---|
| 1 | Qwen3.5-27B | Alibaba | 61.11 |
| 2 | GPT-5.4 | OpenAI | 54.42 |
| 3 | Claude Opus 4.7 | Anthropic | 53.89 |
| 4 | GPT-5.4 mini | OpenAI | 43.72 |
| 5 | GPT-5.4 nano | OpenAI | 31.63 |
| 6 | Gemini 2.5 Flash | 15 | |
| 7 | Claude Sonnet 4.6 | Anthropic | 5.56 |
| 8 | Gemma 4 31B | 3.89 | |
| 9 | Claude Haiku 4.5 | Anthropic | 1.11 |
Comments
Sign in to leave a comment.