LongBench v2

Long context%

Long-context understanding across documents of 32K–2M tokens — tests whether the model can retrieve and reason over facts deep in the input.

At a glance

🏆 Top score

Total results

Models tested

Providers

Verified · Self-reported

9 · 0

Average

30.04 %

Median

31.63 %

Range

1.11 – 61.11 %

Multi-task benchmark covering long-doc QA, summarization, and code over 32K–2M token inputs.

Dataset covers English and Chinese; weaker on other languages.

Showing 9 of 9

ProviderSourceSort by

#	Model	Provider	Score (%)	Source	Date
1	Qwen3.5-27B	Alibaba	61.11	Third-party llm-stats.com	Apr 18, 2026
2	GPT-5.4	OpenAI	54.42	Third-party llm-stats.com	Apr 18, 2026
3	Claude Opus 4.7	Anthropic	53.89	Third-party llm-stats.com	Apr 18, 2026
4	GPT-5.4 mini	OpenAI	43.72	Third-party llm-stats.com	Apr 18, 2026
5	GPT-5.4 nano	OpenAI	31.63	Third-party llm-stats.com	Apr 18, 2026
6	Gemini 2.5 Flash	Google	15	Third-party llm-stats.com	Apr 18, 2026
7	Claude Sonnet 4.6	Anthropic	5.56	Third-party llm-stats.com	Apr 18, 2026
8	Gemma 4 31B	Google	3.89	Third-party llm-stats.com	Apr 18, 2026
9	Claude Haiku 4.5	Anthropic	1.11	Third-party llm-stats.com	Apr 18, 2026