NoLiMa
Long context%Long-context information retrieval without literal matching — requires semantic reasoning to find relevant facts, not string-matching.
At a glance
🏆 Top score
Total results
6
Models tested
6
Providers
4
Verified · Self-reported
6 · 0
Average
31.72 %
Median
11.94 %
Range
2.92 – 83.46 %
Latest result
Apr 18, 2026
Score distribution
3
1
0
0
0
0
0
0
Methodology
Retrieval tasks where the query and relevant passage share no literal token overlap; model must use semantic understanding.
Limitations
Focuses on retrieval — does not measure downstream reasoning on retrieved facts.
By provider
- Anthropic· 2 models83.46 %Claude Opus 4.7Average: 43.19 %Best: 83.46 %
- Alibaba· 1 model73.46 %Qwen3.5 397B A17BAverage: 73.46 %Best: 73.46 %
Full leaderboard
Showing 6 of 6| # | Model | Provider | Score (%) |
|---|---|---|---|
| 1 | Claude Opus 4.7 | Anthropic | 83.46 |
| 2 | Qwen3.5 397B A17B | Alibaba | 73.46 |
| 3 | Grok 4.20 | xAI | 14.02 |
| 4 | GPT-5.4 nano | OpenAI | 9.85 |
| 5 | GPT-5.4 mini | OpenAI | 6.62 |
| 6 | Claude Haiku 4.5 | Anthropic | 2.92 |
Comments
Sign in to leave a comment.