all-bench-leaderboard / llms-full.txt
SeaWolf-AI's picture
Upload 6 files
6d23bf6 verified
# ALL Bench Leaderboard 2026 β€” Full Reference
> Complete model data for AI systems. See llms.txt for summary.
## LLM Rankings (42 Models)
### Flagship Models
| Model | Provider | GPQA | AIME | HLE | ARC-AGI-2 | Metacog | SWE-V | IFEval | LCB | Price(In/Out) |
|-------|----------|------|------|-----|-----------|---------|-------|--------|-----|---------------|
| GPT-5.4 | OpenAI | 92.8 | 97 | 52.1 | 73.3 | β€” | β€” | β€” | β€” | $2.50/$15 |
| GPT-5.2 | OpenAI | 93.2 | 100 | 35.4 | 52.9 | 62.76 | 80.0 | 90.5 | 80.0 | $1.75/$14 |
| GPT-5.3 Codex | OpenAI | 91.5 | 95 | 36.0 | β€” | β€” | β€” | β€” | β€” | $7.50/$30 |
| Claude Opus 4.6 | Anthropic | 91.3 | 100 | 40.0 | 68.8 | 56.04 | 80.8 | 93.1 | 76.0 | $5/$25 |
| Claude Sonnet 4.6 | Anthropic | 89.9 | 83 | β€” | 60.4 | β€” | 79.6 | 89.5 | β€” | $3/$15 |
| Gemini 3.1 Pro | Google | 94.3 | 97 | 44.4 | 77.1 | β€” | 80.6 | 91.0 | 80.0 | $2/$12 |
| Gemini 3 Flash | Google | 90.4 | 84 | 33.7 | β€” | β€” | 78.0 | 88.3 | β€” | $0.50/$3 |
| Grok 4 Heavy | xAI | 92.0 | 97 | 38.5 | 67.5 | β€” | β€” | 90.0 | β€” | $3/$15 |
| Kimi K2.5 | Moonshot | 87.6 | 96.1 | 44.9 | 12.1 | 68.71 | β€” | β€” | 85.0 | $0.14/$0.28 |
| DeepSeek V3.2 | DeepSeek | 82.3 | 92.8 | 25.7 | β€” | 60.04 | β€” | 91.2 | 71.6 | $0.14/$0.28 |
### Open-Source Models
| Model | Provider | MMLU-Pro | GPQA | AIME | License | Price |
|-------|----------|---------|------|------|---------|-------|
| Qwen3.5-397B | Alibaba | 84.6 | 88.1 | 96 | Apache2 | Free |
| DeepSeek R1 | DeepSeek | 79.8 | 87.3 | 97 | MIT | Free |
| Llama 4 Scout | Meta | 74.3 | 79.8 | 73 | Llama | Free |
| Llama 4 Maverick | Meta | 80.5 | 85.8 | 81 | Llama | Free |
| GLM-5 | Zhipu AI | 78.6 | 86.3 | 84 | Free | Free |
| K-EXAONE | LG AI Research | 81.8 | 75.4 | 85.3 | Prop | Prop |
## VLM Rankings (11 Flagship)
| Model | MMMU | MMMU-Pro | MathVista | Type |
|-------|------|---------|-----------|------|
| Gemini 3 Flash | 87.6 | 80.0 | β€” | Closed |
| Gemini 3 Pro | 87.5 | 80.0 | β€” | Closed |
| GPT-5.2 | 86.7 | β€” | β€” | Closed |
| Claude Opus 4.6 | β€” | 85.1 | β€” | Closed |
| GPT-5 | 84.2 | β€” | β€” | Closed |
| Gemini 3.1 Pro | β€” | 82.0 | β€” | Closed |
| InternVL3.5-241B | 77.7 | β€” | β€” | Open |
| Grok 4 Heavy | 76.5 | β€” | β€” | Closed |
| InternVL3-78B | 72.2 | β€” | 79.6 | Open |
| Qwen2.5-VL-72B | 70.2 | β€” | 74.8 | Open |
| Kimi-VL-A3B | 64.0 | 46.3 | 80.1 | Open |
## Agent Rankings (10 Models)
| Model | OSWorld | BrowseComp | Terminal-Bench | GDPval-AA |
|-------|---------|------------|----------------|-----------|
| GPT-5.4 | 75.0 | 82.7 | β€” | 83 |
| Claude Opus 4.6 | 72.7 | 84.0 | 74.7 | 1606 |
| Claude Sonnet 4.6 | 72.5 | β€” | 53.0 | 1633 |
| Gemini 3.1 Pro | β€” | 85.9 | 78.4 | 1317 |
| GPT-5.3 Codex | β€” | β€” | 77.3 | β€” |
## Generative AI Models
### Image Generation (10 Models)
GPT Image 1.5 (OpenAI) Β· Imagen 4 (Google) Β· Flux 2 Pro (BFL) Β· Midjourney v7 Β· Flux 2 Dev Β· Ideogram 3.0 Β· DALL-E 3.5 Β· Nano Banana 2 Β· SD 3.5 Β· Seedream 4.5
### Video Generation (10 Models)
Sora 2 (OpenAI) Β· Veo 3.1 (Google) Β· Runway Gen-4.5 Β· Kling 3.0 Β· Seedance 2.0 Β· Wan 2.6 Β· Pika 2.5 Β· Luma Ray3 Β· LTX-2 Β· HaiLuo AI
### Music Generation (8 Models)
Suno v4.5 Β· Udio v2 Β· Gemini Music Β· MusicGen Large Β· Stable Audio 2.0 Β· JASCO Β· Riffusion v2 Β· Loudme
## Benchmark Methodology
Composite Score = Avg(confirmed benchmarks) Γ— √(N/10) where N = number of benchmarks with confirmed data out of 10 core benchmarks.
Confidence system:
- Cross-verified (βœ“βœ“): 2+ independent sources confirm the score
- Single-source (βœ“): One official or third-party source
- Self-reported (~): Provider claim only, not independently verified
- Null (β€”): No data available, never estimated or imputed
Last verified: 2026-03-08