Spaces:
Running
Running
| # ALL Bench Leaderboard 2026 β Full Reference | |
| > Complete model data for AI systems. See llms.txt for summary. | |
| ## LLM Rankings (42 Models) | |
| ### Flagship Models | |
| | Model | Provider | GPQA | AIME | HLE | ARC-AGI-2 | Metacog | SWE-V | IFEval | LCB | Price(In/Out) | | |
| |-------|----------|------|------|-----|-----------|---------|-------|--------|-----|---------------| | |
| | GPT-5.4 | OpenAI | 92.8 | 97 | 52.1 | 73.3 | β | β | β | β | $2.50/$15 | | |
| | GPT-5.2 | OpenAI | 93.2 | 100 | 35.4 | 52.9 | 62.76 | 80.0 | 90.5 | 80.0 | $1.75/$14 | | |
| | GPT-5.3 Codex | OpenAI | 91.5 | 95 | 36.0 | β | β | β | β | β | $7.50/$30 | | |
| | Claude Opus 4.6 | Anthropic | 91.3 | 100 | 40.0 | 68.8 | 56.04 | 80.8 | 93.1 | 76.0 | $5/$25 | | |
| | Claude Sonnet 4.6 | Anthropic | 89.9 | 83 | β | 60.4 | β | 79.6 | 89.5 | β | $3/$15 | | |
| | Gemini 3.1 Pro | Google | 94.3 | 97 | 44.4 | 77.1 | β | 80.6 | 91.0 | 80.0 | $2/$12 | | |
| | Gemini 3 Flash | Google | 90.4 | 84 | 33.7 | β | β | 78.0 | 88.3 | β | $0.50/$3 | | |
| | Grok 4 Heavy | xAI | 92.0 | 97 | 38.5 | 67.5 | β | β | 90.0 | β | $3/$15 | | |
| | Kimi K2.5 | Moonshot | 87.6 | 96.1 | 44.9 | 12.1 | 68.71 | β | β | 85.0 | $0.14/$0.28 | | |
| | DeepSeek V3.2 | DeepSeek | 82.3 | 92.8 | 25.7 | β | 60.04 | β | 91.2 | 71.6 | $0.14/$0.28 | | |
| ### Open-Source Models | |
| | Model | Provider | MMLU-Pro | GPQA | AIME | License | Price | | |
| |-------|----------|---------|------|------|---------|-------| | |
| | Qwen3.5-397B | Alibaba | 84.6 | 88.1 | 96 | Apache2 | Free | | |
| | DeepSeek R1 | DeepSeek | 79.8 | 87.3 | 97 | MIT | Free | | |
| | Llama 4 Scout | Meta | 74.3 | 79.8 | 73 | Llama | Free | | |
| | Llama 4 Maverick | Meta | 80.5 | 85.8 | 81 | Llama | Free | | |
| | GLM-5 | Zhipu AI | 78.6 | 86.3 | 84 | Free | Free | | |
| | K-EXAONE | LG AI Research | 81.8 | 75.4 | 85.3 | Prop | Prop | | |
| ## VLM Rankings (11 Flagship) | |
| | Model | MMMU | MMMU-Pro | MathVista | Type | | |
| |-------|------|---------|-----------|------| | |
| | Gemini 3 Flash | 87.6 | 80.0 | β | Closed | | |
| | Gemini 3 Pro | 87.5 | 80.0 | β | Closed | | |
| | GPT-5.2 | 86.7 | β | β | Closed | | |
| | Claude Opus 4.6 | β | 85.1 | β | Closed | | |
| | GPT-5 | 84.2 | β | β | Closed | | |
| | Gemini 3.1 Pro | β | 82.0 | β | Closed | | |
| | InternVL3.5-241B | 77.7 | β | β | Open | | |
| | Grok 4 Heavy | 76.5 | β | β | Closed | | |
| | InternVL3-78B | 72.2 | β | 79.6 | Open | | |
| | Qwen2.5-VL-72B | 70.2 | β | 74.8 | Open | | |
| | Kimi-VL-A3B | 64.0 | 46.3 | 80.1 | Open | | |
| ## Agent Rankings (10 Models) | |
| | Model | OSWorld | BrowseComp | Terminal-Bench | GDPval-AA | | |
| |-------|---------|------------|----------------|-----------| | |
| | GPT-5.4 | 75.0 | 82.7 | β | 83 | | |
| | Claude Opus 4.6 | 72.7 | 84.0 | 74.7 | 1606 | | |
| | Claude Sonnet 4.6 | 72.5 | β | 53.0 | 1633 | | |
| | Gemini 3.1 Pro | β | 85.9 | 78.4 | 1317 | | |
| | GPT-5.3 Codex | β | β | 77.3 | β | | |
| ## Generative AI Models | |
| ### Image Generation (10 Models) | |
| GPT Image 1.5 (OpenAI) Β· Imagen 4 (Google) Β· Flux 2 Pro (BFL) Β· Midjourney v7 Β· Flux 2 Dev Β· Ideogram 3.0 Β· DALL-E 3.5 Β· Nano Banana 2 Β· SD 3.5 Β· Seedream 4.5 | |
| ### Video Generation (10 Models) | |
| Sora 2 (OpenAI) Β· Veo 3.1 (Google) Β· Runway Gen-4.5 Β· Kling 3.0 Β· Seedance 2.0 Β· Wan 2.6 Β· Pika 2.5 Β· Luma Ray3 Β· LTX-2 Β· HaiLuo AI | |
| ### Music Generation (8 Models) | |
| Suno v4.5 Β· Udio v2 Β· Gemini Music Β· MusicGen Large Β· Stable Audio 2.0 Β· JASCO Β· Riffusion v2 Β· Loudme | |
| ## Benchmark Methodology | |
| Composite Score = Avg(confirmed benchmarks) Γ β(N/10) where N = number of benchmarks with confirmed data out of 10 core benchmarks. | |
| Confidence system: | |
| - Cross-verified (ββ): 2+ independent sources confirm the score | |
| - Single-source (β): One official or third-party source | |
| - Self-reported (~): Provider claim only, not independently verified | |
| - Null (β): No data available, never estimated or imputed | |
| Last verified: 2026-03-08 | |