Spaces:
Running
title: All Bench Leaderboard
emoji: ššš
colorFrom: indigo
colorTo: pink
sdk: static
pinned: false
license: apache-2.0
short_description: Benchmarking metrics to 90+ leading Generation AI models
models:
- Qwen/Qwen3.5-122B-A10B
- Qwen/Qwen3.5-27B
- Qwen/Qwen3.5-35B-A3B
- Qwen/Qwen3.5-9B
- Qwen/Qwen3.5-4B
- Qwen/Qwen3-Next-80B-A3B-Thinking
- deepseek-ai/DeepSeek-V3
- deepseek-ai/DeepSeek-R1
- zai-org/GLM-5
- meta-llama/Llama-4-Scout-17B-16E-Instruct
- meta-llama/Llama-4-Maverick-17B-128E-Instruct
- microsoft/phi-4
- upstage/Solar-Open-100B
- K-intelligence/Midm-2.0-Base-Instruct
- Nanbeige/Nanbeige4.1-3B
- MiniMaxAI/MiniMax-M2.5
- stepfun-ai/Step-3.5-Flash
- OpenGVLab/InternVL3-78B
- Qwen/Qwen2.5-VL-72B-Instruct
- Qwen/Qwen3-VL-30B-A3B
- black-forest-labs/FLUX.1-dev
- stabilityai/stable-diffusion-3.5-large
- Lightricks/LTX-Video
- facebook/musicgen-large
- facebook/jasco-chords-drums-melody-1B
tags:
- leaderboard
- benchmark
- evaluation
- llm
- gpt
- claude
- gemini
- deepseek
- qwen
- korean-ai
- sovereign-ai
- arc-agi-2
- gpqa
- mmlu-pro
- swe-bench
- hle
- metacognitive
- final-bench
- aime
- reasoning
- multimodal
- open-source
- closed-source
- ai-comparison
- model-ranking
- k-ai
- exaone
- solar
- llm-benchmark
- agi
š ALL Bench Leaderboard 2026 ā Unified Multi-Modal AI Benchmark
The only leaderboard comparing LLM Ā· VLM Ā· Agent Ā· Image Ā· Video Ā· Music across 6 modalities in a single view.
Why ALL Bench?
Most leaderboards only compare LLM text scores. ALL Bench cross-verifies 42 LLMs + 11 VLMs + 10 Agents + 28 generative models and presents them in a single unified view. Every score carries a āā confidence badge ā hover to see the exact source.
v2.1 Features
| Feature | Description |
|---|---|
| š LLM Leaderboard | 42 models Ć 27 metrics with composite 5-axis scoring |
| š VLM Benchmark | 11 flagship models (Gemini Ā· GPT Ā· Claude Ā· InternVL3 Ā· Kimi) + 5 edge models |
| š¤ Agent Bench | OSWorld Ā· BrowseComp Ā· Terminal-Bench 2.0 Ā· SWE-Pro |
| š¼š¬šµ Generative AI | Image 10 Ā· Video 10 Ā· Music 8 model comparison |
| š Intelligence Report | Auto-generated Executive Summary with PDF/DOCX download |
| āā Confidence System | Cross-verified Ā· Single source Ā· Self-reported ā 3-tier badges |
| š Analysis Tools | Model Finder Ā· Head-to-Head Ā· Trust Map Ā· Bar Race |
| š” Free API | 8 Gradio endpoints, no auth required |
Composite Score Formula
Score = Avg(verified benchmarks) Ć ā(N/10)
10 Core Benchmarks: MMLU-Pro Ā· GPQA Ā· AIME Ā· HLE Ā· ARC-AGI-2 Ā· FINAL Bench (Metacognition) Ā· SWE-Pro Ā· BFCL Ā· IFEval Ā· LiveCodeBench
5-Axis Framework: Knowledge Ā· Expert Reasoning Ā· Abstract Reasoning Ā· Metacognition Ā· Execution
FINAL Bench ā Metacognitive Benchmark
FINAL Bench measures an AI model's self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance.
- 𧬠Dataset: FINAL-Bench/Metacognitive
- š Leaderboard: FINAL-Bench/Leaderboard
- š° Featured in: Seoul Shinmun Ā· Asia Economy Ā· IT Chosun (2026.02.27)
API Usage
from gradio_client import Client
client = Client("VIDRAFT/ALL-Bench")
# Get all LLM data
llm = client.predict(api_name="/get_llm_data")
# Search models
results = client.predict("Claude", "llm", api_name="/search_models")
# Get everything at once
all_data = client.predict(api_name="/get_all_data")
Data Files
| File | Contents |
|---|---|
llm.json |
42 LLM models (27 metrics each) |
vlm.json |
VLM benchmarks (34 benchmarks) |
agent.json |
Agent benchmarks (8 benchmarks) |
image.json |
10 image generation models |
video.json |
10 video generation models |
music.json |
8 music generation models |
Citation
@misc{allbench2026,
title={ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation},
author={ALL Bench Team},
year={2026},
url={https://huggingface.co/spaces/VIDRAFT/ALL-Bench}
}
ALL Bench Leaderboard v2.1 Ā· Updated 2026.03.08
#AIBenchmark #LLMLeaderboard #GPT5 #Claude #Gemini #ALLBench #FINALBench #Metacognition #VLM #AIAgent #MultiModal #HuggingFace #OpenSource #AIEvaluation #DeepLearning #ARC-AGI