SeaWolf-AI's picture
Update README.md
2188a8b verified
metadata
title: All Bench Leaderboard
emoji: šŸ†šŸ†šŸ†
colorFrom: indigo
colorTo: pink
sdk: static
pinned: false
license: apache-2.0
short_description: Benchmarking metrics to 90+ leading Generation AI models
models:
  - Qwen/Qwen3.5-122B-A10B
  - Qwen/Qwen3.5-27B
  - Qwen/Qwen3.5-35B-A3B
  - Qwen/Qwen3.5-9B
  - Qwen/Qwen3.5-4B
  - Qwen/Qwen3-Next-80B-A3B-Thinking
  - deepseek-ai/DeepSeek-V3
  - deepseek-ai/DeepSeek-R1
  - zai-org/GLM-5
  - meta-llama/Llama-4-Scout-17B-16E-Instruct
  - meta-llama/Llama-4-Maverick-17B-128E-Instruct
  - microsoft/phi-4
  - upstage/Solar-Open-100B
  - K-intelligence/Midm-2.0-Base-Instruct
  - Nanbeige/Nanbeige4.1-3B
  - MiniMaxAI/MiniMax-M2.5
  - stepfun-ai/Step-3.5-Flash
  - OpenGVLab/InternVL3-78B
  - Qwen/Qwen2.5-VL-72B-Instruct
  - Qwen/Qwen3-VL-30B-A3B
  - black-forest-labs/FLUX.1-dev
  - stabilityai/stable-diffusion-3.5-large
  - Lightricks/LTX-Video
  - facebook/musicgen-large
  - facebook/jasco-chords-drums-melody-1B
tags:
  - leaderboard
  - benchmark
  - evaluation
  - llm
  - gpt
  - claude
  - gemini
  - deepseek
  - qwen
  - korean-ai
  - sovereign-ai
  - arc-agi-2
  - gpqa
  - mmlu-pro
  - swe-bench
  - hle
  - metacognitive
  - final-bench
  - aime
  - reasoning
  - multimodal
  - open-source
  - closed-source
  - ai-comparison
  - model-ranking
  - k-ai
  - exaone
  - solar
  - llm-benchmark
  - agi

šŸ† ALL Bench Leaderboard 2026 — Unified Multi-Modal AI Benchmark

The only leaderboard comparing LLM Ā· VLM Ā· Agent Ā· Image Ā· Video Ā· Music across 6 modalities in a single view.

HuggingFace Dataset GitHub FINAL Bench Leaderboard

Why ALL Bench?

Most leaderboards only compare LLM text scores. ALL Bench cross-verifies 42 LLMs + 11 VLMs + 10 Agents + 28 generative models and presents them in a single unified view. Every score carries a āœ“āœ“ confidence badge — hover to see the exact source.

v2.1 Features

Feature Description
šŸ“Š LLM Leaderboard 42 models Ɨ 27 metrics with composite 5-axis scoring
šŸ‘ VLM Benchmark 11 flagship models (Gemini Ā· GPT Ā· Claude Ā· InternVL3 Ā· Kimi) + 5 edge models
šŸ¤– Agent Bench OSWorld Ā· BrowseComp Ā· Terminal-Bench 2.0 Ā· SWE-Pro
šŸ–¼šŸŽ¬šŸŽµ Generative AI Image 10 Ā· Video 10 Ā· Music 8 model comparison
šŸ“„ Intelligence Report Auto-generated Executive Summary with PDF/DOCX download
āœ“āœ“ Confidence System Cross-verified Ā· Single source Ā· Self-reported — 3-tier badges
šŸ” Analysis Tools Model Finder Ā· Head-to-Head Ā· Trust Map Ā· Bar Race
šŸ“” Free API 8 Gradio endpoints, no auth required

Composite Score Formula

Score = Avg(verified benchmarks) Ɨ √(N/10)

10 Core Benchmarks: MMLU-Pro Ā· GPQA Ā· AIME Ā· HLE Ā· ARC-AGI-2 Ā· FINAL Bench (Metacognition) Ā· SWE-Pro Ā· BFCL Ā· IFEval Ā· LiveCodeBench

5-Axis Framework: Knowledge Ā· Expert Reasoning Ā· Abstract Reasoning Ā· Metacognition Ā· Execution

FINAL Bench — Metacognitive Benchmark

FINAL Bench measures an AI model's self-correction ability. Error Recovery (ER) explains 94.8% of metacognitive performance variance.

API Usage

from gradio_client import Client
client = Client("VIDRAFT/ALL-Bench")

# Get all LLM data
llm = client.predict(api_name="/get_llm_data")

# Search models
results = client.predict("Claude", "llm", api_name="/search_models")

# Get everything at once
all_data = client.predict(api_name="/get_all_data")

Data Files

File Contents
llm.json 42 LLM models (27 metrics each)
vlm.json VLM benchmarks (34 benchmarks)
agent.json Agent benchmarks (8 benchmarks)
image.json 10 image generation models
video.json 10 video generation models
music.json 8 music generation models

Citation

@misc{allbench2026,
    title={ALL Bench Leaderboard 2026: Unified Multi-Modal AI Evaluation},
    author={ALL Bench Team},
    year={2026},
    url={https://huggingface.co/spaces/VIDRAFT/ALL-Bench}
}

ALL Bench Leaderboard v2.1 Ā· Updated 2026.03.08

#AIBenchmark #LLMLeaderboard #GPT5 #Claude #Gemini #ALLBench #FINALBench #Metacognition #VLM #AIAgent #MultiModal #HuggingFace #OpenSource #AIEvaluation #DeepLearning #ARC-AGI