🏗️ AI Model Benchmarking Platform

An Agentic AI platform that automates the end-to-end lifecycle of AI model evaluation:

🔍 Research — Discover and categorize new models from HuggingFace Hub
📥 Download — Fetch models and prepare them for evaluation
🖥️ Install & Configure — Set up models on GPU infrastructure with appropriate frameworks
📊 Benchmark — Run standardized evaluations with domain-specific datasets & metrics
📄 Report — Generate professional PDF executive summaries

Architecture

┌──────────────────────────────────────────────────┐
│             Manager Agent (CodeAgent)             │
│   Plans workflow, coordinates sub-agents,         │
│   handles errors & retries                        │
└──────┬───────────────┬───────────────┬───────────┘
       │               │               │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│  Research   │ │ Evaluation  │ │   Report    │
│   Agent     │ │   Agent     │ │   Agent     │
│             │ │             │ │             │
│ • Search HF │ │ • Load      │ │ • PDF exec  │
│ • Detect    │ │   models    │ │   summary   │
│   category  │ │ • Run       │ │ • JSON      │
│ • Download  │ │   benchmarks│ │   results   │
│ • Trending  │ │ • Collect   │ │ • Model     │
│   models    │ │   metrics   │ │   comparison│
└─────────────┘ └─────────────┘ └─────────────┘

Supported Model Categories & Benchmarks

Category	Benchmarks	Datasets	Key Metrics
Reasoning / LLM	MMLU, GSM8K, ARC-Challenge, TruthfulQA, HumanEval	cais/mmlu, openai/gsm8k, allenai/ai2_arc, truthfulqa/truthful_qa, openai_humaneval	Accuracy, Exact Match, Pass@k
Intent Classification	CLINC150, Banking77, SNIPS	clinc_oos, PolyAI/banking77, benayas/SNIPS	Accuracy, Macro-F1, Weighted-F1
ASR	LibriSpeech-Clean, LibriSpeech-Other, CommonVoice-EN	facebook/librispeech_asr, mozilla-foundation/common_voice_17_0	WER, CER
TTS	LJSpeech	keithito/lj_speech	Intelligibility WER, Real-Time Factor
Machine Translation	FLORES-200 (en→de, en→fr, en→zh)	facebook/flores	BLEU, chrF, COMET
Wakeword / Keyword Spotting	Google Speech Commands v2	google/speech_commands	Accuracy, FAR, FRR
Tool Calling	BFCL v3	gorilla-llm/Berkeley-Function-Calling-Leaderboard	AST Accuracy, Format Adherence

Installation

# Core installation
pip install -e .

# Full installation (all backends)
pip install -e ".[full]"

Dependencies by mode:

Core: smolagents, transformers, datasets, huggingface_hub, fpdf2, scikit-learn
ASR: + jiwer, librosa, soundfile
MT: + sacrebleu, unbabel-comet
LLM: + lighteval, lm-eval, vllm, accelerate
TTS: + speechbrain

Usage

🤖 Agent Mode (Recommended)

Let the AI agent autonomously orchestrate the full pipeline:

python -m ai_benchmark_platform.main agent \
    "Benchmark openai/whisper-large-v3 on all ASR tasks and generate a report"

python -m ai_benchmark_platform.main agent \
    "Find the top 5 trending LLMs, benchmark them on MMLU and GSM8K with 200 samples, \
     and create a comparison PDF report"

⚡ Direct Mode (No Agent)

Run benchmarks programmatically:

# Auto-detect category and run all benchmarks
python -m ai_benchmark_platform.main direct \
    --model openai/whisper-large-v3 \
    --category asr \
    --max-samples 50 \
    --output-dir ./reports

# Run on LLM with GPU
python -m ai_benchmark_platform.main direct \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --category reasoning \
    --max-samples 100 \
    --device auto

🔍 Research Mode

Discover and categorize models:

python -m ai_benchmark_platform.main research "speech recognition models" --detect --limit 10

📋 List Mode

View available benchmarks:

python -m ai_benchmark_platform.main list --category all
python -m ai_benchmark_platform.main list --category asr

Python API

from ai_benchmark_platform.agents.orchestrator import (
    create_benchmark_platform,
    run_full_pipeline,
)

# Option 1: Agent-driven pipeline
manager = create_benchmark_platform(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
result = manager.run("Benchmark openai/whisper-large-v3 on ASR tasks and generate a PDF report")

# Option 2: Direct pipeline (no agent)
result = run_full_pipeline(
    model_id="openai/whisper-large-v3",
    category="asr",
    max_samples=100,
    output_dir="./reports",
)
print(f"PDF: {result['pdf_report']}")
print(f"JSON: {result['json_report']}")

Using Individual Tools

from ai_benchmark_platform.tools.model_tools import search_models, detect_model_category
from ai_benchmark_platform.tools.benchmark_tools import run_benchmark_suite

# Search for models
models = search_models("whisper", category="automatic-speech-recognition", limit=5)

# Detect category
category = detect_model_category("openai/whisper-large-v3")

# Run benchmarks
results = run_benchmark_suite("openai/whisper-large-v3", "asr", max_samples=50)

PDF Report Structure

The generated executive summary PDF includes:

Cover Page — Model name, category, timestamp
Executive Summary — Pass/fail counts, key metrics overview
Model Information — HuggingFace metadata, tags, license
Benchmark Results Table — All metrics in a formatted table
Detailed Results — Per-benchmark analysis with sample predictions
Methodology — Description of evaluation pipeline and metrics
Notes & Disclaimer — Reproducibility and comparison guidelines

Project Structure

ai_benchmark_platform/
├── __init__.py              # Package init with version
├── __main__.py              # Entry point for python -m
├── main.py                  # CLI with 4 modes (agent/direct/research/list)
├── config.py                # Category configs, benchmark registry
├── benchmarks/
│   ├── base.py              # BaseBenchmarkRunner, BenchmarkResult, EvaluationReport
│   ├── runner_factory.py    # Factory: category → runner
│   ├── reasoning_runner.py  # LLM eval via lighteval/lm-eval-harness
│   ├── asr_runner.py        # ASR eval: WER/CER via jiwer
│   ├── intent_runner.py     # Intent classification: accuracy, F1
│   ├── tts_runner.py        # TTS eval: intelligibility, RTF
│   ├── mt_runner.py         # MT eval: BLEU, chrF, COMET
│   ├── wakeword_runner.py   # Keyword spotting: accuracy, FAR/FRR
│   └── tool_calling_runner.py # Function calling: AST accuracy
├── tools/
│   ├── model_tools.py       # smolagents tools: search, download, detect
│   ├── benchmark_tools.py   # smolagents tools: run benchmarks
│   └── report_tools.py      # smolagents tools: generate reports
├── agents/
│   └── orchestrator.py      # Multi-agent system + direct pipeline
├── reports/
│   └── pdf_generator.py     # PDF/JSON report generation
└── utils/

Adding New Model Categories

Add a new ModelCategory enum value in config.py
Define BenchmarkConfig entries with datasets and metrics
Create a new runner class extending BaseBenchmarkRunner
Register it in runner_factory.py

# In config.py
class ModelCategory(str, Enum):
    MY_NEW_CATEGORY = "my_new_category"

# In your_runner.py
class MyNewRunner(BaseBenchmarkRunner):
    def load_model(self): ...
    def run_benchmark(self, ...): ...

# In runner_factory.py
RUNNER_MAP[ModelCategory.MY_NEW_CATEGORY] = MyNewRunner

Hardware Requirements

Category	Recommended GPU	Notes
Reasoning (1-3B)	a10g-large (24GB)	For larger models use a100
Reasoning (7-13B)	a100-large (80GB)
Reasoning (30B+)	a100x4 (320GB)
ASR	a10g-large (24GB)	Whisper models fit in 24GB
TTS	a10g-large (24GB)
MT	a10g-large (24GB)
Intent	a10g-large (24GB)	Can run on CPU for small models
Wakeword	a10g-large (24GB)	Small models can run on CPU
Tool Calling	a100-large (80GB)	Depends on LLM size

License

Apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support