πŸ—οΈ AI Model Benchmarking Platform

An Agentic AI platform that automates the end-to-end lifecycle of AI model evaluation:

  1. πŸ” Research β€” Discover and categorize new models from HuggingFace Hub
  2. πŸ“₯ Download β€” Fetch models and prepare them for evaluation
  3. πŸ–₯️ Install & Configure β€” Set up models on GPU infrastructure with appropriate frameworks
  4. πŸ“Š Benchmark β€” Run standardized evaluations with domain-specific datasets & metrics
  5. πŸ“„ Report β€” Generate professional PDF executive summaries

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚             Manager Agent (CodeAgent)             β”‚
β”‚   Plans workflow, coordinates sub-agents,         β”‚
β”‚   handles errors & retries                        β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚               β”‚               β”‚
β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
β”‚  Research   β”‚ β”‚ Evaluation  β”‚ β”‚   Report    β”‚
β”‚   Agent     β”‚ β”‚   Agent     β”‚ β”‚   Agent     β”‚
β”‚             β”‚ β”‚             β”‚ β”‚             β”‚
β”‚ β€’ Search HF β”‚ β”‚ β€’ Load      β”‚ β”‚ β€’ PDF exec  β”‚
β”‚ β€’ Detect    β”‚ β”‚   models    β”‚ β”‚   summary   β”‚
β”‚   category  β”‚ β”‚ β€’ Run       β”‚ β”‚ β€’ JSON      β”‚
β”‚ β€’ Download  β”‚ β”‚   benchmarksβ”‚ β”‚   results   β”‚
β”‚ β€’ Trending  β”‚ β”‚ β€’ Collect   β”‚ β”‚ β€’ Model     β”‚
β”‚   models    β”‚ β”‚   metrics   β”‚ β”‚   comparisonβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Powered by smolagents for multi-agent orchestration.

Supported Model Categories & Benchmarks

Category Benchmarks Datasets Key Metrics
Reasoning / LLM MMLU, GSM8K, ARC-Challenge, TruthfulQA, HumanEval cais/mmlu, openai/gsm8k, allenai/ai2_arc, truthfulqa/truthful_qa, openai_humaneval Accuracy, Exact Match, Pass@k
Intent Classification CLINC150, Banking77, SNIPS clinc_oos, PolyAI/banking77, benayas/SNIPS Accuracy, Macro-F1, Weighted-F1
ASR LibriSpeech-Clean, LibriSpeech-Other, CommonVoice-EN facebook/librispeech_asr, mozilla-foundation/common_voice_17_0 WER, CER
TTS LJSpeech keithito/lj_speech Intelligibility WER, Real-Time Factor
Machine Translation FLORES-200 (en→de, en→fr, en→zh) facebook/flores BLEU, chrF, COMET
Wakeword / Keyword Spotting Google Speech Commands v2 google/speech_commands Accuracy, FAR, FRR
Tool Calling BFCL v3 gorilla-llm/Berkeley-Function-Calling-Leaderboard AST Accuracy, Format Adherence

Installation

# Core installation
pip install -e .

# Full installation (all backends)
pip install -e ".[full]"

Dependencies by mode:

  • Core: smolagents, transformers, datasets, huggingface_hub, fpdf2, scikit-learn
  • ASR: + jiwer, librosa, soundfile
  • MT: + sacrebleu, unbabel-comet
  • LLM: + lighteval, lm-eval, vllm, accelerate
  • TTS: + speechbrain

Usage

πŸ€– Agent Mode (Recommended)

Let the AI agent autonomously orchestrate the full pipeline:

python -m ai_benchmark_platform.main agent \
    "Benchmark openai/whisper-large-v3 on all ASR tasks and generate a report"
python -m ai_benchmark_platform.main agent \
    "Find the top 5 trending LLMs, benchmark them on MMLU and GSM8K with 200 samples, \
     and create a comparison PDF report"

⚑ Direct Mode (No Agent)

Run benchmarks programmatically:

# Auto-detect category and run all benchmarks
python -m ai_benchmark_platform.main direct \
    --model openai/whisper-large-v3 \
    --category asr \
    --max-samples 50 \
    --output-dir ./reports

# Run on LLM with GPU
python -m ai_benchmark_platform.main direct \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --category reasoning \
    --max-samples 100 \
    --device auto

πŸ” Research Mode

Discover and categorize models:

python -m ai_benchmark_platform.main research "speech recognition models" --detect --limit 10

πŸ“‹ List Mode

View available benchmarks:

python -m ai_benchmark_platform.main list --category all
python -m ai_benchmark_platform.main list --category asr

Python API

from ai_benchmark_platform.agents.orchestrator import (
    create_benchmark_platform,
    run_full_pipeline,
)

# Option 1: Agent-driven pipeline
manager = create_benchmark_platform(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
result = manager.run("Benchmark openai/whisper-large-v3 on ASR tasks and generate a PDF report")

# Option 2: Direct pipeline (no agent)
result = run_full_pipeline(
    model_id="openai/whisper-large-v3",
    category="asr",
    max_samples=100,
    output_dir="./reports",
)
print(f"PDF: {result['pdf_report']}")
print(f"JSON: {result['json_report']}")

Using Individual Tools

from ai_benchmark_platform.tools.model_tools import search_models, detect_model_category
from ai_benchmark_platform.tools.benchmark_tools import run_benchmark_suite

# Search for models
models = search_models("whisper", category="automatic-speech-recognition", limit=5)

# Detect category
category = detect_model_category("openai/whisper-large-v3")

# Run benchmarks
results = run_benchmark_suite("openai/whisper-large-v3", "asr", max_samples=50)

PDF Report Structure

The generated executive summary PDF includes:

  1. Cover Page β€” Model name, category, timestamp
  2. Executive Summary β€” Pass/fail counts, key metrics overview
  3. Model Information β€” HuggingFace metadata, tags, license
  4. Benchmark Results Table β€” All metrics in a formatted table
  5. Detailed Results β€” Per-benchmark analysis with sample predictions
  6. Methodology β€” Description of evaluation pipeline and metrics
  7. Notes & Disclaimer β€” Reproducibility and comparison guidelines

Project Structure

ai_benchmark_platform/
β”œβ”€β”€ __init__.py              # Package init with version
β”œβ”€β”€ __main__.py              # Entry point for python -m
β”œβ”€β”€ main.py                  # CLI with 4 modes (agent/direct/research/list)
β”œβ”€β”€ config.py                # Category configs, benchmark registry
β”œβ”€β”€ benchmarks/
β”‚   β”œβ”€β”€ base.py              # BaseBenchmarkRunner, BenchmarkResult, EvaluationReport
β”‚   β”œβ”€β”€ runner_factory.py    # Factory: category β†’ runner
β”‚   β”œβ”€β”€ reasoning_runner.py  # LLM eval via lighteval/lm-eval-harness
β”‚   β”œβ”€β”€ asr_runner.py        # ASR eval: WER/CER via jiwer
β”‚   β”œβ”€β”€ intent_runner.py     # Intent classification: accuracy, F1
β”‚   β”œβ”€β”€ tts_runner.py        # TTS eval: intelligibility, RTF
β”‚   β”œβ”€β”€ mt_runner.py         # MT eval: BLEU, chrF, COMET
β”‚   β”œβ”€β”€ wakeword_runner.py   # Keyword spotting: accuracy, FAR/FRR
β”‚   └── tool_calling_runner.py # Function calling: AST accuracy
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ model_tools.py       # smolagents tools: search, download, detect
β”‚   β”œβ”€β”€ benchmark_tools.py   # smolagents tools: run benchmarks
β”‚   └── report_tools.py      # smolagents tools: generate reports
β”œβ”€β”€ agents/
β”‚   └── orchestrator.py      # Multi-agent system + direct pipeline
β”œβ”€β”€ reports/
β”‚   └── pdf_generator.py     # PDF/JSON report generation
└── utils/

Adding New Model Categories

  1. Add a new ModelCategory enum value in config.py
  2. Define BenchmarkConfig entries with datasets and metrics
  3. Create a new runner class extending BaseBenchmarkRunner
  4. Register it in runner_factory.py
# In config.py
class ModelCategory(str, Enum):
    MY_NEW_CATEGORY = "my_new_category"

# In your_runner.py
class MyNewRunner(BaseBenchmarkRunner):
    def load_model(self): ...
    def run_benchmark(self, ...): ...

# In runner_factory.py
RUNNER_MAP[ModelCategory.MY_NEW_CATEGORY] = MyNewRunner

Hardware Requirements

Category Recommended GPU Notes
Reasoning (1-3B) a10g-large (24GB) For larger models use a100
Reasoning (7-13B) a100-large (80GB)
Reasoning (30B+) a100x4 (320GB)
ASR a10g-large (24GB) Whisper models fit in 24GB
TTS a10g-large (24GB)
MT a10g-large (24GB)
Intent a10g-large (24GB) Can run on CPU for small models
Wakeword a10g-large (24GB) Small models can run on CPU
Tool Calling a100-large (80GB) Depends on LLM size

License

Apache-2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support