ποΈ AI Model Benchmarking Platform
An Agentic AI platform that automates the end-to-end lifecycle of AI model evaluation:
- π Research β Discover and categorize new models from HuggingFace Hub
- π₯ Download β Fetch models and prepare them for evaluation
- π₯οΈ Install & Configure β Set up models on GPU infrastructure with appropriate frameworks
- π Benchmark β Run standardized evaluations with domain-specific datasets & metrics
- π Report β Generate professional PDF executive summaries
Architecture
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Manager Agent (CodeAgent) β
β Plans workflow, coordinates sub-agents, β
β handles errors & retries β
ββββββββ¬ββββββββββββββββ¬ββββββββββββββββ¬ββββββββββββ
β β β
ββββββββΌβββββββ ββββββββΌβββββββ ββββββββΌβββββββ
β Research β β Evaluation β β Report β
β Agent β β Agent β β Agent β
β β β β β β
β β’ Search HF β β β’ Load β β β’ PDF exec β
β β’ Detect β β models β β summary β
β category β β β’ Run β β β’ JSON β
β β’ Download β β benchmarksβ β results β
β β’ Trending β β β’ Collect β β β’ Model β
β models β β metrics β β comparisonβ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
Powered by smolagents for multi-agent orchestration.
Supported Model Categories & Benchmarks
| Category | Benchmarks | Datasets | Key Metrics |
|---|---|---|---|
| Reasoning / LLM | MMLU, GSM8K, ARC-Challenge, TruthfulQA, HumanEval | cais/mmlu, openai/gsm8k, allenai/ai2_arc, truthfulqa/truthful_qa, openai_humaneval | Accuracy, Exact Match, Pass@k |
| Intent Classification | CLINC150, Banking77, SNIPS | clinc_oos, PolyAI/banking77, benayas/SNIPS | Accuracy, Macro-F1, Weighted-F1 |
| ASR | LibriSpeech-Clean, LibriSpeech-Other, CommonVoice-EN | facebook/librispeech_asr, mozilla-foundation/common_voice_17_0 | WER, CER |
| TTS | LJSpeech | keithito/lj_speech | Intelligibility WER, Real-Time Factor |
| Machine Translation | FLORES-200 (enβde, enβfr, enβzh) | facebook/flores | BLEU, chrF, COMET |
| Wakeword / Keyword Spotting | Google Speech Commands v2 | google/speech_commands | Accuracy, FAR, FRR |
| Tool Calling | BFCL v3 | gorilla-llm/Berkeley-Function-Calling-Leaderboard | AST Accuracy, Format Adherence |
Installation
# Core installation
pip install -e .
# Full installation (all backends)
pip install -e ".[full]"
Dependencies by mode:
- Core: smolagents, transformers, datasets, huggingface_hub, fpdf2, scikit-learn
- ASR: + jiwer, librosa, soundfile
- MT: + sacrebleu, unbabel-comet
- LLM: + lighteval, lm-eval, vllm, accelerate
- TTS: + speechbrain
Usage
π€ Agent Mode (Recommended)
Let the AI agent autonomously orchestrate the full pipeline:
python -m ai_benchmark_platform.main agent \
"Benchmark openai/whisper-large-v3 on all ASR tasks and generate a report"
python -m ai_benchmark_platform.main agent \
"Find the top 5 trending LLMs, benchmark them on MMLU and GSM8K with 200 samples, \
and create a comparison PDF report"
β‘ Direct Mode (No Agent)
Run benchmarks programmatically:
# Auto-detect category and run all benchmarks
python -m ai_benchmark_platform.main direct \
--model openai/whisper-large-v3 \
--category asr \
--max-samples 50 \
--output-dir ./reports
# Run on LLM with GPU
python -m ai_benchmark_platform.main direct \
--model meta-llama/Llama-3.2-3B-Instruct \
--category reasoning \
--max-samples 100 \
--device auto
π Research Mode
Discover and categorize models:
python -m ai_benchmark_platform.main research "speech recognition models" --detect --limit 10
π List Mode
View available benchmarks:
python -m ai_benchmark_platform.main list --category all
python -m ai_benchmark_platform.main list --category asr
Python API
from ai_benchmark_platform.agents.orchestrator import (
create_benchmark_platform,
run_full_pipeline,
)
# Option 1: Agent-driven pipeline
manager = create_benchmark_platform(model_id="Qwen/Qwen2.5-Coder-32B-Instruct")
result = manager.run("Benchmark openai/whisper-large-v3 on ASR tasks and generate a PDF report")
# Option 2: Direct pipeline (no agent)
result = run_full_pipeline(
model_id="openai/whisper-large-v3",
category="asr",
max_samples=100,
output_dir="./reports",
)
print(f"PDF: {result['pdf_report']}")
print(f"JSON: {result['json_report']}")
Using Individual Tools
from ai_benchmark_platform.tools.model_tools import search_models, detect_model_category
from ai_benchmark_platform.tools.benchmark_tools import run_benchmark_suite
# Search for models
models = search_models("whisper", category="automatic-speech-recognition", limit=5)
# Detect category
category = detect_model_category("openai/whisper-large-v3")
# Run benchmarks
results = run_benchmark_suite("openai/whisper-large-v3", "asr", max_samples=50)
PDF Report Structure
The generated executive summary PDF includes:
- Cover Page β Model name, category, timestamp
- Executive Summary β Pass/fail counts, key metrics overview
- Model Information β HuggingFace metadata, tags, license
- Benchmark Results Table β All metrics in a formatted table
- Detailed Results β Per-benchmark analysis with sample predictions
- Methodology β Description of evaluation pipeline and metrics
- Notes & Disclaimer β Reproducibility and comparison guidelines
Project Structure
ai_benchmark_platform/
βββ __init__.py # Package init with version
βββ __main__.py # Entry point for python -m
βββ main.py # CLI with 4 modes (agent/direct/research/list)
βββ config.py # Category configs, benchmark registry
βββ benchmarks/
β βββ base.py # BaseBenchmarkRunner, BenchmarkResult, EvaluationReport
β βββ runner_factory.py # Factory: category β runner
β βββ reasoning_runner.py # LLM eval via lighteval/lm-eval-harness
β βββ asr_runner.py # ASR eval: WER/CER via jiwer
β βββ intent_runner.py # Intent classification: accuracy, F1
β βββ tts_runner.py # TTS eval: intelligibility, RTF
β βββ mt_runner.py # MT eval: BLEU, chrF, COMET
β βββ wakeword_runner.py # Keyword spotting: accuracy, FAR/FRR
β βββ tool_calling_runner.py # Function calling: AST accuracy
βββ tools/
β βββ model_tools.py # smolagents tools: search, download, detect
β βββ benchmark_tools.py # smolagents tools: run benchmarks
β βββ report_tools.py # smolagents tools: generate reports
βββ agents/
β βββ orchestrator.py # Multi-agent system + direct pipeline
βββ reports/
β βββ pdf_generator.py # PDF/JSON report generation
βββ utils/
Adding New Model Categories
- Add a new
ModelCategoryenum value inconfig.py - Define
BenchmarkConfigentries with datasets and metrics - Create a new runner class extending
BaseBenchmarkRunner - Register it in
runner_factory.py
# In config.py
class ModelCategory(str, Enum):
MY_NEW_CATEGORY = "my_new_category"
# In your_runner.py
class MyNewRunner(BaseBenchmarkRunner):
def load_model(self): ...
def run_benchmark(self, ...): ...
# In runner_factory.py
RUNNER_MAP[ModelCategory.MY_NEW_CATEGORY] = MyNewRunner
Hardware Requirements
| Category | Recommended GPU | Notes |
|---|---|---|
| Reasoning (1-3B) | a10g-large (24GB) | For larger models use a100 |
| Reasoning (7-13B) | a100-large (80GB) | |
| Reasoning (30B+) | a100x4 (320GB) | |
| ASR | a10g-large (24GB) | Whisper models fit in 24GB |
| TTS | a10g-large (24GB) | |
| MT | a10g-large (24GB) | |
| Intent | a10g-large (24GB) | Can run on CPU for small models |
| Wakeword | a10g-large (24GB) | Small models can run on CPU |
| Tool Calling | a100-large (80GB) | Depends on LLM size |
License
Apache-2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support