title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
Inference Provider Testing Dashboard
A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.
Setup
Prerequisites
- Python 3.8+
- Hugging Face account with API token
- Access to the
IPTestingnamespace on Hugging Face
Installation
- Clone or navigate to this repository:
cd InferenceProviderTestingBackend
- Install dependencies:
pip install -r requirements.txt
- Set up your Hugging Face token as an environment variable:
export HF_TOKEN="your_huggingface_token_here"
Important: Your HF_TOKEN must have:
- Permission to call inference providers
- Write access to the
IPTestingorganization
Usage
Starting the Dashboard
Run the Gradio app:
python app.py
Initialize Models and Providers
Click the "Fetch and Initialize Models/Providers" button to automatically populate the
models_providers.txtfile with popular models and their available inference providers.Alternatively, manually edit
models_providers.txtwith your desired model-provider combinations:
meta-llama/Llama-3.2-3B-Instruct fireworks-ai
meta-llama/Llama-3.2-3B-Instruct together-ai
Qwen/Qwen2.5-7B-Instruct fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3 together-ai
Format: model_name provider_name (separated by spaces or tabs)
Launching Jobs
- Enter the evaluation tasks in the Tasks field (e.g.,
lighteval|mmlu|0|0) - Verify the config file path (default:
models_providers.txt) - Click "Launch Jobs"
The system will:
- Read all model-provider combinations from the config file
- Launch a separate evaluation job for each combination
- Log the job ID and status
- Monitor job progress automatically
Monitoring Jobs
The Job Results table displays all jobs with:
- Model: The model being tested
- Provider: The inference provider
- Last Run: Timestamp of when the job was last launched
- Status: Current status (running/complete/failed/cancelled)
- Current Score: Average score from the most recent run
- Previous Score: Average score from the prior run (for comparison)
- Latest Job Id: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection
The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.
Configuration
Tasks Format
The tasks parameter follows the lighteval format. Examples:
lighteval|mmlu|0- MMLU benchmark
Daily Checkpoint
The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day.
Data Persistence
All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:
- Results persist across app restarts
- Historical score comparisons are maintained
- Data can be accessed programmatically via the HF datasets library
Architecture
- Main Thread: Runs the Gradio interface
- Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
- APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
- Thread-safe: Uses locks to prevent access issues when checking job_results
- HuggingFace Dataset Storage: Persists results to
IPTesting/inference-provider-test-resultsdataset
Troubleshooting
Jobs Not Launching
- Verify your
HF_TOKENis set and has the required permissions - Check that the
IPTestingnamespace exists and you have access - Review logs for specific error messages
Scores Not Appearing
- Scores are extracted from job logs after completion
- The extraction parses the results table that appears in job logs
- It extracts the score for each task (from the first row where the task name appears)
- The final score is the average of all task scores
- Example table format:
| Task | Version | Metric | Value | Stderr | | extended:ifeval:0 | | prompt_level_strict_acc | 0.9100 | 0.0288 | | lighteval:gpqa:diamond:0 | | gpqa_pass@k_with_k | 0.5000 | 0.0503 | - If scores don't appear, check console output for extraction errors or parsing issues
Files
- app.py - Main Gradio application with UI and job management
- utils/ - Utility package with helper modules:
- utils/io.py - I/O operations: model/provider fetching, file operations, dataset persistence
- utils/jobs.py - Job management: launching, monitoring, score extraction
- models_providers.txt - Configuration file with model-provider combinations
- requirements.txt - Python dependencies
- README.md - This file