Spaces:

OpenEvals
/

InferenceProviderTesting

Running

App Files Files Community

InferenceProviderTesting / README.md

Clémentine

gitignore

096bf86 about 1 month ago

preview code

raw

history blame contribute delete

5 kB

metadata

title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

Setup

Prerequisites

Python 3.8+
Hugging Face account with API token
Access to the IPTesting namespace on Hugging Face

Installation

Clone or navigate to this repository:

cd InferenceProviderTestingBackend

Install dependencies:

pip install -r requirements.txt

Set up your Hugging Face token as an environment variable:

export HF_TOKEN="your_huggingface_token_here"

Important: Your HF_TOKEN must have:

Permission to call inference providers
Write access to the IPTesting organization

Usage

Starting the Dashboard

Run the Gradio app:

python app.py

Initialize Models and Providers

Click the "Fetch and Initialize Models/Providers" button to automatically populate the models_providers.txt file with popular models and their available inference providers.
Alternatively, manually edit models_providers.txt with your desired model-provider combinations:

meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai

Format: model_name provider_name (separated by spaces or tabs)

Launching Jobs

Enter the evaluation tasks in the Tasks field (e.g., lighteval|mmlu|0|0)
Verify the config file path (default: models_providers.txt)
Click "Launch Jobs"

The system will:

Read all model-provider combinations from the config file
Launch a separate evaluation job for each combination
Log the job ID and status
Monitor job progress automatically

Monitoring Jobs

The Job Results table displays all jobs with:

Model: The model being tested
Provider: The inference provider
Last Run: Timestamp of when the job was last launched
Status: Current status (running/complete/failed/cancelled)
Current Score: Average score from the most recent run
Previous Score: Average score from the prior run (for comparison)
Latest Job Id: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

Configuration

Tasks Format

The tasks parameter follows the lighteval format. Examples:

lighteval|mmlu|0 - MMLU benchmark

Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day.

Data Persistence

All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:

Results persist across app restarts
Historical score comparisons are maintained
Data can be accessed programmatically via the HF datasets library

Architecture

Main Thread: Runs the Gradio interface
Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
Thread-safe: Uses locks to prevent access issues when checking job_results
HuggingFace Dataset Storage: Persists results to IPTesting/inference-provider-test-results dataset

Troubleshooting

Jobs Not Launching

Verify your HF_TOKEN is set and has the required permissions
Check that the IPTesting namespace exists and you have access
Review logs for specific error messages

Scores Not Appearing

Scores are extracted from job logs after completion
The extraction parses the results table that appears in job logs
It extracts the score for each task (from the first row where the task name appears)
The final score is the average of all task scores

Example table format:

| Task                    | Version | Metric                | Value  | Stderr |
| extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
| lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |

If scores don't appear, check console output for extraction errors or parsing issues

Files

app.py - Main Gradio application with UI and job management
utils/ - Utility package with helper modules:
- utils/io.py - I/O operations: model/provider fetching, file operations, dataset persistence
- utils/jobs.py - Job management: launching, monitoring, score extraction
models_providers.txt - Configuration file with model-provider combinations
requirements.txt - Python dependencies
README.md - This file