Clémentine
gitignore
096bf86
metadata
title: InferenceProviderTestingBackend
emoji: 📈
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false

Inference Provider Testing Dashboard

A Gradio-based dashboard for launching and monitoring evaluation jobs across multiple models and inference providers using Hugging Face's job API.

Setup

Prerequisites

  • Python 3.8+
  • Hugging Face account with API token
  • Access to the IPTesting namespace on Hugging Face

Installation

  1. Clone or navigate to this repository:
cd InferenceProviderTestingBackend
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your Hugging Face token as an environment variable:
export HF_TOKEN="your_huggingface_token_here"

Important: Your HF_TOKEN must have:

  • Permission to call inference providers
  • Write access to the IPTesting organization

Usage

Starting the Dashboard

Run the Gradio app:

python app.py

Initialize Models and Providers

  1. Click the "Fetch and Initialize Models/Providers" button to automatically populate the models_providers.txt file with popular models and their available inference providers.

  2. Alternatively, manually edit models_providers.txt with your desired model-provider combinations:

meta-llama/Llama-3.2-3B-Instruct  fireworks-ai
meta-llama/Llama-3.2-3B-Instruct  together-ai
Qwen/Qwen2.5-7B-Instruct  fireworks-ai
mistralai/Mistral-7B-Instruct-v0.3  together-ai

Format: model_name provider_name (separated by spaces or tabs)

Launching Jobs

  1. Enter the evaluation tasks in the Tasks field (e.g., lighteval|mmlu|0|0)
  2. Verify the config file path (default: models_providers.txt)
  3. Click "Launch Jobs"

The system will:

  • Read all model-provider combinations from the config file
  • Launch a separate evaluation job for each combination
  • Log the job ID and status
  • Monitor job progress automatically

Monitoring Jobs

The Job Results table displays all jobs with:

  • Model: The model being tested
  • Provider: The inference provider
  • Last Run: Timestamp of when the job was last launched
  • Status: Current status (running/complete/failed/cancelled)
  • Current Score: Average score from the most recent run
  • Previous Score: Average score from the prior run (for comparison)
  • Latest Job Id: Latest job id to put in https://huggingface.co/jobs/NAMESPACE/JOBID for inspection

The table auto-refreshes every 30 seconds, or you can click "Refresh Results" for manual updates.

Configuration

Tasks Format

The tasks parameter follows the lighteval format. Examples:

  • lighteval|mmlu|0 - MMLU benchmark

Daily Checkpoint

The system automatically saves all results to the HuggingFace dataset at 00:00 (midnight) every day.

Data Persistence

All job results are stored in a HuggingFace dataset (IPTesting/inference-provider-test-results), which means:

  • Results persist across app restarts
  • Historical score comparisons are maintained
  • Data can be accessed programmatically via the HF datasets library

Architecture

  • Main Thread: Runs the Gradio interface
  • Monitor Thread: Updates job statuses every 30 seconds and extracts scores from completed jobs
  • APScheduler: Background scheduler that handles daily checkpoint saves at midnight (cron-based)
  • Thread-safe: Uses locks to prevent access issues when checking job_results
  • HuggingFace Dataset Storage: Persists results to IPTesting/inference-provider-test-results dataset

Troubleshooting

Jobs Not Launching

  • Verify your HF_TOKEN is set and has the required permissions
  • Check that the IPTesting namespace exists and you have access
  • Review logs for specific error messages

Scores Not Appearing

  • Scores are extracted from job logs after completion
  • The extraction parses the results table that appears in job logs
  • It extracts the score for each task (from the first row where the task name appears)
  • The final score is the average of all task scores
  • Example table format:
    | Task                    | Version | Metric                | Value  | Stderr |
    | extended:ifeval:0       |         | prompt_level_strict_acc | 0.9100 | 0.0288 |
    | lighteval:gpqa:diamond:0 |        | gpqa_pass@k_with_k     | 0.5000 | 0.0503 |
    
  • If scores don't appear, check console output for extraction errors or parsing issues

Files

  • app.py - Main Gradio application with UI and job management
  • utils/ - Utility package with helper modules:
    • utils/io.py - I/O operations: model/provider fetching, file operations, dataset persistence
    • utils/jobs.py - Job management: launching, monitoring, score extraction
  • models_providers.txt - Configuration file with model-provider combinations
  • requirements.txt - Python dependencies
  • README.md - This file