|
|
--- |
|
|
title: EvalArena |
|
|
emoji: π₯ |
|
|
colorFrom: pink |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
app_file: app.py |
|
|
pinned: true |
|
|
license: cc-by-nc-4.0 |
|
|
short_description: "An AI Judge Evaluation Platform" |
|
|
sdk_version: 5.19.0 |
|
|
--- |
|
|
|
|
|
# EvalArena |
|
|
|
|
|
An AI Judge Evaluation Platform |
|
|
|
|
|
## About |
|
|
|
|
|
EvalArena is a platform that allows users to compare and rate different AI evaluation models (judges). The platform uses a competitive ELO rating system to rank different judge models based on human preferences. |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
After refactoring, the project now has a cleaner structure: |
|
|
|
|
|
``` |
|
|
EvalArena/ |
|
|
β |
|
|
βββ src/ # Source code |
|
|
β βββ app.py # Application logic |
|
|
β βββ config.py # Constants and configuration |
|
|
β βββ data_manager.py # Dataset loading and management |
|
|
β βββ judge.py # Judge evaluation functionality |
|
|
β βββ ui.py # Gradio UI components |
|
|
β |
|
|
βββ data/ # Data directory for CSV files |
|
|
βββ models.jsonl # Model definitions |
|
|
βββ main.py # Entry point |
|
|
βββ requirements.txt # Dependencies |
|
|
``` |
|
|
|
|
|
## Setup |
|
|
|
|
|
1. Clone the repository |
|
|
2. Install dependencies: |
|
|
``` |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
3. Create a `.env` file with any API keys: |
|
|
``` |
|
|
OPENAI_API_KEY=your_key_here |
|
|
ANTHROPIC_API_KEY=your_key_here |
|
|
QUALIFIRE_API_KEY=your_qualifire_key_here |
|
|
``` |
|
|
|
|
|
## Running |
|
|
|
|
|
Run the application using: |
|
|
|
|
|
``` |
|
|
python main.py |
|
|
``` |
|
|
|
|
|
This will start the Gradio web interface where you can: |
|
|
|
|
|
- Select test types (grounding, hallucinations, safety, etc.) |
|
|
- Get random examples |
|
|
- See evaluations from two random judge models |
|
|
- Select which judge provided a better evaluation |
|
|
- View the leaderboard of judges ranked by ELO score |
|
|
|
|
|
## Features |
|
|
|
|
|
- Multiple test types (prompt injections, safety, grounding, hallucinations, policy) |
|
|
- ELO-based competitive rating system |
|
|
- Support for various model providers (OpenAI, Anthropic, Together AI) |
|
|
- Detailed evaluations with scoring criteria |
|
|
- Persistent leaderboard |
|
|
|
|
|
## Overview |
|
|
|
|
|
This application allows users to: |
|
|
|
|
|
1. View AI-generated outputs based on input prompts |
|
|
2. Compare evaluations from two different AI judges |
|
|
3. Select the better evaluation |
|
|
4. Build a leaderboard of judges ranked by ELO score |
|
|
|
|
|
## Features |
|
|
|
|
|
- **Blind Comparison**: Judge identities are hidden until after selection |
|
|
- **ELO Rating System**: Calculates judge rankings based on user preferences |
|
|
- **Leaderboard**: Track performance of different evaluation models |
|
|
- **Sample Examples**: Includes pre-loaded examples for immediate testing |
|
|
|
|
|
## Setup |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
- Python 3.6+ |
|
|
- Required packages: gradio, pandas, numpy |
|
|
|
|
|
### Installation |
|
|
|
|
|
1. Clone this repository: |
|
|
|
|
|
``` |
|
|
git clone https://github.com/yourusername/eval-arena.git |
|
|
cd eval-arena |
|
|
``` |
|
|
|
|
|
2. Install dependencies: |
|
|
|
|
|
``` |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
3. Run the application: |
|
|
|
|
|
``` |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860) |
|
|
|
|
|
## Usage |
|
|
|
|
|
1. **Get Random Example**: Click to load a random input/output pair |
|
|
2. **Get Judge Evaluations**: View two anonymous evaluations of the output |
|
|
3. **Select Better Evaluation**: Choose which evaluation you prefer |
|
|
4. **See Results**: Learn which judges you compared and update the leaderboard |
|
|
5. **Leaderboard Tab**: View current rankings of all judges |
|
|
|
|
|
## Extending the Application |
|
|
|
|
|
### Adding New Examples |
|
|
|
|
|
Add new examples in JSON format to the `data/examples` directory: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"id": "example_id", |
|
|
"input": "Your input prompt", |
|
|
"output": "AI-generated output to evaluate" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Adding New Judges |
|
|
|
|
|
Add new judges in JSON format to the `data/judges` directory: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"id": "judge_id", |
|
|
"name": "Judge Name", |
|
|
"description": "Description of judge's evaluation approach" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Integrating Real Models |
|
|
|
|
|
For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this platform in your research, please cite: |
|
|
|
|
|
``` |
|
|
@software{ai_eval_arena, |
|
|
author = {Your Name}, |
|
|
title = {AI Evaluation Judge Arena}, |
|
|
year = {2023}, |
|
|
url = {https://github.com/yourusername/eval-arena} |
|
|
} |
|
|
``` |
|
|
|
|
|
# Start the configuration |
|
|
|
|
|
Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks). |
|
|
|
|
|
Results files should have the following format and be stored as json files: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"config": { |
|
|
"model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit |
|
|
"model_name": "path of the model on the hub: org/model", |
|
|
"model_sha": "revision on the hub", |
|
|
}, |
|
|
"results": { |
|
|
"task_name": { |
|
|
"metric_name": score, |
|
|
}, |
|
|
"task_name2": { |
|
|
"metric_name": score, |
|
|
} |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
Request files are created automatically by this tool. |
|
|
|
|
|
If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder. |
|
|
|
|
|
# Code logic for more complex edits |
|
|
|
|
|
You'll find |
|
|
|
|
|
- the main table' columns names and properties in `src/display/utils.py` |
|
|
- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py` |
|
|
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py` |
|
|
|