File size: 5,549 Bytes
6e13379 5bda5f1 6e13379 d590c57 6e13379 5bda5f1 1c662a6 6e13379 af28f6f a1bff60 af28f6f 45a014d af28f6f a1bff60 6e13379 a1bff60 6e13379 a1bff60 6e13379 a1bff60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
---
title: EvalArena
emoji: π₯
colorFrom: pink
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: cc-by-nc-4.0
short_description: "An AI Judge Evaluation Platform"
sdk_version: 5.19.0
---
# EvalArena
An AI Judge Evaluation Platform
## About
EvalArena is a platform that allows users to compare and rate different AI evaluation models (judges). The platform uses a competitive ELO rating system to rank different judge models based on human preferences.
## Project Structure
After refactoring, the project now has a cleaner structure:
```
EvalArena/
β
βββ src/ # Source code
β βββ app.py # Application logic
β βββ config.py # Constants and configuration
β βββ data_manager.py # Dataset loading and management
β βββ judge.py # Judge evaluation functionality
β βββ ui.py # Gradio UI components
β
βββ data/ # Data directory for CSV files
βββ models.jsonl # Model definitions
βββ main.py # Entry point
βββ requirements.txt # Dependencies
```
## Setup
1. Clone the repository
2. Install dependencies:
```
pip install -r requirements.txt
```
3. Create a `.env` file with any API keys:
```
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
QUALIFIRE_API_KEY=your_qualifire_key_here
```
## Running
Run the application using:
```
python main.py
```
This will start the Gradio web interface where you can:
- Select test types (grounding, hallucinations, safety, etc.)
- Get random examples
- See evaluations from two random judge models
- Select which judge provided a better evaluation
- View the leaderboard of judges ranked by ELO score
## Features
- Multiple test types (prompt injections, safety, grounding, hallucinations, policy)
- ELO-based competitive rating system
- Support for various model providers (OpenAI, Anthropic, Together AI)
- Detailed evaluations with scoring criteria
- Persistent leaderboard
## Overview
This application allows users to:
1. View AI-generated outputs based on input prompts
2. Compare evaluations from two different AI judges
3. Select the better evaluation
4. Build a leaderboard of judges ranked by ELO score
## Features
- **Blind Comparison**: Judge identities are hidden until after selection
- **ELO Rating System**: Calculates judge rankings based on user preferences
- **Leaderboard**: Track performance of different evaluation models
- **Sample Examples**: Includes pre-loaded examples for immediate testing
## Setup
### Prerequisites
- Python 3.6+
- Required packages: gradio, pandas, numpy
### Installation
1. Clone this repository:
```
git clone https://github.com/yourusername/eval-arena.git
cd eval-arena
```
2. Install dependencies:
```
pip install -r requirements.txt
```
3. Run the application:
```
python app.py
```
4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860)
## Usage
1. **Get Random Example**: Click to load a random input/output pair
2. **Get Judge Evaluations**: View two anonymous evaluations of the output
3. **Select Better Evaluation**: Choose which evaluation you prefer
4. **See Results**: Learn which judges you compared and update the leaderboard
5. **Leaderboard Tab**: View current rankings of all judges
## Extending the Application
### Adding New Examples
Add new examples in JSON format to the `data/examples` directory:
```json
{
"id": "example_id",
"input": "Your input prompt",
"output": "AI-generated output to evaluate"
}
```
### Adding New Judges
Add new judges in JSON format to the `data/judges` directory:
```json
{
"id": "judge_id",
"name": "Judge Name",
"description": "Description of judge's evaluation approach"
}
```
### Integrating Real Models
For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations.
## License
MIT
## Citation
If you use this platform in your research, please cite:
```
@software{ai_eval_arena,
author = {Your Name},
title = {AI Evaluation Judge Arena},
year = {2023},
url = {https://github.com/yourusername/eval-arena}
}
```
# Start the configuration
Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
Results files should have the following format and be stored as json files:
```json
{
"config": {
"model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
"model_name": "path of the model on the hub: org/model",
"model_sha": "revision on the hub",
},
"results": {
"task_name": {
"metric_name": score,
},
"task_name2": {
"metric_name": score,
}
}
}
```
Request files are created automatically by this tool.
If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
# Code logic for more complex edits
You'll find
- the main table' columns names and properties in `src/display/utils.py`
- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
|