File size: 5,549 Bytes
6e13379
5bda5f1
6e13379
d590c57
6e13379
 
 
 
5bda5f1
1c662a6
6e13379
 
 
af28f6f
a1bff60
af28f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45a014d
af28f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1bff60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e13379
 
 
 
 
a1bff60
6e13379
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1bff60
 
6e13379
 
a1bff60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: EvalArena
emoji: πŸ₯‡
colorFrom: pink
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: cc-by-nc-4.0
short_description: "An AI Judge Evaluation Platform"
sdk_version: 5.19.0
---

# EvalArena

An AI Judge Evaluation Platform

## About

EvalArena is a platform that allows users to compare and rate different AI evaluation models (judges). The platform uses a competitive ELO rating system to rank different judge models based on human preferences.

## Project Structure

After refactoring, the project now has a cleaner structure:

```
EvalArena/
β”‚
β”œβ”€β”€ src/                    # Source code
β”‚   β”œβ”€β”€ app.py              # Application logic
β”‚   β”œβ”€β”€ config.py           # Constants and configuration
β”‚   β”œβ”€β”€ data_manager.py     # Dataset loading and management
β”‚   β”œβ”€β”€ judge.py            # Judge evaluation functionality
β”‚   └── ui.py               # Gradio UI components
β”‚
β”œβ”€β”€ data/                   # Data directory for CSV files
β”œβ”€β”€ models.jsonl            # Model definitions
β”œβ”€β”€ main.py                 # Entry point
└── requirements.txt        # Dependencies
```

## Setup

1. Clone the repository
2. Install dependencies:
   ```
   pip install -r requirements.txt
   ```
3. Create a `.env` file with any API keys:
   ```
   OPENAI_API_KEY=your_key_here
   ANTHROPIC_API_KEY=your_key_here
   QUALIFIRE_API_KEY=your_qualifire_key_here
   ```

## Running

Run the application using:

```
python main.py
```

This will start the Gradio web interface where you can:

- Select test types (grounding, hallucinations, safety, etc.)
- Get random examples
- See evaluations from two random judge models
- Select which judge provided a better evaluation
- View the leaderboard of judges ranked by ELO score

## Features

- Multiple test types (prompt injections, safety, grounding, hallucinations, policy)
- ELO-based competitive rating system
- Support for various model providers (OpenAI, Anthropic, Together AI)
- Detailed evaluations with scoring criteria
- Persistent leaderboard

## Overview

This application allows users to:

1. View AI-generated outputs based on input prompts
2. Compare evaluations from two different AI judges
3. Select the better evaluation
4. Build a leaderboard of judges ranked by ELO score

## Features

- **Blind Comparison**: Judge identities are hidden until after selection
- **ELO Rating System**: Calculates judge rankings based on user preferences
- **Leaderboard**: Track performance of different evaluation models
- **Sample Examples**: Includes pre-loaded examples for immediate testing

## Setup

### Prerequisites

- Python 3.6+
- Required packages: gradio, pandas, numpy

### Installation

1. Clone this repository:

```
git clone https://github.com/yourusername/eval-arena.git
cd eval-arena
```

2. Install dependencies:

```
pip install -r requirements.txt
```

3. Run the application:

```
python app.py
```

4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860)

## Usage

1. **Get Random Example**: Click to load a random input/output pair
2. **Get Judge Evaluations**: View two anonymous evaluations of the output
3. **Select Better Evaluation**: Choose which evaluation you prefer
4. **See Results**: Learn which judges you compared and update the leaderboard
5. **Leaderboard Tab**: View current rankings of all judges

## Extending the Application

### Adding New Examples

Add new examples in JSON format to the `data/examples` directory:

```json
{
  "id": "example_id",
  "input": "Your input prompt",
  "output": "AI-generated output to evaluate"
}
```

### Adding New Judges

Add new judges in JSON format to the `data/judges` directory:

```json
{
  "id": "judge_id",
  "name": "Judge Name",
  "description": "Description of judge's evaluation approach"
}
```

### Integrating Real Models

For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations.

## License

MIT

## Citation

If you use this platform in your research, please cite:

```
@software{ai_eval_arena,
  author = {Your Name},
  title = {AI Evaluation Judge Arena},
  year = {2023},
  url = {https://github.com/yourusername/eval-arena}
}
```

# Start the configuration

Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).

Results files should have the following format and be stored as json files:

```json
{
    "config": {
        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
        "model_name": "path of the model on the hub: org/model",
        "model_sha": "revision on the hub",
    },
    "results": {
        "task_name": {
            "metric_name": score,
        },
        "task_name2": {
            "metric_name": score,
        }
    }
}
```

Request files are created automatically by this tool.

If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.

# Code logic for more complex edits

You'll find

- the main table' columns names and properties in `src/display/utils.py`
- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`