EvalArena

Sleeping

App Files Files Community

EvalArena / README.md

dror44

added qualifire to the mix

45a014d 7 months ago

preview code

raw

history blame contribute delete

5.55 kB

	---
	title: EvalArena
	emoji: 🥇
	colorFrom: pink
	colorTo: indigo
	sdk: gradio
	app_file: app.py
	pinned: true
	license: cc-by-nc-4.0
	short_description: "An AI Judge Evaluation Platform"
	sdk_version: 5.19.0
	---

	# EvalArena

	An AI Judge Evaluation Platform

	## About

	EvalArena is a platform that allows users to compare and rate different AI evaluation models (judges). The platform uses a competitive ELO rating system to rank different judge models based on human preferences.

	## Project Structure

	After refactoring, the project now has a cleaner structure:

	```
	EvalArena/
	│
	├── src/ # Source code
	│ ├── app.py # Application logic
	│ ├── config.py # Constants and configuration
	│ ├── data_manager.py # Dataset loading and management
	│ ├── judge.py # Judge evaluation functionality
	│ └── ui.py # Gradio UI components
	│
	├── data/ # Data directory for CSV files
	├── models.jsonl # Model definitions
	├── main.py # Entry point
	└── requirements.txt # Dependencies
	```

	## Setup

	1. Clone the repository
	2. Install dependencies:
	```
	pip install -r requirements.txt
	```
	3. Create a `.env` file with any API keys:
	```
	OPENAI_API_KEY=your_key_here
	ANTHROPIC_API_KEY=your_key_here
	QUALIFIRE_API_KEY=your_qualifire_key_here
	```

	## Running

	Run the application using:

	```
	python main.py
	```

	This will start the Gradio web interface where you can:

	- Select test types (grounding, hallucinations, safety, etc.)
	- Get random examples
	- See evaluations from two random judge models
	- Select which judge provided a better evaluation
	- View the leaderboard of judges ranked by ELO score

	## Features

	- Multiple test types (prompt injections, safety, grounding, hallucinations, policy)
	- ELO-based competitive rating system
	- Support for various model providers (OpenAI, Anthropic, Together AI)
	- Detailed evaluations with scoring criteria
	- Persistent leaderboard

	## Overview

	This application allows users to:

	1. View AI-generated outputs based on input prompts
	2. Compare evaluations from two different AI judges
	3. Select the better evaluation
	4. Build a leaderboard of judges ranked by ELO score

	## Features

	- Blind Comparison: Judge identities are hidden until after selection
	- ELO Rating System: Calculates judge rankings based on user preferences
	- Leaderboard: Track performance of different evaluation models
	- Sample Examples: Includes pre-loaded examples for immediate testing

	## Setup

	### Prerequisites

	- Python 3.6+
	- Required packages: gradio, pandas, numpy

	### Installation

	1. Clone this repository:

	```
	git clone https://github.com/yourusername/eval-arena.git
	cd eval-arena
	```

	2. Install dependencies:

	```
	pip install -r requirements.txt
	```

	3. Run the application:

	```
	python app.py
	```

	4. Open your browser and navigate to the URL displayed in the terminal (typically http://127.0.0.1:7860)

	## Usage

	1. Get Random Example: Click to load a random input/output pair
	2. Get Judge Evaluations: View two anonymous evaluations of the output
	3. Select Better Evaluation: Choose which evaluation you prefer
	4. See Results: Learn which judges you compared and update the leaderboard
	5. Leaderboard Tab: View current rankings of all judges

	## Extending the Application

	### Adding New Examples

	Add new examples in JSON format to the `data/examples` directory:

	```json
	{
	"id": "example_id",
	"input": "Your input prompt",
	"output": "AI-generated output to evaluate"
	}
	```

	### Adding New Judges

	Add new judges in JSON format to the `data/judges` directory:

	```json
	{
	"id": "judge_id",
	"name": "Judge Name",
	"description": "Description of judge's evaluation approach"
	}
	```

	### Integrating Real Models

	For production use, modify the `get_random_judges_evaluations` function to call actual AI evaluation models instead of using the simulated evaluations.

	## License

	MIT

	## Citation

	If you use this platform in your research, please cite:

	```
	@software{ai_eval_arena,
	author = {Your Name},
	title = {AI Evaluation Judge Arena},
	year = {2023},
	url = {https://github.com/yourusername/eval-arena}
	}
	```

	# Start the configuration

	Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).

	Results files should have the following format and be stored as json files:

	```json
	{
	"config": {
	"model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
	"model_name": "path of the model on the hub: org/model",
	"model_sha": "revision on the hub",
	},
	"results": {
	"task_name": {
	"metric_name": score,
	},
	"task_name2": {
	"metric_name": score,
	}
	}
	}
	```

	Request files are created automatically by this tool.

	If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.

	# Code logic for more complex edits

	You'll find

	- the main table' columns names and properties in `src/display/utils.py`
	- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
	- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`