File size: 3,631 Bytes
9c68964
 
 
 
 
 
0eaa38a
9c68964
 
 
0eaa38a
9c68964
 
db88383
9c68964
b979763
9c68964
 
 
b979763
 
 
 
9c68964
 
 
b979763
9c68964
b979763
 
 
9c68964
 
 
 
b979763
9c68964
b979763
9c68964
 
 
 
 
 
b979763
9c68964
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b979763
9c68964
 
 
b979763
 
 
 
9c68964
 
 
b979763
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
title: SE-Arena
emoji: 🛠️
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
hf_oauth: true
pinned: false
short_description: The chatbot arena for software engineering
---

# SE Arena: Evaluate Best SE Chatbots with Long-Context Interactions

Welcome to **SE Arena**, an open-source platform designed for evaluating software engineering-focused chatbots. SE Arena benchmarks foundation models (FMs), such as large language models (LLMs), in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.

## Key Features

- **Advanced Pairwise Comparisons**: Assess chatbots using Elo score, PageRank, and Newman modularity to understand both global performance and task-specific strengths.
- **Interactive Evaluation**: Test chatbots in multi-round conversations tailored for SE tasks like debugging, code generation, and requirement refinement.
- **Open-Source**: Built on [Hugging Face Spaces](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena), enabling transparency and fostering community-driven innovation.
- **Transparent Leaderboard**: View real-time model rankings across diverse SE workflows, updated using advanced evaluation metrics.

## Why SE Arena?

Existing evaluation frameworks often do not address the complex, iterative nature of SE tasks. SE Arena fills this gap by:

- Supporting long-context, multi-turn evaluations to capture iterative workflows.
- Allowing anonymous model comparisons to prevent bias.
- Providing rich, multidimensional metrics for more nuanced model evaluations.

## How It Works

1. **Submit a Prompt**: Sign in and input your SE-related task (e.g., debugging, code reviews).
2. **Compare Responses**: Two anonymous chatbots provide responses to your query.
3. **Vote**: Choose the better response, mark as tied, or select "Can't Decide."
4. **Iterative Testing**: Continue the conversation with follow-up prompts to test contextual understanding over multiple rounds.

## Getting Started

### Prerequisites

- A [Hugging Face](https://huggingface.co) account.
- Basic understanding of software engineering workflows.

### Usage

1. Navigate to the [SE Arena platform](https://huggingface.co/spaces/SE-Arena/Software-Engineering-Arena).
2. Sign in with your Hugging Face account.
3. Enter your SE task prompt and start evaluating model responses.
4. Vote on the better response or continue multi-round interactions to test contextual understanding.

## Contributing

We welcome contributions from the community! Here's how you can help:

1. **Submit Prompts**: Share your SE-related tasks to enrich our evaluation dataset.
2. **Report Issues**: Found a bug or have a feature request? Open an issue in this repository.
3. **Enhance the Codebase**: Fork the repository, make your changes, and submit a pull request.

## Privacy Policy

Your interactions are anonymized and used solely for improving SE Arena and FM benchmarking. By using SE Arena, you agree to our [Terms of Service](#).

## Future Plans

- **Enhanced Metrics**: Add round-wise analysis and context-aware evaluation metrics.
- **Domain-Specific Sub-Leaderboards**: Rankings focused on tasks like debugging, requirement refinement, etc.
- **Advanced Context Compression**: Techniques like LongRope and SelfExtend to manage long-term memory.
- **Support for Multimodal Models**: Evaluate models that integrate text, code, and other modalities.

## Contact

For inquiries or feedback, please [open an issue](https://github.com/SE-Arena/Software-Engineer-Arena/issues/new) in this repository. We welcome your contributions and suggestions!