jitinpatronus commited on
Commit
e5da7a6
·
verified ·
1 Parent(s): a44fe3a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -63
README.md CHANGED
@@ -1,15 +1,16 @@
1
-
2
  ---
3
- title: TRAIL
4
- emoji: 🥇
5
  colorFrom: green
6
  colorTo: indigo
7
  sdk: gradio
8
  app_file: app.py
9
  pinned: true
10
  license: mit
11
- short_description: 'TRAIL: Trace Reasoning and Agentic Issue Localization'
12
  sdk_version: 5.19.0
 
 
13
  ---
14
  # Model Performance Leaderboard
15
 
@@ -17,69 +18,18 @@ This is a Hugging Face Space that hosts a leaderboard for comparing model perfor
17
 
18
  ## Features
19
 
20
- - **Submit Model Results**: Share your model's performance metrics
21
- - **Interactive Leaderboard**: View and sort all submissions
22
- - **Integrated Backend**: Stores all submissions with timestamp and attribution
23
- - **Customizable Metrics**: Configure which metrics to display and track
24
-
25
- ## Installation
26
-
27
- ### Setting Up Your Space
28
-
29
- 1. Upload all files to your Hugging Face Space
30
- 2. Make sure to make `start.sh` executable:
31
- ```bash
32
- chmod +x start.sh
33
- ```
34
- 3. Configure your Space to use the `start.sh` script as the entry point
35
-
36
- ### Troubleshooting Installation Issues
37
-
38
- If you encounter JSON parsing errors:
39
- 1. Check if `models.json` exists and is a valid JSON file
40
- 2. Run `python setup.py` to regenerate configuration files
41
- 3. If problems persist, delete the `models.json` file and let the setup script create a new one
42
-
43
- ## How to Use
44
-
45
- ### Viewing the Leaderboard
46
-
47
- Navigate to the "Leaderboard" tab to see all submitted models. You can:
48
- - Sort by any metric (click on the dropdown)
49
- - Change sort order (ascending/descending)
50
- - Refresh the leaderboard for the latest submissions
51
-
52
- ### Submitting a Model
53
-
54
- 1. Go to the "Submit Model" tab
55
- 2. Fill in your model name, your name, and optional description
56
- 3. Enter values for the requested metrics
57
- 4. Click "Submit Model"
58
-
59
- ## Configuration
60
-
61
- You can customize this leaderboard by modifying the `models.json` file:
62
 
63
- ```json
64
- {
65
- "title": "TRAIL Performance Leaderboard",
66
- "description": "This leaderboard tracks and compares model performance across multiple metrics. Submit your model results to see how they stack up!",
67
- "metrics": ["accuracy", "f1_score", "precision", "recall"],
68
- "main_metric": "accuracy"
69
- }
70
- ```
71
 
72
- - `title`: The title of your leaderboard
73
- - `description`: A description that appears at the top
74
- - `metrics`: List of metrics to track
75
- - `main_metric`: Default metric for sorting
76
 
77
- ## Technical Details
78
 
79
- This leaderboard is built using:
80
- - Gradio for the UI components
81
- - A file-based database to store submissions
82
- - Pandas for data manipulation and display
83
 
84
  ## License
85
 
 
 
1
  ---
2
+ title: TRAIL Leaderboard
3
+ emoji: 🏆
4
  colorFrom: green
5
  colorTo: indigo
6
  sdk: gradio
7
  app_file: app.py
8
  pinned: true
9
  license: mit
10
+ short_description: Trace Reasoning and Agentic Issue Localization Leaderboard
11
  sdk_version: 5.19.0
12
+ tags:
13
+ - leaderboard
14
  ---
15
  # Model Performance Leaderboard
16
 
 
18
 
19
  ## Features
20
 
21
+ - **Submit Your Answers**: Run your model on TRAIL dataset. Submit your results.
22
+ - **Leaderboard**: View how your submissions are ranked.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
+ ## Instructions
 
 
 
 
 
 
 
25
 
26
+ 1. Please refer to our GitHub repository at https://github.com/patronus-ai/trail-benchmark for step‑by‑step instructions on how to run your model with the TRAIL dataset.
27
+ 2. Compress the resulting JSON outputs into a ZIP archive whose filename begins with SWE_/GAIA_, and submit it.
28
+ 3. Once the evaluation is complete, we’ll upload the scores (this process will soon be automated).
 
29
 
30
+ ## Benchmarking on TRAIL
31
 
32
+ TRAIL(Trace Reasoning and Agentic Issue Localization) is a benchmark dataset of 148 annotated AI agent execution traces containing 841 errors across reasoning, execution, and planning categories. Created from real-world software engineering and information retrieval tasks, it challenges even state-of-the-art LLMs, with the best model achieving only 11% accuracy, highlighting the difficulty of trace debugging for complex agent workflows.
 
 
 
33
 
34
  ## License
35