meghsn commited on
Commit
92c92ae
Β·
1 Parent(s): b667dc2

Readme details

Browse files
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: WebAgent Leaderboard
3
- emoji: 🐠
4
  colorFrom: purple
5
  colorTo: green
6
  sdk: docker
@@ -8,4 +8,121 @@ pinned: false
8
  license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: BrowserGym Leaderboard
3
+ emoji: πŸ†
4
  colorFrom: purple
5
  colorTo: green
6
  sdk: docker
 
8
  license: mit
9
  ---
10
 
11
+ # BrowserGym Leaderboard
12
+
13
+ This leaderboard tracks performance of various agents on web navigation tasks.
14
+
15
+ ## How to Submit Results for New Agents
16
+
17
+ ### 1. Create Results Directory
18
+ Create a new folder in the `results` directory with your agent's name:
19
+ ```bash
20
+ results/
21
+ └── your-agent-name/
22
+ β”œβ”€β”€ README.md
23
+ β”œβ”€β”€ webarena.json
24
+ β”œβ”€β”€ workarena-l1.json
25
+ β”œβ”€β”€ workarena++-l2.json
26
+ β”œβ”€β”€ workarena++-l3.json
27
+ └── miniwob.json
28
+ ```
29
+
30
+
31
+ ### 2. Add Agent Details
32
+
33
+ Create a `README.md` in your agent's folder with the following details:
34
+
35
+ #### Required Information
36
+ - **Model Name**: Base model used (e.g., GPT-4, Claude-2)
37
+ - **Model Architecture**: Architecture details and any modifications
38
+ - **Input/Output Format**: How inputs are processed and outputs generated
39
+ - **Training Details**: Training configuration if applicable
40
+ - Dataset used
41
+ - Number of training steps
42
+ - Hardware used
43
+ - Training time
44
+
45
+ #### Optional Information
46
+ - **Paper Link**: Link to published paper/preprint if available
47
+ - **Code Repository**: Link to public code implementation
48
+ - **Additional Notes**: Any special configurations or requirements
49
+ - **License**: License information for your agent
50
+
51
+ Make sure to organize the information in clear sections using Markdown.
52
+
53
+ ### 3. Add Benchmark Results
54
+
55
+ Create separate JSON files for each benchmark following this format:
56
+
57
+ ```json
58
+ [
59
+ {
60
+ "agent_name": "your-agent-name",
61
+ "study_id": "unique-study-identifier-from-agentlab",
62
+ "date_time": "YYYY-MM-DD HH:MM:SS",
63
+ "benchmark": "WebArena",
64
+ "score": 0.0,
65
+ "std_err": 0.0,
66
+ "benchmark_specific": "Yes/No",
67
+ "benchmark_tuned": "Yes/No",
68
+ "followed_evaluation_protocol": "Yes/No",
69
+ "reproducible": "Yes/No",
70
+ "comments": "Additional details",
71
+ "original_or_reproduced": "Original"
72
+ }
73
+ ]
74
+ ```
75
+
76
+ Please add all the benchmark files in separate json files named as follows:
77
+
78
+ - `webarena.json`
79
+ - `workarena-l1.json`
80
+ - `workarena-l2.json`
81
+ - `workarena-l3.json`
82
+ - `miniwob.json`
83
+
84
+ Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
85
+
86
+ ### 4. Submit PR
87
+
88
+ 1. Fork the repository
89
+ 2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
90
+ 3. Create a pull request to the main branch
91
+
92
+ ## How to Submit Reproducibility Results for Existing Agents
93
+
94
+ Open the results file for the agent and benchmark you reproduced the results for.
95
+
96
+ ### 1. Add reproduced results
97
+
98
+
99
+ Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
100
+
101
+ ```json
102
+ [
103
+ {
104
+ "agent_name": "your-agent-name",
105
+ "study_id": "unique-study-identifier-from-agentlab",
106
+ "date_time": "YYYY-MM-DD HH:MM:SS",
107
+ "benchmark": "WebArena",
108
+ "score": 0.0,
109
+ "std_err": 0.0,
110
+ "benchmark_specific": "Yes/No",
111
+ "benchmark_tuned": "Yes/No",
112
+ "followed_evaluation_protocol": "Yes/No",
113
+ "reproducible": "Yes/No",
114
+ "comments": "Additional details",
115
+ "original_or_reproduced": "Reproduced"
116
+ }
117
+ ]
118
+ ```
119
+
120
+ ### 2. Submit PR
121
+
122
+ 1. Fork the repository
123
+ 2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
124
+ 3. Create a pull request to the main branch
125
+
126
+ ## License
127
+
128
+ MIT
app.py CHANGED
@@ -16,8 +16,7 @@ import re
16
  import html
17
  from typing import Dict, Any
18
 
19
- # BENCHMARKS = ["WorkArena-L1", "WorkArena++-L2", "WorkArena++-L3", "MiniWoB", "WebArena"]
20
- BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena++-L2", "WorkArena++-L3", "MiniWoB",]
21
 
22
  def sanitize_agent_name(agent_name):
23
  # Only allow alphanumeric chars, hyphen, underscore
@@ -191,7 +190,6 @@ def main():
191
  st.title("πŸ† BrowserGym Leaderboard")
192
  st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
193
  # content = create_yall()
194
- # tab1, tab2, tab3, tab4 = st.tabs(["πŸ† WebAgent Leaderboard", "WorkArena++-L2 Leaderboard", "WorkArena++-L3 Leaderboard", "πŸ“ About"])
195
  tabs = st.tabs(["πŸ† Main Leaderboard",] + BENCHMARKS + ["πŸ“ About"])
196
 
197
  with tabs[0]:
@@ -268,7 +266,124 @@ def main():
268
 
269
  with tabs[-1]:
270
  st.markdown('''
271
- ### Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
272
  ''')
273
  for i, benchmark in enumerate(BENCHMARKS, start=1):
274
  with tabs[i]:
 
16
  import html
17
  from typing import Dict, Any
18
 
19
+ BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena-L2", "WorkArena-L3", "MiniWoB",]
 
20
 
21
  def sanitize_agent_name(agent_name):
22
  # Only allow alphanumeric chars, hyphen, underscore
 
190
  st.title("πŸ† BrowserGym Leaderboard")
191
  st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
192
  # content = create_yall()
 
193
  tabs = st.tabs(["πŸ† Main Leaderboard",] + BENCHMARKS + ["πŸ“ About"])
194
 
195
  with tabs[0]:
 
266
 
267
  with tabs[-1]:
268
  st.markdown('''
269
+ # BrowserGym Leaderboard
270
+
271
+ This leaderboard tracks performance of various agents on web navigation tasks.
272
+
273
+ ## How to Submit Results for New Agents
274
+
275
+ ### 1. Create Results Directory
276
+ Create a new folder in the `results` directory with your agent's name:
277
+ ```bash
278
+ results/
279
+ └── your-agent-name/
280
+ β”œβ”€β”€ README.md
281
+ β”œβ”€β”€ webarena.json
282
+ β”œβ”€β”€ workarena-l1.json
283
+ β”œβ”€β”€ workarena++-l2.json
284
+ β”œβ”€β”€ workarena++-l3.json
285
+ └── miniwob.json
286
+ ```
287
+
288
+
289
+ ### 2. Add Agent Details
290
+
291
+ Create a `README.md` in your agent's folder with the following details:
292
+
293
+ #### Required Information
294
+ - **Model Name**: Base model used (e.g., GPT-4, Claude-2)
295
+ - **Model Architecture**: Architecture details and any modifications
296
+ - **Input/Output Format**: How inputs are processed and outputs generated
297
+ - **Training Details**: Training configuration if applicable
298
+ - Dataset used
299
+ - Number of training steps
300
+ - Hardware used
301
+ - Training time
302
+
303
+ #### Optional Information
304
+ - **Paper Link**: Link to published paper/preprint if available
305
+ - **Code Repository**: Link to public code implementation
306
+ - **Additional Notes**: Any special configurations or requirements
307
+ - **License**: License information for your agent
308
+
309
+ Make sure to organize the information in clear sections using Markdown.
310
+
311
+ ### 3. Add Benchmark Results
312
+
313
+ Create separate JSON files for each benchmark following this format:
314
+
315
+ ```json
316
+ [
317
+ {
318
+ "agent_name": "your-agent-name",
319
+ "study_id": "unique-study-identifier-from-agentlab",
320
+ "date_time": "YYYY-MM-DD HH:MM:SS",
321
+ "benchmark": "WebArena",
322
+ "score": 0.0,
323
+ "std_err": 0.0,
324
+ "benchmark_specific": "Yes/No",
325
+ "benchmark_tuned": "Yes/No",
326
+ "followed_evaluation_protocol": "Yes/No",
327
+ "reproducible": "Yes/No",
328
+ "comments": "Additional details",
329
+ "original_or_reproduced": "Original"
330
+ }
331
+ ]
332
+ ```
333
+
334
+ Please add all the benchmark files in separate json files named as follows:
335
+
336
+ - `webarena.json`
337
+ - `workarena-l1.json`
338
+ - `workarena-l2.json`
339
+ - `workarena-l3.json`
340
+ - `miniwob.json`
341
+
342
+ Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
343
+
344
+ ### 4. Submit PR
345
+
346
+ 1. Fork the repository
347
+ 2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
348
+ 3. Create a pull request to the main branch
349
+
350
+ ## How to Submit Reproducibility Results for Existing Agents
351
+
352
+ Open the results file for the agent and benchmark you reproduced the results for.
353
+
354
+ ### 1. Add reproduced results
355
+
356
+
357
+ Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
358
+
359
+ ```json
360
+ [
361
+ {
362
+ "agent_name": "your-agent-name",
363
+ "study_id": "unique-study-identifier-from-agentlab",
364
+ "date_time": "YYYY-MM-DD HH:MM:SS",
365
+ "benchmark": "WebArena",
366
+ "score": 0.0,
367
+ "std_err": 0.0,
368
+ "benchmark_specific": "Yes/No",
369
+ "benchmark_tuned": "Yes/No",
370
+ "followed_evaluation_protocol": "Yes/No",
371
+ "reproducible": "Yes/No",
372
+ "comments": "Additional details",
373
+ "original_or_reproduced": "Reproduced"
374
+ }
375
+ ]
376
+ ```
377
+
378
+ ### 2. Submit PR
379
+
380
+ 1. Fork the repository
381
+ 2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
382
+ 3. Create a pull request to the main branch
383
+
384
+ ## License
385
+
386
+ MIT
387
  ''')
388
  for i, benchmark in enumerate(BENCHMARKS, start=1):
389
  with tabs[i]:
results/Bgym-GPT-3.5/{workarena++-l2.json β†’ workarena-l2.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-3.5",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-3.5",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-GPT-3.5/{workarena++-l3.json β†’ workarena-l3.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-3.5",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-3.5",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-GPT-4o-V/{workarena++-l2.json β†’ workarena-l2.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-4o-V",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L2",
7
  "score": 3.8,
8
  "std_err": 0.6,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-4o-V",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L2",
7
  "score": 3.8,
8
  "std_err": 0.6,
9
  "benchmark_specific": "No",
results/Bgym-GPT-4o-V/{workarena++-l3.json β†’ workarena-l3.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-4o-V",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-4o-V",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-GPT-4o/{workarena++-l2.json β†’ workarena-l2.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-4o",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L2",
7
  "score": 3.0,
8
  "std_err": 0.6,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-4o",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L2",
7
  "score": 3.0,
8
  "std_err": 0.6,
9
  "benchmark_specific": "No",
results/Bgym-GPT-4o/{workarena++-l3.json β†’ workarena-l3.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-GPT-4o",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-GPT-4o",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-Llama-3-70b/{workarena++-l2.json β†’ workarena-l2.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-Llama-3-70b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-Llama-3-70b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-Llama-3-70b/{workarena++-l3.json β†’ workarena-l3.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-Llama-3-70b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-Llama-3-70b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-Mixtral-8x22b/{workarena++-l2.json β†’ workarena-l2.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-Mixtral-8x22b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-Mixtral-8x22b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L2",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
results/Bgym-Mixtral-8x22b/{workarena++-l3.json β†’ workarena-l3.json} RENAMED
@@ -3,7 +3,7 @@
3
  "agent_name": "Bgym-Mixtral-8x22b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
- "benchmark": "WorkArena++-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",
 
3
  "agent_name": "Bgym-Mixtral-8x22b",
4
  "study_id": "study_id",
5
  "date_time": "2021-01-01 12:00:00",
6
+ "benchmark": "WorkArena-L3",
7
  "score": 0.0,
8
  "std_err": 0.0,
9
  "benchmark_specific": "No",