Readme details
Browse files- README.md +120 -3
- app.py +119 -4
- results/Bgym-GPT-3.5/{workarena++-l2.json β workarena-l2.json} +1 -1
- results/Bgym-GPT-3.5/{workarena++-l3.json β workarena-l3.json} +1 -1
- results/Bgym-GPT-4o-V/{workarena++-l2.json β workarena-l2.json} +1 -1
- results/Bgym-GPT-4o-V/{workarena++-l3.json β workarena-l3.json} +1 -1
- results/Bgym-GPT-4o/{workarena++-l2.json β workarena-l2.json} +1 -1
- results/Bgym-GPT-4o/{workarena++-l3.json β workarena-l3.json} +1 -1
- results/Bgym-Llama-3-70b/{workarena++-l2.json β workarena-l2.json} +1 -1
- results/Bgym-Llama-3-70b/{workarena++-l3.json β workarena-l3.json} +1 -1
- results/Bgym-Mixtral-8x22b/{workarena++-l2.json β workarena-l2.json} +1 -1
- results/Bgym-Mixtral-8x22b/{workarena++-l3.json β workarena-l3.json} +1 -1
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
colorFrom: purple
|
5 |
colorTo: green
|
6 |
sdk: docker
|
@@ -8,4 +8,121 @@ pinned: false
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: BrowserGym Leaderboard
|
3 |
+
emoji: π
|
4 |
colorFrom: purple
|
5 |
colorTo: green
|
6 |
sdk: docker
|
|
|
8 |
license: mit
|
9 |
---
|
10 |
|
11 |
+
# BrowserGym Leaderboard
|
12 |
+
|
13 |
+
This leaderboard tracks performance of various agents on web navigation tasks.
|
14 |
+
|
15 |
+
## How to Submit Results for New Agents
|
16 |
+
|
17 |
+
### 1. Create Results Directory
|
18 |
+
Create a new folder in the `results` directory with your agent's name:
|
19 |
+
```bash
|
20 |
+
results/
|
21 |
+
βββ your-agent-name/
|
22 |
+
βββ README.md
|
23 |
+
βββ webarena.json
|
24 |
+
βββ workarena-l1.json
|
25 |
+
βββ workarena++-l2.json
|
26 |
+
βββ workarena++-l3.json
|
27 |
+
βββ miniwob.json
|
28 |
+
```
|
29 |
+
|
30 |
+
|
31 |
+
### 2. Add Agent Details
|
32 |
+
|
33 |
+
Create a `README.md` in your agent's folder with the following details:
|
34 |
+
|
35 |
+
#### Required Information
|
36 |
+
- **Model Name**: Base model used (e.g., GPT-4, Claude-2)
|
37 |
+
- **Model Architecture**: Architecture details and any modifications
|
38 |
+
- **Input/Output Format**: How inputs are processed and outputs generated
|
39 |
+
- **Training Details**: Training configuration if applicable
|
40 |
+
- Dataset used
|
41 |
+
- Number of training steps
|
42 |
+
- Hardware used
|
43 |
+
- Training time
|
44 |
+
|
45 |
+
#### Optional Information
|
46 |
+
- **Paper Link**: Link to published paper/preprint if available
|
47 |
+
- **Code Repository**: Link to public code implementation
|
48 |
+
- **Additional Notes**: Any special configurations or requirements
|
49 |
+
- **License**: License information for your agent
|
50 |
+
|
51 |
+
Make sure to organize the information in clear sections using Markdown.
|
52 |
+
|
53 |
+
### 3. Add Benchmark Results
|
54 |
+
|
55 |
+
Create separate JSON files for each benchmark following this format:
|
56 |
+
|
57 |
+
```json
|
58 |
+
[
|
59 |
+
{
|
60 |
+
"agent_name": "your-agent-name",
|
61 |
+
"study_id": "unique-study-identifier-from-agentlab",
|
62 |
+
"date_time": "YYYY-MM-DD HH:MM:SS",
|
63 |
+
"benchmark": "WebArena",
|
64 |
+
"score": 0.0,
|
65 |
+
"std_err": 0.0,
|
66 |
+
"benchmark_specific": "Yes/No",
|
67 |
+
"benchmark_tuned": "Yes/No",
|
68 |
+
"followed_evaluation_protocol": "Yes/No",
|
69 |
+
"reproducible": "Yes/No",
|
70 |
+
"comments": "Additional details",
|
71 |
+
"original_or_reproduced": "Original"
|
72 |
+
}
|
73 |
+
]
|
74 |
+
```
|
75 |
+
|
76 |
+
Please add all the benchmark files in separate json files named as follows:
|
77 |
+
|
78 |
+
- `webarena.json`
|
79 |
+
- `workarena-l1.json`
|
80 |
+
- `workarena-l2.json`
|
81 |
+
- `workarena-l3.json`
|
82 |
+
- `miniwob.json`
|
83 |
+
|
84 |
+
Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
|
85 |
+
|
86 |
+
### 4. Submit PR
|
87 |
+
|
88 |
+
1. Fork the repository
|
89 |
+
2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
|
90 |
+
3. Create a pull request to the main branch
|
91 |
+
|
92 |
+
## How to Submit Reproducibility Results for Existing Agents
|
93 |
+
|
94 |
+
Open the results file for the agent and benchmark you reproduced the results for.
|
95 |
+
|
96 |
+
### 1. Add reproduced results
|
97 |
+
|
98 |
+
|
99 |
+
Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
|
100 |
+
|
101 |
+
```json
|
102 |
+
[
|
103 |
+
{
|
104 |
+
"agent_name": "your-agent-name",
|
105 |
+
"study_id": "unique-study-identifier-from-agentlab",
|
106 |
+
"date_time": "YYYY-MM-DD HH:MM:SS",
|
107 |
+
"benchmark": "WebArena",
|
108 |
+
"score": 0.0,
|
109 |
+
"std_err": 0.0,
|
110 |
+
"benchmark_specific": "Yes/No",
|
111 |
+
"benchmark_tuned": "Yes/No",
|
112 |
+
"followed_evaluation_protocol": "Yes/No",
|
113 |
+
"reproducible": "Yes/No",
|
114 |
+
"comments": "Additional details",
|
115 |
+
"original_or_reproduced": "Reproduced"
|
116 |
+
}
|
117 |
+
]
|
118 |
+
```
|
119 |
+
|
120 |
+
### 2. Submit PR
|
121 |
+
|
122 |
+
1. Fork the repository
|
123 |
+
2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
|
124 |
+
3. Create a pull request to the main branch
|
125 |
+
|
126 |
+
## License
|
127 |
+
|
128 |
+
MIT
|
app.py
CHANGED
@@ -16,8 +16,7 @@ import re
|
|
16 |
import html
|
17 |
from typing import Dict, Any
|
18 |
|
19 |
-
|
20 |
-
BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena++-L2", "WorkArena++-L3", "MiniWoB",]
|
21 |
|
22 |
def sanitize_agent_name(agent_name):
|
23 |
# Only allow alphanumeric chars, hyphen, underscore
|
@@ -191,7 +190,6 @@ def main():
|
|
191 |
st.title("π BrowserGym Leaderboard")
|
192 |
st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
|
193 |
# content = create_yall()
|
194 |
-
# tab1, tab2, tab3, tab4 = st.tabs(["π WebAgent Leaderboard", "WorkArena++-L2 Leaderboard", "WorkArena++-L3 Leaderboard", "π About"])
|
195 |
tabs = st.tabs(["π Main Leaderboard",] + BENCHMARKS + ["π About"])
|
196 |
|
197 |
with tabs[0]:
|
@@ -268,7 +266,124 @@ def main():
|
|
268 |
|
269 |
with tabs[-1]:
|
270 |
st.markdown('''
|
271 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
272 |
''')
|
273 |
for i, benchmark in enumerate(BENCHMARKS, start=1):
|
274 |
with tabs[i]:
|
|
|
16 |
import html
|
17 |
from typing import Dict, Any
|
18 |
|
19 |
+
BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena-L2", "WorkArena-L3", "MiniWoB",]
|
|
|
20 |
|
21 |
def sanitize_agent_name(agent_name):
|
22 |
# Only allow alphanumeric chars, hyphen, underscore
|
|
|
190 |
st.title("π BrowserGym Leaderboard")
|
191 |
st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
|
192 |
# content = create_yall()
|
|
|
193 |
tabs = st.tabs(["π Main Leaderboard",] + BENCHMARKS + ["π About"])
|
194 |
|
195 |
with tabs[0]:
|
|
|
266 |
|
267 |
with tabs[-1]:
|
268 |
st.markdown('''
|
269 |
+
# BrowserGym Leaderboard
|
270 |
+
|
271 |
+
This leaderboard tracks performance of various agents on web navigation tasks.
|
272 |
+
|
273 |
+
## How to Submit Results for New Agents
|
274 |
+
|
275 |
+
### 1. Create Results Directory
|
276 |
+
Create a new folder in the `results` directory with your agent's name:
|
277 |
+
```bash
|
278 |
+
results/
|
279 |
+
βββ your-agent-name/
|
280 |
+
βββ README.md
|
281 |
+
βββ webarena.json
|
282 |
+
βββ workarena-l1.json
|
283 |
+
βββ workarena++-l2.json
|
284 |
+
βββ workarena++-l3.json
|
285 |
+
βββ miniwob.json
|
286 |
+
```
|
287 |
+
|
288 |
+
|
289 |
+
### 2. Add Agent Details
|
290 |
+
|
291 |
+
Create a `README.md` in your agent's folder with the following details:
|
292 |
+
|
293 |
+
#### Required Information
|
294 |
+
- **Model Name**: Base model used (e.g., GPT-4, Claude-2)
|
295 |
+
- **Model Architecture**: Architecture details and any modifications
|
296 |
+
- **Input/Output Format**: How inputs are processed and outputs generated
|
297 |
+
- **Training Details**: Training configuration if applicable
|
298 |
+
- Dataset used
|
299 |
+
- Number of training steps
|
300 |
+
- Hardware used
|
301 |
+
- Training time
|
302 |
+
|
303 |
+
#### Optional Information
|
304 |
+
- **Paper Link**: Link to published paper/preprint if available
|
305 |
+
- **Code Repository**: Link to public code implementation
|
306 |
+
- **Additional Notes**: Any special configurations or requirements
|
307 |
+
- **License**: License information for your agent
|
308 |
+
|
309 |
+
Make sure to organize the information in clear sections using Markdown.
|
310 |
+
|
311 |
+
### 3. Add Benchmark Results
|
312 |
+
|
313 |
+
Create separate JSON files for each benchmark following this format:
|
314 |
+
|
315 |
+
```json
|
316 |
+
[
|
317 |
+
{
|
318 |
+
"agent_name": "your-agent-name",
|
319 |
+
"study_id": "unique-study-identifier-from-agentlab",
|
320 |
+
"date_time": "YYYY-MM-DD HH:MM:SS",
|
321 |
+
"benchmark": "WebArena",
|
322 |
+
"score": 0.0,
|
323 |
+
"std_err": 0.0,
|
324 |
+
"benchmark_specific": "Yes/No",
|
325 |
+
"benchmark_tuned": "Yes/No",
|
326 |
+
"followed_evaluation_protocol": "Yes/No",
|
327 |
+
"reproducible": "Yes/No",
|
328 |
+
"comments": "Additional details",
|
329 |
+
"original_or_reproduced": "Original"
|
330 |
+
}
|
331 |
+
]
|
332 |
+
```
|
333 |
+
|
334 |
+
Please add all the benchmark files in separate json files named as follows:
|
335 |
+
|
336 |
+
- `webarena.json`
|
337 |
+
- `workarena-l1.json`
|
338 |
+
- `workarena-l2.json`
|
339 |
+
- `workarena-l3.json`
|
340 |
+
- `miniwob.json`
|
341 |
+
|
342 |
+
Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
|
343 |
+
|
344 |
+
### 4. Submit PR
|
345 |
+
|
346 |
+
1. Fork the repository
|
347 |
+
2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
|
348 |
+
3. Create a pull request to the main branch
|
349 |
+
|
350 |
+
## How to Submit Reproducibility Results for Existing Agents
|
351 |
+
|
352 |
+
Open the results file for the agent and benchmark you reproduced the results for.
|
353 |
+
|
354 |
+
### 1. Add reproduced results
|
355 |
+
|
356 |
+
|
357 |
+
Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
|
358 |
+
|
359 |
+
```json
|
360 |
+
[
|
361 |
+
{
|
362 |
+
"agent_name": "your-agent-name",
|
363 |
+
"study_id": "unique-study-identifier-from-agentlab",
|
364 |
+
"date_time": "YYYY-MM-DD HH:MM:SS",
|
365 |
+
"benchmark": "WebArena",
|
366 |
+
"score": 0.0,
|
367 |
+
"std_err": 0.0,
|
368 |
+
"benchmark_specific": "Yes/No",
|
369 |
+
"benchmark_tuned": "Yes/No",
|
370 |
+
"followed_evaluation_protocol": "Yes/No",
|
371 |
+
"reproducible": "Yes/No",
|
372 |
+
"comments": "Additional details",
|
373 |
+
"original_or_reproduced": "Reproduced"
|
374 |
+
}
|
375 |
+
]
|
376 |
+
```
|
377 |
+
|
378 |
+
### 2. Submit PR
|
379 |
+
|
380 |
+
1. Fork the repository
|
381 |
+
2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
|
382 |
+
3. Create a pull request to the main branch
|
383 |
+
|
384 |
+
## License
|
385 |
+
|
386 |
+
MIT
|
387 |
''')
|
388 |
for i, benchmark in enumerate(BENCHMARKS, start=1):
|
389 |
with tabs[i]:
|
results/Bgym-GPT-3.5/{workarena++-l2.json β workarena-l2.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-3.5",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-3.5",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L2",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-GPT-3.5/{workarena++-l3.json β workarena-l3.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-3.5",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-3.5",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L3",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-GPT-4o-V/{workarena++-l2.json β workarena-l2.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-4o-V",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 3.8,
|
8 |
"std_err": 0.6,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-4o-V",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L2",
|
7 |
"score": 3.8,
|
8 |
"std_err": 0.6,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-GPT-4o-V/{workarena++-l3.json β workarena-l3.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-4o-V",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-4o-V",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L3",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-GPT-4o/{workarena++-l2.json β workarena-l2.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-4o",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 3.0,
|
8 |
"std_err": 0.6,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-4o",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L2",
|
7 |
"score": 3.0,
|
8 |
"std_err": 0.6,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-GPT-4o/{workarena++-l3.json β workarena-l3.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-GPT-4o",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-GPT-4o",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L3",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-Llama-3-70b/{workarena++-l2.json β workarena-l2.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-Llama-3-70b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-Llama-3-70b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L2",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-Llama-3-70b/{workarena++-l3.json β workarena-l3.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-Llama-3-70b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-Llama-3-70b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L3",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-Mixtral-8x22b/{workarena++-l2.json β workarena-l2.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-Mixtral-8x22b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-Mixtral-8x22b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L2",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
results/Bgym-Mixtral-8x22b/{workarena++-l3.json β workarena-l3.json}
RENAMED
@@ -3,7 +3,7 @@
|
|
3 |
"agent_name": "Bgym-Mixtral-8x22b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
-
"benchmark": "WorkArena
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|
|
|
3 |
"agent_name": "Bgym-Mixtral-8x22b",
|
4 |
"study_id": "study_id",
|
5 |
"date_time": "2021-01-01 12:00:00",
|
6 |
+
"benchmark": "WorkArena-L3",
|
7 |
"score": 0.0,
|
8 |
"std_err": 0.0,
|
9 |
"benchmark_specific": "No",
|