Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

meghsn commited on Nov 21, 2024

Commit

92c92ae

1 Parent(s): b667dc2

Readme details

Browse files

Files changed (12) hide show

README.md +120 -3
app.py +119 -4
results/Bgym-GPT-3.5/{workarena++-l2.json → workarena-l2.json} +1 -1
results/Bgym-GPT-3.5/{workarena++-l3.json → workarena-l3.json} +1 -1
results/Bgym-GPT-4o-V/{workarena++-l2.json → workarena-l2.json} +1 -1
results/Bgym-GPT-4o-V/{workarena++-l3.json → workarena-l3.json} +1 -1
results/Bgym-GPT-4o/{workarena++-l2.json → workarena-l2.json} +1 -1
results/Bgym-GPT-4o/{workarena++-l3.json → workarena-l3.json} +1 -1
results/Bgym-Llama-3-70b/{workarena++-l2.json → workarena-l2.json} +1 -1
results/Bgym-Llama-3-70b/{workarena++-l3.json → workarena-l3.json} +1 -1
results/Bgym-Mixtral-8x22b/{workarena++-l2.json → workarena-l2.json} +1 -1
results/Bgym-Mixtral-8x22b/{workarena++-l3.json → workarena-l3.json} +1 -1

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: WebAgent Leaderboard
-emoji: 🐠
 colorFrom: purple
 colorTo: green
 sdk: docker
@@ -8,4 +8,121 @@ pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: BrowserGym Leaderboard
+emoji: 🏆
 colorFrom: purple
 colorTo: green
 sdk: docker
 license: mit
 ---
+# BrowserGym Leaderboard
+This leaderboard tracks performance of various agents on web navigation tasks.
+## How to Submit Results for New Agents
+### 1. Create Results Directory
+Create a new folder in the `results` directory with your agent's name:
+```bash
+results/
+└── your-agent-name/
+    ├── README.md
+    ├── webarena.json
+    ├── workarena-l1.json
+    ├── workarena++-l2.json
+    ├── workarena++-l3.json
+    └── miniwob.json
+```
+### 2. Add Agent Details
+Create a `README.md` in your agent's folder with the following details:
+#### Required Information
+- **Model Name**: Base model used (e.g., GPT-4, Claude-2)
+- **Model Architecture**: Architecture details and any modifications
+- **Input/Output Format**: How inputs are processed and outputs generated
+- **Training Details**: Training configuration if applicable
+  - Dataset used
+  - Number of training steps
+  - Hardware used
+  - Training time
+#### Optional Information
+- **Paper Link**: Link to published paper/preprint if available
+- **Code Repository**: Link to public code implementation
+- **Additional Notes**: Any special configurations or requirements
+- **License**: License information for your agent
+Make sure to organize the information in clear sections using Markdown.
+### 3. Add Benchmark Results
+Create separate JSON files for each benchmark following this format:
+```json
+[
+    {
+        "agent_name": "your-agent-name",
+        "study_id": "unique-study-identifier-from-agentlab",
+        "date_time": "YYYY-MM-DD HH:MM:SS",
+        "benchmark": "WebArena",
+        "score": 0.0,
+        "std_err": 0.0,
+        "benchmark_specific": "Yes/No",
+        "benchmark_tuned": "Yes/No",
+        "followed_evaluation_protocol": "Yes/No",
+        "reproducible": "Yes/No",
+        "comments": "Additional details",
+        "original_or_reproduced": "Original"
+    }
+]
+```
+Please add all the benchmark files in separate json files named as follows:
+- `webarena.json`
+- `workarena-l1.json`
+- `workarena-l2.json`
+- `workarena-l3.json`
+- `miniwob.json`
+Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
+### 4. Submit PR
+1. Fork the repository
+2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
+3. Create a pull request to the main branch
+## How to Submit Reproducibility Results for Existing Agents
+Open the results file for the agent and benchmark you reproduced the results for.
+### 1. Add reproduced results
+Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
+```json
+[
+    {
+        "agent_name": "your-agent-name",
+        "study_id": "unique-study-identifier-from-agentlab",
+        "date_time": "YYYY-MM-DD HH:MM:SS",
+        "benchmark": "WebArena",
+        "score": 0.0,
+        "std_err": 0.0,
+        "benchmark_specific": "Yes/No",
+        "benchmark_tuned": "Yes/No",
+        "followed_evaluation_protocol": "Yes/No",
+        "reproducible": "Yes/No",
+        "comments": "Additional details",
+        "original_or_reproduced": "Reproduced"
+    }
+]
+```
+### 2. Submit PR
+1. Fork the repository
+2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
+3. Create a pull request to the main branch
+## License
+MIT

app.py CHANGED Viewed

@@ -16,8 +16,7 @@ import re
 import html
 from typing import Dict, Any
-# BENCHMARKS = ["WorkArena-L1", "WorkArena++-L2", "WorkArena++-L3", "MiniWoB", "WebArena"]
-BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena++-L2", "WorkArena++-L3", "MiniWoB",]
 def sanitize_agent_name(agent_name):
     # Only allow alphanumeric chars, hyphen, underscore
@@ -191,7 +190,6 @@ def main():
     st.title("🏆 BrowserGym Leaderboard")
     st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
     # content = create_yall()
-    # tab1, tab2, tab3, tab4 = st.tabs(["🏆 WebAgent Leaderboard", "WorkArena++-L2 Leaderboard", "WorkArena++-L3 Leaderboard", "📝 About"])
     tabs = st.tabs(["🏆 Main Leaderboard",] +  BENCHMARKS + ["📝 About"])
     with tabs[0]:
@@ -268,7 +266,124 @@ def main():
     with tabs[-1]:
             st.markdown('''
-                    ### Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.
                 ''')
     for i, benchmark in enumerate(BENCHMARKS, start=1):
         with tabs[i]:

 import html
 from typing import Dict, Any
+BENCHMARKS = ["WebArena", "WorkArena-L1", "WorkArena-L2", "WorkArena-L3", "MiniWoB",]
 def sanitize_agent_name(agent_name):
     # Only allow alphanumeric chars, hyphen, underscore
     st.title("🏆 BrowserGym Leaderboard")
     st.markdown("Leaderboard to evaluate LLMs, VLMs, and agents on web navigation tasks.")
     # content = create_yall()
     tabs = st.tabs(["🏆 Main Leaderboard",] +  BENCHMARKS + ["📝 About"])
     with tabs[0]:
     with tabs[-1]:
             st.markdown('''
+                    # BrowserGym Leaderboard
+This leaderboard tracks performance of various agents on web navigation tasks.
+## How to Submit Results for New Agents
+### 1. Create Results Directory
+Create a new folder in the `results` directory with your agent's name:
+```bash
+results/
+└── your-agent-name/
+    ├── README.md
+    ├── webarena.json
+    ├── workarena-l1.json
+    ├── workarena++-l2.json
+    ├── workarena++-l3.json
+    └── miniwob.json
+```
+### 2. Add Agent Details
+Create a `README.md` in your agent's folder with the following details:
+#### Required Information
+- **Model Name**: Base model used (e.g., GPT-4, Claude-2)
+- **Model Architecture**: Architecture details and any modifications
+- **Input/Output Format**: How inputs are processed and outputs generated
+- **Training Details**: Training configuration if applicable
+  - Dataset used
+  - Number of training steps
+  - Hardware used
+  - Training time
+#### Optional Information
+- **Paper Link**: Link to published paper/preprint if available
+- **Code Repository**: Link to public code implementation
+- **Additional Notes**: Any special configurations or requirements
+- **License**: License information for your agent
+Make sure to organize the information in clear sections using Markdown.
+### 3. Add Benchmark Results
+Create separate JSON files for each benchmark following this format:
+```json
+[
+    {
+        "agent_name": "your-agent-name",
+        "study_id": "unique-study-identifier-from-agentlab",
+        "date_time": "YYYY-MM-DD HH:MM:SS",
+        "benchmark": "WebArena",
+        "score": 0.0,
+        "std_err": 0.0,
+        "benchmark_specific": "Yes/No",
+        "benchmark_tuned": "Yes/No",
+        "followed_evaluation_protocol": "Yes/No",
+        "reproducible": "Yes/No",
+        "comments": "Additional details",
+        "original_or_reproduced": "Original"
+    }
+]
+```
+Please add all the benchmark files in separate json files named as follows:
+- `webarena.json`
+- `workarena-l1.json`
+- `workarena-l2.json`
+- `workarena-l3.json`
+- `miniwob.json`
+Each file must contain a JSON array with a single object following the format above. The benchmark field in each file must match the benchmark name exactly ([`WebArena`, `WorkArena-L1`, `WorkArena-L2`, `WorkArena-L3`, `MiniWoB`]) and benchmark_lowercase.json as the filename.
+### 4. Submit PR
+1. Fork the repository
+2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
+3. Create a pull request to the main branch
+## How to Submit Reproducibility Results for Existing Agents
+Open the results file for the agent and benchmark you reproduced the results for.
+### 1. Add reproduced results
+Append the following entry in the json file. Ensure you set `original_or_reproduced` as `Reproduced`.
+```json
+[
+    {
+        "agent_name": "your-agent-name",
+        "study_id": "unique-study-identifier-from-agentlab",
+        "date_time": "YYYY-MM-DD HH:MM:SS",
+        "benchmark": "WebArena",
+        "score": 0.0,
+        "std_err": 0.0,
+        "benchmark_specific": "Yes/No",
+        "benchmark_tuned": "Yes/No",
+        "followed_evaluation_protocol": "Yes/No",
+        "reproducible": "Yes/No",
+        "comments": "Additional details",
+        "original_or_reproduced": "Reproduced"
+    }
+]
+```
+### 2. Submit PR
+1. Fork the repository
+2. Add your results following the structure above and in the PR comments add more details about your agent and the submission
+3. Create a pull request to the main branch
+## License
+MIT
                 ''')
     for i, benchmark in enumerate(BENCHMARKS, start=1):
         with tabs[i]:

results/Bgym-GPT-3.5/{workarena++-l2.json → workarena-l2.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-3.5",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-3.5",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-GPT-3.5/{workarena++-l3.json → workarena-l3.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-3.5",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-3.5",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-GPT-4o-V/{workarena++-l2.json → workarena-l2.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-4o-V",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L2",
         "score": 3.8,
         "std_err": 0.6,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-4o-V",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L2",
         "score": 3.8,
         "std_err": 0.6,
         "benchmark_specific": "No",

results/Bgym-GPT-4o-V/{workarena++-l3.json → workarena-l3.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-4o-V",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-4o-V",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-GPT-4o/{workarena++-l2.json → workarena-l2.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-4o",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L2",
         "score": 3.0,
         "std_err": 0.6,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-4o",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L2",
         "score": 3.0,
         "std_err": 0.6,
         "benchmark_specific": "No",

results/Bgym-GPT-4o/{workarena++-l3.json → workarena-l3.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-GPT-4o",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-GPT-4o",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-Llama-3-70b/{workarena++-l2.json → workarena-l2.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-Llama-3-70b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-Llama-3-70b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-Llama-3-70b/{workarena++-l3.json → workarena-l3.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-Llama-3-70b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-Llama-3-70b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-Mixtral-8x22b/{workarena++-l2.json → workarena-l2.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-Mixtral-8x22b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-Mixtral-8x22b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L2",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

results/Bgym-Mixtral-8x22b/{workarena++-l3.json → workarena-l3.json} RENAMED Viewed

@@ -3,7 +3,7 @@
         "agent_name": "Bgym-Mixtral-8x22b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
-        "benchmark": "WorkArena++-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",

         "agent_name": "Bgym-Mixtral-8x22b",
         "study_id": "study_id",
         "date_time": "2021-01-01 12:00:00",
+        "benchmark": "WorkArena-L3",
         "score": 0.0,
         "std_err": 0.0,
         "benchmark_specific": "No",