Spaces:

ServiceNow
/

browsergym-leaderboard

Running

App Files Files Community

new-results

#10

by jaiswala - opened 22 days ago

base: refs/heads/main

←

from: refs/pr/10

Discussion Files changed

+690

-0

Files changed (30) hide show

results/GenericAgent-Claude-3.7-Sonnet/README.md +44 -0
results/GenericAgent-Claude-3.7-Sonnet/webarena.json +16 -0
results/GenericAgent-Claude-4-Sonnet/README.md +44 -0
results/GenericAgent-Claude-4-Sonnet/miniwob.json +17 -0
results/GenericAgent-Claude-4-Sonnet/workarena-l1.json +16 -0
results/GenericAgent-Claude-4-Sonnet/workarena-l2.json +16 -0
results/GenericAgent-GPT-4.1-Mini/README.md +44 -0
results/GenericAgent-GPT-4.1-Mini/webarena.json +16 -0
results/GenericAgent-GPT-5-mini/README.md +44 -0
results/GenericAgent-GPT-5-mini/miniwob.json +16 -0
results/GenericAgent-GPT-5-mini/workarena-l1.json +16 -0
results/GenericAgent-GPT-5-mini/workarena-l2.json +16 -0
results/GenericAgent-GPT-5-nano/README.md +44 -0
results/GenericAgent-GPT-5-nano/miniwob.json +16 -0
results/GenericAgent-GPT-5-nano/workarena-l1.json +16 -0
results/GenericAgent-GPT-5-nano/workarena-l2.json +16 -0
results/GenericAgent-GPT-5/README.md +44 -0
results/GenericAgent-GPT-5/miniwob.json +16 -0
results/GenericAgent-GPT-5/workarena-l1.json +16 -0
results/GenericAgent-GPT-5/workarena-l2.json +16 -0
results/GenericAgent-GPT-5/workarena-l3.json +16 -0
results/GenericAgent-GPT-oss-120b/README.md +44 -0
results/GenericAgent-GPT-oss-120b/miniwob.json +16 -0
results/GenericAgent-GPT-oss-120b/workarena-l1.json +16 -0
results/GenericAgent-GPT-oss-120b/workarena-l2.json +16 -0
results/GenericAgent-GPT-oss-20b/README.md +44 -0
results/GenericAgent-GPT-oss-20b/miniwob.json +16 -0
results/GenericAgent-GPT-oss-20b/workarena-l1.json +16 -0
results/GenericAgent-GPT-oss-20b/workarena-l2.json +16 -0
results/OrbyAgent-Claude-3.5-Sonnet/README.md +1 -0

results/GenericAgent-Claude-3.7-Sonnet/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-Claude-3.7-Sonnet
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses Claude-3.7-Sonnet (claude-3-7-sonnet-20250219) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-Claude-3.7-Sonnet/webarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-Claude-3.7-Sonnet",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WebArena",
+    "score": 44.6,
+    "std_err": 2.5,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-Claude-4-Sonnet/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-Claude-4-Sonnet
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses claude-4-sonnet (claude-sonnet-4-20250514) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-Claude-4-Sonnet/miniwob.json ADDED Viewed

	@@ -0,0 +1,17 @@

+[
+    {
+    "agent_name": "GenericAgent-Claude-4-Sonnet",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 70.7,
+    "std_err": 1.8,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-Claude-4-Sonnet/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-Claude-4-Sonnet",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 63.3,
+    "std_err": 2.7,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-Claude-4-Sonnet/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-Claude-4-Sonnet",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 40.4,
+    "std_err": 3.2,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-4.1-Mini/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT_4_1_mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-4.1-mini (gpt-4.1-mini-2025-04-14) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-4.1-Mini/webarena.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-4.1-Mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WebArena",
+    "score": 30.7,
+    "std_err": 2.4,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-mini/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5-Mini
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5-mini (gpt-5-mini-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-5-mini/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 71,
+    "std_err": 1.8,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-mini/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 60.6,
+    "std_err": 2.7,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-mini/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5-mini",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 47.7,
+    "std_err": 3.3,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-nano/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5-Nano
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5-nano (gpt-5-nano-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-5-nano/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+ {
+    "agent_name": "GenericAgent-GPT-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 64.8,
+    "std_err": 1.9,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-nano/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 40.6,
+    "std_err": 2.7,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5-nano/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5-nano",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 3.4,
+    "std_err": 1.2,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-GPT-5
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-5 (gpt-5-2025-08-07) as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-5/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 71.5,
+    "std_err": 1.8,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+   {
+    "agent_name": "GenericAgent-GPT-5",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 79.1,
+    "std_err": 2.2,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "No",
+    "reproducible": "Yes",
+    "comments": "Increased max_steps from 15 to 30",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-5",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 69.4,
+    "std_err": 3.0,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-5/workarena-l3.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+    {
+    "agent_name": "GenericAgent-GPT-5",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L3",
+    "score": 11.5,
+    "std_err": 2.1,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "No",
+    "reproducible": "Yes",
+    "comments": "Increased max_steps from 50 to 100",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-120b/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-OSS-120B
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-oss-120b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-oss-120b/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 66.4,
+    "std_err": 1.9,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-120b/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 50.9,
+    "std_err": 2.8,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-120b/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-120b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 11.5,
+    "std_err": 2.1,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-20b/README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+### GenericAgent-OSS-20b
+This agent is [GenericAgent](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/generic_agent.py) from [AgentLab](https://github.com/ServiceNow/AgentLab)
+It uses gpt-oss-20b as a backend, with the following [flags](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/agents/generic_agent/tmlr_config.py):
+```python
+BASE_FLAGS = GenericPromptFlags(
+    obs=dp.ObsFlags(
+        use_html=False,
+        use_ax_tree=True,
+        use_focused_element=True,
+        use_error_logs=True,
+        use_history=True,
+        use_past_error_logs=False,
+        use_action_history=True,
+        use_think_history=True,
+        use_diff=False,
+        html_type="pruned_html",
+        use_screenshot=False,
+        use_som=False,
+        extract_visible_tag=True,
+        extract_clickable_tag=True,
+        extract_coords="False",
+        filter_visible_elements_only=False,
+    ),
+    action=dp.ActionFlags(
+        multi_actions=False,
+        action_set="bid",
+        long_description=False,
+        individual_examples=False,
+    ),
+    use_plan=False,
+    use_criticise=False,
+    use_thinking=True,
+    use_memory=False,
+    use_concrete_example=True,
+    use_abstract_example=True,
+    use_hints=True,
+    enable_chat=False,
+    max_prompt_tokens=40_000,
+    be_cautious=True,
+    extra_instructions=None,
+)
+```

results/GenericAgent-GPT-oss-20b/miniwob.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "MiniWoB",
+    "score": 64,
+    "std_err": 1.9,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-20b/workarena-l1.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L1",
+    "score": 38.5,
+    "std_err": 2.7,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/GenericAgent-GPT-oss-20b/workarena-l2.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "agent_name": "GenericAgent-GPT-oss-20b",
+    "study_id": "2025-08-07_21-09-16",
+    "benchmark": "WorkArena-L2",
+    "score": 2.6,
+    "std_err": 1.0,
+    "benchmark_specific": "No",
+    "benchmark_tuned": "No",
+    "followed_evaluation_protocol": "Yes",
+    "reproducible": "Yes",
+    "comments": "NA",
+    "original_or_reproduced": "Original",
+    "date_time": "2025-08-07 21:09:16"
+  }
+]

results/OrbyAgent-Claude-3.5-Sonnet/README.md CHANGED Viewed

@@ -5,3 +5,4 @@ This agent is developed by [Orby AI](https://www.orby.ai/).
 The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
 It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).


5	The agent does not use any benchmark-specific information in the prompts. For WebArena benchmark, we use the original evaluator and task definitions for fair comparison.
6
7	It uses Claude-3.5-sonnet-20241022 as a backend, with both screenshot and HTML as inputs. More details can be found in our [research blog](https://www.orby.ai/resources/elevating-automation-orby-ais-generic-agent-framework-and-self-adaptive-interface-learning-technique).
8	+