core_leaderboard

Running

App Files Files Community

benediktstroebl commited on Aug 27, 2024

Commit

1d41341

verified ·

1 Parent(s): b69a733

Upload 5 files

Browse files

Files changed (5) hide show

agent_submission.md +12 -0
app.py +6 -2
benchmark_submission.md +5 -0
hal.ico +0 -0
hal.png +0 -0

agent_submission.md ADDED Viewed

	@@ -0,0 +1,12 @@

+To submit **a new agent** for evaluation, developers should only need to:
+1. Adhere to Standardized I/O Format: Ensure the agent run file complies with the benchmark-specific I/O format. Depending on HAL's implementation, this could involve:
+   * Providing a specific entry point to the agent (e.g., a Python script or function)
+   * Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a *main.py* file as the entry point and managing *instructions.txt *and *submission.txt *files.
+2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
+   * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
+   * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
+   * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.
+3. Use our CLI to run evaluations and upload the results. The same CLI can also be used to rerun existing agent-benchmark pairs from the leaderboard.

app.py CHANGED Viewed

@@ -247,7 +247,7 @@ class MyTheme(Soft):
 my_theme = MyTheme()
-with gr.Blocks(theme=my_theme, css='css.css') as demo:
     # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
     gr.HTML("""
             <style>
@@ -514,6 +514,7 @@ with gr.Blocks(theme=my_theme, css='css.css') as demo:
             """)
             with gr.Group(elem_classes=["grouped-section"]):
                 gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
                 gr.HTML('<div style="height: 10px;"></div>')
                 gr.Markdown("## Failure report for each agent")
@@ -1533,10 +1534,13 @@ with gr.Blocks(theme=my_theme, css='css.css') as demo:
     gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
 <p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
     gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
 <p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
     gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
 <p>Below we provide a guide on how to reproduce evaluations:</p>""")
@@ -1560,7 +1564,7 @@ async def main():
     # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
     scheduler.start()
-    await demo.launch()
 if __name__ == "__main__":
     weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')

 my_theme = MyTheme()
+with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderboard") as demo:
     # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
     gr.HTML("""
             <style>
             """)
             with gr.Group(elem_classes=["grouped-section"]):
                 gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+                gr.Markdown('The agent monitor provides an overview of the recurring errors an agent makes as well as a summary of the steps the agent takes to solve a task. It currently consists of two main components:')
                 gr.HTML('<div style="height: 10px;"></div>')
                 gr.Markdown("## Failure report for each agent")
     gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
 <p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
+    gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
     gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
 <p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
+    gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
     gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
 <p>Below we provide a guide on how to reproduce evaluations:</p>""")
+    gr.Markdown("""Coming soon...""")
     # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
     scheduler.start()
+    await demo.launch(favicon_path="hal.png")
 if __name__ == "__main__":
     weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')

benchmark_submission.md ADDED Viewed

	@@ -0,0 +1,5 @@

+To submit **a new benchmark **to the library:
+1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
+2. We will encourage developers to support running their tasks on separate VMs and specify the exact hardware specifications for each task in the task environment.

hal.ico ADDED Viewed

hal.png ADDED Viewed