benediktstroebl commited on
Commit
1d41341
·
verified ·
1 Parent(s): b69a733

Upload 5 files

Browse files
Files changed (5) hide show
  1. agent_submission.md +12 -0
  2. app.py +6 -2
  3. benchmark_submission.md +5 -0
  4. hal.ico +0 -0
  5. hal.png +0 -0
agent_submission.md ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ To submit **a new agent** for evaluation, developers should only need to:
2
+
3
+ 1. Adhere to Standardized I/O Format: Ensure the agent run file complies with the benchmark-specific I/O format. Depending on HAL's implementation, this could involve:
4
+ * Providing a specific entry point to the agent (e.g., a Python script or function)
5
+ * Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a *main.py* file as the entry point and managing *instructions.txt *and *submission.txt *files.
6
+
7
+ 2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
8
+ * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
9
+ * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
10
+ * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.
11
+
12
+ 3. Use our CLI to run evaluations and upload the results. The same CLI can also be used to rerun existing agent-benchmark pairs from the leaderboard.
app.py CHANGED
@@ -247,7 +247,7 @@ class MyTheme(Soft):
247
 
248
  my_theme = MyTheme()
249
 
250
- with gr.Blocks(theme=my_theme, css='css.css') as demo:
251
  # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
252
  gr.HTML("""
253
  <style>
@@ -514,6 +514,7 @@ with gr.Blocks(theme=my_theme, css='css.css') as demo:
514
  """)
515
  with gr.Group(elem_classes=["grouped-section"]):
516
  gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
 
517
 
518
  gr.HTML('<div style="height: 10px;"></div>')
519
  gr.Markdown("## Failure report for each agent")
@@ -1533,10 +1534,13 @@ with gr.Blocks(theme=my_theme, css='css.css') as demo:
1533
 
1534
  gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
1535
  <p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
 
1536
  gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
1537
  <p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
 
1538
  gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
1539
  <p>Below we provide a guide on how to reproduce evaluations:</p>""")
 
1540
 
1541
 
1542
 
@@ -1560,7 +1564,7 @@ async def main():
1560
  # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
1561
  scheduler.start()
1562
 
1563
- await demo.launch()
1564
 
1565
  if __name__ == "__main__":
1566
  weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')
 
247
 
248
  my_theme = MyTheme()
249
 
250
+ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderboard") as demo:
251
  # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
252
  gr.HTML("""
253
  <style>
 
514
  """)
515
  with gr.Group(elem_classes=["grouped-section"]):
516
  gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
517
+ gr.Markdown('The agent monitor provides an overview of the recurring errors an agent makes as well as a summary of the steps the agent takes to solve a task. It currently consists of two main components:')
518
 
519
  gr.HTML('<div style="height: 10px;"></div>')
520
  gr.Markdown("## Failure report for each agent")
 
1534
 
1535
  gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent?</h2>
1536
  <p>Below we provide a guide on how to add an agent to the leaderboard:</p>""")
1537
+ gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
1538
  gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark?</h2>
1539
  <p>Below we provide a guide on how to add a benchmark to the leaderboard:</p>""")
1540
+ gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
1541
  gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>
1542
  <p>Below we provide a guide on how to reproduce evaluations:</p>""")
1543
+ gr.Markdown("""Coming soon...""")
1544
 
1545
 
1546
 
 
1564
  # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
1565
  scheduler.start()
1566
 
1567
+ await demo.launch(favicon_path="hal.png")
1568
 
1569
  if __name__ == "__main__":
1570
  weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')
benchmark_submission.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ To submit **a new benchmark **to the library:
2
+
3
+ 1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
4
+
5
+ 2. We will encourage developers to support running their tasks on separate VMs and specify the exact hardware specifications for each task in the task environment.
hal.ico ADDED
hal.png ADDED