agent_submission.md · agent-evals/core_leaderboard at 1d41341fb9e33ede34013d6fca90259e135bf0db

To submit a new agent for evaluation, developers should only need to:

Adhere to Standardized I/O Format: Ensure the agent run file complies with the benchmark-specific I/O format. Depending on HAL's implementation, this could involve:
- Providing a specific entry point to the agent (e.g., a Python script or function)
- Correctly handling instructions and the submission process. For example, in METR's Vivaria, this can mean supplying a main.py file as the entry point and managing *instructions.txt *and *submission.txt *files.
Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
- For our own evaluations, we have been relying on Weights & Biases' Weave which provides integrations for a number of LLM providers.
- Both, Vivaria and UK AISI's Inspect provide logging functionalities.
- However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.
Use our CLI to run evaluations and upload the results. The same CLI can also be used to rerun existing agent-benchmark pairs from the leaderboard.