To submit **a new agent** for evaluation, developers should only need to: 1. Enure that the agent provides a specific entry point to the agent (e.g., a Python script or function) 2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters. * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers. * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities. * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.