agent_submission_core.md · agent-evals/leaderboard at main

To submit a new agent to the CORE leaderboard, follow these steps:

Run your agent on the CORE-Bench Harness. When developing your agent, ensure that it generates a file named agent_trace.log in the base directory it is invoked for each run. The content of this file must be in JSON format and at least include the keys cost and agent_trace:
```
{
    "cost": 0.59,
    "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
}
```
- cost: A float representing the total cost (USD) of API calls made by the agent. We recommend using Weave for easy cost logging.
- agent_trace: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by SWE-Bench:
  - Human-readable.
  - Reflects the intermediate steps your system took that led to the final solution.
  - Generated with the inference process, not post-hoc.
If you have any trouble implementing this, feel free to reach out to us for support.
Run your agent on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the --use_azure flag) to avoid long experiment times. Set the --experiment_name flag to be the name of your agent. You can submit results for any of the three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, or CORE-Bench-Hard.
Submit the following two directories from the harness:
- benchmark/results/[experiment_name]: Contains the results of your agent on each task.
- benchmark/logs/[experiment_name]: Contains the logs of your agent's execution on each task (which are the agent_trace.log files your agent submits).
- These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
Compress these directories into two .tar.gz or .zip files and email them to zss@princeton.edu. If the files are too large to email, please upload them to Google Drive, Dropbox, etc., and email the link. In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.
[Optional] We highly encourage you to submit the files of your agent (i.e. benchmark/agents/[agent_name]) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a .tar.gz file and include it in the email.