HAL: Holistic Agent Leaderboard

Imagine you run a travel agency that wants to adopt an AI agent to automate customer bookings and improve efficiency. How would you choose a product?

And what if you are an agent developer building an agent that can browse the web and book hotels and tickets for entire travel itineraries. How do you go about comparing against past agents?

Or suppose a group of independent researchers claim to have built a generalist agent that can automate offensive web agents that can carry out DDoS attacks. How seriously should we take their claims?

As things stand today, none of the stakeholders described above (customers, agent developers, safety researchers) can judge the evidence of AI agent capabilities for many reasons:

Different agent developers often develop their own evaluation harness for agent benchmarks, making it hard to make a true apples-to-apples comparison.
Many agent benchmarks lack a centralized leaderboard. Even if they do have one, they don't verify if the agents on the leaderboard are implemented correctly.
Most importantly, current leaderboards do not include information about the cost of running agents. This is crucial for downstream customers who might want to use agents and for understanding the safety implications of an agent in terms of which adversaries might have access to these agents and for how much time.

In our recent paper, we showed that AI agent evaluations fall drastically short of the principles of good evaluation, making it hard to verify claims of real-world performance based on benchmark results.

The Holistic Agent Leaderboard aims to address the widespread limitations of current agent evaluation. We will develop a platform to standardize agent evaluations and easily measure their performance on consequential real-world tasks.

We have been here before

For language model evaluations, centralized leaderboards like HELM and OpenLLMLeaderboard have proven essential, as have tools to conduct standardized evaluations, such as lm-eval-harness.

These tools have allowed downstream users of language models, model developers, and safety researchers to compare model performance across multiple benchmarks that capture different competencies.

We aim to do something similar for agent evaluation.

How agent evaluations differ from model evaluations

For model evaluation, standardizing elements of the input prompt can be useful to ensure models compete on an equal footing. For example, zero-shot vs. few-shot prompting can lead to qualitatively different performance.

For agents, modifications to the system prompt (along with other system designs such as retrying multiple times and using a verifier or majority voting) are features, since these methods have been shown to solve real-world tasks of interest more effectively.

But as a side-effect, unlike models, where comparing the cost and time for running models can be intuitive or easily comparable, agents can vary wildly in terms of how much they cost and how much time they take to run. Understanding the cost and time required for running them is key to determining whether an agent design improves on simple baselines (such as running the same model multiple times).

In other words, we are moving away from evaluating AI from the perspective of a one-dimensional leaderboard and toward a Pareto frontier that considers performance and cost. Leaderboards are attractive for many reasons (scientifically—to assess capabilities; culturally—to pick winners and losers), but we think there's no meaningful way to collapse the dimensions into one.

HAL is a third-party, centralized, cost-controlled leaderboard for agent benchmarks

Centralized: Evaluations across agent benchmarks are all recorded to a single leaderboard that evaluates every listed agent in the same way.
Third-party: Agent developers clearly have competing objectives in reporting accuracy: they want to achieve state-of-the-art performance.
Cost-controlled: For downstream users, understanding the cost of running agents is a significant need for adoption. For agent developers, cost-controlled evaluations help develop accurate baselines (if an agent is SoTA by 0.1% and costs 100x as much, is it really impressive?)

Who is it for?

We see HAL being useful for four categories of users:

Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).