benediktstroebl commited on
Commit
7c691e6
β€’
1 Parent(s): 0b4d0ca
.gitignore ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ .DS_Store
2
+ **/__pycache__/**
3
+ evals_upload/*
4
+ evals_live/*
5
+ evals_processed/*
README.md CHANGED
@@ -1,10 +1,10 @@
1
  ---
2
- title: Leaderboard 2
3
- emoji: πŸ“Š
4
- colorFrom: red
5
- colorTo: yellow
6
  sdk: gradio
7
- sdk_version: 5.4.0
8
  app_file: app.py
9
  pinned: false
10
  ---
 
1
  ---
2
+ title: Agent Leaderboard
3
+ emoji: πŸ†
4
+ colorFrom: blue
5
+ colorTo: pink
6
  sdk: gradio
7
+ sdk_version: 4.40.0
8
  app_file: app.py
9
  pinned: false
10
  ---
about.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HAL: Holistic Agent Leaderboard
2
+
3
+ Imagine you run a travel agency that wants to adopt an AI agent to automate customer bookings and improve efficiency. How would you choose a product?
4
+
5
+ And what if you are an agent developer building an agent that can browse the web and book hotels and tickets for entire travel itineraries. How do you go about comparing against past agents?
6
+
7
+ Or suppose a group of independent researchers claim to have built a generalist agent that can automate offensive web agents that can carry out DDoS attacks. How seriously should we take their claims?
8
+
9
+ As things stand today, none of the stakeholders described above (customers, agent developers, safety researchers) can judge the evidence of AI agent capabilities for many reasons:
10
+
11
+ - Different agent developers often develop their own evaluation harness for agent benchmarks, making it hard to make a true apples-to-apples comparison.
12
+ - Many agent benchmarks lack a centralized leaderboard. Even if they do have one, they don't verify if the agents on the leaderboard are implemented correctly.
13
+ - Most importantly, current leaderboards do not include information about the cost of running agents. This is crucial for downstream customers who might want to use agents and for understanding the safety implications of an agent in terms of which adversaries might have access to these agents and for how much time.
14
+
15
+ [In our recent paper](https://arxiv.org/abs/2407.01502), we showed that AI agent evaluations fall drastically short of the principles of good evaluation, making it hard to verify claims of real-world performance based on benchmark results.
16
+
17
+ The Holistic Agent Leaderboard aims to address the widespread limitations of current agent evaluation. We will develop a platform to standardize agent evaluations and easily measure their performance on consequential real-world tasks.
18
+
19
+ ## We have been here before
20
+
21
+ For language model evaluations, centralized leaderboards like HELM and OpenLLMLeaderboard have proven essential, as have tools to conduct standardized evaluations, such as lm-eval-harness.
22
+
23
+ These tools have allowed downstream users of language models, model developers, and safety researchers to compare model performance across multiple benchmarks that capture different competencies.
24
+
25
+ We aim to do something similar for agent evaluation.
26
+
27
+ ## How agent evaluations differ from model evaluations
28
+
29
+ For model evaluation, standardizing elements of the input prompt can be useful to ensure models compete on an equal footing. For example, zero-shot vs. few-shot prompting can lead to qualitatively different performance.
30
+
31
+ For agents, modifications to the system prompt (along with other system designs such as retrying multiple times and using a verifier or majority voting) are features, since these methods have been shown to solve real-world tasks of interest more effectively.
32
+
33
+ But as a side-effect, unlike models, where comparing the cost and time for running models can be intuitive or easily comparable, agents can vary wildly in terms of how much they cost and how much time they take to run. Understanding the cost and time required for running them is key to determining whether an agent design improves on simple baselines (such as running the same model multiple times).
34
+
35
+ In other words, we are moving away from evaluating AI from the perspective of a one-dimensional leaderboard and toward a Pareto frontier that considers performance and cost. Leaderboards are attractive for many reasons (scientificallyβ€”to assess capabilities; culturallyβ€”to pick winners and losers), but we think there's no meaningful way to collapse the dimensions into one.
36
+
37
+ ## HAL is a third-party, centralized, cost-controlled leaderboard for agent benchmarks
38
+
39
+ - Centralized: Evaluations across agent benchmarks are all recorded to a single leaderboard that evaluates every listed agent in the same way.
40
+ - Third-party: Agent developers clearly have competing objectives in reporting accuracy: they want to achieve state-of-the-art performance.
41
+ - Cost-controlled: For downstream users, understanding the cost of running agents is a significant need for adoption. For agent developers, cost-controlled evaluations help develop accurate baselines (if an agent is SoTA by 0.1% and costs 100x as much, is it really impressive?)
42
+
43
+ ## Who is it for?
44
+
45
+ We see HAL being useful for four categories of users:
46
+
47
+ 1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
48
+ 2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
49
+ 3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
50
+ 4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).
agent_monitor/failure_report.py ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ from openai import AsyncOpenAI
3
+ from collections import defaultdict
4
+ import weave
5
+ from pydantic import BaseModel
6
+ from abc import ABC, abstractmethod
7
+ import json
8
+ from typing import Dict, List
9
+ from datetime import datetime
10
+ import backoff
11
+ from openai import APITimeoutError, APIError, RateLimitError
12
+
13
+
14
+ class FailureCategory(BaseModel):
15
+ category_id: int
16
+ category_name: str
17
+ description: str
18
+
19
+ class FailureCategories(BaseModel):
20
+ failure_categories: list[FailureCategory]
21
+
22
+ class TaskSummary(BaseModel):
23
+ task_id: str
24
+ summary: str
25
+
26
+ class TaskClassification(BaseModel):
27
+ task_id: str
28
+ category_id: str
29
+ category_name: str
30
+ explanation: str
31
+
32
+ class OverallAnalysis(BaseModel):
33
+ failure_categories: List[Dict]
34
+ task_classifications: Dict[str, Dict]
35
+ summary: str
36
+
37
+ class AsyncLLMClient(ABC):
38
+ @abstractmethod
39
+ async def generate_text(self, prompt, system_message=None, response_format=None):
40
+ pass
41
+
42
+ # class AsyncOpenAIClient(AsyncLLMClient):
43
+ # def __init__(self, model="gpt-4o-mini"):
44
+ # self.model = model
45
+ # self.client = AsyncOpenAI()
46
+
47
+ # async def generate_text(self, prompt, system_message=None, response_format=None):
48
+ # messages = [
49
+ # {"role": "system", "content": system_message or "You are a helpful AI assistant."},
50
+ # {"role": "user", "content": prompt}
51
+ # ]
52
+ # if response_format:
53
+ # response = await self.client.beta.chat.completions.parse(model=self.model, messages=messages, response_format=response_format)
54
+ # else:
55
+ # response = await self.client.chat.completions.create(model=self.model, messages=messages)
56
+ # return response.choices[0].message.content
57
+
58
+
59
+ class AsyncOpenAIClient(AsyncLLMClient):
60
+ def __init__(self, model="gpt-4o-mini", max_tries=5, max_time=300):
61
+ self.model = model
62
+ self.client = AsyncOpenAI()
63
+ self.max_tries = max_tries
64
+ self.max_time = max_time
65
+
66
+ @backoff.on_exception(
67
+ backoff.expo,
68
+ (APITimeoutError, APIError, RateLimitError),
69
+ max_tries=10,
70
+ max_time=300
71
+ )
72
+ async def _make_request(self, messages, response_format=None):
73
+ if response_format:
74
+ return await self.client.beta.chat.completions.parse(
75
+ model=self.model,
76
+ messages=messages,
77
+ response_format=response_format
78
+ )
79
+ else:
80
+ return await self.client.chat.completions.create(
81
+ model=self.model,
82
+ messages=messages
83
+ )
84
+
85
+ async def generate_text(self, prompt, system_message=None, response_format=None):
86
+ messages = [
87
+ {"role": "system", "content": system_message or "You are a helpful AI assistant."},
88
+ {"role": "user", "content": prompt}
89
+ ]
90
+ try:
91
+ response = await self._make_request(messages, response_format)
92
+ return response.choices[0].message.content
93
+ except Exception as e:
94
+ raise Exception(f"Failed after {self.max_tries} attempts or {self.max_time} seconds: {str(e)}")
95
+
96
+ def get_weave_calls(client):
97
+ calls = client.calls()
98
+ processed_calls = []
99
+ for call in calls:
100
+ ChatCompletion = weave.ref(call.output).get()
101
+ choices = [choice.message.content for choice in ChatCompletion.choices]
102
+ output = {
103
+ 'weave_task_id': call.attributes['weave_task_id'],
104
+ 'trace_id': call.trace_id,
105
+ 'project_id': call.project_id,
106
+ 'created_timestamp': ChatCompletion.created,
107
+ 'inputs': dict(call.inputs),
108
+ 'id': call.id,
109
+ 'outputs': {'choices' : choices},
110
+ 'exception': call.exception,
111
+ 'summary': call.summary,
112
+ 'display_name': call.display_name,
113
+ 'attributes': dict(call.attributes),
114
+ "_children": call._children,
115
+ '_feedback': call._feedback,
116
+ }
117
+ processed_calls.append(output)
118
+ return processed_calls
119
+
120
+ async def analyze_agent_performance(processed_calls, failed_tasks: list, llm_client):
121
+ task_calls = defaultdict(list)
122
+ for call in processed_calls:
123
+ if call['weave_task_id'] in failed_tasks:
124
+ task_calls[call['weave_task_id']].append(call)
125
+
126
+ for task_id in task_calls:
127
+ task_calls[task_id].sort(key=lambda x: x['created_timestamp'])
128
+
129
+ task_summaries = await asyncio.gather(*[summarize_task(task_id, calls, llm_client) for task_id, calls in task_calls.items()])
130
+
131
+ failure_categories = await identify_failure_categories(task_summaries, llm_client)
132
+ task_classifications = await classify_tasks(task_summaries, failure_categories, llm_client)
133
+ overall_summary = await generate_overall_summary(failure_categories, task_classifications, llm_client)
134
+
135
+ task_classifications = {tc["task_id"]: tc for tc in task_classifications}
136
+
137
+ return dict(OverallAnalysis(
138
+ failure_categories=failure_categories,
139
+ task_classifications=task_classifications,
140
+ summary=overall_summary
141
+ ))
142
+
143
+ async def summarize_task(task_id, calls, llm_client):
144
+ calls_summary = ""
145
+ for i, call in enumerate(calls, 1):
146
+ calls_summary += f"""
147
+ Step {i}:
148
+ Input: {call['inputs']}
149
+ Output: {call['outputs']}
150
+ Timestamp: {datetime.fromtimestamp(call['created_timestamp'])}
151
+ """
152
+
153
+ prompt = f"""
154
+ Summarize the AI agent's performance on the following task:
155
+ Task ID: {task_id}
156
+ Number of steps: {len(calls)}
157
+
158
+ Detailed steps:
159
+ {calls_summary}
160
+
161
+ Provide a brief summary of:
162
+ 1. The main goal of the task (inferred from the inputs and outputs)
163
+ 2. The agent's approach, including key steps and decisions made
164
+ 3. Any significant challenges or errors encountered during the task
165
+ 4. The final outcome why the task failed. Be detailed about the reason for failure.
166
+
167
+ Keep the summary concise (around 200 words) but include specific details about the agent's performance and any notable aspects of its problem-solving process.
168
+ """
169
+
170
+ system_message = "You are an AI performance analyst tasked with summarizing an AI agent's performance on individual tasks. Focus on the most important aspects of the agent's approach and performance."
171
+ summary = await llm_client.generate_text(prompt, system_message, response_format=TaskSummary)
172
+ return json.loads(summary)
173
+
174
+ async def identify_failure_categories(task_summaries, llm_client):
175
+ summaries_text = "\n\n".join([f"Task {s['task_id']}:\n{s['summary']}" for s in task_summaries])
176
+ prompt = f"""
177
+ Analyze the following summaries of an AI agent's performance across multiple tasks:
178
+
179
+ {summaries_text}
180
+
181
+ Identify recurring categories of failures that the agent faces across these tasks. For each category:
182
+ 1. Provide a short, descriptive name (max 5 words)
183
+ 2. Write a brief description explaining the nature of this failure or challenge category
184
+
185
+ Focus on patterns that appear across multiple tasks and represent specific errors that impacted the agent's performance. Make sure that your categories are distinct and cover a range of recurring issues. The categories should not bee too general.
186
+
187
+ Examples for categories could include:
188
+ Incorrect Implementation - The agent made a change to a reasonable area but their solution didn’t correctly address the issue.
189
+ Gave Up Prematurely - The agent decides to stop solving the task after encountering some difficulty.
190
+ Failed Edit Recovery - The agent went into an loop, making recurrent failing edits without recovering.
191
+ """
192
+
193
+ system_message = "You are an expert in AI agent analysis, tasked with identifying recurring patterns in agent performance across multiple tasks."
194
+ categories = await llm_client.generate_text(prompt, system_message, response_format=FailureCategories)
195
+ return [dict(category) for category in json.loads(categories)['failure_categories']]
196
+
197
+ async def classify_tasks(task_summaries, failure_categories, llm_client):
198
+ categories_text = "\n".join([f"{cat['category_id']}. {cat['category_name']}: {cat['description']}" for i, cat in enumerate(failure_categories)])
199
+ classifications = []
200
+
201
+ for task in task_summaries:
202
+ prompt = f"""
203
+ Failure Categories:
204
+ {categories_text}
205
+
206
+ Task Summary:
207
+ {task['summary']}
208
+
209
+ Classify this task into one of the failure categories listed above. Provide:
210
+ 1. The number of the chosen category
211
+ 2. A brief explanation of why this category best fits the task's outcome
212
+
213
+ If the task doesn't clearly fit any category, you may classify it as "0. Other" and explain why.
214
+ """
215
+
216
+ system_message = "You are an AI performance analyst tasked with classifying task outcomes into predefined categories."
217
+ classification = await llm_client.generate_text(prompt, system_message, response_format=TaskClassification)
218
+ classification = json.loads(classification)
219
+
220
+ category_number = classification['category_id']
221
+ if str(category_number) == "0":
222
+ category_name = "Other"
223
+ else:
224
+ for cat in failure_categories:
225
+ if str(cat['category_id']) == str(category_number):
226
+ category_name = cat['category_name']
227
+ break
228
+ else:
229
+ category_name = "Other"
230
+
231
+ explanation = classification['explanation']
232
+
233
+ classifications.append(dict(TaskClassification(
234
+ task_id=task['task_id'],
235
+ category_id=category_number,
236
+ category_name=category_name,
237
+ explanation=explanation
238
+ )))
239
+
240
+ return classifications
241
+
242
+ async def generate_overall_summary(failure_categories, task_classifications, llm_client):
243
+ categories_text = "\n".join([f"{cat['category_name']}: {cat['description']}" for cat in failure_categories])
244
+
245
+ classifications_text = "\n".join([f"Task {tc['task_id']}: {tc['category_name']}" for tc in task_classifications])
246
+
247
+ prompt = f"""
248
+ Failure Categories:
249
+ {categories_text}
250
+
251
+ Task Classifications:
252
+ {classifications_text}
253
+
254
+ Based on the failure categories identified and the classification of tasks, provide an overall summary of the AI agent's performance across all tasks. Include:
255
+ 1. The most common types of failures or challenges
256
+ 2. Any patterns in the agent's performance across different tasks
257
+ 3. Suggestions for areas of improvement in the agent's design or training
258
+
259
+ Keep the summary concise but insightful, focusing on the most significant findings and their implications for AI agent development. Do only return the summary itself without any preceding context etc.
260
+ """
261
+
262
+ system_message = "You are a senior AI researcher tasked with providing a high-level analysis of an AI agent's performance across multiple tasks."
263
+ return await llm_client.generate_text(prompt, system_message)
264
+
265
+ async def main():
266
+ client = weave.init("citp_agent_eval/usaco_1723148990")
267
+ processed_calls = get_weave_calls(client)
268
+
269
+ weave.finish()
270
+ openai_client = AsyncOpenAIClient(model="gpt-4o-mini")
271
+ overall_analysis = await analyze_agent_performance(processed_calls, openai_client)
272
+
273
+ with open("agent_performance_analysis.json", "w") as f:
274
+ json.dump(overall_analysis.model_dump(), f, indent=4)
275
+
276
+ if __name__ == "__main__":
277
+ asyncio.run(main())
agent_monitor/monitor.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import asyncio
2
+ from collections import defaultdict
3
+ from pydantic import BaseModel
4
+ import json
5
+
6
+ class StepAnalysis(BaseModel):
7
+ description: str
8
+ action_type: str
9
+ assessment: str
10
+ success: bool
11
+ headline: str
12
+
13
+ class TaskSummary(BaseModel):
14
+ overview: str
15
+ key_successes: str
16
+ main_challenges: str
17
+ overall_assessment: str
18
+
19
+
20
+ async def analyze_agent_steps(processed_calls, llm_client, llm_eval=False):
21
+ task_calls = defaultdict(list)
22
+ for call in processed_calls:
23
+ task_calls[call['weave_task_id']].append(call)
24
+
25
+ for task_id in task_calls:
26
+ task_calls[task_id].sort(key=lambda x: x['created_timestamp'])
27
+
28
+ tasks = [analyze_task(calls, llm_client, llm_eval) for task_id, calls in task_calls.items()]
29
+ task_analyses = await asyncio.gather(*tasks)
30
+
31
+ return dict(zip(task_calls.keys(), task_analyses))
32
+
33
+ async def analyze_task(calls, llm_client, llm_eval=False):
34
+ if llm_eval:
35
+ step_tasks = [analyze_step(call, i+1, len(calls), llm_client) for i, call in enumerate(calls)]
36
+ steps = await asyncio.gather(*step_tasks)
37
+
38
+ else:
39
+ steps = []
40
+ for i, call in enumerate(calls):
41
+ steps.append({
42
+ 'call_data': call,
43
+ 'analysis': dict(StepAnalysis(
44
+ description="Not available",
45
+ action_type='other',
46
+ success=False,
47
+ assessment="Not available",
48
+ headline="Not available"
49
+ ))
50
+ })
51
+
52
+ try:
53
+ if llm_eval:
54
+ task_analysis = await summarize_task(steps, llm_client)
55
+ return {
56
+ 'steps': steps,
57
+ 'task_analysis': task_analysis
58
+ }
59
+ else:
60
+ return {
61
+ 'steps': steps,
62
+ 'task_analysis': dict(TaskSummary(
63
+ overview="Not available",
64
+ key_successes='Not available',
65
+ main_challenges='Not available',
66
+ overall_assessment="Not available"
67
+ ))
68
+ }
69
+
70
+ except Exception as e:
71
+ print(f"Error in task summarization: {str(e)}")
72
+ return dict(TaskSummary(
73
+ overview="Not available",
74
+ key_successes='Not available',
75
+ main_challenges='Not available',
76
+ overall_assessment="Not available"
77
+ ))
78
+
79
+ async def analyze_step(call, step_number, total_steps, llm_client):
80
+ prompt = f"""
81
+ Analyze Step {step_number}/{total_steps} of the AI agent's USACO task solution:
82
+ Input: {call['inputs']}
83
+ Output: {call['outputs']}
84
+ Exception: {call['exception']}
85
+ Summary: {call['summary']}
86
+
87
+ Provide a detailed, technical analysis with the following:
88
+ 1. Specific Description: Describe precisely what the agent did in this step, including any algorithms, data structures, or problem-solving techniques employed.
89
+ 2. Action Classification: Categorize the action as one of:
90
+ - 'plan': Strategizing or outlining an approach
91
+ - 'tool': Using a specific programming construct or algorithm
92
+ - 'retrieve': Accessing or utilizing external information
93
+ - 'other': Any action that doesn't fit the above categories
94
+ 3. Technical Evaluation: Assess the technical merit of the agent's approach. Comment on efficiency, correctness, and adherence to USACO problem-solving best practices.
95
+ 4. Success: Determine if the agent successfully completed its intended action.
96
+ 5. Concise Headline: Write a technically precise headline (max 7 words) that captures the essence of this step.
97
+
98
+ Your analysis should be highly specific to this task. Avoid generalities and focus on the technical details of the agent's approach to this particular problem.
99
+ """
100
+
101
+ system_message = "You are an expert in AI agent design and evaluation. Analyze the AI agent's actions with the depth and specificity expected in a detailed expert review. Focus on providing insights that would be valuable to an AI researcher specializing in AI agent development."
102
+
103
+ analysis = await llm_client.generate_text(prompt, system_message, response_format=StepAnalysis)
104
+
105
+ try:
106
+ analysis = json.loads(analysis)
107
+ except json.JSONDecodeError:
108
+ print(f"Error parsing analysis for step {step_number} of {total_steps} in task {call['weave_task_id']}. Using default values.")
109
+ analysis = print(f"Error in analysis for step {step_number} of {total_steps} in task {call['weave_task_id']}: {str(e)}")
110
+ analysis = dict(StepAnalysis(
111
+ description="Analysis failed",
112
+ category='other',
113
+ success=False,
114
+ assessment="Unable to assess due to error"
115
+ ))
116
+
117
+ return {
118
+ 'call_data': call,
119
+ 'analysis': analysis
120
+ }
121
+ async def summarize_task(steps, llm_client):
122
+ steps_summary = "\n".join([f"Step {i+1}: {step['analysis']}" for i, step in enumerate(steps)])
123
+
124
+ prompt = f"""
125
+ Provide a comprehensive analysis of the AI agent's approach to solving this USACO task:
126
+
127
+ {steps_summary}
128
+
129
+ Your analysis should include:
130
+ 1. Technical Overview: Describe the agent's overall problem-solving strategy, highlighting specific actions and techniques used throughout the task.
131
+ 2. Key Achievements: Identify and explain the most significant breakthroughs or efficient implementations demonstrated by the agent.
132
+ 3. Technical Challenges: Analyze the primary obstacles encountered, focusing on difficulties or conceptual misunderstandings in the context of the task.
133
+ 4. Performance Evaluation: Assess the agent's overall performance, considering factors such as time complexity, space efficiency, code quality, and adherence to competitive programming best practices.
134
+
135
+ Your summary should be highly technical and specific to this task. Assume the reader is an expert as well and familiar with the task context. Focus on providing insights that would be valuable to an AI researcher specializing in AI agent development.
136
+ """
137
+
138
+ system_message = "You are an expert AI performance analyst, skilled in evaluating and summarizing AI agent task execution. You are specialized in providing analyses to support AI researchers to develop AI agents."
139
+ analysis = await llm_client.generate_text(prompt, system_message, response_format=TaskSummary)
140
+ return json.loads(analysis)
agent_performance_analysis.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "failure_categories": [
3
+ {
4
+ "category_id": 1,
5
+ "category_name": "Algorithm Implementation Issues",
6
+ "description": "The agent occasionally struggles with implementing the correct algorithms for given tasks, often leading to inefficiencies or logical errors in output."
7
+ },
8
+ {
9
+ "category_id": 2,
10
+ "category_name": "Input Validation Failures",
11
+ "description": "Issues with handling unexpected or malformed inputs arise, resulting in crashes or incorrect results, indicating a lack of robustness in input handling."
12
+ },
13
+ {
14
+ "category_id": 3,
15
+ "category_name": "Inadequate Commenting and Documentation",
16
+ "description": "The agent sometimes fails to adequately comment or document the code, making it harder to understand the thought process and logic behind implementations, especially for complex tasks."
17
+ },
18
+ {
19
+ "category_id": 4,
20
+ "category_name": "Test Case Coverage Gaps",
21
+ "description": "The agent frequently misses edge cases or does not sufficiently test various scenarios, resulting in incomplete solutions that may fail under certain conditions."
22
+ },
23
+ {
24
+ "category_id": 5,
25
+ "category_name": "Problem Decomposition Difficulties",
26
+ "description": "Challenges in effectively breaking down complex problems into manageable steps can lead to oversight and incomplete solution strategies."
27
+ }
28
+ ],
29
+ "task_classifications": [
30
+ {
31
+ "task_id": "1333_platinum_good_bitstrings",
32
+ "category_id": "0",
33
+ "category_name": "Success/Other",
34
+ "explanation": "The task was successfully completed with clear, well-structured steps and good documentation in the Python implementation. There were no significant challenges or errors encountered, indicating effective handling of the problem without falling into any of the predefined failure categories."
35
+ }
36
+ ],
37
+ "summary": "### Overall Summary of AI Agent's Performance\n\n**1. Common Types of Failures:**\nThe AI agent exhibits several recurring issues that hinder its performance:\n- **Algorithm Implementation Issues:** The agent frequently implements algorithms incorrectly, resulting in inefficiencies and logical inconsistencies. This indicates a need for improved algorithm comprehension and application.\n- **Input Validation Failures:** The agent struggles with handling unexpected or malformed inputs, which can lead to crashes or inaccuracies in output. This underscores a critical lack of robustness in its input handling mechanisms.\n- **Inadequate Commenting and Documentation:** There is a consistent shortfall in the agent's ability to adequately comment and document its code, complicating code comprehension and potentially hindering collaborative efforts.\n- **Test Case Coverage Gaps:** The agent often overlooks edge cases during testing, suggesting that its testing framework may not be rigorous enough to ensure comprehensive solution validation.\n- **Problem Decomposition Difficulties:** The inability to efficiently break complex problems into smaller, manageable tasks leads to incomplete or erroneous solutions, highlighting a weakness in high-level problem-solving strategies.\n\n**2. Patterns in the Agent's Performance Across Tasks:**\nThe agent's performance appears to vary based on the complexity of tasks. While it may succeed in simpler or more straightforward tasks (as indicated by the success classification in Task 1333_platinum_good_bitstrings), it shows vulnerabilities in handling tasks that require deeper reasoning, sophisticated algorithm implementations, or robust input validation. This pattern suggests that the agent may benefit from focused training on problem decomposition and robustness.\n\n**3. Suggestions for Areas of Improvement:**\nTo enhance the AI agent's performance across tasks, the following areas should be prioritized:\n- **Enhanced Training on Algorithm Understanding:** Focus on comprehensive training modules that reinforce algorithm selection and implementation strategies.\n- **Robust Input Handling Mechanisms:** Develop more resilient input validation frameworks to handle edge cases and malformed inputs without runtime failures.\n- **Improved Documentation Practices:** Implement guidelines and tools that facilitate better commentary and documentation of code, enhancing maintainability and collaboration.\n- **Expanded Testing Framework:** Create a more exhaustive testing suite that includes a wider variety of edge cases and scenarios to ensure all functions perform as expected in diverse conditions.\n- **Training on Problem Decomposition:** Include training tactics aimed at teaching the agent to effectively break down complex problems, fostering a stepwise approach to problem-solving.\n\nBy addressing these areas, the AI agent can become more reliable and efficient, ultimately leading to improved performance across a wider range of tasks."
38
+ }
agent_submission.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ To submit **a new agent** for evaluation, developers should only need to:
2
+
3
+ 1. Enure that the agent provides a specific entry point to the agent (e.g., a Python script or function)
4
+
5
+ 2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
6
+ * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
7
+ * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
8
+ * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.
agent_submission_core.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### To submit **a new agent** to the CORE leaderboard, follow these steps:
2
+
3
+ 1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The content of this file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
4
+
5
+ ```json
6
+ {
7
+ "cost": 0.59,
8
+ "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
9
+ }
10
+ ```
11
+
12
+ - **`cost`**: A float representing the total cost (USD) of API calls made by the agent. We recommend using [Weave](https://github.com/wandb/weave) for easy cost logging.
13
+ - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
14
+ - Human-readable.
15
+ - Reflects the intermediate steps your system took that led to the final solution.
16
+ - Generated with the inference process, not post-hoc.
17
+
18
+ If you have any trouble implementing this, feel free to reach out to us for support.
19
+
20
+ 2. **Run your agent** on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. You can submit results for any of the three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, or CORE-Bench-Hard.
21
+
22
+ 3. **Submit the following two directories from the harness**:
23
+ - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
24
+ - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
25
+ - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
26
+
27
+ Compress these directories into two `.tar.gz` or `.zip` files and email them to [zss@princeton.edu](mailto:zss@princeton.edu). If the files are too large to email, please upload them to Google Drive, Dropbox, etc., and email the link. **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
28
+
29
+ 4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.
app.py ADDED
@@ -0,0 +1,1458 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ from gradio_leaderboard import Leaderboard, SelectColumns, ColumnFilter
3
+ import config
4
+ from envs import RESULTS_REPO_ID, REPO_ID, API, HF_TOKEN
5
+ from pathlib import Path
6
+ import pandas as pd
7
+ import os
8
+ import json
9
+ from utils.viz import create_scatter_plot, create_flow_chart, create_bar_chart, create_task_success_heatmap, create_leaderboard
10
+ from utils.processing import check_and_process_uploads
11
+ from huggingface_hub import snapshot_download
12
+ from apscheduler.schedulers.background import BackgroundScheduler
13
+ from datetime import datetime
14
+ import json
15
+ import re
16
+ import markdown
17
+ import asyncio
18
+ from apscheduler.schedulers.asyncio import AsyncIOScheduler
19
+ import weave
20
+ from utils.db import TracePreprocessor
21
+ from gradio.themes.soft import Soft
22
+
23
+ preprocessor = TracePreprocessor()
24
+
25
+ from datetime import datetime
26
+
27
+ abs_path = Path(__file__).parent
28
+
29
+ def restart_space():
30
+ API.restart_space(repo_id=REPO_ID, token=HF_TOKEN)
31
+
32
+ # New function to download results
33
+ def download_latest_results():
34
+ print("Downloading latest results...")
35
+ snapshot_download(RESULTS_REPO_ID,
36
+ local_dir= "evals_upload",
37
+ repo_type='dataset',
38
+ tqdm_class=None,
39
+ etag_timeout=30,
40
+ max_workers=4,
41
+ )
42
+ print("Download complete.")
43
+
44
+
45
+ def get_analyzed_traces(agent_name, benchmark_name):
46
+ return preprocessor.get_analyzed_traces(agent_name, benchmark_name)
47
+
48
+ def get_failure_report(agent_name, benchmark_name):
49
+ return preprocessor.get_failure_report(agent_name, benchmark_name)
50
+
51
+ def parse_json_files(folder_path, benchmark_name, aggregate=True):
52
+ return preprocessor.get_parsed_results(benchmark_name, aggregate=aggregate)
53
+
54
+ def update_agent_dropdown(benchmark_name, metric):
55
+ df = parse_json_files(os.path.join(abs_path, "evals_live"), benchmark_name)
56
+ agents = df['Agent Name'].tolist()
57
+ best_agent = get_best_agent(benchmark_name, metric)
58
+ return gr.Dropdown(choices=agents, value=best_agent, label="Select Agent")
59
+
60
+ def get_best_agent(benchmark_name, metric):
61
+ df = parse_json_files(os.path.join(abs_path, "evals_live"), benchmark_name)
62
+ return df.loc[df[metric].idxmax()]['Agent Name']
63
+
64
+ def update_task_analysis(benchmark_name, agent_name):
65
+ if not agent_name:
66
+ return "Please select an agent.", None, None, ""
67
+
68
+ analyzed_traces = get_analyzed_traces(agent_name, benchmark_name)
69
+ if not analyzed_traces:
70
+ return f"No analysis available for agent: {agent_name}", None, None, ""
71
+
72
+ task_ids = list(analyzed_traces.keys())
73
+
74
+ overview, flow_chart, _ = update_task_details(benchmark_name, agent_name, task_ids[0])
75
+
76
+ return overview, flow_chart, gr.Dropdown(choices=task_ids, value=task_ids[0], label="Select Task"), ""
77
+
78
+ def update_task_details(benchmark_name, agent_name, task_id):
79
+ if not task_id:
80
+ return "Please select a task.", None, ""
81
+
82
+ analyzed_traces = get_analyzed_traces(agent_name, benchmark_name)
83
+ if not analyzed_traces or task_id not in analyzed_traces:
84
+ return f"No analysis available for task: {task_id}", None, ""
85
+
86
+ analysis = analyzed_traces[task_id]
87
+
88
+ summary = analysis.get('task_analysis', {})
89
+
90
+ overview = f"### Summary\n\n{summary.get('overview', 'No overview available.')}\n\n"
91
+ # overview += f"### Successes\n{summary.get('key_successes', 'No successes listed.')}\n\n"
92
+ # overview += f"### Challenges\n{summary.get('main_challenges', 'No challenges listed.')}\n\n"
93
+ # overview += f"### Overall Assessment\n{summary.get('overall_assessment', 'No assessment available.')}\n\n"
94
+
95
+ if summary.get('overview', 'No overview available.') != "Not available":
96
+ flow_chart = create_flow_chart(analysis['steps'])
97
+ else:
98
+ flow_chart = None
99
+
100
+ return overview, flow_chart, ""
101
+
102
+
103
+ def format_call_info(step, step_index):
104
+ call_data = step['call_data']
105
+ analysis = step['analysis']
106
+
107
+ def format_json(obj):
108
+ # if isinstance(obj, dict) and 'choices' in obj:
109
+ # # Special handling for message content
110
+ # formatted_content = format_message_content(obj['choices'][0])
111
+ # return f'<div class="message-content">{formatted_content}</div>'
112
+ # else:
113
+ json_str = json.dumps(obj, indent=2)
114
+ json_str = json_str.replace(' ', '&nbsp;')
115
+ json_str = json_str.replace('\n', '<br>')
116
+ return f'<div class="json-wrapper">{json_str}</div>'
117
+
118
+ # Currently not used but we can enable it to format message content
119
+ def format_message_content(content):
120
+ # Convert Markdown to HTML
121
+ html_content = markdown.markdown(content)
122
+
123
+ # Replace ``` code blocks with styled pre blocks
124
+ html_content = re.sub(r'```python\n(.*?)```', lambda m: f'<pre class="code-block">{m.group(1)}</pre>', html_content, flags=re.DOTALL)
125
+
126
+ return html_content
127
+
128
+ formatted_info = f"""
129
+ <style>
130
+ .json-wrapper {{
131
+ white-space: pre-wrap;
132
+ word-wrap: break-word;
133
+ font-family: monospace;
134
+ max-height: 300px;
135
+ overflow-y: auto;
136
+ background-color: #f5f5f5;
137
+ padding: 10px;
138
+ border-radius: 5px;
139
+ }}
140
+ .message-content {{
141
+ white-space: normal;
142
+ word-wrap: break-word;
143
+ font-family: Arial, sans-serif;
144
+ max-height: 500px;
145
+ overflow-y: auto;
146
+ background-color: #ffffff;
147
+ padding: 10px;
148
+ border-radius: 5px;
149
+ border: 1px solid #e0e0e0;
150
+ }}
151
+ .code-block {{
152
+ background-color: #f0f0f0;
153
+ padding: 10px;
154
+ border-radius: 5px;
155
+ font-family: monospace;
156
+ white-space: pre-wrap;
157
+ word-wrap: break-word;
158
+ }}
159
+ </style>
160
+
161
+ <h3>Step {step_index + 1}: {analysis.get('headline', '')}</h3>
162
+
163
+ <h4>Call Metadata</h4>
164
+ <ul>
165
+ <li><strong>Weave Task ID:</strong> {call_data['weave_task_id']}</li>
166
+ <li><strong>Trace ID:</strong> {call_data['trace_id']}</li>
167
+ <li><strong>Project ID:</strong> {call_data['project_id']}</li>
168
+ <li><strong>Created Timestamp:</strong> {datetime.fromtimestamp(call_data['created_timestamp'])}</li>
169
+ <li><strong>Model:</strong> {call_data['inputs']['model']}</li>
170
+ </ul>
171
+
172
+ <h4>Inputs</h4>
173
+ {format_json(call_data['inputs'])}
174
+
175
+ <h4>Outputs</h4>
176
+ {format_json(call_data['outputs'])}
177
+
178
+ <h4>Usage</h4>
179
+ {format_json(call_data['summary'])}
180
+
181
+ <h4>Analysis</h4>
182
+ <ul>
183
+ <li><strong>Description:</strong> {analysis['description']}</li>
184
+ <li><strong>Assessment:</strong> {analysis['assessment']}</li>
185
+ <li><strong>Success:</strong> {analysis['success']}</li>
186
+ <li><strong>Action Type:</strong> {analysis['action_type']}</li>
187
+ </ul>
188
+ """
189
+ return formatted_info
190
+
191
+
192
+ def update_failure_report(agent_name, benchmark_name):
193
+ failure_report = get_failure_report(agent_name, benchmark_name)
194
+ if not failure_report:
195
+ return "No failure report available for this agent.", None
196
+
197
+ # Create overview of failure categories
198
+ categories_overview = "### Failure Categories:\n\n"
199
+ for category in failure_report['failure_categories']:
200
+ categories_overview += f"#### {category['category_name']}\n"
201
+ categories_overview += f"{category['description']}\n\n"
202
+
203
+ # Count tasks affected by each category
204
+ category_counts = {}
205
+ for task, classification in failure_report['task_classifications'].items():
206
+ category_id = classification['category_id']
207
+ category_counts[category_id] = category_counts.get(category_id, 0) + 1
208
+
209
+ # Prepare data for bar chart
210
+ categories = [cat['category_name'] for cat in failure_report['failure_categories']]
211
+ counts = [category_counts.get(str(i+1), 0) for i in range(len(categories))]
212
+
213
+ # Create bar chart
214
+ chart = create_bar_chart(categories, counts, "Failure Categories", "Number of Affected Tasks", "Failure Categories Distribution")
215
+
216
+ return categories_overview, chart
217
+
218
+ from gradio.themes.utils import colors, fonts, sizes
219
+ from typing import Iterable
220
+ class MyTheme(Soft):
221
+ def __init__(
222
+ self,
223
+ *,
224
+ primary_hue: colors.Color | str = colors.blue,
225
+ text_size: sizes.Size | str = sizes.text_lg,
226
+ font: fonts.Font
227
+ | str
228
+ | Iterable[fonts.Font | str] = (
229
+ fonts.GoogleFont("Lato"),
230
+ "ui-sans-serif",
231
+ "sans-serif",
232
+ ),
233
+ font_mono: fonts.Font
234
+ | str
235
+ | Iterable[fonts.Font | str] = (
236
+ fonts.GoogleFont("IBM Plex Mono"),
237
+ "ui-monospace",
238
+ "monospace",
239
+ ),
240
+ ):
241
+ super().__init__(
242
+ primary_hue=primary_hue,
243
+ text_size=text_size,
244
+ font=font,
245
+ font_mono=font_mono,
246
+ )
247
+
248
+ my_theme = MyTheme()
249
+
250
+ with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderboard") as demo:
251
+ # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
252
+ gr.HTML("""
253
+ <style>
254
+ .hal-header {
255
+ color: #ecf0f1;
256
+ border-radius: 10px;
257
+ padding: 40px 20px;
258
+ text-align: center;
259
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
260
+ }
261
+ .hal-title {
262
+ font-size: 2.5em;
263
+ font-weight: 700;
264
+ margin: 0;
265
+ letter-spacing: 2px;
266
+ text-transform: uppercase;
267
+ }
268
+ .hal-subtitle {
269
+ font-size: 1.2em;
270
+ font-weight: 300;
271
+ margin-top: 15px;
272
+ margin-left: auto;
273
+ margin-right: auto;
274
+ line-height: 1.6;
275
+ text-align: center;
276
+ }
277
+ .hal-highlight {
278
+ color: #3498db;
279
+ font-weight: 600;
280
+ }
281
+ </style>
282
+
283
+ <header class="hal-header">
284
+ <h1 class="hal-title">Holistic Agent Leaderboard (HAL)</h1>
285
+ <p class="hal-subtitle">
286
+ A standardized, cost-aware, and third-party leaderboard for evaluating agents.
287
+ </p>
288
+ </header>""")
289
+ gr.HTML("""
290
+ <style>
291
+ .feature-row {
292
+ display: flex;
293
+ justify-content: space-between;
294
+ margin-top: 20px;
295
+ margin-bottom: 20px;
296
+ }
297
+ .feature-column {
298
+ flex: 1;
299
+ padding: 25px;
300
+ background-color: #ffffff;
301
+ border-radius: 10px;
302
+ margin: 0 15px;
303
+ text-align: left;
304
+ box-shadow: 0 6px 12px rgba(0, 0, 0, 0.1);
305
+ display: flex;
306
+ flex-direction: column;
307
+ align-items: flex-start;
308
+ border-top: 5px solid #3498db;
309
+ transition: transform 0.3s ease, box-shadow 0.3s ease;
310
+ }
311
+ .feature-column:hover {
312
+ transform: translateY(-5px);
313
+ box-shadow: 0 5px 10px rgba(0, 0, 0, 0.15);
314
+ }
315
+ .feature-keyword {
316
+ font-size: 1.2em;
317
+ font-weight: bold;
318
+ color: #1b9e77;
319
+ margin-bottom: 10px;
320
+ text-transform: uppercase;
321
+ letter-spacing: 1px;
322
+ }
323
+ .feature-content {
324
+ flex-grow: 1;
325
+ }
326
+ .feature-description {
327
+ font-size: 0.95em;
328
+ line-height: 1.6;
329
+ color: #333;
330
+ }
331
+ </style>
332
+
333
+ <div class="feature-row">
334
+ <div class="feature-column">
335
+ <div class="feature-keyword">Standardized</div>
336
+ <div class="feature-content">
337
+ <p class="feature-description">Evaluations across agent benchmarks are all recorded to a single leaderboard that evaluates every listed agent in the same way.</p>
338
+ </div>
339
+ </div>
340
+ <div class="feature-column">
341
+ <div class="feature-keyword">Cost-controlled</div>
342
+ <div class="feature-content">
343
+ <p class="feature-description">For downstream users, understanding the cost of running agents is a significant need for adoption. For agent developers, cost-controlled evaluations help develop accurate baselines.</p>
344
+ </div>
345
+ </div>
346
+ <div class="feature-column">
347
+ <div class="feature-keyword">Third-party</div>
348
+ <div class="feature-content">
349
+ <p class="feature-description">Agent developers clearly have competing objectives in reporting accuracy: they want to achieve state-of-the-art performance.</p>
350
+ </div>
351
+ </div>
352
+ </div>
353
+ <style>
354
+ .section-heading {
355
+ font-size: 1.8em;
356
+ font-weight: bold;
357
+ color: #2c3e50;
358
+ margin-top: 40px;
359
+ margin-bottom: 20px;
360
+ text-align: left;
361
+ }
362
+ .user-types-container {
363
+ display: grid;
364
+ grid-template-columns: repeat(2, 1fr);
365
+ gap: 20px;
366
+ margin-top: 20px;
367
+ }
368
+ .user-type {
369
+ background-color: #ffffff;
370
+ border-radius: 10px;
371
+ padding: 25px;
372
+ box-shadow: 0 6px 12px rgba(0, 0, 0, 0.1);
373
+ transition: transform 0.3s ease, box-shadow 0.3s ease;
374
+ border-left: 5px solid #3498db;
375
+ }
376
+ .user-type:hover {
377
+ transform: translateY(-5px);
378
+ box-shadow: 0 5px 10px rgba(0, 0, 0, 0.15);
379
+ }
380
+ .user-type-title {
381
+ font-size: 1.2em;
382
+ font-weight: bold;
383
+ color: #3498db;
384
+ margin-bottom: 10px;
385
+ }
386
+ .user-type-description {
387
+ font-size: 0.95em;
388
+ line-height: 1.6;
389
+ color: #333;
390
+ }
391
+ .user-type-links a {
392
+ display: inline-block;
393
+ padding: 5px 12px;
394
+ margin-bottom: 5px;
395
+ background-color: #f0f4f8;
396
+ color: #2c3e50 !important; /* Force the color change */
397
+ text-decoration: none !important; /* Force remove underline */
398
+ border-radius: 15px;
399
+ font-size: 0.85em;
400
+ transition: all 0.3s ease;
401
+ border: 1px solid #e1e8ed;
402
+ }
403
+ .user-type-links a:hover {
404
+ background-color: #3498db;
405
+ color: white !important; /* Force the color change on hover */
406
+ transform: translateY(-2px);
407
+ box-shadow: 0 2px 5px rgba(52, 152, 219, 0.2);
408
+ text-decoration: none !important; /* Ensure no underline on hover */
409
+ }
410
+ .user-type-links a:visited {
411
+ color: #2c3e50 !important; /* Ensure visited links have the same color */
412
+ }
413
+ .user-type-links a::before {
414
+ content: "β†’";
415
+ margin-right: 5px;
416
+ font-size: 1.1em;
417
+ }
418
+ </style>
419
+
420
+ <h2 class="section-heading">Who is it for?</h2>
421
+ <p>We see HAL being useful for four types of users:</p>
422
+
423
+ <div class="user-types-container">
424
+ <div class="user-type">
425
+ <h3 class="user-type-title">Downstream Users & Procurers</h3>
426
+ <p class="user-type-description">Customers looking to deploy agents can get visibility into existing benchmarks, know developers building useful agents, and identify the state of the art for both cost and accuracy for their tasks of interest.</p>
427
+ <div class="user-type-links">
428
+ <a href="#leaderboards">Leaderboards</a>
429
+ </div>
430
+ </div>
431
+ <div class="user-type">
432
+ <h3 class="user-type-title">Agent Benchmark Developers</h3>
433
+ <p class="user-type-description">Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.</p>
434
+ <div class="user-type-links">
435
+ <a href="#benchmark-submission">Add a Benchmark</a>
436
+ </div>
437
+ </div>
438
+ <div class="user-type">
439
+ <h3 class="user-type-title">Agent Developers</h3>
440
+ <p class="user-type-description">HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.</p>
441
+ <div class="user-type-links">
442
+ <a href="#agent-submission">Submit an Agent</a>
443
+ <a href="#leaderboards">Leaderboards</a>
444
+ <a href="#reproduction-guide">Reproduction Guide</a>
445
+ </div>
446
+ </div>
447
+ <div class="user-type">
448
+ <h3 class="user-type-title">Safety Researchers</h3>
449
+ <p class="user-type-description">Understanding agent capabilities on real-world safety threats and their associated costs is crucial. For example, Cybench evaluations could provide insights into agent performance and affordability for potential adversaries.</p>
450
+ <div class="user-type-links">
451
+ <a href="#cybench-results">Cybench Leaderboard (coming soon)</a>
452
+ <a href="#agent-monitor">Agent Monitor</a>
453
+ </div>
454
+ </div>
455
+ </div>
456
+ </br>
457
+ <h2 class="section-heading" id="leaderboards">Leaderboards</h2>
458
+ <p>Select a benchmark to see the agent leaderboard. Verified results have been run by the HAL team:</p>
459
+ """)
460
+
461
+ with gr.Tabs() as tabs:
462
+ with gr.Tab("CORE-Bench"):
463
+ gr.HTML("""
464
+ <p>
465
+ CORE-Bench evaluates the ability of agents to computationally reproduce the results of published scientific papers. Agents are given the codebase of a paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. The benchmark has tasks at three difficulty levels:
466
+ </p>
467
+ """)
468
+ with gr.Tab("CORE-Bench-Hard"):
469
+ gr.HTML("""
470
+ <p>
471
+ <i><b>CORE-Bench-Hard:</b></i> The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.
472
+ </p>
473
+ """)
474
+ with gr.Row():
475
+ with gr.Column(scale=2):
476
+ Leaderboard(
477
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_hard'), ci_metrics=["Accuracy", "Total Cost"]),
478
+ select_columns=SelectColumns(
479
+ default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
480
+ cant_deselect=["Agent Name"],
481
+ label="Select Columns to Display:",
482
+ ),
483
+ hide_columns=config.COREBENCH_HIDE_COLUMNS,
484
+ search_columns=config.COREBENCH_SEARCH_COLUMNS,
485
+ )
486
+ # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
487
+ with gr.Row():
488
+ gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Hard")
489
+ with gr.Row():
490
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_hard', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
491
+
492
+ gr.HTML('<div style="height: 30px;"></div>')
493
+ gr.Markdown("## Task success heatmap")
494
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
495
+ with gr.Row():
496
+ task_success_heatmap = gr.Plot()
497
+ demo.load(
498
+ lambda: create_task_success_heatmap(
499
+ preprocessor.get_task_success_data('corebench_hard'),
500
+ 'CORE-Bench-Hard'
501
+ ),
502
+ outputs=[task_success_heatmap]
503
+ )
504
+ with gr.Tab("CORE-Bench-Medium"):
505
+ gr.HTML("""
506
+ <p>
507
+ <i><b>CORE-Bench-Medium:</b></i> The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
508
+ </p>
509
+ """)
510
+ with gr.Row():
511
+ with gr.Column(scale=2):
512
+ Leaderboard(
513
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_medium'), ci_metrics=["Accuracy", "Total Cost"]),
514
+ select_columns=SelectColumns(
515
+ default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
516
+ cant_deselect=["Agent Name"],
517
+ label="Select Columns to Display:",
518
+ ),
519
+ hide_columns=config.COREBENCH_HIDE_COLUMNS,
520
+ search_columns=config.COREBENCH_SEARCH_COLUMNS,
521
+ )
522
+ # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
523
+ with gr.Row():
524
+ gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Medium")
525
+ with gr.Row():
526
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_medium', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
527
+
528
+ gr.HTML('<div style="height: 30px;"></div>')
529
+ gr.Markdown("## Task success heatmap")
530
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
531
+ with gr.Row():
532
+ task_success_heatmap = gr.Plot()
533
+ demo.load(
534
+ lambda: create_task_success_heatmap(
535
+ preprocessor.get_task_success_data('corebench_medium'),
536
+ 'CORE-Bench-Medium'
537
+ ),
538
+ outputs=[task_success_heatmap]
539
+ )
540
+ with gr.Tab("CORE-Bench-Easy"):
541
+ gr.HTML("""
542
+ <p>
543
+ <i><b>CORE-Bench-Easy:</b></i> The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
544
+ </p>
545
+ """)
546
+ with gr.Row():
547
+ with gr.Column(scale=2):
548
+ Leaderboard(
549
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_easy'), ci_metrics=["Accuracy", "Total Cost"]),
550
+ select_columns=SelectColumns(
551
+ default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
552
+ cant_deselect=["Agent Name"],
553
+ label="Select Columns to Display:",
554
+ ),
555
+ hide_columns=config.COREBENCH_HIDE_COLUMNS,
556
+ search_columns=config.COREBENCH_SEARCH_COLUMNS,
557
+ )
558
+ # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
559
+ with gr.Row():
560
+ gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Easy")
561
+ with gr.Row():
562
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_easy', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
563
+
564
+ gr.HTML('<div style="height: 30px;"></div>')
565
+ gr.Markdown("## Task success heatmap")
566
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
567
+ with gr.Row():
568
+ task_success_heatmap = gr.Plot()
569
+ demo.load(
570
+ lambda: create_task_success_heatmap(
571
+ preprocessor.get_task_success_data('corebench_easy'),
572
+ 'CORE-Bench-Easy'
573
+ ),
574
+ outputs=[task_success_heatmap]
575
+ )
576
+
577
+ gr.Markdown((Path(__file__).parent / "agent_submission_core.md").read_text())
578
+ with gr.Tab("USACO"):
579
+ gr.Markdown("""The USA Computing Olympiad (USACO) is a computer programming competition for pre-college students. This benchmark evaluates the performance of AI agents on a set of 307 USACO tasks. The agents are evaluated based on the number of tasks correctly solved.""")
580
+ with gr.Row():
581
+ with gr.Column(scale=2):
582
+ Leaderboard(
583
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco'), ci_metrics=["Accuracy", "Total Cost"]),
584
+ select_columns=SelectColumns(
585
+ default_selection=config.USACO_ON_LOAD_COLUMNS + ["Verified"],
586
+ cant_deselect=["Agent Name"],
587
+ label="Select Columns to Display:",
588
+ ),
589
+ hide_columns=config.USACO_HIDE_COLUMNS,
590
+ search_columns=config.USACO_SEARCH_COLUMNS,
591
+ )
592
+ gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
593
+ with gr.Row():
594
+ gr.Markdown("### Accuracy vs. Cost for USACO agents")
595
+ with gr.Row():
596
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
597
+
598
+ gr.HTML('<div style="height: 30px;"></div>')
599
+ gr.Markdown("## Task success heatmap")
600
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
601
+ with gr.Row():
602
+ task_success_heatmap = gr.Plot()
603
+ demo.load(
604
+ lambda: create_task_success_heatmap(
605
+ preprocessor.get_task_success_data('usaco'),
606
+ 'USACO'
607
+ ),
608
+ outputs=[task_success_heatmap]
609
+ )
610
+
611
+ gr.HTML("""
612
+ <style>
613
+ .grouped-section {
614
+ border: 2px solid #dee2e6; /* Color matching unactivated tabs */
615
+ border-radius: 10px;
616
+ padding: 30px;
617
+ margin-top: 40px;
618
+ margin-bottom: 40px;
619
+ position: relative;
620
+ }
621
+
622
+ .grouped-section-title {
623
+ font-size: 1.7em;
624
+ font-weight: bold;
625
+ color: #2c3e50;
626
+ margin-bottom: 20px;
627
+ padding-bottom: 10px;
628
+ border-bottom: 2px solid #dee2e6;
629
+ }
630
+ </style>
631
+ """)
632
+ with gr.Group(elem_classes=["grouped-section"]):
633
+ gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
634
+ gr.Markdown('The agent monitor provides an overview of the recurring errors an agent makes as well as a summary of the steps the agent takes to solve a task. It currently consists of two main components:')
635
+
636
+ gr.HTML('<div style="height: 10px;"></div>')
637
+ gr.Markdown("## Failure report for each agent")
638
+ gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
639
+ gr.HTML('<div style="height: 10px;"></div>')
640
+ with gr.Row():
641
+ with gr.Column(scale=1):
642
+ failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
643
+ gr.HTML('<div style="height: 10px;"></div>')
644
+ with gr.Row():
645
+ with gr.Column(scale=1):
646
+ failure_categories_overview = gr.Markdown()
647
+
648
+ with gr.Column(scale=1):
649
+ failure_categories_chart = gr.Plot()
650
+
651
+ # Initialize the failure report agent dropdown with all agents
652
+ demo.load(update_agent_dropdown,
653
+ inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)],
654
+ outputs=[failure_report_agent_dropdown])
655
+
656
+ # Update failure report when agent is selected
657
+ failure_report_agent_dropdown.change(update_failure_report,
658
+ inputs=[failure_report_agent_dropdown, gr.Textbox(value="usaco", visible=False)],
659
+ outputs=[failure_categories_overview, failure_categories_chart])
660
+
661
+ gr.HTML('<div style="height: 30px;"></div>')
662
+ gr.Markdown("## Task overview")
663
+ gr.HTML('<div style="height: 10px;"></div>')
664
+ with gr.Row():
665
+ with gr.Column(scale=1):
666
+ agent_dropdown = gr.Dropdown(label="Select Agent")
667
+ with gr.Column(scale=1):
668
+ task_dropdown = gr.Dropdown(label="Select USACO Task")
669
+ gr.HTML('<div style="height: 10px;"></div>')
670
+ with gr.Row():
671
+ task_overview = gr.Markdown()
672
+ with gr.Row():
673
+ flow_chart = gr.Plot(label="Task Flow")
674
+
675
+ # Initialize the agent dropdown with the best agent
676
+ demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
677
+ demo.load(update_task_analysis, inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
678
+
679
+ agent_dropdown.change(update_task_analysis,
680
+ inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown],
681
+ outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
682
+ task_dropdown.change(update_task_details,
683
+ inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown, task_dropdown],
684
+ outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
685
+
686
+ gr.Markdown("## Raw predictions")
687
+ gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
688
+ with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
689
+ with gr.Row():
690
+ with gr.Column(scale=1):
691
+ raw_agent_dropdown = gr.Dropdown(label="Select Agent")
692
+ with gr.Column(scale=1):
693
+ raw_task_dropdown = gr.Dropdown(label="Select Task")
694
+ with gr.Column(scale=1):
695
+ raw_step_dropdown = gr.Dropdown(label="Select Step")
696
+ with gr.Row():
697
+ raw_call_details = gr.HTML()
698
+
699
+ def update_raw_task_dropdown(agent_name):
700
+ analyzed_traces = get_analyzed_traces(agent_name, "usaco")
701
+ if not analyzed_traces:
702
+ return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
703
+ task_ids = list(analyzed_traces.keys())
704
+ steps = analyzed_traces[task_ids[0]]['steps']
705
+ return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "usaco")[task_ids[0]]['steps'][0], 0)
706
+
707
+ def update_raw_step_dropdown(agent_name, task_id):
708
+ analyzed_traces = get_analyzed_traces(agent_name, "usaco")
709
+ if not analyzed_traces or task_id not in analyzed_traces:
710
+ return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
711
+ steps = analyzed_traces[task_id]['steps']
712
+ return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
713
+
714
+ def update_raw_call_details(agent_name, task_id, step_index):
715
+ analyzed_traces = get_analyzed_traces(agent_name, "usaco")
716
+ if not analyzed_traces or task_id not in analyzed_traces:
717
+ return "No data available for this selection."
718
+ steps = analyzed_traces[task_id]['steps']
719
+ if step_index is None:
720
+ return "Invalid step selection."
721
+ step = steps[step_index]
722
+ return format_call_info(step, step_index)
723
+
724
+ # Initialize the raw agent dropdown with all agents
725
+ demo.load(update_agent_dropdown,
726
+ inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)],
727
+ outputs=[raw_agent_dropdown])
728
+ demo.load(update_raw_task_dropdown,
729
+ inputs=[raw_agent_dropdown],
730
+ outputs=[raw_task_dropdown, raw_step_dropdown])
731
+ demo.load(update_raw_call_details,
732
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
733
+ outputs=[raw_call_details])
734
+
735
+ raw_agent_dropdown.change(update_raw_task_dropdown,
736
+ inputs=[raw_agent_dropdown],
737
+ outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
738
+ raw_task_dropdown.change(update_raw_step_dropdown,
739
+ inputs=[raw_agent_dropdown, raw_task_dropdown],
740
+ outputs=[raw_step_dropdown, raw_call_details])
741
+ raw_step_dropdown.change(update_raw_call_details,
742
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
743
+ outputs=[raw_call_details])
744
+
745
+
746
+ with gr.Tab("SWE-bench Verified (Mini)"):
747
+ gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
748
+ with gr.Row():
749
+ with gr.Column(scale=2):
750
+ Leaderboard(
751
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified_mini'), ci_metrics=["Accuracy", "Total Cost"]),
752
+ select_columns=SelectColumns(
753
+ default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
754
+ cant_deselect=["Agent Name"],
755
+ label="Select Columns to Display:",
756
+ ),
757
+ hide_columns=config.SWEBENCH_HIDE_COLUMNS,
758
+ search_columns=config.SWEBENCH_SEARCH_COLUMNS,
759
+ )
760
+ gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
761
+ with gr.Row():
762
+ gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
763
+ with gr.Row():
764
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified_mini', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
765
+
766
+ gr.HTML('<div style="height: 30px;"></div>')
767
+ gr.Markdown("## Task success heatmap")
768
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
769
+ with gr.Row():
770
+ task_success_heatmap = gr.Plot()
771
+ demo.load(
772
+ lambda: create_task_success_heatmap(
773
+ preprocessor.get_task_success_data('swebench_verified_mini'),
774
+ 'SWE-bench Verified'
775
+ ),
776
+ outputs=[task_success_heatmap]
777
+ )
778
+
779
+ # gr.HTML("""
780
+ # <style>
781
+ # .grouped-section {
782
+ # border: 2px solid #dee2e6; /* Color matching unactivated tabs */
783
+ # border-radius: 10px;
784
+ # padding: 30px;
785
+ # margin-top: 40px;
786
+ # margin-bottom: 40px;
787
+ # position: relative;
788
+ # }
789
+
790
+ # .grouped-section-title {
791
+ # font-size: 1.7em;
792
+ # font-weight: bold;
793
+ # color: #2c3e50;
794
+ # margin-bottom: 20px;
795
+ # padding-bottom: 10px;
796
+ # border-bottom: 2px solid #dee2e6;
797
+ # }
798
+ # </style>
799
+ # """)
800
+ # with gr.Group(elem_classes=["grouped-section"]):
801
+ # gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
802
+
803
+ # gr.HTML('<div style="height: 10px;"></div>')
804
+ # gr.Markdown("## Failure report for each agent")
805
+ # gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
806
+ # gr.HTML('<div style="height: 10px;"></div>')
807
+ # with gr.Row():
808
+ # with gr.Column(scale=1):
809
+ # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
810
+ # gr.HTML('<div style="height: 10px;"></div>')
811
+ # with gr.Row():
812
+ # with gr.Column(scale=1):
813
+ # failure_categories_overview = gr.Markdown()
814
+
815
+ # with gr.Column(scale=1):
816
+ # failure_categories_chart = gr.Plot()
817
+
818
+ # # Initialize the failure report agent dropdown with all agents
819
+ # demo.load(update_agent_dropdown,
820
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
821
+ # outputs=[failure_report_agent_dropdown])
822
+
823
+ # # Update failure report when agent is selected
824
+ # failure_report_agent_dropdown.change(update_failure_report,
825
+ # inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
826
+ # outputs=[failure_categories_overview, failure_categories_chart])
827
+
828
+ # gr.HTML('<div style="height: 30px;"></div>')
829
+ # gr.Markdown("## Task overview")
830
+ # gr.HTML('<div style="height: 10px;"></div>')
831
+ # with gr.Row():
832
+ # with gr.Column(scale=1):
833
+ # agent_dropdown = gr.Dropdown(label="Select Agent")
834
+ # with gr.Column(scale=1):
835
+ # task_dropdown = gr.Dropdown(label="Select SWE-bench Verified Task")
836
+ # gr.HTML('<div style="height: 10px;"></div>')
837
+ # with gr.Row():
838
+ # task_overview = gr.Markdown()
839
+ # with gr.Row():
840
+ # flow_chart = gr.Plot(label="Task Flow")
841
+
842
+ # # Initialize the agent dropdown with the best agent
843
+ # demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
844
+ # demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
845
+
846
+ # agent_dropdown.change(update_task_analysis,
847
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
848
+ # outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
849
+ # task_dropdown.change(update_task_details,
850
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
851
+ # outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
852
+
853
+ gr.Markdown("## Raw predictions")
854
+ gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
855
+ with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
856
+ with gr.Row():
857
+ with gr.Column(scale=1):
858
+ raw_agent_dropdown = gr.Dropdown(label="Select Agent")
859
+ with gr.Column(scale=1):
860
+ raw_task_dropdown = gr.Dropdown(label="Select Task")
861
+ with gr.Column(scale=1):
862
+ raw_step_dropdown = gr.Dropdown(label="Select Step")
863
+ with gr.Row():
864
+ raw_call_details = gr.HTML()
865
+
866
+ def update_raw_task_dropdown(agent_name):
867
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
868
+ if not analyzed_traces:
869
+ return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
870
+ task_ids = list(analyzed_traces.keys())
871
+ steps = analyzed_traces[task_ids[0]]['steps']
872
+ return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_verified_mini")[task_ids[0]]['steps'][0], 0)
873
+
874
+ def update_raw_step_dropdown(agent_name, task_id):
875
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
876
+ if not analyzed_traces or task_id not in analyzed_traces:
877
+ return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
878
+ steps = analyzed_traces[task_id]['steps']
879
+ return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
880
+
881
+ def update_raw_call_details(agent_name, task_id, step_index):
882
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
883
+ if not analyzed_traces or task_id not in analyzed_traces:
884
+ return "No data available for this selection."
885
+ steps = analyzed_traces[task_id]['steps']
886
+ if step_index is None:
887
+ return "Invalid step selection."
888
+ step = steps[step_index]
889
+ return format_call_info(step, step_index)
890
+
891
+ # Initialize the raw agent dropdown with all agents
892
+ demo.load(update_agent_dropdown,
893
+ inputs=[gr.Textbox(value="swebench_verified_mini", visible=False), gr.Textbox(value="Accuracy", visible=False)],
894
+ outputs=[raw_agent_dropdown])
895
+ demo.load(update_raw_task_dropdown,
896
+ inputs=[raw_agent_dropdown],
897
+ outputs=[raw_task_dropdown, raw_step_dropdown])
898
+ demo.load(update_raw_call_details,
899
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
900
+ outputs=[raw_call_details])
901
+
902
+ raw_agent_dropdown.change(update_raw_task_dropdown,
903
+ inputs=[raw_agent_dropdown],
904
+ outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
905
+ raw_task_dropdown.change(update_raw_step_dropdown,
906
+ inputs=[raw_agent_dropdown, raw_task_dropdown],
907
+ outputs=[raw_step_dropdown, raw_call_details])
908
+ raw_step_dropdown.change(update_raw_call_details,
909
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
910
+ outputs=[raw_call_details])
911
+
912
+
913
+ with gr.Tab("SWE-bench Verified"):
914
+ gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
915
+ with gr.Row():
916
+ with gr.Column(scale=2):
917
+ Leaderboard(
918
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified'), ci_metrics=["Accuracy", "Total Cost"]),
919
+ select_columns=SelectColumns(
920
+ default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
921
+ cant_deselect=["Agent Name"],
922
+ label="Select Columns to Display:",
923
+ ),
924
+ hide_columns=config.SWEBENCH_HIDE_COLUMNS,
925
+ search_columns=config.SWEBENCH_SEARCH_COLUMNS,
926
+ )
927
+ gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
928
+ with gr.Row():
929
+ gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
930
+ with gr.Row():
931
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
932
+
933
+ gr.HTML('<div style="height: 30px;"></div>')
934
+ gr.Markdown("## Task success heatmap")
935
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
936
+ with gr.Row():
937
+ task_success_heatmap = gr.Plot()
938
+ demo.load(
939
+ lambda: create_task_success_heatmap(
940
+ preprocessor.get_task_success_data('swebench_verified'),
941
+ 'SWE-bench Verified'
942
+ ),
943
+ outputs=[task_success_heatmap]
944
+ )
945
+
946
+ # gr.HTML("""
947
+ # <style>
948
+ # .grouped-section {
949
+ # border: 2px solid #dee2e6; /* Color matching unactivated tabs */
950
+ # border-radius: 10px;
951
+ # padding: 30px;
952
+ # margin-top: 40px;
953
+ # margin-bottom: 40px;
954
+ # position: relative;
955
+ # }
956
+
957
+ # .grouped-section-title {
958
+ # font-size: 1.7em;
959
+ # font-weight: bold;
960
+ # color: #2c3e50;
961
+ # margin-bottom: 20px;
962
+ # padding-bottom: 10px;
963
+ # border-bottom: 2px solid #dee2e6;
964
+ # }
965
+ # </style>
966
+ # """)
967
+ # with gr.Group(elem_classes=["grouped-section"]):
968
+ # gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
969
+
970
+ # gr.HTML('<div style="height: 10px;"></div>')
971
+ # gr.Markdown("## Failure report for each agent")
972
+ # gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
973
+ # gr.HTML('<div style="height: 10px;"></div>')
974
+ # with gr.Row():
975
+ # with gr.Column(scale=1):
976
+ # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
977
+ # gr.HTML('<div style="height: 10px;"></div>')
978
+ # with gr.Row():
979
+ # with gr.Column(scale=1):
980
+ # failure_categories_overview = gr.Markdown()
981
+
982
+ # with gr.Column(scale=1):
983
+ # failure_categories_chart = gr.Plot()
984
+
985
+ # # Initialize the failure report agent dropdown with all agents
986
+ # demo.load(update_agent_dropdown,
987
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
988
+ # outputs=[failure_report_agent_dropdown])
989
+
990
+ # # Update failure report when agent is selected
991
+ # failure_report_agent_dropdown.change(update_failure_report,
992
+ # inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
993
+ # outputs=[failure_categories_overview, failure_categories_chart])
994
+
995
+ # gr.HTML('<div style="height: 30px;"></div>')
996
+ # gr.Markdown("## Task overview")
997
+ # gr.HTML('<div style="height: 10px;"></div>')
998
+ # with gr.Row():
999
+ # with gr.Column(scale=1):
1000
+ # agent_dropdown = gr.Dropdown(label="Select Agent")
1001
+ # with gr.Column(scale=1):
1002
+ # task_dropdown = gr.Dropdown(label="Select SWE-bench Verified Task")
1003
+ # gr.HTML('<div style="height: 10px;"></div>')
1004
+ # with gr.Row():
1005
+ # task_overview = gr.Markdown()
1006
+ # with gr.Row():
1007
+ # flow_chart = gr.Plot(label="Task Flow")
1008
+
1009
+ # # Initialize the agent dropdown with the best agent
1010
+ # demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
1011
+ # demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1012
+
1013
+ # agent_dropdown.change(update_task_analysis,
1014
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
1015
+ # outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1016
+ # task_dropdown.change(update_task_details,
1017
+ # inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
1018
+ # outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1019
+
1020
+ gr.Markdown("## Raw predictions")
1021
+ gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
1022
+ with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
1023
+ with gr.Row():
1024
+ with gr.Column(scale=1):
1025
+ raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1026
+ with gr.Column(scale=1):
1027
+ raw_task_dropdown = gr.Dropdown(label="Select Task")
1028
+ with gr.Column(scale=1):
1029
+ raw_step_dropdown = gr.Dropdown(label="Select Step")
1030
+ with gr.Row():
1031
+ raw_call_details = gr.HTML()
1032
+
1033
+ def update_raw_task_dropdown(agent_name):
1034
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1035
+ if not analyzed_traces:
1036
+ return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1037
+ task_ids = list(analyzed_traces.keys())
1038
+ steps = analyzed_traces[task_ids[0]]['steps']
1039
+ return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_verified")[task_ids[0]]['steps'][0], 0)
1040
+
1041
+ def update_raw_step_dropdown(agent_name, task_id):
1042
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1043
+ if not analyzed_traces or task_id not in analyzed_traces:
1044
+ return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1045
+ steps = analyzed_traces[task_id]['steps']
1046
+ return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1047
+
1048
+ def update_raw_call_details(agent_name, task_id, step_index):
1049
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
1050
+ if not analyzed_traces or task_id not in analyzed_traces:
1051
+ return "No data available for this selection."
1052
+ steps = analyzed_traces[task_id]['steps']
1053
+ if step_index is None:
1054
+ return "Invalid step selection."
1055
+ step = steps[step_index]
1056
+ return format_call_info(step, step_index)
1057
+
1058
+ # Initialize the raw agent dropdown with all agents
1059
+ demo.load(update_agent_dropdown,
1060
+ inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1061
+ outputs=[raw_agent_dropdown])
1062
+ demo.load(update_raw_task_dropdown,
1063
+ inputs=[raw_agent_dropdown],
1064
+ outputs=[raw_task_dropdown, raw_step_dropdown])
1065
+ demo.load(update_raw_call_details,
1066
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1067
+ outputs=[raw_call_details])
1068
+
1069
+ raw_agent_dropdown.change(update_raw_task_dropdown,
1070
+ inputs=[raw_agent_dropdown],
1071
+ outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1072
+ raw_task_dropdown.change(update_raw_step_dropdown,
1073
+ inputs=[raw_agent_dropdown, raw_task_dropdown],
1074
+ outputs=[raw_step_dropdown, raw_call_details])
1075
+ raw_step_dropdown.change(update_raw_call_details,
1076
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1077
+ outputs=[raw_call_details])
1078
+
1079
+
1080
+
1081
+ with gr.Tab("SWE-bench Lite"):
1082
+ gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Lite is a subset of 300 tasks of the original SWE-bench. We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
1083
+ with gr.Row():
1084
+ with gr.Column(scale=2):
1085
+ Leaderboard(
1086
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite'), ci_metrics=["Accuracy", "Total Cost"]),
1087
+ select_columns=SelectColumns(
1088
+ default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
1089
+ cant_deselect=["Agent Name"],
1090
+ label="Select Columns to Display:",
1091
+ ),
1092
+ hide_columns=config.SWEBENCH_HIDE_COLUMNS,
1093
+ search_columns=config.SWEBENCH_SEARCH_COLUMNS,
1094
+ )
1095
+ gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
1096
+ with gr.Row():
1097
+ gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
1098
+ with gr.Row():
1099
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
1100
+
1101
+ gr.HTML('<div style="height: 30px;"></div>')
1102
+ gr.Markdown("## Task success heatmap")
1103
+ gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
1104
+ with gr.Row():
1105
+ task_success_heatmap = gr.Plot()
1106
+ demo.load(
1107
+ lambda: create_task_success_heatmap(
1108
+ preprocessor.get_task_success_data('swebench_lite'),
1109
+ 'SWE-bench Lite'
1110
+ ),
1111
+ outputs=[task_success_heatmap]
1112
+ )
1113
+
1114
+ gr.HTML("""
1115
+ <style>
1116
+ .grouped-section {
1117
+ border: 2px solid #dee2e6; /* Color matching unactivated tabs */
1118
+ border-radius: 10px;
1119
+ padding: 30px;
1120
+ margin-top: 40px;
1121
+ margin-bottom: 40px;
1122
+ position: relative;
1123
+ }
1124
+
1125
+ .grouped-section-title {
1126
+ font-size: 1.7em;
1127
+ font-weight: bold;
1128
+ color: #2c3e50;
1129
+ margin-bottom: 20px;
1130
+ padding-bottom: 10px;
1131
+ border-bottom: 2px solid #dee2e6;
1132
+ }
1133
+ </style>
1134
+ """)
1135
+ with gr.Group(elem_classes=["grouped-section"]):
1136
+ gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
1137
+
1138
+ gr.HTML('<div style="height: 10px;"></div>')
1139
+ gr.Markdown("## Failure report for each agent")
1140
+ gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
1141
+ gr.HTML('<div style="height: 10px;"></div>')
1142
+ with gr.Row():
1143
+ with gr.Column(scale=1):
1144
+ failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
1145
+ gr.HTML('<div style="height: 10px;"></div>')
1146
+ with gr.Row():
1147
+ with gr.Column(scale=1):
1148
+ failure_categories_overview = gr.Markdown()
1149
+
1150
+ with gr.Column(scale=1):
1151
+ failure_categories_chart = gr.Plot()
1152
+
1153
+ # Initialize the failure report agent dropdown with all agents
1154
+ demo.load(update_agent_dropdown,
1155
+ inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1156
+ outputs=[failure_report_agent_dropdown])
1157
+
1158
+ # Update failure report when agent is selected
1159
+ failure_report_agent_dropdown.change(update_failure_report,
1160
+ inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_lite", visible=False)],
1161
+ outputs=[failure_categories_overview, failure_categories_chart])
1162
+
1163
+ gr.HTML('<div style="height: 30px;"></div>')
1164
+ gr.Markdown("## Task overview")
1165
+ gr.HTML('<div style="height: 10px;"></div>')
1166
+ with gr.Row():
1167
+ with gr.Column(scale=1):
1168
+ agent_dropdown = gr.Dropdown(label="Select Agent")
1169
+ with gr.Column(scale=1):
1170
+ task_dropdown = gr.Dropdown(label="Select SWE-bench Lite Task")
1171
+ gr.HTML('<div style="height: 10px;"></div>')
1172
+ with gr.Row():
1173
+ task_overview = gr.Markdown()
1174
+ with gr.Row():
1175
+ flow_chart = gr.Plot(label="Task Flow")
1176
+
1177
+ # Initialize the agent dropdown with the best agent
1178
+ demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
1179
+ demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1180
+
1181
+ agent_dropdown.change(update_task_analysis,
1182
+ inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown],
1183
+ outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1184
+ task_dropdown.change(update_task_details,
1185
+ inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown, task_dropdown],
1186
+ outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1187
+
1188
+ gr.Markdown("## Raw predictions")
1189
+ gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
1190
+ with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
1191
+ with gr.Row():
1192
+ with gr.Column(scale=1):
1193
+ raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1194
+ with gr.Column(scale=1):
1195
+ raw_task_dropdown = gr.Dropdown(label="Select Task")
1196
+ with gr.Column(scale=1):
1197
+ raw_step_dropdown = gr.Dropdown(label="Select Step")
1198
+ with gr.Row():
1199
+ raw_call_details = gr.HTML()
1200
+
1201
+ def update_raw_task_dropdown(agent_name):
1202
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1203
+ if not analyzed_traces:
1204
+ return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1205
+ task_ids = list(analyzed_traces.keys())
1206
+ steps = analyzed_traces[task_ids[0]]['steps']
1207
+ return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_lite")[task_ids[0]]['steps'][0], 0)
1208
+
1209
+ def update_raw_step_dropdown(agent_name, task_id):
1210
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1211
+ if not analyzed_traces or task_id not in analyzed_traces:
1212
+ return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1213
+ steps = analyzed_traces[task_id]['steps']
1214
+ return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1215
+
1216
+ def update_raw_call_details(agent_name, task_id, step_index):
1217
+ analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
1218
+ if not analyzed_traces or task_id not in analyzed_traces:
1219
+ return "No data available for this selection."
1220
+ steps = analyzed_traces[task_id]['steps']
1221
+ if step_index is None:
1222
+ return "Invalid step selection."
1223
+ step = steps[step_index]
1224
+ return format_call_info(step, step_index)
1225
+
1226
+ # Initialize the raw agent dropdown with all agents
1227
+ demo.load(update_agent_dropdown,
1228
+ inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
1229
+ outputs=[raw_agent_dropdown])
1230
+ demo.load(update_raw_task_dropdown,
1231
+ inputs=[raw_agent_dropdown],
1232
+ outputs=[raw_task_dropdown, raw_step_dropdown])
1233
+ demo.load(update_raw_call_details,
1234
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1235
+ outputs=[raw_call_details])
1236
+
1237
+ raw_agent_dropdown.change(update_raw_task_dropdown,
1238
+ inputs=[raw_agent_dropdown],
1239
+ outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1240
+ raw_task_dropdown.change(update_raw_step_dropdown,
1241
+ inputs=[raw_agent_dropdown, raw_task_dropdown],
1242
+ outputs=[raw_step_dropdown, raw_call_details])
1243
+ raw_step_dropdown.change(update_raw_call_details,
1244
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1245
+ outputs=[raw_call_details])
1246
+
1247
+
1248
+ with gr.Tab("MLAgentBench"):
1249
+ gr.Markdown("""MLAgentBench is a suite of end-to-end Machine Learning (ML) experimentation tasks, where the agent aims to take a given dataset and a machine learning task description and autonomously develop or improve an ML model. We are currently actively developing this platform and this benchmark is not fully implemented yet. In particular, we only include one agent and a subset of tasks for this benchmark.""")
1250
+ with gr.Row():
1251
+ with gr.Column(scale=2):
1252
+ Leaderboard(
1253
+ value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench')),
1254
+ select_columns=SelectColumns(
1255
+ default_selection=config.MLAGENTBENCH_ON_LOAD_COLUMNS + ["Verified"],
1256
+ cant_deselect=["Agent Name"],
1257
+ label="Select Columns to Display:",
1258
+ ),
1259
+ hide_columns=config.MLAGENTBENCH_HIDE_COLUMNS,
1260
+ search_columns=config.MLAGENTBENCH_SEARCH_COLUMNS,
1261
+ )
1262
+ gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
1263
+ with gr.Row():
1264
+ gr.Markdown("### Accuracy vs. Cost for MLAgentBench agents")
1265
+ with gr.Row():
1266
+ scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench', aggregate=False), "Total Cost", "Overall Score", "Total Cost (in USD)", "Overall Score", ["Agent Name"]))
1267
+
1268
+ # gr.HTML('<div style="height: 30px;"></div>')
1269
+ # gr.Markdown("## Task success heatmap")
1270
+ # gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
1271
+ # with gr.Row():
1272
+ # task_success_heatmap = gr.Plot()
1273
+ # demo.load(
1274
+ # lambda: create_task_success_heatmap(
1275
+ # preprocessor.get_task_success_data('usaco'),
1276
+ # 'USACO'
1277
+ # ),
1278
+ # outputs=[task_success_heatmap]
1279
+ # )
1280
+
1281
+ gr.HTML("""
1282
+ <style>
1283
+ .grouped-section {
1284
+ border: 2px solid #dee2e6; /* Color matching unactivated tabs */
1285
+ border-radius: 10px;
1286
+ padding: 30px;
1287
+ margin-top: 40px;
1288
+ margin-bottom: 40px;
1289
+ position: relative;
1290
+ }
1291
+
1292
+ .grouped-section-title {
1293
+ font-size: 1.7em;
1294
+ font-weight: bold;
1295
+ color: #2c3e50;
1296
+ margin-bottom: 20px;
1297
+ padding-bottom: 10px;
1298
+ border-bottom: 2px solid #dee2e6;
1299
+ }
1300
+ </style>
1301
+ """)
1302
+ with gr.Group(elem_classes=["grouped-section"]):
1303
+ gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
1304
+
1305
+ # gr.HTML('<div style="height: 10px;"></div>')
1306
+ # gr.Markdown("## Failure report for each agent")
1307
+ # gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
1308
+ # gr.HTML('<div style="height: 10px;"></div>')
1309
+ # with gr.Row():
1310
+ # with gr.Column(scale=1):
1311
+ # failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
1312
+ # gr.HTML('<div style="height: 10px;"></div>')
1313
+ # with gr.Row():
1314
+ # with gr.Column(scale=1):
1315
+ # failure_categories_overview = gr.Markdown()
1316
+
1317
+ # with gr.Column(scale=1):
1318
+ # failure_categories_chart = gr.Plot()
1319
+
1320
+ # # Initialize the failure report agent dropdown with all agents
1321
+ # demo.load(update_agent_dropdown,
1322
+ # inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
1323
+ # outputs=[failure_report_agent_dropdown])
1324
+
1325
+ # # Update failure report when agent is selected
1326
+ # failure_report_agent_dropdown.change(update_failure_report,
1327
+ # inputs=[failure_report_agent_dropdown, gr.Textbox(value="mlagentbench", visible=False)],
1328
+ # outputs=[failure_categories_overview, failure_categories_chart])
1329
+
1330
+ gr.HTML('<div style="height: 30px;"></div>')
1331
+ gr.Markdown("## Task overview")
1332
+ gr.HTML('<div style="height: 10px;"></div>')
1333
+ with gr.Row():
1334
+ with gr.Column(scale=1):
1335
+ agent_dropdown = gr.Dropdown(label="Select Agent")
1336
+ with gr.Column(scale=1):
1337
+ task_dropdown = gr.Dropdown(label="Select MLAgentBench Task")
1338
+ gr.HTML('<div style="height: 10px;"></div>')
1339
+ with gr.Row():
1340
+ task_overview = gr.Markdown()
1341
+ with gr.Row():
1342
+ flow_chart = gr.Plot(label="Task Flow")
1343
+
1344
+ # Initialize the agent dropdown with the best agent
1345
+ demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)], outputs=[agent_dropdown])
1346
+ demo.load(update_task_analysis, inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1347
+
1348
+ agent_dropdown.change(update_task_analysis,
1349
+ inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown],
1350
+ outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
1351
+ task_dropdown.change(update_task_details,
1352
+ inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown, task_dropdown],
1353
+ outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
1354
+
1355
+ gr.Markdown("## Raw predictions")
1356
+ gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
1357
+ with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
1358
+ with gr.Row():
1359
+ with gr.Column(scale=1):
1360
+ raw_agent_dropdown = gr.Dropdown(label="Select Agent")
1361
+ with gr.Column(scale=1):
1362
+ raw_task_dropdown = gr.Dropdown(label="Select Task")
1363
+ with gr.Column(scale=1):
1364
+ raw_step_dropdown = gr.Dropdown(label="Select Step")
1365
+ with gr.Row():
1366
+ raw_call_details = gr.HTML()
1367
+
1368
+ def update_raw_task_dropdown(agent_name):
1369
+ analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1370
+ if not analyzed_traces:
1371
+ return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
1372
+ task_ids = list(analyzed_traces.keys())
1373
+ steps = analyzed_traces[task_ids[0]]['steps']
1374
+ return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "mlagentbench")[task_ids[0]]['steps'][0], 0)
1375
+
1376
+ def update_raw_step_dropdown(agent_name, task_id):
1377
+ analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1378
+ if not analyzed_traces or task_id not in analyzed_traces:
1379
+ return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
1380
+ steps = analyzed_traces[task_id]['steps']
1381
+ return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
1382
+
1383
+ def update_raw_call_details(agent_name, task_id, step_index):
1384
+ analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
1385
+ if not analyzed_traces or task_id not in analyzed_traces:
1386
+ return "No data available for this selection."
1387
+ steps = analyzed_traces[task_id]['steps']
1388
+ if step_index is None:
1389
+ return "Invalid step selection."
1390
+ step = steps[step_index]
1391
+ return format_call_info(step, step_index)
1392
+
1393
+ # Initialize the raw agent dropdown with all agents
1394
+ demo.load(update_agent_dropdown,
1395
+ inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
1396
+ outputs=[raw_agent_dropdown])
1397
+ demo.load(update_raw_task_dropdown,
1398
+ inputs=[raw_agent_dropdown],
1399
+ outputs=[raw_task_dropdown, raw_step_dropdown])
1400
+ demo.load(update_raw_call_details,
1401
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1402
+ outputs=[raw_call_details])
1403
+
1404
+ raw_agent_dropdown.change(update_raw_task_dropdown,
1405
+ inputs=[raw_agent_dropdown],
1406
+ outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
1407
+ raw_task_dropdown.change(update_raw_step_dropdown,
1408
+ inputs=[raw_agent_dropdown, raw_task_dropdown],
1409
+ outputs=[raw_step_dropdown, raw_call_details])
1410
+ raw_step_dropdown.change(update_raw_call_details,
1411
+ inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
1412
+ outputs=[raw_call_details])
1413
+
1414
+ with gr.Tab("About"):
1415
+ gr.Markdown((Path(__file__).parent / "about.md").read_text())
1416
+
1417
+ # Will trigger autoscaling of plots when tabs are switched
1418
+ tabs.select(fn=None, inputs=None, outputs=None, js="""
1419
+ function() {
1420
+ setTimeout(function() {
1421
+ window.dispatchEvent(new Event('resize'));
1422
+ }, 100);
1423
+ }
1424
+ """)
1425
+ gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent to HAL leaderboards?</h2>""")
1426
+ gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
1427
+ gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark to HAL?</h2>""")
1428
+ gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
1429
+ gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
1430
+ gr.Markdown("""Coming soon...""")
1431
+
1432
+
1433
+
1434
+
1435
+
1436
+ async def main():
1437
+ # Preprocess traces
1438
+ # preprocessor = TracePreprocessor()
1439
+ # preprocessor.preprocess_traces('evals_live')
1440
+ # preprocessor = TracePreprocessor()
1441
+
1442
+ # Download the results from the Hugging Face Hub
1443
+ # await asyncio.to_thread(download_latest_results)
1444
+
1445
+ # # Check for new uploads and process them
1446
+ # await check_and_process_uploads()
1447
+
1448
+ scheduler = AsyncIOScheduler()
1449
+ scheduler.add_job(restart_space, "interval", hours=1)
1450
+ # scheduler.add_job(download_latest_results, "interval", hours=1)
1451
+ # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
1452
+ scheduler.start()
1453
+
1454
+ await demo.launch(favicon_path="hal.png")
1455
+
1456
+ if __name__ == "__main__":
1457
+ weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')
1458
+ asyncio.run(main())
benchmark_submission.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ To submit **a new benchmark** to the library:
2
+
3
+ 1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
4
+
5
+ 2. We will encourage developers to support running their tasks on separate VMs and specify the exact hardware specifications for each task in the task environment.
config.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+
3
+ TYPES = [
4
+ "str",
5
+ "number",
6
+ "number"
7
+ ]
8
+
9
+ SWEBENCH_ON_LOAD_COLUMNS = [
10
+ "Agent Name",
11
+ "Accuracy",
12
+ "Total Cost",
13
+ "Runs",
14
+ ]
15
+ SWEBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
16
+ SWEBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
17
+
18
+ USACO_ON_LOAD_COLUMNS = [
19
+ "Agent Name",
20
+ "Accuracy",
21
+ "Total Cost",
22
+ "Runs",
23
+ ]
24
+ USACO_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
25
+ USACO_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
26
+
27
+ COREBENCH_ON_LOAD_COLUMNS = [
28
+ "Agent Name",
29
+ "Accuracy",
30
+ "Total Cost",
31
+ "Runs",
32
+ ]
33
+ COREBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
34
+ COREBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
35
+
36
+
37
+
38
+ MLAGENTBENCH_ON_LOAD_COLUMNS = [
39
+ "Agent Name",
40
+ "Overall Score",
41
+ "Total Cost",
42
+ ]
43
+ MLAGENTBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
44
+ MLAGENTBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Accuracy']
45
+
46
+
47
+ NUMERIC_INTERVALS = {
48
+ "?": pd.Interval(-1, 0, closed="right"),
49
+ "~1.5": pd.Interval(0, 2, closed="right"),
50
+ "~3": pd.Interval(2, 4, closed="right"),
51
+ "~7": pd.Interval(4, 9, closed="right"),
52
+ "~13": pd.Interval(9, 20, closed="right"),
53
+ "~35": pd.Interval(20, 45, closed="right"),
54
+ "~60": pd.Interval(45, 70, closed="right"),
55
+ "70+": pd.Interval(70, 10000, closed="right"),
56
+ }
css.css ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Base styles and variables */
2
+ :root {
3
+ --primary-color: #3498db;
4
+ --secondary-color: #2c3e50;
5
+ --background-color: #f8f9fa;
6
+ --text-color: #333;
7
+ --accent-color: #e74c3c;
8
+ --space: 1rem;
9
+ }
10
+
11
+ /* Tabs */
12
+ .tab-nav {
13
+ display: flex;
14
+ background-color: var(--secondary-color);
15
+ border-radius: 8px 8px 0 0;
16
+ overflow: hidden;
17
+ }
18
+
19
+ .tab-nav button {
20
+ padding: 1rem 1.5rem;
21
+ background-color: transparent;
22
+ color: #fff;
23
+ border: none;
24
+ cursor: pointer;
25
+ transition: background-color 0.3s;
26
+ }
27
+
28
+ .tab-nav button:hover,
29
+ .tab-nav button.selected {
30
+ background-color: var(--primary-color);
31
+ }
32
+
33
+
34
+ .svelte-iibkxk .stretch {
35
+ display: none;
36
+ }
37
+
38
+ /* Utility classes */
39
+ .text-center { text-align: center; }
40
+ .text-right { text-align: right; }
41
+ .font-bold { font-weight: 700; }
42
+ .text-small { font-size: 0.875rem; }
43
+ .text-large { font-size: 1.25rem; }
44
+ .mt-1 { margin-top: 1rem; }
45
+ .mb-1 { margin-bottom: 1rem; }
46
+ .ml-1 { margin-left: 1rem; }
47
+ .mr-1 { margin-right: 1rem; }
48
+
envs.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from huggingface_hub import HfApi
3
+
4
+ HF_TOKEN = os.getenv('HF_TOKEN', None)
5
+
6
+ RESULTS_REPO_ID = 'agent-evals/results'
7
+ REPO_ID = 'agent-evals/leaderboard'
8
+
9
+ API = HfApi(token=HF_TOKEN)
10
+
hal.ico ADDED
hal.png ADDED
header.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Holistic Agent Leaderboard (HAL)
2
+
3
+ **A centralized, standardized, cost-aware leaderboard for evaluating agents.**
requirements.txt ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiofiles==23.2.1
2
+ aiohappyeyeballs==2.3.5
3
+ aiohttp==3.10.3
4
+ aioprocessing==2.0.1
5
+ aiosignal==1.3.1
6
+ aiosmtplib==3.0.2
7
+ analytics-python==1.2.9
8
+ annotated-types==0.7.0
9
+ anyio==4.4.0
10
+ APScheduler
11
+ async-timeout==4.0.3
12
+ attrs==24.2.0
13
+ backoff==2.2.1
14
+ certifi==2024.7.4
15
+ charset-normalizer==3.3.2
16
+ click==8.1.7
17
+ contourpy==1.2.1
18
+ cycler==0.12.1
19
+ distro==1.9.0
20
+ dnspython==2.6.1
21
+ docker-pycreds==0.4.0
22
+ email_validator==2.2.0
23
+ emoji==2.12.1
24
+ exceptiongroup==1.2.2
25
+ fastapi==0.111.1
26
+ fastapi-cli==0.0.4
27
+ ffmpy==0.4.0
28
+ filelock==3.15.4
29
+ fonttools==4.53.1
30
+ frozenlist==1.4.1
31
+ fsspec==2024.6.1
32
+ gitdb==4.0.11
33
+ GitPython==3.1.43
34
+ gql==3.5.0
35
+ gradio==4.40.0
36
+ gradio_client==1.2.0
37
+ gradio_leaderboard==0.0.11
38
+ graphql-core==3.2.3
39
+ h11==0.14.0
40
+ httpcore==1.0.5
41
+ httptools==0.6.1
42
+ httpx==0.27.0
43
+ huggingface-hub==0.24.5
44
+ idna==3.7
45
+ importlib_resources==6.4.0
46
+ janus==1.0.0
47
+ Jinja2==3.1.4
48
+ jiter==0.5.0
49
+ kiwisolver==1.4.5
50
+ Markdown==3.6
51
+ markdown-it-py==3.0.0
52
+ MarkupSafe==2.1.5
53
+ matplotlib==3.9.1
54
+ mdurl==0.1.2
55
+ multidict==6.0.5
56
+ numpy==1.26.4
57
+ openai==1.40.3
58
+ orjson==3.10.6
59
+ packaging==24.1
60
+ pandas==2.2.2
61
+ pillow==10.4.0
62
+ platformdirs==4.2.2
63
+ plotly==5.23.0
64
+ protobuf==5.27.3
65
+ psutil==6.0.0
66
+ pyarrow==16.1.0
67
+ pydantic==2.8.2
68
+ pydantic_core==2.20.1
69
+ pydub==0.25.1
70
+ Pygments==2.18.0
71
+ pyparsing==3.1.2
72
+ python-dateutil==2.9.0.post0
73
+ python-dotenv==1.0.1
74
+ python-json-logger==2.0.7
75
+ python-multipart==0.0.9
76
+ pytz
77
+ PyYAML==6.0.1
78
+ regex==2024.7.24
79
+ requests==2.32.3
80
+ requests-toolbelt==1.0.0
81
+ rich==13.7.1
82
+ ruff==0.5.5
83
+ scipy==1.14.1
84
+ semantic-version==2.10.0
85
+ sentry-sdk==2.12.0
86
+ setproctitle==1.3.3
87
+ shellingham==1.5.4
88
+ six
89
+ smmap==5.0.1
90
+ sniffio==1.3.1
91
+ starlette==0.37.2
92
+ tenacity==9.0.0
93
+ tiktoken==0.7.0
94
+ tomlkit==0.12.0
95
+ tqdm==4.66.4
96
+ typer==0.12.3
97
+ typing_extensions==4.12.2
98
+ tzdata==2024.1
99
+ tzlocal
100
+ urllib3==2.2.2
101
+ uvicorn==0.30.4
102
+ uvloop==0.19.0
103
+ wandb==0.17.6
104
+ watchfiles==0.22.0
105
+ weave==0.50.13
106
+ websockets==12.0
107
+ Werkzeug==3.0.3
108
+ yarl==1.9.4
utils/test.txt β†’ scratch.ipynb RENAMED
File without changes
scratch.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ from pathlib import Path
4
+
5
+ def process_json_files(directory, suffix="_updated"):
6
+ # Iterate through all JSON files in the directory
7
+ for filename in os.listdir(directory):
8
+ if filename.endswith(".json") and "USACO" in filename:
9
+ file_path = os.path.join(directory, filename)
10
+
11
+ # Read the JSON file
12
+ with open(file_path, 'r') as f:
13
+ data = json.load(f)
14
+
15
+ # Extract sdict from raw_eval_results
16
+ sdict = data['raw_eval_results']['sdict']
17
+
18
+ # Calculate successful_tasks and failed_tasks
19
+ successful_tasks = [key for key in sdict if float(sdict[key][0]['result']['fraction_passed']) == 1]
20
+ failed_tasks = [key for key in sdict if float(sdict[key][0]['result']['fraction_passed']) < 1]
21
+
22
+ # Add new key-value pairs to the results
23
+ data['results']['successful_tasks'] = successful_tasks
24
+ data['results']['failed_tasks'] = failed_tasks
25
+
26
+ # Create new filename with suffix
27
+ new_filename = f"{Path(filename).stem}{suffix}{Path(filename).suffix}"
28
+ new_file_path = os.path.join(directory, new_filename)
29
+
30
+ # Write updated data to new file
31
+ with open(new_file_path, 'w') as f:
32
+ json.dump(data, f, indent=4)
33
+
34
+ print(f"Processed {filename} and saved as {new_filename}")
35
+
36
+ # Usage
37
+ directory_path = "/Users/benediktstroebl/Documents/GitHub/leaderboard/evals_live"
38
+ process_json_files(directory_path)
utils/data.py ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from pathlib import Path
3
+ import pandas as pd
4
+ import plotly.express as px
5
+ from utils.pareto import Agent, compute_pareto_frontier
6
+ import plotly.graph_objects as go
7
+ import textwrap
8
+
9
+ # def parse_json_files(folder_path, benchmark_name):
10
+ # # Convert folder path to Path object
11
+ # folder = Path(folder_path)
12
+
13
+ # # List to store data from each file
14
+ # data_list = []
15
+
16
+ # # Iterate through all JSON files in the folder
17
+ # for json_file in folder.glob('*.json'):
18
+ # try:
19
+ # with open(json_file, 'r') as file:
20
+ # data = json.load(file)
21
+
22
+ # # Extract config and results
23
+ # config = data['config']
24
+ # results = data['results']
25
+
26
+ # # Combine config and results into a single dictionary
27
+ # combined_data = {
28
+ # 'agent_name': config['agent_name'],
29
+ # 'benchmark_name': config['benchmark_name'],
30
+ # 'date': config['date']
31
+ # }
32
+
33
+ # # Add results with 'results_' prefix
34
+ # for key, value in results.items():
35
+ # combined_data[f'results_{key}'] = value
36
+
37
+ # data_list.append(combined_data)
38
+ # except Exception as e:
39
+ # print(f"Error processing {json_file}: {e}. Skipping!")
40
+
41
+ # # Create DataFrame from the list of dictionaries
42
+ # df = pd.DataFrame(data_list)
43
+ # df = df[df['benchmark_name'] == benchmark_name]
44
+
45
+ # # sort df by descending accuracy
46
+ # df = df.sort_values(by='results_accuracy', ascending=False)
47
+
48
+ # # round all float columns to 2 decimal places
49
+ # for column in df.select_dtypes(include='float').columns:
50
+ # df[column] = df[column].round(3)
51
+
52
+ # # Rename columns
53
+ # df = df.rename(columns={
54
+ # 'agent_name': 'Agent Name',
55
+ # 'results_total_cost': 'Total Cost',
56
+ # 'results_accuracy': 'Accuracy',
57
+ # 'results_precision': 'Precision',
58
+ # 'results_recall': 'Recall',
59
+ # 'results_f1_score': 'F1 Score',
60
+ # 'results_auc': 'AUC',
61
+ # })
62
+
63
+ # return df
64
+
65
+
66
+ def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
67
+ agents = [Agent(row.results_total_cost, row.results_accuracy) for row in df.itertuples()]
68
+ pareto_frontier = compute_pareto_frontier(agents)
69
+
70
+
71
+ fig = px.scatter(df,
72
+ x=x,
73
+ y=y,
74
+ hover_data=hover_data,
75
+ )
76
+
77
+
78
+ # Sort the Pareto frontier points by x-coordinate
79
+ pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
80
+
81
+ # Add the Pareto frontier line
82
+ fig.add_trace(go.Scatter(
83
+ x=[point[0] for point in pareto_points],
84
+ y=[point[1] for point in pareto_points],
85
+ mode='lines',
86
+ name='Pareto Frontier',
87
+ line=dict(color='black', width=1, dash='dash')
88
+ ))
89
+
90
+ fig.update_yaxes(rangemode="tozero")
91
+ fig.update_xaxes(rangemode="tozero")
92
+
93
+ fig.update_layout(
94
+ width = 600,
95
+ height = 500,
96
+ xaxis_title = x_label,
97
+ yaxis_title = y_label,
98
+ xaxis = dict(
99
+ showline = True,
100
+ linecolor = 'black',
101
+ showgrid = False),
102
+ yaxis = dict(
103
+ showline = True,
104
+ showgrid = False,
105
+ linecolor = 'black'),
106
+ plot_bgcolor = 'white',
107
+ # Legend positioning
108
+ legend=dict(
109
+ yanchor="bottom",
110
+ y=0.01,
111
+ xanchor="right",
112
+ x=0.98,
113
+ bgcolor="rgba(255, 255, 255, 0.5)" # semi-transparent white background
114
+ )
115
+ )
116
+ return fig
117
+
118
+
119
+ import plotly.graph_objects as go
120
+ import textwrap
121
+
122
+ def create_flow_chart(steps):
123
+ node_x = []
124
+ node_y = []
125
+ edge_x = []
126
+ edge_y = []
127
+ node_text = []
128
+ hover_text = []
129
+ node_colors = []
130
+ node_shapes = []
131
+
132
+ # Define color and shape mappings
133
+ color_map = {True: 'green', False: 'red'} # True for success, False for challenges
134
+ shape_map = {
135
+ 'plan': 'octagon',
136
+ 'tool': 'square',
137
+ 'retrieve': 'diamond',
138
+ 'other': 'circle'
139
+ }
140
+
141
+ for i, step in enumerate(steps):
142
+ node_x.append(i)
143
+ node_y.append(0)
144
+
145
+ # Extract Description, Assessment, and new attributes
146
+ analysis = step['analysis']
147
+ if isinstance(analysis, str):
148
+ try:
149
+ analysis = json.loads(analysis)
150
+ except json.JSONDecodeError:
151
+ analysis = {}
152
+
153
+ description = analysis.get('description', 'No description available.')
154
+ assessment = analysis.get('assessment', 'No assessment available.')
155
+ success = analysis.get('success', True) # Assuming True if not specified
156
+ action_type = analysis.get('action_type', 'other') # Default to 'other' if not specified
157
+ step_outline = analysis.get('step_outline', '')
158
+
159
+ # Set node color and shape based on attributes
160
+ node_colors.append(color_map[success])
161
+ node_shapes.append(shape_map.get(action_type, 'circle'))
162
+
163
+ # Wrap text to improve readability
164
+ wrapped_description = '<br>'.join(textwrap.wrap(description, width=50))
165
+ wrapped_assessment = '<br>'.join(textwrap.wrap(assessment, width=50))
166
+ wrapped_outline = textwrap.shorten(step_outline, width=30, placeholder='')
167
+ wrapped_outline = '' if wrapped_outline == '' else f": {wrapped_outline}"
168
+
169
+ node_text_outline = '' if wrapped_outline == '' else f":<br>{textwrap.shorten(step_outline, width=30, placeholder='')}"
170
+ node_text.append(f"Step {i+1}{node_text_outline}")
171
+
172
+ # Create formatted hover text without indentation
173
+ hover_info = f"<b>Step {i+1}{wrapped_outline}</b><br><br>" \
174
+ f"<b>Description:</b><br>" \
175
+ f"{wrapped_description}<br><br>" \
176
+ f"<b>Assessment:</b><br>" \
177
+ f"{wrapped_assessment}<br><br>" \
178
+ f"<b>Successful:</b> {'Yes' if success else 'No'}<br>" \
179
+ f"<b>Action Type:</b> {action_type.capitalize()}"
180
+ hover_text.append(hover_info)
181
+
182
+ if i > 0:
183
+ edge_x.extend([i-1, i, None])
184
+ edge_y.extend([0, 0, None])
185
+
186
+ node_trace = go.Scatter(
187
+ x=node_x, y=node_y,
188
+ mode='markers+text',
189
+ text=node_text,
190
+ textposition="top center",
191
+ showlegend=False,
192
+ hovertext=hover_text,
193
+ hoverinfo='text',
194
+ hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
195
+ marker=dict(
196
+ color=node_colors,
197
+ size=30,
198
+ line_width=2,
199
+ symbol=node_shapes
200
+ ))
201
+
202
+ edge_trace = go.Scatter(
203
+ x=edge_x, y=edge_y,
204
+ line=dict(width=2, color='#888'),
205
+ hoverinfo='none',
206
+ showlegend=False,
207
+ mode='lines')
208
+
209
+ # Create legend traces
210
+ legend_traces = []
211
+
212
+ # Color legend
213
+ for success, color in color_map.items():
214
+ legend_traces.append(go.Scatter(
215
+ x=[None], y=[None],
216
+ mode='markers',
217
+ marker=dict(size=10, color=color),
218
+ showlegend=True,
219
+ name=f"{'Success' if success else 'Issue'}"
220
+ ))
221
+
222
+ # Shape legend
223
+ for action, shape in shape_map.items():
224
+ legend_traces.append(go.Scatter(
225
+ x=[None], y=[None],
226
+ mode='markers',
227
+ marker=dict(size=10, symbol=shape, color='gray'),
228
+ showlegend=True,
229
+ name=f"{action.capitalize()}"
230
+ ))
231
+
232
+ # Combine all traces
233
+ all_traces = [edge_trace, node_trace] + legend_traces
234
+
235
+ layout = go.Layout(
236
+ showlegend=True,
237
+ hovermode='closest',
238
+ margin=dict(b=20,l=5,r=5,t=40),
239
+ xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
240
+ yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
241
+ plot_bgcolor='white',
242
+ paper_bgcolor='white',
243
+ modebar=dict(
244
+ activecolor='#1f77b4', # Color of active tool
245
+ orientation='h', # Vertical orientation
246
+ bgcolor='rgba(255,255,255,0.8)', # Slightly transparent white background
247
+ color='#777', # Color of inactive tools
248
+ ),
249
+ legend=dict(
250
+ orientation="h",
251
+ yanchor="bottom",
252
+ y=0.02,
253
+ xanchor="right",
254
+ x=1,
255
+ bgcolor='rgba(255,255,255,0.8)',
256
+ bordercolor='rgba(0,0,0,0.1)',
257
+ borderwidth=1
258
+ ),
259
+ )
260
+
261
+ fig = go.Figure(data=all_traces, layout=layout)
262
+
263
+ fig.update_layout(legend=dict(
264
+ orientation="h",
265
+ yanchor="bottom",
266
+ y=1.02,
267
+ xanchor="right",
268
+ x=1,
269
+ bgcolor='rgba(255,255,255,0.8)', # Set legend background to slightly transparent white
270
+ bordercolor='rgba(0,0,0,0.1)', # Add a light border to the legend
271
+ borderwidth=1
272
+ ),
273
+ dragmode='pan'
274
+ )
275
+
276
+ config = {
277
+ 'add': ['pan2d'],
278
+ 'remove': [
279
+ 'zoom2d',
280
+ 'zoomIn2d',
281
+ 'zoomOut2d',
282
+ 'resetScale2d',
283
+ 'hoverClosestCartesian',
284
+ 'hoverCompareCartesian',
285
+ 'toggleSpikelines',
286
+ 'lasso2d',
287
+ 'lasso',
288
+ 'select2d',
289
+ 'select',
290
+ ]
291
+ }
292
+
293
+ # Apply the config to the figure
294
+ fig.update_layout(modebar=config)
295
+
296
+ return fig
utils/db.py ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from pathlib import Path
3
+ import sqlite3
4
+ import pickle
5
+ from functools import lru_cache
6
+ import threading
7
+ import pandas as pd
8
+ import ast
9
+ from scipy import stats
10
+ import yaml
11
+ import numpy as np
12
+
13
+ class TracePreprocessor:
14
+ def __init__(self, db_path='preprocessed_traces.db'):
15
+ self.db_path = db_path
16
+ self.local = threading.local()
17
+
18
+ def get_conn(self):
19
+ if not hasattr(self.local, 'conn'):
20
+ self.local.conn = sqlite3.connect(self.db_path)
21
+ return self.local.conn
22
+
23
+ def create_tables(self):
24
+ with self.get_conn() as conn:
25
+ conn.execute('''
26
+ CREATE TABLE IF NOT EXISTS preprocessed_traces (
27
+ benchmark_name TEXT,
28
+ agent_name TEXT,
29
+ date TEXT,
30
+ run_id TEXT,
31
+ raw_logging_results BLOB,
32
+ PRIMARY KEY (benchmark_name, agent_name, run_id)
33
+ )
34
+ ''')
35
+ conn.execute('''
36
+ CREATE TABLE IF NOT EXISTS failure_reports (
37
+ benchmark_name TEXT,
38
+ agent_name TEXT,
39
+ date TEXT,
40
+ run_id TEXT,
41
+ failure_report BLOB,
42
+ PRIMARY KEY (benchmark_name, agent_name, run_id)
43
+ )
44
+ ''')
45
+ conn.execute('''
46
+ CREATE TABLE IF NOT EXISTS parsed_results (
47
+ benchmark_name TEXT,
48
+ agent_name TEXT,
49
+ date TEXT,
50
+ run_id TEXT,
51
+ successful_tasks TEXT,
52
+ failed_tasks TEXT,
53
+ total_cost REAL,
54
+ accuracy REAL,
55
+ precision REAL,
56
+ recall REAL,
57
+ f1_score REAL,
58
+ auc REAL,
59
+ overall_score REAL,
60
+ vectorization_score REAL,
61
+ fathomnet_score REAL,
62
+ feedback_score REAL,
63
+ house_price_score REAL,
64
+ spaceship_titanic_score REAL,
65
+ amp_parkinsons_disease_progression_prediction_score REAL,
66
+ cifar10_score REAL,
67
+ imdb_score REAL,
68
+ PRIMARY KEY (benchmark_name, agent_name, run_id)
69
+ )
70
+ ''')
71
+
72
+ def preprocess_traces(self, processed_dir="evals_live"):
73
+ self.create_tables()
74
+ processed_dir = Path(processed_dir)
75
+ for file in processed_dir.glob('*.json'):
76
+ with open(file, 'r') as f:
77
+ data = json.load(f)
78
+ agent_name = data['config']['agent_name']
79
+ benchmark_name = data['config']['benchmark_name']
80
+ date = data['config']['date']
81
+ config = data['config']
82
+
83
+ try:
84
+ raw_logging_results = pickle.dumps(data['raw_logging_results'])
85
+ with self.get_conn() as conn:
86
+ conn.execute('''
87
+ INSERT OR REPLACE INTO preprocessed_traces
88
+ (benchmark_name, agent_name, date, run_id, raw_logging_results)
89
+ VALUES (?, ?, ?, ?, ?)
90
+ ''', (benchmark_name, agent_name, date, config['run_id'], raw_logging_results))
91
+ except Exception as e:
92
+ print(f"Error preprocessing raw_logging_results in {file}: {e}")
93
+
94
+ try:
95
+ failure_report = pickle.dumps(data['failure_report'])
96
+ with self.get_conn() as conn:
97
+ conn.execute('''
98
+ INSERT INTO failure_reports
99
+ (benchmark_name, agent_name, date, run_id, failure_report)
100
+ VALUES (?, ?, ?, ? ,?)
101
+ ''', (benchmark_name, agent_name, date, config['run_id'], failure_report))
102
+ except Exception as e:
103
+ print(f"Error preprocessing failure_report in {file}: {e}")
104
+
105
+ try:
106
+ config = data['config']
107
+ results = data['results']
108
+ with self.get_conn() as conn:
109
+ conn.execute('''
110
+ INSERT INTO parsed_results
111
+ (benchmark_name, agent_name, date, run_id, successful_tasks, failed_tasks, total_cost, accuracy, precision, recall, f1_score, auc, overall_score, vectorization_score, fathomnet_score, feedback_score, house_price_score, spaceship_titanic_score, amp_parkinsons_disease_progression_prediction_score, cifar10_score, imdb_score)
112
+ VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
113
+ ''', (
114
+ benchmark_name,
115
+ agent_name,
116
+ config['date'],
117
+ config['run_id'],
118
+ str(results.get('successful_tasks')),
119
+ str(results.get('failed_tasks')),
120
+ results.get('total_cost'),
121
+ results.get('accuracy'),
122
+ results.get('precision'),
123
+ results.get('recall'),
124
+ results.get('f1_score'),
125
+ results.get('auc'),
126
+ results.get('overall_score'),
127
+ results.get('vectorization_score'),
128
+ results.get('fathomnet_score'),
129
+ results.get('feedback_score'),
130
+ results.get('house-price_score'),
131
+ results.get('spaceship-titanic_score'),
132
+ results.get('amp-parkinsons-disease-progression-prediction_score'),
133
+ results.get('cifar10_score'),
134
+ results.get('imdb_score')
135
+ ))
136
+ except Exception as e:
137
+ print(f"Error preprocessing parsed results in {file}: {e}")
138
+
139
+ @lru_cache(maxsize=100)
140
+ def get_analyzed_traces(self, agent_name, benchmark_name):
141
+ with self.get_conn() as conn:
142
+ query = '''
143
+ SELECT agent_name, raw_logging_results, date FROM preprocessed_traces
144
+ WHERE benchmark_name = ? AND agent_name = ?
145
+ '''
146
+ df = pd.read_sql_query(query, conn, params=(benchmark_name, agent_name))
147
+
148
+
149
+ # check for each row if raw_logging_results is not None with pickle.loads because it is stored as a byte string
150
+ df = df[df['raw_logging_results'].apply(lambda x: pickle.loads(x) is not None and x != 'None')]
151
+
152
+ if len(df) == 0:
153
+ return None
154
+
155
+ # select latest run
156
+ df = df.sort_values('date', ascending=False).groupby('agent_name').first().reset_index()
157
+
158
+
159
+ return pickle.loads(df['raw_logging_results'][0])
160
+
161
+
162
+ @lru_cache(maxsize=100)
163
+ def get_failure_report(self, agent_name, benchmark_name):
164
+ with self.get_conn() as conn:
165
+ query = '''
166
+ SELECT agent_name, date, failure_report FROM failure_reports
167
+ WHERE benchmark_name = ? AND agent_name = ?
168
+ '''
169
+ df = pd.read_sql_query(query, conn, params=(benchmark_name, agent_name))
170
+
171
+ # Select only rows for which failure report is not None and None is a string
172
+ df = df[df['failure_report'].apply(lambda x: pickle.loads(x) is not None and x != 'None')]
173
+
174
+ if len(df) == 0:
175
+ return None
176
+
177
+
178
+ # if there is multiple failure reports, take the last one
179
+ df = df.sort_values('date', ascending=False).groupby('agent_name').first().reset_index()
180
+
181
+ # if there is a failure report, return the first one
182
+ return pickle.loads(df['failure_report'][0])
183
+
184
+ def _calculate_ci(self, data, confidence=0.95, type='minmax'):
185
+ data = data[np.isfinite(data)]
186
+
187
+ if len(data) < 2:
188
+ return '', '', '' # No CI for less than 2 samples
189
+ n = len(data)
190
+
191
+ mean = np.mean(data)
192
+
193
+ if type == 't':
194
+ sem = stats.sem(data)
195
+ ci = stats.t.interval(confidence, n-1, loc=mean, scale=sem)
196
+
197
+ elif type == 'minmax':
198
+ min = np.min(data)
199
+ max = np.max(data)
200
+ ci = (min, max)
201
+ return mean, ci[0], ci[1]
202
+
203
+ def get_parsed_results(self, benchmark_name, aggregate=True):
204
+ with self.get_conn() as conn:
205
+ query = '''
206
+ SELECT * FROM parsed_results
207
+ WHERE benchmark_name = ?
208
+ ORDER BY accuracy DESC
209
+ '''
210
+ df = pd.read_sql_query(query, conn, params=(benchmark_name,))
211
+
212
+ # Load verified agents
213
+ verified_agents = self.load_verified_agents()
214
+
215
+ # Add 'Verified' column
216
+ df['Verified'] = df.apply(lambda row: 'βœ“' if (benchmark_name, row['agent_name']) in verified_agents else '', axis=1)
217
+
218
+
219
+
220
+ # Add column for how many times an agent_name appears in the DataFrame
221
+ df['Runs'] = df.groupby('agent_name')['agent_name'].transform('count')
222
+
223
+ # Compute the 95% confidence interval for accuracy and cost for agents that have been run more than once
224
+ df['acc_ci'] = None
225
+ df['cost_ci'] = None
226
+
227
+ for agent_name in df['agent_name'].unique():
228
+ agent_df = df[df['agent_name'] == agent_name]
229
+
230
+ if len(agent_df) > 1:
231
+ accuracy_mean, accuracy_lower, accuracy_upper = self._calculate_ci(agent_df['accuracy'], type='minmax')
232
+ cost_mean, cost_lower, cost_upper = self._calculate_ci(agent_df['total_cost'], type='minmax')
233
+
234
+ # format the confidence interval with +/- sign
235
+ # accuracy_ci = f"Β± {abs(accuracy_mean - accuracy_lower):.3f}"
236
+ # cost_ci = f"Β± {abs(cost_mean - cost_lower):.3f}"
237
+
238
+ accuracy_ci = f"-{abs(accuracy_mean - accuracy_lower):.3f}/+{abs(accuracy_mean - accuracy_upper):.3f}"
239
+ cost_ci = f"-{abs(cost_mean - cost_lower):.3f}/+{abs(cost_mean - cost_upper):.3f}"
240
+
241
+ df.loc[df['agent_name'] == agent_name, 'acc_ci'] = accuracy_ci
242
+ df.loc[df['agent_name'] == agent_name, 'cost_ci'] = cost_ci
243
+
244
+
245
+ df = df.drop(columns=['successful_tasks', 'failed_tasks', 'run_id'], axis=1)
246
+
247
+ if aggregate:
248
+ # For agents that have been run more than once, compute the average accuracy and cost and use that as the value in the DataFrame
249
+ df = df.groupby('agent_name').agg({
250
+ 'date': 'first',
251
+ 'total_cost': 'mean',
252
+ 'accuracy': 'mean',
253
+ 'precision': 'mean',
254
+ 'recall': 'mean',
255
+ 'f1_score': 'mean',
256
+ 'auc': 'mean',
257
+ 'overall_score': 'mean',
258
+ 'vectorization_score': 'mean',
259
+ 'fathomnet_score': 'mean',
260
+ 'feedback_score': 'mean',
261
+ 'house_price_score': 'mean',
262
+ 'spaceship_titanic_score': 'mean',
263
+ 'amp_parkinsons_disease_progression_prediction_score': 'mean',
264
+ 'cifar10_score': 'mean',
265
+ 'imdb_score': 'mean',
266
+ 'Verified': 'first',
267
+ 'Runs': 'first',
268
+ 'acc_ci': 'first',
269
+ 'cost_ci': 'first'
270
+ }).reset_index()
271
+
272
+ # Round float columns to 3 decimal places
273
+ float_columns = ['total_cost', 'accuracy', 'precision', 'recall', 'f1_score', 'auc', 'overall_score', 'vectorization_score', 'fathomnet_score', 'feedback_score', 'house-price_score', 'spaceship-titanic_score', 'amp-parkinsons-disease-progression-prediction_score', 'cifar10_score', 'imdb_score']
274
+ for column in float_columns:
275
+ if column in df.columns:
276
+ df[column] = df[column].round(3)
277
+
278
+ # sort by accuracy
279
+ df = df.sort_values('accuracy', ascending=False)
280
+
281
+ # Rename columns
282
+ df = df.rename(columns={
283
+ 'agent_name': 'Agent Name',
284
+ 'date': 'Date',
285
+ 'total_cost': 'Total Cost',
286
+ 'accuracy': 'Accuracy',
287
+ 'precision': 'Precision',
288
+ 'recall': 'Recall',
289
+ 'f1_score': 'F1 Score',
290
+ 'auc': 'AUC',
291
+ 'overall_score': 'Overall Score',
292
+ 'vectorization_score': 'Vectorization Score',
293
+ 'fathomnet_score': 'Fathomnet Score',
294
+ 'feedback_score': 'Feedback Score',
295
+ 'house_price_score': 'House Price Score',
296
+ 'spaceship_titanic_score': 'Spaceship Titanic Score',
297
+ 'amp_parkinsons_disease_progression_prediction_score': 'AMP Parkinsons Disease Progression Prediction Score',
298
+ 'cifar10_score': 'CIFAR10 Score',
299
+ 'imdb_score': 'IMDB Score',
300
+ 'acc_ci': 'Accuracy CI',
301
+ 'cost_ci': 'Total Cost CI'
302
+ })
303
+
304
+ return df
305
+
306
+ def get_task_success_data(self, benchmark_name):
307
+ with self.get_conn() as conn:
308
+ query = '''
309
+ SELECT agent_name, accuracy, successful_tasks, failed_tasks
310
+ FROM parsed_results
311
+ WHERE benchmark_name = ?
312
+ '''
313
+ df = pd.read_sql_query(query, conn, params=(benchmark_name,))
314
+
315
+ # for agent_names that have been run more than once, take the run with the highest accuracy
316
+ df = df.sort_values('accuracy', ascending=False).groupby('agent_name').first().reset_index()
317
+
318
+ # Get all unique task IDs
319
+ task_ids = set()
320
+ for tasks in df['successful_tasks']:
321
+ if ast.literal_eval(tasks) is not None:
322
+ task_ids.update(ast.literal_eval(tasks))
323
+ for tasks in df['failed_tasks']:
324
+ if ast.literal_eval(tasks) is not None:
325
+ task_ids.update(ast.literal_eval(tasks))
326
+
327
+ # Create a DataFrame with agent_name, task_ids, and success columns
328
+ data_list = []
329
+ for _, row in df.iterrows():
330
+ agent_name = row['agent_name']
331
+ for task_id in task_ids:
332
+ success = 1 if task_id in row['successful_tasks'] else 0
333
+ data_list.append({
334
+ 'agent_name': agent_name,
335
+ 'task_id': task_id,
336
+ 'success': success
337
+ })
338
+ df = pd.DataFrame(data_list)
339
+
340
+ df = df.rename(columns={
341
+ 'agent_name': 'Agent Name',
342
+ 'task_id': 'Task ID',
343
+ 'success': 'Success'
344
+ })
345
+
346
+ return df
347
+
348
+ def load_verified_agents(self, file_path='verified_agents.yaml'):
349
+ with open(file_path, 'r') as f:
350
+ verified_data = yaml.safe_load(f)
351
+
352
+ verified_agents = set()
353
+ for benchmark, agents in verified_data.items():
354
+ for agent in agents:
355
+ verified_agents.add((benchmark, agent['agent_name']))
356
+
357
+ return verified_agents
358
+
359
+ if __name__ == '__main__':
360
+ preprocessor = TracePreprocessor()
361
+ preprocessor.preprocess_traces()
utils/pareto.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ import matplotlib.pyplot as plt
3
+ from dataclasses import dataclass
4
+
5
+ @dataclass
6
+ class Agent:
7
+ total_cost: float
8
+ accuracy: float
9
+
10
+
11
+ def cross(point_o: Agent, point_a: Agent, point_b: Agent) -> int:
12
+ return (point_a.total_cost - point_o.total_cost) * (point_b.accuracy - point_o.accuracy) - (point_a.accuracy - point_o.accuracy) * (point_b.total_cost - point_o.total_cost)
13
+
14
+ def compute_hull_side(points: list[Agent]) -> list[Agent]:
15
+ hull: list[Agent] = []
16
+ for p in points:
17
+ while len(hull) >= 2 and cross(hull[-2], hull[-1], p) <= 0:
18
+ hull.pop()
19
+ hull.append(p)
20
+ return hull
21
+
22
+ def is_pareto_efficient(others, candidate):
23
+ for other in others:
24
+ if (other.total_cost <= candidate.total_cost and other.accuracy >= candidate.accuracy) and \
25
+ (other.total_cost < candidate.total_cost or other.accuracy > candidate.accuracy):
26
+ return False
27
+ return True
28
+
29
+ def compute_pareto_frontier(points: list[Agent]) -> list[Agent]:
30
+ points = sorted(list(points), key=lambda p: (p.total_cost, p.accuracy))
31
+ if len(points) <= 1:
32
+ return points
33
+
34
+ upper_convex_hull = compute_hull_side(list(reversed(points)))
35
+ pareto_frontier = [agent for agent in upper_convex_hull if is_pareto_efficient(upper_convex_hull, agent)]
36
+
37
+ return pareto_frontier
38
+
utils/processing.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import asyncio
4
+ import aiofiles
5
+ from agent_monitor.monitor import analyze_agent_steps
6
+ from agent_monitor.failure_report import analyze_agent_performance, AsyncOpenAIClient
7
+ import traceback
8
+ from tqdm import tqdm
9
+
10
+ async def check_and_process_uploads():
11
+ upload_dir = "evals_upload"
12
+ processed_dir = "evals_processed"
13
+ live_dir = "evals_live"
14
+
15
+ new_uploads = [f for f in os.listdir(upload_dir) if f.endswith('.json')]
16
+
17
+ if not new_uploads:
18
+ print("No new uploads found.")
19
+ return
20
+
21
+ # check for all new uploads whether they are already in live or processed directory
22
+ # Also check whether the files are actually identical
23
+ unprocessed_uploads = []
24
+ for upload in new_uploads:
25
+ upload_path = os.path.join(upload_dir, upload)
26
+ processed_path = os.path.join(processed_dir, upload)
27
+ live_path = os.path.join(live_dir, upload)
28
+
29
+ if not os.path.exists(live_path) and not os.path.exists(processed_path):
30
+ unprocessed_uploads.append(upload)
31
+ elif os.path.exists(processed_path):
32
+ # with open(upload_path, 'r') as f:
33
+ # new_data = json.load(f)
34
+
35
+ # with open(processed_path, 'r') as f:
36
+ # processed_data = json.load(f)
37
+
38
+ # TODO we can use a better comparison method with exact comparison
39
+ # if new_data != processed_data:
40
+ # unprocessed_uploads.append(upload)
41
+ print(f"Upload {upload} is already in processed directory.")
42
+ elif os.path.exists(live_path):
43
+ with open(upload_path, 'r') as f:
44
+ new_data = json.load(f)
45
+
46
+ with open(live_path, 'r') as f:
47
+ live_data = json.load(f)
48
+
49
+ # if new_data != live_data:
50
+ # unprocessed_uploads.append(upload)
51
+ print(f"Upload {upload} is already in live directory.")
52
+ else:
53
+ unprocessed_uploads.append(upload)
54
+
55
+ print(f"Processing {len(unprocessed_uploads)} new uploads.")
56
+ tasks = []
57
+ for upload in tqdm(unprocessed_uploads):
58
+ upload_path = os.path.join(upload_dir, upload)
59
+ processed_path = os.path.join(processed_dir, upload)
60
+ # tasks.append(process_single_upload(upload_path, processed_path)) # for async processing
61
+ await process_single_upload(upload_path, processed_path)
62
+
63
+
64
+ # await asyncio.gather(*tasks) # for async processing
65
+
66
+
67
+ async def process_single_upload(upload_path, processed_path):
68
+ # Check the structure of the upload
69
+ check_result = await check_upload_structure(upload_path)
70
+
71
+ if check_result['is_valid']:
72
+ # Process the file
73
+ await process_upload(upload_path, processed_path)
74
+
75
+ # Move the file to processed directory
76
+ # await asyncio.to_thread(shutil.move, upload_path, processed_path)
77
+
78
+ else:
79
+ print(f"Upload check failed for {upload_path}: {check_result['message']}")
80
+
81
+
82
+
83
+ async def check_upload_structure(file_path):
84
+ try:
85
+ async with aiofiles.open(file_path, 'r') as f:
86
+ data = json.loads(await f.read())
87
+
88
+ # Check for required keys
89
+ required_keys = ['config', 'results', 'raw_eval_results', 'raw_logging_results']
90
+ missing_keys = [key for key in required_keys if key not in data]
91
+
92
+ if missing_keys:
93
+ return {'is_valid': False, 'message': f"Missing required keys: {', '.join(missing_keys)}"}
94
+
95
+ # Check for specific structure in raw_logging_results
96
+ if not isinstance(data['raw_logging_results'], list):
97
+ return {'is_valid': False, 'message': "raw_logging_results should be a list"}
98
+
99
+ for item in data['raw_logging_results']:
100
+ if not all(key in item for key in ['weave_task_id', 'inputs', 'outputs']):
101
+ return {'is_valid': False, 'message': "Each item in raw_logging_results should have weave_task_id, inputs, and outputs"}
102
+
103
+ return {'is_valid': True, 'message': "File structure is valid"}
104
+
105
+ except json.JSONDecodeError:
106
+ return {'is_valid': False, 'message': "Invalid JSON format"}
107
+ except Exception as e:
108
+ return {'is_valid': False, 'message': f"Unexpected error: {str(e)}"}
109
+
110
+
111
+ async def process_upload(input_path, output_path):
112
+ print(f"Processing {input_path}...")
113
+ # load the file
114
+ with open(input_path, 'r') as f:
115
+ data = json.loads(f.read())
116
+
117
+ assert 'raw_logging_results' in data, "raw_logging_results key not found in the file"
118
+ openai_client = AsyncOpenAIClient(model="gpt-4o-mini")
119
+
120
+ try:
121
+ processed_calls = await analyze_agent_steps(data['raw_logging_results'], openai_client, llm_eval=False)
122
+ # failure_report = await analyze_agent_performance(data['raw_logging_results'], data['results']['failed_tasks'], openai_client)
123
+
124
+ data['raw_logging_results'] = processed_calls
125
+ data['failure_report'] = None
126
+ except Exception as e:
127
+ traceback.print_exc()
128
+ print(f"Error in processing: {str(e)}")
129
+ return
130
+
131
+ with open(output_path, 'w') as f:
132
+ json.dump(data, f, indent=4)
133
+
134
+ print(f"Processing of {input_path} successful. Results saved to {output_path}")
135
+
136
+
137
+
138
+
139
+ if __name__ == "__main__":
140
+ # process single upload for testing
141
+ asyncio.run(process_single_upload("evals_upload/inspect_evalsswe_bench_1729538131_UPLOAD.json", "evals_processed/inspect_evalsswe_bench_1729538131_UPLOAD.json"))
utils/viz.py ADDED
@@ -0,0 +1,744 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import plotly.express as px
3
+ from utils.pareto import Agent, compute_pareto_frontier
4
+ import plotly.graph_objects as go
5
+ import textwrap
6
+ import numpy as np
7
+ import pandas as pd
8
+ from scipy import stats
9
+
10
+
11
+ def create_leaderboard(df, ci_metrics = None):
12
+ # cast dtypes to string
13
+ df = df.astype(str)
14
+
15
+ # for each metric join metric and metric CI columns
16
+ if ci_metrics:
17
+ for metric in ci_metrics:
18
+ CI_metric = metric + ' CI'
19
+ # for rows in the df for which CI metric is not None, join the metric and CI columns by looping through the CI metrics columns
20
+ for i, row in df.iterrows():
21
+ if str(row[CI_metric]) != 'None':
22
+ df.at[i, metric] = str(row[metric]) + " (" + str(row[CI_metric]) + ")"
23
+
24
+ return df
25
+
26
+ def create_task_success_heatmap(df, benchmark_name):
27
+
28
+ # Calculate agent accuracy
29
+ agent_accuracy = df.groupby('Agent Name')['Success'].mean().sort_values(ascending=False)
30
+
31
+ # Calculate task success rate
32
+ task_success_rate = df.groupby('Task ID')['Success'].mean().sort_values(ascending=False)
33
+
34
+ # Pivot the dataframe to create a matrix of agents vs tasks
35
+ pivot_df = df.pivot(index='Agent Name', columns='Task ID', values='Success')
36
+
37
+ # Sort the pivot table
38
+ pivot_df = pivot_df.reindex(index=agent_accuracy.index, columns=task_success_rate.index)
39
+
40
+ # Calculate tasks solved across all agents
41
+ tasks_solved = (pivot_df.sum(axis=0) > 0).astype(int)
42
+ # Total number of tasks (columns)
43
+ total_tasks = len(pivot_df.columns)
44
+ if 'SWE-bench' in benchmark_name:
45
+ total_tasks = 50 # TODO - remove hardcoding
46
+
47
+ # Add the new row to the pivot table
48
+ tasks_solved_df = pd.DataFrame(tasks_solved).T
49
+ tasks_solved_df.index = [f'<b>Tasks Solved: {tasks_solved.sum()}/{total_tasks} (Any Agent)</b>']
50
+ # print number of tasks solved
51
+ pivot_df = pd.concat([pivot_df, tasks_solved_df])
52
+
53
+ num_agents = len(pivot_df.index)
54
+ row_height = 30 # Fixed height for each row in pixels
55
+ total_height = num_agents * row_height
56
+
57
+ # Create a custom colorscale
58
+ colorscale=[[0, 'white'], [1, '#3498db']]
59
+
60
+ # Create the heatmap
61
+ fig = go.Figure(data=go.Heatmap(
62
+ z=pivot_df.values,
63
+ y=pivot_df.index,
64
+ x=pivot_df.columns,
65
+ colorscale=colorscale,
66
+ showscale=False,
67
+ hovertemplate='<b>Agent:</b> %{y}<br>' +
68
+ '<b>Task:</b> %{x}<br>' +
69
+ '<b>Status:</b> %{z}<extra></extra>'
70
+ ))
71
+
72
+ # Update the layout
73
+ fig.update_layout(
74
+ xaxis_title='Task ID',
75
+ height=total_height + 50, # Add extra space for the new row
76
+ yaxis=dict(
77
+ autorange='reversed',
78
+ showticklabels=True,
79
+ showline=True,
80
+ linecolor='black',
81
+ showgrid=False
82
+ ),
83
+ xaxis=dict(
84
+ side='top',
85
+ showticklabels=False,
86
+ showline=True,
87
+ linecolor='black',
88
+ showgrid=False
89
+ ),
90
+ plot_bgcolor='white',
91
+ paper_bgcolor='white',
92
+ hoverlabel=dict(
93
+ bgcolor="white",
94
+ font_size=12,
95
+ font_family="Arial"
96
+ ),
97
+ modebar=dict(
98
+ activecolor='#1f77b4',
99
+ orientation='h',
100
+ bgcolor='rgba(255,255,255,0.8)',
101
+ color='#777',
102
+ add=['pan2d'],
103
+ remove=[
104
+ 'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
105
+ 'hoverClosestCartesian', 'hoverCompareCartesian',
106
+ 'toggleSpikelines', 'lasso2d', 'lasso', 'select2d', 'select'
107
+ ]
108
+ ),
109
+ dragmode='pan'
110
+ )
111
+
112
+ return fig
113
+
114
+ def create_bar_chart(categories, values, x_label, y_label, title):
115
+ # Sort categories and values based on values in descending order
116
+ sorted_data = sorted(zip(categories, values), key=lambda x: x[1], reverse=True)
117
+ categories, values = zip(*sorted_data)
118
+
119
+ # get total number of tasks
120
+ total_tasks = sum(values)
121
+
122
+ text_labels = [f"({value/total_tasks:.1%} of failures)" for value in values]
123
+
124
+
125
+ fig = go.Figure(data=[go.Bar(
126
+ y=categories,
127
+ x=values,
128
+ orientation='h',
129
+ marker_color='#3498db', # Same color as the scatter plot
130
+ text=text_labels,
131
+ textposition='auto',
132
+ customdata=[f'{value} tasks ({value/total_tasks:.1%} of failures)' for value in values],
133
+ textfont=dict(color='black', size=14, family='Arial', weight=2),
134
+ hovertemplate='<b>%{y}</b><br>' +
135
+ 'Affected Tasks: %{customdata}<extra></extra>'
136
+ )])
137
+
138
+ fig.update_layout(
139
+ height=600,
140
+ xaxis=dict(
141
+ showline=True,
142
+ linecolor='black',
143
+ showgrid=False
144
+ ),
145
+ yaxis=dict(
146
+ showline=True,
147
+ linecolor='black',
148
+ showgrid=False,
149
+ autorange="reversed" # This will put the category with the highest value at the top
150
+ ),
151
+ plot_bgcolor='white',
152
+ paper_bgcolor='white',
153
+ bargap=0.2,
154
+ bargroupgap=0.1,
155
+ hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
156
+ modebar=dict(
157
+ activecolor='#1f77b4',
158
+ orientation='h',
159
+ bgcolor='rgba(255,255,255,0.8)',
160
+ color='#777',
161
+ add=['pan2d'],
162
+ remove=[
163
+ 'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
164
+ 'hoverClosestCartesian', 'hoverCompareCartesian',
165
+ 'toggleSpikelines', 'lasso2d', 'lasso', 'select2d', 'select'
166
+ ]
167
+ ),
168
+ dragmode='pan'
169
+ )
170
+
171
+ return fig
172
+
173
+ def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
174
+ # agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
175
+ # instead of creating one Agent object for each row, we can create one Agent object for each unique agent and use the mean of the cost and accuracy values
176
+ unique_agents = df['Agent Name'].unique()
177
+ agents = [Agent(df[df['Agent Name'] == agent]['Total Cost'].mean(), df[df['Agent Name'] == agent]['Accuracy'].mean()) for agent in unique_agents]
178
+
179
+ pareto_frontier = compute_pareto_frontier(agents)
180
+
181
+ fig = go.Figure()
182
+
183
+ # Sort the Pareto frontier points by x-coordinate
184
+ pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
185
+ # Add the Pareto frontier line
186
+ fig.add_trace(go.Scatter(
187
+ x=[point[0] for point in pareto_points],
188
+ y=[point[1] for point in pareto_points],
189
+ mode='lines',
190
+ name='Pareto Frontier',
191
+ hoverinfo=None,
192
+ line=dict(color='black', width=1, dash='dash')
193
+ ))
194
+
195
+ # Plot scatter points and error bars for each agent
196
+ unique_agents = df[hover_data[0]].unique()
197
+ for agent in unique_agents:
198
+ agent_data = df[df[hover_data[0]] == agent]
199
+
200
+ x_value = [np.mean(agent_data[x].values)]
201
+ y_value = [np.mean(agent_data[y].values)]
202
+
203
+ if len(agent_data) > 1:
204
+ # Calculate 95% confidence intervals
205
+ # ci_x = stats.t.interval(0.95, len(agent_data[x])-1, loc=np.mean(agent_data[x]), scale=stats.sem(agent_data[x]))
206
+ # ci_y = stats.t.interval(0.95, len(agent_data[y])-1, loc=np.mean(agent_data[y]), scale=stats.sem(agent_data[y]))
207
+
208
+ # # Add error bars for x (cost)
209
+ # fig.add_trace(go.Scatter(
210
+ # x=x_value,
211
+ # y=y_value,
212
+ # error_x=dict(
213
+ # type='data',
214
+ # symmetric=False,
215
+ # array=[ci_x[1] - x_value],
216
+ # arrayminus=[x_value - ci_x[0]],
217
+ # color='red',
218
+ # ),
219
+ # mode='markers',
220
+ # marker=dict(color='rgba(0,0,0,0)'),
221
+ # showlegend=False,
222
+ # hoverinfo='none'
223
+ # ))
224
+
225
+ # # Add error bars for y (accuracy)
226
+ # fig.add_trace(go.Scatter(
227
+ # x=x_value,
228
+ # y=y_value,
229
+ # error_y=dict(
230
+ # type='data',
231
+ # symmetric=False,
232
+ # array=[ci_y[1] - y_value],
233
+ # arrayminus=[y_value - ci_y[0]],
234
+ # color='green',
235
+ # ),
236
+ # mode='markers',
237
+ # marker=dict(color='rgba(0,0,0,0)'),
238
+ # showlegend=False,
239
+ # hoverinfo='none'
240
+ # ))
241
+
242
+ # Add error bars for x (cost minmax)
243
+ fig.add_trace(go.Scatter(
244
+ x=x_value,
245
+ y=y_value,
246
+ error_x=dict(
247
+ type='data',
248
+ symmetric=False,
249
+ array=[np.max(agent_data[x]) - x_value],
250
+ arrayminus=[x_value - np.min(agent_data[x])],
251
+ color='#fec44f',
252
+ ),
253
+ mode='markers',
254
+ marker=dict(color='rgba(0,0,0,0)', opacity=0),
255
+ showlegend=False,
256
+ hoverinfo=None
257
+ ))
258
+
259
+ # Add error bars for y (accuracy minmax)
260
+ fig.add_trace(go.Scatter(
261
+ x=x_value,
262
+ y=y_value,
263
+ error_y=dict(
264
+ type='data',
265
+ symmetric=False,
266
+ array=[np.max(agent_data[y]) - y_value],
267
+ arrayminus=[y_value - np.min(agent_data[y])],
268
+ color='#bdbdbd',
269
+ ),
270
+ mode='markers',
271
+ marker=dict(color='rgba(0,0,0,0)', opacity=0),
272
+ showlegend=False,
273
+ hoverinfo=None
274
+ ))
275
+
276
+ # Add scatter points for this agent
277
+ fig.add_trace(go.Scatter(
278
+ x=x_value,
279
+ y=y_value,
280
+ mode='markers',
281
+ marker=dict(size=10, color='#3498db'),
282
+ customdata=agent_data[hover_data],
283
+ showlegend=False,
284
+ hovertemplate="<br>".join([
285
+ "<b>Agent</b>: %{customdata[0]}",
286
+ "<b>Total Cost</b>: $%{x:.1f}",
287
+ "<b>Accuracy</b>: %{y:.1%}<extra></extra>",
288
+ ]),
289
+ hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
290
+ ))
291
+
292
+
293
+
294
+ # Add legend entries for error bars
295
+ # fig.add_trace(go.Scatter(
296
+ # x=[None], y=[None], mode='markers',
297
+ # marker=dict(color='red', size=10),
298
+ # name='Cost CI (95%)'
299
+ # ))
300
+ # fig.add_trace(go.Scatter(
301
+ # x=[None], y=[None], mode='markers',
302
+ # marker=dict(color='green', size=10),
303
+ # name='Accuracy CI (95%)'
304
+ # ))
305
+
306
+ # Add legend entries for error bars
307
+ fig.add_trace(go.Scatter(
308
+ x=[None], y=[None], mode='markers',
309
+ marker=dict(color='#fec44f', size=10),
310
+ name='Cost CI (Min-Max)'
311
+ ))
312
+ fig.add_trace(go.Scatter(
313
+ x=[None], y=[None], mode='markers',
314
+ marker=dict(color='#bdbdbd', size=10),
315
+ name='Accuracy CI (Min-Max)'
316
+ ))
317
+
318
+ fig.update_layout(
319
+ height = 600,
320
+ xaxis_title = x_label,
321
+ yaxis_title = y_label,
322
+ xaxis = dict(
323
+ showline = True,
324
+ linecolor = 'black',
325
+ showgrid = False),
326
+ yaxis = dict(
327
+ showline = True,
328
+ showgrid = False,
329
+ linecolor = 'black'),
330
+ plot_bgcolor = 'white',
331
+ legend=dict(
332
+ yanchor="bottom",
333
+ y=0.01,
334
+ xanchor="right",
335
+ x=0.98,
336
+ bgcolor="rgba(255, 255, 255, 0.5)" # semi-transparent white background
337
+ ),
338
+ modebar=dict(
339
+ activecolor='#1f77b4', # Color of active tool
340
+ orientation='h', # Horizontal orientation
341
+ bgcolor='rgba(255,255,255,0.8)', # Slightly transparent white background
342
+ color='#777', # Color of inactive tools
343
+ add = ['pan2d'],
344
+ remove = [
345
+ 'zoom2d',
346
+ 'zoomIn2d',
347
+ 'zoomOut2d',
348
+ 'resetScale2d',
349
+ 'hoverClosestCartesian',
350
+ 'hoverCompareCartesian',
351
+ 'toggleSpikelines',
352
+ 'lasso2d',
353
+ 'lasso',
354
+ 'select2d',
355
+ 'select']
356
+ ),
357
+ dragmode='pan'
358
+ )
359
+
360
+ fig.update_yaxes(rangemode="tozero")
361
+ fig.update_xaxes(rangemode="tozero")
362
+
363
+ return fig
364
+ # def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
365
+ # agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
366
+ # pareto_frontier = compute_pareto_frontier(agents)
367
+
368
+ # fig = go.Figure()
369
+
370
+ # # Function to generate points for error ellipse
371
+ # def error_ellipse(x_center, y_center, x_radius, y_radius, angle, n=50):
372
+ # t = np.linspace(0, 2*np.pi, n)
373
+ # x = x_radius * np.cos(t)
374
+ # y = y_radius * np.sin(t)
375
+ # rotation = np.array([[np.cos(angle), -np.sin(angle)],
376
+ # [np.sin(angle), np.cos(angle)]])
377
+ # xy = np.dot(rotation, np.array([x, y]))
378
+ # return x_center + xy[0], y_center + xy[1]
379
+
380
+ # # Create a color map for agents
381
+ # unique_agents = df['Agent Name'].unique()
382
+ # colors = px.colors.qualitative.Plotly
383
+ # color_map = {agent: colors[i % len(colors)] for i, agent in enumerate(unique_agents)}
384
+
385
+ # # Add scatter points and error ellipses for each agent
386
+ # for agent in unique_agents:
387
+ # agent_data = df[df['Agent Name'] == agent]
388
+
389
+ # # Add scatter points
390
+ # fig.add_trace(go.Scatter(
391
+ # x=agent_data[x],
392
+ # y=agent_data[y],
393
+ # mode='markers',
394
+ # name=agent,
395
+ # marker=dict(size=10, color=color_map[agent]),
396
+ # customdata=agent_data[hover_data] if hover_data else None,
397
+ # hovertemplate="<br>".join([
398
+ # f"<b>Agent</b>: {agent}",
399
+ # f"<b>{x}</b>: ${{x:.1f}}",
400
+ # f"<b>{y}</b>: {{y:.1%}}",
401
+ # ] + ([f"<b>{col}</b>: {{customdata[{i}]}}" for i, col in enumerate(hover_data)] if hover_data else []))
402
+ # ))
403
+
404
+ # # Calculate mean and standard deviation for x and y
405
+ # x_mean = agent_data[x].mean()
406
+ # y_mean = agent_data[y].mean()
407
+ # x_std = agent_data[x].std()
408
+ # y_std = agent_data[y].std()
409
+
410
+ # # Calculate correlation coefficient
411
+ # corr = agent_data[x].corr(agent_data[y])
412
+
413
+ # # Add error ellipses (1 and 2 standard deviations)
414
+ # for n_std, opacity in [(1, 0.5), (2, 0.5)]:
415
+ # chi2_val = chi2.ppf(0.68 if n_std == 1 else 0.95, 2)
416
+ # x_radius = np.sqrt(chi2_val) * x_std
417
+ # y_radius = np.sqrt(chi2_val) * y_std
418
+ # angle = np.arctan2(y_std * corr, x_std)
419
+
420
+ # ellipse_x, ellipse_y = error_ellipse(x_mean, y_mean, x_radius, y_radius, angle)
421
+
422
+ # fig.add_shape(type="path",
423
+ # path=f"M {ellipse_x[0]}, {ellipse_y[0]} " +
424
+ # " ".join([f"L{x},{y}" for x, y in zip(ellipse_x[1:], ellipse_y[1:])]) +
425
+ # " Z",
426
+ # line_color=color_map[agent],
427
+ # line_width=2,
428
+ # opacity=opacity,
429
+ # layer="below")
430
+
431
+ # # Sort the Pareto frontier points by x-coordinate
432
+ # pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
433
+
434
+ # # Add the Pareto frontier line
435
+ # fig.add_trace(go.Scatter(
436
+ # x=[point[0] for point in pareto_points],
437
+ # y=[point[1] for point in pareto_points],
438
+ # mode='lines',
439
+ # name='Pareto Frontier',
440
+ # line=dict(color='black', width=1, dash='dash')
441
+ # ))
442
+
443
+ # fig.update_layout(
444
+ # height = 600,
445
+ # xaxis_title = x_label,
446
+ # yaxis_title = y_label,
447
+ # xaxis = dict(
448
+ # showline = True,
449
+ # linecolor = 'black',
450
+ # showgrid = False),
451
+ # yaxis = dict(
452
+ # showline = True,
453
+ # showgrid = False,
454
+ # linecolor = 'black'),
455
+ # plot_bgcolor = 'white',
456
+ # legend=dict(
457
+ # yanchor="bottom",
458
+ # y=0.01,
459
+ # xanchor="right",
460
+ # x=0.98,
461
+ # bgcolor="rgba(255, 255, 255, 0.5)"
462
+ # ),
463
+ # modebar=dict(
464
+ # activecolor='#1f77b4',
465
+ # orientation='h',
466
+ # bgcolor='rgba(255,255,255,0.8)',
467
+ # color='#777',
468
+ # add = ['pan2d'],
469
+ # remove = [
470
+ # 'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
471
+ # 'hoverClosestCartesian', 'hoverCompareCartesian',
472
+ # 'toggleSpikelines', 'lasso2d', 'lasso',
473
+ # 'select2d', 'select'
474
+ # ]
475
+ # ),
476
+ # dragmode='pan'
477
+ # )
478
+
479
+ # fig.update_yaxes(rangemode="tozero")
480
+ # fig.update_xaxes(rangemode="tozero")
481
+
482
+ # return fig
483
+
484
+ # def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
485
+ # agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
486
+ # pareto_frontier = compute_pareto_frontier(agents)
487
+
488
+ # fig = px.scatter(df,
489
+ # x=x,
490
+ # y=y,
491
+ # custom_data=hover_data)
492
+ # fig.update_traces(
493
+ # hovertemplate="<br>".join([
494
+ # "<b>Agent</b>: %{customdata[0]}",
495
+ # "<b>Total Cost</b>: $%{x:.1f}",
496
+ # "<b>Accuracy</b>: %{y:.1%}",
497
+ # ])
498
+ # )
499
+
500
+ # fig.update_traces(marker=dict(size=10, color='#3498db'),
501
+ # hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),)
502
+
503
+
504
+ # # Sort the Pareto frontier points by x-coordinate
505
+ # pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
506
+
507
+ # # Add the Pareto frontier line
508
+ # fig.add_trace(go.Scatter(
509
+ # x=[point[0] for point in pareto_points],
510
+ # y=[point[1] for point in pareto_points],
511
+ # mode='lines',
512
+ # name='Pareto Frontier',
513
+ # line=dict(color='black', width=1, dash='dash')
514
+ # ))
515
+
516
+ # fig.update_layout(
517
+ # # width = 1150,
518
+ # height = 600,
519
+ # xaxis_title = x_label,
520
+ # yaxis_title = y_label,
521
+ # xaxis = dict(
522
+ # showline = True,
523
+ # linecolor = 'black',
524
+ # showgrid = False),
525
+ # yaxis = dict(
526
+ # showline = True,
527
+ # showgrid = False,
528
+ # linecolor = 'black'),
529
+ # plot_bgcolor = 'white',
530
+ # # Legend positioning
531
+ # legend=dict(
532
+ # yanchor="bottom",
533
+ # y=0.01,
534
+ # xanchor="right",
535
+ # x=0.98,
536
+ # bgcolor="rgba(255, 255, 255, 0.5)" # semi-transparent white background
537
+ # ),
538
+ # modebar=dict(
539
+ # activecolor='#1f77b4', # Color of active tool
540
+ # orientation='h', # Vertical orientation
541
+ # bgcolor='rgba(255,255,255,0.8)', # Slightly transparent white background
542
+ # color='#777', # Color of inactive tools
543
+ # add = ['pan2d'],
544
+ # remove = [
545
+ # 'zoom2d',
546
+ # 'zoomIn2d',
547
+ # 'zoomOut2d',
548
+ # 'resetScale2d',
549
+ # 'hoverClosestCartesian',
550
+ # 'hoverCompareCartesian',
551
+ # 'toggleSpikelines',
552
+ # 'lasso2d',
553
+ # 'lasso',
554
+ # 'select2d',
555
+ # 'select']
556
+ # ),
557
+ # dragmode='pan'
558
+ # )
559
+
560
+ # fig.update_yaxes(rangemode="tozero")
561
+ # fig.update_xaxes(rangemode="tozero")
562
+
563
+ # return fig
564
+
565
+
566
+ import plotly.graph_objects as go
567
+ import textwrap
568
+
569
+ def create_flow_chart(steps):
570
+ node_x = []
571
+ node_y = []
572
+ edge_x = []
573
+ edge_y = []
574
+ node_text = []
575
+ hover_text = []
576
+ node_colors = []
577
+ node_shapes = []
578
+
579
+ # Define color and shape mappings
580
+ color_map = {True: 'green', False: 'red'} # True for success, False for challenges
581
+ shape_map = {
582
+ 'plan': 'octagon',
583
+ 'tool': 'square',
584
+ 'retrieve': 'diamond',
585
+ 'other': 'circle'
586
+ }
587
+
588
+ for i, step in enumerate(steps):
589
+ node_x.append(i)
590
+ node_y.append(0)
591
+
592
+ # Extract Description, Assessment, and new attributes
593
+ analysis = step['analysis']
594
+ if isinstance(analysis, str):
595
+ try:
596
+ analysis = json.loads(analysis)
597
+ except json.JSONDecodeError:
598
+ analysis = {}
599
+
600
+ description = analysis.get('description', 'No description available.')
601
+ assessment = analysis.get('assessment', 'No assessment available.')
602
+ success = analysis.get('success', True) # Assuming True if not specified
603
+ # action_type = analysis.get('action_type', 'other') # Default to 'other' if not specified
604
+ step_headline = analysis.get('headline', '')
605
+
606
+ # Set node color and shape based on attributes
607
+ node_colors.append(color_map[success])
608
+ # node_shapes.append(shape_map.get(action_type, 'circle'))
609
+
610
+ # Wrap text to improve readability
611
+ wrapped_description = '<br>'.join(textwrap.wrap(description, width=90, max_lines=20))
612
+ wrapped_assessment = '<br>'.join(textwrap.wrap(assessment, width=90, max_lines=10))
613
+ wrapped_outline = textwrap.shorten(step_headline, width=50, placeholder='')
614
+ wrapped_outline = '' if wrapped_outline == '' else f": {wrapped_outline}"
615
+
616
+ node_text_outline = '' if wrapped_outline == '' else f":<br>{'<br>'.join(textwrap.wrap(step_headline, width=30, placeholder=''))}"
617
+ node_text.append(f"Step {i+1}{node_text_outline}")
618
+
619
+ # Create formatted hover text without indentation
620
+ hover_info = f"<b>Step {i+1}{wrapped_outline}</b><br><br>" \
621
+ f"<b>Description:</b><br>" \
622
+ f"{wrapped_description}<br><br>" \
623
+ # f"<b>Assessment:</b><br>" \
624
+ # f"{wrapped_assessment}<br><br>" \
625
+ # f"<b>Successful:</b> {'Yes' if success else 'No'}<br>" \
626
+ # f"<b>Action Type:</b> {action_type.capitalize()}"
627
+ hover_text.append(hover_info)
628
+
629
+ if i > 0:
630
+ edge_x.extend([i-1, i, None])
631
+ edge_y.extend([0, 0, None])
632
+
633
+ node_trace = go.Scatter(
634
+ x=node_x, y=node_y,
635
+ mode='markers+text',
636
+ text=node_text,
637
+ textposition="top center",
638
+ showlegend=False,
639
+ hovertext=hover_text,
640
+ hoverinfo='text',
641
+ hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
642
+ marker=dict(
643
+ # color=node_colors,
644
+ color='#3498db',
645
+ size=30,
646
+ line_width=2,
647
+ # symbol=node_shapes
648
+ ))
649
+
650
+ edge_trace = go.Scatter(
651
+ x=edge_x, y=edge_y,
652
+ line=dict(width=2, color='#888'),
653
+ hoverinfo='none',
654
+ showlegend=False,
655
+ mode='lines')
656
+
657
+ # Create legend traces
658
+ legend_traces = []
659
+
660
+ # # Color legend
661
+ # for success, color in color_map.items():
662
+ # legend_traces.append(go.Scatter(
663
+ # x=[None], y=[None],
664
+ # mode='markers',
665
+ # marker=dict(size=10, color=color),
666
+ # showlegend=True,
667
+ # name=f"{'Success' if success else 'Issue'}"
668
+ # ))
669
+
670
+ # # Shape legend
671
+ # for action, shape in shape_map.items():
672
+ # legend_traces.append(go.Scatter(
673
+ # x=[None], y=[None],
674
+ # mode='markers',
675
+ # marker=dict(size=10, symbol=shape, color='gray'),
676
+ # showlegend=True,
677
+ # name=f"{action.capitalize()}"
678
+ # ))
679
+
680
+ # Combine all traces
681
+ all_traces = [edge_trace, node_trace] + legend_traces
682
+
683
+ layout = go.Layout(
684
+ showlegend=True,
685
+ hovermode='closest',
686
+ margin=dict(b=20,l=5,r=5,t=40),
687
+ xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
688
+ yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
689
+ plot_bgcolor='white',
690
+ paper_bgcolor='white',
691
+ modebar=dict(
692
+ activecolor='#1f77b4', # Color of active tool
693
+ orientation='h', # Vertical orientation
694
+ bgcolor='rgba(255,255,255,0.8)', # Slightly transparent white background
695
+ color='#777', # Color of inactive tools
696
+ ),
697
+ legend=dict(
698
+ orientation="h",
699
+ yanchor="bottom",
700
+ y=0.02,
701
+ xanchor="right",
702
+ x=1,
703
+ bgcolor='rgba(255,255,255,0.8)',
704
+ bordercolor='rgba(0,0,0,0.1)',
705
+ borderwidth=1
706
+ ),
707
+ )
708
+
709
+ fig = go.Figure(data=all_traces, layout=layout)
710
+
711
+ fig.update_layout(legend=dict(
712
+ orientation="h",
713
+ yanchor="bottom",
714
+ y=1.02,
715
+ xanchor="right",
716
+ x=1,
717
+ bgcolor='rgba(255,255,255,0.8)', # Set legend background to slightly transparent white
718
+ bordercolor='rgba(0,0,0,0.1)', # Add a light border to the legend
719
+ borderwidth=1
720
+ ),
721
+ dragmode='pan'
722
+ )
723
+
724
+ config = {
725
+ 'add': ['pan2d'],
726
+ 'remove': [
727
+ 'zoom2d',
728
+ 'zoomIn2d',
729
+ 'zoomOut2d',
730
+ 'resetScale2d',
731
+ 'hoverClosestCartesian',
732
+ 'hoverCompareCartesian',
733
+ 'toggleSpikelines',
734
+ 'lasso2d',
735
+ 'lasso',
736
+ 'select2d',
737
+ 'select',
738
+ ]
739
+ }
740
+
741
+ # Apply the config to the figure
742
+ fig.update_layout(modebar=config)
743
+
744
+ return fig
verified_agents.yaml ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # This file contains information about verified agent results for different benchmarks.
2
+ # Format:
3
+ # benchmark_name:
4
+ # - agent_name: "Name of the agent"
5
+ # verification_date: YYYY-MM-DD
6
+
7
+ usaco:
8
+ - agent_name: "USACO Reflexion + Episodic (gpt-4o-mini-2024-07-18)"
9
+ verification_date: 2024-08-20
10
+ - agent_name: "USACO Reflexion + Episodic + Semantic (gpt-4o-mini-2024-07-18)"
11
+ verification_date: 2024-08-20
12
+ - agent_name: "USACO Reflexion (gpt-4o-mini-2024-07-18)"
13
+ verification_date: 2024-08-20
14
+ - agent_name: "USACO Episodic (gpt-4o-mini-2024-07-18)"
15
+ verification_date: 2024-08-12
16
+ - agent_name: "USACO Reflexion + Semantic (gpt-4o-mini-2024-07-18)"
17
+ verification_date: 2024-08-20
18
+ - agent_name: "USACO Zero-shot (gpt-4o-mini-2024-07-18)"
19
+ verification_date: 2024-08-11
20
+ - agent_name: "USACO Semantic (gpt-4o-mini-2024-07-18)"
21
+ verification_date: 2024-08-12
22
+ - agent_name: USACO Reflexion + Episodic + Semantic (gpt-4o-2024-05-13)
23
+ verification_date: 2024-08-25
24
+ - agent_name: USACO Reflexion + Episodic (gpt-4o-2024-05-13)
25
+ verification_date: 2024-08-25
26
+ - agent_name: USACO Reflexion + Semantic (gpt-4o-2024-05-13)
27
+ verification_date: 2024-08-25
28
+ - agent_name: Episodic Retrial (2x) (gpt-4o-2024-05-13)
29
+ verification_date: 2024-08-25
30
+ - agent_name: Episodic Retrial (3x) (gpt-4o-mini-2024-07-18)
31
+ verification_date: 2024-08-25
32
+ - agent_name: Episodic Retrial (2x) (gpt-4o-mini-2024-07-18)
33
+ verification_date: 2024-08-25
34
+ - agent_name: Episodic Retrial (5x) (gpt-4o-mini-2024-07-18)
35
+ verification_date: 2024-08-25
36
+ - agent_name: Episodic Warming (3 Steps) (gpt-4o-mini-2024-07-18)
37
+ verification_date: 2024-08-24
38
+ - agent_name: USACO Episodic (gpt-4o-2024-05-13)
39
+ verification_date: 2024-08-24
40
+ - agent_name: USACO Semantic (gpt-4o-2024-05-13)
41
+ verification_date: 2024-08-24
42
+ - agent_name: Zero-shot Retrial (2x) (gpt-4o-mini-2024-07-18)
43
+ verification_date: 2024-08-24
44
+ - agent_name: Zero-shot Retrial (3x) (gpt-4o-mini-2024-07-18)
45
+ verification_date: 2024-08-24
46
+ - agent_name: Zero-shot Retrial (5x) (gpt-4o-mini-2024-07-18)
47
+ verification_date: 2024-08-24
48
+ - agent_name: USACO Zero-shot (gpt-4o-2024-05-13)
49
+ verification_date: 2024-08-24
50
+
51
+
52
+ swebench_verified:
53
+ - agent_name: "Agentless (gpt-4o-mini-2024-07-18) (50 Instances)"
54
+ verification_date: 2024-08-17
55
+ - agent_name: "SWE-agent (gpt-4o-mini-2024-07-18) (Cost Limit: $1) (50 Instances)"
56
+ verification_date: 2024-08-19
57
+
58
+ mlagentbench:
59
+ - agent_name: "MLAgentBench ResearchAgent (gpt-4o-mini-2024-07-18)"
60
+ verification_date: 2024-08-19
61
+
62
+
63
+ corebench_easy:
64
+ - agent_name: "AutoGPT (GPT-4o)"
65
+ verification_date: 2024-09-28
66
+ - agent_name: "AutoGPT (GPT-4o-mini)"
67
+ verification_date: 2024-09-28
68
+ - agent_name: "CORE-Agent (GPT-4o)"
69
+ verification_date: 2024-09-28
70
+ - agent_name: "CORE-Agent (GPT-4o-mini)"
71
+ verification_date: 2024-09-28
72
+
73
+ corebench_medium:
74
+ - agent_name: "AutoGPT (GPT-4o)"
75
+ verification_date: 2024-09-28
76
+ - agent_name: "AutoGPT (GPT-4o-mini)"
77
+ verification_date: 2024-09-28
78
+ - agent_name: "CORE-Agent (GPT-4o)"
79
+ verification_date: 2024-09-28
80
+ - agent_name: "CORE-Agent (GPT-4o-mini)"
81
+ verification_date: 2024-09-28
82
+
83
+ corebench_hard:
84
+ - agent_name: "AutoGPT (GPT-4o)"
85
+ verification_date: 2024-09-28
86
+ - agent_name: "AutoGPT (GPT-4o-mini)"
87
+ verification_date: 2024-09-28
88
+ - agent_name: "CORE-Agent (GPT-4o)"
89
+ verification_date: 2024-09-28
90
+ - agent_name: "CORE-Agent (GPT-4o-mini)"
91
+ verification_date: 2024-09-28