Spaces:

agent-evals
/

leaderboard

Running

App Files Files Community

benediktstroebl commited on Oct 30, 2024

Commit

7c691e6

1 Parent(s): 0b4d0ca

init v1

Browse files

Files changed (25) hide show

.gitignore +5 -0
README.md +5 -5
about.md +50 -0
agent_monitor/failure_report.py +277 -0
agent_monitor/monitor.py +140 -0
agent_performance_analysis.json +38 -0
agent_submission.md +8 -0
agent_submission_core.md +29 -0
app.py +1458 -0
benchmark_submission.md +5 -0
config.py +56 -0
css.css +48 -0
envs.py +10 -0
hal.ico +0 -0
hal.png +0 -0
header.md +3 -0
requirements.txt +108 -0
utils/test.txt → scratch.ipynb +0 -0
scratch.py +38 -0
utils/data.py +296 -0
utils/db.py +361 -0
utils/pareto.py +38 -0
utils/processing.py +141 -0
utils/viz.py +744 -0
verified_agents.yaml +91 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+.DS_Store
+**/__pycache__/**
+evals_upload/*
+evals_live/*
+evals_processed/*

README.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ---
-title: Leaderboard 2
-emoji: 📊
-colorFrom: red
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.4.0
 app_file: app.py
 pinned: false
 ---

 ---
+title: Agent Leaderboard
+emoji: 🏆
+colorFrom: blue
+colorTo: pink
 sdk: gradio
+sdk_version: 4.40.0
 app_file: app.py
 pinned: false
 ---

about.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# HAL: Holistic Agent Leaderboard
+Imagine you run a travel agency that wants to adopt an AI agent to automate customer bookings and improve efficiency. How would you choose a product?
+And what if you are an agent developer building an agent that can browse the web and book hotels and tickets for entire travel itineraries. How do you go about comparing against past agents?
+Or suppose a group of independent researchers claim to have built a generalist agent that can automate offensive web agents that can carry out DDoS attacks. How seriously should we take their claims?
+As things stand today, none of the stakeholders described above (customers, agent developers, safety researchers) can judge the evidence of AI agent capabilities for many reasons:
+- Different agent developers often develop their own evaluation harness for agent benchmarks, making it hard to make a true apples-to-apples comparison.
+- Many agent benchmarks lack a centralized leaderboard. Even if they do have one, they don't verify if the agents on the leaderboard are implemented correctly.
+- Most importantly, current leaderboards do not include information about the cost of running agents. This is crucial for downstream customers who might want to use agents and for understanding the safety implications of an agent in terms of which adversaries might have access to these agents and for how much time.
+[In our recent paper](https://arxiv.org/abs/2407.01502), we showed that AI agent evaluations fall drastically short of the principles of good evaluation, making it hard to verify claims of real-world performance based on benchmark results.
+The Holistic Agent Leaderboard aims to address the widespread limitations of current agent evaluation. We will develop a platform to standardize agent evaluations and easily measure their performance on consequential real-world tasks.
+## We have been here before
+For language model evaluations, centralized leaderboards like HELM and OpenLLMLeaderboard have proven essential, as have tools to conduct standardized evaluations, such as lm-eval-harness.
+These tools have allowed downstream users of language models, model developers, and safety researchers to compare model performance across multiple benchmarks that capture different competencies.
+We aim to do something similar for agent evaluation.
+## How agent evaluations differ from model evaluations
+For model evaluation, standardizing elements of the input prompt can be useful to ensure models compete on an equal footing. For example, zero-shot vs. few-shot prompting can lead to qualitatively different performance.
+For agents, modifications to the system prompt (along with other system designs such as retrying multiple times and using a verifier or majority voting) are features, since these methods have been shown to solve real-world tasks of interest more effectively.
+But as a side-effect, unlike models, where comparing the cost and time for running models can be intuitive or easily comparable, agents can vary wildly in terms of how much they cost and how much time they take to run. Understanding the cost and time required for running them is key to determining whether an agent design improves on simple baselines (such as running the same model multiple times).
+In other words, we are moving away from evaluating AI from the perspective of a one-dimensional leaderboard and toward a Pareto frontier that considers performance and cost. Leaderboards are attractive for many reasons (scientifically—to assess capabilities; culturally—to pick winners and losers), but we think there's no meaningful way to collapse the dimensions into one.
+## HAL is a third-party, centralized, cost-controlled leaderboard for agent benchmarks
+- Centralized: Evaluations across agent benchmarks are all recorded to a single leaderboard that evaluates every listed agent in the same way.
+- Third-party: Agent developers clearly have competing objectives in reporting accuracy: they want to achieve state-of-the-art performance.
+- Cost-controlled: For downstream users, understanding the cost of running agents is a significant need for adoption. For agent developers, cost-controlled evaluations help develop accurate baselines (if an agent is SoTA by 0.1% and costs 100x as much, is it really impressive?)
+## Who is it for?
+We see HAL being useful for four categories of users:
+1. Downstream users and procurers of agents: Customers looking to deploy agents can get visibility into existing benchmarks that resemble tasks of interest to them, get to know who are the developers building useful agents (and see agent demos), and identify where the state of the art is for both cost and accuracy for the tasks they are looking to solve.
+2. Agent benchmark developers: Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.
+3. Agent developers: HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.
+4. Safety researchers: Understanding the capabilities of agents on real-world safety threats, as well as the cost required to carry them out, is important for safety research. For example, evaluations on Cybench could give a sense of how well agents perform (accuracy) and which adversaries can afford such agents (cost).

agent_monitor/failure_report.py ADDED Viewed

	@@ -0,0 +1,277 @@

+import asyncio
+from openai import AsyncOpenAI
+from collections import defaultdict
+import weave
+from pydantic import BaseModel
+from abc import ABC, abstractmethod
+import json
+from typing import Dict, List
+from datetime import datetime
+import backoff
+from openai import APITimeoutError, APIError, RateLimitError
+class FailureCategory(BaseModel):
+    category_id: int
+    category_name: str
+    description: str
+class FailureCategories(BaseModel):
+    failure_categories: list[FailureCategory]
+class TaskSummary(BaseModel):
+    task_id: str
+    summary: str
+class TaskClassification(BaseModel):
+    task_id: str
+    category_id: str
+    category_name: str
+    explanation: str
+class OverallAnalysis(BaseModel):
+    failure_categories: List[Dict]
+    task_classifications: Dict[str, Dict]
+    summary: str
+class AsyncLLMClient(ABC):
+    @abstractmethod
+    async def generate_text(self, prompt, system_message=None, response_format=None):
+        pass
+# class AsyncOpenAIClient(AsyncLLMClient):
+#     def __init__(self, model="gpt-4o-mini"):
+#         self.model = model
+#         self.client = AsyncOpenAI()
+#     async def generate_text(self, prompt, system_message=None, response_format=None):
+#         messages = [
+#             {"role": "system", "content": system_message or "You are a helpful AI assistant."},
+#             {"role": "user", "content": prompt}
+#         ]
+#         if response_format:
+#             response = await self.client.beta.chat.completions.parse(model=self.model, messages=messages, response_format=response_format)
+#         else:
+#             response = await self.client.chat.completions.create(model=self.model, messages=messages)
+#         return response.choices[0].message.content
+class AsyncOpenAIClient(AsyncLLMClient):
+    def __init__(self, model="gpt-4o-mini", max_tries=5, max_time=300):
+        self.model = model
+        self.client = AsyncOpenAI()
+        self.max_tries = max_tries
+        self.max_time = max_time
+    @backoff.on_exception(
+        backoff.expo,
+        (APITimeoutError, APIError, RateLimitError),
+        max_tries=10,
+        max_time=300
+    )
+    async def _make_request(self, messages, response_format=None):
+        if response_format:
+            return await self.client.beta.chat.completions.parse(
+                model=self.model,
+                messages=messages,
+                response_format=response_format
+            )
+        else:
+            return await self.client.chat.completions.create(
+                model=self.model,
+                messages=messages
+            )
+    async def generate_text(self, prompt, system_message=None, response_format=None):
+        messages = [
+            {"role": "system", "content": system_message or "You are a helpful AI assistant."},
+            {"role": "user", "content": prompt}
+        ]
+        try:
+            response = await self._make_request(messages, response_format)
+            return response.choices[0].message.content
+        except Exception as e:
+            raise Exception(f"Failed after {self.max_tries} attempts or {self.max_time} seconds: {str(e)}")
+def get_weave_calls(client):
+    calls = client.calls()
+    processed_calls = []
+    for call in calls:
+        ChatCompletion = weave.ref(call.output).get()
+        choices = [choice.message.content for choice in ChatCompletion.choices]
+        output = {
+            'weave_task_id': call.attributes['weave_task_id'],
+            'trace_id': call.trace_id,
+            'project_id': call.project_id,
+            'created_timestamp': ChatCompletion.created,
+            'inputs': dict(call.inputs),
+            'id': call.id,
+            'outputs': {'choices' : choices},
+            'exception': call.exception,
+            'summary': call.summary,
+            'display_name': call.display_name,
+            'attributes': dict(call.attributes),
+            "_children": call._children,
+            '_feedback': call._feedback,
+        }
+        processed_calls.append(output)
+    return processed_calls
+async def analyze_agent_performance(processed_calls, failed_tasks: list, llm_client):
+    task_calls = defaultdict(list)
+    for call in processed_calls:
+        if call['weave_task_id'] in failed_tasks:
+            task_calls[call['weave_task_id']].append(call)
+    for task_id in task_calls:
+        task_calls[task_id].sort(key=lambda x: x['created_timestamp'])
+    task_summaries = await asyncio.gather(*[summarize_task(task_id, calls, llm_client) for task_id, calls in task_calls.items()])
+    failure_categories = await identify_failure_categories(task_summaries, llm_client)
+    task_classifications = await classify_tasks(task_summaries, failure_categories, llm_client)
+    overall_summary = await generate_overall_summary(failure_categories, task_classifications, llm_client)
+    task_classifications = {tc["task_id"]: tc for tc in task_classifications}
+    return dict(OverallAnalysis(
+        failure_categories=failure_categories,
+        task_classifications=task_classifications,
+        summary=overall_summary
+    ))
+async def summarize_task(task_id, calls, llm_client):
+    calls_summary = ""
+    for i, call in enumerate(calls, 1):
+        calls_summary += f"""
+        Step {i}:
+        Input: {call['inputs']}
+        Output: {call['outputs']}
+        Timestamp: {datetime.fromtimestamp(call['created_timestamp'])}
+        """
+    prompt = f"""
+    Summarize the AI agent's performance on the following task:
+    Task ID: {task_id}
+    Number of steps: {len(calls)}
+    Detailed steps:
+    {calls_summary}
+    Provide a brief summary of:
+    1. The main goal of the task (inferred from the inputs and outputs)
+    2. The agent's approach, including key steps and decisions made
+    3. Any significant challenges or errors encountered during the task
+    4. The final outcome why the task failed. Be detailed about the reason for failure.
+    Keep the summary concise (around 200 words) but include specific details about the agent's performance and any notable aspects of its problem-solving process.
+    """
+    system_message = "You are an AI performance analyst tasked with summarizing an AI agent's performance on individual tasks. Focus on the most important aspects of the agent's approach and performance."
+    summary = await llm_client.generate_text(prompt, system_message, response_format=TaskSummary)
+    return json.loads(summary)
+async def identify_failure_categories(task_summaries, llm_client):
+    summaries_text = "\n\n".join([f"Task {s['task_id']}:\n{s['summary']}" for s in task_summaries])
+    prompt = f"""
+    Analyze the following summaries of an AI agent's performance across multiple tasks:
+    {summaries_text}
+    Identify recurring categories of failures that the agent faces across these tasks. For each category:
+    1. Provide a short, descriptive name (max 5 words)
+    2. Write a brief description explaining the nature of this failure or challenge category
+    Focus on patterns that appear across multiple tasks and represent specific errors that impacted the agent's performance. Make sure that your categories are distinct and cover a range of recurring issues. The categories should not bee too general.
+    Examples for categories could include:
+    Incorrect Implementation - The agent made a change to a reasonable area but their solution didn’t correctly address the issue.
+    Gave Up Prematurely - The agent decides to stop solving the task after encountering some difficulty.
+    Failed Edit Recovery - The agent went into an loop, making recurrent failing edits without recovering.
+    """
+    system_message = "You are an expert in AI agent analysis, tasked with identifying recurring patterns in agent performance across multiple tasks."
+    categories = await llm_client.generate_text(prompt, system_message, response_format=FailureCategories)
+    return [dict(category) for category in json.loads(categories)['failure_categories']]
+async def classify_tasks(task_summaries, failure_categories, llm_client):
+    categories_text = "\n".join([f"{cat['category_id']}. {cat['category_name']}: {cat['description']}" for i, cat in enumerate(failure_categories)])
+    classifications = []
+    for task in task_summaries:
+        prompt = f"""
+        Failure Categories:
+        {categories_text}
+        Task Summary:
+        {task['summary']}
+        Classify this task into one of the failure categories listed above. Provide:
+        1. The number of the chosen category
+        2. A brief explanation of why this category best fits the task's outcome
+        If the task doesn't clearly fit any category, you may classify it as "0. Other" and explain why.
+        """
+        system_message = "You are an AI performance analyst tasked with classifying task outcomes into predefined categories."
+        classification = await llm_client.generate_text(prompt, system_message, response_format=TaskClassification)
+        classification = json.loads(classification)
+        category_number = classification['category_id']
+        if str(category_number) == "0":
+            category_name = "Other"
+        else:
+            for cat in failure_categories:
+                if str(cat['category_id']) == str(category_number):
+                    category_name = cat['category_name']
+                    break
+                else:
+                    category_name = "Other"
+        explanation = classification['explanation']
+        classifications.append(dict(TaskClassification(
+            task_id=task['task_id'],
+            category_id=category_number,
+            category_name=category_name,
+            explanation=explanation
+        )))
+    return classifications
+async def generate_overall_summary(failure_categories, task_classifications, llm_client):
+    categories_text = "\n".join([f"{cat['category_name']}: {cat['description']}" for cat in failure_categories])
+    classifications_text = "\n".join([f"Task {tc['task_id']}: {tc['category_name']}" for tc in task_classifications])
+    prompt = f"""
+    Failure Categories:
+    {categories_text}
+    Task Classifications:
+    {classifications_text}
+    Based on the failure categories identified and the classification of tasks, provide an overall summary of the AI agent's performance across all tasks. Include:
+    1. The most common types of failures or challenges
+    2. Any patterns in the agent's performance across different tasks
+    3. Suggestions for areas of improvement in the agent's design or training
+    Keep the summary concise but insightful, focusing on the most significant findings and their implications for AI agent development. Do only return the summary itself without any preceding context etc.
+    """
+    system_message = "You are a senior AI researcher tasked with providing a high-level analysis of an AI agent's performance across multiple tasks."
+    return await llm_client.generate_text(prompt, system_message)
+async def main():
+    client = weave.init("citp_agent_eval/usaco_1723148990")
+    processed_calls = get_weave_calls(client)
+    weave.finish()
+    openai_client = AsyncOpenAIClient(model="gpt-4o-mini")
+    overall_analysis = await analyze_agent_performance(processed_calls, openai_client)
+    with open("agent_performance_analysis.json", "w") as f:
+        json.dump(overall_analysis.model_dump(), f, indent=4)
+if __name__ == "__main__":
+    asyncio.run(main())

agent_monitor/monitor.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import asyncio
+from collections import defaultdict
+from pydantic import BaseModel
+import json
+class StepAnalysis(BaseModel):
+    description: str
+    action_type: str
+    assessment: str
+    success: bool
+    headline: str
+class TaskSummary(BaseModel):
+    overview: str
+    key_successes: str
+    main_challenges: str
+    overall_assessment: str
+async def analyze_agent_steps(processed_calls, llm_client, llm_eval=False):
+    task_calls = defaultdict(list)
+    for call in processed_calls:
+        task_calls[call['weave_task_id']].append(call)
+    for task_id in task_calls:
+        task_calls[task_id].sort(key=lambda x: x['created_timestamp'])
+    tasks = [analyze_task(calls, llm_client, llm_eval) for task_id, calls in task_calls.items()]
+    task_analyses = await asyncio.gather(*tasks)
+    return dict(zip(task_calls.keys(), task_analyses))
+async def analyze_task(calls, llm_client, llm_eval=False):
+    if llm_eval:
+        step_tasks = [analyze_step(call, i+1, len(calls), llm_client) for i, call in enumerate(calls)]
+        steps = await asyncio.gather(*step_tasks)
+    else:
+        steps = []
+        for i, call in enumerate(calls):
+            steps.append({
+                'call_data': call,
+                'analysis': dict(StepAnalysis(
+                    description="Not available",
+                    action_type='other',
+                    success=False,
+                    assessment="Not available",
+                    headline="Not available"
+                ))
+            })
+    try:
+        if llm_eval:
+            task_analysis = await summarize_task(steps, llm_client)
+            return {
+            'steps': steps,
+            'task_analysis': task_analysis
+            }
+        else:
+            return {
+                'steps': steps,
+                'task_analysis': dict(TaskSummary(
+                    overview="Not available",
+                    key_successes='Not available',
+                    main_challenges='Not available',
+                    overall_assessment="Not available"
+                ))
+            }
+    except Exception as e:
+        print(f"Error in task summarization: {str(e)}")
+        return dict(TaskSummary(
+            overview="Not available",
+            key_successes='Not available',
+            main_challenges='Not available',
+            overall_assessment="Not available"
+        ))
+async def analyze_step(call, step_number, total_steps, llm_client):
+    prompt = f"""
+    Analyze Step {step_number}/{total_steps} of the AI agent's USACO task solution:
+    Input: {call['inputs']}
+    Output: {call['outputs']}
+    Exception: {call['exception']}
+    Summary: {call['summary']}
+    Provide a detailed, technical analysis with the following:
+    1. Specific Description: Describe precisely what the agent did in this step, including any algorithms, data structures, or problem-solving techniques employed.
+    2. Action Classification: Categorize the action as one of:
+       - 'plan': Strategizing or outlining an approach
+       - 'tool': Using a specific programming construct or algorithm
+       - 'retrieve': Accessing or utilizing external information
+       - 'other': Any action that doesn't fit the above categories
+    3. Technical Evaluation: Assess the technical merit of the agent's approach. Comment on efficiency, correctness, and adherence to USACO problem-solving best practices.
+    4. Success: Determine if the agent successfully completed its intended action.
+    5. Concise Headline: Write a technically precise headline (max 7 words) that captures the essence of this step.
+    Your analysis should be highly specific to this task. Avoid generalities and focus on the technical details of the agent's approach to this particular problem.
+    """
+    system_message = "You are an expert in AI agent design and evaluation. Analyze the AI agent's actions with the depth and specificity expected in a detailed expert review. Focus on providing insights that would be valuable to an AI researcher specializing in AI agent development."
+    analysis = await llm_client.generate_text(prompt, system_message, response_format=StepAnalysis)
+    try:
+        analysis = json.loads(analysis)
+    except json.JSONDecodeError:
+        print(f"Error parsing analysis for step {step_number} of {total_steps} in task {call['weave_task_id']}. Using default values.")
+        analysis = print(f"Error in analysis for step {step_number} of {total_steps} in task {call['weave_task_id']}: {str(e)}")
+        analysis = dict(StepAnalysis(
+            description="Analysis failed",
+            category='other',
+            success=False,
+            assessment="Unable to assess due to error"
+        ))
+    return {
+        'call_data': call,
+        'analysis': analysis
+    }
+async def summarize_task(steps, llm_client):
+    steps_summary = "\n".join([f"Step {i+1}: {step['analysis']}" for i, step in enumerate(steps)])
+    prompt = f"""
+    Provide a comprehensive analysis of the AI agent's approach to solving this USACO task:
+    {steps_summary}
+    Your analysis should include:
+    1. Technical Overview: Describe the agent's overall problem-solving strategy, highlighting specific actions and techniques used throughout the task.
+    2. Key Achievements: Identify and explain the most significant breakthroughs or efficient implementations demonstrated by the agent.
+    3. Technical Challenges: Analyze the primary obstacles encountered, focusing on difficulties or conceptual misunderstandings in the context of the task.
+    4. Performance Evaluation: Assess the agent's overall performance, considering factors such as time complexity, space efficiency, code quality, and adherence to competitive programming best practices.
+    Your summary should be highly technical and specific to this task. Assume the reader is an expert as well and familiar with the task context. Focus on providing insights that would be valuable to an AI researcher specializing in AI agent development.
+    """
+    system_message = "You are an expert AI performance analyst, skilled in evaluating and summarizing AI agent task execution. You are specialized in providing analyses to support AI researchers to develop AI agents."
+    analysis = await llm_client.generate_text(prompt, system_message, response_format=TaskSummary)
+    return json.loads(analysis)

agent_performance_analysis.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+    "failure_categories": [
+        {
+            "category_id": 1,
+            "category_name": "Algorithm Implementation Issues",
+            "description": "The agent occasionally struggles with implementing the correct algorithms for given tasks, often leading to inefficiencies or logical errors in output."
+        },
+        {
+            "category_id": 2,
+            "category_name": "Input Validation Failures",
+            "description": "Issues with handling unexpected or malformed inputs arise, resulting in crashes or incorrect results, indicating a lack of robustness in input handling."
+        },
+        {
+            "category_id": 3,
+            "category_name": "Inadequate Commenting and Documentation",
+            "description": "The agent sometimes fails to adequately comment or document the code, making it harder to understand the thought process and logic behind implementations, especially for complex tasks."
+        },
+        {
+            "category_id": 4,
+            "category_name": "Test Case Coverage Gaps",
+            "description": "The agent frequently misses edge cases or does not sufficiently test various scenarios, resulting in incomplete solutions that may fail under certain conditions."
+        },
+        {
+            "category_id": 5,
+            "category_name": "Problem Decomposition Difficulties",
+            "description": "Challenges in effectively breaking down complex problems into manageable steps can lead to oversight and incomplete solution strategies."
+        }
+    ],
+    "task_classifications": [
+        {
+            "task_id": "1333_platinum_good_bitstrings",
+            "category_id": "0",
+            "category_name": "Success/Other",
+            "explanation": "The task was successfully completed with clear, well-structured steps and good documentation in the Python implementation. There were no significant challenges or errors encountered, indicating effective handling of the problem without falling into any of the predefined failure categories."
+        }
+    ],
+    "summary": "### Overall Summary of AI Agent's Performance\n\n**1. Common Types of Failures:**\nThe AI agent exhibits several recurring issues that hinder its performance:\n- **Algorithm Implementation Issues:** The agent frequently implements algorithms incorrectly, resulting in inefficiencies and logical inconsistencies. This indicates a need for improved algorithm comprehension and application.\n- **Input Validation Failures:** The agent struggles with handling unexpected or malformed inputs, which can lead to crashes or inaccuracies in output. This underscores a critical lack of robustness in its input handling mechanisms.\n- **Inadequate Commenting and Documentation:** There is a consistent shortfall in the agent's ability to adequately comment and document its code, complicating code comprehension and potentially hindering collaborative efforts.\n- **Test Case Coverage Gaps:** The agent often overlooks edge cases during testing, suggesting that its testing framework may not be rigorous enough to ensure comprehensive solution validation.\n- **Problem Decomposition Difficulties:** The inability to efficiently break complex problems into smaller, manageable tasks leads to incomplete or erroneous solutions, highlighting a weakness in high-level problem-solving strategies.\n\n**2. Patterns in the Agent's Performance Across Tasks:**\nThe agent's performance appears to vary based on the complexity of tasks. While it may succeed in simpler or more straightforward tasks (as indicated by the success classification in Task 1333_platinum_good_bitstrings), it shows vulnerabilities in handling tasks that require deeper reasoning, sophisticated algorithm implementations, or robust input validation. This pattern suggests that the agent may benefit from focused training on problem decomposition and robustness.\n\n**3. Suggestions for Areas of Improvement:**\nTo enhance the AI agent's performance across tasks, the following areas should be prioritized:\n- **Enhanced Training on Algorithm Understanding:** Focus on comprehensive training modules that reinforce algorithm selection and implementation strategies.\n- **Robust Input Handling Mechanisms:** Develop more resilient input validation frameworks to handle edge cases and malformed inputs without runtime failures.\n- **Improved Documentation Practices:** Implement guidelines and tools that facilitate better commentary and documentation of code, enhancing maintainability and collaboration.\n- **Expanded Testing Framework:** Create a more exhaustive testing suite that includes a wider variety of edge cases and scenarios to ensure all functions perform as expected in diverse conditions.\n- **Training on Problem Decomposition:** Include training tactics aimed at teaching the agent to effectively break down complex problems, fostering a stepwise approach to problem-solving.\n\nBy addressing these areas, the AI agent can become more reliable and efficient, ultimately leading to improved performance across a wider range of tasks."
+}

agent_submission.md ADDED Viewed

	@@ -0,0 +1,8 @@

+To submit **a new agent** for evaluation, developers should only need to:
+1. Enure that the agent provides a specific entry point to the agent (e.g., a Python script or function)
+2. Integrate logging by wrapping all LLM API calls to report cost, latency, and relevant parameters.
+   * For our own evaluations, we have been relying on [Weights & Biases' Weave](https://wandb.github.io/weave/) which provides integrations for a number of LLM providers.
+   * Both, [Vivaria](https://github.com/METR/vivaria) and UK AISI's [Inspect](https://github.com/UKGovernmentBEIS/inspect_ai) provide logging functionalities.
+   * However, there are some missing pieces we are interested in such as latency and parameters of LLM calls. Weave provides a minimum-effort solution.

agent_submission_core.md ADDED Viewed

	@@ -0,0 +1,29 @@

+### To submit **a new agent** to the CORE leaderboard, follow these steps:
+1. **Run your agent on the [CORE-Bench Harness](https://github.com/siegelz/core-bench).** When developing your agent, ensure that it generates a file named `agent_trace.log` in the base directory it is invoked for each run. The content of this file must be in JSON format and **at least** include the keys `cost` and `agent_trace`:
+    ```json
+    {
+        "cost": 0.59,
+        "agent_trace": "The agent trace is a string that describes the intermediate steps the agent took to arrive at the final solution. This trace does not need to follow a specific format."
+    }
+    ```
+   - **`cost`**: A float representing the total cost (USD) of API calls made by the agent. We recommend using [Weave](https://github.com/wandb/weave) for easy cost logging.
+   - **`agent_trace`**: A string describing the steps your agent took to arrive at its final solution. It should adhere to the following guidelines inspired by [SWE-Bench](https://www.swebench.com/submit.html):
+     - Human-readable.
+     - Reflects the intermediate steps your system took that led to the final solution.
+     - Generated with the inference process, not post-hoc.
+   If you have any trouble implementing this, feel free to reach out to us for support.
+2. **Run your agent** on all tasks of the test set. You will almost certainly need to run your agent using our Azure VM harness (with the `--use_azure` flag) to avoid long experiment times. Set the `--experiment_name` flag to be the name of your agent. You can submit results for any of the three levels of the benchmark: CORE-Bench-Easy, CORE-Bench-Medium, or CORE-Bench-Hard.
+3. **Submit the following two directories from the harness**:
+   - `benchmark/results/[experiment_name]`: Contains the results of your agent on each task.
+   - `benchmark/logs/[experiment_name]`: Contains the logs of your agent's execution on each task (which are the `agent_trace.log` files your agent submits).
+   - These files are automatically generated by the harness when you run your agent. You should not be manually modifying these files.
+   Compress these directories into two `.tar.gz` or `.zip` files and email them to [zss@princeton.edu](mailto:zss@princeton.edu). If the files are too large to email, please upload them to Google Drive, Dropbox, etc., and email the link. **In the body of the email, please also include the name of your agent that you wish to be displayed on the leaderboard.**
+4. [Optional] We highly encourage you to submit the files of your agent (i.e. `benchmark/agents/[agent_name]`) so we can verify the performance of your agent on the leaderboard. If you choose to do so, compress this directory into a `.tar.gz` file and include it in the email.

app.py ADDED Viewed

	@@ -0,0 +1,1458 @@

+import gradio as gr
+from gradio_leaderboard import Leaderboard, SelectColumns, ColumnFilter
+import config
+from envs import RESULTS_REPO_ID, REPO_ID, API, HF_TOKEN
+from pathlib import Path
+import pandas as pd
+import os
+import json
+from utils.viz import create_scatter_plot, create_flow_chart, create_bar_chart, create_task_success_heatmap, create_leaderboard
+from utils.processing import check_and_process_uploads
+from huggingface_hub import snapshot_download
+from apscheduler.schedulers.background import BackgroundScheduler
+from datetime import datetime
+import json
+import re
+import markdown
+import asyncio
+from apscheduler.schedulers.asyncio import AsyncIOScheduler
+import weave
+from utils.db import TracePreprocessor
+from gradio.themes.soft import Soft
+preprocessor = TracePreprocessor()
+from datetime import datetime
+abs_path = Path(__file__).parent
+def restart_space():
+    API.restart_space(repo_id=REPO_ID, token=HF_TOKEN)
+# New function to download results
+def download_latest_results():
+    print("Downloading latest results...")
+    snapshot_download(RESULTS_REPO_ID,
+                    local_dir= "evals_upload",
+                    repo_type='dataset',
+                    tqdm_class=None,
+                    etag_timeout=30,
+                    max_workers=4,
+                    )
+    print("Download complete.")
+def get_analyzed_traces(agent_name, benchmark_name):
+    return preprocessor.get_analyzed_traces(agent_name, benchmark_name)
+def get_failure_report(agent_name, benchmark_name):
+    return preprocessor.get_failure_report(agent_name, benchmark_name)
+def parse_json_files(folder_path, benchmark_name, aggregate=True):
+    return preprocessor.get_parsed_results(benchmark_name, aggregate=aggregate)
+def update_agent_dropdown(benchmark_name, metric):
+    df = parse_json_files(os.path.join(abs_path, "evals_live"), benchmark_name)
+    agents = df['Agent Name'].tolist()
+    best_agent = get_best_agent(benchmark_name, metric)
+    return gr.Dropdown(choices=agents, value=best_agent, label="Select Agent")
+def get_best_agent(benchmark_name, metric):
+    df = parse_json_files(os.path.join(abs_path, "evals_live"), benchmark_name)
+    return df.loc[df[metric].idxmax()]['Agent Name']
+def update_task_analysis(benchmark_name, agent_name):
+    if not agent_name:
+        return "Please select an agent.", None, None, ""
+    analyzed_traces = get_analyzed_traces(agent_name, benchmark_name)
+    if not analyzed_traces:
+        return f"No analysis available for agent: {agent_name}", None, None, ""
+    task_ids = list(analyzed_traces.keys())
+    overview, flow_chart, _ = update_task_details(benchmark_name, agent_name, task_ids[0])
+    return overview, flow_chart, gr.Dropdown(choices=task_ids, value=task_ids[0], label="Select Task"), ""
+def update_task_details(benchmark_name, agent_name, task_id):
+    if not task_id:
+        return "Please select a task.", None, ""
+    analyzed_traces = get_analyzed_traces(agent_name, benchmark_name)
+    if not analyzed_traces or task_id not in analyzed_traces:
+        return f"No analysis available for task: {task_id}", None, ""
+    analysis = analyzed_traces[task_id]
+    summary = analysis.get('task_analysis', {})
+    overview = f"### Summary\n\n{summary.get('overview', 'No overview available.')}\n\n"
+    # overview += f"### Successes\n{summary.get('key_successes', 'No successes listed.')}\n\n"
+    # overview += f"### Challenges\n{summary.get('main_challenges', 'No challenges listed.')}\n\n"
+    # overview += f"### Overall Assessment\n{summary.get('overall_assessment', 'No assessment available.')}\n\n"
+    if summary.get('overview', 'No overview available.') != "Not available":
+        flow_chart = create_flow_chart(analysis['steps'])
+    else:
+        flow_chart = None
+    return overview, flow_chart, ""
+def format_call_info(step, step_index):
+    call_data = step['call_data']
+    analysis = step['analysis']
+    def format_json(obj):
+        # if isinstance(obj, dict) and 'choices' in obj:
+        #     # Special handling for message content
+        #     formatted_content = format_message_content(obj['choices'][0])
+        #     return f'<div class="message-content">{formatted_content}</div>'
+        # else:
+        json_str = json.dumps(obj, indent=2)
+        json_str = json_str.replace(' ', '&nbsp;')
+        json_str = json_str.replace('\n', '<br>')
+        return f'<div class="json-wrapper">{json_str}</div>'
+    # Currently not used but we can enable it to format message content
+    def format_message_content(content):
+        # Convert Markdown to HTML
+        html_content = markdown.markdown(content)
+        # Replace ``` code blocks with styled pre blocks
+        html_content = re.sub(r'```python\n(.*?)```', lambda m: f'<pre class="code-block">{m.group(1)}</pre>', html_content, flags=re.DOTALL)
+        return html_content
+    formatted_info = f"""
+    <style>
+        .json-wrapper {{
+            white-space: pre-wrap;
+            word-wrap: break-word;
+            font-family: monospace;
+            max-height: 300px;
+            overflow-y: auto;
+            background-color: #f5f5f5;
+            padding: 10px;
+            border-radius: 5px;
+        }}
+        .message-content {{
+            white-space: normal;
+            word-wrap: break-word;
+            font-family: Arial, sans-serif;
+            max-height: 500px;
+            overflow-y: auto;
+            background-color: #ffffff;
+            padding: 10px;
+            border-radius: 5px;
+            border: 1px solid #e0e0e0;
+        }}
+        .code-block {{
+            background-color: #f0f0f0;
+            padding: 10px;
+            border-radius: 5px;
+            font-family: monospace;
+            white-space: pre-wrap;
+            word-wrap: break-word;
+        }}
+    </style>
+    <h3>Step {step_index + 1}: {analysis.get('headline', '')}</h3>
+    <h4>Call Metadata</h4>
+    <ul>
+        <li><strong>Weave Task ID:</strong> {call_data['weave_task_id']}</li>
+        <li><strong>Trace ID:</strong> {call_data['trace_id']}</li>
+        <li><strong>Project ID:</strong> {call_data['project_id']}</li>
+        <li><strong>Created Timestamp:</strong> {datetime.fromtimestamp(call_data['created_timestamp'])}</li>
+        <li><strong>Model:</strong> {call_data['inputs']['model']}</li>
+    </ul>
+    <h4>Inputs</h4>
+    {format_json(call_data['inputs'])}
+    <h4>Outputs</h4>
+    {format_json(call_data['outputs'])}
+    <h4>Usage</h4>
+    {format_json(call_data['summary'])}
+    <h4>Analysis</h4>
+    <ul>
+        <li><strong>Description:</strong> {analysis['description']}</li>
+        <li><strong>Assessment:</strong> {analysis['assessment']}</li>
+        <li><strong>Success:</strong> {analysis['success']}</li>
+        <li><strong>Action Type:</strong> {analysis['action_type']}</li>
+    </ul>
+    """
+    return formatted_info
+def update_failure_report(agent_name, benchmark_name):
+    failure_report = get_failure_report(agent_name, benchmark_name)
+    if not failure_report:
+        return "No failure report available for this agent.", None
+    # Create overview of failure categories
+    categories_overview = "### Failure Categories:\n\n"
+    for category in failure_report['failure_categories']:
+        categories_overview += f"#### {category['category_name']}\n"
+        categories_overview += f"{category['description']}\n\n"
+    # Count tasks affected by each category
+    category_counts = {}
+    for task, classification in failure_report['task_classifications'].items():
+        category_id = classification['category_id']
+        category_counts[category_id] = category_counts.get(category_id, 0) + 1
+    # Prepare data for bar chart
+    categories = [cat['category_name'] for cat in failure_report['failure_categories']]
+    counts = [category_counts.get(str(i+1), 0) for i in range(len(categories))]
+    # Create bar chart
+    chart = create_bar_chart(categories, counts, "Failure Categories", "Number of Affected Tasks", "Failure Categories Distribution")
+    return categories_overview, chart
+from gradio.themes.utils import colors, fonts, sizes
+from typing import Iterable
+class MyTheme(Soft):
+    def __init__(
+        self,
+        *,
+        primary_hue: colors.Color | str = colors.blue,
+        text_size: sizes.Size | str = sizes.text_lg,
+        font: fonts.Font
+        | str
+        | Iterable[fonts.Font | str] = (
+            fonts.GoogleFont("Lato"),
+            "ui-sans-serif",
+            "sans-serif",
+        ),
+        font_mono: fonts.Font
+        | str
+        | Iterable[fonts.Font | str] = (
+            fonts.GoogleFont("IBM Plex Mono"),
+            "ui-monospace",
+            "monospace",
+        ),
+    ):
+        super().__init__(
+            primary_hue=primary_hue,
+            text_size=text_size,
+            font=font,
+            font_mono=font_mono,
+        )
+my_theme = MyTheme()
+with gr.Blocks(theme=my_theme, css='css.css', title="HAL: Holistic Agent Leaderboard") as demo:
+    # gr.Markdown((Path(__file__).parent / "header.md").read_text(), elem_classes=["text-large"])
+    gr.HTML("""
+            <style>
+    .hal-header {
+        color: #ecf0f1;
+        border-radius: 10px;
+        padding: 40px 20px;
+        text-align: center;
+        box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
+    }
+    .hal-title {
+        font-size: 2.5em;
+        font-weight: 700;
+        margin: 0;
+        letter-spacing: 2px;
+        text-transform: uppercase;
+    }
+    .hal-subtitle {
+        font-size: 1.2em;
+        font-weight: 300;
+        margin-top: 15px;
+        margin-left: auto;
+        margin-right: auto;
+        line-height: 1.6;
+        text-align: center;
+    }
+    .hal-highlight {
+        color: #3498db;
+        font-weight: 600;
+    }
+</style>
+<header class="hal-header">
+    <h1 class="hal-title">Holistic Agent Leaderboard (HAL)</h1>
+    <p class="hal-subtitle">
+        A standardized, cost-aware, and third-party leaderboard for evaluating agents.
+    </p>
+</header>""")
+    gr.HTML("""
+<style>
+    .feature-row {
+        display: flex;
+        justify-content: space-between;
+        margin-top: 20px;
+        margin-bottom: 20px;
+    }
+    .feature-column {
+        flex: 1;
+        padding: 25px;
+        background-color: #ffffff;
+        border-radius: 10px;
+        margin: 0 15px;
+        text-align: left;
+        box-shadow: 0 6px 12px rgba(0, 0, 0, 0.1);
+        display: flex;
+        flex-direction: column;
+        align-items: flex-start;
+        border-top: 5px solid #3498db;
+        transition: transform 0.3s ease, box-shadow 0.3s ease;
+    }
+    .feature-column:hover {
+        transform: translateY(-5px);
+        box-shadow: 0 5px 10px rgba(0, 0, 0, 0.15);
+    }
+    .feature-keyword {
+        font-size: 1.2em;
+        font-weight: bold;
+        color: #1b9e77;
+        margin-bottom: 10px;
+        text-transform: uppercase;
+        letter-spacing: 1px;
+    }
+    .feature-content {
+        flex-grow: 1;
+    }
+    .feature-description {
+        font-size: 0.95em;
+        line-height: 1.6;
+        color: #333;
+    }
+</style>
+<div class="feature-row">
+    <div class="feature-column">
+        <div class="feature-keyword">Standardized</div>
+        <div class="feature-content">
+            <p class="feature-description">Evaluations across agent benchmarks are all recorded to a single leaderboard that evaluates every listed agent in the same way.</p>
+        </div>
+    </div>
+    <div class="feature-column">
+        <div class="feature-keyword">Cost-controlled</div>
+        <div class="feature-content">
+            <p class="feature-description">For downstream users, understanding the cost of running agents is a significant need for adoption. For agent developers, cost-controlled evaluations help develop accurate baselines.</p>
+        </div>
+    </div>
+    <div class="feature-column">
+        <div class="feature-keyword">Third-party</div>
+        <div class="feature-content">
+            <p class="feature-description">Agent developers clearly have competing objectives in reporting accuracy: they want to achieve state-of-the-art performance.</p>
+        </div>
+    </div>
+</div>
+<style>
+    .section-heading {
+        font-size: 1.8em;
+        font-weight: bold;
+        color: #2c3e50;
+        margin-top: 40px;
+        margin-bottom: 20px;
+        text-align: left;
+    }
+    .user-types-container {
+        display: grid;
+        grid-template-columns: repeat(2, 1fr);
+        gap: 20px;
+        margin-top: 20px;
+    }
+    .user-type {
+        background-color: #ffffff;
+        border-radius: 10px;
+        padding: 25px;
+        box-shadow: 0 6px 12px rgba(0, 0, 0, 0.1);
+        transition: transform 0.3s ease, box-shadow 0.3s ease;
+        border-left: 5px solid #3498db;
+    }
+    .user-type:hover {
+        transform: translateY(-5px);
+        box-shadow: 0 5px 10px rgba(0, 0, 0, 0.15);
+    }
+    .user-type-title {
+        font-size: 1.2em;
+        font-weight: bold;
+        color: #3498db;
+        margin-bottom: 10px;
+    }
+    .user-type-description {
+        font-size: 0.95em;
+        line-height: 1.6;
+        color: #333;
+    }
+    .user-type-links a {
+        display: inline-block;
+        padding: 5px 12px;
+        margin-bottom: 5px;
+        background-color: #f0f4f8;
+        color: #2c3e50 !important; /* Force the color change */
+        text-decoration: none !important; /* Force remove underline */
+        border-radius: 15px;
+        font-size: 0.85em;
+        transition: all 0.3s ease;
+        border: 1px solid #e1e8ed;
+    }
+    .user-type-links a:hover {
+        background-color: #3498db;
+        color: white !important; /* Force the color change on hover */
+        transform: translateY(-2px);
+        box-shadow: 0 2px 5px rgba(52, 152, 219, 0.2);
+        text-decoration: none !important; /* Ensure no underline on hover */
+    }
+    .user-type-links a:visited {
+        color: #2c3e50 !important; /* Ensure visited links have the same color */
+    }
+    .user-type-links a::before {
+        content: "→";
+        margin-right: 5px;
+        font-size: 1.1em;
+    }
+</style>
+<h2 class="section-heading">Who is it for?</h2>
+<p>We see HAL being useful for four types of users:</p>
+<div class="user-types-container">
+    <div class="user-type">
+        <h3 class="user-type-title">Downstream Users & Procurers</h3>
+        <p class="user-type-description">Customers looking to deploy agents can get visibility into existing benchmarks, know developers building useful agents, and identify the state of the art for both cost and accuracy for their tasks of interest.</p>
+        <div class="user-type-links">
+            <a href="#leaderboards">Leaderboards</a>
+        </div>
+    </div>
+    <div class="user-type">
+        <h3 class="user-type-title">Agent Benchmark Developers</h3>
+        <p class="user-type-description">Reporting results on a centralized leaderboard could allow improved visibility into agent benchmarks that measure real-world utility.</p>
+        <div class="user-type-links">
+            <a href="#benchmark-submission">Add a Benchmark</a>
+        </div>
+    </div>
+    <div class="user-type">
+        <h3 class="user-type-title">Agent Developers</h3>
+        <p class="user-type-description">HAL allows for easy reproduction of past agents, clear comparison with past baselines, and a straightforward way to compete on a leaderboard.</p>
+        <div class="user-type-links">
+            <a href="#agent-submission">Submit an Agent</a>
+            <a href="#leaderboards">Leaderboards</a>
+            <a href="#reproduction-guide">Reproduction Guide</a>
+        </div>
+    </div>
+    <div class="user-type">
+        <h3 class="user-type-title">Safety Researchers</h3>
+        <p class="user-type-description">Understanding agent capabilities on real-world safety threats and their associated costs is crucial. For example, Cybench evaluations could provide insights into agent performance and affordability for potential adversaries.</p>
+        <div class="user-type-links">
+            <a href="#cybench-results">Cybench Leaderboard (coming soon)</a>
+            <a href="#agent-monitor">Agent Monitor</a>
+        </div>
+    </div>
+</div>
+</br>
+<h2 class="section-heading" id="leaderboards">Leaderboards</h2>
+<p>Select a benchmark to see the agent leaderboard. Verified results have been run by the HAL team:</p>
+""")
+    with gr.Tabs() as tabs:
+        with gr.Tab("CORE-Bench"):
+            gr.HTML("""
+            <p>
+            CORE-Bench evaluates the ability of agents to computationally reproduce the results of published scientific papers. Agents are given the codebase of a paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. The benchmark has tasks at three difficulty levels:
+            </p>
+            """)
+            with gr.Tab("CORE-Bench-Hard"):
+                gr.HTML("""
+                <p>
+                    <i><b>CORE-Bench-Hard:</b></i> The agent is given the codebase of the paper and must install all libraries and dependencies, run the code, and read through the output and figures to answer questions about the paper. This level is most akin to fully reproducing a paper and is the most realistic and challenging level.
+                </p>
+                """)
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        Leaderboard(
+                            value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_hard'), ci_metrics=["Accuracy", "Total Cost"]),
+                            select_columns=SelectColumns(
+                                default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                                cant_deselect=["Agent Name"],
+                                label="Select Columns to Display:",
+                            ),
+                            hide_columns=config.COREBENCH_HIDE_COLUMNS,
+                            search_columns=config.COREBENCH_SEARCH_COLUMNS,
+                        )
+                        # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+                with gr.Row():
+                    gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Hard")
+                with gr.Row():
+                    scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_hard', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task success heatmap")
+                gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+                with gr.Row():
+                    task_success_heatmap = gr.Plot()
+                demo.load(
+                lambda: create_task_success_heatmap(
+                    preprocessor.get_task_success_data('corebench_hard'),
+                    'CORE-Bench-Hard'
+                ),
+                outputs=[task_success_heatmap]
+                )
+            with gr.Tab("CORE-Bench-Medium"):
+                gr.HTML("""
+                <p>
+                <i><b>CORE-Bench-Medium:</b></i> The agent is given a Dockerfile and instructions on how to use the Dockerfile to fully reproduce the paper. This level mainly evaluates agents ability to use and interact with the terminal. The agent must then answer questions about the output of the code, as in the above level.
+                </p>
+                """)
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        Leaderboard(
+                            value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_medium'), ci_metrics=["Accuracy", "Total Cost"]),
+                            select_columns=SelectColumns(
+                                default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                                cant_deselect=["Agent Name"],
+                                label="Select Columns to Display:",
+                            ),
+                            hide_columns=config.COREBENCH_HIDE_COLUMNS,
+                            search_columns=config.COREBENCH_SEARCH_COLUMNS,
+                        )
+                        # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+                with gr.Row():
+                    gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Medium")
+                with gr.Row():
+                    scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_medium', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task success heatmap")
+                gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+                with gr.Row():
+                    task_success_heatmap = gr.Plot()
+                demo.load(
+                lambda: create_task_success_heatmap(
+                    preprocessor.get_task_success_data('corebench_medium'),
+                    'CORE-Bench-Medium'
+                ),
+                outputs=[task_success_heatmap]
+                )
+            with gr.Tab("CORE-Bench-Easy"):
+                gr.HTML("""
+                <p>
+                <i><b>CORE-Bench-Easy:</b></i> The agent is given the output of the code and must answer questions about the output without running any code. To answer questions, agents must navigate through the terminal output as well as files and figures generated by the code.
+                </p>
+                """)
+                with gr.Row():
+                    with gr.Column(scale=2):
+                        Leaderboard(
+                            value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_easy'), ci_metrics=["Accuracy", "Total Cost"]),
+                            select_columns=SelectColumns(
+                                default_selection=config.COREBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                                cant_deselect=["Agent Name"],
+                                label="Select Columns to Display:",
+                            ),
+                            hide_columns=config.COREBENCH_HIDE_COLUMNS,
+                            search_columns=config.COREBENCH_SEARCH_COLUMNS,
+                        )
+                        # gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+                with gr.Row():
+                    gr.Markdown("### Accuracy vs. Cost on CORE-Bench-Easy")
+                with gr.Row():
+                    scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'corebench_easy', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task success heatmap")
+                gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+                with gr.Row():
+                    task_success_heatmap = gr.Plot()
+                demo.load(
+                lambda: create_task_success_heatmap(
+                    preprocessor.get_task_success_data('corebench_easy'),
+                    'CORE-Bench-Easy'
+                ),
+                outputs=[task_success_heatmap]
+                )
+            gr.Markdown((Path(__file__).parent / "agent_submission_core.md").read_text())
+        with gr.Tab("USACO"):
+            gr.Markdown("""The USA Computing Olympiad (USACO) is a computer programming competition for pre-college students. This benchmark evaluates the performance of AI agents on a set of 307 USACO tasks. The agents are evaluated based on the number of tasks correctly solved.""")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    Leaderboard(
+                        value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco'), ci_metrics=["Accuracy", "Total Cost"]),
+                        select_columns=SelectColumns(
+                            default_selection=config.USACO_ON_LOAD_COLUMNS + ["Verified"],
+                            cant_deselect=["Agent Name"],
+                            label="Select Columns to Display:",
+                        ),
+                        hide_columns=config.USACO_HIDE_COLUMNS,
+                        search_columns=config.USACO_SEARCH_COLUMNS,
+                    )
+                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+            with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost for USACO agents")
+            with gr.Row():
+                scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'usaco', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+            gr.HTML('<div style="height: 30px;"></div>')
+            gr.Markdown("## Task success heatmap")
+            gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+            with gr.Row():
+                task_success_heatmap = gr.Plot()
+            demo.load(
+            lambda: create_task_success_heatmap(
+                preprocessor.get_task_success_data('usaco'),
+                'USACO'
+            ),
+            outputs=[task_success_heatmap]
+            )
+            gr.HTML("""
+            <style>
+                .grouped-section {
+                    border: 2px solid #dee2e6; /* Color matching unactivated tabs */
+                    border-radius: 10px;
+                    padding: 30px;
+                    margin-top: 40px;
+                    margin-bottom: 40px;
+                    position: relative;
+                }
+                .grouped-section-title {
+                    font-size: 1.7em;
+                    font-weight: bold;
+                    color: #2c3e50;
+                    margin-bottom: 20px;
+                    padding-bottom: 10px;
+                    border-bottom: 2px solid #dee2e6;
+                }
+            </style>
+            """)
+            with gr.Group(elem_classes=["grouped-section"]):
+                gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+                gr.Markdown('The agent monitor provides an overview of the recurring errors an agent makes as well as a summary of the steps the agent takes to solve a task. It currently consists of two main components:')
+                gr.HTML('<div style="height: 10px;"></div>')
+                gr.Markdown("## Failure report for each agent")
+                gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        failure_categories_overview = gr.Markdown()
+                    with gr.Column(scale=1):
+                        failure_categories_chart = gr.Plot()
+                # Initialize the failure report agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[failure_report_agent_dropdown])
+                # Update failure report when agent is selected
+                failure_report_agent_dropdown.change(update_failure_report,
+                                                    inputs=[failure_report_agent_dropdown, gr.Textbox(value="usaco", visible=False)],
+                                                    outputs=[failure_categories_overview, failure_categories_chart])
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task overview")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        task_dropdown = gr.Dropdown(label="Select USACO Task")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    task_overview = gr.Markdown()
+                with gr.Row():
+                    flow_chart = gr.Plot(label="Task Flow")
+                # Initialize the agent dropdown with the best agent
+                demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
+                demo.load(update_task_analysis, inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                agent_dropdown.change(update_task_analysis,
+                                    inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown],
+                                    outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                task_dropdown.change(update_task_details,
+                                    inputs=[gr.Textbox(value="usaco", visible=False), agent_dropdown, task_dropdown],
+                                    outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
+            gr.Markdown("## Raw predictions")
+            gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
+            with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        raw_agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        raw_task_dropdown = gr.Dropdown(label="Select Task")
+                    with gr.Column(scale=1):
+                        raw_step_dropdown = gr.Dropdown(label="Select Step")
+                with gr.Row():
+                    raw_call_details = gr.HTML()
+                def update_raw_task_dropdown(agent_name):
+                    analyzed_traces = get_analyzed_traces(agent_name, "usaco")
+                    if not analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
+                    task_ids = list(analyzed_traces.keys())
+                    steps = analyzed_traces[task_ids[0]]['steps']
+                    return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "usaco")[task_ids[0]]['steps'][0], 0)
+                def update_raw_step_dropdown(agent_name, task_id):
+                    analyzed_traces = get_analyzed_traces(agent_name, "usaco")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
+                    steps = analyzed_traces[task_id]['steps']
+                    return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
+                def update_raw_call_details(agent_name, task_id, step_index):
+                    analyzed_traces = get_analyzed_traces(agent_name, "usaco")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return "No data available for this selection."
+                    steps = analyzed_traces[task_id]['steps']
+                    if step_index is None:
+                        return "Invalid step selection."
+                    step = steps[step_index]
+                    return format_call_info(step, step_index)
+                # Initialize the raw agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="usaco", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[raw_agent_dropdown])
+                demo.load(update_raw_task_dropdown,
+                        inputs=[raw_agent_dropdown],
+                        outputs=[raw_task_dropdown, raw_step_dropdown])
+                demo.load(update_raw_call_details,
+                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                        outputs=[raw_call_details])
+                raw_agent_dropdown.change(update_raw_task_dropdown,
+                                        inputs=[raw_agent_dropdown],
+                                        outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
+                raw_task_dropdown.change(update_raw_step_dropdown,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown],
+                                        outputs=[raw_step_dropdown, raw_call_details])
+                raw_step_dropdown.change(update_raw_call_details,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                                        outputs=[raw_call_details])
+        with gr.Tab("SWE-bench Verified (Mini)"):
+            gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The  We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    Leaderboard(
+                        value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified_mini'), ci_metrics=["Accuracy", "Total Cost"]),
+                        select_columns=SelectColumns(
+                            default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                            cant_deselect=["Agent Name"],
+                            label="Select Columns to Display:",
+                        ),
+                        hide_columns=config.SWEBENCH_HIDE_COLUMNS,
+                        search_columns=config.SWEBENCH_SEARCH_COLUMNS,
+                    )
+                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+            with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
+            with gr.Row():
+                scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified_mini', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+            gr.HTML('<div style="height: 30px;"></div>')
+            gr.Markdown("## Task success heatmap")
+            gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+            with gr.Row():
+                task_success_heatmap = gr.Plot()
+            demo.load(
+            lambda: create_task_success_heatmap(
+                preprocessor.get_task_success_data('swebench_verified_mini'),
+                'SWE-bench Verified'
+            ),
+            outputs=[task_success_heatmap]
+            )
+            # gr.HTML("""
+            # <style>
+            #     .grouped-section {
+            #         border: 2px solid #dee2e6; /* Color matching unactivated tabs */
+            #         border-radius: 10px;
+            #         padding: 30px;
+            #         margin-top: 40px;
+            #         margin-bottom: 40px;
+            #         position: relative;
+            #     }
+            #     .grouped-section-title {
+            #         font-size: 1.7em;
+            #         font-weight: bold;
+            #         color: #2c3e50;
+            #         margin-bottom: 20px;
+            #         padding-bottom: 10px;
+            #         border-bottom: 2px solid #dee2e6;
+            #     }
+            # </style>
+            # """)
+            # with gr.Group(elem_classes=["grouped-section"]):
+            #     gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     gr.Markdown("## Failure report for each agent")
+            #     gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             failure_categories_overview = gr.Markdown()
+            #         with gr.Column(scale=1):
+            #             failure_categories_chart = gr.Plot()
+            #     # Initialize the failure report agent dropdown with all agents
+            #     demo.load(update_agent_dropdown,
+            #             inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+            #             outputs=[failure_report_agent_dropdown])
+            #     # Update failure report when agent is selected
+            #     failure_report_agent_dropdown.change(update_failure_report,
+            #                                         inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
+            #                                         outputs=[failure_categories_overview, failure_categories_chart])
+            #     gr.HTML('<div style="height: 30px;"></div>')
+            #     gr.Markdown("## Task overview")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             agent_dropdown = gr.Dropdown(label="Select Agent")
+            #         with gr.Column(scale=1):
+            #             task_dropdown = gr.Dropdown(label="Select SWE-bench Verified Task")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         task_overview = gr.Markdown()
+            #     with gr.Row():
+            #         flow_chart = gr.Plot(label="Task Flow")
+            #     # Initialize the agent dropdown with the best agent
+            #     demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
+            #     demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+            #     agent_dropdown.change(update_task_analysis,
+            #                         inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
+            #                         outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+            #     task_dropdown.change(update_task_details,
+            #                         inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
+            #                         outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
+            gr.Markdown("## Raw predictions")
+            gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
+            with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        raw_agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        raw_task_dropdown = gr.Dropdown(label="Select Task")
+                    with gr.Column(scale=1):
+                        raw_step_dropdown = gr.Dropdown(label="Select Step")
+                with gr.Row():
+                    raw_call_details = gr.HTML()
+                def update_raw_task_dropdown(agent_name):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
+                    if not analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
+                    task_ids = list(analyzed_traces.keys())
+                    steps = analyzed_traces[task_ids[0]]['steps']
+                    return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_verified_mini")[task_ids[0]]['steps'][0], 0)
+                def update_raw_step_dropdown(agent_name, task_id):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
+                    steps = analyzed_traces[task_id]['steps']
+                    return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
+                def update_raw_call_details(agent_name, task_id, step_index):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified_mini")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return "No data available for this selection."
+                    steps = analyzed_traces[task_id]['steps']
+                    if step_index is None:
+                        return "Invalid step selection."
+                    step = steps[step_index]
+                    return format_call_info(step, step_index)
+                # Initialize the raw agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="swebench_verified_mini", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[raw_agent_dropdown])
+                demo.load(update_raw_task_dropdown,
+                        inputs=[raw_agent_dropdown],
+                        outputs=[raw_task_dropdown, raw_step_dropdown])
+                demo.load(update_raw_call_details,
+                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                        outputs=[raw_call_details])
+                raw_agent_dropdown.change(update_raw_task_dropdown,
+                                        inputs=[raw_agent_dropdown],
+                                        outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
+                raw_task_dropdown.change(update_raw_step_dropdown,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown],
+                                        outputs=[raw_step_dropdown, raw_call_details])
+                raw_step_dropdown.change(update_raw_call_details,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                                        outputs=[raw_call_details])
+        with gr.Tab("SWE-bench Verified"):
+            gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Verified is a human-validated subset of 500 problems reviewed by software engineers. The  We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    Leaderboard(
+                        value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified'), ci_metrics=["Accuracy", "Total Cost"]),
+                        select_columns=SelectColumns(
+                            default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                            cant_deselect=["Agent Name"],
+                            label="Select Columns to Display:",
+                        ),
+                        hide_columns=config.SWEBENCH_HIDE_COLUMNS,
+                        search_columns=config.SWEBENCH_SEARCH_COLUMNS,
+                    )
+                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+            with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
+            with gr.Row():
+                scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_verified', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+            gr.HTML('<div style="height: 30px;"></div>')
+            gr.Markdown("## Task success heatmap")
+            gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+            with gr.Row():
+                task_success_heatmap = gr.Plot()
+            demo.load(
+            lambda: create_task_success_heatmap(
+                preprocessor.get_task_success_data('swebench_verified'),
+                'SWE-bench Verified'
+            ),
+            outputs=[task_success_heatmap]
+            )
+            # gr.HTML("""
+            # <style>
+            #     .grouped-section {
+            #         border: 2px solid #dee2e6; /* Color matching unactivated tabs */
+            #         border-radius: 10px;
+            #         padding: 30px;
+            #         margin-top: 40px;
+            #         margin-bottom: 40px;
+            #         position: relative;
+            #     }
+            #     .grouped-section-title {
+            #         font-size: 1.7em;
+            #         font-weight: bold;
+            #         color: #2c3e50;
+            #         margin-bottom: 20px;
+            #         padding-bottom: 10px;
+            #         border-bottom: 2px solid #dee2e6;
+            #     }
+            # </style>
+            # """)
+            # with gr.Group(elem_classes=["grouped-section"]):
+            #     gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     gr.Markdown("## Failure report for each agent")
+            #     gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             failure_categories_overview = gr.Markdown()
+            #         with gr.Column(scale=1):
+            #             failure_categories_chart = gr.Plot()
+            #     # Initialize the failure report agent dropdown with all agents
+            #     demo.load(update_agent_dropdown,
+            #             inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+            #             outputs=[failure_report_agent_dropdown])
+            #     # Update failure report when agent is selected
+            #     failure_report_agent_dropdown.change(update_failure_report,
+            #                                         inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_verified", visible=False)],
+            #                                         outputs=[failure_categories_overview, failure_categories_chart])
+            #     gr.HTML('<div style="height: 30px;"></div>')
+            #     gr.Markdown("## Task overview")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         with gr.Column(scale=1):
+            #             agent_dropdown = gr.Dropdown(label="Select Agent")
+            #         with gr.Column(scale=1):
+            #             task_dropdown = gr.Dropdown(label="Select SWE-bench Verified Task")
+            #     gr.HTML('<div style="height: 10px;"></div>')
+            #     with gr.Row():
+            #         task_overview = gr.Markdown()
+            #     with gr.Row():
+            #         flow_chart = gr.Plot(label="Task Flow")
+            #     # Initialize the agent dropdown with the best agent
+            #     demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
+            #     demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+            #     agent_dropdown.change(update_task_analysis,
+            #                         inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown],
+            #                         outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+            #     task_dropdown.change(update_task_details,
+            #                         inputs=[gr.Textbox(value="swebench_verified", visible=False), agent_dropdown, task_dropdown],
+            #                         outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
+            gr.Markdown("## Raw predictions")
+            gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
+            with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        raw_agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        raw_task_dropdown = gr.Dropdown(label="Select Task")
+                    with gr.Column(scale=1):
+                        raw_step_dropdown = gr.Dropdown(label="Select Step")
+                with gr.Row():
+                    raw_call_details = gr.HTML()
+                def update_raw_task_dropdown(agent_name):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
+                    if not analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
+                    task_ids = list(analyzed_traces.keys())
+                    steps = analyzed_traces[task_ids[0]]['steps']
+                    return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_verified")[task_ids[0]]['steps'][0], 0)
+                def update_raw_step_dropdown(agent_name, task_id):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
+                    steps = analyzed_traces[task_id]['steps']
+                    return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
+                def update_raw_call_details(agent_name, task_id, step_index):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_verified")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return "No data available for this selection."
+                    steps = analyzed_traces[task_id]['steps']
+                    if step_index is None:
+                        return "Invalid step selection."
+                    step = steps[step_index]
+                    return format_call_info(step, step_index)
+                # Initialize the raw agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="swebench_verified", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[raw_agent_dropdown])
+                demo.load(update_raw_task_dropdown,
+                        inputs=[raw_agent_dropdown],
+                        outputs=[raw_task_dropdown, raw_step_dropdown])
+                demo.load(update_raw_call_details,
+                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                        outputs=[raw_call_details])
+                raw_agent_dropdown.change(update_raw_task_dropdown,
+                                        inputs=[raw_agent_dropdown],
+                                        outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
+                raw_task_dropdown.change(update_raw_step_dropdown,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown],
+                                        outputs=[raw_step_dropdown, raw_call_details])
+                raw_step_dropdown.change(update_raw_call_details,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                                        outputs=[raw_call_details])
+        with gr.Tab("SWE-bench Lite"):
+            gr.Markdown("""SWE-bench is a dataset that tests systems' ability to solve GitHub issues automatically. Lite is a subset of 300 tasks of the original SWE-bench. We are currently actively developing this platform and this benchmark is not fully implemented yet.""")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    Leaderboard(
+                        value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite'), ci_metrics=["Accuracy", "Total Cost"]),
+                        select_columns=SelectColumns(
+                            default_selection=config.SWEBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                            cant_deselect=["Agent Name"],
+                            label="Select Columns to Display:",
+                        ),
+                        hide_columns=config.SWEBENCH_HIDE_COLUMNS,
+                        search_columns=config.SWEBENCH_SEARCH_COLUMNS,
+                    )
+                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+            with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost for SWE-bench agents")
+            with gr.Row():
+                scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'swebench_lite', aggregate=False), "Total Cost", "Accuracy", "Total Cost (in USD)", "Accuracy", ["Agent Name"]))
+            gr.HTML('<div style="height: 30px;"></div>')
+            gr.Markdown("## Task success heatmap")
+            gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in SWE-bench are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+            with gr.Row():
+                task_success_heatmap = gr.Plot()
+            demo.load(
+            lambda: create_task_success_heatmap(
+                preprocessor.get_task_success_data('swebench_lite'),
+                'SWE-bench Lite'
+            ),
+            outputs=[task_success_heatmap]
+            )
+            gr.HTML("""
+            <style>
+                .grouped-section {
+                    border: 2px solid #dee2e6; /* Color matching unactivated tabs */
+                    border-radius: 10px;
+                    padding: 30px;
+                    margin-top: 40px;
+                    margin-bottom: 40px;
+                    position: relative;
+                }
+                .grouped-section-title {
+                    font-size: 1.7em;
+                    font-weight: bold;
+                    color: #2c3e50;
+                    margin-bottom: 20px;
+                    padding-bottom: 10px;
+                    border-bottom: 2px solid #dee2e6;
+                }
+            </style>
+            """)
+            with gr.Group(elem_classes=["grouped-section"]):
+                gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+                gr.HTML('<div style="height: 10px;"></div>')
+                gr.Markdown("## Failure report for each agent")
+                gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        failure_categories_overview = gr.Markdown()
+                    with gr.Column(scale=1):
+                        failure_categories_chart = gr.Plot()
+                # Initialize the failure report agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[failure_report_agent_dropdown])
+                # Update failure report when agent is selected
+                failure_report_agent_dropdown.change(update_failure_report,
+                                                    inputs=[failure_report_agent_dropdown, gr.Textbox(value="swebench_lite", visible=False)],
+                                                    outputs=[failure_categories_overview, failure_categories_chart])
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task overview")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        task_dropdown = gr.Dropdown(label="Select SWE-bench Lite Task")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    task_overview = gr.Markdown()
+                with gr.Row():
+                    flow_chart = gr.Plot(label="Task Flow")
+                # Initialize the agent dropdown with the best agent
+                demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)], outputs=[agent_dropdown])
+                demo.load(update_task_analysis, inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                agent_dropdown.change(update_task_analysis,
+                                    inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown],
+                                    outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                task_dropdown.change(update_task_details,
+                                    inputs=[gr.Textbox(value="swebench_lite", visible=False), agent_dropdown, task_dropdown],
+                                    outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
+            gr.Markdown("## Raw predictions")
+            gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
+            with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        raw_agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        raw_task_dropdown = gr.Dropdown(label="Select Task")
+                    with gr.Column(scale=1):
+                        raw_step_dropdown = gr.Dropdown(label="Select Step")
+                with gr.Row():
+                    raw_call_details = gr.HTML()
+                def update_raw_task_dropdown(agent_name):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
+                    if not analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
+                    task_ids = list(analyzed_traces.keys())
+                    steps = analyzed_traces[task_ids[0]]['steps']
+                    return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "swebench_lite")[task_ids[0]]['steps'][0], 0)
+                def update_raw_step_dropdown(agent_name, task_id):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
+                    steps = analyzed_traces[task_id]['steps']
+                    return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
+                def update_raw_call_details(agent_name, task_id, step_index):
+                    analyzed_traces = get_analyzed_traces(agent_name, "swebench_lite")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return "No data available for this selection."
+                    steps = analyzed_traces[task_id]['steps']
+                    if step_index is None:
+                        return "Invalid step selection."
+                    step = steps[step_index]
+                    return format_call_info(step, step_index)
+                # Initialize the raw agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="swebench_lite", visible=False), gr.Textbox(value="Accuracy", visible=False)],
+                        outputs=[raw_agent_dropdown])
+                demo.load(update_raw_task_dropdown,
+                        inputs=[raw_agent_dropdown],
+                        outputs=[raw_task_dropdown, raw_step_dropdown])
+                demo.load(update_raw_call_details,
+                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                        outputs=[raw_call_details])
+                raw_agent_dropdown.change(update_raw_task_dropdown,
+                                        inputs=[raw_agent_dropdown],
+                                        outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
+                raw_task_dropdown.change(update_raw_step_dropdown,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown],
+                                        outputs=[raw_step_dropdown, raw_call_details])
+                raw_step_dropdown.change(update_raw_call_details,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                                        outputs=[raw_call_details])
+        with gr.Tab("MLAgentBench"):
+            gr.Markdown("""MLAgentBench is a suite of end-to-end Machine Learning (ML) experimentation tasks, where the agent aims to take a given dataset and a machine learning task description and autonomously develop or improve an ML model. We are currently actively developing this platform and this benchmark is not fully implemented yet. In particular, we only include one agent and a subset of tasks for this benchmark.""")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    Leaderboard(
+                        value=create_leaderboard(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench')),
+                        select_columns=SelectColumns(
+                            default_selection=config.MLAGENTBENCH_ON_LOAD_COLUMNS + ["Verified"],
+                            cant_deselect=["Agent Name"],
+                            label="Select Columns to Display:",
+                        ),
+                        hide_columns=config.MLAGENTBENCH_HIDE_COLUMNS,
+                        search_columns=config.MLAGENTBENCH_SEARCH_COLUMNS,
+                    )
+                    gr.Markdown("""*Error ranges span from the lowest to highest observed values in repeated runs.*""", elem_classes=["text-right"])
+            with gr.Row():
+                gr.Markdown("### Accuracy vs. Cost for MLAgentBench agents")
+            with gr.Row():
+                scatter_plot = gr.Plot(create_scatter_plot(parse_json_files(os.path.join(abs_path, "evals_live"), 'mlagentbench', aggregate=False), "Total Cost", "Overall Score", "Total Cost (in USD)", "Overall Score", ["Agent Name"]))
+            # gr.HTML('<div style="height: 30px;"></div>')
+            # gr.Markdown("## Task success heatmap")
+            # gr.Markdown("The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks in USACO are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least. For agents that have been run more than once, the run with the highest score is shown.")
+            # with gr.Row():
+            #     task_success_heatmap = gr.Plot()
+            # demo.load(
+            # lambda: create_task_success_heatmap(
+            #     preprocessor.get_task_success_data('usaco'),
+            #     'USACO'
+            # ),
+            # outputs=[task_success_heatmap]
+            # )
+            gr.HTML("""
+            <style>
+                .grouped-section {
+                    border: 2px solid #dee2e6; /* Color matching unactivated tabs */
+                    border-radius: 10px;
+                    padding: 30px;
+                    margin-top: 40px;
+                    margin-bottom: 40px;
+                    position: relative;
+                }
+                .grouped-section-title {
+                    font-size: 1.7em;
+                    font-weight: bold;
+                    color: #2c3e50;
+                    margin-bottom: 20px;
+                    padding-bottom: 10px;
+                    border-bottom: 2px solid #dee2e6;
+                }
+            </style>
+            """)
+            with gr.Group(elem_classes=["grouped-section"]):
+                gr.Markdown("# Agent monitor", elem_classes=["grouped-section-title"], elem_id="agent-monitor")
+                # gr.HTML('<div style="height: 10px;"></div>')
+                # gr.Markdown("## Failure report for each agent")
+                # gr.Markdown('Select an agent to see why the agent fails to solve tasks correctly. Note that these descriptions (and the failure categories) are generated by LLM-based evaluations of the agent logs and may contain inaccuracies.')
+                # gr.HTML('<div style="height: 10px;"></div>')
+                # with gr.Row():
+                #     with gr.Column(scale=1):
+                #         failure_report_agent_dropdown = gr.Dropdown(label="Select Agent for Failure Report")
+                # gr.HTML('<div style="height: 10px;"></div>')
+                # with gr.Row():
+                #     with gr.Column(scale=1):
+                #         failure_categories_overview = gr.Markdown()
+                #     with gr.Column(scale=1):
+                #         failure_categories_chart = gr.Plot()
+                # # Initialize the failure report agent dropdown with all agents
+                # demo.load(update_agent_dropdown,
+                #         inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
+                #         outputs=[failure_report_agent_dropdown])
+                # # Update failure report when agent is selected
+                # failure_report_agent_dropdown.change(update_failure_report,
+                #                                     inputs=[failure_report_agent_dropdown, gr.Textbox(value="mlagentbench", visible=False)],
+                #                                     outputs=[failure_categories_overview, failure_categories_chart])
+                gr.HTML('<div style="height: 30px;"></div>')
+                gr.Markdown("## Task overview")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        task_dropdown = gr.Dropdown(label="Select MLAgentBench Task")
+                gr.HTML('<div style="height: 10px;"></div>')
+                with gr.Row():
+                    task_overview = gr.Markdown()
+                with gr.Row():
+                    flow_chart = gr.Plot(label="Task Flow")
+                # Initialize the agent dropdown with the best agent
+                demo.load(update_agent_dropdown, inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)], outputs=[agent_dropdown])
+                demo.load(update_task_analysis, inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown], outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                agent_dropdown.change(update_task_analysis,
+                                    inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown],
+                                    outputs=[task_overview, flow_chart, task_dropdown, gr.Textbox(visible=False)])
+                task_dropdown.change(update_task_details,
+                                    inputs=[gr.Textbox(value="mlagentbench", visible=False), agent_dropdown, task_dropdown],
+                                    outputs=[task_overview, flow_chart, gr.Textbox(visible=False)])
+            gr.Markdown("## Raw predictions")
+            gr.Markdown('Select an agent to see the raw predictions made by the agent for each task. We also provide information on token usage for each call.')
+            with gr.Accordion("Expand to inspect raw predictions of agents...", open=False):
+                with gr.Row():
+                    with gr.Column(scale=1):
+                        raw_agent_dropdown = gr.Dropdown(label="Select Agent")
+                    with gr.Column(scale=1):
+                        raw_task_dropdown = gr.Dropdown(label="Select Task")
+                    with gr.Column(scale=1):
+                        raw_step_dropdown = gr.Dropdown(label="Select Step")
+                with gr.Row():
+                    raw_call_details = gr.HTML()
+                def update_raw_task_dropdown(agent_name):
+                    analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
+                    if not analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Task"), gr.Dropdown(choices=[], label="Select Step"), f"No raw predictions data available for agent: {agent_name}."
+                    task_ids = list(analyzed_traces.keys())
+                    steps = analyzed_traces[task_ids[0]]['steps']
+                    return gr.Dropdown(choices=task_ids, label="Select Task", value=task_ids[0]), gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(get_analyzed_traces(agent_name, "mlagentbench")[task_ids[0]]['steps'][0], 0)
+                def update_raw_step_dropdown(agent_name, task_id):
+                    analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return gr.Dropdown(choices=[], label="Select Step", value="No data available.")
+                    steps = analyzed_traces[task_id]['steps']
+                    return gr.Dropdown(choices=[(f"Step {i+1}", i) for i in range(len(steps))], label="Select Step", value=0), format_call_info(steps[0], 0)
+                def update_raw_call_details(agent_name, task_id, step_index):
+                    analyzed_traces = get_analyzed_traces(agent_name, "mlagentbench")
+                    if not analyzed_traces or task_id not in analyzed_traces:
+                        return "No data available for this selection."
+                    steps = analyzed_traces[task_id]['steps']
+                    if step_index is None:
+                        return "Invalid step selection."
+                    step = steps[step_index]
+                    return format_call_info(step, step_index)
+                # Initialize the raw agent dropdown with all agents
+                demo.load(update_agent_dropdown,
+                        inputs=[gr.Textbox(value="mlagentbench", visible=False), gr.Textbox(value="Overall Score", visible=False)],
+                        outputs=[raw_agent_dropdown])
+                demo.load(update_raw_task_dropdown,
+                        inputs=[raw_agent_dropdown],
+                        outputs=[raw_task_dropdown, raw_step_dropdown])
+                demo.load(update_raw_call_details,
+                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                        outputs=[raw_call_details])
+                raw_agent_dropdown.change(update_raw_task_dropdown,
+                                        inputs=[raw_agent_dropdown],
+                                        outputs=[raw_task_dropdown, raw_step_dropdown, raw_call_details])
+                raw_task_dropdown.change(update_raw_step_dropdown,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown],
+                                        outputs=[raw_step_dropdown, raw_call_details])
+                raw_step_dropdown.change(update_raw_call_details,
+                                        inputs=[raw_agent_dropdown, raw_task_dropdown, raw_step_dropdown],
+                                        outputs=[raw_call_details])
+        with gr.Tab("About"):
+            gr.Markdown((Path(__file__).parent / "about.md").read_text())
+    # Will trigger autoscaling of plots when tabs are switched
+    tabs.select(fn=None, inputs=None, outputs=None, js="""
+        function() {
+            setTimeout(function() {
+                window.dispatchEvent(new Event('resize'));
+            }, 100);
+        }
+    """)
+    gr.HTML("""<h2 class="section-heading" id="agent-submission">How to add an agent to HAL leaderboards?</h2>""")
+    gr.Markdown((Path(__file__).parent / "agent_submission.md").read_text())
+    gr.HTML("""<h2 class="section-heading" id="benchmark-submission">How to add a benchmark to HAL?</h2>""")
+    gr.Markdown((Path(__file__).parent / "benchmark_submission.md").read_text())
+    gr.HTML("""<h2 class="section-heading" id="reproduction-guide">How can I run evaluations?</h2>""")
+    gr.Markdown("""Coming soon...""")
+async def main():
+    # Preprocess traces
+    # preprocessor = TracePreprocessor()
+    # preprocessor.preprocess_traces('evals_live')
+    # preprocessor = TracePreprocessor()
+    # Download the results from the Hugging Face Hub
+    # await asyncio.to_thread(download_latest_results)
+    # # Check for new uploads and process them
+    # await check_and_process_uploads()
+    scheduler = AsyncIOScheduler()
+    scheduler.add_job(restart_space, "interval", hours=1)
+    # scheduler.add_job(download_latest_results, "interval", hours=1)
+    # scheduler.add_job(check_and_process_uploads, "interval", hours=1)
+    scheduler.start()
+    await demo.launch(favicon_path="hal.png")
+if __name__ == "__main__":
+    weave.init(f'leaderboard_{datetime.now().strftime("%Y%m%d%H%M%S")}')
+    asyncio.run(main())

benchmark_submission.md ADDED Viewed

	@@ -0,0 +1,5 @@

+To submit **a new benchmark** to the library:
+1. Implement a new benchmark using some standard format (such as the [METR Task Standard](https://github.com/METR/task-standard)). This includes specifying the exact instructions for each tasks as well as the task environment that is provided inside the container the agent is run in.
+2. We will encourage developers to support running their tasks on separate VMs and specify the exact hardware specifications for each task in the task environment.

config.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import pandas as pd
+TYPES = [
+    "str",
+    "number",
+    "number"
+]
+SWEBENCH_ON_LOAD_COLUMNS = [
+    "Agent Name",
+    "Accuracy",
+    "Total Cost",
+    "Runs",
+   ]
+SWEBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
+SWEBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
+USACO_ON_LOAD_COLUMNS = [
+    "Agent Name",
+    "Accuracy",
+    "Total Cost",
+    "Runs",
+   ]
+USACO_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
+USACO_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
+COREBENCH_ON_LOAD_COLUMNS = [
+    "Agent Name",
+    "Accuracy",
+    "Total Cost",
+    "Runs",
+   ]
+COREBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
+COREBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Overall Score', 'Vectorization Score', 'Fathomnet Score', 'Feedback Score', 'House Price Score', 'Spaceship Titanic Score', 'AMP Parkinsons Disease Progression Prediction Score', 'CIFAR10 Score', 'IMDB Score']
+MLAGENTBENCH_ON_LOAD_COLUMNS = [
+    "Agent Name",
+    "Overall Score",
+    "Total Cost",
+   ]
+MLAGENTBENCH_SEARCH_COLUMNS = ['Total Cost', 'Agent Name']
+MLAGENTBENCH_HIDE_COLUMNS = ["F1 Score", "AUC", "Precision", "Recall", "benchmark_name", 'Accuracy']
+NUMERIC_INTERVALS = {
+    "?": pd.Interval(-1, 0, closed="right"),
+    "~1.5": pd.Interval(0, 2, closed="right"),
+    "~3": pd.Interval(2, 4, closed="right"),
+    "~7": pd.Interval(4, 9, closed="right"),
+    "~13": pd.Interval(9, 20, closed="right"),
+    "~35": pd.Interval(20, 45, closed="right"),
+    "~60": pd.Interval(45, 70, closed="right"),
+    "70+": pd.Interval(70, 10000, closed="right"),
+}

css.css ADDED Viewed

	@@ -0,0 +1,48 @@

+/* Base styles and variables */
+:root {
+	--primary-color: #3498db;
+	--secondary-color: #2c3e50;
+	--background-color: #f8f9fa;
+	--text-color: #333;
+	--accent-color: #e74c3c;
+	--space: 1rem;
+  }
+/* Tabs */
+.tab-nav {
+display: flex;
+background-color: var(--secondary-color);
+border-radius: 8px 8px 0 0;
+overflow: hidden;
+}
+.tab-nav button {
+padding: 1rem 1.5rem;
+background-color: transparent;
+color: #fff;
+border: none;
+cursor: pointer;
+transition: background-color 0.3s;
+}
+.tab-nav button:hover,
+.tab-nav button.selected {
+background-color: var(--primary-color);
+}
+.svelte-iibkxk .stretch {
+display: none;
+}
+/* Utility classes */
+.text-center { text-align: center; }
+.text-right { text-align: right; }
+.font-bold { font-weight: 700; }
+.text-small { font-size: 0.875rem; }
+.text-large { font-size: 1.25rem; }
+.mt-1 { margin-top: 1rem; }
+.mb-1 { margin-bottom: 1rem; }
+.ml-1 { margin-left: 1rem; }
+.mr-1 { margin-right: 1rem; }

envs.py ADDED Viewed

	@@ -0,0 +1,10 @@

+import os
+from huggingface_hub import HfApi
+HF_TOKEN = os.getenv('HF_TOKEN', None)
+RESULTS_REPO_ID = 'agent-evals/results'
+REPO_ID = 'agent-evals/leaderboard'
+API = HfApi(token=HF_TOKEN)

hal.ico ADDED Viewed

hal.png ADDED Viewed

header.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Holistic Agent Leaderboard (HAL)
2	+
3	+ A centralized, standardized, cost-aware leaderboard for evaluating agents.

requirements.txt ADDED Viewed

	@@ -0,0 +1,108 @@

+aiofiles==23.2.1
+aiohappyeyeballs==2.3.5
+aiohttp==3.10.3
+aioprocessing==2.0.1
+aiosignal==1.3.1
+aiosmtplib==3.0.2
+analytics-python==1.2.9
+annotated-types==0.7.0
+anyio==4.4.0
+APScheduler
+async-timeout==4.0.3
+attrs==24.2.0
+backoff==2.2.1
+certifi==2024.7.4
+charset-normalizer==3.3.2
+click==8.1.7
+contourpy==1.2.1
+cycler==0.12.1
+distro==1.9.0
+dnspython==2.6.1
+docker-pycreds==0.4.0
+email_validator==2.2.0
+emoji==2.12.1
+exceptiongroup==1.2.2
+fastapi==0.111.1
+fastapi-cli==0.0.4
+ffmpy==0.4.0
+filelock==3.15.4
+fonttools==4.53.1
+frozenlist==1.4.1
+fsspec==2024.6.1
+gitdb==4.0.11
+GitPython==3.1.43
+gql==3.5.0
+gradio==4.40.0
+gradio_client==1.2.0
+gradio_leaderboard==0.0.11
+graphql-core==3.2.3
+h11==0.14.0
+httpcore==1.0.5
+httptools==0.6.1
+httpx==0.27.0
+huggingface-hub==0.24.5
+idna==3.7
+importlib_resources==6.4.0
+janus==1.0.0
+Jinja2==3.1.4
+jiter==0.5.0
+kiwisolver==1.4.5
+Markdown==3.6
+markdown-it-py==3.0.0
+MarkupSafe==2.1.5
+matplotlib==3.9.1
+mdurl==0.1.2
+multidict==6.0.5
+numpy==1.26.4
+openai==1.40.3
+orjson==3.10.6
+packaging==24.1
+pandas==2.2.2
+pillow==10.4.0
+platformdirs==4.2.2
+plotly==5.23.0
+protobuf==5.27.3
+psutil==6.0.0
+pyarrow==16.1.0
+pydantic==2.8.2
+pydantic_core==2.20.1
+pydub==0.25.1
+Pygments==2.18.0
+pyparsing==3.1.2
+python-dateutil==2.9.0.post0
+python-dotenv==1.0.1
+python-json-logger==2.0.7
+python-multipart==0.0.9
+pytz
+PyYAML==6.0.1
+regex==2024.7.24
+requests==2.32.3
+requests-toolbelt==1.0.0
+rich==13.7.1
+ruff==0.5.5
+scipy==1.14.1
+semantic-version==2.10.0
+sentry-sdk==2.12.0
+setproctitle==1.3.3
+shellingham==1.5.4
+six
+smmap==5.0.1
+sniffio==1.3.1
+starlette==0.37.2
+tenacity==9.0.0
+tiktoken==0.7.0
+tomlkit==0.12.0
+tqdm==4.66.4
+typer==0.12.3
+typing_extensions==4.12.2
+tzdata==2024.1
+tzlocal
+urllib3==2.2.2
+uvicorn==0.30.4
+uvloop==0.19.0
+wandb==0.17.6
+watchfiles==0.22.0
+weave==0.50.13
+websockets==12.0
+Werkzeug==3.0.3
+yarl==1.9.4

utils/test.txt → scratch.ipynb RENAMED Viewed

File without changes

scratch.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import json
+import os
+from pathlib import Path
+def process_json_files(directory, suffix="_updated"):
+    # Iterate through all JSON files in the directory
+    for filename in os.listdir(directory):
+        if filename.endswith(".json") and "USACO" in filename:
+            file_path = os.path.join(directory, filename)
+            # Read the JSON file
+            with open(file_path, 'r') as f:
+                data = json.load(f)
+            # Extract sdict from raw_eval_results
+            sdict = data['raw_eval_results']['sdict']
+            # Calculate successful_tasks and failed_tasks
+            successful_tasks = [key for key in sdict if float(sdict[key][0]['result']['fraction_passed']) == 1]
+            failed_tasks = [key for key in sdict if float(sdict[key][0]['result']['fraction_passed']) < 1]
+            # Add new key-value pairs to the results
+            data['results']['successful_tasks'] = successful_tasks
+            data['results']['failed_tasks'] = failed_tasks
+            # Create new filename with suffix
+            new_filename = f"{Path(filename).stem}{suffix}{Path(filename).suffix}"
+            new_file_path = os.path.join(directory, new_filename)
+            # Write updated data to new file
+            with open(new_file_path, 'w') as f:
+                json.dump(data, f, indent=4)
+            print(f"Processed {filename} and saved as {new_filename}")
+# Usage
+directory_path = "/Users/benediktstroebl/Documents/GitHub/leaderboard/evals_live"
+process_json_files(directory_path)

utils/data.py ADDED Viewed

	@@ -0,0 +1,296 @@

+import json
+from pathlib import Path
+import pandas as pd
+import plotly.express as px
+from utils.pareto import Agent, compute_pareto_frontier
+import plotly.graph_objects as go
+import textwrap
+# def parse_json_files(folder_path, benchmark_name):
+#     # Convert folder path to Path object
+#     folder = Path(folder_path)
+#     # List to store data from each file
+#     data_list = []
+#     # Iterate through all JSON files in the folder
+#     for json_file in folder.glob('*.json'):
+#         try:
+#             with open(json_file, 'r') as file:
+#                 data = json.load(file)
+#                 # Extract config and results
+#                 config = data['config']
+#                 results = data['results']
+#                 # Combine config and results into a single dictionary
+#                 combined_data = {
+#                     'agent_name': config['agent_name'],
+#                     'benchmark_name': config['benchmark_name'],
+#                     'date': config['date']
+#                 }
+#                 # Add results with 'results_' prefix
+#                 for key, value in results.items():
+#                     combined_data[f'results_{key}'] = value
+#                 data_list.append(combined_data)
+#         except Exception as e:
+#             print(f"Error processing {json_file}: {e}. Skipping!")
+#     # Create DataFrame from the list of dictionaries
+#     df = pd.DataFrame(data_list)
+#     df = df[df['benchmark_name'] == benchmark_name]
+#     # sort df by descending accuracy
+#     df = df.sort_values(by='results_accuracy', ascending=False)
+#     # round all float columns to 2 decimal places
+#     for column in df.select_dtypes(include='float').columns:
+#         df[column] = df[column].round(3)
+#     # Rename columns
+#     df = df.rename(columns={
+#         'agent_name': 'Agent Name',
+#         'results_total_cost': 'Total Cost',
+#         'results_accuracy': 'Accuracy',
+#         'results_precision': 'Precision',
+#         'results_recall': 'Recall',
+#         'results_f1_score': 'F1 Score',
+#         'results_auc': 'AUC',
+#     })
+#     return df
+def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
+    agents = [Agent(row.results_total_cost, row.results_accuracy) for row in df.itertuples()]
+    pareto_frontier = compute_pareto_frontier(agents)
+    fig = px.scatter(df,
+                     x=x,
+                     y=y,
+                     hover_data=hover_data,
+                     )
+    # Sort the Pareto frontier points by x-coordinate
+    pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
+    # Add the Pareto frontier line
+    fig.add_trace(go.Scatter(
+        x=[point[0] for point in pareto_points],
+        y=[point[1] for point in pareto_points],
+        mode='lines',
+        name='Pareto Frontier',
+        line=dict(color='black', width=1, dash='dash')
+    ))
+    fig.update_yaxes(rangemode="tozero")
+    fig.update_xaxes(rangemode="tozero")
+    fig.update_layout(
+    width = 600,
+    height = 500,
+    xaxis_title = x_label,
+    yaxis_title = y_label,
+    xaxis = dict(
+        showline = True,
+        linecolor = 'black',
+        showgrid = False),
+    yaxis = dict(
+        showline = True,
+        showgrid = False,
+        linecolor = 'black'),
+    plot_bgcolor = 'white',
+    # Legend positioning
+    legend=dict(
+        yanchor="bottom",
+        y=0.01,
+        xanchor="right",
+        x=0.98,
+        bgcolor="rgba(255, 255, 255, 0.5)"  # semi-transparent white background
+        )
+    )
+    return fig
+import plotly.graph_objects as go
+import textwrap
+def create_flow_chart(steps):
+    node_x = []
+    node_y = []
+    edge_x = []
+    edge_y = []
+    node_text = []
+    hover_text = []
+    node_colors = []
+    node_shapes = []
+    # Define color and shape mappings
+    color_map = {True: 'green', False: 'red'}  # True for success, False for challenges
+    shape_map = {
+        'plan': 'octagon',
+        'tool': 'square',
+        'retrieve': 'diamond',
+        'other': 'circle'
+    }
+    for i, step in enumerate(steps):
+        node_x.append(i)
+        node_y.append(0)
+        # Extract Description, Assessment, and new attributes
+        analysis = step['analysis']
+        if isinstance(analysis, str):
+            try:
+                analysis = json.loads(analysis)
+            except json.JSONDecodeError:
+                analysis = {}
+        description = analysis.get('description', 'No description available.')
+        assessment = analysis.get('assessment', 'No assessment available.')
+        success = analysis.get('success', True)  # Assuming True if not specified
+        action_type = analysis.get('action_type', 'other')  # Default to 'other' if not specified
+        step_outline = analysis.get('step_outline', '')
+        # Set node color and shape based on attributes
+        node_colors.append(color_map[success])
+        node_shapes.append(shape_map.get(action_type, 'circle'))
+        # Wrap text to improve readability
+        wrapped_description = '<br>'.join(textwrap.wrap(description, width=50))
+        wrapped_assessment = '<br>'.join(textwrap.wrap(assessment, width=50))
+        wrapped_outline = textwrap.shorten(step_outline, width=30, placeholder='')
+        wrapped_outline = '' if wrapped_outline == '' else f": {wrapped_outline}"
+        node_text_outline = '' if wrapped_outline == '' else f":<br>{textwrap.shorten(step_outline, width=30, placeholder='')}"
+        node_text.append(f"Step {i+1}{node_text_outline}")
+        # Create formatted hover text without indentation
+        hover_info = f"<b>Step {i+1}{wrapped_outline}</b><br><br>" \
+                     f"<b>Description:</b><br>" \
+                     f"{wrapped_description}<br><br>" \
+                     f"<b>Assessment:</b><br>" \
+                     f"{wrapped_assessment}<br><br>" \
+                     f"<b>Successful:</b> {'Yes' if success else 'No'}<br>" \
+                     f"<b>Action Type:</b> {action_type.capitalize()}"
+        hover_text.append(hover_info)
+        if i > 0:
+            edge_x.extend([i-1, i, None])
+            edge_y.extend([0, 0, None])
+    node_trace = go.Scatter(
+        x=node_x, y=node_y,
+        mode='markers+text',
+        text=node_text,
+        textposition="top center",
+        showlegend=False,
+        hovertext=hover_text,
+        hoverinfo='text',
+        hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
+        marker=dict(
+            color=node_colors,
+            size=30,
+            line_width=2,
+            symbol=node_shapes
+        ))
+    edge_trace = go.Scatter(
+        x=edge_x, y=edge_y,
+        line=dict(width=2, color='#888'),
+        hoverinfo='none',
+        showlegend=False,
+        mode='lines')
+    # Create legend traces
+    legend_traces = []
+    # Color legend
+    for success, color in color_map.items():
+        legend_traces.append(go.Scatter(
+            x=[None], y=[None],
+            mode='markers',
+            marker=dict(size=10, color=color),
+            showlegend=True,
+            name=f"{'Success' if success else 'Issue'}"
+        ))
+    # Shape legend
+    for action, shape in shape_map.items():
+        legend_traces.append(go.Scatter(
+            x=[None], y=[None],
+            mode='markers',
+            marker=dict(size=10, symbol=shape, color='gray'),
+            showlegend=True,
+            name=f"{action.capitalize()}"
+        ))
+    # Combine all traces
+    all_traces = [edge_trace, node_trace] + legend_traces
+    layout = go.Layout(
+        showlegend=True,
+        hovermode='closest',
+        margin=dict(b=20,l=5,r=5,t=40),
+        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
+        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
+        plot_bgcolor='white',
+        paper_bgcolor='white',
+        modebar=dict(
+            activecolor='#1f77b4',  # Color of active tool
+            orientation='h',  # Vertical orientation
+            bgcolor='rgba(255,255,255,0.8)',  # Slightly transparent white background
+            color='#777',  # Color of inactive tools
+        ),
+        legend=dict(
+            orientation="h",
+            yanchor="bottom",
+            y=0.02,
+            xanchor="right",
+            x=1,
+            bgcolor='rgba(255,255,255,0.8)',
+            bordercolor='rgba(0,0,0,0.1)',
+            borderwidth=1
+        ),
+    )
+    fig = go.Figure(data=all_traces, layout=layout)
+    fig.update_layout(legend=dict(
+        orientation="h",
+        yanchor="bottom",
+        y=1.02,
+        xanchor="right",
+        x=1,
+        bgcolor='rgba(255,255,255,0.8)',  # Set legend background to slightly transparent white
+        bordercolor='rgba(0,0,0,0.1)',  # Add a light border to the legend
+        borderwidth=1
+    ),
+    dragmode='pan'
+    )
+    config = {
+        'add': ['pan2d'],
+        'remove': [
+            'zoom2d',
+            'zoomIn2d',
+            'zoomOut2d',
+            'resetScale2d',
+            'hoverClosestCartesian',
+            'hoverCompareCartesian',
+            'toggleSpikelines',
+            'lasso2d',
+            'lasso',
+            'select2d',
+            'select',
+        ]
+    }
+    # Apply the config to the figure
+    fig.update_layout(modebar=config)
+    return fig

utils/db.py ADDED Viewed

	@@ -0,0 +1,361 @@

+import json
+from pathlib import Path
+import sqlite3
+import pickle
+from functools import lru_cache
+import threading
+import pandas as pd
+import ast
+from scipy import stats
+import yaml
+import numpy as np
+class TracePreprocessor:
+    def __init__(self, db_path='preprocessed_traces.db'):
+        self.db_path = db_path
+        self.local = threading.local()
+    def get_conn(self):
+        if not hasattr(self.local, 'conn'):
+            self.local.conn = sqlite3.connect(self.db_path)
+        return self.local.conn
+    def create_tables(self):
+        with self.get_conn() as conn:
+            conn.execute('''
+                CREATE TABLE IF NOT EXISTS preprocessed_traces (
+                    benchmark_name TEXT,
+                    agent_name TEXT,
+                    date TEXT,
+                    run_id TEXT,
+                    raw_logging_results BLOB,
+                    PRIMARY KEY (benchmark_name, agent_name, run_id)
+                )
+            ''')
+            conn.execute('''
+                CREATE TABLE IF NOT EXISTS failure_reports (
+                    benchmark_name TEXT,
+                    agent_name TEXT,
+                    date TEXT,
+                    run_id TEXT,
+                    failure_report BLOB,
+                    PRIMARY KEY (benchmark_name, agent_name, run_id)
+                )
+            ''')
+            conn.execute('''
+                CREATE TABLE IF NOT EXISTS parsed_results (
+                    benchmark_name TEXT,
+                    agent_name TEXT,
+                    date TEXT,
+                    run_id TEXT,
+                    successful_tasks TEXT,
+                    failed_tasks TEXT,
+                    total_cost REAL,
+                    accuracy REAL,
+                    precision REAL,
+                    recall REAL,
+                    f1_score REAL,
+                    auc REAL,
+                    overall_score REAL,
+                    vectorization_score REAL,
+                    fathomnet_score REAL,
+                    feedback_score REAL,
+                    house_price_score REAL,
+                    spaceship_titanic_score REAL,
+                    amp_parkinsons_disease_progression_prediction_score REAL,
+                    cifar10_score REAL,
+                    imdb_score REAL,
+                    PRIMARY KEY (benchmark_name, agent_name, run_id)
+                )
+            ''')
+    def preprocess_traces(self, processed_dir="evals_live"):
+        self.create_tables()
+        processed_dir = Path(processed_dir)
+        for file in processed_dir.glob('*.json'):
+            with open(file, 'r') as f:
+                data = json.load(f)
+                agent_name = data['config']['agent_name']
+                benchmark_name = data['config']['benchmark_name']
+                date = data['config']['date']
+                config = data['config']
+            try:
+                raw_logging_results = pickle.dumps(data['raw_logging_results'])
+                with self.get_conn() as conn:
+                    conn.execute('''
+                        INSERT OR REPLACE INTO preprocessed_traces
+                        (benchmark_name, agent_name, date, run_id, raw_logging_results)
+                        VALUES (?, ?, ?, ?, ?)
+                    ''', (benchmark_name, agent_name, date, config['run_id'], raw_logging_results))
+            except Exception as e:
+                print(f"Error preprocessing raw_logging_results in {file}: {e}")
+            try:
+                failure_report = pickle.dumps(data['failure_report'])
+                with self.get_conn() as conn:
+                    conn.execute('''
+                        INSERT INTO failure_reports
+                        (benchmark_name, agent_name, date, run_id, failure_report)
+                        VALUES (?, ?, ?, ? ,?)
+                    ''', (benchmark_name, agent_name, date, config['run_id'], failure_report))
+            except Exception as e:
+                print(f"Error preprocessing failure_report in {file}: {e}")
+            try:
+                config = data['config']
+                results = data['results']
+                with self.get_conn() as conn:
+                    conn.execute('''
+                        INSERT INTO parsed_results
+                        (benchmark_name, agent_name, date, run_id, successful_tasks, failed_tasks, total_cost, accuracy, precision, recall, f1_score, auc, overall_score, vectorization_score, fathomnet_score, feedback_score, house_price_score, spaceship_titanic_score, amp_parkinsons_disease_progression_prediction_score, cifar10_score, imdb_score)
+                        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                    ''', (
+                        benchmark_name,
+                        agent_name,
+                        config['date'],
+                        config['run_id'],
+                        str(results.get('successful_tasks')),
+                        str(results.get('failed_tasks')),
+                        results.get('total_cost'),
+                        results.get('accuracy'),
+                        results.get('precision'),
+                        results.get('recall'),
+                        results.get('f1_score'),
+                        results.get('auc'),
+                        results.get('overall_score'),
+                        results.get('vectorization_score'),
+                        results.get('fathomnet_score'),
+                        results.get('feedback_score'),
+                        results.get('house-price_score'),
+                        results.get('spaceship-titanic_score'),
+                        results.get('amp-parkinsons-disease-progression-prediction_score'),
+                        results.get('cifar10_score'),
+                        results.get('imdb_score')
+                    ))
+            except Exception as e:
+                print(f"Error preprocessing parsed results in {file}: {e}")
+    @lru_cache(maxsize=100)
+    def get_analyzed_traces(self, agent_name, benchmark_name):
+        with self.get_conn() as conn:
+            query = '''
+                SELECT agent_name, raw_logging_results, date FROM preprocessed_traces
+                WHERE benchmark_name = ? AND agent_name = ?
+            '''
+            df = pd.read_sql_query(query, conn, params=(benchmark_name, agent_name))
+        # check for each row if raw_logging_results is not None with pickle.loads because it is stored as a byte string
+        df = df[df['raw_logging_results'].apply(lambda x: pickle.loads(x) is not None and x != 'None')]
+        if len(df) == 0:
+            return None
+        # select latest run
+        df = df.sort_values('date', ascending=False).groupby('agent_name').first().reset_index()
+        return pickle.loads(df['raw_logging_results'][0])
+    @lru_cache(maxsize=100)
+    def get_failure_report(self, agent_name, benchmark_name):
+        with self.get_conn() as conn:
+            query = '''
+                SELECT agent_name, date, failure_report FROM failure_reports
+                WHERE benchmark_name = ? AND agent_name = ?
+            '''
+            df = pd.read_sql_query(query, conn, params=(benchmark_name, agent_name))
+        # Select only rows for which failure report is not None and None is a string
+        df = df[df['failure_report'].apply(lambda x: pickle.loads(x) is not None and x != 'None')]
+        if len(df) == 0:
+            return None
+        # if there is multiple failure reports, take the last one
+        df = df.sort_values('date', ascending=False).groupby('agent_name').first().reset_index()
+        # if there is a failure report, return the first one
+        return pickle.loads(df['failure_report'][0])
+    def _calculate_ci(self, data, confidence=0.95, type='minmax'):
+        data = data[np.isfinite(data)]
+        if len(data) < 2:
+            return '', '', '' # No CI for less than 2 samples
+        n = len(data)
+        mean = np.mean(data)
+        if type == 't':
+            sem = stats.sem(data)
+            ci = stats.t.interval(confidence, n-1, loc=mean, scale=sem)
+        elif type == 'minmax':
+            min = np.min(data)
+            max = np.max(data)
+            ci = (min, max)
+        return mean, ci[0], ci[1]
+    def get_parsed_results(self, benchmark_name, aggregate=True):
+        with self.get_conn() as conn:
+            query = '''
+                SELECT * FROM parsed_results
+                WHERE benchmark_name = ?
+                ORDER BY accuracy DESC
+            '''
+            df = pd.read_sql_query(query, conn, params=(benchmark_name,))
+        # Load verified agents
+        verified_agents = self.load_verified_agents()
+        # Add 'Verified' column
+        df['Verified'] = df.apply(lambda row: '✓' if (benchmark_name, row['agent_name']) in verified_agents else '', axis=1)
+        # Add column for how many times an agent_name appears in the DataFrame
+        df['Runs'] = df.groupby('agent_name')['agent_name'].transform('count')
+        # Compute the 95% confidence interval for accuracy and cost for agents that have been run more than once
+        df['acc_ci'] = None
+        df['cost_ci'] = None
+        for agent_name in df['agent_name'].unique():
+            agent_df = df[df['agent_name'] == agent_name]
+            if len(agent_df) > 1:
+                accuracy_mean, accuracy_lower, accuracy_upper = self._calculate_ci(agent_df['accuracy'], type='minmax')
+                cost_mean, cost_lower, cost_upper = self._calculate_ci(agent_df['total_cost'], type='minmax')
+                # format the confidence interval with +/- sign
+                # accuracy_ci = f"± {abs(accuracy_mean - accuracy_lower):.3f}"
+                # cost_ci = f"± {abs(cost_mean - cost_lower):.3f}"
+                accuracy_ci = f"-{abs(accuracy_mean - accuracy_lower):.3f}/+{abs(accuracy_mean - accuracy_upper):.3f}"
+                cost_ci = f"-{abs(cost_mean - cost_lower):.3f}/+{abs(cost_mean - cost_upper):.3f}"
+                df.loc[df['agent_name'] == agent_name, 'acc_ci'] = accuracy_ci
+                df.loc[df['agent_name'] == agent_name, 'cost_ci'] = cost_ci
+        df = df.drop(columns=['successful_tasks', 'failed_tasks', 'run_id'], axis=1)
+        if aggregate:
+            # For agents that have been run more than once, compute the average accuracy and cost and use that as the value in the DataFrame
+            df = df.groupby('agent_name').agg({
+                'date': 'first',
+                'total_cost': 'mean',
+                'accuracy': 'mean',
+                'precision': 'mean',
+                'recall': 'mean',
+                'f1_score': 'mean',
+                'auc': 'mean',
+                'overall_score': 'mean',
+                'vectorization_score': 'mean',
+                'fathomnet_score': 'mean',
+                'feedback_score': 'mean',
+                'house_price_score': 'mean',
+                'spaceship_titanic_score': 'mean',
+                'amp_parkinsons_disease_progression_prediction_score': 'mean',
+                'cifar10_score': 'mean',
+                'imdb_score': 'mean',
+                'Verified': 'first',
+                'Runs': 'first',
+                'acc_ci': 'first',
+                'cost_ci': 'first'
+            }).reset_index()
+        # Round float columns to 3 decimal places
+        float_columns = ['total_cost', 'accuracy', 'precision', 'recall', 'f1_score', 'auc', 'overall_score', 'vectorization_score', 'fathomnet_score', 'feedback_score', 'house-price_score', 'spaceship-titanic_score', 'amp-parkinsons-disease-progression-prediction_score', 'cifar10_score', 'imdb_score']
+        for column in float_columns:
+            if column in df.columns:
+                df[column] = df[column].round(3)
+        # sort by accuracy
+        df = df.sort_values('accuracy', ascending=False)
+        # Rename columns
+        df = df.rename(columns={
+            'agent_name': 'Agent Name',
+            'date': 'Date',
+            'total_cost': 'Total Cost',
+            'accuracy': 'Accuracy',
+            'precision': 'Precision',
+            'recall': 'Recall',
+            'f1_score': 'F1 Score',
+            'auc': 'AUC',
+            'overall_score': 'Overall Score',
+            'vectorization_score': 'Vectorization Score',
+            'fathomnet_score': 'Fathomnet Score',
+            'feedback_score': 'Feedback Score',
+            'house_price_score': 'House Price Score',
+            'spaceship_titanic_score': 'Spaceship Titanic Score',
+            'amp_parkinsons_disease_progression_prediction_score': 'AMP Parkinsons Disease Progression Prediction Score',
+            'cifar10_score': 'CIFAR10 Score',
+            'imdb_score': 'IMDB Score',
+            'acc_ci': 'Accuracy CI',
+            'cost_ci': 'Total Cost CI'
+        })
+        return df
+    def get_task_success_data(self, benchmark_name):
+        with self.get_conn() as conn:
+            query = '''
+                SELECT agent_name, accuracy, successful_tasks, failed_tasks
+                FROM parsed_results
+                WHERE benchmark_name = ?
+            '''
+            df = pd.read_sql_query(query, conn, params=(benchmark_name,))
+        # for agent_names that have been run more than once, take the run with the highest accuracy
+        df = df.sort_values('accuracy', ascending=False).groupby('agent_name').first().reset_index()
+        # Get all unique task IDs
+        task_ids = set()
+        for tasks in df['successful_tasks']:
+            if ast.literal_eval(tasks) is not None:
+                task_ids.update(ast.literal_eval(tasks))
+        for tasks in df['failed_tasks']:
+            if ast.literal_eval(tasks) is not None:
+                task_ids.update(ast.literal_eval(tasks))
+        # Create a DataFrame with agent_name, task_ids, and success columns
+        data_list = []
+        for _, row in df.iterrows():
+            agent_name = row['agent_name']
+            for task_id in task_ids:
+                success = 1 if task_id in row['successful_tasks'] else 0
+                data_list.append({
+                    'agent_name': agent_name,
+                    'task_id': task_id,
+                    'success': success
+                })
+        df = pd.DataFrame(data_list)
+        df = df.rename(columns={
+            'agent_name': 'Agent Name',
+            'task_id': 'Task ID',
+            'success': 'Success'
+        })
+        return df
+    def load_verified_agents(self, file_path='verified_agents.yaml'):
+        with open(file_path, 'r') as f:
+            verified_data = yaml.safe_load(f)
+        verified_agents = set()
+        for benchmark, agents in verified_data.items():
+            for agent in agents:
+                verified_agents.add((benchmark, agent['agent_name']))
+        return verified_agents
+if __name__ == '__main__':
+    preprocessor = TracePreprocessor()
+    preprocessor.preprocess_traces()

utils/pareto.py ADDED Viewed

	@@ -0,0 +1,38 @@

+import numpy as np
+import matplotlib.pyplot as plt
+from dataclasses import dataclass
+@dataclass
+class Agent:
+    total_cost: float
+    accuracy: float
+def cross(point_o: Agent, point_a: Agent, point_b: Agent) -> int:
+    return (point_a.total_cost - point_o.total_cost) * (point_b.accuracy - point_o.accuracy) - (point_a.accuracy - point_o.accuracy) * (point_b.total_cost - point_o.total_cost)
+def compute_hull_side(points: list[Agent]) -> list[Agent]:
+    hull: list[Agent] = []
+    for p in points:
+        while len(hull) >= 2 and cross(hull[-2], hull[-1], p) <= 0:
+            hull.pop()
+        hull.append(p)
+    return hull
+def is_pareto_efficient(others, candidate):
+    for other in others:
+        if (other.total_cost <= candidate.total_cost and other.accuracy >= candidate.accuracy) and \
+           (other.total_cost < candidate.total_cost or other.accuracy > candidate.accuracy):
+            return False
+    return True
+def compute_pareto_frontier(points: list[Agent]) -> list[Agent]:
+    points = sorted(list(points), key=lambda p: (p.total_cost, p.accuracy))
+    if len(points) <= 1:
+        return points
+    upper_convex_hull = compute_hull_side(list(reversed(points)))
+    pareto_frontier = [agent for agent in upper_convex_hull if is_pareto_efficient(upper_convex_hull, agent)]
+    return pareto_frontier

utils/processing.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import os
+import json
+import asyncio
+import aiofiles
+from agent_monitor.monitor import analyze_agent_steps
+from agent_monitor.failure_report import analyze_agent_performance, AsyncOpenAIClient
+import traceback
+from tqdm import tqdm
+async def check_and_process_uploads():
+    upload_dir =  "evals_upload"
+    processed_dir = "evals_processed"
+    live_dir = "evals_live"
+    new_uploads = [f for f in os.listdir(upload_dir) if f.endswith('.json')]
+    if not new_uploads:
+        print("No new uploads found.")
+        return
+    # check for all new uploads whether they are already in live or processed directory
+    # Also check whether the files are actually identical
+    unprocessed_uploads = []
+    for upload in new_uploads:
+        upload_path = os.path.join(upload_dir, upload)
+        processed_path = os.path.join(processed_dir, upload)
+        live_path = os.path.join(live_dir, upload)
+        if not os.path.exists(live_path) and not os.path.exists(processed_path):
+            unprocessed_uploads.append(upload)
+        elif os.path.exists(processed_path):
+            # with open(upload_path, 'r') as f:
+            #     new_data = json.load(f)
+            # with open(processed_path, 'r') as f:
+            #     processed_data = json.load(f)
+            # TODO we can use a better comparison method with exact comparison
+            # if new_data != processed_data:
+            #     unprocessed_uploads.append(upload)
+            print(f"Upload {upload} is already in processed directory.")
+        elif os.path.exists(live_path):
+            with open(upload_path, 'r') as f:
+                new_data = json.load(f)
+            with open(live_path, 'r') as f:
+                live_data = json.load(f)
+            # if new_data != live_data:
+            #     unprocessed_uploads.append(upload)
+            print(f"Upload {upload} is already in live directory.")
+        else:
+            unprocessed_uploads.append(upload)
+    print(f"Processing {len(unprocessed_uploads)} new uploads.")
+    tasks = []
+    for upload in tqdm(unprocessed_uploads):
+        upload_path = os.path.join(upload_dir, upload)
+        processed_path = os.path.join(processed_dir, upload)
+        # tasks.append(process_single_upload(upload_path, processed_path)) # for async processing
+        await process_single_upload(upload_path, processed_path)
+    # await asyncio.gather(*tasks) # for async processing
+async def process_single_upload(upload_path, processed_path):
+    # Check the structure of the upload
+    check_result = await check_upload_structure(upload_path)
+    if check_result['is_valid']:
+        # Process the file
+        await process_upload(upload_path, processed_path)
+        # Move the file to processed directory
+        # await asyncio.to_thread(shutil.move, upload_path, processed_path)
+    else:
+        print(f"Upload check failed for {upload_path}: {check_result['message']}")
+async def check_upload_structure(file_path):
+    try:
+        async with aiofiles.open(file_path, 'r') as f:
+            data = json.loads(await f.read())
+        # Check for required keys
+        required_keys = ['config', 'results', 'raw_eval_results', 'raw_logging_results']
+        missing_keys = [key for key in required_keys if key not in data]
+        if missing_keys:
+            return {'is_valid': False, 'message': f"Missing required keys: {', '.join(missing_keys)}"}
+        # Check for specific structure in raw_logging_results
+        if not isinstance(data['raw_logging_results'], list):
+            return {'is_valid': False, 'message': "raw_logging_results should be a list"}
+        for item in data['raw_logging_results']:
+            if not all(key in item for key in ['weave_task_id', 'inputs', 'outputs']):
+                return {'is_valid': False, 'message': "Each item in raw_logging_results should have weave_task_id, inputs, and outputs"}
+        return {'is_valid': True, 'message': "File structure is valid"}
+    except json.JSONDecodeError:
+        return {'is_valid': False, 'message': "Invalid JSON format"}
+    except Exception as e:
+        return {'is_valid': False, 'message': f"Unexpected error: {str(e)}"}
+async def process_upload(input_path, output_path):
+    print(f"Processing {input_path}...")
+    # load the file
+    with open(input_path, 'r') as f:
+        data = json.loads(f.read())
+    assert 'raw_logging_results' in data, "raw_logging_results key not found in the file"
+    openai_client = AsyncOpenAIClient(model="gpt-4o-mini")
+    try:
+        processed_calls = await analyze_agent_steps(data['raw_logging_results'], openai_client, llm_eval=False)
+        # failure_report = await analyze_agent_performance(data['raw_logging_results'], data['results']['failed_tasks'], openai_client)
+        data['raw_logging_results'] = processed_calls
+        data['failure_report'] = None
+    except Exception as e:
+        traceback.print_exc()
+        print(f"Error in processing: {str(e)}")
+        return
+    with open(output_path, 'w') as f:
+        json.dump(data, f, indent=4)
+    print(f"Processing of {input_path} successful. Results saved to {output_path}")
+if __name__ == "__main__":
+    # process single upload for testing
+    asyncio.run(process_single_upload("evals_upload/inspect_evalsswe_bench_1729538131_UPLOAD.json", "evals_processed/inspect_evalsswe_bench_1729538131_UPLOAD.json"))

utils/viz.py ADDED Viewed

	@@ -0,0 +1,744 @@

+import json
+import plotly.express as px
+from utils.pareto import Agent, compute_pareto_frontier
+import plotly.graph_objects as go
+import textwrap
+import numpy as np
+import pandas as pd
+from scipy import stats
+def create_leaderboard(df, ci_metrics = None):
+    # cast dtypes to string
+    df = df.astype(str)
+    # for each metric join metric and metric CI columns
+    if ci_metrics:
+        for metric in ci_metrics:
+            CI_metric = metric + ' CI'
+            # for rows in the df for which CI metric is not None, join the metric and CI columns by looping through the CI metrics columns
+            for i, row in df.iterrows():
+                if str(row[CI_metric]) != 'None':
+                    df.at[i, metric] = str(row[metric]) + " (" + str(row[CI_metric]) + ")"
+    return df
+def create_task_success_heatmap(df, benchmark_name):
+    # Calculate agent accuracy
+    agent_accuracy = df.groupby('Agent Name')['Success'].mean().sort_values(ascending=False)
+    # Calculate task success rate
+    task_success_rate = df.groupby('Task ID')['Success'].mean().sort_values(ascending=False)
+    # Pivot the dataframe to create a matrix of agents vs tasks
+    pivot_df = df.pivot(index='Agent Name', columns='Task ID', values='Success')
+    # Sort the pivot table
+    pivot_df = pivot_df.reindex(index=agent_accuracy.index, columns=task_success_rate.index)
+    # Calculate tasks solved across all agents
+    tasks_solved = (pivot_df.sum(axis=0) > 0).astype(int)
+    # Total number of tasks (columns)
+    total_tasks = len(pivot_df.columns)
+    if 'SWE-bench' in benchmark_name:
+        total_tasks = 50 # TODO - remove hardcoding
+    # Add the new row to the pivot table
+    tasks_solved_df = pd.DataFrame(tasks_solved).T
+    tasks_solved_df.index = [f'<b>Tasks Solved: {tasks_solved.sum()}/{total_tasks} (Any Agent)</b>']
+    # print number of tasks solved
+    pivot_df = pd.concat([pivot_df, tasks_solved_df])
+    num_agents = len(pivot_df.index)
+    row_height = 30  # Fixed height for each row in pixels
+    total_height = num_agents * row_height
+    # Create a custom colorscale
+    colorscale=[[0, 'white'], [1, '#3498db']]
+    # Create the heatmap
+    fig = go.Figure(data=go.Heatmap(
+        z=pivot_df.values,
+        y=pivot_df.index,
+        x=pivot_df.columns,
+        colorscale=colorscale,
+        showscale=False,
+        hovertemplate='<b>Agent:</b> %{y}<br>' +
+                      '<b>Task:</b> %{x}<br>' +
+                      '<b>Status:</b> %{z}<extra></extra>'
+    ))
+    # Update the layout
+    fig.update_layout(
+        xaxis_title='Task ID',
+        height=total_height + 50,  # Add extra space for the new row
+        yaxis=dict(
+            autorange='reversed',
+            showticklabels=True,
+            showline=True,
+            linecolor='black',
+            showgrid=False
+        ),
+        xaxis=dict(
+            side='top',
+            showticklabels=False,
+            showline=True,
+            linecolor='black',
+            showgrid=False
+        ),
+        plot_bgcolor='white',
+        paper_bgcolor='white',
+        hoverlabel=dict(
+            bgcolor="white",
+            font_size=12,
+            font_family="Arial"
+        ),
+        modebar=dict(
+            activecolor='#1f77b4',
+            orientation='h',
+            bgcolor='rgba(255,255,255,0.8)',
+            color='#777',
+            add=['pan2d'],
+            remove=[
+                'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
+                'hoverClosestCartesian', 'hoverCompareCartesian',
+                'toggleSpikelines', 'lasso2d', 'lasso', 'select2d', 'select'
+            ]
+        ),
+        dragmode='pan'
+    )
+    return fig
+def create_bar_chart(categories, values, x_label, y_label, title):
+    # Sort categories and values based on values in descending order
+    sorted_data = sorted(zip(categories, values), key=lambda x: x[1], reverse=True)
+    categories, values = zip(*sorted_data)
+    # get total number of tasks
+    total_tasks = sum(values)
+    text_labels = [f"({value/total_tasks:.1%} of failures)" for value in values]
+    fig = go.Figure(data=[go.Bar(
+        y=categories,
+        x=values,
+        orientation='h',
+        marker_color='#3498db',  # Same color as the scatter plot
+        text=text_labels,
+        textposition='auto',
+        customdata=[f'{value} tasks ({value/total_tasks:.1%} of failures)' for value in values],
+        textfont=dict(color='black', size=14, family='Arial', weight=2),
+        hovertemplate='<b>%{y}</b><br>' +
+                      'Affected Tasks: %{customdata}<extra></extra>'
+    )])
+    fig.update_layout(
+        height=600,
+        xaxis=dict(
+            showline=True,
+            linecolor='black',
+            showgrid=False
+        ),
+        yaxis=dict(
+            showline=True,
+            linecolor='black',
+            showgrid=False,
+            autorange="reversed"  # This will put the category with the highest value at the top
+        ),
+        plot_bgcolor='white',
+        paper_bgcolor='white',
+        bargap=0.2,
+        bargroupgap=0.1,
+        hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
+        modebar=dict(
+            activecolor='#1f77b4',
+            orientation='h',
+            bgcolor='rgba(255,255,255,0.8)',
+            color='#777',
+            add=['pan2d'],
+            remove=[
+                'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
+                'hoverClosestCartesian', 'hoverCompareCartesian',
+                'toggleSpikelines', 'lasso2d', 'lasso', 'select2d', 'select'
+            ]
+        ),
+        dragmode='pan'
+    )
+    return fig
+def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
+    # agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
+    # instead of creating one Agent object for each row, we can create one Agent object for each unique agent and use the mean of the cost and accuracy values
+    unique_agents = df['Agent Name'].unique()
+    agents = [Agent(df[df['Agent Name'] == agent]['Total Cost'].mean(), df[df['Agent Name'] == agent]['Accuracy'].mean()) for agent in unique_agents]
+    pareto_frontier = compute_pareto_frontier(agents)
+    fig = go.Figure()
+    # Sort the Pareto frontier points by x-coordinate
+    pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
+    # Add the Pareto frontier line
+    fig.add_trace(go.Scatter(
+        x=[point[0] for point in pareto_points],
+        y=[point[1] for point in pareto_points],
+        mode='lines',
+        name='Pareto Frontier',
+        hoverinfo=None,
+        line=dict(color='black', width=1, dash='dash')
+    ))
+    # Plot scatter points and error bars for each agent
+    unique_agents = df[hover_data[0]].unique()
+    for agent in unique_agents:
+        agent_data = df[df[hover_data[0]] == agent]
+        x_value = [np.mean(agent_data[x].values)]
+        y_value = [np.mean(agent_data[y].values)]
+        if len(agent_data) > 1:
+            # Calculate 95% confidence intervals
+            # ci_x = stats.t.interval(0.95, len(agent_data[x])-1, loc=np.mean(agent_data[x]), scale=stats.sem(agent_data[x]))
+            # ci_y = stats.t.interval(0.95, len(agent_data[y])-1, loc=np.mean(agent_data[y]), scale=stats.sem(agent_data[y]))
+            # # Add error bars for x (cost)
+            # fig.add_trace(go.Scatter(
+            #     x=x_value,
+            #     y=y_value,
+            #     error_x=dict(
+            #         type='data',
+            #         symmetric=False,
+            #         array=[ci_x[1] - x_value],
+            #         arrayminus=[x_value - ci_x[0]],
+            #         color='red',
+            #     ),
+            #     mode='markers',
+            #     marker=dict(color='rgba(0,0,0,0)'),
+            #     showlegend=False,
+            #     hoverinfo='none'
+            # ))
+            # # Add error bars for y (accuracy)
+            # fig.add_trace(go.Scatter(
+            #     x=x_value,
+            #     y=y_value,
+            #     error_y=dict(
+            #         type='data',
+            #         symmetric=False,
+            #         array=[ci_y[1] - y_value],
+            #         arrayminus=[y_value - ci_y[0]],
+            #         color='green',
+            #     ),
+            #     mode='markers',
+            #     marker=dict(color='rgba(0,0,0,0)'),
+            #     showlegend=False,
+            #     hoverinfo='none'
+            # ))
+            # Add error bars for x (cost minmax)
+            fig.add_trace(go.Scatter(
+                x=x_value,
+                y=y_value,
+                error_x=dict(
+                    type='data',
+                    symmetric=False,
+                    array=[np.max(agent_data[x]) - x_value],
+                    arrayminus=[x_value - np.min(agent_data[x])],
+                    color='#fec44f',
+                ),
+                mode='markers',
+                marker=dict(color='rgba(0,0,0,0)', opacity=0),
+                showlegend=False,
+                hoverinfo=None
+            ))
+            # Add error bars for y (accuracy minmax)
+            fig.add_trace(go.Scatter(
+                x=x_value,
+                y=y_value,
+                error_y=dict(
+                    type='data',
+                    symmetric=False,
+                    array=[np.max(agent_data[y]) - y_value],
+                    arrayminus=[y_value - np.min(agent_data[y])],
+                    color='#bdbdbd',
+                ),
+                mode='markers',
+                marker=dict(color='rgba(0,0,0,0)', opacity=0),
+                showlegend=False,
+                hoverinfo=None
+            ))
+        # Add scatter points for this agent
+        fig.add_trace(go.Scatter(
+            x=x_value,
+            y=y_value,
+            mode='markers',
+            marker=dict(size=10, color='#3498db'),
+            customdata=agent_data[hover_data],
+            showlegend=False,
+            hovertemplate="<br>".join([
+                "<b>Agent</b>: %{customdata[0]}",
+                "<b>Total Cost</b>: $%{x:.1f}",
+                "<b>Accuracy</b>: %{y:.1%}<extra></extra>",
+            ]),
+            hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
+        ))
+    # Add legend entries for error bars
+    # fig.add_trace(go.Scatter(
+    #     x=[None], y=[None], mode='markers',
+    #     marker=dict(color='red', size=10),
+    #     name='Cost CI (95%)'
+    # ))
+    # fig.add_trace(go.Scatter(
+    #     x=[None], y=[None], mode='markers',
+    #     marker=dict(color='green', size=10),
+    #     name='Accuracy CI (95%)'
+    # ))
+    # Add legend entries for error bars
+    fig.add_trace(go.Scatter(
+        x=[None], y=[None], mode='markers',
+        marker=dict(color='#fec44f', size=10),
+        name='Cost CI (Min-Max)'
+    ))
+    fig.add_trace(go.Scatter(
+        x=[None], y=[None], mode='markers',
+        marker=dict(color='#bdbdbd', size=10),
+        name='Accuracy CI (Min-Max)'
+    ))
+    fig.update_layout(
+        height = 600,
+        xaxis_title = x_label,
+        yaxis_title = y_label,
+        xaxis = dict(
+            showline = True,
+            linecolor = 'black',
+            showgrid = False),
+        yaxis = dict(
+            showline = True,
+            showgrid = False,
+            linecolor = 'black'),
+        plot_bgcolor = 'white',
+        legend=dict(
+            yanchor="bottom",
+            y=0.01,
+            xanchor="right",
+            x=0.98,
+            bgcolor="rgba(255, 255, 255, 0.5)"  # semi-transparent white background
+        ),
+        modebar=dict(
+            activecolor='#1f77b4',  # Color of active tool
+            orientation='h',  # Horizontal orientation
+            bgcolor='rgba(255,255,255,0.8)',  # Slightly transparent white background
+            color='#777',  # Color of inactive tools
+            add = ['pan2d'],
+            remove = [
+                'zoom2d',
+                'zoomIn2d',
+                'zoomOut2d',
+                'resetScale2d',
+                'hoverClosestCartesian',
+                'hoverCompareCartesian',
+                'toggleSpikelines',
+                'lasso2d',
+                'lasso',
+                'select2d',
+                'select']
+        ),
+        dragmode='pan'
+    )
+    fig.update_yaxes(rangemode="tozero")
+    fig.update_xaxes(rangemode="tozero")
+    return fig
+# def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
+#     agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
+#     pareto_frontier = compute_pareto_frontier(agents)
+#     fig = go.Figure()
+#     # Function to generate points for error ellipse
+#     def error_ellipse(x_center, y_center, x_radius, y_radius, angle, n=50):
+#         t = np.linspace(0, 2*np.pi, n)
+#         x = x_radius * np.cos(t)
+#         y = y_radius * np.sin(t)
+#         rotation = np.array([[np.cos(angle), -np.sin(angle)],
+#                              [np.sin(angle), np.cos(angle)]])
+#         xy = np.dot(rotation, np.array([x, y]))
+#         return x_center + xy[0], y_center + xy[1]
+#     # Create a color map for agents
+#     unique_agents = df['Agent Name'].unique()
+#     colors = px.colors.qualitative.Plotly
+#     color_map = {agent: colors[i % len(colors)] for i, agent in enumerate(unique_agents)}
+#     # Add scatter points and error ellipses for each agent
+#     for agent in unique_agents:
+#         agent_data = df[df['Agent Name'] == agent]
+#         # Add scatter points
+#         fig.add_trace(go.Scatter(
+#             x=agent_data[x],
+#             y=agent_data[y],
+#             mode='markers',
+#             name=agent,
+#             marker=dict(size=10, color=color_map[agent]),
+#             customdata=agent_data[hover_data] if hover_data else None,
+#             hovertemplate="<br>".join([
+#                 f"<b>Agent</b>: {agent}",
+#                 f"<b>{x}</b>: ${{x:.1f}}",
+#                 f"<b>{y}</b>: {{y:.1%}}",
+#             ] + ([f"<b>{col}</b>: {{customdata[{i}]}}" for i, col in enumerate(hover_data)] if hover_data else []))
+#         ))
+#         # Calculate mean and standard deviation for x and y
+#         x_mean = agent_data[x].mean()
+#         y_mean = agent_data[y].mean()
+#         x_std = agent_data[x].std()
+#         y_std = agent_data[y].std()
+#         # Calculate correlation coefficient
+#         corr = agent_data[x].corr(agent_data[y])
+#         # Add error ellipses (1 and 2 standard deviations)
+#         for n_std, opacity in [(1, 0.5), (2, 0.5)]:
+#             chi2_val = chi2.ppf(0.68 if n_std == 1 else 0.95, 2)
+#             x_radius = np.sqrt(chi2_val) * x_std
+#             y_radius = np.sqrt(chi2_val) * y_std
+#             angle = np.arctan2(y_std * corr, x_std)
+#             ellipse_x, ellipse_y = error_ellipse(x_mean, y_mean, x_radius, y_radius, angle)
+#             fig.add_shape(type="path",
+#                           path=f"M {ellipse_x[0]}, {ellipse_y[0]} " +
+#                                " ".join([f"L{x},{y}" for x, y in zip(ellipse_x[1:], ellipse_y[1:])]) +
+#                                " Z",
+#                           line_color=color_map[agent],
+#                           line_width=2,
+#                           opacity=opacity,
+#                           layer="below")
+#     # Sort the Pareto frontier points by x-coordinate
+#     pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
+#     # Add the Pareto frontier line
+#     fig.add_trace(go.Scatter(
+#         x=[point[0] for point in pareto_points],
+#         y=[point[1] for point in pareto_points],
+#         mode='lines',
+#         name='Pareto Frontier',
+#         line=dict(color='black', width=1, dash='dash')
+#     ))
+#     fig.update_layout(
+#         height = 600,
+#         xaxis_title = x_label,
+#         yaxis_title = y_label,
+#         xaxis = dict(
+#             showline = True,
+#             linecolor = 'black',
+#             showgrid = False),
+#         yaxis = dict(
+#             showline = True,
+#             showgrid = False,
+#             linecolor = 'black'),
+#         plot_bgcolor = 'white',
+#         legend=dict(
+#             yanchor="bottom",
+#             y=0.01,
+#             xanchor="right",
+#             x=0.98,
+#             bgcolor="rgba(255, 255, 255, 0.5)"
+#         ),
+#         modebar=dict(
+#             activecolor='#1f77b4',
+#             orientation='h',
+#             bgcolor='rgba(255,255,255,0.8)',
+#             color='#777',
+#             add = ['pan2d'],
+#             remove = [
+#                 'zoom2d', 'zoomIn2d', 'zoomOut2d', 'resetScale2d',
+#                 'hoverClosestCartesian', 'hoverCompareCartesian',
+#                 'toggleSpikelines', 'lasso2d', 'lasso',
+#                 'select2d', 'select'
+#             ]
+#         ),
+#         dragmode='pan'
+#     )
+#     fig.update_yaxes(rangemode="tozero")
+#     fig.update_xaxes(rangemode="tozero")
+#     return fig
+# def create_scatter_plot(df, x: str, y: str, x_label: str = None, y_label: str = None, hover_data: list = None):
+#     agents = [Agent(row['Total Cost'], row['Accuracy']) for i, row in df.iterrows()]
+#     pareto_frontier = compute_pareto_frontier(agents)
+#     fig = px.scatter(df,
+#                      x=x,
+#                      y=y,
+#                      custom_data=hover_data)
+    # fig.update_traces(
+    #         hovertemplate="<br>".join([
+    #             "<b>Agent</b>: %{customdata[0]}",
+    #             "<b>Total Cost</b>: $%{x:.1f}",
+    #             "<b>Accuracy</b>: %{y:.1%}",
+    #         ])
+    #     )
+#     fig.update_traces(marker=dict(size=10, color='#3498db'),
+#                       hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),)
+#     # Sort the Pareto frontier points by x-coordinate
+#     pareto_points = sorted([(agent.total_cost, agent.accuracy) for agent in pareto_frontier], key=lambda x: x[0])
+#     # Add the Pareto frontier line
+#     fig.add_trace(go.Scatter(
+#         x=[point[0] for point in pareto_points],
+#         y=[point[1] for point in pareto_points],
+#         mode='lines',
+#         name='Pareto Frontier',
+#         line=dict(color='black', width=1, dash='dash')
+#     ))
+#     fig.update_layout(
+#     # width = 1150,
+#     height = 600,
+#     xaxis_title = x_label,
+#     yaxis_title = y_label,
+#     xaxis = dict(
+#         showline = True,
+#         linecolor = 'black',
+#         showgrid = False),
+#     yaxis = dict(
+#         showline = True,
+#         showgrid = False,
+#         linecolor = 'black'),
+#     plot_bgcolor = 'white',
+#     # Legend positioning
+#     legend=dict(
+#         yanchor="bottom",
+#         y=0.01,
+#         xanchor="right",
+#         x=0.98,
+#         bgcolor="rgba(255, 255, 255, 0.5)"  # semi-transparent white background
+#         ),
+#     modebar=dict(
+#             activecolor='#1f77b4',  # Color of active tool
+#             orientation='h',  # Vertical orientation
+#             bgcolor='rgba(255,255,255,0.8)',  # Slightly transparent white background
+#             color='#777',  # Color of inactive tools
+#             add = ['pan2d'],
+#             remove = [
+#                 'zoom2d',
+#                 'zoomIn2d',
+#                 'zoomOut2d',
+#                 'resetScale2d',
+#                 'hoverClosestCartesian',
+#                 'hoverCompareCartesian',
+#                 'toggleSpikelines',
+#                 'lasso2d',
+#                 'lasso',
+#                 'select2d',
+#                 'select']
+#         ),
+#     dragmode='pan'
+#     )
+#     fig.update_yaxes(rangemode="tozero")
+#     fig.update_xaxes(rangemode="tozero")
+#     return fig
+import plotly.graph_objects as go
+import textwrap
+def create_flow_chart(steps):
+    node_x = []
+    node_y = []
+    edge_x = []
+    edge_y = []
+    node_text = []
+    hover_text = []
+    node_colors = []
+    node_shapes = []
+    # Define color and shape mappings
+    color_map = {True: 'green', False: 'red'}  # True for success, False for challenges
+    shape_map = {
+        'plan': 'octagon',
+        'tool': 'square',
+        'retrieve': 'diamond',
+        'other': 'circle'
+    }
+    for i, step in enumerate(steps):
+        node_x.append(i)
+        node_y.append(0)
+        # Extract Description, Assessment, and new attributes
+        analysis = step['analysis']
+        if isinstance(analysis, str):
+            try:
+                analysis = json.loads(analysis)
+            except json.JSONDecodeError:
+                analysis = {}
+        description = analysis.get('description', 'No description available.')
+        assessment = analysis.get('assessment', 'No assessment available.')
+        success = analysis.get('success', True)  # Assuming True if not specified
+        # action_type = analysis.get('action_type', 'other')  # Default to 'other' if not specified
+        step_headline = analysis.get('headline', '')
+        # Set node color and shape based on attributes
+        node_colors.append(color_map[success])
+        # node_shapes.append(shape_map.get(action_type, 'circle'))
+        # Wrap text to improve readability
+        wrapped_description = '<br>'.join(textwrap.wrap(description, width=90, max_lines=20))
+        wrapped_assessment = '<br>'.join(textwrap.wrap(assessment, width=90, max_lines=10))
+        wrapped_outline = textwrap.shorten(step_headline, width=50, placeholder='')
+        wrapped_outline = '' if wrapped_outline == '' else f": {wrapped_outline}"
+        node_text_outline = '' if wrapped_outline == '' else f":<br>{'<br>'.join(textwrap.wrap(step_headline, width=30, placeholder=''))}"
+        node_text.append(f"Step {i+1}{node_text_outline}")
+        # Create formatted hover text without indentation
+        hover_info = f"<b>Step {i+1}{wrapped_outline}</b><br><br>" \
+                     f"<b>Description:</b><br>" \
+                     f"{wrapped_description}<br><br>" \
+                    #  f"<b>Assessment:</b><br>" \
+                    #  f"{wrapped_assessment}<br><br>" \
+                    #  f"<b>Successful:</b> {'Yes' if success else 'No'}<br>" \
+                    #  f"<b>Action Type:</b> {action_type.capitalize()}"
+        hover_text.append(hover_info)
+        if i > 0:
+            edge_x.extend([i-1, i, None])
+            edge_y.extend([0, 0, None])
+    node_trace = go.Scatter(
+        x=node_x, y=node_y,
+        mode='markers+text',
+        text=node_text,
+        textposition="top center",
+        showlegend=False,
+        hovertext=hover_text,
+        hoverinfo='text',
+        hoverlabel=dict(bgcolor="white", font_size=12, font_family="Arial"),
+        marker=dict(
+            # color=node_colors,
+            color='#3498db',
+            size=30,
+            line_width=2,
+            # symbol=node_shapes
+        ))
+    edge_trace = go.Scatter(
+        x=edge_x, y=edge_y,
+        line=dict(width=2, color='#888'),
+        hoverinfo='none',
+        showlegend=False,
+        mode='lines')
+    # Create legend traces
+    legend_traces = []
+    # # Color legend
+    # for success, color in color_map.items():
+    #     legend_traces.append(go.Scatter(
+    #         x=[None], y=[None],
+    #         mode='markers',
+    #         marker=dict(size=10, color=color),
+    #         showlegend=True,
+    #         name=f"{'Success' if success else 'Issue'}"
+    #     ))
+    # # Shape legend
+    # for action, shape in shape_map.items():
+    #     legend_traces.append(go.Scatter(
+    #         x=[None], y=[None],
+    #         mode='markers',
+    #         marker=dict(size=10, symbol=shape, color='gray'),
+    #         showlegend=True,
+    #         name=f"{action.capitalize()}"
+    #     ))
+    # Combine all traces
+    all_traces = [edge_trace, node_trace] + legend_traces
+    layout = go.Layout(
+        showlegend=True,
+        hovermode='closest',
+        margin=dict(b=20,l=5,r=5,t=40),
+        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
+        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
+        plot_bgcolor='white',
+        paper_bgcolor='white',
+        modebar=dict(
+            activecolor='#1f77b4',  # Color of active tool
+            orientation='h',  # Vertical orientation
+            bgcolor='rgba(255,255,255,0.8)',  # Slightly transparent white background
+            color='#777',  # Color of inactive tools
+        ),
+        legend=dict(
+            orientation="h",
+            yanchor="bottom",
+            y=0.02,
+            xanchor="right",
+            x=1,
+            bgcolor='rgba(255,255,255,0.8)',
+            bordercolor='rgba(0,0,0,0.1)',
+            borderwidth=1
+        ),
+    )
+    fig = go.Figure(data=all_traces, layout=layout)
+    fig.update_layout(legend=dict(
+        orientation="h",
+        yanchor="bottom",
+        y=1.02,
+        xanchor="right",
+        x=1,
+        bgcolor='rgba(255,255,255,0.8)',  # Set legend background to slightly transparent white
+        bordercolor='rgba(0,0,0,0.1)',  # Add a light border to the legend
+        borderwidth=1
+    ),
+    dragmode='pan'
+    )
+    config = {
+        'add': ['pan2d'],
+        'remove': [
+            'zoom2d',
+            'zoomIn2d',
+            'zoomOut2d',
+            'resetScale2d',
+            'hoverClosestCartesian',
+            'hoverCompareCartesian',
+            'toggleSpikelines',
+            'lasso2d',
+            'lasso',
+            'select2d',
+            'select',
+        ]
+    }
+    # Apply the config to the figure
+    fig.update_layout(modebar=config)
+    return fig

verified_agents.yaml ADDED Viewed

	@@ -0,0 +1,91 @@

+# This file contains information about verified agent results for different benchmarks.
+# Format:
+#   benchmark_name:
+#     - agent_name: "Name of the agent"
+#       verification_date: YYYY-MM-DD
+usaco:
+  - agent_name: "USACO Reflexion + Episodic (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-20
+  - agent_name: "USACO Reflexion + Episodic + Semantic (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-20
+  - agent_name: "USACO Reflexion (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-20
+  - agent_name: "USACO Episodic (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-12
+  - agent_name: "USACO Reflexion + Semantic (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-20
+  - agent_name: "USACO Zero-shot (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-11
+  - agent_name: "USACO Semantic (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-12
+  - agent_name: USACO Reflexion + Episodic + Semantic (gpt-4o-2024-05-13)
+    verification_date: 2024-08-25
+  - agent_name: USACO Reflexion + Episodic (gpt-4o-2024-05-13)
+    verification_date: 2024-08-25
+  - agent_name: USACO Reflexion + Semantic (gpt-4o-2024-05-13)
+    verification_date: 2024-08-25
+  - agent_name: Episodic Retrial (2x) (gpt-4o-2024-05-13)
+    verification_date: 2024-08-25
+  - agent_name: Episodic Retrial (3x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-25
+  - agent_name: Episodic Retrial (2x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-25
+  - agent_name: Episodic Retrial (5x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-25
+  - agent_name: Episodic Warming (3 Steps) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-24
+  - agent_name: USACO Episodic (gpt-4o-2024-05-13)
+    verification_date: 2024-08-24
+  - agent_name: USACO Semantic (gpt-4o-2024-05-13)
+    verification_date: 2024-08-24
+  - agent_name: Zero-shot Retrial (2x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-24
+  - agent_name: Zero-shot Retrial (3x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-24
+  - agent_name: Zero-shot Retrial (5x) (gpt-4o-mini-2024-07-18)
+    verification_date: 2024-08-24
+  - agent_name: USACO Zero-shot (gpt-4o-2024-05-13)
+    verification_date: 2024-08-24
+swebench_verified:
+  - agent_name: "Agentless (gpt-4o-mini-2024-07-18) (50 Instances)"
+    verification_date: 2024-08-17
+  - agent_name: "SWE-agent (gpt-4o-mini-2024-07-18) (Cost Limit: $1) (50 Instances)"
+    verification_date: 2024-08-19
+mlagentbench:
+  - agent_name: "MLAgentBench ResearchAgent (gpt-4o-mini-2024-07-18)"
+    verification_date: 2024-08-19
+corebench_easy:
+  - agent_name: "AutoGPT (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "AutoGPT (GPT-4o-mini)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o-mini)"
+    verification_date: 2024-09-28
+corebench_medium:
+  - agent_name: "AutoGPT (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "AutoGPT (GPT-4o-mini)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o-mini)"
+    verification_date: 2024-09-28
+corebench_hard:
+  - agent_name: "AutoGPT (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "AutoGPT (GPT-4o-mini)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o)"
+    verification_date: 2024-09-28
+  - agent_name: "CORE-Agent (GPT-4o-mini)"
+    verification_date: 2024-09-28