Spaces:

OpenHands
/

openhands-index

Running

App Files Files Community

simonrosenberg1 commited on Apr 8

Commit

bef7ade

verified ·

1 Parent(s): 7949205

Show ACP agent results in the leaderboard

Browse files

## Summary

The HF Space currently only loads `results/{model}/` (default OpenHands runs).
The ACP runs (`acp-claude`, `acp-codex`, `acp-gemini`, `openhands_subagents`)
live in `alternative_agents/{type}/{model}/` in the openhands-index-results
repo and never made it into the dataframe, so the website silently dropped
them. After OpenHands/openhands-index-results#820–#829 + #830, all the ACP
Claude Code data from the master table in OpenHands/benchmarks#576 is in
the canonical location, but the leaderboard still doesn't show it.

This PR teaches the loader to ingest `alternative_agents/` and adds an
**Agent** column to the leaderboard so OpenHands vs Claude Code vs Codex
vs Gemini CLI are visible at a glance.

## Changes

- **`setup_data.py`** — copy `alternative_agents/` alongside `results/` when fetching the index repo, so all submissions land in the data dir.
- **`simple_data_loader.py`**:
- Factor per-directory loading into `_records_from_agent_dir` and have `_load_from_agent_dirs` walk both `results/` and `alternative_agents/{type}/{model}/`.
- Default `agent_name` per `agent_type` (Claude Code / Codex / Gemini CLI / OpenHands Sub-agents), matching the `AGENT_NAME_BY_TYPE` map in `OpenHands/evaluation push_to_index_from_archive.py`.
- Include `agent_name` in `agent_id` (`name_version_model`) so an OpenHands run and a Claude Code run on the same SDK version + model don't collide into one row.
- Surface `agent_name` on the transformed record.
- **`leaderboard_transformer.py`**:
- Map `agent_name` → "Agent" in `_pretty_column_name`.
- Insert "Agent" into `base_cols` between `id` and `Language Model`.

## Local verification

Cloned the latest `openhands-index-results` and pointed the loader at it.
The loader now returns 29 rows: 24 OpenHands + 2 Claude Code + 1 Codex + 2
OpenHands Sub-agents. The new Claude Code rows match the master table in
OpenHands/benchmarks#576:

```
Claude Code / claude-opus-4-6: swebench 74.4 swtbench 66.7 gaia 66.1 commit0 50.0 swe-bench-multimodal 32.4
Claude Code / claude-sonnet-4-5: swebench 74.4 swtbench 69.3 gaia 63.0 commit0 31.2 swe-bench-multimodal 35.3
```

## Test plan

- [ ] Reviewer: load the Space preview built from this PR, confirm the leaderboard table now has an **Agent** column and shows Claude Code / Codex / OpenHands Sub-agents rows.
- [ ] Confirm the existing OpenHands rows look unchanged (same scores, no missing entries).

Files changed (3) hide show

leaderboard_transformer.py +6 -1
setup_data.py +17 -5
simple_data_loader.py +120 -56

leaderboard_transformer.py CHANGED Viewed

@@ -655,6 +655,7 @@ def _pretty_column_name(raw_col: str) -> str:
     # Case 1: Handle fixed, special-case mappings first.
     fixed_mappings = {
         'id': 'id',
         'SDK version': 'SDK Version',
         'Openhands version': 'SDK Version',  # Legacy support
         'Language model': 'Language Model',
@@ -815,7 +816,11 @@ class DataTransformer:
         df_view = df_sorted.copy()
         # --- 3. Add Columns for Agent Openness ---
-        base_cols = ["id","Language Model","SDK Version","Source"]
         new_cols = ["Openness"]
         ending_cols = ["Date", "Logs", "Visualization"]

     # Case 1: Handle fixed, special-case mappings first.
     fixed_mappings = {
         'id': 'id',
+        'agent_name': 'Agent',
         'SDK version': 'SDK Version',
         'Openhands version': 'SDK Version',  # Legacy support
         'Language model': 'Language Model',
         df_view = df_sorted.copy()
         # --- 3. Add Columns for Agent Openness ---
+        # "Agent" sits between id and Language Model so OpenHands vs
+        # alternative agents (Claude Code / Codex / Gemini CLI) are visible
+        # at a glance, and so the same model with two different agents
+        # doesn't look like a duplicate row.
+        base_cols = ["id", "Agent", "Language Model", "SDK Version", "Source"]
         new_cols = ["Openness"]
         ending_cols = ["Date", "Logs", "Visualization"]

setup_data.py CHANGED Viewed

@@ -70,27 +70,39 @@ def fetch_data_from_github():
         # Look for data files in the cloned repository
         results_source = clone_dir / "results"
         if not results_source.exists():
             print(f"Results directory not found in repository")
             return False
         # Check if there are any agent result directories
         result_dirs = list(results_source.iterdir())
         if not result_dirs:
             print(f"No agent results found in {results_source}")
             return False
         print(f"Found {len(result_dirs)} agent result directories")
         # Create target directory and copy the results structure
         os.makedirs(target_dir.parent, exist_ok=True)
         if target_dir.exists():
             shutil.rmtree(target_dir)
         # Copy the entire results directory
         target_results = target_dir / "results"
         shutil.copytree(results_source, target_results)
         print(f"Successfully fetched data from GitHub. Files: {list(target_dir.glob('*'))}")

         # Look for data files in the cloned repository
         results_source = clone_dir / "results"
         if not results_source.exists():
             print(f"Results directory not found in repository")
             return False
         # Check if there are any agent result directories
         result_dirs = list(results_source.iterdir())
         if not result_dirs:
             print(f"No agent results found in {results_source}")
             return False
         print(f"Found {len(result_dirs)} agent result directories")
         # Create target directory and copy the results structure
         os.makedirs(target_dir.parent, exist_ok=True)
         if target_dir.exists():
             shutil.rmtree(target_dir)
         # Copy the entire results directory
         target_results = target_dir / "results"
         shutil.copytree(results_source, target_results)
+        # Also copy alternative_agents/ if present, so the loader can pick up
+        # ACP runs (acp-claude, acp-codex, acp-gemini, openhands_subagents, ...)
+        # alongside the default OpenHands agent results.
+        alt_source = clone_dir / "alternative_agents"
+        if alt_source.exists():
+            alt_target = target_dir / "alternative_agents"
+            shutil.copytree(alt_source, alt_target)
+            agent_types = sorted(p.name for p in alt_source.iterdir() if p.is_dir())
+            print(f"Found alternative agent types: {agent_types}")
+        else:
+            print("No alternative_agents/ directory in repository (skipping)")
         print(f"Successfully fetched data from GitHub. Files: {list(target_dir.glob('*'))}")

simple_data_loader.py CHANGED Viewed

@@ -127,55 +127,109 @@ class SimpleLeaderboardViewer:
                 if benchmark not in self.tag_map[category]:
                     self.tag_map[category].append(benchmark)
     def _load_from_agent_dirs(self):
-        """Load data from new agent-centric directory structure (results/YYYYMMDD_model/)."""
-        results_dir = self.config_path / "results"
-        if not results_dir.exists():
-            return None  # Fall back to old format
         all_records = []
         all_validation_errors = []
-        # Iterate through each agent directory
-        for agent_dir in results_dir.iterdir():
-            if not agent_dir.is_dir():
-                continue
-            # Load and validate using pydantic models
-            metadata, scores, errors = load_and_validate_agent_data(agent_dir)
-            if errors:
                 all_validation_errors.extend(errors)
-            if metadata is None or scores is None:
-                continue
-            # Skip entries that are hidden from the leaderboard
-            if metadata.get('hide_from_leaderboard', False):
-                logger.info(f"Skipping {agent_dir.name}: hide_from_leaderboard is True")
-                continue
-            # Create one record per benchmark (mimicking old JSONL format)
-            for score_entry in scores:
-                record = {
-                    'agent_version': metadata.get('agent_version', 'Unknown'),
-                    'llm_base': metadata.get('model', 'unknown'),
-                    'openness': metadata.get('openness', 'unknown'),
-                    'submission_time': score_entry.get('submission_time', metadata.get('submission_time', '')),
-                    'release_date': metadata.get('release_date', ''),  # Model release date
-                    'parameter_count_b': metadata.get('parameter_count_b'),  # Total params in billions
-                    'active_parameter_count_b': metadata.get('active_parameter_count_b'),  # Active params for MoE
-                    'score': score_entry.get('score'),
-                    'metric': score_entry.get('metric', 'unknown'),
-                    'cost_per_instance': score_entry.get('cost_per_instance'),
-                    'average_runtime': score_entry.get('average_runtime'),
-                    'tags': [score_entry.get('benchmark')],
-                    'full_archive': score_entry.get('full_archive', ''),  # Download URL for trajectories
-                    'eval_visualization_page': score_entry.get('eval_visualization_page', ''),  # Laminar visualization URL
-                }
-                all_records.append(record)
         # Log validation errors if any
         if all_validation_errors:
             logger.warning(f"Schema validation errors ({len(all_validation_errors)} total):")
@@ -183,10 +237,10 @@ class SimpleLeaderboardViewer:
                 logger.warning(f"  - {error}")
             if len(all_validation_errors) > 5:
                 logger.warning(f"  ... and {len(all_validation_errors) - 5} more")
         if not all_records:
-            return None  # Fall back to old format
         return pd.DataFrame(all_records)
     def _load(self):
@@ -206,26 +260,36 @@ class SimpleLeaderboardViewer:
             # Group by agent (version + model combination) to aggregate results across datasets
             transformed_records = []
-            # Create a unique identifier for each agent (version + model)
-            df['agent_id'] = df['agent_version'] + '_' + df['llm_base']
             for agent_id in df['agent_id'].unique():
                 agent_records = df[df['agent_id'] == agent_id]
                 # Build a single record for this agent
                 first_record = agent_records.iloc[0]
                 agent_version = first_record['agent_version']
                 # Normalize openness to "open" or "closed"
                 from aliases import OPENNESS_MAPPING
                 raw_openness = first_record['openness']
                 normalized_openness = OPENNESS_MAPPING.get(raw_openness, raw_openness)
                 # All 5 categories for the leaderboard
                 ALL_CATEGORIES = ['Issue Resolution', 'Frontend', 'Greenfield', 'Testing', 'Information Gathering']
                 record = {
                     # Core agent info - use final display names
                     'SDK version': agent_version,  # Will become "SDK Version"
                     'Language model': first_record['llm_base'],  # Will become "Language Model"
                     'openness': normalized_openness,  # Will become "Openness" (simplified to "open" or "closed")
@@ -235,7 +299,7 @@ class SimpleLeaderboardViewer:
                     'parameter_count_b': first_record.get('parameter_count_b'),  # Total params in billions
                     'active_parameter_count_b': first_record.get('active_parameter_count_b'),  # Active params for MoE
                     # Additional columns expected by the transformer
-                    # Use agent_id (version_model) as unique identifier for Pareto frontier calculation
                     'id': agent_id,
                     'source': first_record.get('source', ''),  # Will become "Source"
                     'logs': first_record.get('logs', ''),  # Will become "Logs"

                 if benchmark not in self.tag_map[category]:
                     self.tag_map[category].append(benchmark)
+    # Default agent_name when metadata.json doesn't carry one. Matches the
+    # default-agent value used by push_to_index_from_archive.py so legacy
+    # entries (which omit the field) still group cleanly with new entries.
+    DEFAULT_AGENT_NAME = "OpenHands"
+    def _records_from_agent_dir(self, agent_dir: Path, default_agent_name: str | None = None) -> tuple[list[dict], list[str]]:
+        """Build per-benchmark records from a single agent directory.
+        Shared by ``_load_from_agent_dirs`` (default OpenHands results) and
+        ``_load_from_alternative_agents_dirs`` (acp-claude / acp-codex / etc.).
+        Returns ``(records, validation_errors)``. Returns an empty list of
+        records when the directory has no scores or is hidden from the
+        leaderboard.
+        """
+        records: list[dict] = []
+        metadata, scores, errors = load_and_validate_agent_data(agent_dir)
+        if metadata is None or scores is None:
+            return records, errors
+        if metadata.get('hide_from_leaderboard', False):
+            logger.info(f"Skipping {agent_dir.name}: hide_from_leaderboard is True")
+            return records, errors
+        # Resolve the agent display name. Prefer the value stamped into
+        # metadata.json by push-to-index; fall back to the directory's
+        # default (e.g. "Claude Code" for acp-claude/) and finally to
+        # "OpenHands" for legacy results/ entries that predate the field.
+        agent_name = (
+            metadata.get('agent_name')
+            or default_agent_name
+            or self.DEFAULT_AGENT_NAME
+        )
+        for score_entry in scores:
+            record = {
+                'agent_name': agent_name,
+                'agent_version': metadata.get('agent_version', 'Unknown'),
+                'llm_base': metadata.get('model', 'unknown'),
+                'openness': metadata.get('openness', 'unknown'),
+                'submission_time': score_entry.get('submission_time', metadata.get('submission_time', '')),
+                'release_date': metadata.get('release_date', ''),
+                'parameter_count_b': metadata.get('parameter_count_b'),
+                'active_parameter_count_b': metadata.get('active_parameter_count_b'),
+                'score': score_entry.get('score'),
+                'metric': score_entry.get('metric', 'unknown'),
+                'cost_per_instance': score_entry.get('cost_per_instance'),
+                'average_runtime': score_entry.get('average_runtime'),
+                'tags': [score_entry.get('benchmark')],
+                'full_archive': score_entry.get('full_archive', ''),
+                'eval_visualization_page': score_entry.get('eval_visualization_page', ''),
+            }
+            records.append(record)
+        return records, errors
     def _load_from_agent_dirs(self):
+        """Load default-agent results plus any alternative_agents/ entries.
+        Reads ``{config}/results/{model}/`` for default OpenHands runs and
+        ``{config}/alternative_agents/{type}/{model}/`` for ACP agent runs
+        (acp-claude, acp-codex, acp-gemini, ...) so they all surface in the
+        same leaderboard. Returns ``None`` if neither directory yields any
+        records (which makes the caller render an empty-state placeholder).
+        """
         all_records = []
         all_validation_errors = []
+        # 1. Default OpenHands agent results
+        results_dir = self.config_path / "results"
+        if results_dir.exists():
+            for agent_dir in results_dir.iterdir():
+                if not agent_dir.is_dir():
+                    continue
+                records, errors = self._records_from_agent_dir(agent_dir)
+                all_records.extend(records)
                 all_validation_errors.extend(errors)
+        # 2. Alternative agents (one subdirectory per agent_type, then per model)
+        # Default agent_name per agent_type matches the AGENT_NAME_BY_TYPE map
+        # in OpenHands/evaluation push_to_index_from_archive.py — keeping it
+        # in sync ensures rows are labelled the same way the index repo
+        # records them.
+        agent_type_default_name = {
+            'acp-claude': 'Claude Code',
+            'acp-codex': 'Codex',
+            'acp-gemini': 'Gemini CLI',
+            'openhands_subagents': 'OpenHands Sub-agents',
+        }
+        alt_dir = self.config_path / "alternative_agents"
+        if alt_dir.exists():
+            for type_dir in alt_dir.iterdir():
+                if not type_dir.is_dir():
+                    continue
+                default_name = agent_type_default_name.get(type_dir.name)
+                for agent_dir in type_dir.iterdir():
+                    if not agent_dir.is_dir():
+                        continue
+                    records, errors = self._records_from_agent_dir(
+                        agent_dir, default_agent_name=default_name
+                    )
+                    all_records.extend(records)
+                    all_validation_errors.extend(errors)
         # Log validation errors if any
         if all_validation_errors:
             logger.warning(f"Schema validation errors ({len(all_validation_errors)} total):")
                 logger.warning(f"  - {error}")
             if len(all_validation_errors) > 5:
                 logger.warning(f"  ... and {len(all_validation_errors) - 5} more")
         if not all_records:
+            return None  # Caller will render empty-state placeholder
         return pd.DataFrame(all_records)
     def _load(self):
             # Group by agent (version + model combination) to aggregate results across datasets
             transformed_records = []
+            # Create a unique identifier per (agent_name, agent_version, model)
+            # tuple. Including agent_name keeps an OpenHands run and a Claude
+            # Code run on the same SDK version + model from collapsing into
+            # one row when both submit to the leaderboard.
+            df['agent_name'] = df['agent_name'].fillna(self.DEFAULT_AGENT_NAME)
+            df['agent_id'] = (
+                df['agent_name'].astype(str)
+                + '_' + df['agent_version'].astype(str)
+                + '_' + df['llm_base'].astype(str)
+            )
             for agent_id in df['agent_id'].unique():
                 agent_records = df[df['agent_id'] == agent_id]
                 # Build a single record for this agent
                 first_record = agent_records.iloc[0]
                 agent_version = first_record['agent_version']
+                agent_name = first_record['agent_name']
                 # Normalize openness to "open" or "closed"
                 from aliases import OPENNESS_MAPPING
                 raw_openness = first_record['openness']
                 normalized_openness = OPENNESS_MAPPING.get(raw_openness, raw_openness)
                 # All 5 categories for the leaderboard
                 ALL_CATEGORIES = ['Issue Resolution', 'Frontend', 'Greenfield', 'Testing', 'Information Gathering']
                 record = {
                     # Core agent info - use final display names
+                    'agent_name': agent_name,  # Will become "Agent"
                     'SDK version': agent_version,  # Will become "SDK Version"
                     'Language model': first_record['llm_base'],  # Will become "Language Model"
                     'openness': normalized_openness,  # Will become "Openness" (simplified to "open" or "closed")
                     'parameter_count_b': first_record.get('parameter_count_b'),  # Total params in billions
                     'active_parameter_count_b': first_record.get('active_parameter_count_b'),  # Active params for MoE
                     # Additional columns expected by the transformer
+                    # Use agent_id (name_version_model) as unique identifier for Pareto frontier calculation
                     'id': agent_id,
                     'source': first_record.get('source', ''),  # Will become "Source"
                     'logs': first_record.get('logs', ''),  # Will become "Logs"