Spaces:

lanczos
/

graphtestbed

Sleeping

Zhu Jiajun (jz28583) Claude Opus 4.7 (1M context) commited on 20 days ago

Commit

d094faf

1 Parent(s): ad6901d

Add agents/ harness integrations and HF Space scoring deployment

- agents/cliproxyapi: reusable shim that points any agent's SDK at one
CLIProxyAPI proxy via anthropic_env / openai_env / gemini_env helpers.
- agents/{ai_build_ai,mlevolve}: runners that stage GraphTestbed task
data, route LLM calls through the proxy, and harvest submission CSVs.
Tested end-to-end on figraph; both scored on the leaderboard
(aibuildai-claude-sonnet-4-6 0.819, mlevolve-gpt-5.3-codex-spark 0.790).
- agents/common: shared workspace + task-instruction + finalize helpers.

- server/space/: Docker SDK Space deployment. Boot orchestrator in
space_entry.py snapshot_downloads GT + leaderboard.db from the
companion private dataset (lanczos/graphtestbed-gt) on startup, then
runs a daemon thread that backs up sqlite + new submission CSVs
every 60s via huggingface_hub.upload_file/upload_folder.
- server/api.py: optional GT_ARCHIVE_DIR env writes raw submission CSVs
to disk so the backup loop can ship them to the dataset repo.
- graphtestbed/{submit,leaderboard}.py: default GRAPHTESTBED_API flipped
to the hosted Space URL (env var still overrides for self-hosters).
- pyproject.toml: dependencies were misplaced under [project.urls];
moved to [project] so pip install -e . actually resolves deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (36) hide show

.gitignore +5 -0
README.md +46 -9
agents/README.md +54 -0
agents/__init__.py +9 -0
agents/ai_build_ai/README.md +58 -0
agents/ai_build_ai/__init__.py +6 -0
agents/ai_build_ai/examples/run_figraph.sh +20 -0
agents/ai_build_ai/install.sh +31 -0
agents/ai_build_ai/runner.py +161 -0
agents/cliproxyapi/README.md +108 -0
agents/cliproxyapi/__init__.py +33 -0
agents/cliproxyapi/config.example.yaml +43 -0
agents/cliproxyapi/endpoint.py +44 -0
agents/cliproxyapi/env.py +82 -0
agents/cliproxyapi/health.py +63 -0
agents/common/__init__.py +1 -0
agents/common/submit.py +28 -0
agents/common/tasks.py +63 -0
agents/common/workspace.py +35 -0
agents/mlevolve/README.md +75 -0
agents/mlevolve/__init__.py +10 -0
agents/mlevolve/adapter.py +79 -0
agents/mlevolve/examples/run_figraph.sh +16 -0
agents/mlevolve/install.sh +34 -0
agents/mlevolve/runner.py +210 -0
graphtestbed/leaderboard.py +4 -1
graphtestbed/submit.py +4 -1
pyproject.toml +5 -5
server/api.py +13 -0
server/requirements.txt +1 -0
server/space/DEPLOY.md +101 -0
server/space/Dockerfile +38 -0
server/space/README.md +55 -0
server/space/push_gt.py +67 -0
server/space/push_to_space.sh +27 -0
server/space/space_entry.py +173 -0

.gitignore CHANGED Viewed

@@ -25,3 +25,8 @@ ground_truth*.csv
 *test_labels*.csv
 private/
 **/private/

 *test_labels*.csv
 private/
 **/private/
+# Agent harness scratch space
+runs/
+agents/**/runs/
+agents/**/_vendor/

README.md CHANGED Viewed

@@ -10,31 +10,41 @@ Build an agent. Submit predictions. Get a score. Test labels live on a server, n
 ## Status
-**Pre-launch.** The code runs end-to-end against a local server, but:
 - The package isn't on PyPI yet → install from git (see below)
-- The hosted scoring API isn't deployed yet → run the server on your own machine
 - HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
-The two paths below — **local dev** (works today) and **hosted submit** (coming) — share the exact same client and server code.
-## Run it locally (works today)
 ```bash
-# 1. Install
 pip install git+https://github.com/zhuconv/GraphTestbed
-# 2. Start the scoring API (terminal A)
 git clone https://github.com/zhuconv/GraphTestbed
 cd GraphTestbed
 GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
 # → Running on http://localhost:8080
-# 3. Submit (terminal B)
 export GRAPHTESTBED_API=http://localhost:8080
 gtb submit figraph --file preds.csv --agent my-agent-v1
-# ✓ Scored  primary (auc_roc): 0.689  rank: #3
-gtb leaderboard figraph
 ```
 You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
@@ -175,6 +185,33 @@ You don't modify GraphTestbed. You:
 That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
 </details>
 ## License
 [MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).

 ## Status
+**Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
 - The package isn't on PyPI yet → install from git (see below)
 - HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
+The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
+default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
+server if you'd rather self-host (instructions below).
+## Submit to the hosted leaderboard
 ```bash
 pip install git+https://github.com/zhuconv/GraphTestbed
+gtb submit figraph --file preds.csv --agent my-agent-v1
+# ✓ Scored  primary (auc_roc): 0.689  rank: #3
+gtb leaderboard figraph
+```
+The hosted server is a Docker-SDK HF Space that holds GT files in a private
+companion dataset and never logs prediction CSVs (it does archive them in
+the same private repo for reproducibility — see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
+Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
+3 decimals — same as if you ran the server yourself.
+## Run the server locally (alternative)
+```bash
 git clone https://github.com/zhuconv/GraphTestbed
 cd GraphTestbed
 GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
 # → Running on http://localhost:8080
+# point the client at it
 export GRAPHTESTBED_API=http://localhost:8080
 gtb submit figraph --file preds.csv --agent my-agent-v1
 ```
 You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
 That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
 </details>
+## Reference agent integrations (`agents/`)
+Two third-party harnesses ship pre-wired to the testbed; both route LLM
+traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
+| package | upstream | default model |
+| --- | --- | --- |
+| [`agents.ai_build_ai`](agents/ai_build_ai/README.md) | [aibuildai/AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) | `claude-sonnet-4-6` |
+| [`agents.mlevolve`](agents/mlevolve/README.md) | [InternScience/MLEvolve](https://github.com/InternScience/MLEvolve) | `gpt-5.3-codex-spark` |
+The proxy integration itself is generic — see
+[`agents/cliproxyapi/README.md`](agents/cliproxyapi/README.md) for the
+3-function shim (`anthropic_env` / `openai_env` / `gemini_env` /
+`openai_yaml_block`) that any future agent can reuse.
+```bash
+# One-time
+export CLIPROXYAPI_KEY=<from your ~/.cli-proxy-api/config.yaml api-keys list>
+bash agents/ai_build_ai/install.sh        # or agents/mlevolve/install.sh
+# Per task
+gtb fetch figraph
+python -m agents.ai_build_ai.runner --task figraph
+# → prints path to runs/ai_build_ai/figraph/<ts>/submission.csv
+gtb submit figraph --file <printed-path> --agent aibuildai-sonnet-4-6
+```
 ## License
 [MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).

agents/README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+# `agents/` — third-party harness integrations
+Wraps external agent harnesses so they can be pointed at a GraphTestbed task
+and produce a `submission.csv` the scoring API understands. LLM traffic is
+routed through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI)
+instance via the [`agents.cliproxyapi`](cliproxyapi/README.md) shim.
+## Layout
+```
+agents/
+├── cliproxyapi/    # generic Anthropic/OpenAI/Gemini → proxy shim (reusable)
+├── common/         # workspace + task-instruction + submit helpers
+├── ai_build_ai/    # AI-Build-AI integration   (default: claude-sonnet-4-6)
+└── mlevolve/       # MLEvolve integration      (default: gpt-5.3-codex-spark)
+```
+`agents/<agent>/_vendor/` (gitignored) holds the upstream binary or git
+clone for that agent.
+## End-to-end (figraph example)
+```bash
+# 0. One-time setup of the proxy (see agents/cliproxyapi/README.md)
+export CLIPROXYAPI_KEY=<from your config.yaml>
+# 1. Fetch the task data once
+gtb fetch figraph
+# 2. Install whichever agent you want
+bash agents/ai_build_ai/install.sh        # downloads upstream tarball
+# or
+bash agents/mlevolve/install.sh           # git clone + pip install
+# 3. Run; the runner prints the produced submission.csv path
+python -m agents.ai_build_ai.runner --task figraph
+python -m agents.mlevolve.runner    --task figraph
+# 4. Submit when ready (default is print-and-stop)
+gtb submit figraph --file <printed-path> --agent <my-agent-id>
+# or pass --submit <name> to the runner to combine 3+4
+```
+## Adding another agent
+1. Create `agents/<new_agent>/{__init__.py,runner.py,install.sh,README.md}`.
+2. In `runner.py` import from `agents.cliproxyapi` (one of `anthropic_env`,
+   `openai_env`, `gemini_env`, or `openai_yaml_block` per the agent's SDK).
+3. Use `agents.common.workspace.make_workspace()` for the run dir,
+   `agents.common.tasks.task_instruction()` for the task prompt,
+   `agents.common.submit.finalize()` for validate+optional-submit.
+No changes to `agents/cliproxyapi/` or `agents/common/` are required for new
+agents that fit one of the three supported SDK shapes.

agents/__init__.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Agent harness integrations for GraphTestbed.
+Each subpackage wraps a third-party agent (AI-Build-AI, MLEvolve, ...) so it
+can be pointed at a GraphTestbed task and produce a submission.csv that the
+testbed scoring API understands.
+LLM traffic for every agent flows through a single CLIProxyAPI instance — see
+`agents.cliproxyapi` for the reusable shim.
+"""

agents/ai_build_ai/README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# `agents.ai_build_ai`
+Runs [AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) on a GraphTestbed
+task. AI-Build-AI is an Anthropic-SDK-based auto-ML harness that designs,
+trains, and ranks candidate models from a task description.
+Default model: **`claude-sonnet-4-6`** (override with `--model`).
+## Install
+```bash
+bash agents/ai_build_ai/install.sh        # downloads upstream tarball into _vendor/
+# Linux x86_64 only — upstream constraint.
+```
+The vendored binary lands at `agents/ai_build_ai/_vendor/aibuildai`. Set
+`AIBUILDAI_BIN` if you put it elsewhere.
+## Run
+```bash
+# Proxy must be running and CLIPROXYAPI_KEY set — see agents/cliproxyapi/README.md
+gtb fetch figraph                                   # one-time per task
+python -m agents.ai_build_ai.runner --task figraph
+```
+Output:
+```
+runs/ai_build_ai/figraph/<timestamp>/
+├── data/                    # symlinks to fetched dataset CSVs
+├── playground/              # AI-Build-AI's working dir (candidate_*/, …)
+├── instruction.md           # generated task prompt
+├── agent.log                # full stdout+stderr from the binary
+└── submission.csv           # normalized to match the testbed schema
+```
+The runner prints `submission.csv`'s path; submit when ready:
+```bash
+gtb submit figraph --file runs/ai_build_ai/figraph/<ts>/submission.csv \
+    --agent aibuildai-sonnet-4-6
+# or, in one step:
+python -m agents.ai_build_ai.runner --task figraph --submit aibuildai-sonnet-4-6
+```
+## Knobs
+| flag | default | upstream meaning |
+| --- | --- | --- |
+| `--model` | `claude-sonnet-4-6` | model alias, sent to the proxy |
+| `--budget-min` | 60 | per-run training budget |
+| `--pipeline-budget-min` | 90 | total pipeline budget |
+| `--max-agent-calls` | 8 | LLM call cap per candidate |
+| `--num-candidates` | 3 | how many model variants to generate |
+The `--model` string must exist in your CLIProxyAPI `oauth-model-alias.claude`
+mapping (or be a real model your Claude account exposes).

agents/ai_build_ai/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""AI-Build-AI integration (github.com/aibuildai/AI-Build-AI).
+Wraps the `aibuildai` release binary so it can run against any GraphTestbed
+task. LLM traffic is forced through CLIProxyAPI by setting ANTHROPIC_BASE_URL
+and ANTHROPIC_API_KEY before launching the binary.
+"""

agents/ai_build_ai/examples/run_figraph.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#!/usr/bin/env bash
+# End-to-end smoke test of AI-Build-AI on the `figraph` task.
+# Assumes:
+#   - CLIProxyAPI is running and CLIPROXYAPI_KEY is set (see agents/cliproxyapi/README.md)
+#   - `gtb fetch figraph` has been run, OR a local copy of figraph CSVs sits
+#     at $GRAPHTESTBED_CACHE/figraph/
+#   - `bash agents/ai_build_ai/install.sh` has put the binary in _vendor/
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
+cd "${REPO_ROOT}"
+: "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
+python3 -m agents.ai_build_ai.runner \
+    --task figraph \
+    --model "${MODEL:-claude-sonnet-4-6}" \
+    --budget-min "${BUDGET_MIN:-30}" \
+    --num-candidates "${NUM_CANDIDATES:-2}" \
+    "${@}"

agents/ai_build_ai/install.sh ADDED Viewed

	@@ -0,0 +1,31 @@

+#!/usr/bin/env bash
+# Install the AI-Build-AI release tarball into agents/ai_build_ai/_vendor/.
+# Re-run any time to upgrade. Linux x86_64 only (upstream constraint).
+#
+# Override the release with: AIBUILDAI_VERSION=v0.1.1 bash install.sh
+set -euo pipefail
+VERSION="${AIBUILDAI_VERSION:-v0.1.1}"
+HERE="$(cd "$(dirname "$0")" && pwd)"
+DEST="${HERE}/_vendor"
+TARBALL="aibuildai-linux-x86_64-${VERSION}.tar.gz"
+URL="https://github.com/aibuildai/AI-Build-AI/releases/download/${VERSION}/${TARBALL}"
+mkdir -p "${DEST}"
+cd "${DEST}"
+echo "Downloading ${URL}"
+curl -fL --retry 3 -o "${TARBALL}" "${URL}"
+echo "Unpacking ${TARBALL}"
+tar -xzf "${TARBALL}"
+rm -f "${TARBALL}"
+# Upstream tarball ships an install.sh that finalizes setup (PATH hints etc.)
+if [[ -x ./install.sh ]]; then
+    echo "Running upstream install.sh"
+    ./install.sh
+fi
+echo
+echo "Installed AI-Build-AI ${VERSION} under ${DEST}"
+echo "Set AIBUILDAI_BIN to the binary path if it isn't on \$PATH after this."

agents/ai_build_ai/runner.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""Run AI-Build-AI on a GraphTestbed task, routed through CLIProxyAPI.
+Usage:
+    python -m agents.ai_build_ai.runner --task figraph
+    python -m agents.ai_build_ai.runner --task figraph \\
+        --model claude-sonnet-4-6 --budget-min 30
+    python -m agents.ai_build_ai.runner --task figraph \\
+        --submit aibuildai-sonnet-4-6
+Exit codes mirror the wrapped binary.
+"""
+from __future__ import annotations
+import argparse
+import os
+import shutil
+import subprocess
+import sys
+from pathlib import Path
+import pandas as pd
+from agents.cliproxyapi import ProxyEndpoint, anthropic_env, wait_until_ready
+from agents.common.submit import finalize
+from agents.common.tasks import task_instruction
+from agents.common.workspace import make_workspace, stage_dataset
+from graphtestbed._manifest import task_config
+from graphtestbed.fetch import cache_dir
+DEFAULT_MODEL = "claude-sonnet-4-6"
+def _resolve_binary() -> str:
+    explicit = os.environ.get("AIBUILDAI_BIN")
+    if explicit:
+        return explicit
+    on_path = shutil.which("aibuildai")
+    if on_path:
+        return on_path
+    vendored = Path(__file__).parent / "_vendor" / "aibuildai"
+    if vendored.exists():
+        return str(vendored)
+    raise SystemExit(
+        "Cannot locate the `aibuildai` binary.\n"
+        "  Install it: bash agents/ai_build_ai/install.sh\n"
+        "  Or set AIBUILDAI_BIN to the full path."
+    )
+def _stage_input(task: str, dst: Path) -> None:
+    src = cache_dir() / task
+    if not src.exists():
+        raise SystemExit(
+            f"No cached dataset at {src}. Run `gtb fetch {task}` first.\n"
+            f"(For pre-launch tasks, drop your local CSVs into {src}/.)"
+        )
+    cfg = task_config(task)
+    files = [spec["filename"] for spec in cfg["files"].values()]
+    stage_dataset(src, dst, files)
+def _harvest_submission(task: str, playground: Path, dst: Path) -> Path:
+    """Pick the latest submission.csv produced under playground/, normalize cols."""
+    schema = task_config(task)["submission_schema"]
+    candidates = sorted(
+        playground.rglob("submission.csv"),
+        key=lambda p: p.stat().st_mtime,
+    )
+    if not candidates:
+        raise SystemExit(
+            f"No submission.csv found under {playground}.\n"
+            f"  Inspect the agent's logs to see what happened: "
+            f"{playground.parent / 'agent.log'}"
+        )
+    chosen = candidates[-1]
+    df = pd.read_csv(chosen)
+    expected = [schema["id_col"], schema["pred_col"]]
+    if list(df.columns) != expected:
+        if len(df.columns) == 2:
+            print(f"  (renaming columns {list(df.columns)} → {expected})")
+            df.columns = expected
+        else:
+            raise SystemExit(
+                f"Cannot normalize {chosen}: got columns {list(df.columns)}, "
+                f"expected {expected}"
+            )
+    out = dst / "submission.csv"
+    df.to_csv(out, index=False)
+    print(f"✓ Picked {chosen.relative_to(playground.parent)}")
+    return out
+def main() -> None:
+    ap = argparse.ArgumentParser(prog="agents.ai_build_ai.runner")
+    ap.add_argument("--task", required=True,
+                    help="A task name from datasets/manifest.yaml")
+    ap.add_argument("--model", default=DEFAULT_MODEL,
+                    help=f"Model alias passed to aibuildai (default: {DEFAULT_MODEL})")
+    ap.add_argument("--budget-min", type=int, default=60,
+                    help="--run-budget-minutes for aibuildai (default: 60)")
+    ap.add_argument("--pipeline-budget-min", type=int, default=90,
+                    help="--pipeline-budget-minutes (default: 90)")
+    ap.add_argument("--max-agent-calls", type=int, default=8)
+    ap.add_argument("--num-candidates", type=int, default=3)
+    ap.add_argument("--submit", default=None, metavar="AGENT_ID",
+                    help="If set, POST the produced submission.csv to the "
+                         "GraphTestbed scoring API as this agent name.")
+    ap.add_argument("--workspace-root", type=Path, default=None,
+                    help="Override the runs/ root (default: ./runs)")
+    args = ap.parse_args()
+    binary = _resolve_binary()
+    ep = ProxyEndpoint.from_env()
+    wait_until_ready(ep)
+    print(f"✓ Proxy ready at {ep.base_url()}")
+    ws = make_workspace("ai_build_ai", args.task, args.workspace_root)
+    data = ws / "data"
+    play = ws / "playground"
+    play.mkdir()
+    _stage_input(args.task, data)
+    instruction = task_instruction(args.task)
+    (ws / "instruction.md").write_text(instruction)
+    cmd = [
+        binary,
+        "--task-name", args.task,
+        "--data-dir", str(data),
+        "--playground-dir", str(play),
+        "--model", args.model,
+        "--instruction", instruction,
+        "--max-agent-calls", str(args.max_agent_calls),
+        "--run-budget-minutes", str(args.budget_min),
+        "--pipeline-budget-minutes", str(args.pipeline_budget_min),
+        "--num-candidates", str(args.num_candidates),
+        "--no-form",
+    ]
+    env = {**os.environ, **anthropic_env(ep, model=args.model)}
+    # aibuildai ships a bundled `claude` binary that aborts if it detects an
+    # outer Claude Code session via these env vars. Strip them so the inner
+    # claude treats this as a fresh top-level invocation.
+    for k in ("CLAUDECODE", "CLAUDE_CODE_ENTRYPOINT", "CLAUDE_CODE_SSE_PORT"):
+        env.pop(k, None)
+    print(f"→ Launching {Path(binary).name}  task={args.task}  model={args.model}")
+    print(f"  workspace: {ws}")
+    log = ws / "agent.log"
+    with log.open("wb") as lf:
+        rc = subprocess.call(cmd, env=env, stdout=lf, stderr=subprocess.STDOUT)
+    print(f"  exit={rc}  log={log}")
+    if rc != 0:
+        sys.exit(rc)
+    sub = _harvest_submission(args.task, play, ws)
+    finalize(args.task, sub, args.submit)
+if __name__ == "__main__":
+    main()

agents/cliproxyapi/README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# `agents.cliproxyapi`
+Reusable shim that points any agent's LLM SDK at a single local
+[CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI) instance.
+## Why a shim
+Every agent we test uses a different SDK (Anthropic, OpenAI/Codex, Gemini)
+and a different way of being told "talk to this base URL with this key".
+This package collapses that into three function calls.
+## Public surface
+```python
+from agents.cliproxyapi import (
+    ProxyEndpoint,        # where + key (read from env)
+    anthropic_env,        # → dict, splice into subprocess env
+    openai_env,
+    gemini_env,
+    openai_yaml_block,    # → dict, drop into a YAML config
+    wait_until_ready,     # TCP probe; raise SystemExit on miss
+    spawn_proxy,          # ctx-manager (opt-in; mostly for CI)
+)
+```
+`ProxyEndpoint.from_env()` reads:
+| env var | default |
+| --- | --- |
+| `CLIPROXYAPI_HOST` | `127.0.0.1` |
+| `CLIPROXYAPI_PORT` | `8317` |
+| `CLIPROXYAPI_KEY`  | *required* |
+## Recipe per SDK shape
+### Anthropic SDK / Claude Code (`claude`, `aibuildai`, ...)
+```python
+ep = ProxyEndpoint.from_env()
+env = {**os.environ, **anthropic_env(ep, model="claude-sonnet-4-6")}
+subprocess.run([...], env=env)
+```
+Sets `ANTHROPIC_BASE_URL`, `ANTHROPIC_API_KEY`, `ANTHROPIC_AUTH_TOKEN`,
+`ANTHROPIC_MODEL`.
+### OpenAI / Codex CLI / any OpenAI-compatible SDK
+```python
+env = {**os.environ, **openai_env(ep, model="gpt-5.3-codex-spark")}
+```
+Sets `OPENAI_BASE_URL=…/v1`, `OPENAI_API_KEY`, `OPENAI_API_BASE`,
+`OPENAI_MODEL`.
+### Gemini SDK
+```python
+env = {**os.environ, **gemini_env(ep, model="gemini-2-pro-preview")}
+```
+### YAML configs (e.g. MLEvolve)
+```python
+block = openai_yaml_block(ep, model="gpt-5.3-codex-spark")
+# → {"model": ..., "base_url": "http://127.0.0.1:8317/v1", "api_key": ...}
+config["agent"]["code"].update(block)
+config["agent"]["feedback"].update(block)
+```
+## Setting up the proxy itself
+1. Install:
+   ```bash
+   git clone https://github.com/router-for-me/CLIProxyAPI && cd CLIProxyAPI
+   docker compose up -d        # or: go build -o cliproxy ./cmd/...
+   ```
+2. Drop in a config (start from
+   [`config.example.yaml`](config.example.yaml) here):
+   ```bash
+   mkdir -p ~/.cli-proxy-api
+   cp agents/cliproxyapi/config.example.yaml ~/.cli-proxy-api/config.yaml
+   $EDITOR ~/.cli-proxy-api/config.yaml      # set api-keys[0] + aliases
+   ```
+3. Run interactively once to OAuth-log into Claude / Codex / Gemini accounts.
+4. Export client-side env vars:
+   ```bash
+   export CLIPROXYAPI_KEY=<the api-keys[0] you set>
+   # CLIPROXYAPI_HOST/PORT only needed if you bind elsewhere
+   ```
+5. Smoke-test:
+   ```bash
+   curl -s -H "Authorization: Bearer $CLIPROXYAPI_KEY" \
+     http://127.0.0.1:8317/v1/models | head
+   ```
+Once the proxy is up and `CLIPROXYAPI_KEY` is set, every agent runner in
+`agents/*/runner.py` works without further configuration.
+## Adding a new agent that uses the proxy
+```python
+# agents/my_agent/runner.py
+from agents.cliproxyapi import ProxyEndpoint, openai_env, wait_until_ready
+ep = ProxyEndpoint.from_env()
+wait_until_ready(ep)
+subprocess.run(
+    ["my-agent-binary", "--task", task, "--model", model],
+    env={**os.environ, **openai_env(ep, model=model)},
+)
+```
+That's the entire integration.

agents/cliproxyapi/__init__.py ADDED Viewed

	@@ -0,0 +1,33 @@

+"""Generic CLIProxyAPI integration shared by every agent runner.
+CLIProxyAPI (github.com/router-for-me/CLIProxyAPI) is a single local proxy
+that bridges Anthropic, OpenAI/Codex, and Gemini protocol surfaces on one
+port. Pointing every agent at it lets us share OAuth state, credentials, and
+rate-limit budget across many harnesses.
+Public surface — three things:
+    ProxyEndpoint          → where the proxy is + what API key to send
+    {anthropic,openai,gemini}_env(ep, model=...)  → env-var dicts to splice
+                                                    into subprocess.Popen
+    openai_yaml_block(ep, model)  → snippet for agents whose configs take
+                                    base_url/api_key/model directly
+Plus `wait_until_ready(ep)` for runners that should fail fast if the proxy
+isn't up, and an opt-in `spawn_proxy()` ctx-manager for one-off testing.
+"""
+from .endpoint import ProxyEndpoint
+from .env import anthropic_env, gemini_env, openai_env, openai_yaml_block
+from .health import is_ready, spawn_proxy, wait_until_ready
+__all__ = [
+    "ProxyEndpoint",
+    "anthropic_env",
+    "gemini_env",
+    "openai_env",
+    "openai_yaml_block",
+    "is_ready",
+    "spawn_proxy",
+    "wait_until_ready",
+]

agents/cliproxyapi/config.example.yaml ADDED Viewed

	@@ -0,0 +1,43 @@

+# Minimal CLIProxyAPI config for GraphTestbed agent runs.
+#
+# Place at ~/.cli-proxy-api/config.yaml (or pass --config /path/to/this file
+# when launching the proxy). Full schema:
+#   https://github.com/router-for-me/CLIProxyAPI
+#
+# Quickstart:
+#   1. Replace the api-keys[0] placeholder with `openssl rand -hex 16`.
+#   2. Export the same value as CLIPROXYAPI_KEY in the shell that runs the
+#      agents (so the agent's SDK sends it; the proxy validates it).
+#   3. Launch the proxy interactively once and complete the OAuth flow for
+#      each upstream account you intend to use (Claude / Codex / Gemini).
+#   4. Adjust `oauth-model-alias.{claude,codex}` so the model strings the
+#      agents send (e.g. `claude-sonnet-4-6`, `gpt-5.3-codex-spark`) resolve
+#      to whatever upstream IDs your subscriptions actually expose.
+host: "127.0.0.1"
+port: 8317
+auth-dir: "~/.cli-proxy-api"
+api-keys:
+  - "REPLACE-WITH-OPENSSL-RAND-HEX-16"
+strategy: "round-robin"
+session-affinity-ttl: "1h"
+# Upstream Claude OAuth account(s). Run the proxy once with your browser open
+# to log in; the proxy then caches refresh tokens under auth-dir.
+claude-api-key: []
+# Upstream Codex OAuth account(s). Same pattern.
+codex-api-key: []
+# Map the alias names our agents send → actual upstream model IDs.
+# AI-Build-AI sends `--model claude-sonnet-4-6` (or whatever you pick).
+# MLEvolve sends the model string from agents/mlevolve/runner.py's --model.
+oauth-model-alias:
+  claude:
+    # Match the string the agent's runner sends; map to whatever your Claude
+    # subscription actually exposes (check `curl ${proxy}/v1/models`).
+    claude-sonnet-4-6: "<upstream-claude-id>"
+  codex:
+    gpt-5.3-codex-spark: "<upstream-codex-id>"

agents/cliproxyapi/endpoint.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""ProxyEndpoint — single source of truth for "where is the proxy + what key".
+Every agent runner reads this from environment, then hands the resulting
+object to `agents.cliproxyapi.env.*` to build SDK-specific configuration.
+Env vars:
+    CLIPROXYAPI_HOST   default 127.0.0.1
+    CLIPROXYAPI_PORT   default 8317  (CLIProxyAPI's stock port)
+    CLIPROXYAPI_KEY    required — must match one of the api-keys: entries
+                       in your CLIProxyAPI config.yaml
+"""
+from __future__ import annotations
+import os
+from dataclasses import dataclass
+DEFAULT_HOST = "127.0.0.1"
+DEFAULT_PORT = 8317
+@dataclass(frozen=True)
+class ProxyEndpoint:
+    host: str = DEFAULT_HOST
+    port: int = DEFAULT_PORT
+    api_key: str = ""
+    @classmethod
+    def from_env(cls) -> "ProxyEndpoint":
+        host = os.environ.get("CLIPROXYAPI_HOST", DEFAULT_HOST)
+        port = int(os.environ.get("CLIPROXYAPI_PORT", str(DEFAULT_PORT)))
+        api_key = os.environ.get("CLIPROXYAPI_KEY", "").strip()
+        if not api_key:
+            raise SystemExit(
+                "CLIPROXYAPI_KEY is unset. Set it to one of the api-keys "
+                "you've configured in your CLIProxyAPI config.yaml.\n"
+                "Example:\n"
+                "  export CLIPROXYAPI_KEY=$(grep -A1 'api-keys:' "
+                "~/.cli-proxy-api/config.yaml | tail -1 | tr -d ' \"-')"
+            )
+        return cls(host=host, port=port, api_key=api_key)
+    def base_url(self, scheme: str = "http") -> str:
+        return f"{scheme}://{self.host}:{self.port}"

agents/cliproxyapi/env.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""Build env-var dicts (or YAML-config snippets) that point an SDK at the proxy.
+Three SDK shapes are covered today; add more here as agents arrive:
+    anthropic_env(ep, model)   → Anthropic SDK / Claude Code CLI
+    openai_env(ep, model)      → OpenAI SDK / Codex CLI
+    gemini_env(ep, model)      → google-generativeai SDK / gemini-cli
+Plus `openai_yaml_block(ep, model)` for agents whose config files take
+`base_url` / `api_key` / `model` fields directly (e.g. MLEvolve).
+Usage from any agent runner:
+    from agents.cliproxyapi import ProxyEndpoint, anthropic_env
+    ep = ProxyEndpoint.from_env()
+    subprocess.run(cmd, env={**os.environ, **anthropic_env(ep, model="...")})
+"""
+from __future__ import annotations
+from .endpoint import ProxyEndpoint
+def anthropic_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
+    """Env vars consumed by anthropic-python and claude-code.
+    The Anthropic SDK appends `/v1/messages` to ANTHROPIC_BASE_URL itself,
+    so we hand it the proxy root (no trailing path).
+    """
+    env = {
+        "ANTHROPIC_BASE_URL": ep.base_url(),
+        "ANTHROPIC_API_KEY": ep.api_key,
+        "ANTHROPIC_AUTH_TOKEN": ep.api_key,
+    }
+    if model:
+        env["ANTHROPIC_MODEL"] = model
+    return env
+def openai_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
+    """Env vars consumed by openai-python, codex-cli, and many compatible SDKs.
+    The OpenAI SDK appends `/chat/completions` (and other paths) to
+    OPENAI_BASE_URL, so we include the `/v1` prefix here.
+    """
+    env = {
+        "OPENAI_BASE_URL": f"{ep.base_url()}/v1",
+        "OPENAI_API_KEY": ep.api_key,
+        "OPENAI_API_BASE": f"{ep.base_url()}/v1",  # legacy var, still common
+    }
+    if model:
+        env["OPENAI_MODEL"] = model
+    return env
+def gemini_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
+    """Env vars consumed by google-generativeai and gemini-cli.
+    The proxy exposes Gemini's `/v1beta/models/.../generateContent` shape on
+    the proxy root — clients prepend nothing.
+    """
+    env = {
+        "GEMINI_API_BASE": ep.base_url(),
+        "GOOGLE_API_KEY": ep.api_key,
+        "GEMINI_API_KEY": ep.api_key,
+    }
+    if model:
+        env["GEMINI_MODEL"] = model
+    return env
+def openai_yaml_block(ep: ProxyEndpoint, model: str) -> dict[str, str]:
+    """Three-key dict for configs that name the proxy directly (e.g. MLEvolve).
+    Returns:
+        {"model": ..., "base_url": ".../v1", "api_key": ...}
+    """
+    return {
+        "model": model,
+        "base_url": f"{ep.base_url()}/v1",
+        "api_key": ep.api_key,
+    }

agents/cliproxyapi/health.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Probe and (optionally) spawn the CLIProxyAPI process.
+`wait_until_ready` does a TCP connect — endpoint-agnostic, so it works no
+matter which protocol surfaces the proxy version exposes.
+`spawn_proxy` is a context manager for tests / one-off CI runs. Most users
+should run the proxy out-of-band: it owns long-lived OAuth tokens and may
+serve other tools besides the testbed.
+"""
+from __future__ import annotations
+import contextlib
+import socket
+import subprocess
+import time
+from pathlib import Path
+from .endpoint import ProxyEndpoint
+def is_ready(ep: ProxyEndpoint, timeout: float = 2.0) -> bool:
+    try:
+        with socket.create_connection((ep.host, ep.port), timeout=timeout):
+            return True
+    except OSError:
+        return False
+def wait_until_ready(ep: ProxyEndpoint, timeout: float = 30.0) -> None:
+    deadline = time.monotonic() + timeout
+    while time.monotonic() < deadline:
+        if is_ready(ep):
+            return
+        time.sleep(0.5)
+    raise SystemExit(
+        f"CLIProxyAPI at {ep.base_url()} did not respond within {timeout:.0f}s.\n"
+        f"Start it (e.g. `cliproxy --config ~/.cli-proxy-api/config.yaml`) "
+        f"and confirm CLIPROXYAPI_HOST / CLIPROXYAPI_PORT."
+    )
+@contextlib.contextmanager
+def spawn_proxy(
+    config_path: str | Path,
+    binary: str = "cliproxy",
+    timeout: float = 30.0,
+):
+    ep = ProxyEndpoint.from_env()
+    proc = subprocess.Popen(
+        [binary, "--config", str(config_path)],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    try:
+        wait_until_ready(ep, timeout=timeout)
+        yield ep
+    finally:
+        proc.terminate()
+        try:
+            proc.wait(timeout=5)
+        except subprocess.TimeoutExpired:
+            proc.kill()

agents/common/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Shared adapter helpers between testbed and individual agent runners."""

agents/common/submit.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Validate and (optionally) submit an agent's output to the GraphTestbed API.
+Default mode is print-and-stop: the runner reports the path to the produced
+submission.csv but does not POST. Pass `--submit <agent-name>` to the runner
+to actually call the scoring API.
+"""
+from __future__ import annotations
+from pathlib import Path
+from graphtestbed.submit import submit as gtb_submit
+from graphtestbed.submit import validate_submission
+def finalize(task: str, csv_path: Path, agent: str | None) -> None:
+    info = validate_submission(task, csv_path)
+    print()
+    print(f"✓ Submission ready")
+    print(f"    file:   {csv_path}")
+    print(f"    rows:   {info['n_rows']}")
+    print(f"    sha256: {info['sha256'][:12]}...")
+    if agent:
+        gtb_submit(task, csv_path, agent, dry_run=False)
+    else:
+        print()
+        print("(not submitted — pass --submit <agent-name> to POST)")
+        print(f"  manual:  gtb submit {task} --file {csv_path} --agent <name>")

agents/common/tasks.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Render a per-task instruction markdown for any agent.
+Pulls the canonical task description from datasets/manifest.yaml and decorates
+it with the submission contract (id col, pred col, n rows, metric).
+Per-task overrides — handcrafted prompts that beat the auto-generated text —
+live in agents/common/tasks_md/<task>.md and take priority when present.
+"""
+from __future__ import annotations
+from pathlib import Path
+from graphtestbed._manifest import task_config
+_TEMPLATE = """\
+# Task: {task}
+{description}
+## Files you will see
+- `train_features.csv` — labeled training rows
+- `val_features.csv`   — labeled validation rows (use for HPO / early stopping)
+- `test_features.csv`  — **unlabeled** test rows; predict here
+The `Label` (or task-specific target) column is present in train/val and
+absent from test. Do not attempt to recover test labels from upstream sources.
+## Submission format
+Write a CSV with **exactly two columns**, in this order:
+| column | type | meaning |
+| --- | --- | --- |
+| `{id_col}` | id | matches `test_features.csv[{id_col}]` 100% |
+| `{pred_col}` | float in [0, 1] | predicted score |
+Row count: **{n_rows}**.
+## Metric
+You will be evaluated on `{primary}` (primary). Secondary: {secondary}.
+Optimize for the primary metric.
+"""
+def task_instruction(task: str) -> str:
+    override = Path(__file__).parent / "tasks_md" / f"{task}.md"
+    if override.exists():
+        return override.read_text()
+    cfg = task_config(task)
+    s = cfg["submission_schema"]
+    m = cfg["metric"]
+    return _TEMPLATE.format(
+        task=task,
+        description=str(cfg.get("description", "")).strip(),
+        id_col=s["id_col"],
+        pred_col=s["pred_col"],
+        n_rows=s.get("n_rows", "?"),
+        primary=m["primary"],
+        secondary=", ".join(m.get("secondary", [])) or "(none)",
+    )

agents/common/workspace.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""Ephemeral workspace dirs and dataset staging for agent runs.
+Each runner allocates `runs/<agent>/<task>/<timestamp>/` so concurrent runs
+don't collide and post-mortems are always recoverable from disk.
+"""
+from __future__ import annotations
+import datetime as dt
+from pathlib import Path
+def make_workspace(agent: str, task: str, root: Path | None = None) -> Path:
+    root = Path(root) if root else Path.cwd() / "runs"
+    ts = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
+    ws = root / agent / task / ts
+    ws.mkdir(parents=True, exist_ok=False)
+    return ws
+def stage_dataset(src_dir: Path, dst_dir: Path, files: list[str]) -> None:
+    """Symlink each `files[i]` from src_dir into dst_dir.
+    Symlinks (vs copies) keep large CSVs on the cache disk; the agent reads
+    from src via the link transparently.
+    """
+    dst_dir.mkdir(parents=True, exist_ok=True)
+    for f in files:
+        s = src_dir / f
+        if not s.exists():
+            raise SystemExit(f"Missing dataset file: {s}")
+        d = dst_dir / f
+        if d.is_symlink() or d.exists():
+            d.unlink()
+        d.symlink_to(s.resolve())

agents/mlevolve/README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# `agents.mlevolve`
+Runs [MLEvolve](https://github.com/InternScience/MLEvolve) on a GraphTestbed
+task. MLEvolve is an MCGS auto-ML harness wired for OpenAI-compatible APIs.
+Default model: **`gpt-5.3-codex-spark`** (a pipe-through alias you define in
+your CLIProxyAPI `oauth-model-alias.codex` block).
+## Install
+```bash
+bash agents/mlevolve/install.sh
+# heavy: clones the repo + pip-installs torch and ML deps (~5-10 GB).
+```
+Lands at `agents/mlevolve/_vendor/MLEvolve/`. Set `MLEVOLVE_DIR` if you
+already have a clone elsewhere.
+## Run
+```bash
+gtb fetch figraph
+python -m agents.mlevolve.runner --task figraph
+```
+Output:
+```
+runs/mlevolve/figraph/<timestamp>/
+├── mlebench-tree/figraph/
+│   ├── prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
+│   ├── prepared/private/test.csv      # val labels — local grader uses this
+│   └── REAL_TEST_FEATURES.csv          # the actual test split, for re-execute
+├── agent.log
+└── val_submission.csv                  # MLEvolve's best on the val "test" split
+```
+## ⚠ v1 limitation: val-as-test
+GraphTestbed's actual test labels live on the scoring server, not on disk.
+For the local mle-bench grader to function, the adapter exposes
+`val_features.csv` (with labels) as the "test" set MLEvolve searches against.
+The CSV the runner harvests is therefore predictions on **val**, not test.
+To submit a real test-set score:
+1. Open `agents/mlevolve/_vendor/MLEvolve/runs/<latest-ts>/` and find the
+   best runfile.py (search order: best score in the run's tree summary).
+2. Re-execute it against the real test split:
+   ```bash
+   cd <some scratch dir>
+   cp <ws>/mlebench-tree/figraph/REAL_TEST_FEATURES.csv ./test.csv
+   cp <ws>/mlebench-tree/figraph/prepared/public/train.csv ./train.csv
+   python <runfile>      # produces submission.csv
+   ```
+3. Submit:
+   ```bash
+   gtb submit figraph --file ./submission.csv --agent mlevolve-codex-spark
+   ```
+This step is manual in v1 because the structure of MLEvolve's `runfile.py`
+varies per task and we don't want to silently mis-execute. It is on the
+roadmap to automate.
+## Knobs
+| flag | default | meaning |
+| --- | --- | --- |
+| `--model` | `gpt-5.3-codex-spark` | sent to proxy via OPENAI_BASE_URL/v1 |
+| `--steps` | 100 | MCGS exploration count (upstream default: 500) |
+| `--time-limit-min` | 120 | per-task wall-clock cap (upstream default: 720) |
+| `--gpus` | 0 | passed to `search.num_gpus` |
+The `--model` string must exist in your CLIProxyAPI
+`oauth-model-alias.codex` (or be a real model your Codex account exposes).

agents/mlevolve/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""MLEvolve integration (github.com/InternScience/MLEvolve).
+MLEvolve is an MCGS-based auto-ML harness designed for the mle-bench
+data layout. The adapter here translates a GraphTestbed task into the
+mle-bench shape it expects, then drives the upstream `run.py` (Hydra
+entry point) with overrides that route LLM traffic through CLIProxyAPI.
+Default model: `gpt-5.3-codex-spark` (pipe-through alias the user defines
+in their CLIProxyAPI `oauth-model-alias.codex` block).
+"""

agents/mlevolve/adapter.py ADDED Viewed

	@@ -0,0 +1,79 @@

+"""GraphTestbed task → mle-bench-shaped data tree.
+mle-bench expects, per experiment ID:
+    <root>/<exp_id>/prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
+GraphTestbed's test labels live only on the scoring server, so the agent
+cannot be auto-scored against `test_features.csv` locally. v1 strategy:
+    - Stage `val_features.csv` (with labels) as the "test" the agent
+      searches against. MLEvolve's grader can score val predictions locally,
+      which is what drives MCGS exploration.
+    - Stash the real `test_features.csv` next to the staged tree as
+      `<root>/<exp_id>/REAL_TEST_FEATURES.csv` so users can re-execute the
+      best runfile.py against it after the search finishes.
+This is documented as a known limitation in agents/mlevolve/README.md.
+"""
+from __future__ import annotations
+from pathlib import Path
+import pandas as pd
+from agents.common.tasks import task_instruction
+from graphtestbed._manifest import task_config
+from graphtestbed.fetch import cache_dir
+def stage(task: str, root: Path) -> Path:
+    """Build <root>/<task>/prepared/{public,private}/. Return the prepared dir."""
+    cfg = task_config(task)
+    s = cfg["submission_schema"]
+    src = cache_dir() / task
+    if not src.exists():
+        raise SystemExit(
+            f"No cached dataset at {src}. Run `gtb fetch {task}` first."
+        )
+    base = root / task / "prepared"
+    pub = base / "public"
+    priv = base / "private"
+    pub.mkdir(parents=True, exist_ok=True)
+    priv.mkdir(parents=True, exist_ok=True)
+    train = pd.read_csv(src / "train_features.csv")
+    val = pd.read_csv(src / "val_features.csv")
+    test = pd.read_csv(src / "test_features.csv")
+    if s["pred_col"] not in val.columns:
+        raise SystemExit(
+            f"val_features.csv has no `{s['pred_col']}` column — cannot use "
+            f"val as the local-grading split for task {task}."
+        )
+    # Public tree (what the agent sees). val_no_label = val minus label →
+    # served as `test.csv` so the agent's runfile predicts on it.
+    val_no_label = val.drop(columns=[s["pred_col"]])
+    train.to_csv(pub / "train.csv", index=False)
+    val_no_label.to_csv(pub / "test.csv", index=False)
+    sample = val_no_label[[s["id_col"]]].copy()
+    sample[s["pred_col"]] = 0.5
+    sample.to_csv(pub / "sample_submission.csv", index=False)
+    (pub / "description.md").write_text(task_instruction(task))
+    # Private tree: val with labels — the local grader checks submission
+    # against this.
+    val[[s["id_col"], s["pred_col"]]].rename(
+        columns={s["pred_col"]: "Label"}
+    ).to_csv(priv / "test.csv", index=False)
+    # Stash the real test set for post-search re-execution by the user.
+    test.to_csv(root / task / "REAL_TEST_FEATURES.csv", index=False)
+    return base

agents/mlevolve/examples/run_figraph.sh ADDED Viewed

	@@ -0,0 +1,16 @@

+#!/usr/bin/env bash
+# End-to-end smoke test of MLEvolve on the `figraph` task.
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
+cd "${REPO_ROOT}"
+: "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
+python3 -m agents.mlevolve.runner \
+    --task figraph \
+    --model "${MODEL:-gpt-5.3-codex-spark}" \
+    --steps "${STEPS:-30}" \
+    --time-limit-min "${TIME_LIMIT_MIN:-30}" \
+    --gpus "${GPUS:-0}" \
+    "${@}"

agents/mlevolve/install.sh ADDED Viewed

	@@ -0,0 +1,34 @@

+#!/usr/bin/env bash
+# Clone MLEvolve into agents/mlevolve/_vendor/MLEvolve and install its deps.
+# This is a heavy install (torch + ML stack); expect ~5–10 GB and 5–15 min.
+set -euo pipefail
+HERE="$(cd "$(dirname "$0")" && pwd)"
+DEST="${HERE}/_vendor"
+REPO="${MLEVOLVE_REPO:-https://github.com/InternScience/MLEvolve}"
+REF="${MLEVOLVE_REF:-main}"
+mkdir -p "${DEST}"
+if [[ -d "${DEST}/MLEvolve/.git" ]]; then
+    echo "Updating existing clone in ${DEST}/MLEvolve"
+    git -C "${DEST}/MLEvolve" fetch origin "${REF}"
+    git -C "${DEST}/MLEvolve" checkout "${REF}"
+    git -C "${DEST}/MLEvolve" pull --ff-only
+else
+    git clone --depth 50 --branch "${REF}" "${REPO}" "${DEST}/MLEvolve"
+fi
+cd "${DEST}/MLEvolve"
+echo
+echo "Installing requirements (heavy — torch + ML stack)..."
+for f in requirements_base.txt requirements_ml.txt requirements_domain.txt; do
+    if [[ -f "$f" ]]; then
+        echo "  pip install --no-deps -r $f"
+        pip install --no-deps -r "$f"
+    fi
+done
+echo
+echo "MLEvolve installed at ${DEST}/MLEvolve"
+echo "Set MLEVOLVE_DIR if you put it elsewhere."

agents/mlevolve/runner.py ADDED Viewed

	@@ -0,0 +1,210 @@

+"""Run MLEvolve on a GraphTestbed task, routed through CLIProxyAPI.
+Usage:
+    python -m agents.mlevolve.runner --task figraph
+    python -m agents.mlevolve.runner --task figraph \\
+        --model gpt-5.3-codex-spark --steps 100
+    python -m agents.mlevolve.runner --task figraph \\
+        --submit mlevolve-codex-spark
+What this does:
+    1. Build an mle-bench-shaped tree from the GraphTestbed task data
+       (val-as-test for v1 — see adapter.py for why).
+    2. Render config.yaml into _vendor/MLEvolve/config/, with the proxy
+       endpoint + model wired into agent.code and agent.feedback.
+    3. Invoke `python run.py …` from inside _vendor/MLEvolve/ with Hydra
+       overrides for paths and run-budget.
+    4. Harvest the latest submission.csv from runs/, normalize its column
+       names, validate against the testbed schema, and (optionally) submit.
+Known v1 limitation: the produced submission scores VAL-set predictions,
+not TEST-set. To score on test, rerun the best runfile.py against
+<workspace>/mlebench-tree/<task>/REAL_TEST_FEATURES.csv before submitting.
+"""
+from __future__ import annotations
+import argparse
+import os
+import subprocess
+import sys
+from pathlib import Path
+import pandas as pd
+from agents.cliproxyapi import (
+    ProxyEndpoint,
+    openai_yaml_block,
+    wait_until_ready,
+)
+from agents.common.submit import finalize
+from agents.common.workspace import make_workspace
+from agents.mlevolve.adapter import stage as stage_mlebench
+from graphtestbed._manifest import task_config
+DEFAULT_MODEL = "gpt-5.3-codex-spark"
+def _resolve_mlevolve_dir() -> Path:
+    explicit = os.environ.get("MLEVOLVE_DIR")
+    if explicit:
+        p = Path(explicit)
+        if not (p / "run.py").exists():
+            raise SystemExit(f"MLEVOLVE_DIR={p} does not contain run.py")
+        return p
+    vendored = Path(__file__).parent / "_vendor" / "MLEvolve"
+    if (vendored / "run.py").exists():
+        return vendored
+    raise SystemExit(
+        "Cannot locate MLEvolve.\n"
+        "  Install: bash agents/mlevolve/install.sh\n"
+        "  Or set MLEVOLVE_DIR to your existing clone."
+    )
+def _hydra_overrides(
+    task: str, mlebench_root: Path, prepared: Path, ep: ProxyEndpoint,
+    model: str, steps: int, time_limit_s: int, num_gpus: int,
+) -> list[str]:
+    """Build Hydra-style key=value overrides for run.py."""
+    public = prepared / "public"
+    block = openai_yaml_block(ep, model)
+    cfg_metric = task_config(task)["metric"]["primary"]
+    overrides = [
+        f"exp_id={task}",
+        f"exp_name={task}",
+        f"dataset_dir={mlebench_root}",
+        f"data_dir={public}",
+        f"desc_file={public / 'description.md'}",
+        f"start_cpu_id=0",
+        f"cpu_number=4",
+        # LLM routing → proxy
+        f"agent.code.model={block['model']}",
+        f"agent.code.base_url={block['base_url']}",
+        f"agent.code.api_key={block['api_key']}",
+        f"agent.feedback.model={block['model']}",
+        f"agent.feedback.base_url={block['base_url']}",
+        f"agent.feedback.api_key={block['api_key']}",
+        # Run budget overrides
+        f"agent.steps={steps}",
+        f"agent.time_limit={time_limit_s}",
+        f"agent.memory_embedding_device={'cuda' if num_gpus > 0 else 'cpu'}",
+        f"agent.search.num_gpus={num_gpus}",
+        f"use_grading_server=false",
+        # Goal hint
+        f"goal=Maximize {cfg_metric} on the test set",
+        f"eval={cfg_metric}",
+    ]
+    return overrides
+def _harvest_submission(
+    task: str, mlevolve_dir: Path, dst: Path,
+) -> Path:
+    schema = task_config(task)["submission_schema"]
+    runs = mlevolve_dir / "runs"
+    if not runs.exists():
+        raise SystemExit(f"No runs/ dir under {mlevolve_dir}")
+    candidates = sorted(runs.rglob("submission.csv"),
+                        key=lambda p: p.stat().st_mtime)
+    if not candidates:
+        raise SystemExit(
+            f"No submission.csv produced under {runs}. "
+            f"Inspect {dst / 'agent.log'} for the failure mode."
+        )
+    chosen = candidates[-1]
+    df = pd.read_csv(chosen)
+    expected = [schema["id_col"], schema["pred_col"]]
+    if list(df.columns) != expected:
+        if len(df.columns) == 2:
+            print(f"  (renaming columns {list(df.columns)} → {expected})")
+            df.columns = expected
+        else:
+            raise SystemExit(
+                f"Cannot normalize {chosen}: got {list(df.columns)}, expected {expected}"
+            )
+    out = dst / "val_submission.csv"
+    df.to_csv(out, index=False)
+    print(f"✓ Picked {chosen.relative_to(mlevolve_dir)}")
+    return out
+def _print_followup(task: str, ws: Path, val_sub: Path) -> None:
+    real_test = ws / "mlebench-tree" / task / "REAL_TEST_FEATURES.csv"
+    print()
+    print("⚠  v1 limitation: the file above scores VAL predictions.")
+    print("   To score on the actual test set:")
+    print(f"     1. Find the best runfile.py under "
+          f"{Path('_vendor/MLEvolve/runs')}/<latest>/")
+    print(f"     2. Re-run it with test.csv replaced by:")
+    print(f"        {real_test}")
+    print(f"     3. Submit the resulting CSV via:")
+    print(f"        gtb submit {task} --file <path> --agent <name>")
+def main() -> None:
+    ap = argparse.ArgumentParser(prog="agents.mlevolve.runner")
+    ap.add_argument("--task", required=True)
+    ap.add_argument("--model", default=DEFAULT_MODEL,
+                    help=f"default: {DEFAULT_MODEL}")
+    ap.add_argument("--steps", type=int, default=100,
+                    help="agent.steps (default: 100, upstream default 500 — "
+                         "MCGS exploration count)")
+    ap.add_argument("--time-limit-min", type=int, default=120,
+                    help="agent.time_limit in minutes (default: 120)")
+    ap.add_argument("--gpus", type=int, default=0,
+                    help="search.num_gpus (default: 0 — CPU only)")
+    ap.add_argument("--submit", default=None, metavar="AGENT_ID",
+                    help="POST val-set submission to scoring API as this name. "
+                         "Note: scores VAL not test (see runner docstring).")
+    ap.add_argument("--workspace-root", type=Path, default=None)
+    args = ap.parse_args()
+    mlevolve_dir = _resolve_mlevolve_dir()
+    ep = ProxyEndpoint.from_env()
+    wait_until_ready(ep)
+    print(f"✓ Proxy ready at {ep.base_url()}")
+    print(f"✓ MLEvolve at {mlevolve_dir}")
+    ws = make_workspace("mlevolve", args.task, args.workspace_root)
+    mlebench_root = ws / "mlebench-tree"
+    prepared = stage_mlebench(args.task, mlebench_root)
+    print(f"✓ mle-bench tree staged at {mlebench_root}")
+    overrides = _hydra_overrides(
+        task=args.task,
+        mlebench_root=mlebench_root,
+        prepared=prepared,
+        ep=ep,
+        model=args.model,
+        steps=args.steps,
+        time_limit_s=args.time_limit_min * 60,
+        num_gpus=args.gpus,
+    )
+    cmd = [sys.executable, "run.py", *overrides]
+    print(f"→ Launching MLEvolve  task={args.task}  model={args.model}")
+    print(f"  workspace: {ws}")
+    log = ws / "agent.log"
+    with log.open("wb") as lf:
+        rc = subprocess.call(cmd, cwd=mlevolve_dir, stdout=lf, stderr=subprocess.STDOUT)
+    print(f"  exit={rc}  log={log}")
+    if rc != 0:
+        raise SystemExit(rc)
+    val_sub = _harvest_submission(args.task, mlevolve_dir, ws)
+    _print_followup(args.task, ws, val_sub)
+    # Note: don't auto-finalize against `test_features.csv` schema since this
+    # is a val-set submission. Just print & stop.
+    print()
+    print(f"  val_submission: {val_sub}")
+    if args.submit:
+        print(f"  --submit was set; posting val-set predictions as "
+              f"`{args.submit}` (will score 0 against test GT).")
+        finalize(args.task, val_sub, args.submit)
+if __name__ == "__main__":
+    main()

graphtestbed/leaderboard.py CHANGED Viewed

@@ -8,7 +8,10 @@ import os
 import json
-API_URL = os.environ.get("GRAPHTESTBED_API", "http://localhost:8080")
 def main() -> None:

 import json
+API_URL = os.environ.get(
+    "GRAPHTESTBED_API",
+    "https://lanczos-graphtestbed.hf.space",
+)
 def main() -> None:

graphtestbed/submit.py CHANGED Viewed

@@ -21,7 +21,10 @@ import pandas as pd
 from graphtestbed._manifest import sha256_file, task_config
-API_URL = os.environ.get("GRAPHTESTBED_API", "http://localhost:8080")
 TIMEOUT_SEC = 60

 from graphtestbed._manifest import sha256_file, task_config
+API_URL = os.environ.get(
+    "GRAPHTESTBED_API",
+    "https://lanczos-graphtestbed.hf.space",
+)
 TIMEOUT_SEC = 60

pyproject.toml CHANGED Viewed

@@ -7,11 +7,6 @@ license = "MIT"
 readme = "README.md"
 requires-python = ">=3.10"
 keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
-[project.urls]
-Homepage = "https://github.com/zhuconv/GraphTestbed"
-Repository = "https://github.com/zhuconv/GraphTestbed"
-Issues = "https://github.com/zhuconv/GraphTestbed/issues"
 dependencies = [
     "huggingface-hub >= 0.20",
     "pandas >= 2.0",
@@ -19,6 +14,11 @@ dependencies = [
     "requests >= 2.30",
 ]
 [project.optional-dependencies]
 dev = ["scikit-learn >= 1.3"]

 readme = "README.md"
 requires-python = ">=3.10"
 keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
 dependencies = [
     "huggingface-hub >= 0.20",
     "pandas >= 2.0",
     "requests >= 2.30",
 ]
+[project.urls]
+Homepage = "https://github.com/zhuconv/GraphTestbed"
+Repository = "https://github.com/zhuconv/GraphTestbed"
+Issues = "https://github.com/zhuconv/GraphTestbed/issues"
 [project.optional-dependencies]
 dev = ["scikit-learn >= 1.3"]

server/api.py CHANGED Viewed

@@ -41,6 +41,10 @@ from flask import Flask, jsonify, request
 GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
 DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
 MANIFEST_PATH = Path(os.environ.get(
     "GT_MANIFEST",
     Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
@@ -195,6 +199,15 @@ def submit():
     )
     conn.commit()
     # Rank = how many distinct agents have a strictly better best-score on
     # this task. The just-inserted row contributes to that count only if the
     # SAME agent had a better prior submission (in which case rank doesn't

 GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
 DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
+ARCHIVE_DIR = (
+    Path(os.environ["GT_ARCHIVE_DIR"])
+    if os.environ.get("GT_ARCHIVE_DIR") else None
+)
 MANIFEST_PATH = Path(os.environ.get(
     "GT_MANIFEST",
     Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
     )
     conn.commit()
+    # Archive the raw CSV when GT_ARCHIVE_DIR is configured, so the deploy
+    # host can later prove what each scored entry was. Filename embeds the
+    # agent + run_id so multiple submissions don't collide.
+    if ARCHIVE_DIR is not None:
+        safe_agent = "".join(c if c.isalnum() or c in "-_." else "_" for c in agent)
+        out = ARCHIVE_DIR / task / f"{safe_agent}-{run_id}.csv"
+        out.parent.mkdir(parents=True, exist_ok=True)
+        out.write_bytes(raw)
     # Rank = how many distinct agents have a strictly better best-score on
     # this task. The just-inserted row contributes to that count only if the
     # SAME agent had a better prior submission (in which case rank doesn't

server/requirements.txt CHANGED Viewed

@@ -3,3 +3,4 @@ pandas>=2.0
 pyyaml>=6.0
 scikit-learn>=1.3
 gunicorn>=21.0

 pyyaml>=6.0
 scikit-learn>=1.3
 gunicorn>=21.0
+huggingface_hub>=0.20

server/space/DEPLOY.md ADDED Viewed

	@@ -0,0 +1,101 @@

+# Deploying the GraphTestbed scoring server to HF Spaces
+All commands assume `HF_TOKEN` is exported and has **write** scope on the
+`lanczos` namespace.
+## 1. Seed the GT dataset repo
+```bash
+HF_TOKEN=$HF_TOKEN python server/space/push_gt.py \
+    --repo lanczos/graphtestbed-gt \
+    --gt-dir ~/graphtestbed-gt
+```
+This creates the **private** dataset repo if it doesn't exist and uploads
+each `<task>.csv` to `gt/<task>.csv`. Verify at:
+  <https://huggingface.co/datasets/lanczos/graphtestbed-gt>
+## 2. Create the Space
+```bash
+huggingface-cli repo create graphtestbed --type space --space_sdk docker
+```
+Or in the web UI: New Space → name `graphtestbed` → SDK: **Docker**.
+## 3. Set the Space secret
+In Space Settings → Variables and secrets, add:
+| name | value |
+| --- | --- |
+| `HF_TOKEN` | same token (write scope on `lanczos/graphtestbed-gt`) |
+Optional overrides (set as **variables**, not secrets):
+| name | default | when to override |
+| --- | --- | --- |
+| `GT_DATASET_REPO` | `lanczos/graphtestbed-gt` | running multiple Spaces against different GT |
+| `GT_BACKUP_INTERVAL` | `60` | tighter durability vs. fewer commits |
+| `GT_QUOTA` | `5` | bumping during a benchmark sprint |
+## 4. Push the code to the Space
+```bash
+# One-time
+git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
+# Each deploy (HF prompts for credentials: user=lanczos, password=$HF_TOKEN)
+./server/space/push_to_space.sh
+```
+The script overlays `server/space/README.md` at repo root on a temp branch
+and force-pushes to `space/main` (HF reads its frontmatter from root
+README). Your GitHub root README is untouched.
+First build ~3 min (pandas + sklearn wheels). Subsequent ~30 s.
+## 5. Smoke-test
+```bash
+curl -s https://lanczos-graphtestbed.hf.space/healthz | jq
+```
+Expect:
+```json
+{
+  "status": "ok",
+  "tasks": ["arxiv-citation", "figraph", "ibm-aml", "ieee-fraud-detection"],
+  "gt_present": ["figraph", "..."],
+  "quota_per_day": 5,
+  "uptime_unix": 1776633751
+}
+```
+If `gt_present` is empty, the boot bootstrap couldn't read from the dataset
+repo — check the Space logs and verify `HF_TOKEN` has read scope on
+`GT_DATASET_REPO`.
+## 6. Hand out the URL
+```
+export GRAPHTESTBED_API=https://lanczos-graphtestbed.hf.space
+gtb submit figraph --file preds.csv --agent my-agent-v1
+```
+## Reading the leaderboard back as a maintainer
+```bash
+huggingface-cli download lanczos/graphtestbed-gt \
+    leaderboard.db \
+    --repo-type dataset \
+    --local-dir ./backup
+sqlite3 backup/leaderboard.db \
+    "SELECT task, agent, primary_metric, n_rows, submitted_at
+     FROM submissions ORDER BY submitted_at DESC LIMIT 20"
+```
+The full per-submission CSV archive lives under `submissions/<task>/<agent>-<run_id>.csv`
+in the same dataset repo.

server/space/Dockerfile ADDED Viewed

	@@ -0,0 +1,38 @@

+FROM python:3.11-slim
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1
+WORKDIR /app
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Install deps first so the layer caches across code-only changes.
+COPY server/requirements.txt /app/server/requirements.txt
+RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
+# Install the graphtestbed package itself so server/api.py can
+# `from graphtestbed._manifest import ...`.
+COPY pyproject.toml /app/
+COPY graphtestbed /app/graphtestbed
+COPY datasets /app/datasets
+COPY server /app/server
+RUN pip install --no-deps -e /app
+# HF Spaces mounts /data on Persistent Storage tier; on free tier it's
+# just an in-container path that the dataset-repo backup loop preserves.
+ENV GT_DATA_ROOT=/data \
+    GT_DIR=/data/gt \
+    GT_DB=/data/leaderboard.db \
+    GT_ARCHIVE_DIR=/data/submissions \
+    GT_DATASET_REPO=lanczos/graphtestbed-gt \
+    GT_BACKUP_INTERVAL=60 \
+    GT_QUOTA=5 \
+    PORT=7860
+RUN mkdir -p /data && chmod 777 /data
+EXPOSE 7860
+CMD ["python", "/app/server/space/space_entry.py"]

server/space/README.md ADDED Viewed

	@@ -0,0 +1,55 @@

+---
+title: GraphTestbed Scoring API
+emoji: 📊
+colorFrom: indigo
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+---
+# GraphTestbed Scoring API
+Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
+benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
+anywhere; the scored entry lands on a single shared leaderboard.
+## Endpoints
+| method | path | purpose |
+| --- | --- | --- |
+| POST | `/submit` | multipart `task=…&agent=…&file=preds.csv` → JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
+| GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
+| GET | `/healthz` | tasks list + which have GT loaded + quota |
+Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
+## Trust model
+Non-adversarial benchmark. The API enforces:
+- 5 submissions / day / IP / task
+- Schema check before scoring (malformed CSVs don't burn quota)
+- Score bucketing (round to 3 dp)
+- Audit trail in sqlite + per-submission CSV archive
+Test labels live only in the companion private dataset repo
+(`lanczos/graphtestbed-gt`) and never enter the Space's git history.
+## Configuration (Space secrets)
+| name | required | default | notes |
+| --- | --- | --- | --- |
+| `HF_TOKEN` | yes | — | write scope on `GT_DATASET_REPO` |
+| `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
+| `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite → dataset-repo pushes |
+| `GT_QUOTA` | no | `5` | submissions/day/IP/task |
+## Persistence
+- On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
+  archived `submissions/**/*.csv` from the dataset repo into `/data`.
+- Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
+  uses `sqlite3.Connection.backup()` to copy the DB atomically and
+  `upload_file`s it back. New submission CSVs in `/data/submissions/` are
+  pushed via `upload_folder` (content-hash diff — unchanged files skipped).
+- Worst-case loss on Space crash: 60 s of submissions.

server/space/push_gt.py ADDED Viewed

	@@ -0,0 +1,67 @@

+"""One-shot uploader for ground-truth CSVs to the companion HF dataset repo.
+Creates the dataset repo (private by default) if it doesn't exist, then
+uploads every <task>.csv from --gt-dir to gt/<task>.csv in the repo.
+Usage (run locally with a token that has write scope on the namespace):
+    HF_TOKEN=hf_xxx python server/space/push_gt.py \\
+        --repo lanczos/graphtestbed-gt \\
+        --gt-dir ~/graphtestbed-gt
+"""
+from __future__ import annotations
+import argparse
+import os
+import sys
+from pathlib import Path
+from huggingface_hub import create_repo, upload_file
+def main() -> int:
+    ap = argparse.ArgumentParser(prog="push_gt")
+    ap.add_argument("--repo", default="lanczos/graphtestbed-gt",
+                    help="dataset repo id (default: lanczos/graphtestbed-gt)")
+    ap.add_argument("--gt-dir", type=Path, required=True,
+                    help="local dir containing <task>.csv files")
+    ap.add_argument("--public", action="store_true",
+                    help="create the repo as public (default: private)")
+    args = ap.parse_args()
+    token = os.environ.get("HF_TOKEN")
+    if not token:
+        sys.exit("HF_TOKEN not set in env")
+    if not args.gt_dir.exists():
+        sys.exit(f"--gt-dir not found: {args.gt_dir}")
+    csvs = sorted(args.gt_dir.glob("*.csv"))
+    if not csvs:
+        sys.exit(f"no *.csv files under {args.gt_dir}")
+    print(f"creating/confirming dataset repo {args.repo} (private={not args.public})")
+    create_repo(
+        repo_id=args.repo, repo_type="dataset",
+        private=not args.public, exist_ok=True, token=token,
+    )
+    for csv in csvs:
+        rel = f"gt/{csv.name}"
+        print(f"uploading {csv} → {args.repo}:{rel}")
+        upload_file(
+            path_or_fileobj=str(csv),
+            path_in_repo=rel,
+            repo_id=args.repo, repo_type="dataset",
+            token=token,
+            commit_message=f"upload {csv.name}",
+        )
+    print(f"\ndone — {len(csvs)} ground-truth file(s) at:")
+    print(f"  https://huggingface.co/datasets/{args.repo}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

server/space/push_to_space.sh ADDED Viewed

	@@ -0,0 +1,27 @@

+#!/usr/bin/env bash
+# Push the current commit to the HF Space remote, with server/space/README.md
+# overlaid at repo root (HF reads the Space's metadata frontmatter from the
+# root README; the GitHub root README stays untouched).
+#
+# Prereq once:
+#   git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
+#
+# When git prompts for credentials on push:
+#   user     = lanczos
+#   password = $HF_TOKEN
+set -euo pipefail
+BRANCH=$(git rev-parse --abbrev-ref HEAD)
+TEMP="space-deploy-$(date +%s)"
+trap 'git checkout "$BRANCH" >/dev/null 2>&1 || true; \
+      git branch -D "$TEMP" >/dev/null 2>&1 || true' EXIT
+git checkout -b "$TEMP"
+cp server/space/README.md README.md
+git add README.md
+git commit --no-verify -m "deploy: overlay server/space/README.md as Space root"
+git push -f space "$TEMP:main"
+echo
+echo "pushed to space/main"
+echo "URL: https://lanczos-graphtestbed.hf.space/"

server/space/space_entry.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""Entry point for the GraphTestbed scoring server on HF Spaces.
+On boot:
+  1. snapshot_download the companion dataset repo (lanczos/graphtestbed-gt by
+     default) into /data: gt/*.csv, leaderboard.db, submissions/**/*.csv.
+  2. Spawn a daemon thread that every BACKUP_INTERVAL seconds:
+       a. SELECT COUNT(*) FROM submissions; bail if unchanged.
+       b. sqlite3.Connection.backup() into a temp file (atomic, lock-safe).
+       c. upload_file the temp file → leaderboard.db in the dataset repo.
+       d. upload_folder /data/submissions/ → submissions/ in the dataset repo
+          (huggingface_hub diffs by content-hash; unchanged files don't transfer).
+  3. Hand off to server/api.py via Flask app.run(threaded=True).
+Env vars (all have sensible defaults baked into the Dockerfile):
+  HF_TOKEN              required   write scope on GT_DATASET_REPO
+  GT_DATASET_REPO       optional   default: lanczos/graphtestbed-gt
+  GT_DATA_ROOT          optional   default: /data
+  GT_BACKUP_INTERVAL    optional   default: 60 (seconds)
+  PORT                  optional   default: 7860
+"""
+from __future__ import annotations
+import os
+import sqlite3
+import sys
+import threading
+import time
+from pathlib import Path
+from huggingface_hub import snapshot_download, upload_file, upload_folder
+HF_TOKEN = os.environ.get("HF_TOKEN")
+HF_REPO = os.environ.get("GT_DATASET_REPO", "lanczos/graphtestbed-gt")
+DATA_DIR = Path(os.environ.get("GT_DATA_ROOT", "/data"))
+GT_DIR = DATA_DIR / "gt"
+DB_PATH = DATA_DIR / "leaderboard.db"
+ARCHIVE_DIR = DATA_DIR / "submissions"
+BACKUP_INTERVAL = int(os.environ.get("GT_BACKUP_INTERVAL", "60"))
+PORT = int(os.environ.get("PORT", "7860"))
+def _require_token() -> str:
+    if not HF_TOKEN:
+        raise SystemExit(
+            "HF_TOKEN is unset. Set it as a Space secret with write scope on "
+            f"{HF_REPO}."
+        )
+    return HF_TOKEN
+def bootstrap() -> None:
+    """Pull GT files, leaderboard, and submission archive from the dataset repo."""
+    token = _require_token()
+    for d in (DATA_DIR, GT_DIR, ARCHIVE_DIR):
+        d.mkdir(parents=True, exist_ok=True)
+    print(f"snapshot_download {HF_REPO} → {DATA_DIR}", flush=True)
+    try:
+        snapshot_download(
+            HF_REPO,
+            repo_type="dataset",
+            local_dir=str(DATA_DIR),
+            allow_patterns=["gt/*.csv", "leaderboard.db", "submissions/**/*.csv"],
+            token=token,
+        )
+    except Exception as e:
+        # First-deploy or empty repo: keep going with empty /data.
+        print(f"snapshot_download warning ({type(e).__name__}): {e}", flush=True)
+    n_gt = len(list(GT_DIR.glob("*.csv")))
+    print(f"GT files present: {n_gt}", flush=True)
+    if DB_PATH.exists():
+        try:
+            n = int(sqlite3.connect(DB_PATH).execute(
+                "SELECT COUNT(*) FROM submissions"
+            ).fetchone()[0])
+            print(f"restored leaderboard.db ({n} submissions)", flush=True)
+        except sqlite3.OperationalError:
+            print("leaderboard.db present but no submissions table yet", flush=True)
+    else:
+        print("no prior leaderboard.db; starting fresh", flush=True)
+def _submission_count() -> int:
+    if not DB_PATH.exists():
+        return 0
+    try:
+        conn = sqlite3.connect(DB_PATH)
+        try:
+            row = conn.execute("SELECT COUNT(*) FROM submissions").fetchone()
+            return int(row[0]) if row else 0
+        finally:
+            conn.close()
+    except sqlite3.OperationalError:
+        return 0
+def _atomic_db_copy(dst: Path) -> None:
+    """sqlite3.backup() is lock-safe — readers/writers stay consistent."""
+    src = sqlite3.connect(DB_PATH)
+    try:
+        target = sqlite3.connect(dst)
+        try:
+            src.backup(target)
+        finally:
+            target.close()
+    finally:
+        src.close()
+def backup_loop() -> None:
+    token = _require_token()
+    last_count = -1
+    print(f"backup_loop started (interval={BACKUP_INTERVAL}s)", flush=True)
+    while True:
+        time.sleep(BACKUP_INTERVAL)
+        n = _submission_count()
+        if n == last_count:
+            continue
+        try:
+            tmp = DATA_DIR / "_leaderboard.db.tmp"
+            _atomic_db_copy(tmp)
+            upload_file(
+                path_or_fileobj=str(tmp),
+                path_in_repo="leaderboard.db",
+                repo_id=HF_REPO, repo_type="dataset",
+                token=token,
+                commit_message=f"backup leaderboard ({n} submissions)",
+            )
+            tmp.unlink()
+        except Exception as e:
+            print(f"leaderboard backup failed: {type(e).__name__}: {e}", flush=True)
+            continue
+        if ARCHIVE_DIR.exists() and any(ARCHIVE_DIR.rglob("*.csv")):
+            try:
+                upload_folder(
+                    folder_path=str(ARCHIVE_DIR),
+                    path_in_repo="submissions",
+                    repo_id=HF_REPO, repo_type="dataset",
+                    token=token,
+                    commit_message=f"archive submissions ({n} total)",
+                    allow_patterns=["**/*.csv"],
+                )
+            except Exception as e:
+                print(f"submission archive failed: {type(e).__name__}: {e}", flush=True)
+        last_count = n
+        print(f"backup pushed: {n} submissions", flush=True)
+def main() -> int:
+    bootstrap()
+    # Make sure server/api.py reads paths consistent with what we just bootstrapped.
+    os.environ.setdefault("GT_DIR", str(GT_DIR))
+    os.environ.setdefault("GT_DB", str(DB_PATH))
+    os.environ.setdefault("GT_ARCHIVE_DIR", str(ARCHIVE_DIR))
+    threading.Thread(target=backup_loop, daemon=True).start()
+    sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+    from api import app  # noqa: E402 — env vars must be set first
+    print(f"serving on 0.0.0.0:{PORT}", flush=True)
+    app.run(host="0.0.0.0", port=PORT, threaded=True, use_reloader=False)
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())