Zhu Jiajun (jz28583) Claude Opus 4.7 (1M context) commited on
Commit
d094faf
·
1 Parent(s): ad6901d

Add agents/ harness integrations and HF Space scoring deployment

Browse files

- agents/cliproxyapi: reusable shim that points any agent's SDK at one
CLIProxyAPI proxy via anthropic_env / openai_env / gemini_env helpers.
- agents/{ai_build_ai,mlevolve}: runners that stage GraphTestbed task
data, route LLM calls through the proxy, and harvest submission CSVs.
Tested end-to-end on figraph; both scored on the leaderboard
(aibuildai-claude-sonnet-4-6 0.819, mlevolve-gpt-5.3-codex-spark 0.790).
- agents/common: shared workspace + task-instruction + finalize helpers.

- server/space/: Docker SDK Space deployment. Boot orchestrator in
space_entry.py snapshot_downloads GT + leaderboard.db from the
companion private dataset (lanczos/graphtestbed-gt) on startup, then
runs a daemon thread that backs up sqlite + new submission CSVs
every 60s via huggingface_hub.upload_file/upload_folder.
- server/api.py: optional GT_ARCHIVE_DIR env writes raw submission CSVs
to disk so the backup loop can ship them to the dataset repo.
- graphtestbed/{submit,leaderboard}.py: default GRAPHTESTBED_API flipped
to the hosted Space URL (env var still overrides for self-hosters).
- pyproject.toml: dependencies were misplaced under [project.urls];
moved to [project] so pip install -e . actually resolves deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

.gitignore CHANGED
@@ -25,3 +25,8 @@ ground_truth*.csv
25
  *test_labels*.csv
26
  private/
27
  **/private/
 
 
 
 
 
 
25
  *test_labels*.csv
26
  private/
27
  **/private/
28
+
29
+ # Agent harness scratch space
30
+ runs/
31
+ agents/**/runs/
32
+ agents/**/_vendor/
README.md CHANGED
@@ -10,31 +10,41 @@ Build an agent. Submit predictions. Get a score. Test labels live on a server, n
10
 
11
  ## Status
12
 
13
- **Pre-launch.** The code runs end-to-end against a local server, but:
14
 
15
  - The package isn't on PyPI yet → install from git (see below)
16
- - The hosted scoring API isn't deployed yet → run the server on your own machine
17
  - HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
18
 
19
- The two paths below — **local dev** (works today) and **hosted submit** (coming) share the exact same client and server code.
 
 
20
 
21
- ## Run it locally (works today)
22
 
23
  ```bash
24
- # 1. Install
25
  pip install git+https://github.com/zhuconv/GraphTestbed
 
 
 
 
26
 
27
- # 2. Start the scoring API (terminal A)
 
 
 
 
 
 
 
 
28
  git clone https://github.com/zhuconv/GraphTestbed
29
  cd GraphTestbed
30
  GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
31
  # → Running on http://localhost:8080
32
 
33
- # 3. Submit (terminal B)
34
  export GRAPHTESTBED_API=http://localhost:8080
35
  gtb submit figraph --file preds.csv --agent my-agent-v1
36
- # ✓ Scored primary (auc_roc): 0.689 rank: #3
37
- gtb leaderboard figraph
38
  ```
39
 
40
  You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
@@ -175,6 +185,33 @@ You don't modify GraphTestbed. You:
175
  That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
176
  </details>
177
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
  ## License
179
 
180
  [MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).
 
10
 
11
  ## Status
12
 
13
+ **Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
14
 
15
  - The package isn't on PyPI yet → install from git (see below)
 
16
  - HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
17
 
18
+ The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
19
+ default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
20
+ server if you'd rather self-host (instructions below).
21
 
22
+ ## Submit to the hosted leaderboard
23
 
24
  ```bash
 
25
  pip install git+https://github.com/zhuconv/GraphTestbed
26
+ gtb submit figraph --file preds.csv --agent my-agent-v1
27
+ # ✓ Scored primary (auc_roc): 0.689 rank: #3
28
+ gtb leaderboard figraph
29
+ ```
30
 
31
+ The hosted server is a Docker-SDK HF Space that holds GT files in a private
32
+ companion dataset and never logs prediction CSVs (it does archive them in
33
+ the same private repo for reproducibility — see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
34
+ Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
35
+ 3 decimals — same as if you ran the server yourself.
36
+
37
+ ## Run the server locally (alternative)
38
+
39
+ ```bash
40
  git clone https://github.com/zhuconv/GraphTestbed
41
  cd GraphTestbed
42
  GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
43
  # → Running on http://localhost:8080
44
 
45
+ # point the client at it
46
  export GRAPHTESTBED_API=http://localhost:8080
47
  gtb submit figraph --file preds.csv --agent my-agent-v1
 
 
48
  ```
49
 
50
  You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
 
185
  That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
186
  </details>
187
 
188
+ ## Reference agent integrations (`agents/`)
189
+
190
+ Two third-party harnesses ship pre-wired to the testbed; both route LLM
191
+ traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
192
+
193
+ | package | upstream | default model |
194
+ | --- | --- | --- |
195
+ | [`agents.ai_build_ai`](agents/ai_build_ai/README.md) | [aibuildai/AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) | `claude-sonnet-4-6` |
196
+ | [`agents.mlevolve`](agents/mlevolve/README.md) | [InternScience/MLEvolve](https://github.com/InternScience/MLEvolve) | `gpt-5.3-codex-spark` |
197
+
198
+ The proxy integration itself is generic — see
199
+ [`agents/cliproxyapi/README.md`](agents/cliproxyapi/README.md) for the
200
+ 3-function shim (`anthropic_env` / `openai_env` / `gemini_env` /
201
+ `openai_yaml_block`) that any future agent can reuse.
202
+
203
+ ```bash
204
+ # One-time
205
+ export CLIPROXYAPI_KEY=<from your ~/.cli-proxy-api/config.yaml api-keys list>
206
+ bash agents/ai_build_ai/install.sh # or agents/mlevolve/install.sh
207
+
208
+ # Per task
209
+ gtb fetch figraph
210
+ python -m agents.ai_build_ai.runner --task figraph
211
+ # → prints path to runs/ai_build_ai/figraph/<ts>/submission.csv
212
+ gtb submit figraph --file <printed-path> --agent aibuildai-sonnet-4-6
213
+ ```
214
+
215
  ## License
216
 
217
  [MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).
agents/README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `agents/` — third-party harness integrations
2
+
3
+ Wraps external agent harnesses so they can be pointed at a GraphTestbed task
4
+ and produce a `submission.csv` the scoring API understands. LLM traffic is
5
+ routed through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI)
6
+ instance via the [`agents.cliproxyapi`](cliproxyapi/README.md) shim.
7
+
8
+ ## Layout
9
+
10
+ ```
11
+ agents/
12
+ ├── cliproxyapi/ # generic Anthropic/OpenAI/Gemini → proxy shim (reusable)
13
+ ├── common/ # workspace + task-instruction + submit helpers
14
+ ├── ai_build_ai/ # AI-Build-AI integration (default: claude-sonnet-4-6)
15
+ └── mlevolve/ # MLEvolve integration (default: gpt-5.3-codex-spark)
16
+ ```
17
+
18
+ `agents/<agent>/_vendor/` (gitignored) holds the upstream binary or git
19
+ clone for that agent.
20
+
21
+ ## End-to-end (figraph example)
22
+
23
+ ```bash
24
+ # 0. One-time setup of the proxy (see agents/cliproxyapi/README.md)
25
+ export CLIPROXYAPI_KEY=<from your config.yaml>
26
+
27
+ # 1. Fetch the task data once
28
+ gtb fetch figraph
29
+
30
+ # 2. Install whichever agent you want
31
+ bash agents/ai_build_ai/install.sh # downloads upstream tarball
32
+ # or
33
+ bash agents/mlevolve/install.sh # git clone + pip install
34
+
35
+ # 3. Run; the runner prints the produced submission.csv path
36
+ python -m agents.ai_build_ai.runner --task figraph
37
+ python -m agents.mlevolve.runner --task figraph
38
+
39
+ # 4. Submit when ready (default is print-and-stop)
40
+ gtb submit figraph --file <printed-path> --agent <my-agent-id>
41
+ # or pass --submit <name> to the runner to combine 3+4
42
+ ```
43
+
44
+ ## Adding another agent
45
+
46
+ 1. Create `agents/<new_agent>/{__init__.py,runner.py,install.sh,README.md}`.
47
+ 2. In `runner.py` import from `agents.cliproxyapi` (one of `anthropic_env`,
48
+ `openai_env`, `gemini_env`, or `openai_yaml_block` per the agent's SDK).
49
+ 3. Use `agents.common.workspace.make_workspace()` for the run dir,
50
+ `agents.common.tasks.task_instruction()` for the task prompt,
51
+ `agents.common.submit.finalize()` for validate+optional-submit.
52
+
53
+ No changes to `agents/cliproxyapi/` or `agents/common/` are required for new
54
+ agents that fit one of the three supported SDK shapes.
agents/__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ """Agent harness integrations for GraphTestbed.
2
+
3
+ Each subpackage wraps a third-party agent (AI-Build-AI, MLEvolve, ...) so it
4
+ can be pointed at a GraphTestbed task and produce a submission.csv that the
5
+ testbed scoring API understands.
6
+
7
+ LLM traffic for every agent flows through a single CLIProxyAPI instance — see
8
+ `agents.cliproxyapi` for the reusable shim.
9
+ """
agents/ai_build_ai/README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `agents.ai_build_ai`
2
+
3
+ Runs [AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) on a GraphTestbed
4
+ task. AI-Build-AI is an Anthropic-SDK-based auto-ML harness that designs,
5
+ trains, and ranks candidate models from a task description.
6
+
7
+ Default model: **`claude-sonnet-4-6`** (override with `--model`).
8
+
9
+ ## Install
10
+
11
+ ```bash
12
+ bash agents/ai_build_ai/install.sh # downloads upstream tarball into _vendor/
13
+ # Linux x86_64 only — upstream constraint.
14
+ ```
15
+
16
+ The vendored binary lands at `agents/ai_build_ai/_vendor/aibuildai`. Set
17
+ `AIBUILDAI_BIN` if you put it elsewhere.
18
+
19
+ ## Run
20
+
21
+ ```bash
22
+ # Proxy must be running and CLIPROXYAPI_KEY set — see agents/cliproxyapi/README.md
23
+ gtb fetch figraph # one-time per task
24
+ python -m agents.ai_build_ai.runner --task figraph
25
+ ```
26
+
27
+ Output:
28
+
29
+ ```
30
+ runs/ai_build_ai/figraph/<timestamp>/
31
+ ├── data/ # symlinks to fetched dataset CSVs
32
+ ├── playground/ # AI-Build-AI's working dir (candidate_*/, …)
33
+ ├── instruction.md # generated task prompt
34
+ ├── agent.log # full stdout+stderr from the binary
35
+ └── submission.csv # normalized to match the testbed schema
36
+ ```
37
+
38
+ The runner prints `submission.csv`'s path; submit when ready:
39
+
40
+ ```bash
41
+ gtb submit figraph --file runs/ai_build_ai/figraph/<ts>/submission.csv \
42
+ --agent aibuildai-sonnet-4-6
43
+ # or, in one step:
44
+ python -m agents.ai_build_ai.runner --task figraph --submit aibuildai-sonnet-4-6
45
+ ```
46
+
47
+ ## Knobs
48
+
49
+ | flag | default | upstream meaning |
50
+ | --- | --- | --- |
51
+ | `--model` | `claude-sonnet-4-6` | model alias, sent to the proxy |
52
+ | `--budget-min` | 60 | per-run training budget |
53
+ | `--pipeline-budget-min` | 90 | total pipeline budget |
54
+ | `--max-agent-calls` | 8 | LLM call cap per candidate |
55
+ | `--num-candidates` | 3 | how many model variants to generate |
56
+
57
+ The `--model` string must exist in your CLIProxyAPI `oauth-model-alias.claude`
58
+ mapping (or be a real model your Claude account exposes).
agents/ai_build_ai/__init__.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ """AI-Build-AI integration (github.com/aibuildai/AI-Build-AI).
2
+
3
+ Wraps the `aibuildai` release binary so it can run against any GraphTestbed
4
+ task. LLM traffic is forced through CLIProxyAPI by setting ANTHROPIC_BASE_URL
5
+ and ANTHROPIC_API_KEY before launching the binary.
6
+ """
agents/ai_build_ai/examples/run_figraph.sh ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # End-to-end smoke test of AI-Build-AI on the `figraph` task.
3
+ # Assumes:
4
+ # - CLIProxyAPI is running and CLIPROXYAPI_KEY is set (see agents/cliproxyapi/README.md)
5
+ # - `gtb fetch figraph` has been run, OR a local copy of figraph CSVs sits
6
+ # at $GRAPHTESTBED_CACHE/figraph/
7
+ # - `bash agents/ai_build_ai/install.sh` has put the binary in _vendor/
8
+ set -euo pipefail
9
+
10
+ REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
11
+ cd "${REPO_ROOT}"
12
+
13
+ : "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
14
+
15
+ python3 -m agents.ai_build_ai.runner \
16
+ --task figraph \
17
+ --model "${MODEL:-claude-sonnet-4-6}" \
18
+ --budget-min "${BUDGET_MIN:-30}" \
19
+ --num-candidates "${NUM_CANDIDATES:-2}" \
20
+ "${@}"
agents/ai_build_ai/install.sh ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Install the AI-Build-AI release tarball into agents/ai_build_ai/_vendor/.
3
+ # Re-run any time to upgrade. Linux x86_64 only (upstream constraint).
4
+ #
5
+ # Override the release with: AIBUILDAI_VERSION=v0.1.1 bash install.sh
6
+ set -euo pipefail
7
+
8
+ VERSION="${AIBUILDAI_VERSION:-v0.1.1}"
9
+ HERE="$(cd "$(dirname "$0")" && pwd)"
10
+ DEST="${HERE}/_vendor"
11
+ TARBALL="aibuildai-linux-x86_64-${VERSION}.tar.gz"
12
+ URL="https://github.com/aibuildai/AI-Build-AI/releases/download/${VERSION}/${TARBALL}"
13
+
14
+ mkdir -p "${DEST}"
15
+ cd "${DEST}"
16
+
17
+ echo "Downloading ${URL}"
18
+ curl -fL --retry 3 -o "${TARBALL}" "${URL}"
19
+ echo "Unpacking ${TARBALL}"
20
+ tar -xzf "${TARBALL}"
21
+ rm -f "${TARBALL}"
22
+
23
+ # Upstream tarball ships an install.sh that finalizes setup (PATH hints etc.)
24
+ if [[ -x ./install.sh ]]; then
25
+ echo "Running upstream install.sh"
26
+ ./install.sh
27
+ fi
28
+
29
+ echo
30
+ echo "Installed AI-Build-AI ${VERSION} under ${DEST}"
31
+ echo "Set AIBUILDAI_BIN to the binary path if it isn't on \$PATH after this."
agents/ai_build_ai/runner.py ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Run AI-Build-AI on a GraphTestbed task, routed through CLIProxyAPI.
2
+
3
+ Usage:
4
+ python -m agents.ai_build_ai.runner --task figraph
5
+ python -m agents.ai_build_ai.runner --task figraph \\
6
+ --model claude-sonnet-4-6 --budget-min 30
7
+ python -m agents.ai_build_ai.runner --task figraph \\
8
+ --submit aibuildai-sonnet-4-6
9
+
10
+ Exit codes mirror the wrapped binary.
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import os
17
+ import shutil
18
+ import subprocess
19
+ import sys
20
+ from pathlib import Path
21
+
22
+ import pandas as pd
23
+
24
+ from agents.cliproxyapi import ProxyEndpoint, anthropic_env, wait_until_ready
25
+ from agents.common.submit import finalize
26
+ from agents.common.tasks import task_instruction
27
+ from agents.common.workspace import make_workspace, stage_dataset
28
+ from graphtestbed._manifest import task_config
29
+ from graphtestbed.fetch import cache_dir
30
+
31
+ DEFAULT_MODEL = "claude-sonnet-4-6"
32
+
33
+
34
+ def _resolve_binary() -> str:
35
+ explicit = os.environ.get("AIBUILDAI_BIN")
36
+ if explicit:
37
+ return explicit
38
+ on_path = shutil.which("aibuildai")
39
+ if on_path:
40
+ return on_path
41
+ vendored = Path(__file__).parent / "_vendor" / "aibuildai"
42
+ if vendored.exists():
43
+ return str(vendored)
44
+ raise SystemExit(
45
+ "Cannot locate the `aibuildai` binary.\n"
46
+ " Install it: bash agents/ai_build_ai/install.sh\n"
47
+ " Or set AIBUILDAI_BIN to the full path."
48
+ )
49
+
50
+
51
+ def _stage_input(task: str, dst: Path) -> None:
52
+ src = cache_dir() / task
53
+ if not src.exists():
54
+ raise SystemExit(
55
+ f"No cached dataset at {src}. Run `gtb fetch {task}` first.\n"
56
+ f"(For pre-launch tasks, drop your local CSVs into {src}/.)"
57
+ )
58
+ cfg = task_config(task)
59
+ files = [spec["filename"] for spec in cfg["files"].values()]
60
+ stage_dataset(src, dst, files)
61
+
62
+
63
+ def _harvest_submission(task: str, playground: Path, dst: Path) -> Path:
64
+ """Pick the latest submission.csv produced under playground/, normalize cols."""
65
+ schema = task_config(task)["submission_schema"]
66
+ candidates = sorted(
67
+ playground.rglob("submission.csv"),
68
+ key=lambda p: p.stat().st_mtime,
69
+ )
70
+ if not candidates:
71
+ raise SystemExit(
72
+ f"No submission.csv found under {playground}.\n"
73
+ f" Inspect the agent's logs to see what happened: "
74
+ f"{playground.parent / 'agent.log'}"
75
+ )
76
+ chosen = candidates[-1]
77
+ df = pd.read_csv(chosen)
78
+ expected = [schema["id_col"], schema["pred_col"]]
79
+ if list(df.columns) != expected:
80
+ if len(df.columns) == 2:
81
+ print(f" (renaming columns {list(df.columns)} → {expected})")
82
+ df.columns = expected
83
+ else:
84
+ raise SystemExit(
85
+ f"Cannot normalize {chosen}: got columns {list(df.columns)}, "
86
+ f"expected {expected}"
87
+ )
88
+ out = dst / "submission.csv"
89
+ df.to_csv(out, index=False)
90
+ print(f"✓ Picked {chosen.relative_to(playground.parent)}")
91
+ return out
92
+
93
+
94
+ def main() -> None:
95
+ ap = argparse.ArgumentParser(prog="agents.ai_build_ai.runner")
96
+ ap.add_argument("--task", required=True,
97
+ help="A task name from datasets/manifest.yaml")
98
+ ap.add_argument("--model", default=DEFAULT_MODEL,
99
+ help=f"Model alias passed to aibuildai (default: {DEFAULT_MODEL})")
100
+ ap.add_argument("--budget-min", type=int, default=60,
101
+ help="--run-budget-minutes for aibuildai (default: 60)")
102
+ ap.add_argument("--pipeline-budget-min", type=int, default=90,
103
+ help="--pipeline-budget-minutes (default: 90)")
104
+ ap.add_argument("--max-agent-calls", type=int, default=8)
105
+ ap.add_argument("--num-candidates", type=int, default=3)
106
+ ap.add_argument("--submit", default=None, metavar="AGENT_ID",
107
+ help="If set, POST the produced submission.csv to the "
108
+ "GraphTestbed scoring API as this agent name.")
109
+ ap.add_argument("--workspace-root", type=Path, default=None,
110
+ help="Override the runs/ root (default: ./runs)")
111
+ args = ap.parse_args()
112
+
113
+ binary = _resolve_binary()
114
+ ep = ProxyEndpoint.from_env()
115
+ wait_until_ready(ep)
116
+ print(f"✓ Proxy ready at {ep.base_url()}")
117
+
118
+ ws = make_workspace("ai_build_ai", args.task, args.workspace_root)
119
+ data = ws / "data"
120
+ play = ws / "playground"
121
+ play.mkdir()
122
+ _stage_input(args.task, data)
123
+
124
+ instruction = task_instruction(args.task)
125
+ (ws / "instruction.md").write_text(instruction)
126
+
127
+ cmd = [
128
+ binary,
129
+ "--task-name", args.task,
130
+ "--data-dir", str(data),
131
+ "--playground-dir", str(play),
132
+ "--model", args.model,
133
+ "--instruction", instruction,
134
+ "--max-agent-calls", str(args.max_agent_calls),
135
+ "--run-budget-minutes", str(args.budget_min),
136
+ "--pipeline-budget-minutes", str(args.pipeline_budget_min),
137
+ "--num-candidates", str(args.num_candidates),
138
+ "--no-form",
139
+ ]
140
+ env = {**os.environ, **anthropic_env(ep, model=args.model)}
141
+ # aibuildai ships a bundled `claude` binary that aborts if it detects an
142
+ # outer Claude Code session via these env vars. Strip them so the inner
143
+ # claude treats this as a fresh top-level invocation.
144
+ for k in ("CLAUDECODE", "CLAUDE_CODE_ENTRYPOINT", "CLAUDE_CODE_SSE_PORT"):
145
+ env.pop(k, None)
146
+
147
+ print(f"→ Launching {Path(binary).name} task={args.task} model={args.model}")
148
+ print(f" workspace: {ws}")
149
+ log = ws / "agent.log"
150
+ with log.open("wb") as lf:
151
+ rc = subprocess.call(cmd, env=env, stdout=lf, stderr=subprocess.STDOUT)
152
+ print(f" exit={rc} log={log}")
153
+ if rc != 0:
154
+ sys.exit(rc)
155
+
156
+ sub = _harvest_submission(args.task, play, ws)
157
+ finalize(args.task, sub, args.submit)
158
+
159
+
160
+ if __name__ == "__main__":
161
+ main()
agents/cliproxyapi/README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `agents.cliproxyapi`
2
+
3
+ Reusable shim that points any agent's LLM SDK at a single local
4
+ [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI) instance.
5
+
6
+ ## Why a shim
7
+
8
+ Every agent we test uses a different SDK (Anthropic, OpenAI/Codex, Gemini)
9
+ and a different way of being told "talk to this base URL with this key".
10
+ This package collapses that into three function calls.
11
+
12
+ ## Public surface
13
+
14
+ ```python
15
+ from agents.cliproxyapi import (
16
+ ProxyEndpoint, # where + key (read from env)
17
+ anthropic_env, # → dict, splice into subprocess env
18
+ openai_env,
19
+ gemini_env,
20
+ openai_yaml_block, # → dict, drop into a YAML config
21
+ wait_until_ready, # TCP probe; raise SystemExit on miss
22
+ spawn_proxy, # ctx-manager (opt-in; mostly for CI)
23
+ )
24
+ ```
25
+
26
+ `ProxyEndpoint.from_env()` reads:
27
+
28
+ | env var | default |
29
+ | --- | --- |
30
+ | `CLIPROXYAPI_HOST` | `127.0.0.1` |
31
+ | `CLIPROXYAPI_PORT` | `8317` |
32
+ | `CLIPROXYAPI_KEY` | *required* |
33
+
34
+ ## Recipe per SDK shape
35
+
36
+ ### Anthropic SDK / Claude Code (`claude`, `aibuildai`, ...)
37
+ ```python
38
+ ep = ProxyEndpoint.from_env()
39
+ env = {**os.environ, **anthropic_env(ep, model="claude-sonnet-4-6")}
40
+ subprocess.run([...], env=env)
41
+ ```
42
+ Sets `ANTHROPIC_BASE_URL`, `ANTHROPIC_API_KEY`, `ANTHROPIC_AUTH_TOKEN`,
43
+ `ANTHROPIC_MODEL`.
44
+
45
+ ### OpenAI / Codex CLI / any OpenAI-compatible SDK
46
+ ```python
47
+ env = {**os.environ, **openai_env(ep, model="gpt-5.3-codex-spark")}
48
+ ```
49
+ Sets `OPENAI_BASE_URL=…/v1`, `OPENAI_API_KEY`, `OPENAI_API_BASE`,
50
+ `OPENAI_MODEL`.
51
+
52
+ ### Gemini SDK
53
+ ```python
54
+ env = {**os.environ, **gemini_env(ep, model="gemini-2-pro-preview")}
55
+ ```
56
+
57
+ ### YAML configs (e.g. MLEvolve)
58
+ ```python
59
+ block = openai_yaml_block(ep, model="gpt-5.3-codex-spark")
60
+ # → {"model": ..., "base_url": "http://127.0.0.1:8317/v1", "api_key": ...}
61
+ config["agent"]["code"].update(block)
62
+ config["agent"]["feedback"].update(block)
63
+ ```
64
+
65
+ ## Setting up the proxy itself
66
+
67
+ 1. Install:
68
+ ```bash
69
+ git clone https://github.com/router-for-me/CLIProxyAPI && cd CLIProxyAPI
70
+ docker compose up -d # or: go build -o cliproxy ./cmd/...
71
+ ```
72
+ 2. Drop in a config (start from
73
+ [`config.example.yaml`](config.example.yaml) here):
74
+ ```bash
75
+ mkdir -p ~/.cli-proxy-api
76
+ cp agents/cliproxyapi/config.example.yaml ~/.cli-proxy-api/config.yaml
77
+ $EDITOR ~/.cli-proxy-api/config.yaml # set api-keys[0] + aliases
78
+ ```
79
+ 3. Run interactively once to OAuth-log into Claude / Codex / Gemini accounts.
80
+ 4. Export client-side env vars:
81
+ ```bash
82
+ export CLIPROXYAPI_KEY=<the api-keys[0] you set>
83
+ # CLIPROXYAPI_HOST/PORT only needed if you bind elsewhere
84
+ ```
85
+ 5. Smoke-test:
86
+ ```bash
87
+ curl -s -H "Authorization: Bearer $CLIPROXYAPI_KEY" \
88
+ http://127.0.0.1:8317/v1/models | head
89
+ ```
90
+
91
+ Once the proxy is up and `CLIPROXYAPI_KEY` is set, every agent runner in
92
+ `agents/*/runner.py` works without further configuration.
93
+
94
+ ## Adding a new agent that uses the proxy
95
+
96
+ ```python
97
+ # agents/my_agent/runner.py
98
+ from agents.cliproxyapi import ProxyEndpoint, openai_env, wait_until_ready
99
+
100
+ ep = ProxyEndpoint.from_env()
101
+ wait_until_ready(ep)
102
+ subprocess.run(
103
+ ["my-agent-binary", "--task", task, "--model", model],
104
+ env={**os.environ, **openai_env(ep, model=model)},
105
+ )
106
+ ```
107
+
108
+ That's the entire integration.
agents/cliproxyapi/__init__.py ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Generic CLIProxyAPI integration shared by every agent runner.
2
+
3
+ CLIProxyAPI (github.com/router-for-me/CLIProxyAPI) is a single local proxy
4
+ that bridges Anthropic, OpenAI/Codex, and Gemini protocol surfaces on one
5
+ port. Pointing every agent at it lets us share OAuth state, credentials, and
6
+ rate-limit budget across many harnesses.
7
+
8
+ Public surface — three things:
9
+
10
+ ProxyEndpoint → where the proxy is + what API key to send
11
+ {anthropic,openai,gemini}_env(ep, model=...) → env-var dicts to splice
12
+ into subprocess.Popen
13
+ openai_yaml_block(ep, model) → snippet for agents whose configs take
14
+ base_url/api_key/model directly
15
+
16
+ Plus `wait_until_ready(ep)` for runners that should fail fast if the proxy
17
+ isn't up, and an opt-in `spawn_proxy()` ctx-manager for one-off testing.
18
+ """
19
+
20
+ from .endpoint import ProxyEndpoint
21
+ from .env import anthropic_env, gemini_env, openai_env, openai_yaml_block
22
+ from .health import is_ready, spawn_proxy, wait_until_ready
23
+
24
+ __all__ = [
25
+ "ProxyEndpoint",
26
+ "anthropic_env",
27
+ "gemini_env",
28
+ "openai_env",
29
+ "openai_yaml_block",
30
+ "is_ready",
31
+ "spawn_proxy",
32
+ "wait_until_ready",
33
+ ]
agents/cliproxyapi/config.example.yaml ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Minimal CLIProxyAPI config for GraphTestbed agent runs.
2
+ #
3
+ # Place at ~/.cli-proxy-api/config.yaml (or pass --config /path/to/this file
4
+ # when launching the proxy). Full schema:
5
+ # https://github.com/router-for-me/CLIProxyAPI
6
+ #
7
+ # Quickstart:
8
+ # 1. Replace the api-keys[0] placeholder with `openssl rand -hex 16`.
9
+ # 2. Export the same value as CLIPROXYAPI_KEY in the shell that runs the
10
+ # agents (so the agent's SDK sends it; the proxy validates it).
11
+ # 3. Launch the proxy interactively once and complete the OAuth flow for
12
+ # each upstream account you intend to use (Claude / Codex / Gemini).
13
+ # 4. Adjust `oauth-model-alias.{claude,codex}` so the model strings the
14
+ # agents send (e.g. `claude-sonnet-4-6`, `gpt-5.3-codex-spark`) resolve
15
+ # to whatever upstream IDs your subscriptions actually expose.
16
+
17
+ host: "127.0.0.1"
18
+ port: 8317
19
+ auth-dir: "~/.cli-proxy-api"
20
+
21
+ api-keys:
22
+ - "REPLACE-WITH-OPENSSL-RAND-HEX-16"
23
+
24
+ strategy: "round-robin"
25
+ session-affinity-ttl: "1h"
26
+
27
+ # Upstream Claude OAuth account(s). Run the proxy once with your browser open
28
+ # to log in; the proxy then caches refresh tokens under auth-dir.
29
+ claude-api-key: []
30
+
31
+ # Upstream Codex OAuth account(s). Same pattern.
32
+ codex-api-key: []
33
+
34
+ # Map the alias names our agents send → actual upstream model IDs.
35
+ # AI-Build-AI sends `--model claude-sonnet-4-6` (or whatever you pick).
36
+ # MLEvolve sends the model string from agents/mlevolve/runner.py's --model.
37
+ oauth-model-alias:
38
+ claude:
39
+ # Match the string the agent's runner sends; map to whatever your Claude
40
+ # subscription actually exposes (check `curl ${proxy}/v1/models`).
41
+ claude-sonnet-4-6: "<upstream-claude-id>"
42
+ codex:
43
+ gpt-5.3-codex-spark: "<upstream-codex-id>"
agents/cliproxyapi/endpoint.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """ProxyEndpoint — single source of truth for "where is the proxy + what key".
2
+
3
+ Every agent runner reads this from environment, then hands the resulting
4
+ object to `agents.cliproxyapi.env.*` to build SDK-specific configuration.
5
+
6
+ Env vars:
7
+ CLIPROXYAPI_HOST default 127.0.0.1
8
+ CLIPROXYAPI_PORT default 8317 (CLIProxyAPI's stock port)
9
+ CLIPROXYAPI_KEY required — must match one of the api-keys: entries
10
+ in your CLIProxyAPI config.yaml
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import os
16
+ from dataclasses import dataclass
17
+
18
+ DEFAULT_HOST = "127.0.0.1"
19
+ DEFAULT_PORT = 8317
20
+
21
+
22
+ @dataclass(frozen=True)
23
+ class ProxyEndpoint:
24
+ host: str = DEFAULT_HOST
25
+ port: int = DEFAULT_PORT
26
+ api_key: str = ""
27
+
28
+ @classmethod
29
+ def from_env(cls) -> "ProxyEndpoint":
30
+ host = os.environ.get("CLIPROXYAPI_HOST", DEFAULT_HOST)
31
+ port = int(os.environ.get("CLIPROXYAPI_PORT", str(DEFAULT_PORT)))
32
+ api_key = os.environ.get("CLIPROXYAPI_KEY", "").strip()
33
+ if not api_key:
34
+ raise SystemExit(
35
+ "CLIPROXYAPI_KEY is unset. Set it to one of the api-keys "
36
+ "you've configured in your CLIProxyAPI config.yaml.\n"
37
+ "Example:\n"
38
+ " export CLIPROXYAPI_KEY=$(grep -A1 'api-keys:' "
39
+ "~/.cli-proxy-api/config.yaml | tail -1 | tr -d ' \"-')"
40
+ )
41
+ return cls(host=host, port=port, api_key=api_key)
42
+
43
+ def base_url(self, scheme: str = "http") -> str:
44
+ return f"{scheme}://{self.host}:{self.port}"
agents/cliproxyapi/env.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Build env-var dicts (or YAML-config snippets) that point an SDK at the proxy.
2
+
3
+ Three SDK shapes are covered today; add more here as agents arrive:
4
+
5
+ anthropic_env(ep, model) → Anthropic SDK / Claude Code CLI
6
+ openai_env(ep, model) → OpenAI SDK / Codex CLI
7
+ gemini_env(ep, model) → google-generativeai SDK / gemini-cli
8
+
9
+ Plus `openai_yaml_block(ep, model)` for agents whose config files take
10
+ `base_url` / `api_key` / `model` fields directly (e.g. MLEvolve).
11
+
12
+ Usage from any agent runner:
13
+
14
+ from agents.cliproxyapi import ProxyEndpoint, anthropic_env
15
+ ep = ProxyEndpoint.from_env()
16
+ subprocess.run(cmd, env={**os.environ, **anthropic_env(ep, model="...")})
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ from .endpoint import ProxyEndpoint
22
+
23
+
24
+ def anthropic_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
25
+ """Env vars consumed by anthropic-python and claude-code.
26
+
27
+ The Anthropic SDK appends `/v1/messages` to ANTHROPIC_BASE_URL itself,
28
+ so we hand it the proxy root (no trailing path).
29
+ """
30
+ env = {
31
+ "ANTHROPIC_BASE_URL": ep.base_url(),
32
+ "ANTHROPIC_API_KEY": ep.api_key,
33
+ "ANTHROPIC_AUTH_TOKEN": ep.api_key,
34
+ }
35
+ if model:
36
+ env["ANTHROPIC_MODEL"] = model
37
+ return env
38
+
39
+
40
+ def openai_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
41
+ """Env vars consumed by openai-python, codex-cli, and many compatible SDKs.
42
+
43
+ The OpenAI SDK appends `/chat/completions` (and other paths) to
44
+ OPENAI_BASE_URL, so we include the `/v1` prefix here.
45
+ """
46
+ env = {
47
+ "OPENAI_BASE_URL": f"{ep.base_url()}/v1",
48
+ "OPENAI_API_KEY": ep.api_key,
49
+ "OPENAI_API_BASE": f"{ep.base_url()}/v1", # legacy var, still common
50
+ }
51
+ if model:
52
+ env["OPENAI_MODEL"] = model
53
+ return env
54
+
55
+
56
+ def gemini_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
57
+ """Env vars consumed by google-generativeai and gemini-cli.
58
+
59
+ The proxy exposes Gemini's `/v1beta/models/.../generateContent` shape on
60
+ the proxy root — clients prepend nothing.
61
+ """
62
+ env = {
63
+ "GEMINI_API_BASE": ep.base_url(),
64
+ "GOOGLE_API_KEY": ep.api_key,
65
+ "GEMINI_API_KEY": ep.api_key,
66
+ }
67
+ if model:
68
+ env["GEMINI_MODEL"] = model
69
+ return env
70
+
71
+
72
+ def openai_yaml_block(ep: ProxyEndpoint, model: str) -> dict[str, str]:
73
+ """Three-key dict for configs that name the proxy directly (e.g. MLEvolve).
74
+
75
+ Returns:
76
+ {"model": ..., "base_url": ".../v1", "api_key": ...}
77
+ """
78
+ return {
79
+ "model": model,
80
+ "base_url": f"{ep.base_url()}/v1",
81
+ "api_key": ep.api_key,
82
+ }
agents/cliproxyapi/health.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Probe and (optionally) spawn the CLIProxyAPI process.
2
+
3
+ `wait_until_ready` does a TCP connect — endpoint-agnostic, so it works no
4
+ matter which protocol surfaces the proxy version exposes.
5
+
6
+ `spawn_proxy` is a context manager for tests / one-off CI runs. Most users
7
+ should run the proxy out-of-band: it owns long-lived OAuth tokens and may
8
+ serve other tools besides the testbed.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import contextlib
14
+ import socket
15
+ import subprocess
16
+ import time
17
+ from pathlib import Path
18
+
19
+ from .endpoint import ProxyEndpoint
20
+
21
+
22
+ def is_ready(ep: ProxyEndpoint, timeout: float = 2.0) -> bool:
23
+ try:
24
+ with socket.create_connection((ep.host, ep.port), timeout=timeout):
25
+ return True
26
+ except OSError:
27
+ return False
28
+
29
+
30
+ def wait_until_ready(ep: ProxyEndpoint, timeout: float = 30.0) -> None:
31
+ deadline = time.monotonic() + timeout
32
+ while time.monotonic() < deadline:
33
+ if is_ready(ep):
34
+ return
35
+ time.sleep(0.5)
36
+ raise SystemExit(
37
+ f"CLIProxyAPI at {ep.base_url()} did not respond within {timeout:.0f}s.\n"
38
+ f"Start it (e.g. `cliproxy --config ~/.cli-proxy-api/config.yaml`) "
39
+ f"and confirm CLIPROXYAPI_HOST / CLIPROXYAPI_PORT."
40
+ )
41
+
42
+
43
+ @contextlib.contextmanager
44
+ def spawn_proxy(
45
+ config_path: str | Path,
46
+ binary: str = "cliproxy",
47
+ timeout: float = 30.0,
48
+ ):
49
+ ep = ProxyEndpoint.from_env()
50
+ proc = subprocess.Popen(
51
+ [binary, "--config", str(config_path)],
52
+ stdout=subprocess.PIPE,
53
+ stderr=subprocess.STDOUT,
54
+ )
55
+ try:
56
+ wait_until_ready(ep, timeout=timeout)
57
+ yield ep
58
+ finally:
59
+ proc.terminate()
60
+ try:
61
+ proc.wait(timeout=5)
62
+ except subprocess.TimeoutExpired:
63
+ proc.kill()
agents/common/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """Shared adapter helpers between testbed and individual agent runners."""
agents/common/submit.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Validate and (optionally) submit an agent's output to the GraphTestbed API.
2
+
3
+ Default mode is print-and-stop: the runner reports the path to the produced
4
+ submission.csv but does not POST. Pass `--submit <agent-name>` to the runner
5
+ to actually call the scoring API.
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ from pathlib import Path
11
+
12
+ from graphtestbed.submit import submit as gtb_submit
13
+ from graphtestbed.submit import validate_submission
14
+
15
+
16
+ def finalize(task: str, csv_path: Path, agent: str | None) -> None:
17
+ info = validate_submission(task, csv_path)
18
+ print()
19
+ print(f"✓ Submission ready")
20
+ print(f" file: {csv_path}")
21
+ print(f" rows: {info['n_rows']}")
22
+ print(f" sha256: {info['sha256'][:12]}...")
23
+ if agent:
24
+ gtb_submit(task, csv_path, agent, dry_run=False)
25
+ else:
26
+ print()
27
+ print("(not submitted — pass --submit <agent-name> to POST)")
28
+ print(f" manual: gtb submit {task} --file {csv_path} --agent <name>")
agents/common/tasks.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Render a per-task instruction markdown for any agent.
2
+
3
+ Pulls the canonical task description from datasets/manifest.yaml and decorates
4
+ it with the submission contract (id col, pred col, n rows, metric).
5
+
6
+ Per-task overrides — handcrafted prompts that beat the auto-generated text —
7
+ live in agents/common/tasks_md/<task>.md and take priority when present.
8
+ """
9
+
10
+ from __future__ import annotations
11
+
12
+ from pathlib import Path
13
+
14
+ from graphtestbed._manifest import task_config
15
+
16
+ _TEMPLATE = """\
17
+ # Task: {task}
18
+
19
+ {description}
20
+
21
+ ## Files you will see
22
+
23
+ - `train_features.csv` — labeled training rows
24
+ - `val_features.csv` — labeled validation rows (use for HPO / early stopping)
25
+ - `test_features.csv` — **unlabeled** test rows; predict here
26
+
27
+ The `Label` (or task-specific target) column is present in train/val and
28
+ absent from test. Do not attempt to recover test labels from upstream sources.
29
+
30
+ ## Submission format
31
+
32
+ Write a CSV with **exactly two columns**, in this order:
33
+
34
+ | column | type | meaning |
35
+ | --- | --- | --- |
36
+ | `{id_col}` | id | matches `test_features.csv[{id_col}]` 100% |
37
+ | `{pred_col}` | float in [0, 1] | predicted score |
38
+
39
+ Row count: **{n_rows}**.
40
+
41
+ ## Metric
42
+
43
+ You will be evaluated on `{primary}` (primary). Secondary: {secondary}.
44
+ Optimize for the primary metric.
45
+ """
46
+
47
+
48
+ def task_instruction(task: str) -> str:
49
+ override = Path(__file__).parent / "tasks_md" / f"{task}.md"
50
+ if override.exists():
51
+ return override.read_text()
52
+ cfg = task_config(task)
53
+ s = cfg["submission_schema"]
54
+ m = cfg["metric"]
55
+ return _TEMPLATE.format(
56
+ task=task,
57
+ description=str(cfg.get("description", "")).strip(),
58
+ id_col=s["id_col"],
59
+ pred_col=s["pred_col"],
60
+ n_rows=s.get("n_rows", "?"),
61
+ primary=m["primary"],
62
+ secondary=", ".join(m.get("secondary", [])) or "(none)",
63
+ )
agents/common/workspace.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Ephemeral workspace dirs and dataset staging for agent runs.
2
+
3
+ Each runner allocates `runs/<agent>/<task>/<timestamp>/` so concurrent runs
4
+ don't collide and post-mortems are always recoverable from disk.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import datetime as dt
10
+ from pathlib import Path
11
+
12
+
13
+ def make_workspace(agent: str, task: str, root: Path | None = None) -> Path:
14
+ root = Path(root) if root else Path.cwd() / "runs"
15
+ ts = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
16
+ ws = root / agent / task / ts
17
+ ws.mkdir(parents=True, exist_ok=False)
18
+ return ws
19
+
20
+
21
+ def stage_dataset(src_dir: Path, dst_dir: Path, files: list[str]) -> None:
22
+ """Symlink each `files[i]` from src_dir into dst_dir.
23
+
24
+ Symlinks (vs copies) keep large CSVs on the cache disk; the agent reads
25
+ from src via the link transparently.
26
+ """
27
+ dst_dir.mkdir(parents=True, exist_ok=True)
28
+ for f in files:
29
+ s = src_dir / f
30
+ if not s.exists():
31
+ raise SystemExit(f"Missing dataset file: {s}")
32
+ d = dst_dir / f
33
+ if d.is_symlink() or d.exists():
34
+ d.unlink()
35
+ d.symlink_to(s.resolve())
agents/mlevolve/README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # `agents.mlevolve`
2
+
3
+ Runs [MLEvolve](https://github.com/InternScience/MLEvolve) on a GraphTestbed
4
+ task. MLEvolve is an MCGS auto-ML harness wired for OpenAI-compatible APIs.
5
+
6
+ Default model: **`gpt-5.3-codex-spark`** (a pipe-through alias you define in
7
+ your CLIProxyAPI `oauth-model-alias.codex` block).
8
+
9
+ ## Install
10
+
11
+ ```bash
12
+ bash agents/mlevolve/install.sh
13
+ # heavy: clones the repo + pip-installs torch and ML deps (~5-10 GB).
14
+ ```
15
+
16
+ Lands at `agents/mlevolve/_vendor/MLEvolve/`. Set `MLEVOLVE_DIR` if you
17
+ already have a clone elsewhere.
18
+
19
+ ## Run
20
+
21
+ ```bash
22
+ gtb fetch figraph
23
+ python -m agents.mlevolve.runner --task figraph
24
+ ```
25
+
26
+ Output:
27
+
28
+ ```
29
+ runs/mlevolve/figraph/<timestamp>/
30
+ ├── mlebench-tree/figraph/
31
+ │ ├── prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
32
+ │ ├── prepared/private/test.csv # val labels — local grader uses this
33
+ │ └── REAL_TEST_FEATURES.csv # the actual test split, for re-execute
34
+ ├── agent.log
35
+ └── val_submission.csv # MLEvolve's best on the val "test" split
36
+ ```
37
+
38
+ ## ⚠ v1 limitation: val-as-test
39
+
40
+ GraphTestbed's actual test labels live on the scoring server, not on disk.
41
+ For the local mle-bench grader to function, the adapter exposes
42
+ `val_features.csv` (with labels) as the "test" set MLEvolve searches against.
43
+
44
+ The CSV the runner harvests is therefore predictions on **val**, not test.
45
+ To submit a real test-set score:
46
+
47
+ 1. Open `agents/mlevolve/_vendor/MLEvolve/runs/<latest-ts>/` and find the
48
+ best runfile.py (search order: best score in the run's tree summary).
49
+ 2. Re-execute it against the real test split:
50
+ ```bash
51
+ cd <some scratch dir>
52
+ cp <ws>/mlebench-tree/figraph/REAL_TEST_FEATURES.csv ./test.csv
53
+ cp <ws>/mlebench-tree/figraph/prepared/public/train.csv ./train.csv
54
+ python <runfile> # produces submission.csv
55
+ ```
56
+ 3. Submit:
57
+ ```bash
58
+ gtb submit figraph --file ./submission.csv --agent mlevolve-codex-spark
59
+ ```
60
+
61
+ This step is manual in v1 because the structure of MLEvolve's `runfile.py`
62
+ varies per task and we don't want to silently mis-execute. It is on the
63
+ roadmap to automate.
64
+
65
+ ## Knobs
66
+
67
+ | flag | default | meaning |
68
+ | --- | --- | --- |
69
+ | `--model` | `gpt-5.3-codex-spark` | sent to proxy via OPENAI_BASE_URL/v1 |
70
+ | `--steps` | 100 | MCGS exploration count (upstream default: 500) |
71
+ | `--time-limit-min` | 120 | per-task wall-clock cap (upstream default: 720) |
72
+ | `--gpus` | 0 | passed to `search.num_gpus` |
73
+
74
+ The `--model` string must exist in your CLIProxyAPI
75
+ `oauth-model-alias.codex` (or be a real model your Codex account exposes).
agents/mlevolve/__init__.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ """MLEvolve integration (github.com/InternScience/MLEvolve).
2
+
3
+ MLEvolve is an MCGS-based auto-ML harness designed for the mle-bench
4
+ data layout. The adapter here translates a GraphTestbed task into the
5
+ mle-bench shape it expects, then drives the upstream `run.py` (Hydra
6
+ entry point) with overrides that route LLM traffic through CLIProxyAPI.
7
+
8
+ Default model: `gpt-5.3-codex-spark` (pipe-through alias the user defines
9
+ in their CLIProxyAPI `oauth-model-alias.codex` block).
10
+ """
agents/mlevolve/adapter.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GraphTestbed task → mle-bench-shaped data tree.
2
+
3
+ mle-bench expects, per experiment ID:
4
+
5
+ <root>/<exp_id>/prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
6
+
7
+ GraphTestbed's test labels live only on the scoring server, so the agent
8
+ cannot be auto-scored against `test_features.csv` locally. v1 strategy:
9
+
10
+ - Stage `val_features.csv` (with labels) as the "test" the agent
11
+ searches against. MLEvolve's grader can score val predictions locally,
12
+ which is what drives MCGS exploration.
13
+ - Stash the real `test_features.csv` next to the staged tree as
14
+ `<root>/<exp_id>/REAL_TEST_FEATURES.csv` so users can re-execute the
15
+ best runfile.py against it after the search finishes.
16
+
17
+ This is documented as a known limitation in agents/mlevolve/README.md.
18
+ """
19
+
20
+ from __future__ import annotations
21
+
22
+ from pathlib import Path
23
+
24
+ import pandas as pd
25
+
26
+ from agents.common.tasks import task_instruction
27
+ from graphtestbed._manifest import task_config
28
+ from graphtestbed.fetch import cache_dir
29
+
30
+
31
+ def stage(task: str, root: Path) -> Path:
32
+ """Build <root>/<task>/prepared/{public,private}/. Return the prepared dir."""
33
+ cfg = task_config(task)
34
+ s = cfg["submission_schema"]
35
+
36
+ src = cache_dir() / task
37
+ if not src.exists():
38
+ raise SystemExit(
39
+ f"No cached dataset at {src}. Run `gtb fetch {task}` first."
40
+ )
41
+
42
+ base = root / task / "prepared"
43
+ pub = base / "public"
44
+ priv = base / "private"
45
+ pub.mkdir(parents=True, exist_ok=True)
46
+ priv.mkdir(parents=True, exist_ok=True)
47
+
48
+ train = pd.read_csv(src / "train_features.csv")
49
+ val = pd.read_csv(src / "val_features.csv")
50
+ test = pd.read_csv(src / "test_features.csv")
51
+
52
+ if s["pred_col"] not in val.columns:
53
+ raise SystemExit(
54
+ f"val_features.csv has no `{s['pred_col']}` column — cannot use "
55
+ f"val as the local-grading split for task {task}."
56
+ )
57
+
58
+ # Public tree (what the agent sees). val_no_label = val minus label →
59
+ # served as `test.csv` so the agent's runfile predicts on it.
60
+ val_no_label = val.drop(columns=[s["pred_col"]])
61
+ train.to_csv(pub / "train.csv", index=False)
62
+ val_no_label.to_csv(pub / "test.csv", index=False)
63
+
64
+ sample = val_no_label[[s["id_col"]]].copy()
65
+ sample[s["pred_col"]] = 0.5
66
+ sample.to_csv(pub / "sample_submission.csv", index=False)
67
+
68
+ (pub / "description.md").write_text(task_instruction(task))
69
+
70
+ # Private tree: val with labels — the local grader checks submission
71
+ # against this.
72
+ val[[s["id_col"], s["pred_col"]]].rename(
73
+ columns={s["pred_col"]: "Label"}
74
+ ).to_csv(priv / "test.csv", index=False)
75
+
76
+ # Stash the real test set for post-search re-execution by the user.
77
+ test.to_csv(root / task / "REAL_TEST_FEATURES.csv", index=False)
78
+
79
+ return base
agents/mlevolve/examples/run_figraph.sh ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # End-to-end smoke test of MLEvolve on the `figraph` task.
3
+ set -euo pipefail
4
+
5
+ REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
6
+ cd "${REPO_ROOT}"
7
+
8
+ : "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
9
+
10
+ python3 -m agents.mlevolve.runner \
11
+ --task figraph \
12
+ --model "${MODEL:-gpt-5.3-codex-spark}" \
13
+ --steps "${STEPS:-30}" \
14
+ --time-limit-min "${TIME_LIMIT_MIN:-30}" \
15
+ --gpus "${GPUS:-0}" \
16
+ "${@}"
agents/mlevolve/install.sh ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Clone MLEvolve into agents/mlevolve/_vendor/MLEvolve and install its deps.
3
+ # This is a heavy install (torch + ML stack); expect ~5–10 GB and 5–15 min.
4
+ set -euo pipefail
5
+
6
+ HERE="$(cd "$(dirname "$0")" && pwd)"
7
+ DEST="${HERE}/_vendor"
8
+ REPO="${MLEVOLVE_REPO:-https://github.com/InternScience/MLEvolve}"
9
+ REF="${MLEVOLVE_REF:-main}"
10
+
11
+ mkdir -p "${DEST}"
12
+
13
+ if [[ -d "${DEST}/MLEvolve/.git" ]]; then
14
+ echo "Updating existing clone in ${DEST}/MLEvolve"
15
+ git -C "${DEST}/MLEvolve" fetch origin "${REF}"
16
+ git -C "${DEST}/MLEvolve" checkout "${REF}"
17
+ git -C "${DEST}/MLEvolve" pull --ff-only
18
+ else
19
+ git clone --depth 50 --branch "${REF}" "${REPO}" "${DEST}/MLEvolve"
20
+ fi
21
+
22
+ cd "${DEST}/MLEvolve"
23
+ echo
24
+ echo "Installing requirements (heavy — torch + ML stack)..."
25
+ for f in requirements_base.txt requirements_ml.txt requirements_domain.txt; do
26
+ if [[ -f "$f" ]]; then
27
+ echo " pip install --no-deps -r $f"
28
+ pip install --no-deps -r "$f"
29
+ fi
30
+ done
31
+
32
+ echo
33
+ echo "MLEvolve installed at ${DEST}/MLEvolve"
34
+ echo "Set MLEVOLVE_DIR if you put it elsewhere."
agents/mlevolve/runner.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Run MLEvolve on a GraphTestbed task, routed through CLIProxyAPI.
2
+
3
+ Usage:
4
+ python -m agents.mlevolve.runner --task figraph
5
+ python -m agents.mlevolve.runner --task figraph \\
6
+ --model gpt-5.3-codex-spark --steps 100
7
+ python -m agents.mlevolve.runner --task figraph \\
8
+ --submit mlevolve-codex-spark
9
+
10
+ What this does:
11
+ 1. Build an mle-bench-shaped tree from the GraphTestbed task data
12
+ (val-as-test for v1 — see adapter.py for why).
13
+ 2. Render config.yaml into _vendor/MLEvolve/config/, with the proxy
14
+ endpoint + model wired into agent.code and agent.feedback.
15
+ 3. Invoke `python run.py …` from inside _vendor/MLEvolve/ with Hydra
16
+ overrides for paths and run-budget.
17
+ 4. Harvest the latest submission.csv from runs/, normalize its column
18
+ names, validate against the testbed schema, and (optionally) submit.
19
+
20
+ Known v1 limitation: the produced submission scores VAL-set predictions,
21
+ not TEST-set. To score on test, rerun the best runfile.py against
22
+ <workspace>/mlebench-tree/<task>/REAL_TEST_FEATURES.csv before submitting.
23
+ """
24
+
25
+ from __future__ import annotations
26
+
27
+ import argparse
28
+ import os
29
+ import subprocess
30
+ import sys
31
+ from pathlib import Path
32
+
33
+ import pandas as pd
34
+
35
+ from agents.cliproxyapi import (
36
+ ProxyEndpoint,
37
+ openai_yaml_block,
38
+ wait_until_ready,
39
+ )
40
+ from agents.common.submit import finalize
41
+ from agents.common.workspace import make_workspace
42
+ from agents.mlevolve.adapter import stage as stage_mlebench
43
+ from graphtestbed._manifest import task_config
44
+
45
+ DEFAULT_MODEL = "gpt-5.3-codex-spark"
46
+
47
+
48
+ def _resolve_mlevolve_dir() -> Path:
49
+ explicit = os.environ.get("MLEVOLVE_DIR")
50
+ if explicit:
51
+ p = Path(explicit)
52
+ if not (p / "run.py").exists():
53
+ raise SystemExit(f"MLEVOLVE_DIR={p} does not contain run.py")
54
+ return p
55
+ vendored = Path(__file__).parent / "_vendor" / "MLEvolve"
56
+ if (vendored / "run.py").exists():
57
+ return vendored
58
+ raise SystemExit(
59
+ "Cannot locate MLEvolve.\n"
60
+ " Install: bash agents/mlevolve/install.sh\n"
61
+ " Or set MLEVOLVE_DIR to your existing clone."
62
+ )
63
+
64
+
65
+ def _hydra_overrides(
66
+ task: str, mlebench_root: Path, prepared: Path, ep: ProxyEndpoint,
67
+ model: str, steps: int, time_limit_s: int, num_gpus: int,
68
+ ) -> list[str]:
69
+ """Build Hydra-style key=value overrides for run.py."""
70
+ public = prepared / "public"
71
+ block = openai_yaml_block(ep, model)
72
+ cfg_metric = task_config(task)["metric"]["primary"]
73
+
74
+ overrides = [
75
+ f"exp_id={task}",
76
+ f"exp_name={task}",
77
+ f"dataset_dir={mlebench_root}",
78
+ f"data_dir={public}",
79
+ f"desc_file={public / 'description.md'}",
80
+ f"start_cpu_id=0",
81
+ f"cpu_number=4",
82
+ # LLM routing → proxy
83
+ f"agent.code.model={block['model']}",
84
+ f"agent.code.base_url={block['base_url']}",
85
+ f"agent.code.api_key={block['api_key']}",
86
+ f"agent.feedback.model={block['model']}",
87
+ f"agent.feedback.base_url={block['base_url']}",
88
+ f"agent.feedback.api_key={block['api_key']}",
89
+ # Run budget overrides
90
+ f"agent.steps={steps}",
91
+ f"agent.time_limit={time_limit_s}",
92
+ f"agent.memory_embedding_device={'cuda' if num_gpus > 0 else 'cpu'}",
93
+ f"agent.search.num_gpus={num_gpus}",
94
+ f"use_grading_server=false",
95
+ # Goal hint
96
+ f"goal=Maximize {cfg_metric} on the test set",
97
+ f"eval={cfg_metric}",
98
+ ]
99
+ return overrides
100
+
101
+
102
+ def _harvest_submission(
103
+ task: str, mlevolve_dir: Path, dst: Path,
104
+ ) -> Path:
105
+ schema = task_config(task)["submission_schema"]
106
+ runs = mlevolve_dir / "runs"
107
+ if not runs.exists():
108
+ raise SystemExit(f"No runs/ dir under {mlevolve_dir}")
109
+ candidates = sorted(runs.rglob("submission.csv"),
110
+ key=lambda p: p.stat().st_mtime)
111
+ if not candidates:
112
+ raise SystemExit(
113
+ f"No submission.csv produced under {runs}. "
114
+ f"Inspect {dst / 'agent.log'} for the failure mode."
115
+ )
116
+ chosen = candidates[-1]
117
+ df = pd.read_csv(chosen)
118
+ expected = [schema["id_col"], schema["pred_col"]]
119
+ if list(df.columns) != expected:
120
+ if len(df.columns) == 2:
121
+ print(f" (renaming columns {list(df.columns)} → {expected})")
122
+ df.columns = expected
123
+ else:
124
+ raise SystemExit(
125
+ f"Cannot normalize {chosen}: got {list(df.columns)}, expected {expected}"
126
+ )
127
+ out = dst / "val_submission.csv"
128
+ df.to_csv(out, index=False)
129
+ print(f"✓ Picked {chosen.relative_to(mlevolve_dir)}")
130
+ return out
131
+
132
+
133
+ def _print_followup(task: str, ws: Path, val_sub: Path) -> None:
134
+ real_test = ws / "mlebench-tree" / task / "REAL_TEST_FEATURES.csv"
135
+ print()
136
+ print("⚠ v1 limitation: the file above scores VAL predictions.")
137
+ print(" To score on the actual test set:")
138
+ print(f" 1. Find the best runfile.py under "
139
+ f"{Path('_vendor/MLEvolve/runs')}/<latest>/")
140
+ print(f" 2. Re-run it with test.csv replaced by:")
141
+ print(f" {real_test}")
142
+ print(f" 3. Submit the resulting CSV via:")
143
+ print(f" gtb submit {task} --file <path> --agent <name>")
144
+
145
+
146
+ def main() -> None:
147
+ ap = argparse.ArgumentParser(prog="agents.mlevolve.runner")
148
+ ap.add_argument("--task", required=True)
149
+ ap.add_argument("--model", default=DEFAULT_MODEL,
150
+ help=f"default: {DEFAULT_MODEL}")
151
+ ap.add_argument("--steps", type=int, default=100,
152
+ help="agent.steps (default: 100, upstream default 500 — "
153
+ "MCGS exploration count)")
154
+ ap.add_argument("--time-limit-min", type=int, default=120,
155
+ help="agent.time_limit in minutes (default: 120)")
156
+ ap.add_argument("--gpus", type=int, default=0,
157
+ help="search.num_gpus (default: 0 — CPU only)")
158
+ ap.add_argument("--submit", default=None, metavar="AGENT_ID",
159
+ help="POST val-set submission to scoring API as this name. "
160
+ "Note: scores VAL not test (see runner docstring).")
161
+ ap.add_argument("--workspace-root", type=Path, default=None)
162
+ args = ap.parse_args()
163
+
164
+ mlevolve_dir = _resolve_mlevolve_dir()
165
+ ep = ProxyEndpoint.from_env()
166
+ wait_until_ready(ep)
167
+ print(f"✓ Proxy ready at {ep.base_url()}")
168
+ print(f"✓ MLEvolve at {mlevolve_dir}")
169
+
170
+ ws = make_workspace("mlevolve", args.task, args.workspace_root)
171
+ mlebench_root = ws / "mlebench-tree"
172
+ prepared = stage_mlebench(args.task, mlebench_root)
173
+ print(f"✓ mle-bench tree staged at {mlebench_root}")
174
+
175
+ overrides = _hydra_overrides(
176
+ task=args.task,
177
+ mlebench_root=mlebench_root,
178
+ prepared=prepared,
179
+ ep=ep,
180
+ model=args.model,
181
+ steps=args.steps,
182
+ time_limit_s=args.time_limit_min * 60,
183
+ num_gpus=args.gpus,
184
+ )
185
+ cmd = [sys.executable, "run.py", *overrides]
186
+
187
+ print(f"→ Launching MLEvolve task={args.task} model={args.model}")
188
+ print(f" workspace: {ws}")
189
+ log = ws / "agent.log"
190
+ with log.open("wb") as lf:
191
+ rc = subprocess.call(cmd, cwd=mlevolve_dir, stdout=lf, stderr=subprocess.STDOUT)
192
+ print(f" exit={rc} log={log}")
193
+ if rc != 0:
194
+ raise SystemExit(rc)
195
+
196
+ val_sub = _harvest_submission(args.task, mlevolve_dir, ws)
197
+ _print_followup(args.task, ws, val_sub)
198
+
199
+ # Note: don't auto-finalize against `test_features.csv` schema since this
200
+ # is a val-set submission. Just print & stop.
201
+ print()
202
+ print(f" val_submission: {val_sub}")
203
+ if args.submit:
204
+ print(f" --submit was set; posting val-set predictions as "
205
+ f"`{args.submit}` (will score 0 against test GT).")
206
+ finalize(args.task, val_sub, args.submit)
207
+
208
+
209
+ if __name__ == "__main__":
210
+ main()
graphtestbed/leaderboard.py CHANGED
@@ -8,7 +8,10 @@ import os
8
  import json
9
 
10
 
11
- API_URL = os.environ.get("GRAPHTESTBED_API", "http://localhost:8080")
 
 
 
12
 
13
 
14
  def main() -> None:
 
8
  import json
9
 
10
 
11
+ API_URL = os.environ.get(
12
+ "GRAPHTESTBED_API",
13
+ "https://lanczos-graphtestbed.hf.space",
14
+ )
15
 
16
 
17
  def main() -> None:
graphtestbed/submit.py CHANGED
@@ -21,7 +21,10 @@ import pandas as pd
21
  from graphtestbed._manifest import sha256_file, task_config
22
 
23
 
24
- API_URL = os.environ.get("GRAPHTESTBED_API", "http://localhost:8080")
 
 
 
25
  TIMEOUT_SEC = 60
26
 
27
 
 
21
  from graphtestbed._manifest import sha256_file, task_config
22
 
23
 
24
+ API_URL = os.environ.get(
25
+ "GRAPHTESTBED_API",
26
+ "https://lanczos-graphtestbed.hf.space",
27
+ )
28
  TIMEOUT_SEC = 60
29
 
30
 
pyproject.toml CHANGED
@@ -7,11 +7,6 @@ license = "MIT"
7
  readme = "README.md"
8
  requires-python = ">=3.10"
9
  keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
10
-
11
- [project.urls]
12
- Homepage = "https://github.com/zhuconv/GraphTestbed"
13
- Repository = "https://github.com/zhuconv/GraphTestbed"
14
- Issues = "https://github.com/zhuconv/GraphTestbed/issues"
15
  dependencies = [
16
  "huggingface-hub >= 0.20",
17
  "pandas >= 2.0",
@@ -19,6 +14,11 @@ dependencies = [
19
  "requests >= 2.30",
20
  ]
21
 
 
 
 
 
 
22
  [project.optional-dependencies]
23
  dev = ["scikit-learn >= 1.3"]
24
 
 
7
  readme = "README.md"
8
  requires-python = ">=3.10"
9
  keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
 
 
 
 
 
10
  dependencies = [
11
  "huggingface-hub >= 0.20",
12
  "pandas >= 2.0",
 
14
  "requests >= 2.30",
15
  ]
16
 
17
+ [project.urls]
18
+ Homepage = "https://github.com/zhuconv/GraphTestbed"
19
+ Repository = "https://github.com/zhuconv/GraphTestbed"
20
+ Issues = "https://github.com/zhuconv/GraphTestbed/issues"
21
+
22
  [project.optional-dependencies]
23
  dev = ["scikit-learn >= 1.3"]
24
 
server/api.py CHANGED
@@ -41,6 +41,10 @@ from flask import Flask, jsonify, request
41
 
42
  GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
43
  DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
 
 
 
 
44
  MANIFEST_PATH = Path(os.environ.get(
45
  "GT_MANIFEST",
46
  Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
@@ -195,6 +199,15 @@ def submit():
195
  )
196
  conn.commit()
197
 
 
 
 
 
 
 
 
 
 
198
  # Rank = how many distinct agents have a strictly better best-score on
199
  # this task. The just-inserted row contributes to that count only if the
200
  # SAME agent had a better prior submission (in which case rank doesn't
 
41
 
42
  GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
43
  DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
44
+ ARCHIVE_DIR = (
45
+ Path(os.environ["GT_ARCHIVE_DIR"])
46
+ if os.environ.get("GT_ARCHIVE_DIR") else None
47
+ )
48
  MANIFEST_PATH = Path(os.environ.get(
49
  "GT_MANIFEST",
50
  Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
 
199
  )
200
  conn.commit()
201
 
202
+ # Archive the raw CSV when GT_ARCHIVE_DIR is configured, so the deploy
203
+ # host can later prove what each scored entry was. Filename embeds the
204
+ # agent + run_id so multiple submissions don't collide.
205
+ if ARCHIVE_DIR is not None:
206
+ safe_agent = "".join(c if c.isalnum() or c in "-_." else "_" for c in agent)
207
+ out = ARCHIVE_DIR / task / f"{safe_agent}-{run_id}.csv"
208
+ out.parent.mkdir(parents=True, exist_ok=True)
209
+ out.write_bytes(raw)
210
+
211
  # Rank = how many distinct agents have a strictly better best-score on
212
  # this task. The just-inserted row contributes to that count only if the
213
  # SAME agent had a better prior submission (in which case rank doesn't
server/requirements.txt CHANGED
@@ -3,3 +3,4 @@ pandas>=2.0
3
  pyyaml>=6.0
4
  scikit-learn>=1.3
5
  gunicorn>=21.0
 
 
3
  pyyaml>=6.0
4
  scikit-learn>=1.3
5
  gunicorn>=21.0
6
+ huggingface_hub>=0.20
server/space/DEPLOY.md ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deploying the GraphTestbed scoring server to HF Spaces
2
+
3
+ All commands assume `HF_TOKEN` is exported and has **write** scope on the
4
+ `lanczos` namespace.
5
+
6
+ ## 1. Seed the GT dataset repo
7
+
8
+ ```bash
9
+ HF_TOKEN=$HF_TOKEN python server/space/push_gt.py \
10
+ --repo lanczos/graphtestbed-gt \
11
+ --gt-dir ~/graphtestbed-gt
12
+ ```
13
+
14
+ This creates the **private** dataset repo if it doesn't exist and uploads
15
+ each `<task>.csv` to `gt/<task>.csv`. Verify at:
16
+
17
+ <https://huggingface.co/datasets/lanczos/graphtestbed-gt>
18
+
19
+ ## 2. Create the Space
20
+
21
+ ```bash
22
+ huggingface-cli repo create graphtestbed --type space --space_sdk docker
23
+ ```
24
+
25
+ Or in the web UI: New Space → name `graphtestbed` → SDK: **Docker**.
26
+
27
+ ## 3. Set the Space secret
28
+
29
+ In Space Settings → Variables and secrets, add:
30
+
31
+ | name | value |
32
+ | --- | --- |
33
+ | `HF_TOKEN` | same token (write scope on `lanczos/graphtestbed-gt`) |
34
+
35
+ Optional overrides (set as **variables**, not secrets):
36
+
37
+ | name | default | when to override |
38
+ | --- | --- | --- |
39
+ | `GT_DATASET_REPO` | `lanczos/graphtestbed-gt` | running multiple Spaces against different GT |
40
+ | `GT_BACKUP_INTERVAL` | `60` | tighter durability vs. fewer commits |
41
+ | `GT_QUOTA` | `5` | bumping during a benchmark sprint |
42
+
43
+ ## 4. Push the code to the Space
44
+
45
+ ```bash
46
+ # One-time
47
+ git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
48
+
49
+ # Each deploy (HF prompts for credentials: user=lanczos, password=$HF_TOKEN)
50
+ ./server/space/push_to_space.sh
51
+ ```
52
+
53
+ The script overlays `server/space/README.md` at repo root on a temp branch
54
+ and force-pushes to `space/main` (HF reads its frontmatter from root
55
+ README). Your GitHub root README is untouched.
56
+
57
+ First build ~3 min (pandas + sklearn wheels). Subsequent ~30 s.
58
+
59
+ ## 5. Smoke-test
60
+
61
+ ```bash
62
+ curl -s https://lanczos-graphtestbed.hf.space/healthz | jq
63
+ ```
64
+
65
+ Expect:
66
+ ```json
67
+ {
68
+ "status": "ok",
69
+ "tasks": ["arxiv-citation", "figraph", "ibm-aml", "ieee-fraud-detection"],
70
+ "gt_present": ["figraph", "..."],
71
+ "quota_per_day": 5,
72
+ "uptime_unix": 1776633751
73
+ }
74
+ ```
75
+
76
+ If `gt_present` is empty, the boot bootstrap couldn't read from the dataset
77
+ repo — check the Space logs and verify `HF_TOKEN` has read scope on
78
+ `GT_DATASET_REPO`.
79
+
80
+ ## 6. Hand out the URL
81
+
82
+ ```
83
+ export GRAPHTESTBED_API=https://lanczos-graphtestbed.hf.space
84
+ gtb submit figraph --file preds.csv --agent my-agent-v1
85
+ ```
86
+
87
+ ## Reading the leaderboard back as a maintainer
88
+
89
+ ```bash
90
+ huggingface-cli download lanczos/graphtestbed-gt \
91
+ leaderboard.db \
92
+ --repo-type dataset \
93
+ --local-dir ./backup
94
+
95
+ sqlite3 backup/leaderboard.db \
96
+ "SELECT task, agent, primary_metric, n_rows, submitted_at
97
+ FROM submissions ORDER BY submitted_at DESC LIMIT 20"
98
+ ```
99
+
100
+ The full per-submission CSV archive lives under `submissions/<task>/<agent>-<run_id>.csv`
101
+ in the same dataset repo.
server/space/Dockerfile ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ ENV PYTHONUNBUFFERED=1 \
4
+ PYTHONDONTWRITEBYTECODE=1 \
5
+ PIP_NO_CACHE_DIR=1
6
+
7
+ WORKDIR /app
8
+
9
+ RUN apt-get update && \
10
+ apt-get install -y --no-install-recommends git && \
11
+ rm -rf /var/lib/apt/lists/*
12
+
13
+ # Install deps first so the layer caches across code-only changes.
14
+ COPY server/requirements.txt /app/server/requirements.txt
15
+ RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
16
+
17
+ # Install the graphtestbed package itself so server/api.py can
18
+ # `from graphtestbed._manifest import ...`.
19
+ COPY pyproject.toml /app/
20
+ COPY graphtestbed /app/graphtestbed
21
+ COPY datasets /app/datasets
22
+ COPY server /app/server
23
+ RUN pip install --no-deps -e /app
24
+
25
+ # HF Spaces mounts /data on Persistent Storage tier; on free tier it's
26
+ # just an in-container path that the dataset-repo backup loop preserves.
27
+ ENV GT_DATA_ROOT=/data \
28
+ GT_DIR=/data/gt \
29
+ GT_DB=/data/leaderboard.db \
30
+ GT_ARCHIVE_DIR=/data/submissions \
31
+ GT_DATASET_REPO=lanczos/graphtestbed-gt \
32
+ GT_BACKUP_INTERVAL=60 \
33
+ GT_QUOTA=5 \
34
+ PORT=7860
35
+ RUN mkdir -p /data && chmod 777 /data
36
+
37
+ EXPOSE 7860
38
+ CMD ["python", "/app/server/space/space_entry.py"]
server/space/README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: GraphTestbed Scoring API
3
+ emoji: 📊
4
+ colorFrom: indigo
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
+ # GraphTestbed Scoring API
12
+
13
+ Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
14
+ benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
15
+ anywhere; the scored entry lands on a single shared leaderboard.
16
+
17
+ ## Endpoints
18
+
19
+ | method | path | purpose |
20
+ | --- | --- | --- |
21
+ | POST | `/submit` | multipart `task=…&agent=…&file=preds.csv` → JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
22
+ | GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
23
+ | GET | `/healthz` | tasks list + which have GT loaded + quota |
24
+
25
+ Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
26
+
27
+ ## Trust model
28
+
29
+ Non-adversarial benchmark. The API enforces:
30
+ - 5 submissions / day / IP / task
31
+ - Schema check before scoring (malformed CSVs don't burn quota)
32
+ - Score bucketing (round to 3 dp)
33
+ - Audit trail in sqlite + per-submission CSV archive
34
+
35
+ Test labels live only in the companion private dataset repo
36
+ (`lanczos/graphtestbed-gt`) and never enter the Space's git history.
37
+
38
+ ## Configuration (Space secrets)
39
+
40
+ | name | required | default | notes |
41
+ | --- | --- | --- | --- |
42
+ | `HF_TOKEN` | yes | — | write scope on `GT_DATASET_REPO` |
43
+ | `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
44
+ | `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite → dataset-repo pushes |
45
+ | `GT_QUOTA` | no | `5` | submissions/day/IP/task |
46
+
47
+ ## Persistence
48
+
49
+ - On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
50
+ archived `submissions/**/*.csv` from the dataset repo into `/data`.
51
+ - Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
52
+ uses `sqlite3.Connection.backup()` to copy the DB atomically and
53
+ `upload_file`s it back. New submission CSVs in `/data/submissions/` are
54
+ pushed via `upload_folder` (content-hash diff — unchanged files skipped).
55
+ - Worst-case loss on Space crash: 60 s of submissions.
server/space/push_gt.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """One-shot uploader for ground-truth CSVs to the companion HF dataset repo.
2
+
3
+ Creates the dataset repo (private by default) if it doesn't exist, then
4
+ uploads every <task>.csv from --gt-dir to gt/<task>.csv in the repo.
5
+
6
+ Usage (run locally with a token that has write scope on the namespace):
7
+
8
+ HF_TOKEN=hf_xxx python server/space/push_gt.py \\
9
+ --repo lanczos/graphtestbed-gt \\
10
+ --gt-dir ~/graphtestbed-gt
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import os
17
+ import sys
18
+ from pathlib import Path
19
+
20
+ from huggingface_hub import create_repo, upload_file
21
+
22
+
23
+ def main() -> int:
24
+ ap = argparse.ArgumentParser(prog="push_gt")
25
+ ap.add_argument("--repo", default="lanczos/graphtestbed-gt",
26
+ help="dataset repo id (default: lanczos/graphtestbed-gt)")
27
+ ap.add_argument("--gt-dir", type=Path, required=True,
28
+ help="local dir containing <task>.csv files")
29
+ ap.add_argument("--public", action="store_true",
30
+ help="create the repo as public (default: private)")
31
+ args = ap.parse_args()
32
+
33
+ token = os.environ.get("HF_TOKEN")
34
+ if not token:
35
+ sys.exit("HF_TOKEN not set in env")
36
+
37
+ if not args.gt_dir.exists():
38
+ sys.exit(f"--gt-dir not found: {args.gt_dir}")
39
+
40
+ csvs = sorted(args.gt_dir.glob("*.csv"))
41
+ if not csvs:
42
+ sys.exit(f"no *.csv files under {args.gt_dir}")
43
+
44
+ print(f"creating/confirming dataset repo {args.repo} (private={not args.public})")
45
+ create_repo(
46
+ repo_id=args.repo, repo_type="dataset",
47
+ private=not args.public, exist_ok=True, token=token,
48
+ )
49
+
50
+ for csv in csvs:
51
+ rel = f"gt/{csv.name}"
52
+ print(f"uploading {csv} → {args.repo}:{rel}")
53
+ upload_file(
54
+ path_or_fileobj=str(csv),
55
+ path_in_repo=rel,
56
+ repo_id=args.repo, repo_type="dataset",
57
+ token=token,
58
+ commit_message=f"upload {csv.name}",
59
+ )
60
+
61
+ print(f"\ndone — {len(csvs)} ground-truth file(s) at:")
62
+ print(f" https://huggingface.co/datasets/{args.repo}")
63
+ return 0
64
+
65
+
66
+ if __name__ == "__main__":
67
+ raise SystemExit(main())
server/space/push_to_space.sh ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Push the current commit to the HF Space remote, with server/space/README.md
3
+ # overlaid at repo root (HF reads the Space's metadata frontmatter from the
4
+ # root README; the GitHub root README stays untouched).
5
+ #
6
+ # Prereq once:
7
+ # git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
8
+ #
9
+ # When git prompts for credentials on push:
10
+ # user = lanczos
11
+ # password = $HF_TOKEN
12
+ set -euo pipefail
13
+
14
+ BRANCH=$(git rev-parse --abbrev-ref HEAD)
15
+ TEMP="space-deploy-$(date +%s)"
16
+
17
+ trap 'git checkout "$BRANCH" >/dev/null 2>&1 || true; \
18
+ git branch -D "$TEMP" >/dev/null 2>&1 || true' EXIT
19
+
20
+ git checkout -b "$TEMP"
21
+ cp server/space/README.md README.md
22
+ git add README.md
23
+ git commit --no-verify -m "deploy: overlay server/space/README.md as Space root"
24
+ git push -f space "$TEMP:main"
25
+ echo
26
+ echo "pushed to space/main"
27
+ echo "URL: https://lanczos-graphtestbed.hf.space/"
server/space/space_entry.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Entry point for the GraphTestbed scoring server on HF Spaces.
2
+
3
+ On boot:
4
+ 1. snapshot_download the companion dataset repo (lanczos/graphtestbed-gt by
5
+ default) into /data: gt/*.csv, leaderboard.db, submissions/**/*.csv.
6
+ 2. Spawn a daemon thread that every BACKUP_INTERVAL seconds:
7
+ a. SELECT COUNT(*) FROM submissions; bail if unchanged.
8
+ b. sqlite3.Connection.backup() into a temp file (atomic, lock-safe).
9
+ c. upload_file the temp file → leaderboard.db in the dataset repo.
10
+ d. upload_folder /data/submissions/ → submissions/ in the dataset repo
11
+ (huggingface_hub diffs by content-hash; unchanged files don't transfer).
12
+ 3. Hand off to server/api.py via Flask app.run(threaded=True).
13
+
14
+ Env vars (all have sensible defaults baked into the Dockerfile):
15
+ HF_TOKEN required write scope on GT_DATASET_REPO
16
+ GT_DATASET_REPO optional default: lanczos/graphtestbed-gt
17
+ GT_DATA_ROOT optional default: /data
18
+ GT_BACKUP_INTERVAL optional default: 60 (seconds)
19
+ PORT optional default: 7860
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import os
25
+ import sqlite3
26
+ import sys
27
+ import threading
28
+ import time
29
+ from pathlib import Path
30
+
31
+ from huggingface_hub import snapshot_download, upload_file, upload_folder
32
+
33
+ HF_TOKEN = os.environ.get("HF_TOKEN")
34
+ HF_REPO = os.environ.get("GT_DATASET_REPO", "lanczos/graphtestbed-gt")
35
+ DATA_DIR = Path(os.environ.get("GT_DATA_ROOT", "/data"))
36
+ GT_DIR = DATA_DIR / "gt"
37
+ DB_PATH = DATA_DIR / "leaderboard.db"
38
+ ARCHIVE_DIR = DATA_DIR / "submissions"
39
+ BACKUP_INTERVAL = int(os.environ.get("GT_BACKUP_INTERVAL", "60"))
40
+ PORT = int(os.environ.get("PORT", "7860"))
41
+
42
+
43
+ def _require_token() -> str:
44
+ if not HF_TOKEN:
45
+ raise SystemExit(
46
+ "HF_TOKEN is unset. Set it as a Space secret with write scope on "
47
+ f"{HF_REPO}."
48
+ )
49
+ return HF_TOKEN
50
+
51
+
52
+ def bootstrap() -> None:
53
+ """Pull GT files, leaderboard, and submission archive from the dataset repo."""
54
+ token = _require_token()
55
+ for d in (DATA_DIR, GT_DIR, ARCHIVE_DIR):
56
+ d.mkdir(parents=True, exist_ok=True)
57
+
58
+ print(f"snapshot_download {HF_REPO} → {DATA_DIR}", flush=True)
59
+ try:
60
+ snapshot_download(
61
+ HF_REPO,
62
+ repo_type="dataset",
63
+ local_dir=str(DATA_DIR),
64
+ allow_patterns=["gt/*.csv", "leaderboard.db", "submissions/**/*.csv"],
65
+ token=token,
66
+ )
67
+ except Exception as e:
68
+ # First-deploy or empty repo: keep going with empty /data.
69
+ print(f"snapshot_download warning ({type(e).__name__}): {e}", flush=True)
70
+
71
+ n_gt = len(list(GT_DIR.glob("*.csv")))
72
+ print(f"GT files present: {n_gt}", flush=True)
73
+ if DB_PATH.exists():
74
+ try:
75
+ n = int(sqlite3.connect(DB_PATH).execute(
76
+ "SELECT COUNT(*) FROM submissions"
77
+ ).fetchone()[0])
78
+ print(f"restored leaderboard.db ({n} submissions)", flush=True)
79
+ except sqlite3.OperationalError:
80
+ print("leaderboard.db present but no submissions table yet", flush=True)
81
+ else:
82
+ print("no prior leaderboard.db; starting fresh", flush=True)
83
+
84
+
85
+ def _submission_count() -> int:
86
+ if not DB_PATH.exists():
87
+ return 0
88
+ try:
89
+ conn = sqlite3.connect(DB_PATH)
90
+ try:
91
+ row = conn.execute("SELECT COUNT(*) FROM submissions").fetchone()
92
+ return int(row[0]) if row else 0
93
+ finally:
94
+ conn.close()
95
+ except sqlite3.OperationalError:
96
+ return 0
97
+
98
+
99
+ def _atomic_db_copy(dst: Path) -> None:
100
+ """sqlite3.backup() is lock-safe — readers/writers stay consistent."""
101
+ src = sqlite3.connect(DB_PATH)
102
+ try:
103
+ target = sqlite3.connect(dst)
104
+ try:
105
+ src.backup(target)
106
+ finally:
107
+ target.close()
108
+ finally:
109
+ src.close()
110
+
111
+
112
+ def backup_loop() -> None:
113
+ token = _require_token()
114
+ last_count = -1
115
+ print(f"backup_loop started (interval={BACKUP_INTERVAL}s)", flush=True)
116
+ while True:
117
+ time.sleep(BACKUP_INTERVAL)
118
+ n = _submission_count()
119
+ if n == last_count:
120
+ continue
121
+
122
+ try:
123
+ tmp = DATA_DIR / "_leaderboard.db.tmp"
124
+ _atomic_db_copy(tmp)
125
+ upload_file(
126
+ path_or_fileobj=str(tmp),
127
+ path_in_repo="leaderboard.db",
128
+ repo_id=HF_REPO, repo_type="dataset",
129
+ token=token,
130
+ commit_message=f"backup leaderboard ({n} submissions)",
131
+ )
132
+ tmp.unlink()
133
+ except Exception as e:
134
+ print(f"leaderboard backup failed: {type(e).__name__}: {e}", flush=True)
135
+ continue
136
+
137
+ if ARCHIVE_DIR.exists() and any(ARCHIVE_DIR.rglob("*.csv")):
138
+ try:
139
+ upload_folder(
140
+ folder_path=str(ARCHIVE_DIR),
141
+ path_in_repo="submissions",
142
+ repo_id=HF_REPO, repo_type="dataset",
143
+ token=token,
144
+ commit_message=f"archive submissions ({n} total)",
145
+ allow_patterns=["**/*.csv"],
146
+ )
147
+ except Exception as e:
148
+ print(f"submission archive failed: {type(e).__name__}: {e}", flush=True)
149
+
150
+ last_count = n
151
+ print(f"backup pushed: {n} submissions", flush=True)
152
+
153
+
154
+ def main() -> int:
155
+ bootstrap()
156
+
157
+ # Make sure server/api.py reads paths consistent with what we just bootstrapped.
158
+ os.environ.setdefault("GT_DIR", str(GT_DIR))
159
+ os.environ.setdefault("GT_DB", str(DB_PATH))
160
+ os.environ.setdefault("GT_ARCHIVE_DIR", str(ARCHIVE_DIR))
161
+
162
+ threading.Thread(target=backup_loop, daemon=True).start()
163
+
164
+ sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
165
+ from api import app # noqa: E402 — env vars must be set first
166
+
167
+ print(f"serving on 0.0.0.0:{PORT}", flush=True)
168
+ app.run(host="0.0.0.0", port=PORT, threaded=True, use_reloader=False)
169
+ return 0
170
+
171
+
172
+ if __name__ == "__main__":
173
+ raise SystemExit(main())