Spaces:
Sleeping
Add agents/ harness integrations and HF Space scoring deployment
Browse files- agents/cliproxyapi: reusable shim that points any agent's SDK at one
CLIProxyAPI proxy via anthropic_env / openai_env / gemini_env helpers.
- agents/{ai_build_ai,mlevolve}: runners that stage GraphTestbed task
data, route LLM calls through the proxy, and harvest submission CSVs.
Tested end-to-end on figraph; both scored on the leaderboard
(aibuildai-claude-sonnet-4-6 0.819, mlevolve-gpt-5.3-codex-spark 0.790).
- agents/common: shared workspace + task-instruction + finalize helpers.
- server/space/: Docker SDK Space deployment. Boot orchestrator in
space_entry.py snapshot_downloads GT + leaderboard.db from the
companion private dataset (lanczos/graphtestbed-gt) on startup, then
runs a daemon thread that backs up sqlite + new submission CSVs
every 60s via huggingface_hub.upload_file/upload_folder.
- server/api.py: optional GT_ARCHIVE_DIR env writes raw submission CSVs
to disk so the backup loop can ship them to the dataset repo.
- graphtestbed/{submit,leaderboard}.py: default GRAPHTESTBED_API flipped
to the hosted Space URL (env var still overrides for self-hosters).
- pyproject.toml: dependencies were misplaced under [project.urls];
moved to [project] so pip install -e . actually resolves deps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- .gitignore +5 -0
- README.md +46 -9
- agents/README.md +54 -0
- agents/__init__.py +9 -0
- agents/ai_build_ai/README.md +58 -0
- agents/ai_build_ai/__init__.py +6 -0
- agents/ai_build_ai/examples/run_figraph.sh +20 -0
- agents/ai_build_ai/install.sh +31 -0
- agents/ai_build_ai/runner.py +161 -0
- agents/cliproxyapi/README.md +108 -0
- agents/cliproxyapi/__init__.py +33 -0
- agents/cliproxyapi/config.example.yaml +43 -0
- agents/cliproxyapi/endpoint.py +44 -0
- agents/cliproxyapi/env.py +82 -0
- agents/cliproxyapi/health.py +63 -0
- agents/common/__init__.py +1 -0
- agents/common/submit.py +28 -0
- agents/common/tasks.py +63 -0
- agents/common/workspace.py +35 -0
- agents/mlevolve/README.md +75 -0
- agents/mlevolve/__init__.py +10 -0
- agents/mlevolve/adapter.py +79 -0
- agents/mlevolve/examples/run_figraph.sh +16 -0
- agents/mlevolve/install.sh +34 -0
- agents/mlevolve/runner.py +210 -0
- graphtestbed/leaderboard.py +4 -1
- graphtestbed/submit.py +4 -1
- pyproject.toml +5 -5
- server/api.py +13 -0
- server/requirements.txt +1 -0
- server/space/DEPLOY.md +101 -0
- server/space/Dockerfile +38 -0
- server/space/README.md +55 -0
- server/space/push_gt.py +67 -0
- server/space/push_to_space.sh +27 -0
- server/space/space_entry.py +173 -0
|
@@ -25,3 +25,8 @@ ground_truth*.csv
|
|
| 25 |
*test_labels*.csv
|
| 26 |
private/
|
| 27 |
**/private/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
*test_labels*.csv
|
| 26 |
private/
|
| 27 |
**/private/
|
| 28 |
+
|
| 29 |
+
# Agent harness scratch space
|
| 30 |
+
runs/
|
| 31 |
+
agents/**/runs/
|
| 32 |
+
agents/**/_vendor/
|
|
@@ -10,31 +10,41 @@ Build an agent. Submit predictions. Get a score. Test labels live on a server, n
|
|
| 10 |
|
| 11 |
## Status
|
| 12 |
|
| 13 |
-
**Pre-launch.** The code runs end-to-end
|
| 14 |
|
| 15 |
- The package isn't on PyPI yet → install from git (see below)
|
| 16 |
-
- The hosted scoring API isn't deployed yet → run the server on your own machine
|
| 17 |
- HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
|
| 18 |
|
| 19 |
-
The
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
##
|
| 22 |
|
| 23 |
```bash
|
| 24 |
-
# 1. Install
|
| 25 |
pip install git+https://github.com/zhuconv/GraphTestbed
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
git clone https://github.com/zhuconv/GraphTestbed
|
| 29 |
cd GraphTestbed
|
| 30 |
GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
|
| 31 |
# → Running on http://localhost:8080
|
| 32 |
|
| 33 |
-
#
|
| 34 |
export GRAPHTESTBED_API=http://localhost:8080
|
| 35 |
gtb submit figraph --file preds.csv --agent my-agent-v1
|
| 36 |
-
# ✓ Scored primary (auc_roc): 0.689 rank: #3
|
| 37 |
-
gtb leaderboard figraph
|
| 38 |
```
|
| 39 |
|
| 40 |
You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
|
|
@@ -175,6 +185,33 @@ You don't modify GraphTestbed. You:
|
|
| 175 |
That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
|
| 176 |
</details>
|
| 177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
## License
|
| 179 |
|
| 180 |
[MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).
|
|
|
|
| 10 |
|
| 11 |
## Status
|
| 12 |
|
| 13 |
+
**Pre-launch.** The code runs end-to-end. Pieces that aren't fully live yet:
|
| 14 |
|
| 15 |
- The package isn't on PyPI yet → install from git (see below)
|
|
|
|
| 16 |
- HuggingFace dataset repos aren't published yet → use your own `train/val/test_features.csv` files for now
|
| 17 |
|
| 18 |
+
The hosted scoring API at <https://lanczos-graphtestbed.hf.space/> is the
|
| 19 |
+
default `gtb submit` target. Set `GRAPHTESTBED_API` to point at a local
|
| 20 |
+
server if you'd rather self-host (instructions below).
|
| 21 |
|
| 22 |
+
## Submit to the hosted leaderboard
|
| 23 |
|
| 24 |
```bash
|
|
|
|
| 25 |
pip install git+https://github.com/zhuconv/GraphTestbed
|
| 26 |
+
gtb submit figraph --file preds.csv --agent my-agent-v1
|
| 27 |
+
# ✓ Scored primary (auc_roc): 0.689 rank: #3
|
| 28 |
+
gtb leaderboard figraph
|
| 29 |
+
```
|
| 30 |
|
| 31 |
+
The hosted server is a Docker-SDK HF Space that holds GT files in a private
|
| 32 |
+
companion dataset and never logs prediction CSVs (it does archive them in
|
| 33 |
+
the same private repo for reproducibility — see [`server/space/DEPLOY.md`](server/space/DEPLOY.md)).
|
| 34 |
+
Trust model: non-adversarial, 5 submissions/day/IP/task, score bucketed to
|
| 35 |
+
3 decimals — same as if you ran the server yourself.
|
| 36 |
+
|
| 37 |
+
## Run the server locally (alternative)
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
git clone https://github.com/zhuconv/GraphTestbed
|
| 41 |
cd GraphTestbed
|
| 42 |
GT_DIR=~/path/to/your/ground_truth ./server/run_local.sh
|
| 43 |
# → Running on http://localhost:8080
|
| 44 |
|
| 45 |
+
# point the client at it
|
| 46 |
export GRAPHTESTBED_API=http://localhost:8080
|
| 47 |
gtb submit figraph --file preds.csv --agent my-agent-v1
|
|
|
|
|
|
|
| 48 |
```
|
| 49 |
|
| 50 |
You provide the `ground_truth/<task>.csv` files yourself (one row per test entity, columns `<id_col>,Label`). The CLI never needs to see them.
|
|
|
|
| 185 |
That's it. See [`PROTOCOL.md`](PROTOCOL.md) for edge cases.
|
| 186 |
</details>
|
| 187 |
|
| 188 |
+
## Reference agent integrations (`agents/`)
|
| 189 |
+
|
| 190 |
+
Two third-party harnesses ship pre-wired to the testbed; both route LLM
|
| 191 |
+
traffic through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI):
|
| 192 |
+
|
| 193 |
+
| package | upstream | default model |
|
| 194 |
+
| --- | --- | --- |
|
| 195 |
+
| [`agents.ai_build_ai`](agents/ai_build_ai/README.md) | [aibuildai/AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) | `claude-sonnet-4-6` |
|
| 196 |
+
| [`agents.mlevolve`](agents/mlevolve/README.md) | [InternScience/MLEvolve](https://github.com/InternScience/MLEvolve) | `gpt-5.3-codex-spark` |
|
| 197 |
+
|
| 198 |
+
The proxy integration itself is generic — see
|
| 199 |
+
[`agents/cliproxyapi/README.md`](agents/cliproxyapi/README.md) for the
|
| 200 |
+
3-function shim (`anthropic_env` / `openai_env` / `gemini_env` /
|
| 201 |
+
`openai_yaml_block`) that any future agent can reuse.
|
| 202 |
+
|
| 203 |
+
```bash
|
| 204 |
+
# One-time
|
| 205 |
+
export CLIPROXYAPI_KEY=<from your ~/.cli-proxy-api/config.yaml api-keys list>
|
| 206 |
+
bash agents/ai_build_ai/install.sh # or agents/mlevolve/install.sh
|
| 207 |
+
|
| 208 |
+
# Per task
|
| 209 |
+
gtb fetch figraph
|
| 210 |
+
python -m agents.ai_build_ai.runner --task figraph
|
| 211 |
+
# → prints path to runs/ai_build_ai/figraph/<ts>/submission.csv
|
| 212 |
+
gtb submit figraph --file <printed-path> --agent aibuildai-sonnet-4-6
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
## License
|
| 216 |
|
| 217 |
[MIT](LICENSE). Data: subject to upstream licenses (Kaggle competition rules, FiGraph CC BY-NC 4.0, etc.).
|
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# `agents/` — third-party harness integrations
|
| 2 |
+
|
| 3 |
+
Wraps external agent harnesses so they can be pointed at a GraphTestbed task
|
| 4 |
+
and produce a `submission.csv` the scoring API understands. LLM traffic is
|
| 5 |
+
routed through one local [CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI)
|
| 6 |
+
instance via the [`agents.cliproxyapi`](cliproxyapi/README.md) shim.
|
| 7 |
+
|
| 8 |
+
## Layout
|
| 9 |
+
|
| 10 |
+
```
|
| 11 |
+
agents/
|
| 12 |
+
├── cliproxyapi/ # generic Anthropic/OpenAI/Gemini → proxy shim (reusable)
|
| 13 |
+
├── common/ # workspace + task-instruction + submit helpers
|
| 14 |
+
├── ai_build_ai/ # AI-Build-AI integration (default: claude-sonnet-4-6)
|
| 15 |
+
└── mlevolve/ # MLEvolve integration (default: gpt-5.3-codex-spark)
|
| 16 |
+
```
|
| 17 |
+
|
| 18 |
+
`agents/<agent>/_vendor/` (gitignored) holds the upstream binary or git
|
| 19 |
+
clone for that agent.
|
| 20 |
+
|
| 21 |
+
## End-to-end (figraph example)
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
# 0. One-time setup of the proxy (see agents/cliproxyapi/README.md)
|
| 25 |
+
export CLIPROXYAPI_KEY=<from your config.yaml>
|
| 26 |
+
|
| 27 |
+
# 1. Fetch the task data once
|
| 28 |
+
gtb fetch figraph
|
| 29 |
+
|
| 30 |
+
# 2. Install whichever agent you want
|
| 31 |
+
bash agents/ai_build_ai/install.sh # downloads upstream tarball
|
| 32 |
+
# or
|
| 33 |
+
bash agents/mlevolve/install.sh # git clone + pip install
|
| 34 |
+
|
| 35 |
+
# 3. Run; the runner prints the produced submission.csv path
|
| 36 |
+
python -m agents.ai_build_ai.runner --task figraph
|
| 37 |
+
python -m agents.mlevolve.runner --task figraph
|
| 38 |
+
|
| 39 |
+
# 4. Submit when ready (default is print-and-stop)
|
| 40 |
+
gtb submit figraph --file <printed-path> --agent <my-agent-id>
|
| 41 |
+
# or pass --submit <name> to the runner to combine 3+4
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
## Adding another agent
|
| 45 |
+
|
| 46 |
+
1. Create `agents/<new_agent>/{__init__.py,runner.py,install.sh,README.md}`.
|
| 47 |
+
2. In `runner.py` import from `agents.cliproxyapi` (one of `anthropic_env`,
|
| 48 |
+
`openai_env`, `gemini_env`, or `openai_yaml_block` per the agent's SDK).
|
| 49 |
+
3. Use `agents.common.workspace.make_workspace()` for the run dir,
|
| 50 |
+
`agents.common.tasks.task_instruction()` for the task prompt,
|
| 51 |
+
`agents.common.submit.finalize()` for validate+optional-submit.
|
| 52 |
+
|
| 53 |
+
No changes to `agents/cliproxyapi/` or `agents/common/` are required for new
|
| 54 |
+
agents that fit one of the three supported SDK shapes.
|
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Agent harness integrations for GraphTestbed.
|
| 2 |
+
|
| 3 |
+
Each subpackage wraps a third-party agent (AI-Build-AI, MLEvolve, ...) so it
|
| 4 |
+
can be pointed at a GraphTestbed task and produce a submission.csv that the
|
| 5 |
+
testbed scoring API understands.
|
| 6 |
+
|
| 7 |
+
LLM traffic for every agent flows through a single CLIProxyAPI instance — see
|
| 8 |
+
`agents.cliproxyapi` for the reusable shim.
|
| 9 |
+
"""
|
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# `agents.ai_build_ai`
|
| 2 |
+
|
| 3 |
+
Runs [AI-Build-AI](https://github.com/aibuildai/AI-Build-AI) on a GraphTestbed
|
| 4 |
+
task. AI-Build-AI is an Anthropic-SDK-based auto-ML harness that designs,
|
| 5 |
+
trains, and ranks candidate models from a task description.
|
| 6 |
+
|
| 7 |
+
Default model: **`claude-sonnet-4-6`** (override with `--model`).
|
| 8 |
+
|
| 9 |
+
## Install
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
bash agents/ai_build_ai/install.sh # downloads upstream tarball into _vendor/
|
| 13 |
+
# Linux x86_64 only — upstream constraint.
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
The vendored binary lands at `agents/ai_build_ai/_vendor/aibuildai`. Set
|
| 17 |
+
`AIBUILDAI_BIN` if you put it elsewhere.
|
| 18 |
+
|
| 19 |
+
## Run
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
# Proxy must be running and CLIPROXYAPI_KEY set — see agents/cliproxyapi/README.md
|
| 23 |
+
gtb fetch figraph # one-time per task
|
| 24 |
+
python -m agents.ai_build_ai.runner --task figraph
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Output:
|
| 28 |
+
|
| 29 |
+
```
|
| 30 |
+
runs/ai_build_ai/figraph/<timestamp>/
|
| 31 |
+
├── data/ # symlinks to fetched dataset CSVs
|
| 32 |
+
├── playground/ # AI-Build-AI's working dir (candidate_*/, …)
|
| 33 |
+
├── instruction.md # generated task prompt
|
| 34 |
+
├── agent.log # full stdout+stderr from the binary
|
| 35 |
+
└── submission.csv # normalized to match the testbed schema
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
The runner prints `submission.csv`'s path; submit when ready:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
gtb submit figraph --file runs/ai_build_ai/figraph/<ts>/submission.csv \
|
| 42 |
+
--agent aibuildai-sonnet-4-6
|
| 43 |
+
# or, in one step:
|
| 44 |
+
python -m agents.ai_build_ai.runner --task figraph --submit aibuildai-sonnet-4-6
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Knobs
|
| 48 |
+
|
| 49 |
+
| flag | default | upstream meaning |
|
| 50 |
+
| --- | --- | --- |
|
| 51 |
+
| `--model` | `claude-sonnet-4-6` | model alias, sent to the proxy |
|
| 52 |
+
| `--budget-min` | 60 | per-run training budget |
|
| 53 |
+
| `--pipeline-budget-min` | 90 | total pipeline budget |
|
| 54 |
+
| `--max-agent-calls` | 8 | LLM call cap per candidate |
|
| 55 |
+
| `--num-candidates` | 3 | how many model variants to generate |
|
| 56 |
+
|
| 57 |
+
The `--model` string must exist in your CLIProxyAPI `oauth-model-alias.claude`
|
| 58 |
+
mapping (or be a real model your Claude account exposes).
|
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""AI-Build-AI integration (github.com/aibuildai/AI-Build-AI).
|
| 2 |
+
|
| 3 |
+
Wraps the `aibuildai` release binary so it can run against any GraphTestbed
|
| 4 |
+
task. LLM traffic is forced through CLIProxyAPI by setting ANTHROPIC_BASE_URL
|
| 5 |
+
and ANTHROPIC_API_KEY before launching the binary.
|
| 6 |
+
"""
|
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# End-to-end smoke test of AI-Build-AI on the `figraph` task.
|
| 3 |
+
# Assumes:
|
| 4 |
+
# - CLIProxyAPI is running and CLIPROXYAPI_KEY is set (see agents/cliproxyapi/README.md)
|
| 5 |
+
# - `gtb fetch figraph` has been run, OR a local copy of figraph CSVs sits
|
| 6 |
+
# at $GRAPHTESTBED_CACHE/figraph/
|
| 7 |
+
# - `bash agents/ai_build_ai/install.sh` has put the binary in _vendor/
|
| 8 |
+
set -euo pipefail
|
| 9 |
+
|
| 10 |
+
REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
|
| 11 |
+
cd "${REPO_ROOT}"
|
| 12 |
+
|
| 13 |
+
: "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
|
| 14 |
+
|
| 15 |
+
python3 -m agents.ai_build_ai.runner \
|
| 16 |
+
--task figraph \
|
| 17 |
+
--model "${MODEL:-claude-sonnet-4-6}" \
|
| 18 |
+
--budget-min "${BUDGET_MIN:-30}" \
|
| 19 |
+
--num-candidates "${NUM_CANDIDATES:-2}" \
|
| 20 |
+
"${@}"
|
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Install the AI-Build-AI release tarball into agents/ai_build_ai/_vendor/.
|
| 3 |
+
# Re-run any time to upgrade. Linux x86_64 only (upstream constraint).
|
| 4 |
+
#
|
| 5 |
+
# Override the release with: AIBUILDAI_VERSION=v0.1.1 bash install.sh
|
| 6 |
+
set -euo pipefail
|
| 7 |
+
|
| 8 |
+
VERSION="${AIBUILDAI_VERSION:-v0.1.1}"
|
| 9 |
+
HERE="$(cd "$(dirname "$0")" && pwd)"
|
| 10 |
+
DEST="${HERE}/_vendor"
|
| 11 |
+
TARBALL="aibuildai-linux-x86_64-${VERSION}.tar.gz"
|
| 12 |
+
URL="https://github.com/aibuildai/AI-Build-AI/releases/download/${VERSION}/${TARBALL}"
|
| 13 |
+
|
| 14 |
+
mkdir -p "${DEST}"
|
| 15 |
+
cd "${DEST}"
|
| 16 |
+
|
| 17 |
+
echo "Downloading ${URL}"
|
| 18 |
+
curl -fL --retry 3 -o "${TARBALL}" "${URL}"
|
| 19 |
+
echo "Unpacking ${TARBALL}"
|
| 20 |
+
tar -xzf "${TARBALL}"
|
| 21 |
+
rm -f "${TARBALL}"
|
| 22 |
+
|
| 23 |
+
# Upstream tarball ships an install.sh that finalizes setup (PATH hints etc.)
|
| 24 |
+
if [[ -x ./install.sh ]]; then
|
| 25 |
+
echo "Running upstream install.sh"
|
| 26 |
+
./install.sh
|
| 27 |
+
fi
|
| 28 |
+
|
| 29 |
+
echo
|
| 30 |
+
echo "Installed AI-Build-AI ${VERSION} under ${DEST}"
|
| 31 |
+
echo "Set AIBUILDAI_BIN to the binary path if it isn't on \$PATH after this."
|
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Run AI-Build-AI on a GraphTestbed task, routed through CLIProxyAPI.
|
| 2 |
+
|
| 3 |
+
Usage:
|
| 4 |
+
python -m agents.ai_build_ai.runner --task figraph
|
| 5 |
+
python -m agents.ai_build_ai.runner --task figraph \\
|
| 6 |
+
--model claude-sonnet-4-6 --budget-min 30
|
| 7 |
+
python -m agents.ai_build_ai.runner --task figraph \\
|
| 8 |
+
--submit aibuildai-sonnet-4-6
|
| 9 |
+
|
| 10 |
+
Exit codes mirror the wrapped binary.
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import os
|
| 17 |
+
import shutil
|
| 18 |
+
import subprocess
|
| 19 |
+
import sys
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
import pandas as pd
|
| 23 |
+
|
| 24 |
+
from agents.cliproxyapi import ProxyEndpoint, anthropic_env, wait_until_ready
|
| 25 |
+
from agents.common.submit import finalize
|
| 26 |
+
from agents.common.tasks import task_instruction
|
| 27 |
+
from agents.common.workspace import make_workspace, stage_dataset
|
| 28 |
+
from graphtestbed._manifest import task_config
|
| 29 |
+
from graphtestbed.fetch import cache_dir
|
| 30 |
+
|
| 31 |
+
DEFAULT_MODEL = "claude-sonnet-4-6"
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _resolve_binary() -> str:
|
| 35 |
+
explicit = os.environ.get("AIBUILDAI_BIN")
|
| 36 |
+
if explicit:
|
| 37 |
+
return explicit
|
| 38 |
+
on_path = shutil.which("aibuildai")
|
| 39 |
+
if on_path:
|
| 40 |
+
return on_path
|
| 41 |
+
vendored = Path(__file__).parent / "_vendor" / "aibuildai"
|
| 42 |
+
if vendored.exists():
|
| 43 |
+
return str(vendored)
|
| 44 |
+
raise SystemExit(
|
| 45 |
+
"Cannot locate the `aibuildai` binary.\n"
|
| 46 |
+
" Install it: bash agents/ai_build_ai/install.sh\n"
|
| 47 |
+
" Or set AIBUILDAI_BIN to the full path."
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
def _stage_input(task: str, dst: Path) -> None:
|
| 52 |
+
src = cache_dir() / task
|
| 53 |
+
if not src.exists():
|
| 54 |
+
raise SystemExit(
|
| 55 |
+
f"No cached dataset at {src}. Run `gtb fetch {task}` first.\n"
|
| 56 |
+
f"(For pre-launch tasks, drop your local CSVs into {src}/.)"
|
| 57 |
+
)
|
| 58 |
+
cfg = task_config(task)
|
| 59 |
+
files = [spec["filename"] for spec in cfg["files"].values()]
|
| 60 |
+
stage_dataset(src, dst, files)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _harvest_submission(task: str, playground: Path, dst: Path) -> Path:
|
| 64 |
+
"""Pick the latest submission.csv produced under playground/, normalize cols."""
|
| 65 |
+
schema = task_config(task)["submission_schema"]
|
| 66 |
+
candidates = sorted(
|
| 67 |
+
playground.rglob("submission.csv"),
|
| 68 |
+
key=lambda p: p.stat().st_mtime,
|
| 69 |
+
)
|
| 70 |
+
if not candidates:
|
| 71 |
+
raise SystemExit(
|
| 72 |
+
f"No submission.csv found under {playground}.\n"
|
| 73 |
+
f" Inspect the agent's logs to see what happened: "
|
| 74 |
+
f"{playground.parent / 'agent.log'}"
|
| 75 |
+
)
|
| 76 |
+
chosen = candidates[-1]
|
| 77 |
+
df = pd.read_csv(chosen)
|
| 78 |
+
expected = [schema["id_col"], schema["pred_col"]]
|
| 79 |
+
if list(df.columns) != expected:
|
| 80 |
+
if len(df.columns) == 2:
|
| 81 |
+
print(f" (renaming columns {list(df.columns)} → {expected})")
|
| 82 |
+
df.columns = expected
|
| 83 |
+
else:
|
| 84 |
+
raise SystemExit(
|
| 85 |
+
f"Cannot normalize {chosen}: got columns {list(df.columns)}, "
|
| 86 |
+
f"expected {expected}"
|
| 87 |
+
)
|
| 88 |
+
out = dst / "submission.csv"
|
| 89 |
+
df.to_csv(out, index=False)
|
| 90 |
+
print(f"✓ Picked {chosen.relative_to(playground.parent)}")
|
| 91 |
+
return out
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
def main() -> None:
|
| 95 |
+
ap = argparse.ArgumentParser(prog="agents.ai_build_ai.runner")
|
| 96 |
+
ap.add_argument("--task", required=True,
|
| 97 |
+
help="A task name from datasets/manifest.yaml")
|
| 98 |
+
ap.add_argument("--model", default=DEFAULT_MODEL,
|
| 99 |
+
help=f"Model alias passed to aibuildai (default: {DEFAULT_MODEL})")
|
| 100 |
+
ap.add_argument("--budget-min", type=int, default=60,
|
| 101 |
+
help="--run-budget-minutes for aibuildai (default: 60)")
|
| 102 |
+
ap.add_argument("--pipeline-budget-min", type=int, default=90,
|
| 103 |
+
help="--pipeline-budget-minutes (default: 90)")
|
| 104 |
+
ap.add_argument("--max-agent-calls", type=int, default=8)
|
| 105 |
+
ap.add_argument("--num-candidates", type=int, default=3)
|
| 106 |
+
ap.add_argument("--submit", default=None, metavar="AGENT_ID",
|
| 107 |
+
help="If set, POST the produced submission.csv to the "
|
| 108 |
+
"GraphTestbed scoring API as this agent name.")
|
| 109 |
+
ap.add_argument("--workspace-root", type=Path, default=None,
|
| 110 |
+
help="Override the runs/ root (default: ./runs)")
|
| 111 |
+
args = ap.parse_args()
|
| 112 |
+
|
| 113 |
+
binary = _resolve_binary()
|
| 114 |
+
ep = ProxyEndpoint.from_env()
|
| 115 |
+
wait_until_ready(ep)
|
| 116 |
+
print(f"✓ Proxy ready at {ep.base_url()}")
|
| 117 |
+
|
| 118 |
+
ws = make_workspace("ai_build_ai", args.task, args.workspace_root)
|
| 119 |
+
data = ws / "data"
|
| 120 |
+
play = ws / "playground"
|
| 121 |
+
play.mkdir()
|
| 122 |
+
_stage_input(args.task, data)
|
| 123 |
+
|
| 124 |
+
instruction = task_instruction(args.task)
|
| 125 |
+
(ws / "instruction.md").write_text(instruction)
|
| 126 |
+
|
| 127 |
+
cmd = [
|
| 128 |
+
binary,
|
| 129 |
+
"--task-name", args.task,
|
| 130 |
+
"--data-dir", str(data),
|
| 131 |
+
"--playground-dir", str(play),
|
| 132 |
+
"--model", args.model,
|
| 133 |
+
"--instruction", instruction,
|
| 134 |
+
"--max-agent-calls", str(args.max_agent_calls),
|
| 135 |
+
"--run-budget-minutes", str(args.budget_min),
|
| 136 |
+
"--pipeline-budget-minutes", str(args.pipeline_budget_min),
|
| 137 |
+
"--num-candidates", str(args.num_candidates),
|
| 138 |
+
"--no-form",
|
| 139 |
+
]
|
| 140 |
+
env = {**os.environ, **anthropic_env(ep, model=args.model)}
|
| 141 |
+
# aibuildai ships a bundled `claude` binary that aborts if it detects an
|
| 142 |
+
# outer Claude Code session via these env vars. Strip them so the inner
|
| 143 |
+
# claude treats this as a fresh top-level invocation.
|
| 144 |
+
for k in ("CLAUDECODE", "CLAUDE_CODE_ENTRYPOINT", "CLAUDE_CODE_SSE_PORT"):
|
| 145 |
+
env.pop(k, None)
|
| 146 |
+
|
| 147 |
+
print(f"→ Launching {Path(binary).name} task={args.task} model={args.model}")
|
| 148 |
+
print(f" workspace: {ws}")
|
| 149 |
+
log = ws / "agent.log"
|
| 150 |
+
with log.open("wb") as lf:
|
| 151 |
+
rc = subprocess.call(cmd, env=env, stdout=lf, stderr=subprocess.STDOUT)
|
| 152 |
+
print(f" exit={rc} log={log}")
|
| 153 |
+
if rc != 0:
|
| 154 |
+
sys.exit(rc)
|
| 155 |
+
|
| 156 |
+
sub = _harvest_submission(args.task, play, ws)
|
| 157 |
+
finalize(args.task, sub, args.submit)
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
if __name__ == "__main__":
|
| 161 |
+
main()
|
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# `agents.cliproxyapi`
|
| 2 |
+
|
| 3 |
+
Reusable shim that points any agent's LLM SDK at a single local
|
| 4 |
+
[CLIProxyAPI](https://github.com/router-for-me/CLIProxyAPI) instance.
|
| 5 |
+
|
| 6 |
+
## Why a shim
|
| 7 |
+
|
| 8 |
+
Every agent we test uses a different SDK (Anthropic, OpenAI/Codex, Gemini)
|
| 9 |
+
and a different way of being told "talk to this base URL with this key".
|
| 10 |
+
This package collapses that into three function calls.
|
| 11 |
+
|
| 12 |
+
## Public surface
|
| 13 |
+
|
| 14 |
+
```python
|
| 15 |
+
from agents.cliproxyapi import (
|
| 16 |
+
ProxyEndpoint, # where + key (read from env)
|
| 17 |
+
anthropic_env, # → dict, splice into subprocess env
|
| 18 |
+
openai_env,
|
| 19 |
+
gemini_env,
|
| 20 |
+
openai_yaml_block, # → dict, drop into a YAML config
|
| 21 |
+
wait_until_ready, # TCP probe; raise SystemExit on miss
|
| 22 |
+
spawn_proxy, # ctx-manager (opt-in; mostly for CI)
|
| 23 |
+
)
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
`ProxyEndpoint.from_env()` reads:
|
| 27 |
+
|
| 28 |
+
| env var | default |
|
| 29 |
+
| --- | --- |
|
| 30 |
+
| `CLIPROXYAPI_HOST` | `127.0.0.1` |
|
| 31 |
+
| `CLIPROXYAPI_PORT` | `8317` |
|
| 32 |
+
| `CLIPROXYAPI_KEY` | *required* |
|
| 33 |
+
|
| 34 |
+
## Recipe per SDK shape
|
| 35 |
+
|
| 36 |
+
### Anthropic SDK / Claude Code (`claude`, `aibuildai`, ...)
|
| 37 |
+
```python
|
| 38 |
+
ep = ProxyEndpoint.from_env()
|
| 39 |
+
env = {**os.environ, **anthropic_env(ep, model="claude-sonnet-4-6")}
|
| 40 |
+
subprocess.run([...], env=env)
|
| 41 |
+
```
|
| 42 |
+
Sets `ANTHROPIC_BASE_URL`, `ANTHROPIC_API_KEY`, `ANTHROPIC_AUTH_TOKEN`,
|
| 43 |
+
`ANTHROPIC_MODEL`.
|
| 44 |
+
|
| 45 |
+
### OpenAI / Codex CLI / any OpenAI-compatible SDK
|
| 46 |
+
```python
|
| 47 |
+
env = {**os.environ, **openai_env(ep, model="gpt-5.3-codex-spark")}
|
| 48 |
+
```
|
| 49 |
+
Sets `OPENAI_BASE_URL=…/v1`, `OPENAI_API_KEY`, `OPENAI_API_BASE`,
|
| 50 |
+
`OPENAI_MODEL`.
|
| 51 |
+
|
| 52 |
+
### Gemini SDK
|
| 53 |
+
```python
|
| 54 |
+
env = {**os.environ, **gemini_env(ep, model="gemini-2-pro-preview")}
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### YAML configs (e.g. MLEvolve)
|
| 58 |
+
```python
|
| 59 |
+
block = openai_yaml_block(ep, model="gpt-5.3-codex-spark")
|
| 60 |
+
# → {"model": ..., "base_url": "http://127.0.0.1:8317/v1", "api_key": ...}
|
| 61 |
+
config["agent"]["code"].update(block)
|
| 62 |
+
config["agent"]["feedback"].update(block)
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Setting up the proxy itself
|
| 66 |
+
|
| 67 |
+
1. Install:
|
| 68 |
+
```bash
|
| 69 |
+
git clone https://github.com/router-for-me/CLIProxyAPI && cd CLIProxyAPI
|
| 70 |
+
docker compose up -d # or: go build -o cliproxy ./cmd/...
|
| 71 |
+
```
|
| 72 |
+
2. Drop in a config (start from
|
| 73 |
+
[`config.example.yaml`](config.example.yaml) here):
|
| 74 |
+
```bash
|
| 75 |
+
mkdir -p ~/.cli-proxy-api
|
| 76 |
+
cp agents/cliproxyapi/config.example.yaml ~/.cli-proxy-api/config.yaml
|
| 77 |
+
$EDITOR ~/.cli-proxy-api/config.yaml # set api-keys[0] + aliases
|
| 78 |
+
```
|
| 79 |
+
3. Run interactively once to OAuth-log into Claude / Codex / Gemini accounts.
|
| 80 |
+
4. Export client-side env vars:
|
| 81 |
+
```bash
|
| 82 |
+
export CLIPROXYAPI_KEY=<the api-keys[0] you set>
|
| 83 |
+
# CLIPROXYAPI_HOST/PORT only needed if you bind elsewhere
|
| 84 |
+
```
|
| 85 |
+
5. Smoke-test:
|
| 86 |
+
```bash
|
| 87 |
+
curl -s -H "Authorization: Bearer $CLIPROXYAPI_KEY" \
|
| 88 |
+
http://127.0.0.1:8317/v1/models | head
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
Once the proxy is up and `CLIPROXYAPI_KEY` is set, every agent runner in
|
| 92 |
+
`agents/*/runner.py` works without further configuration.
|
| 93 |
+
|
| 94 |
+
## Adding a new agent that uses the proxy
|
| 95 |
+
|
| 96 |
+
```python
|
| 97 |
+
# agents/my_agent/runner.py
|
| 98 |
+
from agents.cliproxyapi import ProxyEndpoint, openai_env, wait_until_ready
|
| 99 |
+
|
| 100 |
+
ep = ProxyEndpoint.from_env()
|
| 101 |
+
wait_until_ready(ep)
|
| 102 |
+
subprocess.run(
|
| 103 |
+
["my-agent-binary", "--task", task, "--model", model],
|
| 104 |
+
env={**os.environ, **openai_env(ep, model=model)},
|
| 105 |
+
)
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
That's the entire integration.
|
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generic CLIProxyAPI integration shared by every agent runner.
|
| 2 |
+
|
| 3 |
+
CLIProxyAPI (github.com/router-for-me/CLIProxyAPI) is a single local proxy
|
| 4 |
+
that bridges Anthropic, OpenAI/Codex, and Gemini protocol surfaces on one
|
| 5 |
+
port. Pointing every agent at it lets us share OAuth state, credentials, and
|
| 6 |
+
rate-limit budget across many harnesses.
|
| 7 |
+
|
| 8 |
+
Public surface — three things:
|
| 9 |
+
|
| 10 |
+
ProxyEndpoint → where the proxy is + what API key to send
|
| 11 |
+
{anthropic,openai,gemini}_env(ep, model=...) → env-var dicts to splice
|
| 12 |
+
into subprocess.Popen
|
| 13 |
+
openai_yaml_block(ep, model) → snippet for agents whose configs take
|
| 14 |
+
base_url/api_key/model directly
|
| 15 |
+
|
| 16 |
+
Plus `wait_until_ready(ep)` for runners that should fail fast if the proxy
|
| 17 |
+
isn't up, and an opt-in `spawn_proxy()` ctx-manager for one-off testing.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from .endpoint import ProxyEndpoint
|
| 21 |
+
from .env import anthropic_env, gemini_env, openai_env, openai_yaml_block
|
| 22 |
+
from .health import is_ready, spawn_proxy, wait_until_ready
|
| 23 |
+
|
| 24 |
+
__all__ = [
|
| 25 |
+
"ProxyEndpoint",
|
| 26 |
+
"anthropic_env",
|
| 27 |
+
"gemini_env",
|
| 28 |
+
"openai_env",
|
| 29 |
+
"openai_yaml_block",
|
| 30 |
+
"is_ready",
|
| 31 |
+
"spawn_proxy",
|
| 32 |
+
"wait_until_ready",
|
| 33 |
+
]
|
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Minimal CLIProxyAPI config for GraphTestbed agent runs.
|
| 2 |
+
#
|
| 3 |
+
# Place at ~/.cli-proxy-api/config.yaml (or pass --config /path/to/this file
|
| 4 |
+
# when launching the proxy). Full schema:
|
| 5 |
+
# https://github.com/router-for-me/CLIProxyAPI
|
| 6 |
+
#
|
| 7 |
+
# Quickstart:
|
| 8 |
+
# 1. Replace the api-keys[0] placeholder with `openssl rand -hex 16`.
|
| 9 |
+
# 2. Export the same value as CLIPROXYAPI_KEY in the shell that runs the
|
| 10 |
+
# agents (so the agent's SDK sends it; the proxy validates it).
|
| 11 |
+
# 3. Launch the proxy interactively once and complete the OAuth flow for
|
| 12 |
+
# each upstream account you intend to use (Claude / Codex / Gemini).
|
| 13 |
+
# 4. Adjust `oauth-model-alias.{claude,codex}` so the model strings the
|
| 14 |
+
# agents send (e.g. `claude-sonnet-4-6`, `gpt-5.3-codex-spark`) resolve
|
| 15 |
+
# to whatever upstream IDs your subscriptions actually expose.
|
| 16 |
+
|
| 17 |
+
host: "127.0.0.1"
|
| 18 |
+
port: 8317
|
| 19 |
+
auth-dir: "~/.cli-proxy-api"
|
| 20 |
+
|
| 21 |
+
api-keys:
|
| 22 |
+
- "REPLACE-WITH-OPENSSL-RAND-HEX-16"
|
| 23 |
+
|
| 24 |
+
strategy: "round-robin"
|
| 25 |
+
session-affinity-ttl: "1h"
|
| 26 |
+
|
| 27 |
+
# Upstream Claude OAuth account(s). Run the proxy once with your browser open
|
| 28 |
+
# to log in; the proxy then caches refresh tokens under auth-dir.
|
| 29 |
+
claude-api-key: []
|
| 30 |
+
|
| 31 |
+
# Upstream Codex OAuth account(s). Same pattern.
|
| 32 |
+
codex-api-key: []
|
| 33 |
+
|
| 34 |
+
# Map the alias names our agents send → actual upstream model IDs.
|
| 35 |
+
# AI-Build-AI sends `--model claude-sonnet-4-6` (or whatever you pick).
|
| 36 |
+
# MLEvolve sends the model string from agents/mlevolve/runner.py's --model.
|
| 37 |
+
oauth-model-alias:
|
| 38 |
+
claude:
|
| 39 |
+
# Match the string the agent's runner sends; map to whatever your Claude
|
| 40 |
+
# subscription actually exposes (check `curl ${proxy}/v1/models`).
|
| 41 |
+
claude-sonnet-4-6: "<upstream-claude-id>"
|
| 42 |
+
codex:
|
| 43 |
+
gpt-5.3-codex-spark: "<upstream-codex-id>"
|
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""ProxyEndpoint — single source of truth for "where is the proxy + what key".
|
| 2 |
+
|
| 3 |
+
Every agent runner reads this from environment, then hands the resulting
|
| 4 |
+
object to `agents.cliproxyapi.env.*` to build SDK-specific configuration.
|
| 5 |
+
|
| 6 |
+
Env vars:
|
| 7 |
+
CLIPROXYAPI_HOST default 127.0.0.1
|
| 8 |
+
CLIPROXYAPI_PORT default 8317 (CLIProxyAPI's stock port)
|
| 9 |
+
CLIPROXYAPI_KEY required — must match one of the api-keys: entries
|
| 10 |
+
in your CLIProxyAPI config.yaml
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import os
|
| 16 |
+
from dataclasses import dataclass
|
| 17 |
+
|
| 18 |
+
DEFAULT_HOST = "127.0.0.1"
|
| 19 |
+
DEFAULT_PORT = 8317
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
@dataclass(frozen=True)
|
| 23 |
+
class ProxyEndpoint:
|
| 24 |
+
host: str = DEFAULT_HOST
|
| 25 |
+
port: int = DEFAULT_PORT
|
| 26 |
+
api_key: str = ""
|
| 27 |
+
|
| 28 |
+
@classmethod
|
| 29 |
+
def from_env(cls) -> "ProxyEndpoint":
|
| 30 |
+
host = os.environ.get("CLIPROXYAPI_HOST", DEFAULT_HOST)
|
| 31 |
+
port = int(os.environ.get("CLIPROXYAPI_PORT", str(DEFAULT_PORT)))
|
| 32 |
+
api_key = os.environ.get("CLIPROXYAPI_KEY", "").strip()
|
| 33 |
+
if not api_key:
|
| 34 |
+
raise SystemExit(
|
| 35 |
+
"CLIPROXYAPI_KEY is unset. Set it to one of the api-keys "
|
| 36 |
+
"you've configured in your CLIProxyAPI config.yaml.\n"
|
| 37 |
+
"Example:\n"
|
| 38 |
+
" export CLIPROXYAPI_KEY=$(grep -A1 'api-keys:' "
|
| 39 |
+
"~/.cli-proxy-api/config.yaml | tail -1 | tr -d ' \"-')"
|
| 40 |
+
)
|
| 41 |
+
return cls(host=host, port=port, api_key=api_key)
|
| 42 |
+
|
| 43 |
+
def base_url(self, scheme: str = "http") -> str:
|
| 44 |
+
return f"{scheme}://{self.host}:{self.port}"
|
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Build env-var dicts (or YAML-config snippets) that point an SDK at the proxy.
|
| 2 |
+
|
| 3 |
+
Three SDK shapes are covered today; add more here as agents arrive:
|
| 4 |
+
|
| 5 |
+
anthropic_env(ep, model) → Anthropic SDK / Claude Code CLI
|
| 6 |
+
openai_env(ep, model) → OpenAI SDK / Codex CLI
|
| 7 |
+
gemini_env(ep, model) → google-generativeai SDK / gemini-cli
|
| 8 |
+
|
| 9 |
+
Plus `openai_yaml_block(ep, model)` for agents whose config files take
|
| 10 |
+
`base_url` / `api_key` / `model` fields directly (e.g. MLEvolve).
|
| 11 |
+
|
| 12 |
+
Usage from any agent runner:
|
| 13 |
+
|
| 14 |
+
from agents.cliproxyapi import ProxyEndpoint, anthropic_env
|
| 15 |
+
ep = ProxyEndpoint.from_env()
|
| 16 |
+
subprocess.run(cmd, env={**os.environ, **anthropic_env(ep, model="...")})
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
from .endpoint import ProxyEndpoint
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def anthropic_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
|
| 25 |
+
"""Env vars consumed by anthropic-python and claude-code.
|
| 26 |
+
|
| 27 |
+
The Anthropic SDK appends `/v1/messages` to ANTHROPIC_BASE_URL itself,
|
| 28 |
+
so we hand it the proxy root (no trailing path).
|
| 29 |
+
"""
|
| 30 |
+
env = {
|
| 31 |
+
"ANTHROPIC_BASE_URL": ep.base_url(),
|
| 32 |
+
"ANTHROPIC_API_KEY": ep.api_key,
|
| 33 |
+
"ANTHROPIC_AUTH_TOKEN": ep.api_key,
|
| 34 |
+
}
|
| 35 |
+
if model:
|
| 36 |
+
env["ANTHROPIC_MODEL"] = model
|
| 37 |
+
return env
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def openai_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
|
| 41 |
+
"""Env vars consumed by openai-python, codex-cli, and many compatible SDKs.
|
| 42 |
+
|
| 43 |
+
The OpenAI SDK appends `/chat/completions` (and other paths) to
|
| 44 |
+
OPENAI_BASE_URL, so we include the `/v1` prefix here.
|
| 45 |
+
"""
|
| 46 |
+
env = {
|
| 47 |
+
"OPENAI_BASE_URL": f"{ep.base_url()}/v1",
|
| 48 |
+
"OPENAI_API_KEY": ep.api_key,
|
| 49 |
+
"OPENAI_API_BASE": f"{ep.base_url()}/v1", # legacy var, still common
|
| 50 |
+
}
|
| 51 |
+
if model:
|
| 52 |
+
env["OPENAI_MODEL"] = model
|
| 53 |
+
return env
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def gemini_env(ep: ProxyEndpoint, model: str | None = None) -> dict[str, str]:
|
| 57 |
+
"""Env vars consumed by google-generativeai and gemini-cli.
|
| 58 |
+
|
| 59 |
+
The proxy exposes Gemini's `/v1beta/models/.../generateContent` shape on
|
| 60 |
+
the proxy root — clients prepend nothing.
|
| 61 |
+
"""
|
| 62 |
+
env = {
|
| 63 |
+
"GEMINI_API_BASE": ep.base_url(),
|
| 64 |
+
"GOOGLE_API_KEY": ep.api_key,
|
| 65 |
+
"GEMINI_API_KEY": ep.api_key,
|
| 66 |
+
}
|
| 67 |
+
if model:
|
| 68 |
+
env["GEMINI_MODEL"] = model
|
| 69 |
+
return env
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
def openai_yaml_block(ep: ProxyEndpoint, model: str) -> dict[str, str]:
|
| 73 |
+
"""Three-key dict for configs that name the proxy directly (e.g. MLEvolve).
|
| 74 |
+
|
| 75 |
+
Returns:
|
| 76 |
+
{"model": ..., "base_url": ".../v1", "api_key": ...}
|
| 77 |
+
"""
|
| 78 |
+
return {
|
| 79 |
+
"model": model,
|
| 80 |
+
"base_url": f"{ep.base_url()}/v1",
|
| 81 |
+
"api_key": ep.api_key,
|
| 82 |
+
}
|
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Probe and (optionally) spawn the CLIProxyAPI process.
|
| 2 |
+
|
| 3 |
+
`wait_until_ready` does a TCP connect — endpoint-agnostic, so it works no
|
| 4 |
+
matter which protocol surfaces the proxy version exposes.
|
| 5 |
+
|
| 6 |
+
`spawn_proxy` is a context manager for tests / one-off CI runs. Most users
|
| 7 |
+
should run the proxy out-of-band: it owns long-lived OAuth tokens and may
|
| 8 |
+
serve other tools besides the testbed.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from __future__ import annotations
|
| 12 |
+
|
| 13 |
+
import contextlib
|
| 14 |
+
import socket
|
| 15 |
+
import subprocess
|
| 16 |
+
import time
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
from .endpoint import ProxyEndpoint
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def is_ready(ep: ProxyEndpoint, timeout: float = 2.0) -> bool:
|
| 23 |
+
try:
|
| 24 |
+
with socket.create_connection((ep.host, ep.port), timeout=timeout):
|
| 25 |
+
return True
|
| 26 |
+
except OSError:
|
| 27 |
+
return False
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def wait_until_ready(ep: ProxyEndpoint, timeout: float = 30.0) -> None:
|
| 31 |
+
deadline = time.monotonic() + timeout
|
| 32 |
+
while time.monotonic() < deadline:
|
| 33 |
+
if is_ready(ep):
|
| 34 |
+
return
|
| 35 |
+
time.sleep(0.5)
|
| 36 |
+
raise SystemExit(
|
| 37 |
+
f"CLIProxyAPI at {ep.base_url()} did not respond within {timeout:.0f}s.\n"
|
| 38 |
+
f"Start it (e.g. `cliproxy --config ~/.cli-proxy-api/config.yaml`) "
|
| 39 |
+
f"and confirm CLIPROXYAPI_HOST / CLIPROXYAPI_PORT."
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
@contextlib.contextmanager
|
| 44 |
+
def spawn_proxy(
|
| 45 |
+
config_path: str | Path,
|
| 46 |
+
binary: str = "cliproxy",
|
| 47 |
+
timeout: float = 30.0,
|
| 48 |
+
):
|
| 49 |
+
ep = ProxyEndpoint.from_env()
|
| 50 |
+
proc = subprocess.Popen(
|
| 51 |
+
[binary, "--config", str(config_path)],
|
| 52 |
+
stdout=subprocess.PIPE,
|
| 53 |
+
stderr=subprocess.STDOUT,
|
| 54 |
+
)
|
| 55 |
+
try:
|
| 56 |
+
wait_until_ready(ep, timeout=timeout)
|
| 57 |
+
yield ep
|
| 58 |
+
finally:
|
| 59 |
+
proc.terminate()
|
| 60 |
+
try:
|
| 61 |
+
proc.wait(timeout=5)
|
| 62 |
+
except subprocess.TimeoutExpired:
|
| 63 |
+
proc.kill()
|
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""Shared adapter helpers between testbed and individual agent runners."""
|
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Validate and (optionally) submit an agent's output to the GraphTestbed API.
|
| 2 |
+
|
| 3 |
+
Default mode is print-and-stop: the runner reports the path to the produced
|
| 4 |
+
submission.csv but does not POST. Pass `--submit <agent-name>` to the runner
|
| 5 |
+
to actually call the scoring API.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
from __future__ import annotations
|
| 9 |
+
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
|
| 12 |
+
from graphtestbed.submit import submit as gtb_submit
|
| 13 |
+
from graphtestbed.submit import validate_submission
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def finalize(task: str, csv_path: Path, agent: str | None) -> None:
|
| 17 |
+
info = validate_submission(task, csv_path)
|
| 18 |
+
print()
|
| 19 |
+
print(f"✓ Submission ready")
|
| 20 |
+
print(f" file: {csv_path}")
|
| 21 |
+
print(f" rows: {info['n_rows']}")
|
| 22 |
+
print(f" sha256: {info['sha256'][:12]}...")
|
| 23 |
+
if agent:
|
| 24 |
+
gtb_submit(task, csv_path, agent, dry_run=False)
|
| 25 |
+
else:
|
| 26 |
+
print()
|
| 27 |
+
print("(not submitted — pass --submit <agent-name> to POST)")
|
| 28 |
+
print(f" manual: gtb submit {task} --file {csv_path} --agent <name>")
|
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Render a per-task instruction markdown for any agent.
|
| 2 |
+
|
| 3 |
+
Pulls the canonical task description from datasets/manifest.yaml and decorates
|
| 4 |
+
it with the submission contract (id col, pred col, n rows, metric).
|
| 5 |
+
|
| 6 |
+
Per-task overrides — handcrafted prompts that beat the auto-generated text —
|
| 7 |
+
live in agents/common/tasks_md/<task>.md and take priority when present.
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
from __future__ import annotations
|
| 11 |
+
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
from graphtestbed._manifest import task_config
|
| 15 |
+
|
| 16 |
+
_TEMPLATE = """\
|
| 17 |
+
# Task: {task}
|
| 18 |
+
|
| 19 |
+
{description}
|
| 20 |
+
|
| 21 |
+
## Files you will see
|
| 22 |
+
|
| 23 |
+
- `train_features.csv` — labeled training rows
|
| 24 |
+
- `val_features.csv` — labeled validation rows (use for HPO / early stopping)
|
| 25 |
+
- `test_features.csv` — **unlabeled** test rows; predict here
|
| 26 |
+
|
| 27 |
+
The `Label` (or task-specific target) column is present in train/val and
|
| 28 |
+
absent from test. Do not attempt to recover test labels from upstream sources.
|
| 29 |
+
|
| 30 |
+
## Submission format
|
| 31 |
+
|
| 32 |
+
Write a CSV with **exactly two columns**, in this order:
|
| 33 |
+
|
| 34 |
+
| column | type | meaning |
|
| 35 |
+
| --- | --- | --- |
|
| 36 |
+
| `{id_col}` | id | matches `test_features.csv[{id_col}]` 100% |
|
| 37 |
+
| `{pred_col}` | float in [0, 1] | predicted score |
|
| 38 |
+
|
| 39 |
+
Row count: **{n_rows}**.
|
| 40 |
+
|
| 41 |
+
## Metric
|
| 42 |
+
|
| 43 |
+
You will be evaluated on `{primary}` (primary). Secondary: {secondary}.
|
| 44 |
+
Optimize for the primary metric.
|
| 45 |
+
"""
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def task_instruction(task: str) -> str:
|
| 49 |
+
override = Path(__file__).parent / "tasks_md" / f"{task}.md"
|
| 50 |
+
if override.exists():
|
| 51 |
+
return override.read_text()
|
| 52 |
+
cfg = task_config(task)
|
| 53 |
+
s = cfg["submission_schema"]
|
| 54 |
+
m = cfg["metric"]
|
| 55 |
+
return _TEMPLATE.format(
|
| 56 |
+
task=task,
|
| 57 |
+
description=str(cfg.get("description", "")).strip(),
|
| 58 |
+
id_col=s["id_col"],
|
| 59 |
+
pred_col=s["pred_col"],
|
| 60 |
+
n_rows=s.get("n_rows", "?"),
|
| 61 |
+
primary=m["primary"],
|
| 62 |
+
secondary=", ".join(m.get("secondary", [])) or "(none)",
|
| 63 |
+
)
|
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Ephemeral workspace dirs and dataset staging for agent runs.
|
| 2 |
+
|
| 3 |
+
Each runner allocates `runs/<agent>/<task>/<timestamp>/` so concurrent runs
|
| 4 |
+
don't collide and post-mortems are always recoverable from disk.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from __future__ import annotations
|
| 8 |
+
|
| 9 |
+
import datetime as dt
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def make_workspace(agent: str, task: str, root: Path | None = None) -> Path:
|
| 14 |
+
root = Path(root) if root else Path.cwd() / "runs"
|
| 15 |
+
ts = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
|
| 16 |
+
ws = root / agent / task / ts
|
| 17 |
+
ws.mkdir(parents=True, exist_ok=False)
|
| 18 |
+
return ws
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
def stage_dataset(src_dir: Path, dst_dir: Path, files: list[str]) -> None:
|
| 22 |
+
"""Symlink each `files[i]` from src_dir into dst_dir.
|
| 23 |
+
|
| 24 |
+
Symlinks (vs copies) keep large CSVs on the cache disk; the agent reads
|
| 25 |
+
from src via the link transparently.
|
| 26 |
+
"""
|
| 27 |
+
dst_dir.mkdir(parents=True, exist_ok=True)
|
| 28 |
+
for f in files:
|
| 29 |
+
s = src_dir / f
|
| 30 |
+
if not s.exists():
|
| 31 |
+
raise SystemExit(f"Missing dataset file: {s}")
|
| 32 |
+
d = dst_dir / f
|
| 33 |
+
if d.is_symlink() or d.exists():
|
| 34 |
+
d.unlink()
|
| 35 |
+
d.symlink_to(s.resolve())
|
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# `agents.mlevolve`
|
| 2 |
+
|
| 3 |
+
Runs [MLEvolve](https://github.com/InternScience/MLEvolve) on a GraphTestbed
|
| 4 |
+
task. MLEvolve is an MCGS auto-ML harness wired for OpenAI-compatible APIs.
|
| 5 |
+
|
| 6 |
+
Default model: **`gpt-5.3-codex-spark`** (a pipe-through alias you define in
|
| 7 |
+
your CLIProxyAPI `oauth-model-alias.codex` block).
|
| 8 |
+
|
| 9 |
+
## Install
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
bash agents/mlevolve/install.sh
|
| 13 |
+
# heavy: clones the repo + pip-installs torch and ML deps (~5-10 GB).
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
Lands at `agents/mlevolve/_vendor/MLEvolve/`. Set `MLEVOLVE_DIR` if you
|
| 17 |
+
already have a clone elsewhere.
|
| 18 |
+
|
| 19 |
+
## Run
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
gtb fetch figraph
|
| 23 |
+
python -m agents.mlevolve.runner --task figraph
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
Output:
|
| 27 |
+
|
| 28 |
+
```
|
| 29 |
+
runs/mlevolve/figraph/<timestamp>/
|
| 30 |
+
├── mlebench-tree/figraph/
|
| 31 |
+
│ ├── prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
|
| 32 |
+
│ ├── prepared/private/test.csv # val labels — local grader uses this
|
| 33 |
+
│ └── REAL_TEST_FEATURES.csv # the actual test split, for re-execute
|
| 34 |
+
├── agent.log
|
| 35 |
+
└── val_submission.csv # MLEvolve's best on the val "test" split
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## ⚠ v1 limitation: val-as-test
|
| 39 |
+
|
| 40 |
+
GraphTestbed's actual test labels live on the scoring server, not on disk.
|
| 41 |
+
For the local mle-bench grader to function, the adapter exposes
|
| 42 |
+
`val_features.csv` (with labels) as the "test" set MLEvolve searches against.
|
| 43 |
+
|
| 44 |
+
The CSV the runner harvests is therefore predictions on **val**, not test.
|
| 45 |
+
To submit a real test-set score:
|
| 46 |
+
|
| 47 |
+
1. Open `agents/mlevolve/_vendor/MLEvolve/runs/<latest-ts>/` and find the
|
| 48 |
+
best runfile.py (search order: best score in the run's tree summary).
|
| 49 |
+
2. Re-execute it against the real test split:
|
| 50 |
+
```bash
|
| 51 |
+
cd <some scratch dir>
|
| 52 |
+
cp <ws>/mlebench-tree/figraph/REAL_TEST_FEATURES.csv ./test.csv
|
| 53 |
+
cp <ws>/mlebench-tree/figraph/prepared/public/train.csv ./train.csv
|
| 54 |
+
python <runfile> # produces submission.csv
|
| 55 |
+
```
|
| 56 |
+
3. Submit:
|
| 57 |
+
```bash
|
| 58 |
+
gtb submit figraph --file ./submission.csv --agent mlevolve-codex-spark
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
This step is manual in v1 because the structure of MLEvolve's `runfile.py`
|
| 62 |
+
varies per task and we don't want to silently mis-execute. It is on the
|
| 63 |
+
roadmap to automate.
|
| 64 |
+
|
| 65 |
+
## Knobs
|
| 66 |
+
|
| 67 |
+
| flag | default | meaning |
|
| 68 |
+
| --- | --- | --- |
|
| 69 |
+
| `--model` | `gpt-5.3-codex-spark` | sent to proxy via OPENAI_BASE_URL/v1 |
|
| 70 |
+
| `--steps` | 100 | MCGS exploration count (upstream default: 500) |
|
| 71 |
+
| `--time-limit-min` | 120 | per-task wall-clock cap (upstream default: 720) |
|
| 72 |
+
| `--gpus` | 0 | passed to `search.num_gpus` |
|
| 73 |
+
|
| 74 |
+
The `--model` string must exist in your CLIProxyAPI
|
| 75 |
+
`oauth-model-alias.codex` (or be a real model your Codex account exposes).
|
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""MLEvolve integration (github.com/InternScience/MLEvolve).
|
| 2 |
+
|
| 3 |
+
MLEvolve is an MCGS-based auto-ML harness designed for the mle-bench
|
| 4 |
+
data layout. The adapter here translates a GraphTestbed task into the
|
| 5 |
+
mle-bench shape it expects, then drives the upstream `run.py` (Hydra
|
| 6 |
+
entry point) with overrides that route LLM traffic through CLIProxyAPI.
|
| 7 |
+
|
| 8 |
+
Default model: `gpt-5.3-codex-spark` (pipe-through alias the user defines
|
| 9 |
+
in their CLIProxyAPI `oauth-model-alias.codex` block).
|
| 10 |
+
"""
|
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""GraphTestbed task → mle-bench-shaped data tree.
|
| 2 |
+
|
| 3 |
+
mle-bench expects, per experiment ID:
|
| 4 |
+
|
| 5 |
+
<root>/<exp_id>/prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
|
| 6 |
+
|
| 7 |
+
GraphTestbed's test labels live only on the scoring server, so the agent
|
| 8 |
+
cannot be auto-scored against `test_features.csv` locally. v1 strategy:
|
| 9 |
+
|
| 10 |
+
- Stage `val_features.csv` (with labels) as the "test" the agent
|
| 11 |
+
searches against. MLEvolve's grader can score val predictions locally,
|
| 12 |
+
which is what drives MCGS exploration.
|
| 13 |
+
- Stash the real `test_features.csv` next to the staged tree as
|
| 14 |
+
`<root>/<exp_id>/REAL_TEST_FEATURES.csv` so users can re-execute the
|
| 15 |
+
best runfile.py against it after the search finishes.
|
| 16 |
+
|
| 17 |
+
This is documented as a known limitation in agents/mlevolve/README.md.
|
| 18 |
+
"""
|
| 19 |
+
|
| 20 |
+
from __future__ import annotations
|
| 21 |
+
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
|
| 24 |
+
import pandas as pd
|
| 25 |
+
|
| 26 |
+
from agents.common.tasks import task_instruction
|
| 27 |
+
from graphtestbed._manifest import task_config
|
| 28 |
+
from graphtestbed.fetch import cache_dir
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
def stage(task: str, root: Path) -> Path:
|
| 32 |
+
"""Build <root>/<task>/prepared/{public,private}/. Return the prepared dir."""
|
| 33 |
+
cfg = task_config(task)
|
| 34 |
+
s = cfg["submission_schema"]
|
| 35 |
+
|
| 36 |
+
src = cache_dir() / task
|
| 37 |
+
if not src.exists():
|
| 38 |
+
raise SystemExit(
|
| 39 |
+
f"No cached dataset at {src}. Run `gtb fetch {task}` first."
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
base = root / task / "prepared"
|
| 43 |
+
pub = base / "public"
|
| 44 |
+
priv = base / "private"
|
| 45 |
+
pub.mkdir(parents=True, exist_ok=True)
|
| 46 |
+
priv.mkdir(parents=True, exist_ok=True)
|
| 47 |
+
|
| 48 |
+
train = pd.read_csv(src / "train_features.csv")
|
| 49 |
+
val = pd.read_csv(src / "val_features.csv")
|
| 50 |
+
test = pd.read_csv(src / "test_features.csv")
|
| 51 |
+
|
| 52 |
+
if s["pred_col"] not in val.columns:
|
| 53 |
+
raise SystemExit(
|
| 54 |
+
f"val_features.csv has no `{s['pred_col']}` column — cannot use "
|
| 55 |
+
f"val as the local-grading split for task {task}."
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
# Public tree (what the agent sees). val_no_label = val minus label →
|
| 59 |
+
# served as `test.csv` so the agent's runfile predicts on it.
|
| 60 |
+
val_no_label = val.drop(columns=[s["pred_col"]])
|
| 61 |
+
train.to_csv(pub / "train.csv", index=False)
|
| 62 |
+
val_no_label.to_csv(pub / "test.csv", index=False)
|
| 63 |
+
|
| 64 |
+
sample = val_no_label[[s["id_col"]]].copy()
|
| 65 |
+
sample[s["pred_col"]] = 0.5
|
| 66 |
+
sample.to_csv(pub / "sample_submission.csv", index=False)
|
| 67 |
+
|
| 68 |
+
(pub / "description.md").write_text(task_instruction(task))
|
| 69 |
+
|
| 70 |
+
# Private tree: val with labels — the local grader checks submission
|
| 71 |
+
# against this.
|
| 72 |
+
val[[s["id_col"], s["pred_col"]]].rename(
|
| 73 |
+
columns={s["pred_col"]: "Label"}
|
| 74 |
+
).to_csv(priv / "test.csv", index=False)
|
| 75 |
+
|
| 76 |
+
# Stash the real test set for post-search re-execution by the user.
|
| 77 |
+
test.to_csv(root / task / "REAL_TEST_FEATURES.csv", index=False)
|
| 78 |
+
|
| 79 |
+
return base
|
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# End-to-end smoke test of MLEvolve on the `figraph` task.
|
| 3 |
+
set -euo pipefail
|
| 4 |
+
|
| 5 |
+
REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
|
| 6 |
+
cd "${REPO_ROOT}"
|
| 7 |
+
|
| 8 |
+
: "${CLIPROXYAPI_KEY:?Set CLIPROXYAPI_KEY before running}"
|
| 9 |
+
|
| 10 |
+
python3 -m agents.mlevolve.runner \
|
| 11 |
+
--task figraph \
|
| 12 |
+
--model "${MODEL:-gpt-5.3-codex-spark}" \
|
| 13 |
+
--steps "${STEPS:-30}" \
|
| 14 |
+
--time-limit-min "${TIME_LIMIT_MIN:-30}" \
|
| 15 |
+
--gpus "${GPUS:-0}" \
|
| 16 |
+
"${@}"
|
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Clone MLEvolve into agents/mlevolve/_vendor/MLEvolve and install its deps.
|
| 3 |
+
# This is a heavy install (torch + ML stack); expect ~5–10 GB and 5–15 min.
|
| 4 |
+
set -euo pipefail
|
| 5 |
+
|
| 6 |
+
HERE="$(cd "$(dirname "$0")" && pwd)"
|
| 7 |
+
DEST="${HERE}/_vendor"
|
| 8 |
+
REPO="${MLEVOLVE_REPO:-https://github.com/InternScience/MLEvolve}"
|
| 9 |
+
REF="${MLEVOLVE_REF:-main}"
|
| 10 |
+
|
| 11 |
+
mkdir -p "${DEST}"
|
| 12 |
+
|
| 13 |
+
if [[ -d "${DEST}/MLEvolve/.git" ]]; then
|
| 14 |
+
echo "Updating existing clone in ${DEST}/MLEvolve"
|
| 15 |
+
git -C "${DEST}/MLEvolve" fetch origin "${REF}"
|
| 16 |
+
git -C "${DEST}/MLEvolve" checkout "${REF}"
|
| 17 |
+
git -C "${DEST}/MLEvolve" pull --ff-only
|
| 18 |
+
else
|
| 19 |
+
git clone --depth 50 --branch "${REF}" "${REPO}" "${DEST}/MLEvolve"
|
| 20 |
+
fi
|
| 21 |
+
|
| 22 |
+
cd "${DEST}/MLEvolve"
|
| 23 |
+
echo
|
| 24 |
+
echo "Installing requirements (heavy — torch + ML stack)..."
|
| 25 |
+
for f in requirements_base.txt requirements_ml.txt requirements_domain.txt; do
|
| 26 |
+
if [[ -f "$f" ]]; then
|
| 27 |
+
echo " pip install --no-deps -r $f"
|
| 28 |
+
pip install --no-deps -r "$f"
|
| 29 |
+
fi
|
| 30 |
+
done
|
| 31 |
+
|
| 32 |
+
echo
|
| 33 |
+
echo "MLEvolve installed at ${DEST}/MLEvolve"
|
| 34 |
+
echo "Set MLEVOLVE_DIR if you put it elsewhere."
|
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Run MLEvolve on a GraphTestbed task, routed through CLIProxyAPI.
|
| 2 |
+
|
| 3 |
+
Usage:
|
| 4 |
+
python -m agents.mlevolve.runner --task figraph
|
| 5 |
+
python -m agents.mlevolve.runner --task figraph \\
|
| 6 |
+
--model gpt-5.3-codex-spark --steps 100
|
| 7 |
+
python -m agents.mlevolve.runner --task figraph \\
|
| 8 |
+
--submit mlevolve-codex-spark
|
| 9 |
+
|
| 10 |
+
What this does:
|
| 11 |
+
1. Build an mle-bench-shaped tree from the GraphTestbed task data
|
| 12 |
+
(val-as-test for v1 — see adapter.py for why).
|
| 13 |
+
2. Render config.yaml into _vendor/MLEvolve/config/, with the proxy
|
| 14 |
+
endpoint + model wired into agent.code and agent.feedback.
|
| 15 |
+
3. Invoke `python run.py …` from inside _vendor/MLEvolve/ with Hydra
|
| 16 |
+
overrides for paths and run-budget.
|
| 17 |
+
4. Harvest the latest submission.csv from runs/, normalize its column
|
| 18 |
+
names, validate against the testbed schema, and (optionally) submit.
|
| 19 |
+
|
| 20 |
+
Known v1 limitation: the produced submission scores VAL-set predictions,
|
| 21 |
+
not TEST-set. To score on test, rerun the best runfile.py against
|
| 22 |
+
<workspace>/mlebench-tree/<task>/REAL_TEST_FEATURES.csv before submitting.
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import argparse
|
| 28 |
+
import os
|
| 29 |
+
import subprocess
|
| 30 |
+
import sys
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
|
| 33 |
+
import pandas as pd
|
| 34 |
+
|
| 35 |
+
from agents.cliproxyapi import (
|
| 36 |
+
ProxyEndpoint,
|
| 37 |
+
openai_yaml_block,
|
| 38 |
+
wait_until_ready,
|
| 39 |
+
)
|
| 40 |
+
from agents.common.submit import finalize
|
| 41 |
+
from agents.common.workspace import make_workspace
|
| 42 |
+
from agents.mlevolve.adapter import stage as stage_mlebench
|
| 43 |
+
from graphtestbed._manifest import task_config
|
| 44 |
+
|
| 45 |
+
DEFAULT_MODEL = "gpt-5.3-codex-spark"
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def _resolve_mlevolve_dir() -> Path:
|
| 49 |
+
explicit = os.environ.get("MLEVOLVE_DIR")
|
| 50 |
+
if explicit:
|
| 51 |
+
p = Path(explicit)
|
| 52 |
+
if not (p / "run.py").exists():
|
| 53 |
+
raise SystemExit(f"MLEVOLVE_DIR={p} does not contain run.py")
|
| 54 |
+
return p
|
| 55 |
+
vendored = Path(__file__).parent / "_vendor" / "MLEvolve"
|
| 56 |
+
if (vendored / "run.py").exists():
|
| 57 |
+
return vendored
|
| 58 |
+
raise SystemExit(
|
| 59 |
+
"Cannot locate MLEvolve.\n"
|
| 60 |
+
" Install: bash agents/mlevolve/install.sh\n"
|
| 61 |
+
" Or set MLEVOLVE_DIR to your existing clone."
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def _hydra_overrides(
|
| 66 |
+
task: str, mlebench_root: Path, prepared: Path, ep: ProxyEndpoint,
|
| 67 |
+
model: str, steps: int, time_limit_s: int, num_gpus: int,
|
| 68 |
+
) -> list[str]:
|
| 69 |
+
"""Build Hydra-style key=value overrides for run.py."""
|
| 70 |
+
public = prepared / "public"
|
| 71 |
+
block = openai_yaml_block(ep, model)
|
| 72 |
+
cfg_metric = task_config(task)["metric"]["primary"]
|
| 73 |
+
|
| 74 |
+
overrides = [
|
| 75 |
+
f"exp_id={task}",
|
| 76 |
+
f"exp_name={task}",
|
| 77 |
+
f"dataset_dir={mlebench_root}",
|
| 78 |
+
f"data_dir={public}",
|
| 79 |
+
f"desc_file={public / 'description.md'}",
|
| 80 |
+
f"start_cpu_id=0",
|
| 81 |
+
f"cpu_number=4",
|
| 82 |
+
# LLM routing → proxy
|
| 83 |
+
f"agent.code.model={block['model']}",
|
| 84 |
+
f"agent.code.base_url={block['base_url']}",
|
| 85 |
+
f"agent.code.api_key={block['api_key']}",
|
| 86 |
+
f"agent.feedback.model={block['model']}",
|
| 87 |
+
f"agent.feedback.base_url={block['base_url']}",
|
| 88 |
+
f"agent.feedback.api_key={block['api_key']}",
|
| 89 |
+
# Run budget overrides
|
| 90 |
+
f"agent.steps={steps}",
|
| 91 |
+
f"agent.time_limit={time_limit_s}",
|
| 92 |
+
f"agent.memory_embedding_device={'cuda' if num_gpus > 0 else 'cpu'}",
|
| 93 |
+
f"agent.search.num_gpus={num_gpus}",
|
| 94 |
+
f"use_grading_server=false",
|
| 95 |
+
# Goal hint
|
| 96 |
+
f"goal=Maximize {cfg_metric} on the test set",
|
| 97 |
+
f"eval={cfg_metric}",
|
| 98 |
+
]
|
| 99 |
+
return overrides
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
def _harvest_submission(
|
| 103 |
+
task: str, mlevolve_dir: Path, dst: Path,
|
| 104 |
+
) -> Path:
|
| 105 |
+
schema = task_config(task)["submission_schema"]
|
| 106 |
+
runs = mlevolve_dir / "runs"
|
| 107 |
+
if not runs.exists():
|
| 108 |
+
raise SystemExit(f"No runs/ dir under {mlevolve_dir}")
|
| 109 |
+
candidates = sorted(runs.rglob("submission.csv"),
|
| 110 |
+
key=lambda p: p.stat().st_mtime)
|
| 111 |
+
if not candidates:
|
| 112 |
+
raise SystemExit(
|
| 113 |
+
f"No submission.csv produced under {runs}. "
|
| 114 |
+
f"Inspect {dst / 'agent.log'} for the failure mode."
|
| 115 |
+
)
|
| 116 |
+
chosen = candidates[-1]
|
| 117 |
+
df = pd.read_csv(chosen)
|
| 118 |
+
expected = [schema["id_col"], schema["pred_col"]]
|
| 119 |
+
if list(df.columns) != expected:
|
| 120 |
+
if len(df.columns) == 2:
|
| 121 |
+
print(f" (renaming columns {list(df.columns)} → {expected})")
|
| 122 |
+
df.columns = expected
|
| 123 |
+
else:
|
| 124 |
+
raise SystemExit(
|
| 125 |
+
f"Cannot normalize {chosen}: got {list(df.columns)}, expected {expected}"
|
| 126 |
+
)
|
| 127 |
+
out = dst / "val_submission.csv"
|
| 128 |
+
df.to_csv(out, index=False)
|
| 129 |
+
print(f"✓ Picked {chosen.relative_to(mlevolve_dir)}")
|
| 130 |
+
return out
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def _print_followup(task: str, ws: Path, val_sub: Path) -> None:
|
| 134 |
+
real_test = ws / "mlebench-tree" / task / "REAL_TEST_FEATURES.csv"
|
| 135 |
+
print()
|
| 136 |
+
print("⚠ v1 limitation: the file above scores VAL predictions.")
|
| 137 |
+
print(" To score on the actual test set:")
|
| 138 |
+
print(f" 1. Find the best runfile.py under "
|
| 139 |
+
f"{Path('_vendor/MLEvolve/runs')}/<latest>/")
|
| 140 |
+
print(f" 2. Re-run it with test.csv replaced by:")
|
| 141 |
+
print(f" {real_test}")
|
| 142 |
+
print(f" 3. Submit the resulting CSV via:")
|
| 143 |
+
print(f" gtb submit {task} --file <path> --agent <name>")
|
| 144 |
+
|
| 145 |
+
|
| 146 |
+
def main() -> None:
|
| 147 |
+
ap = argparse.ArgumentParser(prog="agents.mlevolve.runner")
|
| 148 |
+
ap.add_argument("--task", required=True)
|
| 149 |
+
ap.add_argument("--model", default=DEFAULT_MODEL,
|
| 150 |
+
help=f"default: {DEFAULT_MODEL}")
|
| 151 |
+
ap.add_argument("--steps", type=int, default=100,
|
| 152 |
+
help="agent.steps (default: 100, upstream default 500 — "
|
| 153 |
+
"MCGS exploration count)")
|
| 154 |
+
ap.add_argument("--time-limit-min", type=int, default=120,
|
| 155 |
+
help="agent.time_limit in minutes (default: 120)")
|
| 156 |
+
ap.add_argument("--gpus", type=int, default=0,
|
| 157 |
+
help="search.num_gpus (default: 0 — CPU only)")
|
| 158 |
+
ap.add_argument("--submit", default=None, metavar="AGENT_ID",
|
| 159 |
+
help="POST val-set submission to scoring API as this name. "
|
| 160 |
+
"Note: scores VAL not test (see runner docstring).")
|
| 161 |
+
ap.add_argument("--workspace-root", type=Path, default=None)
|
| 162 |
+
args = ap.parse_args()
|
| 163 |
+
|
| 164 |
+
mlevolve_dir = _resolve_mlevolve_dir()
|
| 165 |
+
ep = ProxyEndpoint.from_env()
|
| 166 |
+
wait_until_ready(ep)
|
| 167 |
+
print(f"✓ Proxy ready at {ep.base_url()}")
|
| 168 |
+
print(f"✓ MLEvolve at {mlevolve_dir}")
|
| 169 |
+
|
| 170 |
+
ws = make_workspace("mlevolve", args.task, args.workspace_root)
|
| 171 |
+
mlebench_root = ws / "mlebench-tree"
|
| 172 |
+
prepared = stage_mlebench(args.task, mlebench_root)
|
| 173 |
+
print(f"✓ mle-bench tree staged at {mlebench_root}")
|
| 174 |
+
|
| 175 |
+
overrides = _hydra_overrides(
|
| 176 |
+
task=args.task,
|
| 177 |
+
mlebench_root=mlebench_root,
|
| 178 |
+
prepared=prepared,
|
| 179 |
+
ep=ep,
|
| 180 |
+
model=args.model,
|
| 181 |
+
steps=args.steps,
|
| 182 |
+
time_limit_s=args.time_limit_min * 60,
|
| 183 |
+
num_gpus=args.gpus,
|
| 184 |
+
)
|
| 185 |
+
cmd = [sys.executable, "run.py", *overrides]
|
| 186 |
+
|
| 187 |
+
print(f"→ Launching MLEvolve task={args.task} model={args.model}")
|
| 188 |
+
print(f" workspace: {ws}")
|
| 189 |
+
log = ws / "agent.log"
|
| 190 |
+
with log.open("wb") as lf:
|
| 191 |
+
rc = subprocess.call(cmd, cwd=mlevolve_dir, stdout=lf, stderr=subprocess.STDOUT)
|
| 192 |
+
print(f" exit={rc} log={log}")
|
| 193 |
+
if rc != 0:
|
| 194 |
+
raise SystemExit(rc)
|
| 195 |
+
|
| 196 |
+
val_sub = _harvest_submission(args.task, mlevolve_dir, ws)
|
| 197 |
+
_print_followup(args.task, ws, val_sub)
|
| 198 |
+
|
| 199 |
+
# Note: don't auto-finalize against `test_features.csv` schema since this
|
| 200 |
+
# is a val-set submission. Just print & stop.
|
| 201 |
+
print()
|
| 202 |
+
print(f" val_submission: {val_sub}")
|
| 203 |
+
if args.submit:
|
| 204 |
+
print(f" --submit was set; posting val-set predictions as "
|
| 205 |
+
f"`{args.submit}` (will score 0 against test GT).")
|
| 206 |
+
finalize(args.task, val_sub, args.submit)
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
if __name__ == "__main__":
|
| 210 |
+
main()
|
|
@@ -8,7 +8,10 @@ import os
|
|
| 8 |
import json
|
| 9 |
|
| 10 |
|
| 11 |
-
API_URL = os.environ.get(
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
|
| 14 |
def main() -> None:
|
|
|
|
| 8 |
import json
|
| 9 |
|
| 10 |
|
| 11 |
+
API_URL = os.environ.get(
|
| 12 |
+
"GRAPHTESTBED_API",
|
| 13 |
+
"https://lanczos-graphtestbed.hf.space",
|
| 14 |
+
)
|
| 15 |
|
| 16 |
|
| 17 |
def main() -> None:
|
|
@@ -21,7 +21,10 @@ import pandas as pd
|
|
| 21 |
from graphtestbed._manifest import sha256_file, task_config
|
| 22 |
|
| 23 |
|
| 24 |
-
API_URL = os.environ.get(
|
|
|
|
|
|
|
|
|
|
| 25 |
TIMEOUT_SEC = 60
|
| 26 |
|
| 27 |
|
|
|
|
| 21 |
from graphtestbed._manifest import sha256_file, task_config
|
| 22 |
|
| 23 |
|
| 24 |
+
API_URL = os.environ.get(
|
| 25 |
+
"GRAPHTESTBED_API",
|
| 26 |
+
"https://lanczos-graphtestbed.hf.space",
|
| 27 |
+
)
|
| 28 |
TIMEOUT_SEC = 60
|
| 29 |
|
| 30 |
|
|
@@ -7,11 +7,6 @@ license = "MIT"
|
|
| 7 |
readme = "README.md"
|
| 8 |
requires-python = ">=3.10"
|
| 9 |
keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
|
| 10 |
-
|
| 11 |
-
[project.urls]
|
| 12 |
-
Homepage = "https://github.com/zhuconv/GraphTestbed"
|
| 13 |
-
Repository = "https://github.com/zhuconv/GraphTestbed"
|
| 14 |
-
Issues = "https://github.com/zhuconv/GraphTestbed/issues"
|
| 15 |
dependencies = [
|
| 16 |
"huggingface-hub >= 0.20",
|
| 17 |
"pandas >= 2.0",
|
|
@@ -19,6 +14,11 @@ dependencies = [
|
|
| 19 |
"requests >= 2.30",
|
| 20 |
]
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
[project.optional-dependencies]
|
| 23 |
dev = ["scikit-learn >= 1.3"]
|
| 24 |
|
|
|
|
| 7 |
readme = "README.md"
|
| 8 |
requires-python = ">=3.10"
|
| 9 |
keywords = ["benchmark", "graph", "ml", "agent", "evaluation"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
dependencies = [
|
| 11 |
"huggingface-hub >= 0.20",
|
| 12 |
"pandas >= 2.0",
|
|
|
|
| 14 |
"requests >= 2.30",
|
| 15 |
]
|
| 16 |
|
| 17 |
+
[project.urls]
|
| 18 |
+
Homepage = "https://github.com/zhuconv/GraphTestbed"
|
| 19 |
+
Repository = "https://github.com/zhuconv/GraphTestbed"
|
| 20 |
+
Issues = "https://github.com/zhuconv/GraphTestbed/issues"
|
| 21 |
+
|
| 22 |
[project.optional-dependencies]
|
| 23 |
dev = ["scikit-learn >= 1.3"]
|
| 24 |
|
|
@@ -41,6 +41,10 @@ from flask import Flask, jsonify, request
|
|
| 41 |
|
| 42 |
GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
|
| 43 |
DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
MANIFEST_PATH = Path(os.environ.get(
|
| 45 |
"GT_MANIFEST",
|
| 46 |
Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
|
|
@@ -195,6 +199,15 @@ def submit():
|
|
| 195 |
)
|
| 196 |
conn.commit()
|
| 197 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
# Rank = how many distinct agents have a strictly better best-score on
|
| 199 |
# this task. The just-inserted row contributes to that count only if the
|
| 200 |
# SAME agent had a better prior submission (in which case rank doesn't
|
|
|
|
| 41 |
|
| 42 |
GT_DIR = Path(os.environ.get("GT_DIR", "/var/graphtestbed/gt"))
|
| 43 |
DB_PATH = Path(os.environ.get("GT_DB", "/var/graphtestbed/leaderboard.db"))
|
| 44 |
+
ARCHIVE_DIR = (
|
| 45 |
+
Path(os.environ["GT_ARCHIVE_DIR"])
|
| 46 |
+
if os.environ.get("GT_ARCHIVE_DIR") else None
|
| 47 |
+
)
|
| 48 |
MANIFEST_PATH = Path(os.environ.get(
|
| 49 |
"GT_MANIFEST",
|
| 50 |
Path(__file__).resolve().parents[1] / "datasets" / "manifest.yaml",
|
|
|
|
| 199 |
)
|
| 200 |
conn.commit()
|
| 201 |
|
| 202 |
+
# Archive the raw CSV when GT_ARCHIVE_DIR is configured, so the deploy
|
| 203 |
+
# host can later prove what each scored entry was. Filename embeds the
|
| 204 |
+
# agent + run_id so multiple submissions don't collide.
|
| 205 |
+
if ARCHIVE_DIR is not None:
|
| 206 |
+
safe_agent = "".join(c if c.isalnum() or c in "-_." else "_" for c in agent)
|
| 207 |
+
out = ARCHIVE_DIR / task / f"{safe_agent}-{run_id}.csv"
|
| 208 |
+
out.parent.mkdir(parents=True, exist_ok=True)
|
| 209 |
+
out.write_bytes(raw)
|
| 210 |
+
|
| 211 |
# Rank = how many distinct agents have a strictly better best-score on
|
| 212 |
# this task. The just-inserted row contributes to that count only if the
|
| 213 |
# SAME agent had a better prior submission (in which case rank doesn't
|
|
@@ -3,3 +3,4 @@ pandas>=2.0
|
|
| 3 |
pyyaml>=6.0
|
| 4 |
scikit-learn>=1.3
|
| 5 |
gunicorn>=21.0
|
|
|
|
|
|
| 3 |
pyyaml>=6.0
|
| 4 |
scikit-learn>=1.3
|
| 5 |
gunicorn>=21.0
|
| 6 |
+
huggingface_hub>=0.20
|
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deploying the GraphTestbed scoring server to HF Spaces
|
| 2 |
+
|
| 3 |
+
All commands assume `HF_TOKEN` is exported and has **write** scope on the
|
| 4 |
+
`lanczos` namespace.
|
| 5 |
+
|
| 6 |
+
## 1. Seed the GT dataset repo
|
| 7 |
+
|
| 8 |
+
```bash
|
| 9 |
+
HF_TOKEN=$HF_TOKEN python server/space/push_gt.py \
|
| 10 |
+
--repo lanczos/graphtestbed-gt \
|
| 11 |
+
--gt-dir ~/graphtestbed-gt
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
This creates the **private** dataset repo if it doesn't exist and uploads
|
| 15 |
+
each `<task>.csv` to `gt/<task>.csv`. Verify at:
|
| 16 |
+
|
| 17 |
+
<https://huggingface.co/datasets/lanczos/graphtestbed-gt>
|
| 18 |
+
|
| 19 |
+
## 2. Create the Space
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
huggingface-cli repo create graphtestbed --type space --space_sdk docker
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
Or in the web UI: New Space → name `graphtestbed` → SDK: **Docker**.
|
| 26 |
+
|
| 27 |
+
## 3. Set the Space secret
|
| 28 |
+
|
| 29 |
+
In Space Settings → Variables and secrets, add:
|
| 30 |
+
|
| 31 |
+
| name | value |
|
| 32 |
+
| --- | --- |
|
| 33 |
+
| `HF_TOKEN` | same token (write scope on `lanczos/graphtestbed-gt`) |
|
| 34 |
+
|
| 35 |
+
Optional overrides (set as **variables**, not secrets):
|
| 36 |
+
|
| 37 |
+
| name | default | when to override |
|
| 38 |
+
| --- | --- | --- |
|
| 39 |
+
| `GT_DATASET_REPO` | `lanczos/graphtestbed-gt` | running multiple Spaces against different GT |
|
| 40 |
+
| `GT_BACKUP_INTERVAL` | `60` | tighter durability vs. fewer commits |
|
| 41 |
+
| `GT_QUOTA` | `5` | bumping during a benchmark sprint |
|
| 42 |
+
|
| 43 |
+
## 4. Push the code to the Space
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
# One-time
|
| 47 |
+
git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
|
| 48 |
+
|
| 49 |
+
# Each deploy (HF prompts for credentials: user=lanczos, password=$HF_TOKEN)
|
| 50 |
+
./server/space/push_to_space.sh
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
The script overlays `server/space/README.md` at repo root on a temp branch
|
| 54 |
+
and force-pushes to `space/main` (HF reads its frontmatter from root
|
| 55 |
+
README). Your GitHub root README is untouched.
|
| 56 |
+
|
| 57 |
+
First build ~3 min (pandas + sklearn wheels). Subsequent ~30 s.
|
| 58 |
+
|
| 59 |
+
## 5. Smoke-test
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
curl -s https://lanczos-graphtestbed.hf.space/healthz | jq
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
Expect:
|
| 66 |
+
```json
|
| 67 |
+
{
|
| 68 |
+
"status": "ok",
|
| 69 |
+
"tasks": ["arxiv-citation", "figraph", "ibm-aml", "ieee-fraud-detection"],
|
| 70 |
+
"gt_present": ["figraph", "..."],
|
| 71 |
+
"quota_per_day": 5,
|
| 72 |
+
"uptime_unix": 1776633751
|
| 73 |
+
}
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
If `gt_present` is empty, the boot bootstrap couldn't read from the dataset
|
| 77 |
+
repo — check the Space logs and verify `HF_TOKEN` has read scope on
|
| 78 |
+
`GT_DATASET_REPO`.
|
| 79 |
+
|
| 80 |
+
## 6. Hand out the URL
|
| 81 |
+
|
| 82 |
+
```
|
| 83 |
+
export GRAPHTESTBED_API=https://lanczos-graphtestbed.hf.space
|
| 84 |
+
gtb submit figraph --file preds.csv --agent my-agent-v1
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Reading the leaderboard back as a maintainer
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
huggingface-cli download lanczos/graphtestbed-gt \
|
| 91 |
+
leaderboard.db \
|
| 92 |
+
--repo-type dataset \
|
| 93 |
+
--local-dir ./backup
|
| 94 |
+
|
| 95 |
+
sqlite3 backup/leaderboard.db \
|
| 96 |
+
"SELECT task, agent, primary_metric, n_rows, submitted_at
|
| 97 |
+
FROM submissions ORDER BY submitted_at DESC LIMIT 20"
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
The full per-submission CSV archive lives under `submissions/<task>/<agent>-<run_id>.csv`
|
| 101 |
+
in the same dataset repo.
|
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
ENV PYTHONUNBUFFERED=1 \
|
| 4 |
+
PYTHONDONTWRITEBYTECODE=1 \
|
| 5 |
+
PIP_NO_CACHE_DIR=1
|
| 6 |
+
|
| 7 |
+
WORKDIR /app
|
| 8 |
+
|
| 9 |
+
RUN apt-get update && \
|
| 10 |
+
apt-get install -y --no-install-recommends git && \
|
| 11 |
+
rm -rf /var/lib/apt/lists/*
|
| 12 |
+
|
| 13 |
+
# Install deps first so the layer caches across code-only changes.
|
| 14 |
+
COPY server/requirements.txt /app/server/requirements.txt
|
| 15 |
+
RUN pip install -r /app/server/requirements.txt huggingface_hub>=0.20
|
| 16 |
+
|
| 17 |
+
# Install the graphtestbed package itself so server/api.py can
|
| 18 |
+
# `from graphtestbed._manifest import ...`.
|
| 19 |
+
COPY pyproject.toml /app/
|
| 20 |
+
COPY graphtestbed /app/graphtestbed
|
| 21 |
+
COPY datasets /app/datasets
|
| 22 |
+
COPY server /app/server
|
| 23 |
+
RUN pip install --no-deps -e /app
|
| 24 |
+
|
| 25 |
+
# HF Spaces mounts /data on Persistent Storage tier; on free tier it's
|
| 26 |
+
# just an in-container path that the dataset-repo backup loop preserves.
|
| 27 |
+
ENV GT_DATA_ROOT=/data \
|
| 28 |
+
GT_DIR=/data/gt \
|
| 29 |
+
GT_DB=/data/leaderboard.db \
|
| 30 |
+
GT_ARCHIVE_DIR=/data/submissions \
|
| 31 |
+
GT_DATASET_REPO=lanczos/graphtestbed-gt \
|
| 32 |
+
GT_BACKUP_INTERVAL=60 \
|
| 33 |
+
GT_QUOTA=5 \
|
| 34 |
+
PORT=7860
|
| 35 |
+
RUN mkdir -p /data && chmod 777 /data
|
| 36 |
+
|
| 37 |
+
EXPOSE 7860
|
| 38 |
+
CMD ["python", "/app/server/space/space_entry.py"]
|
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: GraphTestbed Scoring API
|
| 3 |
+
emoji: 📊
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: false
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# GraphTestbed Scoring API
|
| 12 |
+
|
| 13 |
+
Public scoring server for the [GraphTestbed](https://github.com/zhuconv/GraphTestbed)
|
| 14 |
+
benchmark. Anyone can `gtb submit <task> --file preds.csv --agent <name>` from
|
| 15 |
+
anywhere; the scored entry lands on a single shared leaderboard.
|
| 16 |
+
|
| 17 |
+
## Endpoints
|
| 18 |
+
|
| 19 |
+
| method | path | purpose |
|
| 20 |
+
| --- | --- | --- |
|
| 21 |
+
| POST | `/submit` | multipart `task=…&agent=…&file=preds.csv` → JSON with primary metric, secondary metrics, leaderboard rank, quota_remaining |
|
| 22 |
+
| GET | `/leaderboard/<task>` | best-per-agent JSON, sorted by primary desc |
|
| 23 |
+
| GET | `/healthz` | tasks list + which have GT loaded + quota |
|
| 24 |
+
|
| 25 |
+
Full contract: [PROTOCOL.md](https://github.com/zhuconv/GraphTestbed/blob/main/PROTOCOL.md).
|
| 26 |
+
|
| 27 |
+
## Trust model
|
| 28 |
+
|
| 29 |
+
Non-adversarial benchmark. The API enforces:
|
| 30 |
+
- 5 submissions / day / IP / task
|
| 31 |
+
- Schema check before scoring (malformed CSVs don't burn quota)
|
| 32 |
+
- Score bucketing (round to 3 dp)
|
| 33 |
+
- Audit trail in sqlite + per-submission CSV archive
|
| 34 |
+
|
| 35 |
+
Test labels live only in the companion private dataset repo
|
| 36 |
+
(`lanczos/graphtestbed-gt`) and never enter the Space's git history.
|
| 37 |
+
|
| 38 |
+
## Configuration (Space secrets)
|
| 39 |
+
|
| 40 |
+
| name | required | default | notes |
|
| 41 |
+
| --- | --- | --- | --- |
|
| 42 |
+
| `HF_TOKEN` | yes | — | write scope on `GT_DATASET_REPO` |
|
| 43 |
+
| `GT_DATASET_REPO` | no | `lanczos/graphtestbed-gt` | private dataset holding GT + leaderboard backups |
|
| 44 |
+
| `GT_BACKUP_INTERVAL` | no | `60` | seconds between sqlite → dataset-repo pushes |
|
| 45 |
+
| `GT_QUOTA` | no | `5` | submissions/day/IP/task |
|
| 46 |
+
|
| 47 |
+
## Persistence
|
| 48 |
+
|
| 49 |
+
- On boot: `snapshot_download` pulls `gt/*.csv`, `leaderboard.db`, and any
|
| 50 |
+
archived `submissions/**/*.csv` from the dataset repo into `/data`.
|
| 51 |
+
- Every 60 s: if `SELECT COUNT(*) FROM submissions` grew, a daemon thread
|
| 52 |
+
uses `sqlite3.Connection.backup()` to copy the DB atomically and
|
| 53 |
+
`upload_file`s it back. New submission CSVs in `/data/submissions/` are
|
| 54 |
+
pushed via `upload_folder` (content-hash diff — unchanged files skipped).
|
| 55 |
+
- Worst-case loss on Space crash: 60 s of submissions.
|
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""One-shot uploader for ground-truth CSVs to the companion HF dataset repo.
|
| 2 |
+
|
| 3 |
+
Creates the dataset repo (private by default) if it doesn't exist, then
|
| 4 |
+
uploads every <task>.csv from --gt-dir to gt/<task>.csv in the repo.
|
| 5 |
+
|
| 6 |
+
Usage (run locally with a token that has write scope on the namespace):
|
| 7 |
+
|
| 8 |
+
HF_TOKEN=hf_xxx python server/space/push_gt.py \\
|
| 9 |
+
--repo lanczos/graphtestbed-gt \\
|
| 10 |
+
--gt-dir ~/graphtestbed-gt
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
from __future__ import annotations
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import os
|
| 17 |
+
import sys
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
from huggingface_hub import create_repo, upload_file
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
def main() -> int:
|
| 24 |
+
ap = argparse.ArgumentParser(prog="push_gt")
|
| 25 |
+
ap.add_argument("--repo", default="lanczos/graphtestbed-gt",
|
| 26 |
+
help="dataset repo id (default: lanczos/graphtestbed-gt)")
|
| 27 |
+
ap.add_argument("--gt-dir", type=Path, required=True,
|
| 28 |
+
help="local dir containing <task>.csv files")
|
| 29 |
+
ap.add_argument("--public", action="store_true",
|
| 30 |
+
help="create the repo as public (default: private)")
|
| 31 |
+
args = ap.parse_args()
|
| 32 |
+
|
| 33 |
+
token = os.environ.get("HF_TOKEN")
|
| 34 |
+
if not token:
|
| 35 |
+
sys.exit("HF_TOKEN not set in env")
|
| 36 |
+
|
| 37 |
+
if not args.gt_dir.exists():
|
| 38 |
+
sys.exit(f"--gt-dir not found: {args.gt_dir}")
|
| 39 |
+
|
| 40 |
+
csvs = sorted(args.gt_dir.glob("*.csv"))
|
| 41 |
+
if not csvs:
|
| 42 |
+
sys.exit(f"no *.csv files under {args.gt_dir}")
|
| 43 |
+
|
| 44 |
+
print(f"creating/confirming dataset repo {args.repo} (private={not args.public})")
|
| 45 |
+
create_repo(
|
| 46 |
+
repo_id=args.repo, repo_type="dataset",
|
| 47 |
+
private=not args.public, exist_ok=True, token=token,
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
for csv in csvs:
|
| 51 |
+
rel = f"gt/{csv.name}"
|
| 52 |
+
print(f"uploading {csv} → {args.repo}:{rel}")
|
| 53 |
+
upload_file(
|
| 54 |
+
path_or_fileobj=str(csv),
|
| 55 |
+
path_in_repo=rel,
|
| 56 |
+
repo_id=args.repo, repo_type="dataset",
|
| 57 |
+
token=token,
|
| 58 |
+
commit_message=f"upload {csv.name}",
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
print(f"\ndone — {len(csvs)} ground-truth file(s) at:")
|
| 62 |
+
print(f" https://huggingface.co/datasets/{args.repo}")
|
| 63 |
+
return 0
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
if __name__ == "__main__":
|
| 67 |
+
raise SystemExit(main())
|
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Push the current commit to the HF Space remote, with server/space/README.md
|
| 3 |
+
# overlaid at repo root (HF reads the Space's metadata frontmatter from the
|
| 4 |
+
# root README; the GitHub root README stays untouched).
|
| 5 |
+
#
|
| 6 |
+
# Prereq once:
|
| 7 |
+
# git remote add space https://huggingface.co/spaces/lanczos/graphtestbed
|
| 8 |
+
#
|
| 9 |
+
# When git prompts for credentials on push:
|
| 10 |
+
# user = lanczos
|
| 11 |
+
# password = $HF_TOKEN
|
| 12 |
+
set -euo pipefail
|
| 13 |
+
|
| 14 |
+
BRANCH=$(git rev-parse --abbrev-ref HEAD)
|
| 15 |
+
TEMP="space-deploy-$(date +%s)"
|
| 16 |
+
|
| 17 |
+
trap 'git checkout "$BRANCH" >/dev/null 2>&1 || true; \
|
| 18 |
+
git branch -D "$TEMP" >/dev/null 2>&1 || true' EXIT
|
| 19 |
+
|
| 20 |
+
git checkout -b "$TEMP"
|
| 21 |
+
cp server/space/README.md README.md
|
| 22 |
+
git add README.md
|
| 23 |
+
git commit --no-verify -m "deploy: overlay server/space/README.md as Space root"
|
| 24 |
+
git push -f space "$TEMP:main"
|
| 25 |
+
echo
|
| 26 |
+
echo "pushed to space/main"
|
| 27 |
+
echo "URL: https://lanczos-graphtestbed.hf.space/"
|
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Entry point for the GraphTestbed scoring server on HF Spaces.
|
| 2 |
+
|
| 3 |
+
On boot:
|
| 4 |
+
1. snapshot_download the companion dataset repo (lanczos/graphtestbed-gt by
|
| 5 |
+
default) into /data: gt/*.csv, leaderboard.db, submissions/**/*.csv.
|
| 6 |
+
2. Spawn a daemon thread that every BACKUP_INTERVAL seconds:
|
| 7 |
+
a. SELECT COUNT(*) FROM submissions; bail if unchanged.
|
| 8 |
+
b. sqlite3.Connection.backup() into a temp file (atomic, lock-safe).
|
| 9 |
+
c. upload_file the temp file → leaderboard.db in the dataset repo.
|
| 10 |
+
d. upload_folder /data/submissions/ → submissions/ in the dataset repo
|
| 11 |
+
(huggingface_hub diffs by content-hash; unchanged files don't transfer).
|
| 12 |
+
3. Hand off to server/api.py via Flask app.run(threaded=True).
|
| 13 |
+
|
| 14 |
+
Env vars (all have sensible defaults baked into the Dockerfile):
|
| 15 |
+
HF_TOKEN required write scope on GT_DATASET_REPO
|
| 16 |
+
GT_DATASET_REPO optional default: lanczos/graphtestbed-gt
|
| 17 |
+
GT_DATA_ROOT optional default: /data
|
| 18 |
+
GT_BACKUP_INTERVAL optional default: 60 (seconds)
|
| 19 |
+
PORT optional default: 7860
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import os
|
| 25 |
+
import sqlite3
|
| 26 |
+
import sys
|
| 27 |
+
import threading
|
| 28 |
+
import time
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
|
| 31 |
+
from huggingface_hub import snapshot_download, upload_file, upload_folder
|
| 32 |
+
|
| 33 |
+
HF_TOKEN = os.environ.get("HF_TOKEN")
|
| 34 |
+
HF_REPO = os.environ.get("GT_DATASET_REPO", "lanczos/graphtestbed-gt")
|
| 35 |
+
DATA_DIR = Path(os.environ.get("GT_DATA_ROOT", "/data"))
|
| 36 |
+
GT_DIR = DATA_DIR / "gt"
|
| 37 |
+
DB_PATH = DATA_DIR / "leaderboard.db"
|
| 38 |
+
ARCHIVE_DIR = DATA_DIR / "submissions"
|
| 39 |
+
BACKUP_INTERVAL = int(os.environ.get("GT_BACKUP_INTERVAL", "60"))
|
| 40 |
+
PORT = int(os.environ.get("PORT", "7860"))
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _require_token() -> str:
|
| 44 |
+
if not HF_TOKEN:
|
| 45 |
+
raise SystemExit(
|
| 46 |
+
"HF_TOKEN is unset. Set it as a Space secret with write scope on "
|
| 47 |
+
f"{HF_REPO}."
|
| 48 |
+
)
|
| 49 |
+
return HF_TOKEN
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def bootstrap() -> None:
|
| 53 |
+
"""Pull GT files, leaderboard, and submission archive from the dataset repo."""
|
| 54 |
+
token = _require_token()
|
| 55 |
+
for d in (DATA_DIR, GT_DIR, ARCHIVE_DIR):
|
| 56 |
+
d.mkdir(parents=True, exist_ok=True)
|
| 57 |
+
|
| 58 |
+
print(f"snapshot_download {HF_REPO} → {DATA_DIR}", flush=True)
|
| 59 |
+
try:
|
| 60 |
+
snapshot_download(
|
| 61 |
+
HF_REPO,
|
| 62 |
+
repo_type="dataset",
|
| 63 |
+
local_dir=str(DATA_DIR),
|
| 64 |
+
allow_patterns=["gt/*.csv", "leaderboard.db", "submissions/**/*.csv"],
|
| 65 |
+
token=token,
|
| 66 |
+
)
|
| 67 |
+
except Exception as e:
|
| 68 |
+
# First-deploy or empty repo: keep going with empty /data.
|
| 69 |
+
print(f"snapshot_download warning ({type(e).__name__}): {e}", flush=True)
|
| 70 |
+
|
| 71 |
+
n_gt = len(list(GT_DIR.glob("*.csv")))
|
| 72 |
+
print(f"GT files present: {n_gt}", flush=True)
|
| 73 |
+
if DB_PATH.exists():
|
| 74 |
+
try:
|
| 75 |
+
n = int(sqlite3.connect(DB_PATH).execute(
|
| 76 |
+
"SELECT COUNT(*) FROM submissions"
|
| 77 |
+
).fetchone()[0])
|
| 78 |
+
print(f"restored leaderboard.db ({n} submissions)", flush=True)
|
| 79 |
+
except sqlite3.OperationalError:
|
| 80 |
+
print("leaderboard.db present but no submissions table yet", flush=True)
|
| 81 |
+
else:
|
| 82 |
+
print("no prior leaderboard.db; starting fresh", flush=True)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def _submission_count() -> int:
|
| 86 |
+
if not DB_PATH.exists():
|
| 87 |
+
return 0
|
| 88 |
+
try:
|
| 89 |
+
conn = sqlite3.connect(DB_PATH)
|
| 90 |
+
try:
|
| 91 |
+
row = conn.execute("SELECT COUNT(*) FROM submissions").fetchone()
|
| 92 |
+
return int(row[0]) if row else 0
|
| 93 |
+
finally:
|
| 94 |
+
conn.close()
|
| 95 |
+
except sqlite3.OperationalError:
|
| 96 |
+
return 0
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def _atomic_db_copy(dst: Path) -> None:
|
| 100 |
+
"""sqlite3.backup() is lock-safe — readers/writers stay consistent."""
|
| 101 |
+
src = sqlite3.connect(DB_PATH)
|
| 102 |
+
try:
|
| 103 |
+
target = sqlite3.connect(dst)
|
| 104 |
+
try:
|
| 105 |
+
src.backup(target)
|
| 106 |
+
finally:
|
| 107 |
+
target.close()
|
| 108 |
+
finally:
|
| 109 |
+
src.close()
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def backup_loop() -> None:
|
| 113 |
+
token = _require_token()
|
| 114 |
+
last_count = -1
|
| 115 |
+
print(f"backup_loop started (interval={BACKUP_INTERVAL}s)", flush=True)
|
| 116 |
+
while True:
|
| 117 |
+
time.sleep(BACKUP_INTERVAL)
|
| 118 |
+
n = _submission_count()
|
| 119 |
+
if n == last_count:
|
| 120 |
+
continue
|
| 121 |
+
|
| 122 |
+
try:
|
| 123 |
+
tmp = DATA_DIR / "_leaderboard.db.tmp"
|
| 124 |
+
_atomic_db_copy(tmp)
|
| 125 |
+
upload_file(
|
| 126 |
+
path_or_fileobj=str(tmp),
|
| 127 |
+
path_in_repo="leaderboard.db",
|
| 128 |
+
repo_id=HF_REPO, repo_type="dataset",
|
| 129 |
+
token=token,
|
| 130 |
+
commit_message=f"backup leaderboard ({n} submissions)",
|
| 131 |
+
)
|
| 132 |
+
tmp.unlink()
|
| 133 |
+
except Exception as e:
|
| 134 |
+
print(f"leaderboard backup failed: {type(e).__name__}: {e}", flush=True)
|
| 135 |
+
continue
|
| 136 |
+
|
| 137 |
+
if ARCHIVE_DIR.exists() and any(ARCHIVE_DIR.rglob("*.csv")):
|
| 138 |
+
try:
|
| 139 |
+
upload_folder(
|
| 140 |
+
folder_path=str(ARCHIVE_DIR),
|
| 141 |
+
path_in_repo="submissions",
|
| 142 |
+
repo_id=HF_REPO, repo_type="dataset",
|
| 143 |
+
token=token,
|
| 144 |
+
commit_message=f"archive submissions ({n} total)",
|
| 145 |
+
allow_patterns=["**/*.csv"],
|
| 146 |
+
)
|
| 147 |
+
except Exception as e:
|
| 148 |
+
print(f"submission archive failed: {type(e).__name__}: {e}", flush=True)
|
| 149 |
+
|
| 150 |
+
last_count = n
|
| 151 |
+
print(f"backup pushed: {n} submissions", flush=True)
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def main() -> int:
|
| 155 |
+
bootstrap()
|
| 156 |
+
|
| 157 |
+
# Make sure server/api.py reads paths consistent with what we just bootstrapped.
|
| 158 |
+
os.environ.setdefault("GT_DIR", str(GT_DIR))
|
| 159 |
+
os.environ.setdefault("GT_DB", str(DB_PATH))
|
| 160 |
+
os.environ.setdefault("GT_ARCHIVE_DIR", str(ARCHIVE_DIR))
|
| 161 |
+
|
| 162 |
+
threading.Thread(target=backup_loop, daemon=True).start()
|
| 163 |
+
|
| 164 |
+
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
| 165 |
+
from api import app # noqa: E402 — env vars must be set first
|
| 166 |
+
|
| 167 |
+
print(f"serving on 0.0.0.0:{PORT}", flush=True)
|
| 168 |
+
app.run(host="0.0.0.0", port=PORT, threaded=True, use_reloader=False)
|
| 169 |
+
return 0
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
if __name__ == "__main__":
|
| 173 |
+
raise SystemExit(main())
|