Spaces:

lanczos
/

graphtestbed

Sleeping

App Files Files Community

graphtestbed / agents /mlevolve /README.md

Zhu Jiajun (jz28583)

Add agents/ harness integrations and HF Space scoring deployment

d094faf 24 days ago

preview code

raw

history blame contribute delete

2.6 kB

`agents.mlevolve`

Runs MLEvolve on a GraphTestbed task. MLEvolve is an MCGS auto-ML harness wired for OpenAI-compatible APIs.

Default model: gpt-5.3-codex-spark (a pipe-through alias you define in your CLIProxyAPI oauth-model-alias.codex block).

Install

bash agents/mlevolve/install.sh
# heavy: clones the repo + pip-installs torch and ML deps (~5-10 GB).

Lands at agents/mlevolve/_vendor/MLEvolve/. Set MLEVOLVE_DIR if you already have a clone elsewhere.

Run

gtb fetch figraph
python -m agents.mlevolve.runner --task figraph

Output:

runs/mlevolve/figraph/<timestamp>/
├── mlebench-tree/figraph/
│   ├── prepared/public/{train.csv,test.csv,description.md,sample_submission.csv}
│   ├── prepared/private/test.csv      # val labels — local grader uses this
│   └── REAL_TEST_FEATURES.csv          # the actual test split, for re-execute
├── agent.log
└── val_submission.csv                  # MLEvolve's best on the val "test" split

⚠ v1 limitation: val-as-test

GraphTestbed's actual test labels live on the scoring server, not on disk. For the local mle-bench grader to function, the adapter exposes val_features.csv (with labels) as the "test" set MLEvolve searches against.

The CSV the runner harvests is therefore predictions on val, not test. To submit a real test-set score:

Open agents/mlevolve/_vendor/MLEvolve/runs/<latest-ts>/ and find the best runfile.py (search order: best score in the run's tree summary).

Re-execute it against the real test split:

cd <some scratch dir>
cp <ws>/mlebench-tree/figraph/REAL_TEST_FEATURES.csv ./test.csv
cp <ws>/mlebench-tree/figraph/prepared/public/train.csv ./train.csv
python <runfile>      # produces submission.csv

Submit:

gtb submit figraph --file ./submission.csv --agent mlevolve-codex-spark

This step is manual in v1 because the structure of MLEvolve's runfile.py varies per task and we don't want to silently mis-execute. It is on the roadmap to automate.

Knobs

flag	default	meaning
`--model`	`gpt-5.3-codex-spark`	sent to proxy via OPENAI_BASE_URL/v1
`--steps`	100	MCGS exploration count (upstream default: 500)
`--time-limit-min`	120	per-task wall-clock cap (upstream default: 720)
`--gpus`	0	passed to `search.num_gpus`

The --model string must exist in your CLIProxyAPI oauth-model-alias.codex (or be a real model your Codex account exposes).