Spaces:

lanczos
/

graphtestbed

Sleeping

Zhu Jiajun (jz28583) Claude Opus 4.7 (1M context) commited on 19 days ago

Commit

bd3e9ac

1 Parent(s): ab28b31

Single-repo dataset hosting on HF (GLUE-style subdirs)

All 4 tasks now live under one public dataset repo
`lanczos/graphtestbed-data`, organized as one subdir per task:

graphtestbed-data/
arxiv-citation/{train,val,test}_features.csv + sample_submission.csv
figraph/... (+ edges20{14..18}.csv)
ibm-aml/...
ieee-fraud-detection/...

Why one repo: easier to manage permissions, README, versioning. New tasks
become a `git push` of a folder, not a new HF repo per dataset.

Changes:
- server/space/push_data.py: one-shot uploader (folder per task), writes
a top-level dataset card too.
- manifest.yaml: every task points at lanczos/graphtestbed-data with a new
`hf_subdir: <task>` field.
- graphtestbed/fetch.py: prepends hf_subdir/ when downloading.
- agents/common/tasks.py: instruction text now points the agent at the HF
source explicitly + reminds them not to query upstream for test labels.

Test labels remain in the private companion repo (lanczos/graphtestbed-gt)
and never enter the public dataset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (4) hide show

agents/common/tasks.py +6 -1
datasets/manifest.yaml +8 -4
graphtestbed/fetch.py +8 -1
server/space/push_data.py +170 -0

agents/common/tasks.py CHANGED Viewed

@@ -24,8 +24,13 @@ _TEMPLATE = """\
 - `val_features.csv`   — labeled validation rows (use for HPO / early stopping)
 - `test_features.csv`  — **unlabeled** test rows; predict here
 The `Label` (or task-specific target) column is present in train/val and
-absent from test. Do not attempt to recover test labels from upstream sources.
 ## Submission format

 - `val_features.csv`   — labeled validation rows (use for HPO / early stopping)
 - `test_features.csv`  — **unlabeled** test rows; predict here
+These are pulled from `lanczos/graphtestbed-data` on HuggingFace (subdir
+`{task}/`). **Train and HPO on these files only** — do not pull from the
+upstream source mentioned above to recover test labels. The benchmark is
+non-adversarial; we trust agent authors to honor the contract.
 The `Label` (or task-specific target) column is present in train/val and
+absent from test.
 ## Submission format

datasets/manifest.yaml CHANGED Viewed

@@ -1,5 +1,6 @@
 ieee-fraud-detection:
-  hf_repo: graphtestbed/ieee-fraud-detection
   hf_revision: main
   files:
     train_features:
@@ -44,7 +45,8 @@ ieee-fraud-detection:
   backend_config:
     competition: ieee-fraud-detection
 arxiv-citation:
-  hf_repo: graphtestbed/arxiv-citation
   hf_revision: main
   files:
     train_features:
@@ -79,7 +81,8 @@ arxiv-citation:
     for this task). The split is balanced enough (~42.7% positive) that AUC-ROC discriminates
     models well.'
 figraph:
-  hf_repo: graphtestbed/figraph
   hf_revision: main
   files:
     train_features:
@@ -129,7 +132,8 @@ figraph:
     Metric: AUC-ROC. The FiGraph paper uses AUC-ROC for the company anomaly task (~4.7%
     positive); secondary AUC-PR and F1 reported for context.'
 ibm-aml:
-  hf_repo: graphtestbed/ibm-aml
   hf_revision: main
   files:
     train_features:

 ieee-fraud-detection:
+  hf_repo: lanczos/graphtestbed-data
+  hf_subdir: ieee-fraud-detection
   hf_revision: main
   files:
     train_features:
   backend_config:
     competition: ieee-fraud-detection
 arxiv-citation:
+  hf_repo: lanczos/graphtestbed-data
+  hf_subdir: arxiv-citation
   hf_revision: main
   files:
     train_features:
     for this task). The split is balanced enough (~42.7% positive) that AUC-ROC discriminates
     models well.'
 figraph:
+  hf_repo: lanczos/graphtestbed-data
+  hf_subdir: figraph
   hf_revision: main
   files:
     train_features:
     Metric: AUC-ROC. The FiGraph paper uses AUC-ROC for the company anomaly task (~4.7%
     positive); secondary AUC-PR and F1 reported for context.'
 ibm-aml:
+  hf_repo: lanczos/graphtestbed-data
+  hf_subdir: ibm-aml
   hf_revision: main
   files:
     train_features:

graphtestbed/fetch.py CHANGED Viewed

@@ -29,12 +29,19 @@ def fetch_task(task: str, allow_unverified: bool = False) -> Path:
     out = cache_dir() / task
     out.mkdir(parents=True, exist_ok=True)
     n_unpinned = 0
     for key, spec in cfg["files"].items():
         try:
             local = hf_hub_download(
                 repo_id=cfg["hf_repo"],
-                filename=spec["filename"],
                 revision=cfg.get("hf_revision", "main"),
                 repo_type="dataset",
                 cache_dir=str(out / "_hf_cache"),

     out = cache_dir() / task
     out.mkdir(parents=True, exist_ok=True)
+    # If hf_subdir is set, the file is laid out as <subdir>/<filename> inside
+    # the repo (GLUE-style single-repo-many-subsets). Older single-repo-per-
+    # task entries leave hf_subdir unset and use bare filenames.
+    hf_subdir = cfg.get("hf_subdir", "").strip("/")
     n_unpinned = 0
     for key, spec in cfg["files"].items():
+        path_in_repo = (f"{hf_subdir}/{spec['filename']}"
+                        if hf_subdir else spec["filename"])
         try:
             local = hf_hub_download(
                 repo_id=cfg["hf_repo"],
+                filename=path_in_repo,
                 revision=cfg.get("hf_revision", "main"),
                 repo_type="dataset",
                 cache_dir=str(out / "_hf_cache"),

server/space/push_data.py ADDED Viewed

	@@ -0,0 +1,170 @@

+"""One-shot uploader for the agent-visible features (train/val/test) to a
+single public HF dataset repo, organized GLUE-style as one subdir per task.
+Layout in the repo:
+    lanczos/graphtestbed-data/
+    ├── README.md
+    ├── arxiv-citation/{train,val,test}_features.csv + sample_submission.csv
+    ├── figraph/...
+    ├── ibm-aml/...
+    └── ieee-fraud-detection/...
+The test_features.csv MUST already have its label column stripped out — this
+script does NOT strip it. Spot-check before upload by running with --dry-run.
+Usage:
+    HF_TOKEN=hf_xxx python server/space/push_data.py \
+        --repo lanczos/graphtestbed-data --src ~/.graphtestbed/data
+    # or one task at a time:
+    HF_TOKEN=hf_xxx python server/space/push_data.py \
+        --repo lanczos/graphtestbed-data --src ~/.graphtestbed/data \
+        --tasks figraph arxiv-citation
+"""
+from __future__ import annotations
+import argparse
+import os
+import sys
+import tempfile
+from pathlib import Path
+import yaml
+from huggingface_hub import HfApi, create_repo
+REPO_ROOT = Path(__file__).resolve().parents[2]
+MANIFEST = REPO_ROOT / "datasets" / "manifest.yaml"
+FILES = ["train_features.csv", "val_features.csv",
+         "test_features.csv", "sample_submission.csv"]
+def _readme(tasks: list[str], cfg: dict) -> str:
+    lines = [
+        "---",
+        "license: mit",
+        "tags: [graph, benchmark, fraud-detection, graph-ml]",
+        "---",
+        "",
+        "# GraphTestbed Datasets",
+        "",
+        "Public train/val/test features for the four [GraphTestbed]"
+        "(https://github.com/zhuconv/GraphTestbed) tasks. Test labels are"
+        " held privately by the scoring server.",
+        "",
+        "## Why a single repo",
+        "",
+        "GLUE-style: one repo, one subdir per task, one README. Adding a"
+        " new task is a `git push` of one folder, not a new HF repo.",
+        "",
+        "## Subsets",
+        "",
+        "| Task | id col | metric | rows (train/val/test) | Source |",
+        "| --- | --- | --- | --- | --- |",
+    ]
+    for t in tasks:
+        c = cfg[t]
+        s = c["submission_schema"]
+        m = c["metric"]
+        # Pull the first sentence of the description as the source line
+        desc = (c.get("description", "") or "").split(".")[0]
+        lines.append(
+            f"| `{t}` | `{s['id_col']}` | `{m['primary']}` | "
+            f"see csv | {desc.strip()[:60]} |"
+        )
+    lines += [
+        "",
+        "## Use",
+        "",
+        "```python",
+        "from huggingface_hub import hf_hub_download",
+        "import pandas as pd",
+        "",
+        "p = hf_hub_download(",
+        "    'lanczos/graphtestbed-data', 'arxiv-citation/train_features.csv',",
+        "    repo_type='dataset',",
+        ")",
+        "train = pd.read_csv(p)",
+        "```",
+        "",
+        "**Contract:** treat upstream sources (e.g. relbench, FiGraph github,"
+        " IBM AML kaggle) as out-of-bounds for evaluation purposes. Train +"
+        " HPO on what's in this repo only.",
+        "",
+        "Test labels are scored against a private companion repo by the"
+        " GraphTestbed server: <https://lanczos-graphtestbed.hf.space/>.",
+    ]
+    return "\n".join(lines)
+def main() -> None:
+    ap = argparse.ArgumentParser(prog="push_data")
+    ap.add_argument("--repo", required=True,
+                    help="HF dataset repo id, e.g. lanczos/graphtestbed-data")
+    ap.add_argument("--src", required=True, type=Path,
+                    help="Local source root (e.g. ~/.graphtestbed/data) — "
+                         "must contain a subdir per task with the 4 CSVs.")
+    ap.add_argument("--tasks", nargs="+", default=None,
+                    help="Limit to these task names (default: all in manifest)")
+    ap.add_argument("--dry-run", action="store_true")
+    args = ap.parse_args()
+    cfg = yaml.safe_load(MANIFEST.read_text())
+    tasks = args.tasks or sorted(cfg)
+    src_root = args.src.expanduser()
+    missing = []
+    for t in tasks:
+        for f in FILES:
+            if not (src_root / t / f).exists():
+                missing.append(f"{t}/{f}")
+    if missing:
+        sys.exit("Missing files:\n  " + "\n  ".join(missing))
+    token = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACE_HUB_TOKEN")
+    if not token:
+        sys.exit("Set HF_TOKEN env var with write scope on the namespace.")
+    api = HfApi(token=token)
+    if args.dry_run:
+        print(f"[dry-run] would push to {args.repo}:")
+        for t in tasks:
+            for f in FILES:
+                p = (src_root / t / f).resolve()
+                size_mb = p.stat().st_size / (1024 * 1024)
+                print(f"  {t}/{f}  ({size_mb:.1f} MB)")
+        return
+    create_repo(args.repo, repo_type="dataset", token=token,
+                exist_ok=True, private=False)
+    # Write the README into a tempdir so we don't dirty the source root
+    with tempfile.TemporaryDirectory() as td:
+        readme = Path(td) / "README.md"
+        readme.write_text(_readme(tasks, cfg))
+        api.upload_file(
+            path_or_fileobj=str(readme),
+            path_in_repo="README.md",
+            repo_id=args.repo,
+            repo_type="dataset",
+            commit_message="Update README (auto from push_data.py)",
+        )
+    for t in tasks:
+        # upload_folder follows symlinks via the underlying open() calls.
+        api.upload_folder(
+            folder_path=str(src_root / t),
+            path_in_repo=t,
+            repo_id=args.repo,
+            repo_type="dataset",
+            allow_patterns=FILES,
+            commit_message=f"Push {t} train/val/test features",
+        )
+        print(f"  ✓ {t}/")
+    print("Done.")
+if __name__ == "__main__":
+    main()