Spaces:

lanczos
/

graphtestbed

Sleeping

Zhu Jiajun (jz28583) Claude Opus 4.7 (1M context) commited on 22 days ago

Commit

0309359

1 Parent(s): d05b5bd

arxiv-citation: ship the heterograph (citations + author/category tables)

Without the relation tables, agents collapsed arxiv-citation into a pure
tabular regression on the per-paper feature CSV — figraph and ibm-aml run
the same way. That's a significant under-fit; the task is a graph task
because RelBench rel-arxiv:paper-citation is a temporal heterograph.

Added 5 tables under arxiv-citation/ subdir on HF:

citations.csv (Paper_ID, References_Paper_ID, Submission_Date)
— 1.2M rows; filtered to Submission_Date < 2023-01-01
so test-period citations (which encode the labels)
do not leak.
paperAuthors.csv (Paper_ID, Author_ID, Submission_Date) — 617k rows.
paperCategories.csv (Paper_ID, Category_ID, Submission_Date) — 155k rows.
authors.csv (Author_ID, Name, ORCID) — 144k rows.
categories.csv (Category_ID, Category) — 53 rows.

Manifest gets one entry per file; the auto-generated agent instruction
template now lists every file declared in `files:` (was previously a
hardcoded 3-row list), so agents see them.

mlevolve adapter forwards any non-canonical files into the public tree
beside train.csv/test.csv so its REPL can read them with a relative path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

agents/common/tasks.py +34 -5
agents/mlevolve/adapter.py +13 -0
datasets/manifest.yaml +32 -1

agents/common/tasks.py CHANGED Viewed

@@ -20,17 +20,14 @@ _TEMPLATE = """\
 ## Files you will see
-- `train_features.csv` — labeled training rows
-- `val_features.csv`   — labeled validation rows (use for HPO / early stopping)
-- `test_features.csv`  — **unlabeled** test rows; predict here
 These are pulled from `lanczos/graphtestbed-data` on HuggingFace (subdir
 `{task}/`). **Train and HPO on these files only** — do not pull from the
 upstream source mentioned above to recover test labels. The benchmark is
 non-adversarial; we trust agent authors to honor the contract.
-The `Label` (or task-specific target) column is present in train/val and
-absent from test.
 ## Submission format
@@ -56,6 +53,37 @@ _DTYPE_DESC = {
 }
 def task_instruction(task: str) -> str:
     override = Path(__file__).parent / "tasks_md" / f"{task}.md"
     if override.exists():
@@ -69,6 +97,7 @@ def task_instruction(task: str) -> str:
     return _TEMPLATE.format(
         task=task,
         description=str(cfg.get("description", "")).strip(),
         id_col=s["id_col"],
         pred_col=s["pred_col"],
         n_rows=s.get("n_rows", "?"),

 ## Files you will see
+{files_block}
 These are pulled from `lanczos/graphtestbed-data` on HuggingFace (subdir
 `{task}/`). **Train and HPO on these files only** — do not pull from the
 upstream source mentioned above to recover test labels. The benchmark is
 non-adversarial; we trust agent authors to honor the contract.
+The label column is present in train/val and absent from test.
 ## Submission format
 }
+_KNOWN_FILE_HINTS = {
+    "train_features.csv": "labeled training rows",
+    "val_features.csv":   "labeled validation rows (use for HPO / early stopping)",
+    "test_features.csv":  "**unlabeled** test rows; predict here",
+    "sample_submission.csv": "the schema you must match (column order + row IDs)",
+}
+def _files_block(cfg: dict) -> str:
+    """Render every file declared in manifest, with a known hint when we have
+    one — otherwise just the filename so the agent knows it's available."""
+    lines = []
+    seen = set()
+    # Show the canonical four first in a fixed order, then everything else
+    # (graph tables, edges, etc.) in manifest declaration order.
+    canonical = ["train_features.csv", "val_features.csv",
+                 "test_features.csv", "sample_submission.csv"]
+    by_name = {spec["filename"]: key for key, spec in cfg["files"].items()}
+    for fn in canonical:
+        if fn in by_name:
+            lines.append(f"- `{fn}` — {_KNOWN_FILE_HINTS[fn]}")
+            seen.add(fn)
+    for key, spec in cfg["files"].items():
+        fn = spec["filename"]
+        if fn in seen:
+            continue
+        hint = _KNOWN_FILE_HINTS.get(fn, "additional task data (see description above)")
+        lines.append(f"- `{fn}` — {hint}")
+    return "\n".join(lines)
 def task_instruction(task: str) -> str:
     override = Path(__file__).parent / "tasks_md" / f"{task}.md"
     if override.exists():
     return _TEMPLATE.format(
         task=task,
         description=str(cfg.get("description", "")).strip(),
+        files_block=_files_block(cfg),
         id_col=s["id_col"],
         pred_col=s["pred_col"],
         n_rows=s.get("n_rows", "?"),

agents/mlevolve/adapter.py CHANGED Viewed

@@ -76,4 +76,17 @@ def stage(task: str, root: Path) -> Path:
     # Stash the real test set for post-search re-execution by the user.
     test.to_csv(root / task / "REAL_TEST_FEATURES.csv", index=False)
     return base

     # Stash the real test set for post-search re-execution by the user.
     test.to_csv(root / task / "REAL_TEST_FEATURES.csv", index=False)
+    # Forward any additional task data files declared in the manifest (graph
+    # edges, relation tables, …) into the public tree so the agent can build
+    # a real graph model instead of treating the task as pure tabular.
+    canonical = {"train_features.csv", "val_features.csv",
+                 "test_features.csv", "sample_submission.csv"}
+    for spec in cfg["files"].values():
+        fn = spec["filename"]
+        if fn in canonical:
+            continue
+        src_path = src / fn
+        if src_path.exists():
+            (pub / fn).write_bytes(src_path.read_bytes())
     return base

datasets/manifest.yaml CHANGED Viewed

@@ -61,6 +61,21 @@ arxiv-citation:
     sample_submission:
       filename: sample_submission.csv
       sha256: TBD
   submission_schema:
     id_col: Paper_ID
     pred_col: Label
@@ -74,7 +89,23 @@ arxiv-citation:
   description: 'Predict whether each arXiv paper receives ≥1 citation within 6 months
     after submission. Source: RelBench rel-arxiv:paper-citation (stanford-snap/relbench,
     MIT). Temporal split: train cutoff 2022-01-01, val cutoff 2023-01-01, test from
-    val cutoff onward. Test rows: 193,694 (~42.7% positive).
     Metric: AUC-ROC, matching RelBench rel-arxiv:paper-citation (the official benchmark

     sample_submission:
       filename: sample_submission.csv
       sha256: TBD
+    citations:
+      filename: citations.csv
+      sha256: TBD
+    paper_authors:
+      filename: paperAuthors.csv
+      sha256: TBD
+    paper_categories:
+      filename: paperCategories.csv
+      sha256: TBD
+    authors:
+      filename: authors.csv
+      sha256: TBD
+    categories:
+      filename: categories.csv
+      sha256: TBD
   submission_schema:
     id_col: Paper_ID
     pred_col: Label
   description: 'Predict whether each arXiv paper receives ≥1 citation within 6 months
     after submission. Source: RelBench rel-arxiv:paper-citation (stanford-snap/relbench,
     MIT). Temporal split: train cutoff 2022-01-01, val cutoff 2023-01-01, test from
+    val cutoff onward. Test rows: 193,696 (~42.7% positive).
+    This is a GRAPH task. Beyond train/val/test_features.csv (one row per paper with
+    pre-extracted scalar features), the subdir also ships the relational tables that
+    let you build the actual paper-author-category-citation heterograph:
+      citations.csv         (Paper_ID, References_Paper_ID, Submission_Date) — 1.2M
+                            edges; filtered to Submission_Date < 2023-01-01 to
+                            prevent test-label leakage.
+      paperAuthors.csv      (Paper_ID, Author_ID, Submission_Date) — 617k edges.
+      paperCategories.csv   (Paper_ID, Category_ID, Submission_Date) — 155k edges.
+      authors.csv           (Author_ID, Name, ORCID) — 144k author entities.
+      categories.csv        (Category_ID, Category) — 53 category entities.
+    A purely tabular model that ignores these will under-fit. Most baselines for this
+    benchmark use a GNN (GraphSAGE / R-GCN / temporal HGN) over the heterograph.
     Metric: AUC-ROC, matching RelBench rel-arxiv:paper-citation (the official benchmark