graphtestbed / PROTOCOL.md
zhuconv
Initial commit: GraphTestbed v0.1.0
ad6901d

GraphTestbed Protocol

The contract between agent harnesses, the public client (gtb), and the scoring API (server/api.py).

1. Two-branch single-repo layout

GraphTestbed (single public repo)
β”œβ”€β”€ main branch     β†’ what agents `git clone`
└── server branch   β†’ what the maintainer deploys

Branches are not access control β€” they're an organizational split. Both are public; both can be cloned by anyone. The actual privacy boundary is:

  • Test labels never enter git (any branch).
  • The scoring server holds GT at /var/graphtestbed/gt/<task>.csv, populated separately from git at deploy time.

2. Data layout (per task)

What the agent sees (HuggingFace public dataset)

graphtestbed/<task>:
  train_features.csv       # labeled
  val_features.csv         # labeled
  test_features.csv        # NO target column
  description.md
  sample_submission.csv
  <auxiliary>.csv          # optional: edges*.csv, train_text.csv, ...

What the server holds privately

/var/graphtestbed/gt/<task>.csv
  <id_col>,Label           # one row per test entity, with the true label

ID column matches test_features.csv[<id_col>] 100%.

3. Submission contract

Format

Per-task schema in datasets/manifest.yaml:

<task>:
  submission_schema:
    id_col: <name>
    pred_col: <name>
    n_rows: <int>
    pred_dtype: float    # binary classification

CSV must have:

  • Exactly 2 columns: <id_col>, <pred_col>
  • Exactly n_rows rows
  • ID set 100% matching test_features.csv[<id_col>]
  • <pred_col> values: float in [0, 1] for binary

Submit

gtb submit <task> --file <preds.csv> --agent <name>

This:

  1. Validates schema locally (no API call if malformed).
  2. POSTs multipart/form-data to $GRAPHTESTBED_API/submit with task=<task>&agent=<name>&file=<csv>.
  3. Server re-validates schema, checks quota, scores against GT, returns metrics JSON.
  4. Client prints metrics + leaderboard rank + remaining quota.

4. API endpoints

POST /submit (multipart)

Form fields:

  • task (str, required)
  • agent (str, required)
  • file (CSV, required, ≀50 MB)

Response 200:

{
  "run_id": "8a3f29bce4a1",
  "task": "arxiv-citation",
  "agent": "autopipe-v0.4",
  "primary": 0.689,
  "secondary": {"auc_pr": 0.661, "f1": 0.591},
  "n_rows": 19394,
  "leaderboard_rank": 3,
  "quota_remaining": 4,
  "submitted_at": "2026-04-18T14:23:00"
}

Response 4xx:

  • 400 β€” missing form fields, malformed CSV
  • 404 β€” unknown task
  • 422 β€” schema check failed (returns reason)
  • 429 β€” quota exceeded
  • 503 β€” GT not deployed for this task

GET /leaderboard/<task>

Returns array sorted by primary desc, best per agent (not all submissions):

[
  {"agent": "human-baseline", "primary": 0.901, "n_submissions": 1, "first_seen": "..."},
  {"agent": "autopipe-v0.4",  "primary": 0.689, "n_submissions": 4, "first_seen": "..."},
  ...
]

GET /healthz

{
  "status": "ok",
  "tasks": ["ieee-fraud-detection", ...],
  "gt_present": ["arxiv-citation", "figraph"],
  "quota_per_day": 5,
  "uptime_unix": 1745081234
}

5. Anti-leakage rails (lightweight)

Rail Mechanism
Test labels not in git .gitignore blocks ground_truth*, private/. GT lives only on server fs.
Score bucketing Server rounds to 3 decimals before returning
No per-row feedback API returns aggregate metrics only
Quota 5 submissions / day / IP / task by default
Schema check first Malformed submissions don't burn quota and don't expose GT to scoring

6. Reproducibility

Every server response includes run_id. Server stores in sqlite:

run_id, task, agent, primary_metric, secondary_json, submission_sha256,
n_rows, submitter_ip, submitted_at

To audit a leaderboard entry: maintainer queries sqlite by run_id, gets the submission CSV from submissions/<task>/<agent>/<ts>.csv (the client doesn't push these to git automatically β€” see Β§8 if you want that).

7. Versioning

Dataset versions

manifest.yaml pins (hf_repo, hf_revision, sha256) per file. To bump:

  1. Push new files to HF as a new revision.
  2. Update manifest.yaml's sha256 fields.
  3. Existing leaderboard entries reference the old data via their run_id's stored submission_sha256; new submissions use the new data.

For breaking changes (different metric, different test rows), add a v2 task entry: arxiv-citation-v2. Don't mutate v1's manifest.

Metric versions

Same rule: don't mutate, add <task>-v2 if metric definition changes.

8. Optional: client-side submission archiving

If you want every submission CSV preserved publicly, add a step to graphtestbed/submit.py that opens a PR with the CSV under submissions/<task>/<agent>/<ts>.csv after the API returns. Off by default to keep the client simple.

9. What the agent must NOT assume

  • No persistent state between submissions. Each /submit is independent.
  • No mid-run feedback. One scorer call per submit, full stop.
  • No private leaderboard. Every submission is rate-counted by IP and scored β€” no shadow leaderboard for testing.
  • agent field is your name. The server takes you at your word; don't impersonate.

10. What the API explicitly does NOT defend against

  • An agent that downloads upstream data (Kaggle, FiGraph github) and tries to recompute test labels. We re-split for time-forward eval, so upstream labels don't directly map, but a determined agent could try.
  • An agent that checks out server branch to read scoring code. The branch is public; reading the scoring logic doesn't leak GT.
  • An agent that runs on a private machine and tries to OCR a maintainer's laptop screen. We are not your adversary.

If your threat model includes adversarial agents with arbitrary capability, this design is wrong; use container-isolated evaluation instead.

11. Rationale (why this vs the prior GitHub-Actions design)

We considered using GitHub Actions + a private GT repo. Rejected because:

  • Two repos (public + private GT) is more moving parts than one repo + one small server.
  • A FastAPI/Flask app on a $5/month VM (or free HF Space) gives us the same privacy boundary (GT lives on server fs, not in git).
  • Synchronous scoring (instant response) is a better UX than async PR comments.
  • The trust model is "agents are honest within an academic benchmark," not "agents are adversaries." Doesn't need physical isolation; needs an API contract that prevents accidental leaks.

The result: ~200 lines of server code, ~150 lines of client code, deploys on free tiers, no maintenance beyond restarting the server if it crashes.