IDMedicine
/

code-graph-trajeval-v1

Model card Files Files and versions

xet

Community

Bremin commited on 16 days ago

Commit

8fdba4d

verified ·

1 Parent(s): a38a118

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +62 -0

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+license: mit
+---
+# SWE-Bench Trajectory Eval Bundle (v1)
+Companion artifact for the trajectory-probe downstream eval of the
+code-graph-v7 encoders (W1, I6, ...).
+## Contents
+- `traj_full_bundle.tar.gz` (488 MB) — contains:
+  - `specs.jsonl`: 2456 SWE-Bench Verified agent trajectories harvested
+    from `swe-bench-submissions` S3 bucket. Fields: instance_id, traj_id,
+    repo, base_commit, patches (1 entry = final model patch), resolved.
+  - `repos/`: shallow (`--filter=blob:none`) clones of the 12 target
+    repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy,
+    xarray, pytest, pylint, requests, seaborn, flask). ~671 MB
+    uncompressed. Blobs pulled lazily per base_commit checkout.
+  - `graphjepa/`: pipeline code (trajectory_pipeline, trajectory_realize,
+    trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh.
+- `harvest.log` — stdout from the S3 harvester that produced specs.jsonl.
+## Downstream workflow
+```bash
+tar -xzf traj_full_bundle.tar.gz
+rsync -a traj_full/graphjepa/ graphjepa/
+mkdir -p outputs/traj_real
+cp traj_full/specs.jsonl outputs/traj_real/
+mv traj_full/repos outputs/traj_real/repos
+# realize (4 sharded workers by repo)
+SHARDS=4 bash graphjepa/scripts/trajectory_full.sh
+tail -f outputs/traj_real/logs/realize_shard*.log
+# merge manifests + probe with each encoder
+cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl
+for NAME in W1_softplus_s0 I6_joint_s0; do
+  .venv/bin/python -m graphjepa.trajectory_probe \
+    --manifest outputs/traj_real/manifest.jsonl \
+    --ckpt outputs/$NAME/ckpt_final.pt \
+    --pool mean --split-by repo \
+    --output outputs/traj_real/probe_${NAME}.json
+done
+```
+## Provenance
+Specs harvested from 5 SWE-Bench Verified submissions:
+| Submission | N | Resolved | Rate |
+|---|---|---|---|
+| 20240620_sweagent_claude3.5sonnet | 485 | 168 | 34.6% |
+| 20241022_tools_claude-3-5-sonnet-updated | 483 | 245 | 50.7% |
+| 20241028_agentless-1.5_gpt4o | 495 | 194 | 39.2% |
+| 20241029_OpenHands-CodeAct-2.1-sonnet | 493 | 265 | 53.8% |
+| 20250405_amazon-q-developer-2025 | 500 | 330 | 66.0% |
+| **total** | **2456** | **1202** | **48.9%** |
+500 unique instance_ids, 499 unique base_commits (median 5 trajectories
+per commit — different agents attempting the same task).