Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# SWE-Bench Trajectory Eval Bundle (v1)
|
| 6 |
+
|
| 7 |
+
Companion artifact for the trajectory-probe downstream eval of the
|
| 8 |
+
code-graph-v7 encoders (W1, I6, ...).
|
| 9 |
+
|
| 10 |
+
## Contents
|
| 11 |
+
|
| 12 |
+
- `traj_full_bundle.tar.gz` (488 MB) — contains:
|
| 13 |
+
- `specs.jsonl`: 2456 SWE-Bench Verified agent trajectories harvested
|
| 14 |
+
from `swe-bench-submissions` S3 bucket. Fields: instance_id, traj_id,
|
| 15 |
+
repo, base_commit, patches (1 entry = final model patch), resolved.
|
| 16 |
+
- `repos/`: shallow (`--filter=blob:none`) clones of the 12 target
|
| 17 |
+
repos (django, sympy, sphinx, matplotlib, scikit-learn, astropy,
|
| 18 |
+
xarray, pytest, pylint, requests, seaborn, flask). ~671 MB
|
| 19 |
+
uncompressed. Blobs pulled lazily per base_commit checkout.
|
| 20 |
+
- `graphjepa/`: pipeline code (trajectory_pipeline, trajectory_realize,
|
| 21 |
+
trajectory_probe, trajectory_harvest) plus scripts/trajectory_full.sh.
|
| 22 |
+
- `harvest.log` — stdout from the S3 harvester that produced specs.jsonl.
|
| 23 |
+
|
| 24 |
+
## Downstream workflow
|
| 25 |
+
|
| 26 |
+
```bash
|
| 27 |
+
tar -xzf traj_full_bundle.tar.gz
|
| 28 |
+
rsync -a traj_full/graphjepa/ graphjepa/
|
| 29 |
+
mkdir -p outputs/traj_real
|
| 30 |
+
cp traj_full/specs.jsonl outputs/traj_real/
|
| 31 |
+
mv traj_full/repos outputs/traj_real/repos
|
| 32 |
+
|
| 33 |
+
# realize (4 sharded workers by repo)
|
| 34 |
+
SHARDS=4 bash graphjepa/scripts/trajectory_full.sh
|
| 35 |
+
tail -f outputs/traj_real/logs/realize_shard*.log
|
| 36 |
+
|
| 37 |
+
# merge manifests + probe with each encoder
|
| 38 |
+
cat outputs/traj_real/manifest_shard*.jsonl > outputs/traj_real/manifest.jsonl
|
| 39 |
+
for NAME in W1_softplus_s0 I6_joint_s0; do
|
| 40 |
+
.venv/bin/python -m graphjepa.trajectory_probe \
|
| 41 |
+
--manifest outputs/traj_real/manifest.jsonl \
|
| 42 |
+
--ckpt outputs/$NAME/ckpt_final.pt \
|
| 43 |
+
--pool mean --split-by repo \
|
| 44 |
+
--output outputs/traj_real/probe_${NAME}.json
|
| 45 |
+
done
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
## Provenance
|
| 49 |
+
|
| 50 |
+
Specs harvested from 5 SWE-Bench Verified submissions:
|
| 51 |
+
|
| 52 |
+
| Submission | N | Resolved | Rate |
|
| 53 |
+
|---|---|---|---|
|
| 54 |
+
| 20240620_sweagent_claude3.5sonnet | 485 | 168 | 34.6% |
|
| 55 |
+
| 20241022_tools_claude-3-5-sonnet-updated | 483 | 245 | 50.7% |
|
| 56 |
+
| 20241028_agentless-1.5_gpt4o | 495 | 194 | 39.2% |
|
| 57 |
+
| 20241029_OpenHands-CodeAct-2.1-sonnet | 493 | 265 | 53.8% |
|
| 58 |
+
| 20250405_amazon-q-developer-2025 | 500 | 330 | 66.0% |
|
| 59 |
+
| **total** | **2456** | **1202** | **48.9%** |
|
| 60 |
+
|
| 61 |
+
500 unique instance_ids, 499 unique base_commits (median 5 trajectories
|
| 62 |
+
per commit — different agents attempting the same task).
|