SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Paper β’ 2509.16941 β’ Published β’ 21
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
pip install datasets transformers trl openai
export OPENAI_API_KEY="your-key"
# Or add to .env in project directory
cd experiments
python loop.py \
--model Qwen/Qwen2.5-0.5B-Instruct \
--provider openai-chat \
--iterations 3 \
--max-instances 50 \
--method grpo
| File | Purpose |
|---|---|
tau_bridge.py |
Python bridge to tau Rust harness |
swe_bench_runner.py |
Run tau on SWE-Bench Pro instances |
evaluate.py |
Docker-based patch evaluation |
failure_classifier.py |
LLM-as-judge failure mode classifier |
train_v2.py |
GRPO training with TRL environment_factory |
loop.py |
Main autoresearch loop orchestrator |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AUTORESEARCH LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. SAMPLE TASK β SWE-Bench Pro public set β
β β β
β 2. RUN AGENT β Tau harness + model + tools β
β β β
β 3. EVALUATE β Docker: apply patch, run tests β
β β β
β 4. IF FAILED: β Extract last 20 turns β
β β β Classify failure mode β
β β β Store (trajectory, failure_mode) β
β β β
β 5. TRAIN β GRPO with custom reward functions β
β β β Target specific failure modes β
β β β
β 6. DEPLOY β Load fine-tuned model into tau β
β β β
β 7. REPEAT β Go to step 1 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Group Relative Policy Optimization with custom reward functions:
patch_format_reward: Rewards properly formatted git diffstool_use_reward: Rewards correct tool usage patternsreasoning_reward: Rewards structured problem analysisenv_state_reward: Rewards exploration and test executionTraining runs are logged to Trackio for monitoring:
tau-swe-pro