Tau Agent Self-Improvement on SWE-Bench Pro

Quick Start

1. Install Dependencies

pip install datasets transformers trl openai

2. Set API Keys

export OPENAI_API_KEY="your-key"
# Or add to .env in project directory

3. Run the Autoresearch Loop

cd experiments
python loop.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --provider openai-chat \
  --iterations 3 \
  --max-instances 50 \
  --method grpo

Experiment Files

File	Purpose
`tau_bridge.py`	Python bridge to tau Rust harness
`swe_bench_runner.py`	Run tau on SWE-Bench Pro instances
`evaluate.py`	Docker-based patch evaluation
`failure_classifier.py`	LLM-as-judge failure mode classifier
`train_v2.py`	GRPO training with TRL environment_factory
`loop.py`	Main autoresearch loop orchestrator

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AUTORESEARCH LOOP                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. SAMPLE TASK    ←  SWE-Bench Pro public set             │
│        ↓                                                     │
│  2. RUN AGENT      →  Tau harness + model + tools           │
│        ↓                                                     │
│  3. EVALUATE       →  Docker: apply patch, run tests        │
│        ↓                                                     │
│  4. IF FAILED:     →  Extract last 20 turns                │
│        ↓              →  Classify failure mode               │
│        ↓              →  Store (trajectory, failure_mode)    │
│        ↓                                                     │
│  5. TRAIN          →  GRPO with custom reward functions     │
│        ↓              →  Target specific failure modes       │
│        ↓                                                     │
│  6. DEPLOY         →  Load fine-tuned model into tau       │
│        ↓                                                     │
│  7. REPEAT         →  Go to step 1                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Training Methods

GRPO (Recommended)

Group Relative Policy Optimization with custom reward functions:

patch_format_reward: Rewards properly formatted git diffs
tool_use_reward: Rewards correct tool usage patterns
reasoning_reward: Rewards structured problem analysis
env_state_reward: Rewards exploration and test execution

Key Datasets

ScaleAI/SWE-bench_Pro: 731 public instances with problem statements, requirements, interfaces
Inferact/codex_swebenchpro_traces: 610 real agent trajectories with 53.9% pass rate

Trackio Dashboard

Training runs are logged to Trackio for monitoring:

Project: tau-swe-pro
Dashboard: https://huggingface.co/spaces/sujkilla/ml-intern-tau-swe

Key Papers

SWE-Bench Pro - Scale AI's benchmark
DeepSeekMath - GRPO algorithm
DAPO - Token-level normalization
Dr. GRPO - Removing length bias

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for sujkilla/tau-swe-experiments

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 145

sujkilla
/

tau-swe-experiments