YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Tau Agent Self-Improvement on SWE-Bench Pro

Quick Start

1. Install Dependencies

pip install datasets transformers trl openai

2. Set API Keys

export OPENAI_API_KEY="your-key"
# Or add to .env in project directory

3. Run the Autoresearch Loop

cd experiments
python loop.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --provider openai-chat \
  --iterations 3 \
  --max-instances 50 \
  --method grpo

Experiment Files

File Purpose
tau_bridge.py Python bridge to tau Rust harness
swe_bench_runner.py Run tau on SWE-Bench Pro instances
evaluate.py Docker-based patch evaluation
failure_classifier.py LLM-as-judge failure mode classifier
train_v2.py GRPO training with TRL environment_factory
loop.py Main autoresearch loop orchestrator

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    AUTORESEARCH LOOP                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  1. SAMPLE TASK    ←  SWE-Bench Pro public set             β”‚
β”‚        ↓                                                     β”‚
β”‚  2. RUN AGENT      β†’  Tau harness + model + tools           β”‚
β”‚        ↓                                                     β”‚
β”‚  3. EVALUATE       β†’  Docker: apply patch, run tests        β”‚
β”‚        ↓                                                     β”‚
β”‚  4. IF FAILED:     β†’  Extract last 20 turns                β”‚
β”‚        ↓              β†’  Classify failure mode               β”‚
β”‚        ↓              β†’  Store (trajectory, failure_mode)    β”‚
β”‚        ↓                                                     β”‚
β”‚  5. TRAIN          β†’  GRPO with custom reward functions     β”‚
β”‚        ↓              β†’  Target specific failure modes       β”‚
β”‚        ↓                                                     β”‚
β”‚  6. DEPLOY         β†’  Load fine-tuned model into tau       β”‚
β”‚        ↓                                                     β”‚
β”‚  7. REPEAT         β†’  Go to step 1                         β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Failure Mode Categories (from Scale AI)

  • wrong_solution: Semantically incorrect patch
  • syntax_error: Compilation/syntax errors
  • tool_error: Misuse of editing tools
  • identified_incorrect_file: Wrong file edited
  • context_overflow_from_listing: Exhausted context
  • endless_file_reading: Stuck reading without progress
  • misunderstood_problem_statement: Failed comprehension
  • infinite_loop: Execution loop
  • other: Miscellaneous

Training Methods

GRPO (Recommended)

Group Relative Policy Optimization with custom reward functions:

  • patch_format_reward: Rewards properly formatted git diffs
  • tool_use_reward: Rewards correct tool usage patterns
  • reasoning_reward: Rewards structured problem analysis
  • env_state_reward: Rewards exploration and test execution

Key Datasets

  • ScaleAI/SWE-bench_Pro: 731 public instances with problem statements, requirements, interfaces
  • Inferact/codex_swebenchpro_traces: 610 real agent trajectories with 53.9% pass rate

Trackio Dashboard

Training runs are logged to Trackio for monitoring:

Key Papers

  1. SWE-Bench Pro - Scale AI's benchmark
  2. DeepSeekMath - GRPO algorithm
  3. DAPO - Token-level normalization
  4. Dr. GRPO - Removing length bias
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for sujkilla/tau-swe-experiments