shashaank0707's picture
Update README.md
8f19095 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: AgentDebuggerEnv
emoji: 🐞
colorFrom: purple
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false

AgentDebuggerEnv

Hackathon Links:

πŸš€ One-Line Pitch

An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution.

πŸ’‘ Why This Exists

LLMs often hallucinate bug fixes via blind trial-and-error. Real debugging in production requires hypothesis-driven reasoning, isolation, and verification. We engineered an environment that forces models to observe, hypothesize, and execute code within a secure sandboxβ€”penalizing blind guessing and explicitly rewarding structured problem-solving.

🧠 Key Technical Insights & Research Foundations

  • Hypothesis-Driven Debugging (NeurIPS 2025): Recent research presented at NeurIPS demonstrates that forcing an LLM to formulate a concrete hypothesis before generating code significantly improves debugging accuracy. Inspired by this, our environment mandates a strict OBSERVATION β†’ HYPOTHESIS β†’ ACTION loop. Every single step taken by the agent must be preceded by a formal hypothesis to receive a positive reward.
  • Literature-Backed Reward Criteria: Our continuous, multi-objective reward shaping architecture is heavily influenced by the latest findings in LLM reasoning and code generation capabilities, specifically drawing from:
  • Curriculum Learning for RL: A flat bug distribution caused early policy collapse. We implemented a 3-tier curriculum, introducing complex logic bugs only after structural formatting and syntax localization stabilized.
  • Hardened Sandboxed Grading: Evaluating arbitrary LLM-generated fixes introduces severe RCE risks. We engineered a secure execution sandbox that restricts execution time, limits memory, and completely replaces unsafe exec() calls, ensuring deterministic and safe grading.

πŸ—οΈ Architecture Overview

  • OpenEnv Core: Manages state transitions, agent interactions, and environment telemetry.
  • Grader Subsystem: Multi-layered evaluation utilizing a Hard Grader (secure execution, deterministic AST matching) and a Soft Grader (Llama-3.1-70B semantic evaluation).
  • Trainer: HuggingFace TRL GRPO pipeline with dynamic batch scaling based on runtime VRAM detection.
  • Live Monitor: A Gradio dashboard streaming stdout and Weights & Biases metrics directly from the active training container.

⚑ What Makes This Impressive

  • Zero-to-One in 250 Steps: Achieved a ~2.5x increase in total reward within just 250 steps, demonstrating extreme sample efficiency via GRPO.
  • Dynamic Hardware Scaling: The training pipeline natively detects hardware capability (A100/H100 vs. T4) and automatically scales batch_size, grad_accum, and compute dtype (bfloat16/float16)β€”eliminating OOM errors across deployment environments.
  • Frictionless Deployment: Bypassed heavy dependency constraints (PyTorch/TRL vs. Gradio PIP conflicts) by engineering a lazy-loading runtime environment that ensures deterministic Docker builds.

πŸ› οΈ Tech Stack

  • Frameworks: OpenEnv, FastAPI, Docker
  • RL Pipeline: HuggingFace TRL (GRPO), Peft (LoRA)
  • Models: Qwen2.5-Coder-7B-Instruct (Base), Llama-3.1-70B (Evaluator)
  • Telemetry: Weights & Biases

πŸ“Š Results & Benchmarks

Our training run clearly demonstrates rapid policy adaptation. The model successfully learned the OBSERVATION/HYPOTHESIS/ACTION constraint almost instantly and navigated the tier-2 difficulty bump (step 150) with a textbook drop-and-recover curve.

Training Results

W&B Run | HF Blog

(Note for Hackathon Judges: Live Weights & Biases charts and Gradio UI are embedded below as evidence of the training run).

Total Reward Format Compliance

Additional Training Metrics:

Gradio UI Training Monitor

  • Format Compliance: Scaled to 1.0 (max) within 50 steps.
  • Total Reward: Scaled from baseline ~0.4 to peaks of ~1.0 by step 250.
  • Baseline Solve Rate: 100.0% validation on tiered data structure.

πŸ”₯ Challenges & How They Were Solved

  • Reward Hacking: Initial RL runs showed the model farming points by writing functionally valid code that bypassed the actual bug. Fix: Recalibrated the Hard Grader to execute both the initial buggy code and the proposed fix, computing the delta to ensure points are only awarded for actual regression fixes.
  • Hugging Face Space Build Failures: The Space suffered from "resolution-too-deep" PIP timeouts due to conflicting requirements between Gradio and trl/accelerate. Fix: Stripped requirements.txt to the bare minimum for the UI and engineered a lazy-load script that installs training dependencies post-boot in a background thread.
  • Flaky LLM-as-a-Judge: Using LLMs to grade code functionality proved non-deterministic. Fix: Replaced LLM evaluation for execution success with the deterministic Python sandbox, reserving the LLM judge solely for evaluating the semantic quality of the hypothesis.

▢️ Quick Start

We built the training pipeline to be universally runnable, with a specific focus on reproducible execution for judging.

Run the Training Notebook The easiest way to re-run the exact GRPO training pipeline is via our Jupyter Notebook. It auto-detects hardware and sets configurations accordingly.

  1. Open training/AgentDebuggerEnv_GRPO_Training.ipynb in Google Colab or Kaggle.
  2. Select a GPU runtime (T4, A100, etc.).
  3. Run all cells. It will automatically install dependencies and start streaming results.

πŸ“‚ Code Structure

β”œβ”€β”€ data/               # Tiered bug datasets (JSONL)
β”œβ”€β”€ env/                # OpenEnv environment definitions
β”œβ”€β”€ server/             # FastAPI backend & Grader implementations
β”‚   β”œβ”€β”€ grader_hard.py  # Sandboxed deterministic code execution
β”‚   └── grader_soft.py  # Semantic evaluation logic
β”œβ”€β”€ training/           # GRPO pipeline & Runnable Notebook
└── app.py              # Gradio training monitor

🀝 If I Had More Time

  • Multi-File Contexts: Expand the environment to handle complex multi-file repository debugging using an active Language Server Protocol (LSP) integration.
  • PPO vs GRPO Benchmarking: Quantify the compute and efficiency tradeoffs between PPO and GRPO on this specific task.
  • Adversarial Bug Generation: Implement an adversarial LLM agent to continuously mutate and generate edge-case bugs, creating an infinite, self-sustaining curriculum.

πŸ‘₯ Team Endurance