DebugArena: Teaching LLMs to Fix Bugs Through Trial and Error
Meta PyTorch OpenEnv Hackathon Γ Scaler School of Technology 2026 | Theme 4: Self-Improving Agents
TL;DR
I built an RL environment called DebugArena where an LLM learns to fix buggy Python code through trial and error. I fine-tuned Qwen2.5-Coder-0.5B using GRPO and showed:
- Reward improved from 1.36 β 1.72 (+26%)
- Solve rate improved from 35% β 45% (+10 percentage points)
The Problem
Ask any LLM to write a function that adds two numbers β it does it perfectly. Ask it to find the bug in this:
def add(a, b):
return a - b
It might get it. But ask it to fix a subtle off-by-one in a binary search, or a missing edge case in a recursive function β it starts to struggle.
Why? Because most LLM training data is correct code. There is almost no data of the form: broken function β error β minimal fix.
LLMs learned to write code. Nobody built a gym to teach them to fix it.
What I Built
DebugArena is an OpenEnv reinforcement learning environment. The loop:
- Agent gets a broken Python function
- Sees which tests fail and why
- Proposes a fix
- Environment runs it in a sandboxed executor
- Gets reward based on 4 independent signals
- Tries again (max 5 attempts per bug)
No correct answers shown. The agent figures it out from test failures alone.
The Model
I used Qwen2.5-Coder-0.5B-Instruct from HuggingFace β a small but capable coding model. Small enough to train fast on a T4 GPU, capable enough to actually learn from the reward signal.
Fine-tuned using GRPO (Group Relative Policy Optimization) via TRL + Unsloth. GRPO samples 4 candidate fixes per bug, scores them all, and shifts the model toward higher-reward outputs. No value model needed.
Reward Design
4 independent reward signals to prevent reward hacking:
r1 = tests_passing # 0.0 to 1.0 β proportion of tests that pass
r2 = full_solve_bonus # +2.0 if ALL tests pass
r3 = format_compliance # +0.2 valid function / -0.3 malformed
r4 = anti_hacking # -1.0 if forbidden imports detected
Anti-hacking measures: forbidden imports blocked (os, sys, subprocess), restricted builtins, each test runs independently.
The Bug Curriculum
42 hand-crafted bugs across easy, medium, and hard:
Easy:
# Bug: wrong operator
def add(a, b):
return a - b # should be a + b
Medium:
# Bug: no empty string check
def first_char(s):
return s[0] # crashes on empty string
Hard:
# Bug: swap overwrites before saving
def bubble_sort(arr):
if arr[j] > arr[j+1]:
arr[j] = arr[j+1] # overwrites! should be tuple swap
arr[j+1] = arr[j]
Results
| Before Training | After GRPO | |
|---|---|---|
| Avg Reward | 1.361 | 1.717 |
| Solve Rate | 35% | 45% |
The reward distribution graph (right) shows the trained model getting significantly more episodes at maximum reward (3.0), while the base model was clustered near 0.
Training Code
from trl import GRPOConfig, GRPOTrainer
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name='unsloth/qwen2.5-coder-0.5b-instruct',
max_seq_length=2048,
load_in_4bit=True,
)
trainer = GRPOTrainer(
model=model,
reward_funcs=compute_reward,
args=GRPOConfig(
num_train_epochs=3,
num_generations=4,
learning_rate=1e-5,
),
train_dataset=dataset,
)
trainer.train()
What's Next
- Auto-generated bugs using an LLM β infinite curriculum
- Curriculum learning β easy first, hard after model improves
- Multi-language β JavaScript and Java
- Larger models β 7B+ parameter experiments
Links
- GitHub: github.com/BharathVikas-T/debugarena
- HuggingFace Space: huggingface.co/spaces/BharathVikas/debugarena
- Training Notebook: kaggle.com/code/bharathvikas/debugarena
Built solo by Bharath Vikas Tadepalli Meta PyTorch OpenEnv Hackathon Γ Scaler SST 2026
