DebugArena: Teaching LLMs to Fix Bugs Through Trial and Error

Meta PyTorch OpenEnv Hackathon × Scaler School of Technology 2026 | Theme 4: Self-Improving Agents

TL;DR

I built an RL environment called DebugArena where an LLM learns to fix buggy Python code through trial and error. I fine-tuned Qwen2.5-Coder-0.5B using GRPO and showed:

Reward improved from 1.36 → 1.72 (+26%)
Solve rate improved from 35% → 45% (+10 percentage points)

The Problem

Ask any LLM to write a function that adds two numbers — it does it perfectly. Ask it to find the bug in this:

def add(a, b):
    return a - b

It might get it. But ask it to fix a subtle off-by-one in a binary search, or a missing edge case in a recursive function — it starts to struggle.

Why? Because most LLM training data is correct code. There is almost no data of the form: broken function → error → minimal fix.

LLMs learned to write code. Nobody built a gym to teach them to fix it.

What I Built

DebugArena is an OpenEnv reinforcement learning environment. The loop:

Agent gets a broken Python function
Sees which tests fail and why
Proposes a fix
Environment runs it in a sandboxed executor
Gets reward based on 4 independent signals
Tries again (max 5 attempts per bug)

No correct answers shown. The agent figures it out from test failures alone.

The Model

I used Qwen2.5-Coder-0.5B-Instruct from HuggingFace — a small but capable coding model. Small enough to train fast on a T4 GPU, capable enough to actually learn from the reward signal.

Fine-tuned using GRPO (Group Relative Policy Optimization) via TRL + Unsloth. GRPO samples 4 candidate fixes per bug, scores them all, and shifts the model toward higher-reward outputs. No value model needed.

Reward Design

4 independent reward signals to prevent reward hacking:

r1 = tests_passing      # 0.0 to 1.0 — proportion of tests that pass
r2 = full_solve_bonus   # +2.0 if ALL tests pass
r3 = format_compliance  # +0.2 valid function / -0.3 malformed
r4 = anti_hacking       # -1.0 if forbidden imports detected

Anti-hacking measures: forbidden imports blocked (os, sys, subprocess), restricted builtins, each test runs independently.

The Bug Curriculum

42 hand-crafted bugs across easy, medium, and hard:

Easy:

# Bug: wrong operator
def add(a, b):
    return a - b  # should be a + b

Medium:

# Bug: no empty string check
def first_char(s):
    return s[0]  # crashes on empty string

Hard:

# Bug: swap overwrites before saving
def bubble_sort(arr):
    if arr[j] > arr[j+1]:
        arr[j] = arr[j+1]   # overwrites! should be tuple swap
        arr[j+1] = arr[j]

Results

	Before Training	After GRPO
Avg Reward	1.361	1.717
Solve Rate	35%	45%

The reward distribution graph (right) shows the trained model getting significantly more episodes at maximum reward (3.0), while the base model was clustered near 0.

Training Code

from trl import GRPOConfig, GRPOTrainer
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='unsloth/qwen2.5-coder-0.5b-instruct',
    max_seq_length=2048,
    load_in_4bit=True,
)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=compute_reward,
    args=GRPOConfig(
        num_train_epochs=3,
        num_generations=4,
        learning_rate=1e-5,
    ),
    train_dataset=dataset,
)
trainer.train()

What's Next

Auto-generated bugs using an LLM — infinite curriculum
Curriculum learning — easy first, hard after model improves
Multi-language — JavaScript and Java
Larger models — 7B+ parameter experiments

BharathVikas
/

debugarena