DebugArena: Teaching LLMs to Fix Bugs Through Trial and Error

Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology 2026 | Theme 4: Self-Improving Agents


TL;DR

I built an RL environment called DebugArena where an LLM learns to fix buggy Python code through trial and error. I fine-tuned Qwen2.5-Coder-0.5B using GRPO and showed:

  • Reward improved from 1.36 β†’ 1.72 (+26%)
  • Solve rate improved from 35% β†’ 45% (+10 percentage points)

Before vs After


The Problem

Ask any LLM to write a function that adds two numbers β€” it does it perfectly. Ask it to find the bug in this:

def add(a, b):
    return a - b

It might get it. But ask it to fix a subtle off-by-one in a binary search, or a missing edge case in a recursive function β€” it starts to struggle.

Why? Because most LLM training data is correct code. There is almost no data of the form: broken function β†’ error β†’ minimal fix.

LLMs learned to write code. Nobody built a gym to teach them to fix it.


What I Built

DebugArena is an OpenEnv reinforcement learning environment. The loop:

  1. Agent gets a broken Python function
  2. Sees which tests fail and why
  3. Proposes a fix
  4. Environment runs it in a sandboxed executor
  5. Gets reward based on 4 independent signals
  6. Tries again (max 5 attempts per bug)

No correct answers shown. The agent figures it out from test failures alone.


The Model

I used Qwen2.5-Coder-0.5B-Instruct from HuggingFace β€” a small but capable coding model. Small enough to train fast on a T4 GPU, capable enough to actually learn from the reward signal.

Fine-tuned using GRPO (Group Relative Policy Optimization) via TRL + Unsloth. GRPO samples 4 candidate fixes per bug, scores them all, and shifts the model toward higher-reward outputs. No value model needed.


Reward Design

4 independent reward signals to prevent reward hacking:

r1 = tests_passing      # 0.0 to 1.0 β€” proportion of tests that pass
r2 = full_solve_bonus   # +2.0 if ALL tests pass
r3 = format_compliance  # +0.2 valid function / -0.3 malformed
r4 = anti_hacking       # -1.0 if forbidden imports detected

Anti-hacking measures: forbidden imports blocked (os, sys, subprocess), restricted builtins, each test runs independently.


The Bug Curriculum

42 hand-crafted bugs across easy, medium, and hard:

Easy:

# Bug: wrong operator
def add(a, b):
    return a - b  # should be a + b

Medium:

# Bug: no empty string check
def first_char(s):
    return s[0]  # crashes on empty string

Hard:

# Bug: swap overwrites before saving
def bubble_sort(arr):
    if arr[j] > arr[j+1]:
        arr[j] = arr[j+1]   # overwrites! should be tuple swap
        arr[j+1] = arr[j]

Results

Before vs After GRPO Fine-tuning

Before Training After GRPO
Avg Reward 1.361 1.717
Solve Rate 35% 45%

The reward distribution graph (right) shows the trained model getting significantly more episodes at maximum reward (3.0), while the base model was clustered near 0.


Training Code

from trl import GRPOConfig, GRPOTrainer
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='unsloth/qwen2.5-coder-0.5b-instruct',
    max_seq_length=2048,
    load_in_4bit=True,
)

trainer = GRPOTrainer(
    model=model,
    reward_funcs=compute_reward,
    args=GRPOConfig(
        num_train_epochs=3,
        num_generations=4,
        learning_rate=1e-5,
    ),
    train_dataset=dataset,
)
trainer.train()

What's Next

  1. Auto-generated bugs using an LLM β€” infinite curriculum
  2. Curriculum learning β€” easy first, hard after model improves
  3. Multi-language β€” JavaScript and Java
  4. Larger models β€” 7B+ parameter experiments

Links

  • GitHub: github.com/BharathVikas-T/debugarena
  • HuggingFace Space: huggingface.co/spaces/BharathVikas/debugarena
  • Training Notebook: kaggle.com/code/bharathvikas/debugarena

Built solo by Bharath Vikas Tadepalli Meta PyTorch OpenEnv Hackathon Γ— Scaler SST 2026

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading