api-debug-env / training /README.md
avichauhan's picture
Upload folder using huggingface_hub
d73bfc0 verified

Training with GRPO on API Debug Environment

Trains a small LLM using GRPO (Group Relative Policy Optimization) on the live API Debug Environment with curriculum learning.

What is GRPO?

For each prompt, GRPO:

  1. Generates multiple completions (debug attempts)
  2. Scores each with the environment's grader (reward signal)
  3. Updates the model to prefer higher-scoring responses

Over thousands of episodes, the LLM learns to debug API requests purely from reward signals -- no labelled data needed.

Curriculum Learning

The training auto-promotes through difficulty levels:

Level Task Threshold Max Turns Skill
1 easy 0.7 avg reward 3 Identify single error type + fields
2 classify 0.6 avg reward 4 Identify ALL error types + fields
3 medium 0.6 avg reward 5 Fix the broken request body
4 headers 0.5 avg reward 4 Fix header-level errors
5 response 0.5 avg reward 4 Validate API response issues
6 hard -- 7 Fix mixed errors + explain reasoning

Promotion happens when the rolling average reward (window=10) exceeds the threshold for the current level.

Architecture

Dataset prompt ("Debug this broken API request.")
     |
GRPOTrainer calls rollout_func()
     |
rollout_func() connects to live HF Space via WebSocket
     |
env.reset(task=current_task) -> broken API request
     |
LLM generates JSON response -> env.step(action) -> reward
     |  (repeat up to max_turns)
Returns: prompt_ids, completion_ids, logprobs, env_reward
     |
reward_from_env() extracts env_reward
     |
GRPO updates model weights
     |
maybe_promote() checks if agent should advance to next task

Run on Google Colab (free T4 GPU)

# Cell 1 -- Install
!pip install trl>=0.26.0 transformers torch datasets openenv-core openai

# Cell 2 -- Clone repo
!git clone https://github.com/Avi-chauhan/api-debug-env.git
%cd api-debug-env

# Cell 3 -- Train
!python training/train.py

Requirements