Spaces:

avichauhan
/

api-debug-env

Running

App Files Files Community

avichauhan commited on 7 days ago

Commit

5dc848a

verified ·

1 Parent(s): 92af457

Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +16 -0
TECHNICAL_DEEP_DIVE.md +7 -0
server/environment.py +1 -1
tests/test_environment.py +2 -2
training/requirements.txt +1 -1
training/train.py +3 -2
training/train_colab.ipynb +3 -18

README.md CHANGED Viewed

@@ -207,6 +207,22 @@ The training auto-promotes through 6 difficulty levels based on rolling average
 The environment also supports `task="auto"` which lets the environment itself manage curriculum progression based on session history.
 ## Setup
 ### Prerequisites

 The environment also supports `task="auto"` which lets the environment itself manage curriculum progression based on session history.
+### Training Results
+Trained on Google Colab (free T4 GPU) with 64 episodes on the easy task:
+| Metric | Value |
+|--------|-------|
+| Runtime | 7m 43s (8 steps) |
+| Mean reward (easy) | 0.172 |
+| Mean completion length | 62 tokens |
+| Loss | -0.003 (converging) |
+| GPU | Tesla T4, bf16 |
+The trained model is available at: [avichauhan/api-debug-grpo-qwen-0.5b](https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b)
+A Colab notebook is provided at `training/train_colab.ipynb` for one-click training.
 ## Setup
 ### Prerequisites

TECHNICAL_DEEP_DIVE.md CHANGED Viewed

@@ -495,6 +495,13 @@ All five advancement items from the original roadmap have been implemented:
 **How it works**: The model generates JSON debugging attempts, the environment grades them via its deterministic graders, and GRPO updates the policy to prefer higher-scoring responses. The rollout function connects to the live HF Space via WebSocket, runs multi-turn episodes, and returns prompt_ids, completion_ids, logprobs, and env_reward.
 **Key config**: `max_completion_length=128`, `gradient_accumulation_steps=16`, `vllm_gpu_memory_utilization=0.3`. Runs on free Colab T4 GPU.
 ### 2. Expanded API Specs and Domains (IMPLEMENTED)
 **What**: Expanded from 30 specs / 6 domains to 45 specs / 9 domains.
 **New domains**: Analytics/Monitoring (dashboards, metrics, alerts), DevOps/Infrastructure (deployments, DNS, load balancers), AI/ML APIs (inference, fine-tuning, embeddings).

 **How it works**: The model generates JSON debugging attempts, the environment grades them via its deterministic graders, and GRPO updates the policy to prefer higher-scoring responses. The rollout function connects to the live HF Space via WebSocket, runs multi-turn episodes, and returns prompt_ids, completion_ids, logprobs, and env_reward.
 **Key config**: `max_completion_length=128`, `gradient_accumulation_steps=16`, `vllm_gpu_memory_utilization=0.3`. Runs on free Colab T4 GPU.
+**Training results** (64 episodes, easy task, Colab T4):
+- Runtime: 7m 43s (8 training steps)
+- Mean reward: 0.172 (easy task, rolling across all steps)
+- Mean completion length: 62 tokens
+- Loss converged from 0.0 to -0.003 with gradient norms showing active learning
+- Trained model: [avichauhan/api-debug-grpo-qwen-0.5b](https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b)
 ### 2. Expanded API Specs and Domains (IMPLEMENTED)
 **What**: Expanded from 30 specs / 6 domains to 45 specs / 9 domains.
 **New domains**: Analytics/Monitoring (dashboards, metrics, alerts), DevOps/Infrastructure (deployments, DNS, load balancers), AI/ML APIs (inference, fine-tuning, embeddings).

server/environment.py CHANGED Viewed

@@ -259,7 +259,7 @@ class APIDebugEnvironment(Environment):
         if self.episode_done:
             return self._make_observation(
                 feedback="Episode already ended.",
-                reward=0.0,
                 done=True,
             )

         if self.episode_done:
             return self._make_observation(
                 feedback="Episode already ended.",
+                reward=self.best_reward if self.best_reward > 0 else 0.001,
                 done=True,
             )

tests/test_environment.py CHANGED Viewed

@@ -159,11 +159,11 @@ class TestEasyGrader:
         assert obs is not None
         assert obs.done is True
-    def test_step_after_done_returns_zero_reward(self):
         env = make_env("easy", seed=42)
         env.step(perfect_easy_action(env))  # ends episode
         obs = env.step(APIDebugAction(error_type="anything"))
-        assert obs.reward == 0.0
         assert obs.done is True
     def test_reward_always_non_negative(self):

         assert obs is not None
         assert obs.done is True
+    def test_step_after_done_returns_best_reward(self):
         env = make_env("easy", seed=42)
         env.step(perfect_easy_action(env))  # ends episode
         obs = env.step(APIDebugAction(error_type="anything"))
+        assert obs.reward >= 0.001  # never 0.0 -- open interval (0, 1)
         assert obs.done is True
     def test_reward_always_non_negative(self):

training/requirements.txt CHANGED Viewed

@@ -1,4 +1,4 @@
-trl>=0.26.0
 transformers
 torch
 datasets

+trl[vllm]>=0.26.0
 transformers
 torch
 datasets

training/train.py CHANGED Viewed

@@ -141,7 +141,9 @@ def parse_llm_response(text: str) -> dict:
     return {}
-def build_action(data: dict) -> APIDebugAction:
     fixed_req = data.get("fixed_request")
     if isinstance(fixed_req, dict):
         fixed_req = json.dumps(fixed_req)
@@ -281,7 +283,6 @@ grpo_args = GRPOConfig(
     report_to="none",
     bf16=supports_bf16,
     fp16=has_gpu and not supports_bf16,
-    no_cuda=not has_gpu,
     gradient_checkpointing=True,
     vllm_gpu_memory_utilization=0.3,
     dataloader_pin_memory=False,

     return {}
+def build_action(data) -> APIDebugAction:
+    if not isinstance(data, dict):
+        return APIDebugAction()
     fixed_req = data.get("fixed_request")
     if isinstance(fixed_req, dict):
         fixed_req = json.dumps(fixed_req)
     report_to="none",
     bf16=supports_bf16,
     fp16=has_gpu and not supports_bf16,
     gradient_checkpointing=True,
     vllm_gpu_memory_utilization=0.3,
     dataloader_pin_memory=False,

training/train_colab.ipynb CHANGED Viewed

@@ -17,10 +17,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Cell 1: Install dependencies\n",
-    "!pip install -q trl>=0.26.0 transformers torch datasets openenv-core openai"
-   ]
   },
   {
    "cell_type": "code",
@@ -88,19 +85,7 @@
    "execution_count": null,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Cell 6: Upload trained model to HuggingFace\n",
-    "from huggingface_hub import HfApi\n",
-    "\n",
-    "api = HfApi()\n",
-    "api.upload_folder(\n",
-    "    folder_path='./outputs/api-debug-grpo',\n",
-    "    repo_id='avichauhan/api-debug-grpo-qwen-0.5b',\n",
-    "    repo_type='model',\n",
-    "    create_pr=False,\n",
-    ")\n",
-    "print('Model uploaded to: https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b')"
-   ]
   }
  ],
  "metadata": {
@@ -120,4 +105,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}

    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": "# Cell 1: Install dependencies (vllm required for fast generation)\n!pip install -q \"trl[vllm]>=0.26.0\" transformers torch datasets openenv-core openai"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
+   "source": "# Cell 6: Upload trained model to HuggingFace\nfrom google.colab import userdata\nfrom huggingface_hub import HfApi\n\ntoken = userdata.get('HF_TOKEN')\napi = HfApi(token=token)\n\n# Create repo first (in case it doesn't exist)\napi.create_repo('avichauhan/api-debug-grpo-qwen-0.5b', exist_ok=True)\n\napi.upload_folder(\n    folder_path='./outputs/api-debug-grpo',\n    repo_id='avichauhan/api-debug-grpo-qwen-0.5b',\n    repo_type='model',\n    create_pr=False,\n)\nprint('Model uploaded to: https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b')"
   }
  ],
  "metadata": {
  },
  "nbformat": 4,
  "nbformat_minor": 4
+}