avichauhan commited on
Commit
5dc848a
·
verified ·
1 Parent(s): 92af457

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -207,6 +207,22 @@ The training auto-promotes through 6 difficulty levels based on rolling average
207
 
208
  The environment also supports `task="auto"` which lets the environment itself manage curriculum progression based on session history.
209
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
  ## Setup
211
 
212
  ### Prerequisites
 
207
 
208
  The environment also supports `task="auto"` which lets the environment itself manage curriculum progression based on session history.
209
 
210
+ ### Training Results
211
+
212
+ Trained on Google Colab (free T4 GPU) with 64 episodes on the easy task:
213
+
214
+ | Metric | Value |
215
+ |--------|-------|
216
+ | Runtime | 7m 43s (8 steps) |
217
+ | Mean reward (easy) | 0.172 |
218
+ | Mean completion length | 62 tokens |
219
+ | Loss | -0.003 (converging) |
220
+ | GPU | Tesla T4, bf16 |
221
+
222
+ The trained model is available at: [avichauhan/api-debug-grpo-qwen-0.5b](https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b)
223
+
224
+ A Colab notebook is provided at `training/train_colab.ipynb` for one-click training.
225
+
226
  ## Setup
227
 
228
  ### Prerequisites
TECHNICAL_DEEP_DIVE.md CHANGED
@@ -495,6 +495,13 @@ All five advancement items from the original roadmap have been implemented:
495
  **How it works**: The model generates JSON debugging attempts, the environment grades them via its deterministic graders, and GRPO updates the policy to prefer higher-scoring responses. The rollout function connects to the live HF Space via WebSocket, runs multi-turn episodes, and returns prompt_ids, completion_ids, logprobs, and env_reward.
496
  **Key config**: `max_completion_length=128`, `gradient_accumulation_steps=16`, `vllm_gpu_memory_utilization=0.3`. Runs on free Colab T4 GPU.
497
 
 
 
 
 
 
 
 
498
  ### 2. Expanded API Specs and Domains (IMPLEMENTED)
499
  **What**: Expanded from 30 specs / 6 domains to 45 specs / 9 domains.
500
  **New domains**: Analytics/Monitoring (dashboards, metrics, alerts), DevOps/Infrastructure (deployments, DNS, load balancers), AI/ML APIs (inference, fine-tuning, embeddings).
 
495
  **How it works**: The model generates JSON debugging attempts, the environment grades them via its deterministic graders, and GRPO updates the policy to prefer higher-scoring responses. The rollout function connects to the live HF Space via WebSocket, runs multi-turn episodes, and returns prompt_ids, completion_ids, logprobs, and env_reward.
496
  **Key config**: `max_completion_length=128`, `gradient_accumulation_steps=16`, `vllm_gpu_memory_utilization=0.3`. Runs on free Colab T4 GPU.
497
 
498
+ **Training results** (64 episodes, easy task, Colab T4):
499
+ - Runtime: 7m 43s (8 training steps)
500
+ - Mean reward: 0.172 (easy task, rolling across all steps)
501
+ - Mean completion length: 62 tokens
502
+ - Loss converged from 0.0 to -0.003 with gradient norms showing active learning
503
+ - Trained model: [avichauhan/api-debug-grpo-qwen-0.5b](https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b)
504
+
505
  ### 2. Expanded API Specs and Domains (IMPLEMENTED)
506
  **What**: Expanded from 30 specs / 6 domains to 45 specs / 9 domains.
507
  **New domains**: Analytics/Monitoring (dashboards, metrics, alerts), DevOps/Infrastructure (deployments, DNS, load balancers), AI/ML APIs (inference, fine-tuning, embeddings).
server/environment.py CHANGED
@@ -259,7 +259,7 @@ class APIDebugEnvironment(Environment):
259
  if self.episode_done:
260
  return self._make_observation(
261
  feedback="Episode already ended.",
262
- reward=0.0,
263
  done=True,
264
  )
265
 
 
259
  if self.episode_done:
260
  return self._make_observation(
261
  feedback="Episode already ended.",
262
+ reward=self.best_reward if self.best_reward > 0 else 0.001,
263
  done=True,
264
  )
265
 
tests/test_environment.py CHANGED
@@ -159,11 +159,11 @@ class TestEasyGrader:
159
  assert obs is not None
160
  assert obs.done is True
161
 
162
- def test_step_after_done_returns_zero_reward(self):
163
  env = make_env("easy", seed=42)
164
  env.step(perfect_easy_action(env)) # ends episode
165
  obs = env.step(APIDebugAction(error_type="anything"))
166
- assert obs.reward == 0.0
167
  assert obs.done is True
168
 
169
  def test_reward_always_non_negative(self):
 
159
  assert obs is not None
160
  assert obs.done is True
161
 
162
+ def test_step_after_done_returns_best_reward(self):
163
  env = make_env("easy", seed=42)
164
  env.step(perfect_easy_action(env)) # ends episode
165
  obs = env.step(APIDebugAction(error_type="anything"))
166
+ assert obs.reward >= 0.001 # never 0.0 -- open interval (0, 1)
167
  assert obs.done is True
168
 
169
  def test_reward_always_non_negative(self):
training/requirements.txt CHANGED
@@ -1,4 +1,4 @@
1
- trl>=0.26.0
2
  transformers
3
  torch
4
  datasets
 
1
+ trl[vllm]>=0.26.0
2
  transformers
3
  torch
4
  datasets
training/train.py CHANGED
@@ -141,7 +141,9 @@ def parse_llm_response(text: str) -> dict:
141
  return {}
142
 
143
 
144
- def build_action(data: dict) -> APIDebugAction:
 
 
145
  fixed_req = data.get("fixed_request")
146
  if isinstance(fixed_req, dict):
147
  fixed_req = json.dumps(fixed_req)
@@ -281,7 +283,6 @@ grpo_args = GRPOConfig(
281
  report_to="none",
282
  bf16=supports_bf16,
283
  fp16=has_gpu and not supports_bf16,
284
- no_cuda=not has_gpu,
285
  gradient_checkpointing=True,
286
  vllm_gpu_memory_utilization=0.3,
287
  dataloader_pin_memory=False,
 
141
  return {}
142
 
143
 
144
+ def build_action(data) -> APIDebugAction:
145
+ if not isinstance(data, dict):
146
+ return APIDebugAction()
147
  fixed_req = data.get("fixed_request")
148
  if isinstance(fixed_req, dict):
149
  fixed_req = json.dumps(fixed_req)
 
283
  report_to="none",
284
  bf16=supports_bf16,
285
  fp16=has_gpu and not supports_bf16,
 
286
  gradient_checkpointing=True,
287
  vllm_gpu_memory_utilization=0.3,
288
  dataloader_pin_memory=False,
training/train_colab.ipynb CHANGED
@@ -17,10 +17,7 @@
17
  "execution_count": null,
18
  "metadata": {},
19
  "outputs": [],
20
- "source": [
21
- "# Cell 1: Install dependencies\n",
22
- "!pip install -q trl>=0.26.0 transformers torch datasets openenv-core openai"
23
- ]
24
  },
25
  {
26
  "cell_type": "code",
@@ -88,19 +85,7 @@
88
  "execution_count": null,
89
  "metadata": {},
90
  "outputs": [],
91
- "source": [
92
- "# Cell 6: Upload trained model to HuggingFace\n",
93
- "from huggingface_hub import HfApi\n",
94
- "\n",
95
- "api = HfApi()\n",
96
- "api.upload_folder(\n",
97
- " folder_path='./outputs/api-debug-grpo',\n",
98
- " repo_id='avichauhan/api-debug-grpo-qwen-0.5b',\n",
99
- " repo_type='model',\n",
100
- " create_pr=False,\n",
101
- ")\n",
102
- "print('Model uploaded to: https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b')"
103
- ]
104
  }
105
  ],
106
  "metadata": {
@@ -120,4 +105,4 @@
120
  },
121
  "nbformat": 4,
122
  "nbformat_minor": 4
123
- }
 
17
  "execution_count": null,
18
  "metadata": {},
19
  "outputs": [],
20
+ "source": "# Cell 1: Install dependencies (vllm required for fast generation)\n!pip install -q \"trl[vllm]>=0.26.0\" transformers torch datasets openenv-core openai"
 
 
 
21
  },
22
  {
23
  "cell_type": "code",
 
85
  "execution_count": null,
86
  "metadata": {},
87
  "outputs": [],
88
+ "source": "# Cell 6: Upload trained model to HuggingFace\nfrom google.colab import userdata\nfrom huggingface_hub import HfApi\n\ntoken = userdata.get('HF_TOKEN')\napi = HfApi(token=token)\n\n# Create repo first (in case it doesn't exist)\napi.create_repo('avichauhan/api-debug-grpo-qwen-0.5b', exist_ok=True)\n\napi.upload_folder(\n folder_path='./outputs/api-debug-grpo',\n repo_id='avichauhan/api-debug-grpo-qwen-0.5b',\n repo_type='model',\n create_pr=False,\n)\nprint('Model uploaded to: https://huggingface.co/avichauhan/api-debug-grpo-qwen-0.5b')"
 
 
 
 
 
 
 
 
 
 
 
 
89
  }
90
  ],
91
  "metadata": {
 
105
  },
106
  "nbformat": 4,
107
  "nbformat_minor": 4
108
+ }