muskan singh commited on
Commit
8d77f52
Β·
1 Parent(s): 5ebb26b

plots, results, readme updations

Browse files
Dockerfile CHANGED
@@ -1,26 +1,49 @@
1
- FROM python:3.11-slim
2
-
3
- # Non-root user for HuggingFace Spaces compatibility
4
- RUN useradd -m -u 1000 appuser
5
 
6
  WORKDIR /app
7
 
8
- # Install dependencies first (layer cache)
9
- COPY requirements.txt .
10
- RUN pip install --no-cache-dir -r requirements.txt
 
 
 
 
11
 
12
- # Copy project files
13
- COPY . .
14
 
15
- # Switch to non-root
16
- RUN chown -R appuser:appuser /app
17
- USER appuser
18
 
 
19
  EXPOSE 8000
20
 
21
- HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
22
- CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- # server.app:app β€” runs server/app.py from /app working directory
25
- # models.py, client.py, inference.py live at /app root (on PYTHONPATH automatically)
26
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
 
1
+ FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
 
 
 
2
 
3
  WORKDIR /app
4
 
5
+ # System deps
6
+ RUN apt-get update && apt-get install -y \
7
+ python3 python3-pip git && \
8
+ rm -rf /var/lib/apt/lists/*
9
+
10
+ # Python setup
11
+ RUN pip3 install --upgrade pip
12
 
13
+ # Copy files
14
+ COPY . /app
15
 
16
+ # Install Python deps
17
+ RUN pip install -r requirements.txt
 
18
 
19
+ # Expose port for env server
20
  EXPOSE 8000
21
 
22
+ # Run training
23
+ CMD ["python3", "train.py"]
24
+ # FROM python:3.11-slim
25
+
26
+ # # Non-root user for HuggingFace Spaces compatibility
27
+ # RUN useradd -m -u 1000 appuser
28
+
29
+ # WORKDIR /app
30
+
31
+ # # Install dependencies first (layer cache)
32
+ # COPY requirements.txt .
33
+ # RUN pip install --no-cache-dir -r requirements.txt
34
+
35
+ # # Copy project files
36
+ # COPY . .
37
+
38
+ # # Switch to non-root
39
+ # RUN chown -R appuser:appuser /app
40
+ # USER appuser
41
+
42
+ # EXPOSE 8000
43
+
44
+ # HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
45
+ # CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
46
 
47
+ # # server.app:app β€” runs server/app.py from /app working directory
48
+ # # models.py, client.py, inference.py live at /app root (on PYTHONPATH automatically)
49
+ # CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
README.md CHANGED
@@ -21,6 +21,17 @@ Built for the [Meta PyTorch Γ— Scaler OpenEnv Hackathon](https://huggingface.co/
21
 
22
  ---
23
 
 
 
 
 
 
 
 
 
 
 
 
24
  ## Live Demo
25
 
26
  πŸš€ **[HuggingFace Space β†’](https://huggingface.co/spaces/tanvibisht/orgos-openenv)**
@@ -135,11 +146,27 @@ Terminal completion bonus = +0.20
135
 
136
  ## Training
137
 
138
- The `training/grpo_orgos.ipynb` notebook trains **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer**:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
- - Before training: ~0.55 score (uses stale canonical field names β†’ schema error penalties)
141
- - After training: ~0.75 score (reads `schema_hints`, uses drifted field names β†’ adaptation bonuses)
142
- - **Ξ” β‰ˆ +0.20** per episode, visible in `before_after_curves.png`
143
 
144
  ---
145
 
 
21
 
22
  ---
23
 
24
+ ## Resources
25
+
26
+ | | |
27
+ |---|---|
28
+ | πŸ€— Environment Space | **[huggingface.co/spaces/tanvibisht/orgos-openenv](https://huggingface.co/spaces/tanvibisht/orgos-openenv)** |
29
+ | πŸ‹οΈ Training Space | **[huggingface.co/spaces/muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)** |
30
+ | πŸ“ HF Blog Post | **[OrgOS: Teaching Agents to Survive Enterprise API Drift](https://huggingface.co/blog/muskansingh1101/orgos-openenv)** |
31
+ | πŸ““ Training Notebook | **[training/grpo_orgos.ipynb](training/grpo_orgos.ipynb)** |
32
+
33
+ ---
34
+
35
  ## Live Demo
36
 
37
  πŸš€ **[HuggingFace Space β†’](https://huggingface.co/spaces/tanvibisht/orgos-openenv)**
 
146
 
147
  ## Training
148
 
149
+ The [`training/grpo_orgos.ipynb`](training/grpo_orgos.ipynb) notebook trains **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer** (150 GRPO steps, multi-step reward, Drive checkpoints every 30 steps).
150
+
151
+ Also runnable as a live HF Space: **[muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)**
152
+
153
+ ### Results
154
+
155
+ | Workflow | Before GRPO | After GRPO | Ξ” |
156
+ |---|---|---|---|
157
+ | A β€” Customer Bug Fix | 0.70 | ~0.82 | +0.12 |
158
+ | B β€” Employee Onboarding | 0.57 | ~0.74 | +0.17 |
159
+ | C β€” Churn Risk Alert | 0.25 | ~0.48 | +0.23 |
160
+ | **Average** | **0.50** | **~0.68** | **+0.18** |
161
+
162
+ ![Training Curve](training/plots/training_curve.png)
163
+ *Reward per training step β€” 150 GRPO steps on Qwen2.5-3B-Instruct*
164
+
165
+ ![Baseline vs Trained](training/plots/baseline_vs_trained.png)
166
+ *Per-workflow score: untrained baseline vs. GRPO-trained agent*
167
 
168
+ ![Score Distribution](training/plots/score_distribution.png)
169
+ *Distribution of episode scores before and after training*
 
170
 
171
  ---
172
 
baseline_scores.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "scores": {
3
- "workflow_A": 0.697,
4
- "workflow_B": 0.744,
5
- "workflow_C": 0.722
6
  },
7
- "average": 0.721
8
  }
 
1
  {
2
  "scores": {
3
+ "workflow_A": 0.7,
4
+ "workflow_B": 0.5665,
5
+ "workflow_C": 0.247
6
  },
7
+ "average": 0.5045
8
  }
hf_blog_post.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OrgOS: Teaching Agents to Survive Enterprise API Drift
2
+
3
+ *Submitted to the Meta PyTorch Γ— Scaler OpenEnv Hackathon Round 2*
4
+
5
+ ---
6
+
7
+ ## The Problem
8
+
9
+ Enterprise AI agents break in production β€” not because the model is bad, but because the environment keeps changing. SaaS APIs rename fields. SLAs tighten. Access policies shift. An agent trained on yesterday's Jira schema fails when `priority` becomes `severity`.
10
+
11
+ Static datasets can't capture this. You need an environment that drifts.
12
+
13
+ ---
14
+
15
+ ## What We Built: OrgOS
16
+
17
+ **OrgOS** is a multi-app enterprise RL environment where an AI agent completes real business workflows across four interconnected mock SaaS applications: **Jira, Zendesk, Salesforce, and Workday**.
18
+
19
+ ### Three Cross-App Workflows
20
+
21
+ | Workflow | Role | Steps |
22
+ |---|---|---|
23
+ | A β€” Customer Bug Fix | Support | Acknowledge ticket β†’ Create Jira issue β†’ Assign engineer β†’ Log SLA β†’ Check account health |
24
+ | B β€” Employee Onboarding | Manager | Create Workday record β†’ Provision Jira access β†’ Add to Salesforce β†’ Create Zendesk profile |
25
+ | C β€” Churn Risk Alert | Support | Flag churn in Salesforce β†’ Escalate to Zendesk β†’ Create Jira tracker β†’ Log SLA event |
26
+
27
+ ### What Makes It Hard
28
+
29
+ **Schema Drift**: Every episode, field names can change across versions. `priority` β†’ `severity` β†’ `urgency_level`. The agent sees a `schema_hints` dict telling it the current mapping β€” but only if it reads it. Using stale field names incurs a `-0.20` penalty. Using adapted names earns `+0.10`.
30
+
31
+ **Policy Drift**: Every 3rd episode, SLA thresholds tighten automatically (P0 response: 30 min β†’ 15 min). Agents that ignore `active_rules` get caught.
32
+
33
+ **RBAC**: Support vs. manager roles are strictly enforced. Unauthorized actions cost `-0.25`.
34
+
35
+ ### Reward Function
36
+
37
+ ```
38
+ score = 0.30 Γ— workflow_completion
39
+ + 0.25 Γ— rule_compliance
40
+ + 0.20 Γ— schema_adaptation
41
+ + 0.15 Γ— efficiency
42
+ + 0.10 Γ— policy_drift_handling
43
+ ```
44
+
45
+ The agent receives dense per-step signals, not just terminal rewards.
46
+
47
+ ---
48
+
49
+ ## Training: GRPO on Qwen2.5-3B
50
+
51
+ We trained **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer** for 150 steps.
52
+
53
+ ### Key Design Choices
54
+
55
+ **Multi-step reward**: Instead of rewarding just the GRPO-generated action, we continue 1 more greedy step with the model and return the cumulative 2-step score. This prevents the model from collapsing to safe list_* operations that look good on single-step rewards but don't advance workflows.
56
+
57
+ **System prompt engineering**: The prompt explicitly instructs the agent to read `schema_hints` before choosing field names and to check `pending_steps` to know what the workflow needs next.
58
+
59
+ **Pinned TRL**: We pin `trl<=0.24` for API stability β€” newer versions changed the GRPOTrainer interface.
60
+
61
+ ### Training Config
62
+
63
+ | Config | Value |
64
+ |---|---|
65
+ | Model | Qwen2.5-3B-Instruct (4-bit) |
66
+ | LoRA rank | r=16 |
67
+ | Steps | 150 |
68
+ | LR | 8e-6 |
69
+ | Batch | 1 (grad accum 2) |
70
+ | Reward | 2-step cumulative |
71
+
72
+ ---
73
+
74
+ ## Results
75
+
76
+ | Workflow | Before GRPO | After GRPO | Ξ” |
77
+ |---|---|---|---|
78
+ | A β€” Customer Bug Fix | 0.70 | ~0.82 | +0.12 |
79
+ | B β€” Employee Onboarding | 0.57 | ~0.74 | +0.17 |
80
+ | C β€” Churn Risk Alert | 0.25 | ~0.48 | +0.23 |
81
+ | **Average** | **0.50** | **~0.68** | **+0.18** |
82
+
83
+ The biggest gain is on Workflow C (Churn Risk Alert) β€” the hardest workflow, which requires the most cross-app coordination. The untrained model barely scores 0.25 on it; after GRPO it reaches 0.48.
84
+
85
+ The trained agent learns to:
86
+ 1. Read `schema_hints` and use the current field names instead of stale canonical ones
87
+ 2. Follow `pending_steps` in order instead of randomly calling available operations
88
+ 3. Respect `active_rules` (SLA thresholds, RBAC permissions)
89
+
90
+ ---
91
+
92
+ ## Try It
93
+
94
+ - 🌐 **Environment**: [huggingface.co/spaces/tanvibisht/orgos-openenv](https://huggingface.co/spaces/tanvibisht/orgos-openenv)
95
+ - πŸ‹οΈ **Training Space**: [huggingface.co/spaces/muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)
96
+ - πŸ““ **Notebook**: [training/grpo_orgos.ipynb](https://github.com/muskansingh1101/OpenEnv-Round-2/blob/main/training/grpo_orgos.ipynb)
97
+
98
+ ---
99
+
100
+ ## Why It Matters
101
+
102
+ Any agent that automates enterprise workflows will face API drift. The tools it was trained on today will be renamed, versioned, or deprecated tomorrow. OrgOS is a controlled environment for studying exactly this failure mode β€” and for training agents that adapt instead of break.
103
+
104
+ ---
105
+
106
+ *Built for Meta PyTorch Γ— Scaler OpenEnv Hackathon Round 2. MIT License.*
training/plots/score_distribution.png ADDED

Git LFS Details

  • SHA256: 94259acf8bdff90a7466e152ae4b0a269e2a8a9efbd4b1054e1f6e70919fd515
  • Pointer size: 130 Bytes
  • Size of remote file: 41.2 kB
training/train.py CHANGED
@@ -184,9 +184,24 @@ def obs_to_text(obs: dict) -> str:
184
  "",
185
  "=== APP STATES ===",
186
  ]
 
 
 
 
 
 
 
 
 
 
187
  for app_name, view in obs.get("app_states", {}).items():
 
 
188
  lines.append(f" [{app_name.upper()}]")
189
- lines.append(f" {view}")
 
 
 
190
  lines.append("")
191
  return "\n".join(lines)
192
 
@@ -260,20 +275,18 @@ def orgos_reward_fn(completions: List[str], prompts: List[str], **kwargs) -> Lis
260
  # ------------------------------------------------------------------
261
 
262
  def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int = 15) -> float:
263
- result = httpx.post(f"{ENV_URL}/reset", json={"workflow_id": workflow_id}).json()
264
- obs = result["observation"]
265
- history = []
266
 
267
  for _ in range(max_steps):
268
  if obs["done"]:
269
  break
270
 
 
 
271
  obs_text = obs_to_text(obs)
272
- history.append({"role": "user", "content": obs_text})
273
-
274
- messages = list(history)
275
- messages[0] = {"role": "user",
276
- "content": SYSTEM_PROMPT + "\n\n---\n\n" + messages[0]["content"]}
277
 
278
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
279
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
@@ -290,8 +303,6 @@ def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int =
290
  out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
291
  ).strip()
292
 
293
- history.append({"role": "assistant", "content": action_str})
294
-
295
  action = parse_action(action_str)
296
  if action is None:
297
  break
 
184
  "",
185
  "=== APP STATES ===",
186
  ]
187
+ # workflow-relevant apps only β€” skip apps the workflow doesn't touch
188
+ WORKFLOW_APPS = {
189
+ "A": {"jira", "zendesk", "salesforce", "workday"},
190
+ "B": {"zendesk", "salesforce", "workday"},
191
+ "C": {"jira", "zendesk", "salesforce"},
192
+ }
193
+ relevant = WORKFLOW_APPS.get(
194
+ obs.get("workflow_id", "A"),
195
+ {"jira", "zendesk", "salesforce", "workday"},
196
+ )
197
  for app_name, view in obs.get("app_states", {}).items():
198
+ if app_name not in relevant:
199
+ continue
200
  lines.append(f" [{app_name.upper()}]")
201
+ view_str = str(view)
202
+ if len(view_str) > 600:
203
+ view_str = view_str[:600] + "...[truncated]"
204
+ lines.append(f" {view_str}")
205
  lines.append("")
206
  return "\n".join(lines)
207
 
 
275
  # ------------------------------------------------------------------
276
 
277
  def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int = 15) -> float:
278
+ result = httpx.post(f"{ENV_URL}/reset", json={"workflow_id": workflow_id}).json()
279
+ obs = result["observation"]
 
280
 
281
  for _ in range(max_steps):
282
  if obs["done"]:
283
  break
284
 
285
+ # Stateless single-turn prompt β€” matches the GRPO training format.
286
+ # obs["message"] already carries last-action feedback, so no history needed.
287
  obs_text = obs_to_text(obs)
288
+ messages = [{"role": "user",
289
+ "content": SYSTEM_PROMPT + "\n\n---\n\n" + obs_text}]
 
 
 
290
 
291
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
292
  inputs = tokenizer(text, return_tensors="pt").to(model.device)
 
303
  out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
304
  ).strip()
305
 
 
 
306
  action = parse_action(action_str)
307
  if action is None:
308
  break