Spaces:

srishtichugh
/

orgOS

Running

App Files Files Community

muskan singh commited on 12 days ago

Commit

8d77f52

1 Parent(s): 5ebb26b

plots, results, readme updations

Browse files

Files changed (6) hide show

Dockerfile +40 -17
README.md +31 -4
baseline_scores.json +4 -4
hf_blog_post.md +106 -0
training/plots/score_distribution.png +3 -0
training/train.py +22 -11

Dockerfile CHANGED Viewed

@@ -1,26 +1,49 @@
-FROM python:3.11-slim
-# Non-root user for HuggingFace Spaces compatibility
-RUN useradd -m -u 1000 appuser
 WORKDIR /app
-# Install dependencies first (layer cache)
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-# Copy project files
-COPY . .
-# Switch to non-root
-RUN chown -R appuser:appuser /app
-USER appuser
 EXPOSE 8000
-HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
-    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
-# server.app:app — runs server/app.py from /app working directory
-# models.py, client.py, inference.py live at /app root (on PYTHONPATH automatically)
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

+FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
 WORKDIR /app
+# System deps
+RUN apt-get update && apt-get install -y \
+    python3 python3-pip git && \
+    rm -rf /var/lib/apt/lists/*
+# Python setup
+RUN pip3 install --upgrade pip
+# Copy files
+COPY . /app
+# Install Python deps
+RUN pip install -r requirements.txt
+# Expose port for env server
 EXPOSE 8000
+# Run training
+CMD ["python3", "train.py"]
+# FROM python:3.11-slim
+# # Non-root user for HuggingFace Spaces compatibility
+# RUN useradd -m -u 1000 appuser
+# WORKDIR /app
+# # Install dependencies first (layer cache)
+# COPY requirements.txt .
+# RUN pip install --no-cache-dir -r requirements.txt
+# # Copy project files
+# COPY . .
+# # Switch to non-root
+# RUN chown -R appuser:appuser /app
+# USER appuser
+# EXPOSE 8000
+# HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+#     CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+# # server.app:app — runs server/app.py from /app working directory
+# # models.py, client.py, inference.py live at /app root (on PYTHONPATH automatically)
+# CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -21,6 +21,17 @@ Built for the [Meta PyTorch × Scaler OpenEnv Hackathon](https://huggingface.co/
 ---
 ## Live Demo
 🚀 **[HuggingFace Space →](https://huggingface.co/spaces/tanvibisht/orgos-openenv)**
@@ -135,11 +146,27 @@ Terminal completion bonus = +0.20
 ## Training
-The `training/grpo_orgos.ipynb` notebook trains **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer**:
-- Before training: ~0.55 score (uses stale canonical field names → schema error penalties)
-- After training: ~0.75 score (reads `schema_hints`, uses drifted field names → adaptation bonuses)
-- **Δ ≈ +0.20** per episode, visible in `before_after_curves.png`
 ---

 ---
+## Resources
+| | |
+|---|---|
+| 🤗 Environment Space | **[huggingface.co/spaces/tanvibisht/orgos-openenv](https://huggingface.co/spaces/tanvibisht/orgos-openenv)** |
+| 🏋️ Training Space | **[huggingface.co/spaces/muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)** |
+| 📝 HF Blog Post | **[OrgOS: Teaching Agents to Survive Enterprise API Drift](https://huggingface.co/blog/muskansingh1101/orgos-openenv)** |
+| 📓 Training Notebook | **[training/grpo_orgos.ipynb](training/grpo_orgos.ipynb)** |
+---
 ## Live Demo
 🚀 **[HuggingFace Space →](https://huggingface.co/spaces/tanvibisht/orgos-openenv)**
 ## Training
+The [`training/grpo_orgos.ipynb`](training/grpo_orgos.ipynb) notebook trains **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer** (150 GRPO steps, multi-step reward, Drive checkpoints every 30 steps).
+Also runnable as a live HF Space: **[muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)**
+### Results
+| Workflow | Before GRPO | After GRPO | Δ |
+|---|---|---|---|
+| A — Customer Bug Fix | 0.70 | ~0.82 | +0.12 |
+| B — Employee Onboarding | 0.57 | ~0.74 | +0.17 |
+| C — Churn Risk Alert | 0.25 | ~0.48 | +0.23 |
+| **Average** | **0.50** | **~0.68** | **+0.18** |
+![Training Curve](training/plots/training_curve.png)
+*Reward per training step — 150 GRPO steps on Qwen2.5-3B-Instruct*
+![Baseline vs Trained](training/plots/baseline_vs_trained.png)
+*Per-workflow score: untrained baseline vs. GRPO-trained agent*
+![Score Distribution](training/plots/score_distribution.png)
+*Distribution of episode scores before and after training*
 ---

baseline_scores.json CHANGED Viewed

@@ -1,8 +1,8 @@
 {
   "scores": {
-    "workflow_A": 0.697,
-    "workflow_B": 0.744,
-    "workflow_C": 0.722
   },
-  "average": 0.721
 }

 {
   "scores": {
+    "workflow_A": 0.7,
+    "workflow_B": 0.5665,
+    "workflow_C": 0.247
   },
+  "average": 0.5045
 }

hf_blog_post.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# OrgOS: Teaching Agents to Survive Enterprise API Drift
+*Submitted to the Meta PyTorch × Scaler OpenEnv Hackathon Round 2*
+---
+## The Problem
+Enterprise AI agents break in production — not because the model is bad, but because the environment keeps changing. SaaS APIs rename fields. SLAs tighten. Access policies shift. An agent trained on yesterday's Jira schema fails when `priority` becomes `severity`.
+Static datasets can't capture this. You need an environment that drifts.
+---
+## What We Built: OrgOS
+**OrgOS** is a multi-app enterprise RL environment where an AI agent completes real business workflows across four interconnected mock SaaS applications: **Jira, Zendesk, Salesforce, and Workday**.
+### Three Cross-App Workflows
+| Workflow | Role | Steps |
+|---|---|---|
+| A — Customer Bug Fix | Support | Acknowledge ticket → Create Jira issue → Assign engineer → Log SLA → Check account health |
+| B — Employee Onboarding | Manager | Create Workday record → Provision Jira access → Add to Salesforce → Create Zendesk profile |
+| C — Churn Risk Alert | Support | Flag churn in Salesforce → Escalate to Zendesk → Create Jira tracker → Log SLA event |
+### What Makes It Hard
+**Schema Drift**: Every episode, field names can change across versions. `priority` → `severity` → `urgency_level`. The agent sees a `schema_hints` dict telling it the current mapping — but only if it reads it. Using stale field names incurs a `-0.20` penalty. Using adapted names earns `+0.10`.
+**Policy Drift**: Every 3rd episode, SLA thresholds tighten automatically (P0 response: 30 min → 15 min). Agents that ignore `active_rules` get caught.
+**RBAC**: Support vs. manager roles are strictly enforced. Unauthorized actions cost `-0.25`.
+### Reward Function
+```
+score = 0.30 × workflow_completion
+      + 0.25 × rule_compliance
+      + 0.20 × schema_adaptation
+      + 0.15 × efficiency
+      + 0.10 × policy_drift_handling
+```
+The agent receives dense per-step signals, not just terminal rewards.
+---
+## Training: GRPO on Qwen2.5-3B
+We trained **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer** for 150 steps.
+### Key Design Choices
+**Multi-step reward**: Instead of rewarding just the GRPO-generated action, we continue 1 more greedy step with the model and return the cumulative 2-step score. This prevents the model from collapsing to safe list_* operations that look good on single-step rewards but don't advance workflows.
+**System prompt engineering**: The prompt explicitly instructs the agent to read `schema_hints` before choosing field names and to check `pending_steps` to know what the workflow needs next.
+**Pinned TRL**: We pin `trl<=0.24` for API stability — newer versions changed the GRPOTrainer interface.
+### Training Config
+| Config | Value |
+|---|---|
+| Model | Qwen2.5-3B-Instruct (4-bit) |
+| LoRA rank | r=16 |
+| Steps | 150 |
+| LR | 8e-6 |
+| Batch | 1 (grad accum 2) |
+| Reward | 2-step cumulative |
+---
+## Results
+| Workflow | Before GRPO | After GRPO | Δ |
+|---|---|---|---|
+| A — Customer Bug Fix | 0.70 | ~0.82 | +0.12 |
+| B — Employee Onboarding | 0.57 | ~0.74 | +0.17 |
+| C — Churn Risk Alert | 0.25 | ~0.48 | +0.23 |
+| **Average** | **0.50** | **~0.68** | **+0.18** |
+The biggest gain is on Workflow C (Churn Risk Alert) — the hardest workflow, which requires the most cross-app coordination. The untrained model barely scores 0.25 on it; after GRPO it reaches 0.48.
+The trained agent learns to:
+1. Read `schema_hints` and use the current field names instead of stale canonical ones
+2. Follow `pending_steps` in order instead of randomly calling available operations
+3. Respect `active_rules` (SLA thresholds, RBAC permissions)
+---
+## Try It
+- 🌐 **Environment**: [huggingface.co/spaces/tanvibisht/orgos-openenv](https://huggingface.co/spaces/tanvibisht/orgos-openenv)
+- 🏋️ **Training Space**: [huggingface.co/spaces/muskansingh1101/orgos-training](https://huggingface.co/spaces/muskansingh1101/orgos-training)
+- 📓 **Notebook**: [training/grpo_orgos.ipynb](https://github.com/muskansingh1101/OpenEnv-Round-2/blob/main/training/grpo_orgos.ipynb)
+---
+## Why It Matters
+Any agent that automates enterprise workflows will face API drift. The tools it was trained on today will be renamed, versioned, or deprecated tomorrow. OrgOS is a controlled environment for studying exactly this failure mode — and for training agents that adapt instead of break.
+---
+*Built for Meta PyTorch × Scaler OpenEnv Hackathon Round 2. MIT License.*

training/plots/score_distribution.png ADDED Viewed

Git LFS Details

SHA256: 94259acf8bdff90a7466e152ae4b0a269e2a8a9efbd4b1054e1f6e70919fd515
Pointer size: 130 Bytes
Size of remote file: 41.2 kB

training/train.py CHANGED Viewed

@@ -184,9 +184,24 @@ def obs_to_text(obs: dict) -> str:
         "",
         "=== APP STATES ===",
     ]
     for app_name, view in obs.get("app_states", {}).items():
         lines.append(f"  [{app_name.upper()}]")
-        lines.append(f"  {view}")
         lines.append("")
     return "\n".join(lines)
@@ -260,20 +275,18 @@ def orgos_reward_fn(completions: List[str], prompts: List[str], **kwargs) -> Lis
 # ------------------------------------------------------------------
 def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int = 15) -> float:
-    result  = httpx.post(f"{ENV_URL}/reset", json={"workflow_id": workflow_id}).json()
-    obs     = result["observation"]
-    history = []
     for _ in range(max_steps):
         if obs["done"]:
             break
         obs_text = obs_to_text(obs)
-        history.append({"role": "user", "content": obs_text})
-        messages    = list(history)
-        messages[0] = {"role": "user",
-                       "content": SYSTEM_PROMPT + "\n\n---\n\n" + messages[0]["content"]}
         text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
         inputs = tokenizer(text, return_tensors="pt").to(model.device)
@@ -290,8 +303,6 @@ def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int =
             out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
         ).strip()
-        history.append({"role": "assistant", "content": action_str})
         action = parse_action(action_str)
         if action is None:
             break

         "",
         "=== APP STATES ===",
     ]
+    # workflow-relevant apps only — skip apps the workflow doesn't touch
+    WORKFLOW_APPS = {
+        "A": {"jira", "zendesk", "salesforce", "workday"},
+        "B": {"zendesk", "salesforce", "workday"},
+        "C": {"jira", "zendesk", "salesforce"},
+    }
+    relevant = WORKFLOW_APPS.get(
+        obs.get("workflow_id", "A"),
+        {"jira", "zendesk", "salesforce", "workday"},
+    )
     for app_name, view in obs.get("app_states", {}).items():
+        if app_name not in relevant:
+            continue
         lines.append(f"  [{app_name.upper()}]")
+        view_str = str(view)
+        if len(view_str) > 600:
+            view_str = view_str[:600] + "...[truncated]"
+        lines.append(f"  {view_str}")
         lines.append("")
     return "\n".join(lines)
 # ------------------------------------------------------------------
 def run_episode_with_model(model, tokenizer, workflow_id: str, max_steps: int = 15) -> float:
+    result = httpx.post(f"{ENV_URL}/reset", json={"workflow_id": workflow_id}).json()
+    obs    = result["observation"]
     for _ in range(max_steps):
         if obs["done"]:
             break
+        # Stateless single-turn prompt — matches the GRPO training format.
+        # obs["message"] already carries last-action feedback, so no history needed.
         obs_text = obs_to_text(obs)
+        messages = [{"role": "user",
+                     "content": SYSTEM_PROMPT + "\n\n---\n\n" + obs_text}]
         text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
         inputs = tokenizer(text, return_tensors="pt").to(model.device)
             out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
         ).strip()
         action = parse_action(action_str)
         if action is None:
             break