Spaces:
Running
Running
Upload folder using huggingface_hub
Browse files- BLOG.md +105 -0
- README.md +115 -295
- server/app.py +2 -0
BLOG.md
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ghostexec: A tiny chief-of-staff simulator for agents
|
| 2 |
+
|
| 3 |
+
**Demo (<2 min):** [Ghostexec on YouTube](https://youtu.be/g4IFZMEzfO8)
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## The messy day, in one paragraph
|
| 8 |
+
|
| 9 |
+
Picture a morning where everything arrives at once: a board memberβs email, a double-booked calendar, a message from home about dinner, a report due at noon, and a teammate who is already annoyed. You cannot βsolveβ that with a single summary. You **sequence** decisionsβwho gets a reply, what gets rescheduled, what gets delegatedβand each move changes how stressed you are, how people feel, and whether real work moves forward.
|
| 10 |
+
|
| 11 |
+
**Ghostexec** is a small, trainable environment that captures that feeling. It is built on **[OpenEnv](https://github.com/meta-pytorch/OpenEnv)** so agents, researchers, and judges can talk to it through a **standard HTTP and WebSocket API**, run it on a **public Hugging Face Space**, and plug it into a **real RL or preference-optimization loop**.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## What Ghostexec actually is
|
| 16 |
+
|
| 17 |
+
Ghostexec is an **OpenEnv-compatible βAI chief of staffβ simulator**:
|
| 18 |
+
|
| 19 |
+
- **Inbox, calendar, contacts, tasks**, and **stakeholder moods** live in **JSON scenarios** under `scenarios/` β the story is data, not hardcoded prose in Python.
|
| 20 |
+
- Each step, the policy sees a **plain-text briefing** (the same kind of wall-of-text a human assistant might scan), not a raw dump of the whole world object.
|
| 21 |
+
- The agent returns **one structured action per step** β for example `reply_email`, `reschedule_meeting`, `complete_task`, or `delegate_task` β with fields validated against a schema.
|
| 22 |
+
- **Invalid actions do not crash the server.** They return a controlled signal so learning (or evaluation) can continue.
|
| 23 |
+
- **`do_nothing` is penalised** so βfreeze and hope it goes awayβ is not a free winning strategy when fires are burning.
|
| 24 |
+
|
| 25 |
+
Under the hood, the simulation advances **time**, **moods**, and **conflicts**, and optional **drift events** in the JSON can reshuffle the situation mid-episode so the agent is tested on **adaptation**, not memorizing the first screen.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Why OpenEnv?
|
| 30 |
+
|
| 31 |
+
[OpenEnv](https://github.com/meta-pytorch/OpenEnv) gives us a **shared contract**: reset, step, schema, health, WebSocket sessions, and tooling to **validate** and **ship** environments. Our manifest is in **`openenv.yaml`** (environment name `ghostexec`, three tasks with graders, FastAPI app entrypoint). That keeps the submission **inspectable** and **reproducible** β judges can open the Space, read the repo, and run tests locally with the same entrypoints we use in Docker.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Rewards: teach the model, certify the run
|
| 36 |
+
|
| 37 |
+
We use **two layers** of feedback, on purpose:
|
| 38 |
+
|
| 39 |
+
1. **Dense step reward** (in `server/reward.py`) blends **conflict**, **relationship**, and **task** progress with fixed weights **0.35 / 0.35 / 0.30**, plus bounded shaping and explicit handling of invalid or idle steps. That signal is what you want when **training** with modern RL or GRPO-style methods.
|
| 40 |
+
2. **Trajectory graders** (in `graders.py`, wired in `openenv.yaml`) produce **bounded** scores for three **tasks** β easy, medium, and hard scenarios β so hackathon **certification** stays in a well-defined range.
|
| 41 |
+
|
| 42 |
+
Together: the model can **learn** from rich per-step feedback, while organizers can **score** full trajectories against clear tasks.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## Try it in sixty seconds
|
| 47 |
+
|
| 48 |
+
**Short demo video:** [https://youtu.be/g4IFZMEzfO8](https://youtu.be/g4IFZMEzfO8)
|
| 49 |
+
|
| 50 |
+
**Live Space (public):** [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec)
|
| 51 |
+
|
| 52 |
+
- Open **`/docs`** on the Space for the interactive API, or **`/web`** for the OpenEnv playground.
|
| 53 |
+
- Full README (formatted, with tables and deep links):
|
| 54 |
+
[https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md)
|
| 55 |
+
|
| 56 |
+
**Local quick start** (from repo root):
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
uv sync
|
| 60 |
+
uv run server --port 8000
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
Then use **`GhostexecEnv`** from `client.py` (WebSocket session) for **multi-step episodes**, or raw HTTP if you only need a smoke test. The READMEβs βQuick startβ section has a copy-paste Python snippet.
|
| 64 |
+
|
| 65 |
+
**Source:** mirror on GitHub as you prefer; the canonical hackathon artifact is the **Space + repo** layout described in the README.
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## If you only have two minutes on video
|
| 70 |
+
|
| 71 |
+
Published walkthrough: [**youtu.be/g4IFZMEzfO8**](https://youtu.be/g4IFZMEzfO8)
|
| 72 |
+
|
| 73 |
+
A tight arc that works for non-technical viewers:
|
| 74 |
+
|
| 75 |
+
1. **Show the briefing** β scroll through the same text the model sees. Say: βThis is not a quiz; itβs a shift at work.β
|
| 76 |
+
2. **One good action** β e.g. a thoughtful `reply_email` with a real `email_id` from the text; show reward or mood metadata if the UI exposes it.
|
| 77 |
+
3. **One bad action** β wrong id or `do_nothing` while something urgent waits; show that the world **does not crash**, but the score **hurts**.
|
| 78 |
+
4. **One sentence on training** β βWe can optimize policies against this API with GRPO / TRL-style loops; graders score whole episodes for the hackathon.β
|
| 79 |
+
|
| 80 |
+
End on: **Ghostexec is the busy day, compressed β so models can practice being calm, fast, and fair before anyone trusts them near a real calendar.**
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## Theme fit (hackathon)
|
| 85 |
+
|
| 86 |
+
Ghostexec aligns naturally with **personalized, high-stakes tasks**: executive triage, delegation, and tradeoffs between **people**, **calendar**, and **deadlines**. Diverse **`scenarios/*.json`** and optional curriculum / perturbation hooks (see README) make it easy to **stress-test** policies without rewriting core engine code.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## Where to read more
|
| 91 |
+
|
| 92 |
+
| Resource | Link |
|
| 93 |
+
|----------|------|
|
| 94 |
+
| Demo video (<2 min) | [YouTube](https://youtu.be/g4IFZMEzfO8) |
|
| 95 |
+
| Full project README (judging sections, layout, commands) | [README on the Hub](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md) |
|
| 96 |
+
| Innovation-only deep dive | [`environment-innovation/README.md`](environment-innovation/README.md) in the repo |
|
| 97 |
+
| OpenEnv upstream | [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) |
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## Closing
|
| 102 |
+
|
| 103 |
+
Ghostexec is not trying to replace human assistants. It is trying to give **models and researchers** a **credible, stressful, and kind** miniature office: text that reads like work, actions that look like tools, and scores that admit **tradeoffs**. If that sounds useful, spin up the Space, break something on purpose, and watch the environment **keep running** β that resilience is part of the point.
|
| 104 |
+
|
| 105 |
+
*β Ghostexec / OpenEnv submission*
|
README.md
CHANGED
|
@@ -11,326 +11,123 @@ tags:
|
|
| 11 |
- openenv
|
| 12 |
---
|
| 13 |
|
| 14 |
-
# Ghostexec
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|------|--------|
|
| 20 |
-
| **HF Space name / manifest** | `ghostexec` in [`openenv.yaml`](openenv.yaml) |
|
| 21 |
-
| **Python package** | `openenv-ghostexec` in [`pyproject.toml`](pyproject.toml) (import `ghostexec`) |
|
| 22 |
-
| **Public Space** | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
|
| 23 |
-
| **Deeper innovation-only brief** | [`environment-innovation/README.md`](environment-innovation/README.md) |
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
## Deliverables (fill before freeze)
|
| 28 |
-
|
| 29 |
-
| Deliverable | URL |
|
| 30 |
-
|-------------|-----|
|
| 31 |
-
| Public HF Space (required) | [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
|
| 32 |
-
| Write-up / blog (HF post preferred) | `TODO: paste your post URL` |
|
| 33 |
-
| Short demo video (<2 min) | `TODO: paste your video URL` |
|
| 34 |
-
|
| 35 |
-
---
|
| 36 |
-
|
| 37 |
-
## Contents
|
| 38 |
-
|
| 39 |
-
**Judging criteria (this README is organized around them)**
|
| 40 |
-
|
| 41 |
-
1. [Criterion: Environment Innovation (40%)](#ghostexec-env-innovation)
|
| 42 |
-
2. [Criterion: Storytelling & Presentation (30%)](#ghostexec-storytelling)
|
| 43 |
-
3. [Criterion: Showing Improvement in Rewards (20%)](#ghostexec-reward-improvement)
|
| 44 |
-
4. [Criterion: Reward & Training Pipeline (10%)](#ghostexec-reward-pipeline)
|
| 45 |
-
|
| 46 |
-
**Reference**
|
| 47 |
-
|
| 48 |
-
5. [Hackathon themes & checklist](#openenv-hackathon-themes--checklist)
|
| 49 |
-
6. [Quick start](#quick-start-python-client)
|
| 50 |
-
7. [Actions](#actions-and-fields)
|
| 51 |
-
8. [Observation](#observation)
|
| 52 |
-
9. [Reward (formula summary)](#reward-formula-summary)
|
| 53 |
-
10. [HTTP vs WebSocket](#http-vs-websocket-episode-state)
|
| 54 |
-
11. [Running and testing locally](#running-and-testing-locally)
|
| 55 |
-
12. [Hugging Face Spaces](#hugging-face-spaces)
|
| 56 |
-
13. [Scenarios](#scenarios)
|
| 57 |
-
14. [Project layout](#project-layout)
|
| 58 |
-
15. [Resources & references](#resources--references)
|
| 59 |
-
16. [License](#license)
|
| 60 |
-
|
| 61 |
-
---
|
| 62 |
-
|
| 63 |
-
## Criterion: Environment Innovation (40%)
|
| 64 |
-
|
| 65 |
-
<a id="ghostexec-env-innovation"></a>
|
| 66 |
-
|
| 67 |
-
**Weight:** 40%
|
| 68 |
-
|
| 69 |
-
**What it means:**
|
| 70 |
-
|
| 71 |
-
- Is the environment novel, creative, or genuinely challenging?
|
| 72 |
-
- Does it meaningfully test agent behavior in a way that hasn't been done before?
|
| 73 |
-
|
| 74 |
-
### How Ghostexec answers this
|
| 75 |
-
|
| 76 |
-
**Challenging world.** The policy sees **one dense natural-language briefing** per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)βnot a JSON dump of the world. It must **ground** decisions in real ids from that text, return **valid typed actions**, and accept **time pressure** and **social fallout** when meetings move or mail goes unanswered. Invalid actions **do not crash** the server; they return structured errors so learning signals stay intact.
|
| 77 |
-
|
| 78 |
-
**Meaningful behavior, not a toy Q&A.** Success needs **comprehension + tool discipline**: legal JSON schema, multi-step **sequences** (WebSocket sessions for real episodes), and **tradeoffs** across channels (mail vs calendar vs tasks vs relationships). **`do_nothing` is penalised** so βsafeβ idleness is costly when fires are burning.
|
| 79 |
-
|
| 80 |
-
**Dynamics, not a static paragraph.** After each valid action, the simulation **advances the clock**, updates **moods**, rebuilds **conflicts**, and can apply **scenario-driven drift** (`after_step` events in JSON): shifted meetings, new deadlines, preference changesβso the agent is tested on **adaptation**, not memorizing the first screen.
|
| 81 |
-
|
| 82 |
-
**Dual evaluation.** **Dense step rewards** in `server/reward.py` teach fine structure; **trajectory graders** in `graders.py` return scores strictly in **`(0.01, 0.99)`** per OpenEnv task wiring in `openenv.yaml`. Agents learn from the dense signal; judges get bounded certification scores.
|
| 83 |
-
|
| 84 |
-
**Honest novelty claim.** Inboxes and calendars are familiar **ingredients**. What is less common is the **composition**: OpenEnv-native packaging, **plain-text-only** observations, **data-defined** scenarios, live dynamics + drift, dual reward/grader stack, and a **transactional** action API in one trainable, hostable environment.
|
| 85 |
-
|
| 86 |
-
### Task ladder (difficulty in data)
|
| 87 |
-
|
| 88 |
-
| Task id | Difficulty | Scenario | What gets harder |
|
| 89 |
-
|---------|------------|----------|------------------|
|
| 90 |
-
| `phase2_core` | easy | `scenarios/phase2_core.json` | Dense triage: VIP mail, calendar relief, overlapping work. |
|
| 91 |
-
| `monday_morning` | medium | `scenarios/monday_morning.json` | Stacked Monday rush, less slack. |
|
| 92 |
-
| `dinner_disaster` | hard | `scenarios/dinner_disaster.json` | Personal vs professional collision, escalation risk. |
|
| 93 |
-
|
| 94 |
-
### 5-minute verification checklist
|
| 95 |
-
|
| 96 |
-
1. **`openenv.yaml`** β three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`, grader paths.
|
| 97 |
-
2. **`scenarios/*.json`** β world content is **data**, not hardcoded lore in Python.
|
| 98 |
-
3. **`server/ghostexec_environment.py`** β `build_briefing_text`, `_apply_action`, post-step dynamics, schema drift hooks.
|
| 99 |
-
4. **`server/reward.py`** β fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
|
| 100 |
-
5. **`graders.py`** β bounded grader outputs, trajectory consumption.
|
| 101 |
-
6. **Live Space** β `/docs` or `POST /reset` + `POST /step`: legal steps change state; illegal steps return errors, not stack traces.
|
| 102 |
-
|
| 103 |
-
For a **standalone** walkthrough of the innovation angle only, see **[environment-innovation/README.md](environment-innovation/README.md)**.
|
| 104 |
-
|
| 105 |
-
---
|
| 106 |
-
|
| 107 |
-
## Criterion: Storytelling & Presentation (30%)
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
### The problem (plain language)
|
| 119 |
-
|
| 120 |
-
An executiveβs day is **messy**: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice **ripples**βsomeone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a **small simulator** the model must **run**, not a single paragraph to summarize.
|
| 121 |
-
|
| 122 |
-
### The environment (one sentence)
|
| 123 |
-
|
| 124 |
-
**You read a realistic staff briefing; you pick one legal βmoveβ (reply, reschedule, delegate, β¦); the world updates; you get a score that reflects tension across work, people, and tasks.**
|
| 125 |
|
| 126 |
-
##
|
| 127 |
|
| 128 |
-
- **
|
| 129 |
-
- **
|
| 130 |
-
- **
|
| 131 |
-
- **
|
|
|
|
| 132 |
|
| 133 |
-
##
|
| 134 |
|
| 135 |
-
1
|
| 136 |
-
2. **Show one good step vs one bad step** β e.g. thoughtful reply vs invalid id or `do_nothing` while critical mail waits (mood / reward visibly differ).
|
| 137 |
-
3. **Name the three βchannelsβ** β calmer calendar, happier stakeholders, tasks moving forwardβwithout math jargon.
|
| 138 |
-
4. **End on βwhat improvedβ** β after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).
|
| 139 |
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
-
**
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
##
|
| 147 |
|
| 148 |
-
|
| 149 |
|
| 150 |
-
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
|
|
|
|
|
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|----------|------|
|
| 160 |
-
| `outputs/logs/episode_rewards.jsonl` | Per-step reward trace (gitignored); use for **reward curves** and component debugging. |
|
| 161 |
-
| `outputs/trainer_state.json` / training logs | Produced by training scripts when configured; feed into plotting. |
|
| 162 |
-
| `outputs/reward_log.csv` | Optional CSV companion for plotting pipelines. |
|
| 163 |
-
| `outputs/compliance_manifest.json` | Baseline / compliance metadata for **comparison** charts. |
|
| 164 |
-
| `outputs/plots/*.png` | Generated report figures (see command below). |
|
| 165 |
|
| 166 |
-
|
|
|
|
| 167 |
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
--trainer-history outputs/trainer_state.json \
|
| 171 |
-
--reward-csv outputs/reward_log.csv \
|
| 172 |
-
--baselines-json outputs/compliance_manifest.json \
|
| 173 |
-
--out-dir outputs/plots
|
| 174 |
-
```
|
| 175 |
|
| 176 |
-
|
|
|
|
| 177 |
|
| 178 |
-
**
|
| 179 |
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
-
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
<a id="ghostexec-reward-pipeline"></a>
|
| 187 |
-
|
| 188 |
-
**Weight:** 10%
|
| 189 |
-
|
| 190 |
-
**What it means:**
|
| 191 |
-
|
| 192 |
-
- Is the reward logic coherent?
|
| 193 |
-
- Does the pipeline produce meaningful improvement in the trained agent's behavior?
|
| 194 |
-
|
| 195 |
-
### Reward logic (coherent and inspectable)
|
| 196 |
-
|
| 197 |
-
Phase-4 scoring in `server/reward.py` uses a **fixed** core blend:
|
| 198 |
|
| 199 |
\[
|
| 200 |
-
\text{weighted
|
| 201 |
\]
|
| 202 |
|
| 203 |
-
Then
|
| 204 |
-
|
| 205 |
-
Design rationale is aligned with dense reward-shaping practice (see [arXiv:2408.10215](https://arxiv.org/abs/2408.10215))βfixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.
|
| 206 |
-
|
| 207 |
-
### Training pipeline (entrypoints)
|
| 208 |
|
| 209 |
-
|
| 210 |
-
|------|---------------------|
|
| 211 |
-
| Install | `uv sync` (from repo root) |
|
| 212 |
-
| Server (matches Dockerfile) | `uv run server --port 8000` |
|
| 213 |
-
| SFT β GRPO script | `uv run python scripts/train_sft_then_grpo.py` (see [Running and testing locally](#running-and-testing-locally) for a full example invocation) |
|
| 214 |
-
| Tests | `uv run pytest tests/ -q` |
|
| 215 |
-
| Docker build gate | `GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q` |
|
| 216 |
-
|
| 217 |
-
The pipeline is **meaningful** when tied to the **20% evidence** above: same env URL, logged rewards, and plots that move in the right direction over trainingβnot when loss alone decreases.
|
| 218 |
-
|
| 219 |
-
---
|
| 220 |
-
|
| 221 |
-
## OpenEnv Hackathon themes & checklist
|
| 222 |
-
|
| 223 |
-
| Item | Status |
|
| 224 |
-
|------|--------|
|
| 225 |
-
| OpenEnv-based env + `openenv.yaml` | In-repo (`openenv-core[core]>=0.2.3`). |
|
| 226 |
-
| Short write-up or <2 min video | **You:** publish and paste URLs in [Deliverables](#deliverables-fill-before-freeze). |
|
| 227 |
-
| Public HF Space | [Deliverables](#deliverables-fill-before-freeze); deploy with `openenv push --repo-id <your>/ghostexec`. |
|
| 228 |
-
|
| 229 |
-
---
|
| 230 |
|
| 231 |
-
## Quick
|
| 232 |
-
|
| 233 |
-
From the repo root (where `pyproject.toml` lives):
|
| 234 |
|
| 235 |
```bash
|
| 236 |
uv sync
|
| 237 |
uv run server --port 8000
|
| 238 |
```
|
| 239 |
|
|
|
|
|
|
|
| 240 |
```python
|
| 241 |
from ghostexec import GhostexecAction, GhostexecEnv
|
| 242 |
|
| 243 |
with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
|
| 244 |
out = env.reset()
|
| 245 |
-
print(out.observation.echoed_message[:
|
| 246 |
|
| 247 |
step = env.step(
|
| 248 |
GhostexecAction(
|
| 249 |
action_type="reply_email",
|
| 250 |
email_id="e01",
|
| 251 |
-
message_body=
|
| 252 |
-
"Marcus β acknowledged. Revised figures and short rationale "
|
| 253 |
-
"before noon. β Exec"
|
| 254 |
-
),
|
| 255 |
)
|
| 256 |
)
|
| 257 |
print("reward:", step.reward)
|
| 258 |
-
print("metadata keys:", sorted((step.observation.metadata or {}).keys()))
|
| 259 |
```
|
| 260 |
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
```bash
|
| 264 |
-
docker build -t ghostexec-env:latest .
|
| 265 |
-
```
|
| 266 |
-
|
| 267 |
-
---
|
| 268 |
-
|
| 269 |
-
## Actions and fields
|
| 270 |
-
|
| 271 |
-
`GhostexecAction` (`models.py`):
|
| 272 |
-
|
| 273 |
-
| `action_type` | Typical fields |
|
| 274 |
-
|---------------|----------------|
|
| 275 |
-
| `reply_email` | `email_id`, `message_body` |
|
| 276 |
-
| `archive_email` | `email_id` |
|
| 277 |
-
| `reschedule_meeting` | `meeting_id`, `new_time`, `reason` |
|
| 278 |
-
| `cancel_meeting` | `meeting_id`, `reason` |
|
| 279 |
-
| `complete_task` | `task_id` |
|
| 280 |
-
| `delegate_task` | `task_id`, `contact_name` |
|
| 281 |
-
| `send_message` | `contact_name`, `message` |
|
| 282 |
-
| `do_nothing` | β (penalised path) |
|
| 283 |
-
|
| 284 |
-
Malformed HTTP payloads are handled safely so clients do not crash the server.
|
| 285 |
-
|
| 286 |
-
---
|
| 287 |
-
|
| 288 |
-
## Observation
|
| 289 |
-
|
| 290 |
-
- **`echoed_message`** β Full plain-text briefing.
|
| 291 |
-
- **`message_length`** β Length of briefing.
|
| 292 |
-
- **`reward`**, **`done`**, **`metadata`** β Step outcome; metadata includes `step_ok`, reward breakdown fields, and debug ids.
|
| 293 |
-
|
| 294 |
-
---
|
| 295 |
-
|
| 296 |
-
## Reward (formula summary)
|
| 297 |
-
|
| 298 |
-
Full detail is under [Criterion: Reward & Training Pipeline (10%)](#criterion-reward--training-pipeline-10). Episode logs: `outputs/logs/episode_rewards.jsonl` (gitignored).
|
| 299 |
-
|
| 300 |
-
---
|
| 301 |
-
|
| 302 |
-
## HTTP vs WebSocket (episode state)
|
| 303 |
-
|
| 304 |
-
- **HTTP** `POST /reset` and `POST /step` may use **short-lived** instances; consecutive HTTP calls might not share one in-memory episode.
|
| 305 |
-
- **WebSocket `/ws`** (or `GhostexecEnv`) β use for **multi-step episodes** on one session.
|
| 306 |
-
|
| 307 |
-
Endpoints: **`/web`**, **`/docs`**, **`/health`**, **`/ws`**.
|
| 308 |
-
|
| 309 |
-
---
|
| 310 |
-
|
| 311 |
-
## Running and testing locally
|
| 312 |
-
|
| 313 |
-
```bash
|
| 314 |
-
uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
|
| 315 |
-
# or
|
| 316 |
-
uv run server --port 8000
|
| 317 |
-
```
|
| 318 |
-
|
| 319 |
-
**HTTP smoke:**
|
| 320 |
-
|
| 321 |
-
```bash
|
| 322 |
-
uv run python scripts/http_endpoint_smoke.py --local
|
| 323 |
-
```
|
| 324 |
-
|
| 325 |
-
**Tests:**
|
| 326 |
-
|
| 327 |
-
```bash
|
| 328 |
-
uv run pytest tests/ -q
|
| 329 |
-
GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
|
| 330 |
-
uv run pytest tests/test_live_server_exhaustive.py -v --tb=short # server on :8000
|
| 331 |
-
```
|
| 332 |
-
|
| 333 |
-
**SFT β GRPO (example):**
|
| 334 |
|
| 335 |
```bash
|
| 336 |
uv run python scripts/train_sft_then_grpo.py \
|
|
@@ -347,36 +144,63 @@ uv run python scripts/train_sft_then_grpo.py \
|
|
| 347 |
--curriculum-ramp-ratio 0.60
|
| 348 |
```
|
| 349 |
|
| 350 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 351 |
|
| 352 |
-
##
|
| 353 |
|
| 354 |
```bash
|
| 355 |
openenv serve
|
| 356 |
openenv build
|
| 357 |
openenv validate --verbose
|
| 358 |
openenv push
|
| 359 |
-
# openenv push --repo-id your-username/ghostexec
|
| 360 |
```
|
| 361 |
|
| 362 |
-
|
| 363 |
|
| 364 |
-
|
|
|
|
|
|
|
| 365 |
|
| 366 |
-
##
|
| 367 |
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
| `scenarios/schema_drift_test.json` | Drift-event harness |
|
| 374 |
|
| 375 |
-
|
| 376 |
|
| 377 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 378 |
|
| 379 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
ghostexec/
|
| 381 |
βββ openenv.yaml
|
| 382 |
βββ pyproject.toml
|
|
@@ -387,24 +211,20 @@ ghostexec/
|
|
| 387 |
βββ scripts/
|
| 388 |
βββ notebooks/
|
| 389 |
βββ tests/
|
|
|
|
| 390 |
βββ server/
|
| 391 |
βββ app.py
|
| 392 |
βββ ghostexec_environment.py
|
| 393 |
-
|
| 394 |
-
βββ Dockerfile
|
| 395 |
```
|
| 396 |
|
| 397 |
-
|
| 398 |
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
- [
|
| 402 |
-
- [
|
| 403 |
-
- [OpenEnv Hub](https://huggingface.co/openenv)
|
| 404 |
-
- [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA) (and related talks linked in prior README iterations)
|
| 405 |
-
|
| 406 |
-
---
|
| 407 |
|
| 408 |
## License
|
| 409 |
|
| 410 |
-
BSD-style
|
|
|
|
| 11 |
- openenv
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# Ghostexec: The AI Chief-of-Staff Environment
|
| 15 |
|
| 16 |
+
Ghostexec is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment where an LLM acts as an executive chief-of-staff under pressure: triaging inbox crises, resolving calendar conflicts, protecting stakeholder relationships, and finishing critical tasks.
|
| 17 |
|
| 18 |
+
The agent gets a dense plain-text briefing, takes one structured action, and is scored on three coupled dimensions: conflict reduction, relationship quality, and task progress.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
## Submission Package
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
| Item | Link |
|
| 23 |
+
|------|------|
|
| 24 |
+
| Public HF Space (required) | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
|
| 25 |
+
| OpenEnv manifest | [`openenv.yaml`](openenv.yaml) |
|
| 26 |
+
| Training notebook (Colab-ready) | [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) |
|
| 27 |
+
| Minimal training script (Unsloth + TRL) | [`scripts/train_sft_then_grpo.py`](scripts/train_sft_then_grpo.py) |
|
| 28 |
+
| Mini-blog (required) | `ADD_HF_BLOG_URL_HERE` |
|
| 29 |
+
| Demo video <2 minutes (required) | [**YouTube β Ghostexec demo**](https://youtu.be/g4IFZMEzfO8) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
+
## Why This Environment Is Competitive
|
| 32 |
|
| 33 |
+
- **Novel task composition**: combines language-heavy triage, social reasoning, scheduling constraints, and deadline management in a single trainable loop.
|
| 34 |
+
- **Non-trivial behavior**: valid JSON is necessary but not sufficient; the policy must choose useful actions on the right entity ids at the right time.
|
| 35 |
+
- **Dynamic world model**: mood shifts, conflict rebuilds, overdue penalties, and scenario drift events force adaptation over a trajectory.
|
| 36 |
+
- **Trainable reward signal**: dense step reward for learning plus bounded graders for evaluation.
|
| 37 |
+
- **Hackathon fit**: fully OpenEnv-packaged, hostable on HF Spaces, with reproducible training and visible before/after evidence.
|
| 38 |
|
| 39 |
+
## Judging-Criteria Mapping
|
| 40 |
|
| 41 |
+
### 1) Environment Innovation (40%)
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
- The observation is a realistic text briefing, not a toy tabular state dump.
|
| 44 |
+
- Actions are schema-bound (`GhostexecAction`) and validated against live world ids.
|
| 45 |
+
- The world evolves after each step (conflict graph, stress, mood, time shifts).
|
| 46 |
+
- Drift events in scenario data test robustness to changing conditions.
|
| 47 |
|
| 48 |
+
**Task ladder**
|
| 49 |
|
| 50 |
+
| Task ID | Difficulty | Scenario |
|
| 51 |
+
|---------|------------|----------|
|
| 52 |
+
| `phase2_core` | easy | `scenarios/phase2_core.json` |
|
| 53 |
+
| `monday_morning` | medium | `scenarios/monday_morning.json` |
|
| 54 |
+
| `dinner_disaster` | hard | `scenarios/dinner_disaster.json` |
|
| 55 |
|
| 56 |
+
### 2) Storytelling and Presentation (30%)
|
| 57 |
|
| 58 |
+
Ghostexec tells a familiar high-stakes story: too many urgent asks, not enough time, and every action has social + operational consequences.
|
| 59 |
|
| 60 |
+
The demo is easy to follow:
|
| 61 |
+
1. show the same briefing the model sees,
|
| 62 |
+
2. compare weak vs better action choice,
|
| 63 |
+
3. show reward movement and policy behavior improvements.
|
| 64 |
|
| 65 |
+
### 3) Showing Improvement in Rewards (20%)
|
| 66 |
|
| 67 |
+
The repo includes persisted training artifacts and plot outputs:
|
| 68 |
|
| 69 |
+
- `output/reward_curve.png`
|
| 70 |
+
- `output/loss_curve.png`
|
| 71 |
+
- `output/baseline_comparison.png`
|
| 72 |
|
| 73 |
+
**Training evidence plots**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+

|
| 76 |
+
*Reward trend across training progression.*
|
| 77 |
|
| 78 |
+

|
| 79 |
+
*SFT/GRPO training loss over optimization steps.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+

|
| 82 |
+
*Random vs frozen vs trained policy mean episode reward.*
|
| 83 |
|
| 84 |
+
**Current before/after metrics (from saved artifacts)**
|
| 85 |
|
| 86 |
+
| Metric | Baseline | Trained |
|
| 87 |
+
|--------|----------|---------|
|
| 88 |
+
| Mean step reward | `0.145` | `0.257` |
|
| 89 |
+
| Invalid action rate | `Not logged in saved artifacts` | `Not logged in saved artifacts` |
|
| 90 |
+
| Grader score | `Not logged in saved artifacts` | `Not logged in saved artifacts` |
|
| 91 |
|
| 92 |
+
### 4) Reward and Training Pipeline (10%)
|
| 93 |
|
| 94 |
+
Ghostexec uses a coherent weighted reward core plus bounded shaping:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
\[
|
| 97 |
+
\text{weighted\_base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
|
| 98 |
\]
|
| 99 |
|
| 100 |
+
Then applies structured adjustments (invalid-action penalties, do-nothing pressure, completion/catastrophic terms) with transparent breakdown fields.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
Training is end-to-end and environment-connected (not static-only): SFT warm start, then GRPO with environment reward plus local shaping functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
## Quick Start
|
|
|
|
|
|
|
| 105 |
|
| 106 |
```bash
|
| 107 |
uv sync
|
| 108 |
uv run server --port 8000
|
| 109 |
```
|
| 110 |
|
| 111 |
+
Python client example:
|
| 112 |
+
|
| 113 |
```python
|
| 114 |
from ghostexec import GhostexecAction, GhostexecEnv
|
| 115 |
|
| 116 |
with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
|
| 117 |
out = env.reset()
|
| 118 |
+
print(out.observation.echoed_message[:400], "...")
|
| 119 |
|
| 120 |
step = env.step(
|
| 121 |
GhostexecAction(
|
| 122 |
action_type="reply_email",
|
| 123 |
email_id="e01",
|
| 124 |
+
message_body="Acknowledged. Sending concise revised update before noon.",
|
|
|
|
|
|
|
|
|
|
| 125 |
)
|
| 126 |
)
|
| 127 |
print("reward:", step.reward)
|
|
|
|
| 128 |
```
|
| 129 |
|
| 130 |
+
## Reproducible Training Commands
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
|
| 132 |
```bash
|
| 133 |
uv run python scripts/train_sft_then_grpo.py \
|
|
|
|
| 144 |
--curriculum-ramp-ratio 0.60
|
| 145 |
```
|
| 146 |
|
| 147 |
+
Generate post-train plots:
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
uv run python scripts/plot_training_report.py \
|
| 151 |
+
--trainer-history outputs/trainer_state.json \
|
| 152 |
+
--reward-csv outputs/reward_log.csv \
|
| 153 |
+
--baselines-json outputs/compliance_manifest.json \
|
| 154 |
+
--out-dir output
|
| 155 |
+
```
|
| 156 |
|
| 157 |
+
## OpenEnv and Space Deployment
|
| 158 |
|
| 159 |
```bash
|
| 160 |
openenv serve
|
| 161 |
openenv build
|
| 162 |
openenv validate --verbose
|
| 163 |
openenv push
|
|
|
|
| 164 |
```
|
| 165 |
|
| 166 |
+
If needed:
|
| 167 |
|
| 168 |
+
```bash
|
| 169 |
+
openenv push --repo-id your-username/ghostexec
|
| 170 |
+
```
|
| 171 |
|
| 172 |
+
## Environment API and Contract
|
| 173 |
|
| 174 |
+
- Core endpoints: `/reset`, `/step`, `/state`, `/schema`, `/health`, `/docs`, `/ws`
|
| 175 |
+
- Observation contains:
|
| 176 |
+
- `echoed_message` (plain-text briefing),
|
| 177 |
+
- optional metadata (step validity, reward breakdown, ids).
|
| 178 |
+
- Action schema: see `GhostexecAction` in [`models.py`](models.py).
|
|
|
|
| 179 |
|
| 180 |
+
Supported `action_type` values:
|
| 181 |
|
| 182 |
+
- `reply_email`
|
| 183 |
+
- `archive_email`
|
| 184 |
+
- `reschedule_meeting`
|
| 185 |
+
- `cancel_meeting`
|
| 186 |
+
- `complete_task`
|
| 187 |
+
- `delegate_task`
|
| 188 |
+
- `send_message`
|
| 189 |
+
- `do_nothing`
|
| 190 |
|
| 191 |
+
## Submission Readiness Checklist
|
| 192 |
+
|
| 193 |
+
- [x] OpenEnv latest-compatible environment with valid `openenv.yaml`
|
| 194 |
+
- [x] Public HF Space deployed and reachable
|
| 195 |
+
- [x] Minimal trainable script using Unsloth + TRL
|
| 196 |
+
- [x] Colab-ready notebook for reruns
|
| 197 |
+
- [x] Training evidence plots embedded in README
|
| 198 |
+
- [ ] Add HF blog link
|
| 199 |
+
- [x] Add <2 minute YouTube demo link β [youtu.be/g4IFZMEzfO8](https://youtu.be/g4IFZMEzfO8)
|
| 200 |
+
|
| 201 |
+
## Repository Structure
|
| 202 |
+
|
| 203 |
+
```text
|
| 204 |
ghostexec/
|
| 205 |
βββ openenv.yaml
|
| 206 |
βββ pyproject.toml
|
|
|
|
| 211 |
βββ scripts/
|
| 212 |
βββ notebooks/
|
| 213 |
βββ tests/
|
| 214 |
+
βββ output/
|
| 215 |
βββ server/
|
| 216 |
βββ app.py
|
| 217 |
βββ ghostexec_environment.py
|
| 218 |
+
βββ reward.py
|
|
|
|
| 219 |
```
|
| 220 |
|
| 221 |
+
## Additional References
|
| 222 |
|
| 223 |
+
- [OpenEnv (Meta PyTorch)](https://github.com/meta-pytorch/OpenEnv)
|
| 224 |
+
- [OpenEnv Packaging and Deploying Docs](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)
|
| 225 |
+
- [OpenEnv Hub](https://huggingface.co/openenv)
|
| 226 |
+
- [Environment Innovation Deep-Dive](environment-innovation/README.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
|
| 228 |
## License
|
| 229 |
|
| 230 |
+
BSD-style license as included in this repository and upstream OpenEnv lineage notices.
|
server/app.py
CHANGED
|
@@ -68,8 +68,10 @@ def _ghostexec_load_environment_metadata(env, env_name=None): # type: ignore[no
|
|
| 68 |
space = "modelbuilderhq/ghostexec"
|
| 69 |
readme_url = f"https://huggingface.co/spaces/{space}/blob/main/README.md"
|
| 70 |
space_url = f"https://huggingface.co/spaces/{space}"
|
|
|
|
| 71 |
meta.readme_content = (
|
| 72 |
"### README\n\n"
|
|
|
|
| 73 |
f"Formatted documentation (Space card + full markdown): "
|
| 74 |
f"[**README.md on Hugging Face**]({readme_url})\n\n"
|
| 75 |
f"Space: [**{space}**]({space_url})"
|
|
|
|
| 68 |
space = "modelbuilderhq/ghostexec"
|
| 69 |
readme_url = f"https://huggingface.co/spaces/{space}/blob/main/README.md"
|
| 70 |
space_url = f"https://huggingface.co/spaces/{space}"
|
| 71 |
+
demo_video = "https://youtu.be/g4IFZMEzfO8"
|
| 72 |
meta.readme_content = (
|
| 73 |
"### README\n\n"
|
| 74 |
+
f"**Demo (<2 min):** [**YouTube**]({demo_video})\n\n"
|
| 75 |
f"Formatted documentation (Space card + full markdown): "
|
| 76 |
f"[**README.md on Hugging Face**]({readme_url})\n\n"
|
| 77 |
f"Space: [**{space}**]({space_url})"
|