modelbuilderhq commited on
Commit
60fe7cd
Β·
verified Β·
1 Parent(s): 98de6fc

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. BLOG.md +105 -0
  2. README.md +115 -295
  3. server/app.py +2 -0
BLOG.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ghostexec: A tiny chief-of-staff simulator for agents
2
+
3
+ **Demo (<2 min):** [Ghostexec on YouTube](https://youtu.be/g4IFZMEzfO8)
4
+
5
+ ---
6
+
7
+ ## The messy day, in one paragraph
8
+
9
+ Picture a morning where everything arrives at once: a board member’s email, a double-booked calendar, a message from home about dinner, a report due at noon, and a teammate who is already annoyed. You cannot β€œsolve” that with a single summary. You **sequence** decisionsβ€”who gets a reply, what gets rescheduled, what gets delegatedβ€”and each move changes how stressed you are, how people feel, and whether real work moves forward.
10
+
11
+ **Ghostexec** is a small, trainable environment that captures that feeling. It is built on **[OpenEnv](https://github.com/meta-pytorch/OpenEnv)** so agents, researchers, and judges can talk to it through a **standard HTTP and WebSocket API**, run it on a **public Hugging Face Space**, and plug it into a **real RL or preference-optimization loop**.
12
+
13
+ ---
14
+
15
+ ## What Ghostexec actually is
16
+
17
+ Ghostexec is an **OpenEnv-compatible β€œAI chief of staff” simulator**:
18
+
19
+ - **Inbox, calendar, contacts, tasks**, and **stakeholder moods** live in **JSON scenarios** under `scenarios/` β€” the story is data, not hardcoded prose in Python.
20
+ - Each step, the policy sees a **plain-text briefing** (the same kind of wall-of-text a human assistant might scan), not a raw dump of the whole world object.
21
+ - The agent returns **one structured action per step** β€” for example `reply_email`, `reschedule_meeting`, `complete_task`, or `delegate_task` β€” with fields validated against a schema.
22
+ - **Invalid actions do not crash the server.** They return a controlled signal so learning (or evaluation) can continue.
23
+ - **`do_nothing` is penalised** so β€œfreeze and hope it goes away” is not a free winning strategy when fires are burning.
24
+
25
+ Under the hood, the simulation advances **time**, **moods**, and **conflicts**, and optional **drift events** in the JSON can reshuffle the situation mid-episode so the agent is tested on **adaptation**, not memorizing the first screen.
26
+
27
+ ---
28
+
29
+ ## Why OpenEnv?
30
+
31
+ [OpenEnv](https://github.com/meta-pytorch/OpenEnv) gives us a **shared contract**: reset, step, schema, health, WebSocket sessions, and tooling to **validate** and **ship** environments. Our manifest is in **`openenv.yaml`** (environment name `ghostexec`, three tasks with graders, FastAPI app entrypoint). That keeps the submission **inspectable** and **reproducible** β€” judges can open the Space, read the repo, and run tests locally with the same entrypoints we use in Docker.
32
+
33
+ ---
34
+
35
+ ## Rewards: teach the model, certify the run
36
+
37
+ We use **two layers** of feedback, on purpose:
38
+
39
+ 1. **Dense step reward** (in `server/reward.py`) blends **conflict**, **relationship**, and **task** progress with fixed weights **0.35 / 0.35 / 0.30**, plus bounded shaping and explicit handling of invalid or idle steps. That signal is what you want when **training** with modern RL or GRPO-style methods.
40
+ 2. **Trajectory graders** (in `graders.py`, wired in `openenv.yaml`) produce **bounded** scores for three **tasks** β€” easy, medium, and hard scenarios β€” so hackathon **certification** stays in a well-defined range.
41
+
42
+ Together: the model can **learn** from rich per-step feedback, while organizers can **score** full trajectories against clear tasks.
43
+
44
+ ---
45
+
46
+ ## Try it in sixty seconds
47
+
48
+ **Short demo video:** [https://youtu.be/g4IFZMEzfO8](https://youtu.be/g4IFZMEzfO8)
49
+
50
+ **Live Space (public):** [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec)
51
+
52
+ - Open **`/docs`** on the Space for the interactive API, or **`/web`** for the OpenEnv playground.
53
+ - Full README (formatted, with tables and deep links):
54
+ [https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md)
55
+
56
+ **Local quick start** (from repo root):
57
+
58
+ ```bash
59
+ uv sync
60
+ uv run server --port 8000
61
+ ```
62
+
63
+ Then use **`GhostexecEnv`** from `client.py` (WebSocket session) for **multi-step episodes**, or raw HTTP if you only need a smoke test. The README’s β€œQuick start” section has a copy-paste Python snippet.
64
+
65
+ **Source:** mirror on GitHub as you prefer; the canonical hackathon artifact is the **Space + repo** layout described in the README.
66
+
67
+ ---
68
+
69
+ ## If you only have two minutes on video
70
+
71
+ Published walkthrough: [**youtu.be/g4IFZMEzfO8**](https://youtu.be/g4IFZMEzfO8)
72
+
73
+ A tight arc that works for non-technical viewers:
74
+
75
+ 1. **Show the briefing** β€” scroll through the same text the model sees. Say: β€œThis is not a quiz; it’s a shift at work.”
76
+ 2. **One good action** β€” e.g. a thoughtful `reply_email` with a real `email_id` from the text; show reward or mood metadata if the UI exposes it.
77
+ 3. **One bad action** β€” wrong id or `do_nothing` while something urgent waits; show that the world **does not crash**, but the score **hurts**.
78
+ 4. **One sentence on training** β€” β€œWe can optimize policies against this API with GRPO / TRL-style loops; graders score whole episodes for the hackathon.”
79
+
80
+ End on: **Ghostexec is the busy day, compressed β€” so models can practice being calm, fast, and fair before anyone trusts them near a real calendar.**
81
+
82
+ ---
83
+
84
+ ## Theme fit (hackathon)
85
+
86
+ Ghostexec aligns naturally with **personalized, high-stakes tasks**: executive triage, delegation, and tradeoffs between **people**, **calendar**, and **deadlines**. Diverse **`scenarios/*.json`** and optional curriculum / perturbation hooks (see README) make it easy to **stress-test** policies without rewriting core engine code.
87
+
88
+ ---
89
+
90
+ ## Where to read more
91
+
92
+ | Resource | Link |
93
+ |----------|------|
94
+ | Demo video (<2 min) | [YouTube](https://youtu.be/g4IFZMEzfO8) |
95
+ | Full project README (judging sections, layout, commands) | [README on the Hub](https://huggingface.co/spaces/modelbuilderhq/ghostexec/blob/main/README.md) |
96
+ | Innovation-only deep dive | [`environment-innovation/README.md`](environment-innovation/README.md) in the repo |
97
+ | OpenEnv upstream | [github.com/meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) |
98
+
99
+ ---
100
+
101
+ ## Closing
102
+
103
+ Ghostexec is not trying to replace human assistants. It is trying to give **models and researchers** a **credible, stressful, and kind** miniature office: text that reads like work, actions that look like tools, and scores that admit **tradeoffs**. If that sounds useful, spin up the Space, break something on purpose, and watch the environment **keep running** β€” that resilience is part of the point.
104
+
105
+ *β€” Ghostexec / OpenEnv submission*
README.md CHANGED
@@ -11,326 +11,123 @@ tags:
11
  - openenv
12
  ---
13
 
14
- # Ghostexec
15
 
16
- **Ghostexec** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment: a busy **executive chief-of-staff** simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a **plain-text briefing**, then emit **one structured action per step** (`reply_email`, `reschedule_meeting`, …). The server returns rewards shaped around **conflict**, **relationships**, and **tasks**β€”plus trajectory **graders** for hackathon validation. All episode **content** lives in `scenarios/*.json`; the engine is in `server/ghostexec_environment.py` and `server/reward.py`.
17
 
18
- | Item | Value |
19
- |------|--------|
20
- | **HF Space name / manifest** | `ghostexec` in [`openenv.yaml`](openenv.yaml) |
21
- | **Python package** | `openenv-ghostexec` in [`pyproject.toml`](pyproject.toml) (import `ghostexec`) |
22
- | **Public Space** | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
23
- | **Deeper innovation-only brief** | [`environment-innovation/README.md`](environment-innovation/README.md) |
24
 
25
- ---
26
-
27
- ## Deliverables (fill before freeze)
28
-
29
- | Deliverable | URL |
30
- |-------------|-----|
31
- | Public HF Space (required) | [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
32
- | Write-up / blog (HF post preferred) | `TODO: paste your post URL` |
33
- | Short demo video (<2 min) | `TODO: paste your video URL` |
34
-
35
- ---
36
-
37
- ## Contents
38
-
39
- **Judging criteria (this README is organized around them)**
40
-
41
- 1. [Criterion: Environment Innovation (40%)](#ghostexec-env-innovation)
42
- 2. [Criterion: Storytelling & Presentation (30%)](#ghostexec-storytelling)
43
- 3. [Criterion: Showing Improvement in Rewards (20%)](#ghostexec-reward-improvement)
44
- 4. [Criterion: Reward & Training Pipeline (10%)](#ghostexec-reward-pipeline)
45
-
46
- **Reference**
47
-
48
- 5. [Hackathon themes & checklist](#openenv-hackathon-themes--checklist)
49
- 6. [Quick start](#quick-start-python-client)
50
- 7. [Actions](#actions-and-fields)
51
- 8. [Observation](#observation)
52
- 9. [Reward (formula summary)](#reward-formula-summary)
53
- 10. [HTTP vs WebSocket](#http-vs-websocket-episode-state)
54
- 11. [Running and testing locally](#running-and-testing-locally)
55
- 12. [Hugging Face Spaces](#hugging-face-spaces)
56
- 13. [Scenarios](#scenarios)
57
- 14. [Project layout](#project-layout)
58
- 15. [Resources & references](#resources--references)
59
- 16. [License](#license)
60
-
61
- ---
62
-
63
- ## Criterion: Environment Innovation (40%)
64
-
65
- <a id="ghostexec-env-innovation"></a>
66
-
67
- **Weight:** 40%
68
-
69
- **What it means:**
70
-
71
- - Is the environment novel, creative, or genuinely challenging?
72
- - Does it meaningfully test agent behavior in a way that hasn't been done before?
73
-
74
- ### How Ghostexec answers this
75
-
76
- **Challenging world.** The policy sees **one dense natural-language briefing** per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)β€”not a JSON dump of the world. It must **ground** decisions in real ids from that text, return **valid typed actions**, and accept **time pressure** and **social fallout** when meetings move or mail goes unanswered. Invalid actions **do not crash** the server; they return structured errors so learning signals stay intact.
77
-
78
- **Meaningful behavior, not a toy Q&A.** Success needs **comprehension + tool discipline**: legal JSON schema, multi-step **sequences** (WebSocket sessions for real episodes), and **tradeoffs** across channels (mail vs calendar vs tasks vs relationships). **`do_nothing` is penalised** so β€œsafe” idleness is costly when fires are burning.
79
-
80
- **Dynamics, not a static paragraph.** After each valid action, the simulation **advances the clock**, updates **moods**, rebuilds **conflicts**, and can apply **scenario-driven drift** (`after_step` events in JSON): shifted meetings, new deadlines, preference changesβ€”so the agent is tested on **adaptation**, not memorizing the first screen.
81
-
82
- **Dual evaluation.** **Dense step rewards** in `server/reward.py` teach fine structure; **trajectory graders** in `graders.py` return scores strictly in **`(0.01, 0.99)`** per OpenEnv task wiring in `openenv.yaml`. Agents learn from the dense signal; judges get bounded certification scores.
83
-
84
- **Honest novelty claim.** Inboxes and calendars are familiar **ingredients**. What is less common is the **composition**: OpenEnv-native packaging, **plain-text-only** observations, **data-defined** scenarios, live dynamics + drift, dual reward/grader stack, and a **transactional** action API in one trainable, hostable environment.
85
-
86
- ### Task ladder (difficulty in data)
87
-
88
- | Task id | Difficulty | Scenario | What gets harder |
89
- |---------|------------|----------|------------------|
90
- | `phase2_core` | easy | `scenarios/phase2_core.json` | Dense triage: VIP mail, calendar relief, overlapping work. |
91
- | `monday_morning` | medium | `scenarios/monday_morning.json` | Stacked Monday rush, less slack. |
92
- | `dinner_disaster` | hard | `scenarios/dinner_disaster.json` | Personal vs professional collision, escalation risk. |
93
-
94
- ### 5-minute verification checklist
95
-
96
- 1. **`openenv.yaml`** β€” three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`, grader paths.
97
- 2. **`scenarios/*.json`** β€” world content is **data**, not hardcoded lore in Python.
98
- 3. **`server/ghostexec_environment.py`** β€” `build_briefing_text`, `_apply_action`, post-step dynamics, schema drift hooks.
99
- 4. **`server/reward.py`** β€” fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
100
- 5. **`graders.py`** β€” bounded grader outputs, trajectory consumption.
101
- 6. **Live Space** β€” `/docs` or `POST /reset` + `POST /step`: legal steps change state; illegal steps return errors, not stack traces.
102
-
103
- For a **standalone** walkthrough of the innovation angle only, see **[environment-innovation/README.md](environment-innovation/README.md)**.
104
-
105
- ---
106
-
107
- ## Criterion: Storytelling & Presentation (30%)
108
 
109
- <a id="ghostexec-storytelling"></a>
110
-
111
- **Weight:** 30%
112
-
113
- **What it means:**
114
-
115
- - Can you clearly explain the problem, the environment, and what the agent learned?
116
- - Is the demo engaging and easy to follow for a non-technical audience?
117
-
118
- ### The problem (plain language)
119
-
120
- An executive’s day is **messy**: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice **ripples**β€”someone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a **small simulator** the model must **run**, not a single paragraph to summarize.
121
-
122
- ### The environment (one sentence)
123
-
124
- **You read a realistic staff briefing; you pick one legal β€œmove” (reply, reschedule, delegate, …); the world updates; you get a score that reflects tension across work, people, and tasks.**
125
 
126
- ### What the agent is supposed to learn
127
 
128
- - **Read carefully** β€” wrong `email_id` / `meeting_id` / `task_id` fails cleanly with feedback.
129
- - **Act under pressure** β€” clock, `max_steps`, and stress push toward decisions, not endless analysis.
130
- - **Balance competing goals** β€” improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff.
131
- - **Recover from change** β€” drift events mean the β€œright” plan from step 1 may not stay right at step 8.
 
132
 
133
- ### Demo tips for a non-technical audience
134
 
135
- 1. **Show the briefing first** β€” let viewers see the same wall of text the model sees (relatable chaos).
136
- 2. **Show one good step vs one bad step** β€” e.g. thoughtful reply vs invalid id or `do_nothing` while critical mail waits (mood / reward visibly differ).
137
- 3. **Name the three β€œchannels”** β€” calmer calendar, happier stakeholders, tasks moving forwardβ€”without math jargon.
138
- 4. **End on β€œwhat improved”** β€” after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).
139
 
140
- ### Hackathon alignment (themes)
 
 
 
141
 
142
- **Theme fit (examples):** Ghostexec fits **Theme 3.2 β€” Personalized tasks** (executive-style inbox, calendar, delegation). **Theme 4** is partially supported via `GHOSTEXEC_CURRICULUM`, `GHOSTEXEC_PERTURB`, and diverse `scenarios/`.
143
 
144
- ---
 
 
 
 
145
 
146
- ## Criterion: Showing Improvement in Rewards (20%)
147
 
148
- <a id="ghostexec-reward-improvement"></a>
149
 
150
- **Weight:** 20%
 
 
 
151
 
152
- **What it means:**
153
 
154
- - Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baselineβ€”anything that proves the agent learned something.
155
 
156
- ### Where evidence lives in this repo
 
 
157
 
158
- | Artifact | Role |
159
- |----------|------|
160
- | `outputs/logs/episode_rewards.jsonl` | Per-step reward trace (gitignored); use for **reward curves** and component debugging. |
161
- | `outputs/trainer_state.json` / training logs | Produced by training scripts when configured; feed into plotting. |
162
- | `outputs/reward_log.csv` | Optional CSV companion for plotting pipelines. |
163
- | `outputs/compliance_manifest.json` | Baseline / compliance metadata for **comparison** charts. |
164
- | `outputs/plots/*.png` | Generated report figures (see command below). |
165
 
166
- **Plot pack (loss + reward + components + baseline bar):**
 
167
 
168
- ```bash
169
- uv run python scripts/plot_training_report.py \
170
- --trainer-history outputs/trainer_state.json \
171
- --reward-csv outputs/reward_log.csv \
172
- --baselines-json outputs/compliance_manifest.json \
173
- --out-dir outputs/plots
174
- ```
175
 
176
- Writes `loss_curve.png`, `reward_curve.png`, `components_curve.png`, `baseline_comparison.png` under `outputs/plots/`.
 
177
 
178
- **End-to-end notebook:** [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) is intended to **Run All** without manual steps (per project convention).
179
 
180
- **Before / after narrative for judges:** same `task_id` and seedβ€”show **lower invalid rate**, **higher mean step reward**, or **clearer grader trajectory** after finetuning. Pair numbers with **one short clip** of two runs side by side on the Space or local server.
 
 
 
 
181
 
182
- ---
183
 
184
- ## Criterion: Reward & Training Pipeline (10%)
185
-
186
- <a id="ghostexec-reward-pipeline"></a>
187
-
188
- **Weight:** 10%
189
-
190
- **What it means:**
191
-
192
- - Is the reward logic coherent?
193
- - Does the pipeline produce meaningful improvement in the trained agent's behavior?
194
-
195
- ### Reward logic (coherent and inspectable)
196
-
197
- Phase-4 scoring in `server/reward.py` uses a **fixed** core blend:
198
 
199
  \[
200
- \text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
201
  \]
202
 
203
- Then bounded shaping, invalid-step handling, and explicit penalties (including **`do_nothing`**). Components surface on `RewardBreakdown` and in observation **metadata** where configuredβ€”so β€œwhy did this step score X?” is **auditable**, not a black box.
204
-
205
- Design rationale is aligned with dense reward-shaping practice (see [arXiv:2408.10215](https://arxiv.org/abs/2408.10215))β€”fixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.
206
-
207
- ### Training pipeline (entrypoints)
208
 
209
- | Step | Command / artifact |
210
- |------|---------------------|
211
- | Install | `uv sync` (from repo root) |
212
- | Server (matches Dockerfile) | `uv run server --port 8000` |
213
- | SFT β†’ GRPO script | `uv run python scripts/train_sft_then_grpo.py` (see [Running and testing locally](#running-and-testing-locally) for a full example invocation) |
214
- | Tests | `uv run pytest tests/ -q` |
215
- | Docker build gate | `GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q` |
216
-
217
- The pipeline is **meaningful** when tied to the **20% evidence** above: same env URL, logged rewards, and plots that move in the right direction over trainingβ€”not when loss alone decreases.
218
-
219
- ---
220
-
221
- ## OpenEnv Hackathon themes & checklist
222
-
223
- | Item | Status |
224
- |------|--------|
225
- | OpenEnv-based env + `openenv.yaml` | In-repo (`openenv-core[core]>=0.2.3`). |
226
- | Short write-up or &lt;2 min video | **You:** publish and paste URLs in [Deliverables](#deliverables-fill-before-freeze). |
227
- | Public HF Space | [Deliverables](#deliverables-fill-before-freeze); deploy with `openenv push --repo-id <your>/ghostexec`. |
228
-
229
- ---
230
 
231
- ## Quick start (Python client)
232
-
233
- From the repo root (where `pyproject.toml` lives):
234
 
235
  ```bash
236
  uv sync
237
  uv run server --port 8000
238
  ```
239
 
 
 
240
  ```python
241
  from ghostexec import GhostexecAction, GhostexecEnv
242
 
243
  with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
244
  out = env.reset()
245
- print(out.observation.echoed_message[:500], "…")
246
 
247
  step = env.step(
248
  GhostexecAction(
249
  action_type="reply_email",
250
  email_id="e01",
251
- message_body=(
252
- "Marcus β€” acknowledged. Revised figures and short rationale "
253
- "before noon. β€” Exec"
254
- ),
255
  )
256
  )
257
  print("reward:", step.reward)
258
- print("metadata keys:", sorted((step.observation.metadata or {}).keys()))
259
  ```
260
 
261
- **Docker (optional):**
262
-
263
- ```bash
264
- docker build -t ghostexec-env:latest .
265
- ```
266
-
267
- ---
268
-
269
- ## Actions and fields
270
-
271
- `GhostexecAction` (`models.py`):
272
-
273
- | `action_type` | Typical fields |
274
- |---------------|----------------|
275
- | `reply_email` | `email_id`, `message_body` |
276
- | `archive_email` | `email_id` |
277
- | `reschedule_meeting` | `meeting_id`, `new_time`, `reason` |
278
- | `cancel_meeting` | `meeting_id`, `reason` |
279
- | `complete_task` | `task_id` |
280
- | `delegate_task` | `task_id`, `contact_name` |
281
- | `send_message` | `contact_name`, `message` |
282
- | `do_nothing` | β€” (penalised path) |
283
-
284
- Malformed HTTP payloads are handled safely so clients do not crash the server.
285
-
286
- ---
287
-
288
- ## Observation
289
-
290
- - **`echoed_message`** β€” Full plain-text briefing.
291
- - **`message_length`** β€” Length of briefing.
292
- - **`reward`**, **`done`**, **`metadata`** β€” Step outcome; metadata includes `step_ok`, reward breakdown fields, and debug ids.
293
-
294
- ---
295
-
296
- ## Reward (formula summary)
297
-
298
- Full detail is under [Criterion: Reward & Training Pipeline (10%)](#criterion-reward--training-pipeline-10). Episode logs: `outputs/logs/episode_rewards.jsonl` (gitignored).
299
-
300
- ---
301
-
302
- ## HTTP vs WebSocket (episode state)
303
-
304
- - **HTTP** `POST /reset` and `POST /step` may use **short-lived** instances; consecutive HTTP calls might not share one in-memory episode.
305
- - **WebSocket `/ws`** (or `GhostexecEnv`) β€” use for **multi-step episodes** on one session.
306
-
307
- Endpoints: **`/web`**, **`/docs`**, **`/health`**, **`/ws`**.
308
-
309
- ---
310
-
311
- ## Running and testing locally
312
-
313
- ```bash
314
- uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
315
- # or
316
- uv run server --port 8000
317
- ```
318
-
319
- **HTTP smoke:**
320
-
321
- ```bash
322
- uv run python scripts/http_endpoint_smoke.py --local
323
- ```
324
-
325
- **Tests:**
326
-
327
- ```bash
328
- uv run pytest tests/ -q
329
- GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
330
- uv run pytest tests/test_live_server_exhaustive.py -v --tb=short # server on :8000
331
- ```
332
-
333
- **SFT β†’ GRPO (example):**
334
 
335
  ```bash
336
  uv run python scripts/train_sft_then_grpo.py \
@@ -347,36 +144,63 @@ uv run python scripts/train_sft_then_grpo.py \
347
  --curriculum-ramp-ratio 0.60
348
  ```
349
 
350
- ---
 
 
 
 
 
 
 
 
351
 
352
- ## Hugging Face Spaces
353
 
354
  ```bash
355
  openenv serve
356
  openenv build
357
  openenv validate --verbose
358
  openenv push
359
- # openenv push --repo-id your-username/ghostexec
360
  ```
361
 
362
- Use a **public** Space for the default hackathon flow. `openenv.yaml` carries **name**, **version**, and **description** for metadataβ€”keep them in sync with submission needs.
363
 
364
- ---
 
 
365
 
366
- ## Scenarios
367
 
368
- | File | Role |
369
- |------|------|
370
- | `scenarios/phase2_core.json` | Default dense fixture |
371
- | `scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json` | Narrative pressure |
372
- | `scenarios/vip_meltdown_drift.json` | Mood / escalation drift |
373
- | `scenarios/schema_drift_test.json` | Drift-event harness |
374
 
375
- ---
376
 
377
- ## Project layout
 
 
 
 
 
 
 
378
 
379
- ```
 
 
 
 
 
 
 
 
 
 
 
 
380
  ghostexec/
381
  β”œβ”€β”€ openenv.yaml
382
  β”œβ”€β”€ pyproject.toml
@@ -387,24 +211,20 @@ ghostexec/
387
  β”œβ”€β”€ scripts/
388
  β”œβ”€β”€ notebooks/
389
  β”œβ”€β”€ tests/
 
390
  └── server/
391
  β”œβ”€β”€ app.py
392
  β”œβ”€β”€ ghostexec_environment.py
393
- β”œβ”€β”€ reward.py
394
- └── Dockerfile
395
  ```
396
 
397
- ---
398
 
399
- ## Resources & references
400
-
401
- - [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) β€” core stack
402
- - [Packaging & Deploying](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)
403
- - [OpenEnv Hub](https://huggingface.co/openenv)
404
- - [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA) (and related talks linked in prior README iterations)
405
-
406
- ---
407
 
408
  ## License
409
 
410
- BSD-style β€” see license notices in source files (Meta / OpenEnv lineage).
 
11
  - openenv
12
  ---
13
 
14
+ # Ghostexec: The AI Chief-of-Staff Environment
15
 
16
+ Ghostexec is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compliant environment where an LLM acts as an executive chief-of-staff under pressure: triaging inbox crises, resolving calendar conflicts, protecting stakeholder relationships, and finishing critical tasks.
17
 
18
+ The agent gets a dense plain-text briefing, takes one structured action, and is scored on three coupled dimensions: conflict reduction, relationship quality, and task progress.
 
 
 
 
 
19
 
20
+ ## Submission Package
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ | Item | Link |
23
+ |------|------|
24
+ | Public HF Space (required) | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
25
+ | OpenEnv manifest | [`openenv.yaml`](openenv.yaml) |
26
+ | Training notebook (Colab-ready) | [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) |
27
+ | Minimal training script (Unsloth + TRL) | [`scripts/train_sft_then_grpo.py`](scripts/train_sft_then_grpo.py) |
28
+ | Mini-blog (required) | `ADD_HF_BLOG_URL_HERE` |
29
+ | Demo video &lt;2 minutes (required) | [**YouTube β€” Ghostexec demo**](https://youtu.be/g4IFZMEzfO8) |
 
 
 
 
 
 
 
 
30
 
31
+ ## Why This Environment Is Competitive
32
 
33
+ - **Novel task composition**: combines language-heavy triage, social reasoning, scheduling constraints, and deadline management in a single trainable loop.
34
+ - **Non-trivial behavior**: valid JSON is necessary but not sufficient; the policy must choose useful actions on the right entity ids at the right time.
35
+ - **Dynamic world model**: mood shifts, conflict rebuilds, overdue penalties, and scenario drift events force adaptation over a trajectory.
36
+ - **Trainable reward signal**: dense step reward for learning plus bounded graders for evaluation.
37
+ - **Hackathon fit**: fully OpenEnv-packaged, hostable on HF Spaces, with reproducible training and visible before/after evidence.
38
 
39
+ ## Judging-Criteria Mapping
40
 
41
+ ### 1) Environment Innovation (40%)
 
 
 
42
 
43
+ - The observation is a realistic text briefing, not a toy tabular state dump.
44
+ - Actions are schema-bound (`GhostexecAction`) and validated against live world ids.
45
+ - The world evolves after each step (conflict graph, stress, mood, time shifts).
46
+ - Drift events in scenario data test robustness to changing conditions.
47
 
48
+ **Task ladder**
49
 
50
+ | Task ID | Difficulty | Scenario |
51
+ |---------|------------|----------|
52
+ | `phase2_core` | easy | `scenarios/phase2_core.json` |
53
+ | `monday_morning` | medium | `scenarios/monday_morning.json` |
54
+ | `dinner_disaster` | hard | `scenarios/dinner_disaster.json` |
55
 
56
+ ### 2) Storytelling and Presentation (30%)
57
 
58
+ Ghostexec tells a familiar high-stakes story: too many urgent asks, not enough time, and every action has social + operational consequences.
59
 
60
+ The demo is easy to follow:
61
+ 1. show the same briefing the model sees,
62
+ 2. compare weak vs better action choice,
63
+ 3. show reward movement and policy behavior improvements.
64
 
65
+ ### 3) Showing Improvement in Rewards (20%)
66
 
67
+ The repo includes persisted training artifacts and plot outputs:
68
 
69
+ - `output/reward_curve.png`
70
+ - `output/loss_curve.png`
71
+ - `output/baseline_comparison.png`
72
 
73
+ **Training evidence plots**
 
 
 
 
 
 
74
 
75
+ ![Reward curve](output/reward_curve.png)
76
+ *Reward trend across training progression.*
77
 
78
+ ![Loss curve](output/loss_curve.png)
79
+ *SFT/GRPO training loss over optimization steps.*
 
 
 
 
 
80
 
81
+ ![Baseline comparison](output/baseline_comparison.png)
82
+ *Random vs frozen vs trained policy mean episode reward.*
83
 
84
+ **Current before/after metrics (from saved artifacts)**
85
 
86
+ | Metric | Baseline | Trained |
87
+ |--------|----------|---------|
88
+ | Mean step reward | `0.145` | `0.257` |
89
+ | Invalid action rate | `Not logged in saved artifacts` | `Not logged in saved artifacts` |
90
+ | Grader score | `Not logged in saved artifacts` | `Not logged in saved artifacts` |
91
 
92
+ ### 4) Reward and Training Pipeline (10%)
93
 
94
+ Ghostexec uses a coherent weighted reward core plus bounded shaping:
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
  \[
97
+ \text{weighted\_base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
98
  \]
99
 
100
+ Then applies structured adjustments (invalid-action penalties, do-nothing pressure, completion/catastrophic terms) with transparent breakdown fields.
 
 
 
 
101
 
102
+ Training is end-to-end and environment-connected (not static-only): SFT warm start, then GRPO with environment reward plus local shaping functions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ ## Quick Start
 
 
105
 
106
  ```bash
107
  uv sync
108
  uv run server --port 8000
109
  ```
110
 
111
+ Python client example:
112
+
113
  ```python
114
  from ghostexec import GhostexecAction, GhostexecEnv
115
 
116
  with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
117
  out = env.reset()
118
+ print(out.observation.echoed_message[:400], "...")
119
 
120
  step = env.step(
121
  GhostexecAction(
122
  action_type="reply_email",
123
  email_id="e01",
124
+ message_body="Acknowledged. Sending concise revised update before noon.",
 
 
 
125
  )
126
  )
127
  print("reward:", step.reward)
 
128
  ```
129
 
130
+ ## Reproducible Training Commands
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  ```bash
133
  uv run python scripts/train_sft_then_grpo.py \
 
144
  --curriculum-ramp-ratio 0.60
145
  ```
146
 
147
+ Generate post-train plots:
148
+
149
+ ```bash
150
+ uv run python scripts/plot_training_report.py \
151
+ --trainer-history outputs/trainer_state.json \
152
+ --reward-csv outputs/reward_log.csv \
153
+ --baselines-json outputs/compliance_manifest.json \
154
+ --out-dir output
155
+ ```
156
 
157
+ ## OpenEnv and Space Deployment
158
 
159
  ```bash
160
  openenv serve
161
  openenv build
162
  openenv validate --verbose
163
  openenv push
 
164
  ```
165
 
166
+ If needed:
167
 
168
+ ```bash
169
+ openenv push --repo-id your-username/ghostexec
170
+ ```
171
 
172
+ ## Environment API and Contract
173
 
174
+ - Core endpoints: `/reset`, `/step`, `/state`, `/schema`, `/health`, `/docs`, `/ws`
175
+ - Observation contains:
176
+ - `echoed_message` (plain-text briefing),
177
+ - optional metadata (step validity, reward breakdown, ids).
178
+ - Action schema: see `GhostexecAction` in [`models.py`](models.py).
 
179
 
180
+ Supported `action_type` values:
181
 
182
+ - `reply_email`
183
+ - `archive_email`
184
+ - `reschedule_meeting`
185
+ - `cancel_meeting`
186
+ - `complete_task`
187
+ - `delegate_task`
188
+ - `send_message`
189
+ - `do_nothing`
190
 
191
+ ## Submission Readiness Checklist
192
+
193
+ - [x] OpenEnv latest-compatible environment with valid `openenv.yaml`
194
+ - [x] Public HF Space deployed and reachable
195
+ - [x] Minimal trainable script using Unsloth + TRL
196
+ - [x] Colab-ready notebook for reruns
197
+ - [x] Training evidence plots embedded in README
198
+ - [ ] Add HF blog link
199
+ - [x] Add &lt;2 minute YouTube demo link β€” [youtu.be/g4IFZMEzfO8](https://youtu.be/g4IFZMEzfO8)
200
+
201
+ ## Repository Structure
202
+
203
+ ```text
204
  ghostexec/
205
  β”œβ”€β”€ openenv.yaml
206
  β”œβ”€β”€ pyproject.toml
 
211
  β”œβ”€β”€ scripts/
212
  β”œβ”€β”€ notebooks/
213
  β”œβ”€β”€ tests/
214
+ β”œβ”€β”€ output/
215
  └── server/
216
  β”œβ”€β”€ app.py
217
  β”œβ”€β”€ ghostexec_environment.py
218
+ └── reward.py
 
219
  ```
220
 
221
+ ## Additional References
222
 
223
+ - [OpenEnv (Meta PyTorch)](https://github.com/meta-pytorch/OpenEnv)
224
+ - [OpenEnv Packaging and Deploying Docs](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)
225
+ - [OpenEnv Hub](https://huggingface.co/openenv)
226
+ - [Environment Innovation Deep-Dive](environment-innovation/README.md)
 
 
 
 
227
 
228
  ## License
229
 
230
+ BSD-style license as included in this repository and upstream OpenEnv lineage notices.
server/app.py CHANGED
@@ -68,8 +68,10 @@ def _ghostexec_load_environment_metadata(env, env_name=None): # type: ignore[no
68
  space = "modelbuilderhq/ghostexec"
69
  readme_url = f"https://huggingface.co/spaces/{space}/blob/main/README.md"
70
  space_url = f"https://huggingface.co/spaces/{space}"
 
71
  meta.readme_content = (
72
  "### README\n\n"
 
73
  f"Formatted documentation (Space card + full markdown): "
74
  f"[**README.md on Hugging Face**]({readme_url})\n\n"
75
  f"Space: [**{space}**]({space_url})"
 
68
  space = "modelbuilderhq/ghostexec"
69
  readme_url = f"https://huggingface.co/spaces/{space}/blob/main/README.md"
70
  space_url = f"https://huggingface.co/spaces/{space}"
71
+ demo_video = "https://youtu.be/g4IFZMEzfO8"
72
  meta.readme_content = (
73
  "### README\n\n"
74
+ f"**Demo (&lt;2 min):** [**YouTube**]({demo_video})\n\n"
75
  f"Formatted documentation (Space card + full markdown): "
76
  f"[**README.md on Hugging Face**]({readme_url})\n\n"
77
  f"Space: [**{space}**]({space_url})"