Spaces:

modelbuilderhq
/

ghostexec

Sleeping

App Files Files Community

modelbuilderhq commited on 9 days ago

Commit

d815df7

verified ·

1 Parent(s): d669b0f

Upload folder using huggingface_hub

Browse files

Files changed (3) hide show

README.md +410 -355
environment-innovation/README.md +147 -0
server/app.py +25 -0

README.md CHANGED Viewed

@@ -1,355 +1,410 @@
----
-title: Ghostexec Environment Server
-emoji: 📢
-colorFrom: pink
-colorTo: yellow
-sdk: docker
-pinned: false
-app_port: 7860
-base_path: /web
-tags:
-  - openenv
----
-# Ghostexec
-**Ghostexec** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment that simulates a busy executive’s world: inbox, calendar, contacts, tasks, and stakeholder moods. The agent chooses **structured actions** (reply, reschedule, delegate, …); the server returns a **plain-text briefing** as the main observation and a **scalar reward** shaped around conflict, relationships, and task progress. Scenario data lives in `scenarios/*.json` — nothing is hardcoded in Python for world content.
-**Manifest:** `openenv.yaml` (name **`ghostexec`**, HF Space identifier).
-**Package:** `openenv-ghostexec` in `pyproject.toml` (import as `ghostexec`).
----
-## Deliverables
-| Deliverable | URL |
-|-------------|-----|
-| Public HF Space (required) | `TODO: https://huggingface.co/spaces/<org>/ghostexec` |
-| Write-up / blog (HF post preferred) | `TODO: https://huggingface.co/blog/...` |
-| Short demo video (&lt;2 min) | `TODO: https://youtube.com/...` |
-Fill these URLs before submission freeze so reviewers can verify everything from one place.
----
-## OpenEnv Hackathon alignment (themes + submission checklist)
-**Theme fit (examples, not exhaustive):** Ghostexec targets **Theme 3.2 — Personalized tasks** (executive-style inbox, calendar, conflicts, delegation via structured actions). **Theme 4** is partially supported via curriculum + perturb (`GHOSTEXEC_CURRICULUM`, `GHOSTEXEC_PERTURB`) and diverse scenarios under `scenarios/`.
-**Minimum submission checklist (fill before freeze):**
-| Item | Status |
-|------|--------|
-| OpenEnv-based env + `openenv.yaml` | Done in-repo (`openenv-core[core]>=0.2.3` in `pyproject.toml`; aligns with current PyPI release line). |
-| Short write-up or &lt;2 min video | **You:** publish and paste links in [Deliverables](#deliverables). |
-| Public HF Space URL | **You:** `openenv push` and paste the URL in [Deliverables](#deliverables). |
----
-## Design narrative
-Ghostexec is intentionally built as an **AI Chief of Staff** environment, not a grid-world clone: the model must triage inbox, calendar, stakeholder mood, and task deadlines under conflict pressure while taking only legal structured actions.
-- **Environment Innovation (40%)** — scenario-driven executive operations with competing priorities, conflict queues, and relationship-sensitive outcomes in `scenarios/*.json` + `server/ghostexec_environment.py`.
-- **Storytelling & Presentation (30%)** — each scenario encodes a narrative arc (VIP escalations, family/professional collisions, deadline cascades) so policy behavior reads like realistic assistant decisions rather than abstract moves.
-- **Showing Improvement in Rewards (20%)** — environment reward remains deterministic, inspectable, and traceable through metadata + episode logs under `outputs/logs/`.
-- **Reward Quality (10%)** — fixed weighted core signal (0.35 conflict / 0.35 relationship / 0.30 task), bounded shaping terms, explicit invalid-action handling, and do_nothing penalties.
-This framing gives judges a clear throughline: **realistic executive chaos -> constrained legal actions -> measurable policy improvement on held-out scenarios**.
----
-## Features
-- **Legal action set** — `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, `do_nothing` (see `models.py`).
-- **Human-readable observations** — `GhostexecObservation.echoed_message` is the full briefing text for the model (not raw JSON).
-- **Invalid actions** — Handled in-process: structured metadata (e.g. `step_ok`), no server crash.
-- **Reward** — Weighted blend of conflict, relationship, and task signals (see [Reward](#reward)); per-step logging under `outputs/logs/` (gitignored).
-- **HTTP + WebSocket** — FastAPI app in `server/app.py`; `GhostexecEnv` uses WebSockets for persistent episodes.
----
-## Quick start (Python client)
-From the repo root (`ghostexec/` — where `pyproject.toml` lives):
-```bash
-uv sync
-uv run server --port 8000
-```
-In another terminal or notebook:
-```python
-from ghostexec import GhostexecAction, GhostexecEnv
-with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
-    out = env.reset()
-    print(out.observation.echoed_message[:500], "…")  # plain-text briefing
-    step = env.step(
-        GhostexecAction(
-            action_type="reply_email",
-            email_id="e01",
-            message_body=(
-                "Marcus — acknowledged. Revised figures and short rationale "
-                "before noon. — Exec"
-            ),
-        )
-    )
-    print("reward:", step.reward)
-    print("metadata keys:", sorted((step.observation.metadata or {}).keys()))
-```
-**Docker image** (optional): if your OpenEnv client supports it, you can point `GhostexecEnv` at a container built from the root `Dockerfile`. Build from repo root:
-```bash
-docker build -t ghostexec-env:latest .
-```
----
-## Actions and fields
-`GhostexecAction` (`models.py`) includes:
-| `action_type`          | Typical fields used |
-|------------------------|----------------------|
-| `reply_email`          | `email_id`, `message_body` |
-| `archive_email`      | `email_id` |
-| `reschedule_meeting` | `meeting_id`, `new_time`, `reason` |
-| `cancel_meeting`     | `meeting_id`, `reason` |
-| `complete_task`      | `task_id` |
-| `delegate_task`      | `task_id`, `contact_name` |
-| `send_message`       | `contact_name`, `message` (channel text) |
-| `do_nothing`         | — (intentionally weak / penalised path) |
-Unknown or malformed HTTP payloads deserialize safely to `do_nothing`-style defaults where applicable so older clients do not crash.
----
-## Observation
-`GhostexecObservation`:
-- **`echoed_message`** — Full briefing (emails, conflicts, contacts, tasks, stress, steps remaining).
-- **`message_length`** — Length of `echoed_message` for quick checks.
-- **`reward`**, **`done`**, **`metadata`** — Step outcome; metadata carries flags such as `step_ok`, reward breakdown fields, and ids for debugging.
----
-## Reward
-Phase-4 scoring (`server/reward.py`) combines three channels with **fixed weights**:
-\[
-\text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
-\]
-Then applies output scaling, invalid-step adjustments, bonuses/penalties, and a floor for `do_nothing`. Full component values are available on `RewardBreakdown` and are mirrored into observation metadata where configured. **Episode reward traces** append to `outputs/logs/episode_rewards.jsonl` (directory gitignored).
-**Reward-engineering provenance.** The design follows the reward-shaping playbook surveyed in *Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications* ([arXiv:2408.10215](https://arxiv.org/abs/2408.10215)): dense per-step shaping around proxy signals (conflict / relationship / task) instead of a single sparse end-of-episode reward, fixed weights to keep channel trade-offs inspectable, and bounded per-step magnitudes to resist hacking.
----
-## HTTP vs WebSocket (episode state)
-- **HTTP** `POST /reset` and `POST /step` often bind to **short-lived** environment instances depending on deployment; consecutive HTTP calls may not share one in-memory episode.
-- **Ghostexec** still applies your action against a scenario-primed instance so a lone `POST /step` can return a meaningful reward and metadata.
-- **WebSocket `/ws`** — Use this (or `GhostexecEnv(base_url=...)`, which speaks WebSocket) for **multi-step episodes** on the same session.
-Endpoints (typical OpenEnv layout): **`/web`**, **`/docs`**, **`/health`**, **`/ws`**.
----
-## Running and testing locally
-```bash
-# Dev server (package layout)
-uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
-# Or console entrypoint (matches Dockerfile)
-uv run server --port 8000
-```
-**Smoke script** (HTTP):
-```bash
-uv run python scripts/http_endpoint_smoke.py --local
-uv run python scripts/http_endpoint_smoke.py --url http://127.0.0.1:8000
-uv run python scripts/http_endpoint_smoke.py --print-curl
-```
-**Tests:**
-```bash
-uv run pytest tests/ -q
-```
-Opt-in Docker build smoke (Phase 1 gate):
-```bash
-GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
-```
-With the server already on port 8000:
-```bash
-uv run pytest tests/test_live_server_exhaustive.py -v --tb=short
-```
-Override live URL (Windows PowerShell example):
-```powershell
-$env:GHOSTEXEC_LIVE_BASE_URL = "http://127.0.0.1:9000"
-uv run pytest tests/test_live_server_exhaustive.py -q
-```
-Optional real WebSocket client check:
-```bash
-# Terminal 1
-uv run server --port 8000
-# Terminal 2
-set GHOSTEXEC_WS_BASE_URL=http://127.0.0.1:8000
-uv run pytest tests/test_complete_integration.py::test_ghostexec_env_client_against_live_url_if_set -q
-```
-Post-training plot pack (loss + reward + components + baseline bar):
-```bash
-uv run python scripts/plot_training_report.py \
-  --trainer-history outputs/trainer_state.json \
-  --reward-csv outputs/reward_log.csv \
-  --baselines-json outputs/compliance_manifest.json \
-  --out-dir outputs/plots
-```
-The script writes:
-- `outputs/plots/loss_curve.png`
-- `outputs/plots/reward_curve.png`
-- `outputs/plots/components_curve.png`
-- `outputs/plots/baseline_comparison.png`
-SFT before GRPO (with partial live-env usage during SFT data generation and GRPO rewards):
-```bash
-uv run python scripts/train_sft_then_grpo.py \
-  --model-preset small_iter_fast \
-  --training-preset hackathon_turbo \
-  --env-url http://127.0.0.1:8000 \
-  --generate-sft-from-env \
-  --sft-samples 120 \
-  --max-sft-steps 60 \
-  --max-grpo-steps 120 \
-  --env-reward-scale 1.0 \
-  --local-reward-scale 0.35 \
-  --complexity-curriculum easy_to_full \
-  --curriculum-ramp-ratio 0.60
-```
-This performs:
-- SFT warm-start on JSONL (`prompt` + `completion`) generated from live `/reset` briefings.
-- GRPO continuation from the SFT adapter.
-- Mixed reward shaping where env-derived reward remains active and local shaping can be down-weighted/up-weighted via scales.
-- Optional complexity curriculum (`easy_to_full`) that starts with stronger scaffold/local signals and anneals to env-dominant reward later.
-- Stability-first optimization defaults (cosine schedule + warmup + grad clipping + higher GRPO KL beta). Optional `--reward-ema-decay 0..1` smooths the *env* reward channel (defaults come from `--training-preset`). Training always runs the full `max_*_steps` (no early-stop callbacks).
-Recommended model strategy for hackathon iteration speed:
-- Start with `--model-preset small_iter_fast` (`unsloth/Qwen2.5-3B-Instruct`) + QLoRA.
-- Run many short SFT->GRPO loops, improve reward signals, then scale model size only after curves stabilize.
-- Use larger presets only when memory + runtime are consistently stable.
-- Use `--training-preset hackathon_turbo` to apply stable aggressive defaults for iterative win-rate.
-- Script prints SFT/GRPO LoRA delta checks; if deltas are near zero it stops, so you never mistake a no-op run for real finetuning.
----
-## Hugging Face Spaces
-Full OpenEnv CLI flow from this directory (matches steps 5–8 of the [Packaging & Deploying guide](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)):
-```bash
-openenv serve                       # local dev server on :8000
-openenv build                       # build the Docker image
-openenv validate --verbose          # structure + Dockerfile + entrypoint checks
-openenv push                        # deploy to HF Spaces
-# openenv push --repo-id your-username/ghostexec
-```
-Use a **public** Space for the default hackathon flow unless you intentionally need a private Space. Authenticate with Hugging Face first (`huggingface-cli login` or equivalent).
----
-## Scenarios
-| File | Role |
-|------|------|
-| `scenarios/phase2_core.json` | Default dense inbox/calendar/tasks fixture |
-| `scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json` | Narrative demos |
-| `scenarios/vip_meltdown_drift.json` | Mood / escalation drift |
-| `scenarios/schema_drift_test.json` | Drift-event harness |
----
-## Concurrent WebSocket sessions
-`server/app.py` passes **`GhostexecEnvironment`** (the class) into `create_app` with `max_concurrent_envs=1` by default. Increase `max_concurrent_envs` if you need multiple simultaneous WebSocket clients.
----
-## Project layout
-```
-ghostexec/
-├── openenv.yaml           # OpenEnv name, version, description
-├── pyproject.toml         # Package metadata + optional extras
-├── uv.lock
-├── models.py              # World + GhostexecAction / GhostexecObservation
-├── client.py              # GhostexecEnv (WebSocket client)
-├── scenarios/             # World JSON (source of truth for episodes)
-├── scripts/               # http_endpoint_smoke.py
-├── tests/
-└── server/
-    ├── app.py             # FastAPI + create_app
-    ├── ghostexec_environment.py
-    ├── reward.py
-    └── Dockerfile
-```
----
-## Resources & references
-Ghostexec is built against the official Meta PyTorch OpenEnv stack. Every design choice below is traceable to one of these sources.
-**OpenEnv core.** The Gymnasium-style `reset()` / `step()` / `state` interface in `server/ghostexec_environment.py`, the `EnvClient` subclass in `client.py`, and the `create_app(...)` wiring in `server/app.py` follow the [Packaging & Deploying guide](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html) exactly.
-- Core repo: [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv)
-- Docs: [meta-pytorch.org/OpenEnv](https://meta-pytorch.org/OpenEnv/)
-**OpenEnv Hub (Hugging Face).** Target deployment for `openenv push`. The Space metadata at the top of this README + `openenv.yaml` are the knobs HF Spaces reads.
-- Environments: [huggingface.co/openenv](https://huggingface.co/openenv)
-- Spaces: [huggingface.co/openenv/spaces](https://huggingface.co/openenv/spaces)
-**Tutorials.** General OpenEnv environment patterns are documented in the official tutorial pages and examples.
-- All tutorials: [OpenEnv/tutorial](https://github.com/meta-pytorch/OpenEnv/tree/main/tutorial)
-- Environment examples: [OpenEnv/envs](https://github.com/meta-pytorch/OpenEnv/tree/main/envs)
-**YouTube — Building RL environments.** Talks from Meta / OpenEnv contributors that informed the scenario-driven reset, WebSocket session model, and reward breakdown used here:
-- [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA)
-- [OpenEnv Deep Dive](https://www.youtube.com/watch?v=ap4q4sAK4OY)
-- [Agentic RL Environments](https://www.youtube.com/watch?v=Jew4lhAiqnw)
-- [OpenEnv Livestream (4-hour walkthrough)](https://www.youtube.com/live/kkCNMz0Ptd8)
-**Reward-engineering papers.** See [Reward](#reward) for how each paper maps to specific components of `server/reward.py`.
-- Jnadi, A. (2024). *Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning Applications*. [arXiv:2408.10215](https://arxiv.org/abs/2408.10215). Informs the dense per-step conflict / relationship / task shaping and the bounded-magnitude design.
----
-## License
-BSD-style — see the license notice at the top of each source file (Meta / OpenEnv lineage).

+---
+title: Ghostexec Environment Server
+emoji: 📢
+colorFrom: pink
+colorTo: yellow
+sdk: docker
+pinned: false
+app_port: 7860
+base_path: /web
+tags:
+  - openenv
+---
+# Ghostexec
+**Ghostexec** is an [OpenEnv](https://github.com/meta-pytorch/OpenEnv)-compatible environment: a busy **executive chief-of-staff** simulator with inbox, calendar, contacts, tasks, and stakeholder moods. The agent must read a **plain-text briefing**, then emit **one structured action per step** (`reply_email`, `reschedule_meeting`, …). The server returns rewards shaped around **conflict**, **relationships**, and **tasks**—plus trajectory **graders** for hackathon validation. All episode **content** lives in `scenarios/*.json`; the engine is in `server/ghostexec_environment.py` and `server/reward.py`.
+| Item | Value |
+|------|--------|
+| **HF Space name / manifest** | `ghostexec` in [`openenv.yaml`](openenv.yaml) |
+| **Python package** | `openenv-ghostexec` in [`pyproject.toml`](pyproject.toml) (import `ghostexec`) |
+| **Public Space** | [modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
+| **Deeper innovation-only brief** | [`environment-innovation/README.md`](environment-innovation/README.md) |
+---
+## Deliverables (fill before freeze)
+| Deliverable | URL |
+|-------------|-----|
+| Public HF Space (required) | [https://huggingface.co/spaces/modelbuilderhq/ghostexec](https://huggingface.co/spaces/modelbuilderhq/ghostexec) |
+| Write-up / blog (HF post preferred) | `TODO: paste your post URL` |
+| Short demo video (&lt;2 min) | `TODO: paste your video URL` |
+---
+## Contents
+**Judging criteria (this README is organized around them)**
+1. [Criterion: Environment Innovation (40%)](#ghostexec-env-innovation)
+2. [Criterion: Storytelling & Presentation (30%)](#ghostexec-storytelling)
+3. [Criterion: Showing Improvement in Rewards (20%)](#ghostexec-reward-improvement)
+4. [Criterion: Reward & Training Pipeline (10%)](#ghostexec-reward-pipeline)
+**Reference**
+5. [Hackathon themes & checklist](#openenv-hackathon-themes--checklist)
+6. [Quick start](#quick-start-python-client)
+7. [Actions](#actions-and-fields)
+8. [Observation](#observation)
+9. [Reward (formula summary)](#reward-formula-summary)
+10. [HTTP vs WebSocket](#http-vs-websocket-episode-state)
+11. [Running and testing locally](#running-and-testing-locally)
+12. [Hugging Face Spaces](#hugging-face-spaces)
+13. [Scenarios](#scenarios)
+14. [Project layout](#project-layout)
+15. [Resources & references](#resources--references)
+16. [License](#license)
+---
+## Criterion: Environment Innovation (40%)
+<a id="ghostexec-env-innovation"></a>
+**Weight:** 40%
+**What it means:**
+- Is the environment novel, creative, or genuinely challenging?
+- Does it meaningfully test agent behavior in a way that hasn't been done before?
+### How Ghostexec answers this
+**Challenging world.** The policy sees **one dense natural-language briefing** per step (emails, calendar overlaps, contacts with mood, overdue tasks, stress, steps remaining)—not a JSON dump of the world. It must **ground** decisions in real ids from that text, return **valid typed actions**, and accept **time pressure** and **social fallout** when meetings move or mail goes unanswered. Invalid actions **do not crash** the server; they return structured errors so learning signals stay intact.
+**Meaningful behavior, not a toy Q&A.** Success needs **comprehension + tool discipline**: legal JSON schema, multi-step **sequences** (WebSocket sessions for real episodes), and **tradeoffs** across channels (mail vs calendar vs tasks vs relationships). **`do_nothing` is penalised** so “safe” idleness is costly when fires are burning.
+**Dynamics, not a static paragraph.** After each valid action, the simulation **advances the clock**, updates **moods**, rebuilds **conflicts**, and can apply **scenario-driven drift** (`after_step` events in JSON): shifted meetings, new deadlines, preference changes—so the agent is tested on **adaptation**, not memorizing the first screen.
+**Dual evaluation.** **Dense step rewards** in `server/reward.py` teach fine structure; **trajectory graders** in `graders.py` return scores strictly in **`(0.01, 0.99)`** per OpenEnv task wiring in `openenv.yaml`. Agents learn from the dense signal; judges get bounded certification scores.
+**Honest novelty claim.** Inboxes and calendars are familiar **ingredients**. What is less common is the **composition**: OpenEnv-native packaging, **plain-text-only** observations, **data-defined** scenarios, live dynamics + drift, dual reward/grader stack, and a **transactional** action API in one trainable, hostable environment.
+### Task ladder (difficulty in data)
+| Task id | Difficulty | Scenario | What gets harder |
+|---------|------------|----------|------------------|
+| `phase2_core` | easy | `scenarios/phase2_core.json` | Dense triage: VIP mail, calendar relief, overlapping work. |
+| `monday_morning` | medium | `scenarios/monday_morning.json` | Stacked Monday rush, less slack. |
+| `dinner_disaster` | hard | `scenarios/dinner_disaster.json` | Personal vs professional collision, escalation risk. |
+### 5-minute verification checklist
+1. **`openenv.yaml`** — three tasks, `max_steps`, `app: server.app:app`, `name: ghostexec`, grader paths.
+2. **`scenarios/*.json`** — world content is **data**, not hardcoded lore in Python.
+3. **`server/ghostexec_environment.py`** — `build_briefing_text`, `_apply_action`, post-step dynamics, schema drift hooks.
+4. **`server/reward.py`** — fixed 0.35 / 0.35 / 0.30 core, invalid / idle handling, shaping caps.
+5. **`graders.py`** — bounded grader outputs, trajectory consumption.
+6. **Live Space** — `/docs` or `POST /reset` + `POST /step`: legal steps change state; illegal steps return errors, not stack traces.
+For a **standalone** walkthrough of the innovation angle only, see **[environment-innovation/README.md](environment-innovation/README.md)**.
+---
+## Criterion: Storytelling & Presentation (30%)
+<a id="ghostexec-storytelling"></a>
+**Weight:** 30%
+**What it means:**
+- Can you clearly explain the problem, the environment, and what the agent learned?
+- Is the demo engaging and easy to follow for a non-technical audience?
+### The problem (plain language)
+An executive’s day is **messy**: urgent email from a board member, a double-booked calendar, a spouse texting about dinner, a report due at noon, and every choice **ripples**—someone feels heard or ignored, a conflict gets better or worse, a task slips or gets done. Ghostexec turns that into a **small simulator** the model must **run**, not a single paragraph to summarize.
+### The environment (one sentence)
+**You read a realistic staff briefing; you pick one legal “move” (reply, reschedule, delegate, …); the world updates; you get a score that reflects tension across work, people, and tasks.**
+### What the agent is supposed to learn
+- **Read carefully** — wrong `email_id` / `meeting_id` / `task_id` fails cleanly with feedback.
+- **Act under pressure** — clock, `max_steps`, and stress push toward decisions, not endless analysis.
+- **Balance competing goals** — improving relationships can conflict with clearing the calendar or finishing tasks; rewards encode that tradeoff.
+- **Recover from change** — drift events mean the “right” plan from step 1 may not stay right at step 8.
+### Demo tips for a non-technical audience
+1. **Show the briefing first** — let viewers see the same wall of text the model sees (relatable chaos).
+2. **Show one good step vs one bad step** — e.g. thoughtful reply vs invalid id or `do_nothing` while critical mail waits (mood / reward visibly differ).
+3. **Name the three “channels”** — calmer calendar, happier stakeholders, tasks moving forward—without math jargon.
+4. **End on “what improved”** — after training, pick the same scenario and show fewer invalid steps, higher rewards, or a grader curve (ties to the 20% section below).
+### Hackathon alignment (themes)
+**Theme fit (examples):** Ghostexec fits **Theme 3.2 — Personalized tasks** (executive-style inbox, calendar, delegation). **Theme 4** is partially supported via `GHOSTEXEC_CURRICULUM`, `GHOSTEXEC_PERTURB`, and diverse `scenarios/`.
+---
+## Criterion: Showing Improvement in Rewards (20%)
+<a id="ghostexec-reward-improvement"></a>
+**Weight:** 20%
+**What it means:**
+- Is there observable evidence of training progress? Reward curves, before/after behavior, comparison against a baseline—anything that proves the agent learned something.
+### Where evidence lives in this repo
+| Artifact | Role |
+|----------|------|
+| `outputs/logs/episode_rewards.jsonl` | Per-step reward trace (gitignored); use for **reward curves** and component debugging. |
+| `outputs/trainer_state.json` / training logs | Produced by training scripts when configured; feed into plotting. |
+| `outputs/reward_log.csv` | Optional CSV companion for plotting pipelines. |
+| `outputs/compliance_manifest.json` | Baseline / compliance metadata for **comparison** charts. |
+| `outputs/plots/*.png` | Generated report figures (see command below). |
+**Plot pack (loss + reward + components + baseline bar):**
+```bash
+uv run python scripts/plot_training_report.py \
+  --trainer-history outputs/trainer_state.json \
+  --reward-csv outputs/reward_log.csv \
+  --baselines-json outputs/compliance_manifest.json \
+  --out-dir outputs/plots
+```
+Writes `loss_curve.png`, `reward_curve.png`, `components_curve.png`, `baseline_comparison.png` under `outputs/plots/`.
+**End-to-end notebook:** [`notebooks/ghostexec_unsloth_grpo_hf_api.ipynb`](notebooks/ghostexec_unsloth_grpo_hf_api.ipynb) is intended to **Run All** without manual steps (per project convention).
+**Before / after narrative for judges:** same `task_id` and seed—show **lower invalid rate**, **higher mean step reward**, or **clearer grader trajectory** after finetuning. Pair numbers with **one short clip** of two runs side by side on the Space or local server.
+---
+## Criterion: Reward & Training Pipeline (10%)
+<a id="ghostexec-reward-pipeline"></a>
+**Weight:** 10%
+**What it means:**
+- Is the reward logic coherent?
+- Does the pipeline produce meaningful improvement in the trained agent's behavior?
+### Reward logic (coherent and inspectable)
+Phase-4 scoring in `server/reward.py` uses a **fixed** core blend:
+\[
+\text{weighted base} = 0.35 \cdot \text{conflict} + 0.35 \cdot \text{relationship} + 0.30 \cdot \text{task}
+\]
+Then bounded shaping, invalid-step handling, and explicit penalties (including **`do_nothing`**). Components surface on `RewardBreakdown` and in observation **metadata** where configured—so “why did this step score X?” is **auditable**, not a black box.
+Design rationale is aligned with dense reward-shaping practice (see [arXiv:2408.10215](https://arxiv.org/abs/2408.10215))—fixed channel weights, bounded magnitudes, sparse end-of-episode avoided for training.
+### Training pipeline (entrypoints)
+| Step | Command / artifact |
+|------|---------------------|
+| Install | `uv sync` (from repo root) |
+| Server (matches Dockerfile) | `uv run server --port 8000` |
+| SFT → GRPO script | `uv run python scripts/train_sft_then_grpo.py` (see [Running and testing locally](#running-and-testing-locally) for a full example invocation) |
+| Tests | `uv run pytest tests/ -q` |
+| Docker build gate | `GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q` |
+The pipeline is **meaningful** when tied to the **20% evidence** above: same env URL, logged rewards, and plots that move in the right direction over training—not when loss alone decreases.
+---
+## OpenEnv Hackathon themes & checklist
+| Item | Status |
+|------|--------|
+| OpenEnv-based env + `openenv.yaml` | In-repo (`openenv-core[core]>=0.2.3`). |
+| Short write-up or &lt;2 min video | **You:** publish and paste URLs in [Deliverables](#deliverables-fill-before-freeze). |
+| Public HF Space | [Deliverables](#deliverables-fill-before-freeze); deploy with `openenv push --repo-id <your>/ghostexec`. |
+---
+## Quick start (Python client)
+From the repo root (where `pyproject.toml` lives):
+```bash
+uv sync
+uv run server --port 8000
+```
+```python
+from ghostexec import GhostexecAction, GhostexecEnv
+with GhostexecEnv(base_url="http://127.0.0.1:8000") as env:
+    out = env.reset()
+    print(out.observation.echoed_message[:500], "…")
+    step = env.step(
+        GhostexecAction(
+            action_type="reply_email",
+            email_id="e01",
+            message_body=(
+                "Marcus — acknowledged. Revised figures and short rationale "
+                "before noon. — Exec"
+            ),
+        )
+    )
+    print("reward:", step.reward)
+    print("metadata keys:", sorted((step.observation.metadata or {}).keys()))
+```
+**Docker (optional):**
+```bash
+docker build -t ghostexec-env:latest .
+```
+---
+## Actions and fields
+`GhostexecAction` (`models.py`):
+| `action_type` | Typical fields |
+|---------------|----------------|
+| `reply_email` | `email_id`, `message_body` |
+| `archive_email` | `email_id` |
+| `reschedule_meeting` | `meeting_id`, `new_time`, `reason` |
+| `cancel_meeting` | `meeting_id`, `reason` |
+| `complete_task` | `task_id` |
+| `delegate_task` | `task_id`, `contact_name` |
+| `send_message` | `contact_name`, `message` |
+| `do_nothing` | — (penalised path) |
+Malformed HTTP payloads are handled safely so clients do not crash the server.
+---
+## Observation
+- **`echoed_message`** — Full plain-text briefing.
+- **`message_length`** — Length of briefing.
+- **`reward`**, **`done`**, **`metadata`** — Step outcome; metadata includes `step_ok`, reward breakdown fields, and debug ids.
+---
+## Reward (formula summary)
+Full detail is under [Criterion: Reward & Training Pipeline (10%)](#criterion-reward--training-pipeline-10). Episode logs: `outputs/logs/episode_rewards.jsonl` (gitignored).
+---
+## HTTP vs WebSocket (episode state)
+- **HTTP** `POST /reset` and `POST /step` may use **short-lived** instances; consecutive HTTP calls might not share one in-memory episode.
+- **WebSocket `/ws`** (or `GhostexecEnv`) — use for **multi-step episodes** on one session.
+Endpoints: **`/web`**, **`/docs`**, **`/health`**, **`/ws`**.
+---
+## Running and testing locally
+```bash
+uv run uvicorn ghostexec.server.app:app --reload --host 0.0.0.0 --port 8000
+# or
+uv run server --port 8000
+```
+**HTTP smoke:**
+```bash
+uv run python scripts/http_endpoint_smoke.py --local
+```
+**Tests:**
+```bash
+uv run pytest tests/ -q
+GHOSTEXEC_RUN_DOCKER_BUILD=1 uv run pytest tests/test_docker_build.py -q
+uv run pytest tests/test_live_server_exhaustive.py -v --tb=short   # server on :8000
+```
+**SFT → GRPO (example):**
+```bash
+uv run python scripts/train_sft_then_grpo.py \
+  --model-preset small_iter_fast \
+  --training-preset hackathon_turbo \
+  --env-url http://127.0.0.1:8000 \
+  --generate-sft-from-env \
+  --sft-samples 120 \
+  --max-sft-steps 60 \
+  --max-grpo-steps 120 \
+  --env-reward-scale 1.0 \
+  --local-reward-scale 0.35 \
+  --complexity-curriculum easy_to_full \
+  --curriculum-ramp-ratio 0.60
+```
+---
+## Hugging Face Spaces
+```bash
+openenv serve
+openenv build
+openenv validate --verbose
+openenv push
+# openenv push --repo-id your-username/ghostexec
+```
+Use a **public** Space for the default hackathon flow. `openenv.yaml` carries **name**, **version**, and **description** for metadata—keep them in sync with submission needs.
+---
+## Scenarios
+| File | Role |
+|------|------|
+| `scenarios/phase2_core.json` | Default dense fixture |
+| `scenarios/monday_morning.json`, `dinner_disaster.json`, `vip_meltdown.json` | Narrative pressure |
+| `scenarios/vip_meltdown_drift.json` | Mood / escalation drift |
+| `scenarios/schema_drift_test.json` | Drift-event harness |
+---
+## Project layout
+```
+ghostexec/
+├── openenv.yaml
+├── pyproject.toml
+├── models.py
+├── client.py
+├── graders.py
+├── scenarios/
+├── scripts/
+├── notebooks/
+├── tests/
+└── server/
+    ├── app.py
+    ├── ghostexec_environment.py
+    ├── reward.py
+    └── Dockerfile
+```
+---
+## Resources & references
+- [meta-pytorch/OpenEnv](https://github.com/meta-pytorch/OpenEnv) — core stack
+- [Packaging & Deploying](https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html)
+- [OpenEnv Hub](https://huggingface.co/openenv)
+- [Building RL Environments with OpenEnv](https://www.youtube.com/watch?v=0airz7BhBiA) (and related talks linked in prior README iterations)
+---
+## License
+BSD-style — see license notices in source files (Meta / OpenEnv lineage).

environment-innovation/README.md ADDED Viewed

	@@ -0,0 +1,147 @@

+# Ghostexec — innovation brief (for reviewers)
+**Repository:** [Ghostexec (OpenEnv)](../README.md)
+**Public Space:** https://huggingface.co/spaces/modelbuilderhq/ghostexec
+This README is a **standalone** walkthrough for reviewers: why the environment is hard, what agent capabilities it stresses, how to verify claims in code and on the live Space. You can read it **without** opening the rest of the repo narrative.
+---
+## Contents
+1. [How to read this document](#how-to-read-this-document)
+2. [Short answers](#short-answers-so-nothing-is-buried)
+3. [What Ghostexec is](#1-what-ghostexec-is-one-paragraph)
+4. [What the agent observes](#2-what-the-agent-observes-and-why-that-matters)
+5. [What the agent can do](#3-what-the-agent-can-do-actions-and-legality)
+6. [What changes between steps](#4-what-changes-between-steps-dynamics-and-drift)
+7. [How success is scored](#5-how-success-is-scored-two-layers-on-purpose)
+8. [Task ladder](#6-the-public-task-ladder-difficulty-in-data-not-vibes)
+9. [Reviewer checklist](#7-how-a-reviewer-can-verify-5-minute-checklist)
+10. [Closing](#8-closing)
+11. [Key files (from repo root)](#key-files-from-repo-root)
+---
+## How to read this document
+We group the argument under **two angles** reviewers typically care about. Everything below maps to one or both:
+| Angle | Sections that answer it |
+|-------|-------------------------|
+| Is the **world** itself interesting and genuinely hard? | [Short answers](#short-answers-so-nothing-is-buried), [§1–§4](#1-what-ghostexec-is-one-paragraph) |
+| Does it **stress-test agents** in a way a toy demo would not? | [Short answers](#short-answers-so-nothing-is-buried), [§3–§6](#3-what-the-agent-can-do-actions-and-legality), [§8](#8-closing) |
+---
+## Short answers (so nothing is buried)
+**Is it genuinely challenging?** Yes. The agent must survive **dense natural-language state**, emit **strict structured actions** that **mutate** a multi-entity world, and accept **time pressure**, **social consequences**, and **invalid-action economics** without crashing the server. “Easy” wins are rare because channels **compete**: mail, calendar, tasks, and relationships all pull in different directions.
+**Is it a meaningful test of behavior?** Yes. Success requires **grounded parsing** (real ids from the briefing), **tool discipline** (legal JSON schema), **sequencing** over multiple steps (WebSocket sessions for real episodes; HTTP for resets and single steps), and **tradeoffs** reflected in a **multi-channel** reward—not a single template answer.
+**Is every ingredient globally novel?** No—and we do not claim otherwise. Inboxes and calendars are familiar. What *is* uncommon is the **composition**: OpenEnv-first packaging, **plain-text-only** observations, **data-driven** scenarios, **live dynamics** and **timed drift**, **dual** evaluation (**dense step rewards** + **trajectory graders** in strict `(0.01, 0.99)`), and a **production-shaped** action API—together—in one environment you can train and ship.
+---
+### 1. What Ghostexec is (one paragraph)
+Ghostexec is an **executive chief-of-staff simulator**. Each episode starts from JSON scenario data under `../scenarios/`, selected by **task id** in `../openenv.yaml`. The **engine** lives in `../server/ghostexec_environment.py` and `../server/reward.py`; the **deployment contract** for Hugging Face / OpenEnv is `../openenv.yaml` (name **`ghostexec`**, FastAPI `server.app:app`, port **8000**). The model never sees raw scenario JSON as its primary observation: it sees a **rendered briefing**—the same class of messy, overlapping information a human would scan under time pressure.
+---
+### 2. What the agent observes (and why that matters)
+After `reset` (or the WebSocket equivalent), the policy receives `GhostexecObservation.echoed_message`: a **single plain-text** block that includes, at minimum:
+- A **timestamped header** (simulated “now”).
+- **Unread emails** with priority, sender, relationship, subject, and a short preview.
+- **Calendar conflicts** in a rolling horizon (overlaps the agent could resolve or worsen).
+- **Top contacts** with **mood**, relationship type, and communication preference.
+- **Tasks** that are overdue or due soon.
+- **Executive stress** and **steps remaining** toward `max_steps` (see `../openenv.yaml`, default **20**).
+**Why this matters for “challenging”:** many demos hide structure in JSON observations or tool schemas. Here, the **only** narrative state the model is supposed to “read” like a user is **natural language**, while the **law** of the world is still **typed actions**. That forces **comprehension + compliance** together—hallucinated ids and “vibes-only” plans fail in ways you can measure.
+---
+### 3. What the agent can do (actions and legality)
+Each step the agent returns **exactly one** `GhostexecAction` (`../models.py`): `reply_email`, `archive_email`, `reschedule_meeting`, `cancel_meeting`, `complete_task`, `delegate_task`, `send_message`, or `do_nothing`.
+**Validity is enforced against the live world:** wrong `email_id` / `meeting_id` / `task_id`, missing required fields, or impossible combinations produce an **invalid step**. The server **does not throw**; it returns structured metadata (`step_ok`, error text) so RL and HTTP clients can learn from mistakes instead of dying.
+**Valid actions mutate state:** mail can be replied or archived; meetings moved or cancelled; tasks completed or delegated; direct messages sent. The episode is therefore a **small transactional simulation**, not a static Q&A.
+---
+### 4. What changes between steps (dynamics and drift)
+Ghostexec is **not** a static paragraph with a hidden answer key. After actions, the environment runs **post-step dynamics** (see `../server/ghostexec_environment.py`):
+- **Clock:** simulation time advances (default **20 minutes** per step), which can flip tasks into overdue and change what “urgent” means.
+- **Mood:** stakeholders move along a mood ladder after real actions (e.g. a thoughtful reply can improve a sender; cancelling a meeting can upset attendees).
+- **Pressure on idle / invalid behavior:** if the agent **`do_nothing`**s or **fails** while **critical** mail is still unanswered, mood pressure can concentrate on the sender who is actually waiting—so “safe” inaction is not safe in the social graph.
+- **Stress and conflicts:** the world rebuilds an **active conflict list** (overlaps, unanswered critical mail) and maps that into the **stress** value surfaced in the briefing—so calendar debt is not cosmetic.
+**Scenario-driven schema drift:** harder JSON can schedule **`after_step`** events that reshuffle the world mid-episode: shift meetings, move deadlines, change communication preferences, **suppress relationship credit** for certain reply paths, or force moods. That tests **adaptation**, not memorization of the first screen.
+---
+### 5. How success is scored (two layers, on purpose)
+**A. Dense step reward (training and fine-grained analysis)** — `../server/reward.py`
+A **fixed** weighted core (**0.35 conflict + 0.35 relationship + 0.30 task**) plus **bounded** shaping terms (synergy, tradeoffs, progress-style shaping, scaffold, quality separation). Invalid steps and **`do_nothing`** are handled explicitly (idle is **penalised**, not neutral). Rich `RewardBreakdown` fields can be logged to `outputs/logs/episode_rewards.jsonl` (gitignored) for auditing *why* a step moved.
+**B. Trajectory graders (OpenEnv / hackathon validation)** — `../graders.py`
+Each public task in `../openenv.yaml` binds a grader (`graders.phase2_core_grader`, etc.). Graders read **trajectory-shaped** payloads (e.g. lists of rewards) and return scores **strictly inside `(0.01, 0.99)`**—the validator-facing layer—while the step engine remains the **dense teaching signal**.
+That split is deliberate: **agents learn from fine structure**, **judges certify** with stable bounded scores.
+---
+### 6. The public task ladder (difficulty in *data*, not vibes)
+| Task id | Difficulty | Scenario file | What gets harder |
+|---------|------------|----------------|------------------|
+| `phase2_core` | easy | `../scenarios/phase2_core.json` | Dense default triage: VIP mail, calendar relief, overlapping obligations. |
+| `monday_morning` | medium | `../scenarios/monday_morning.json` | Stacked Monday rush: more concurrent fires, less slack. |
+| `dinner_disaster` | hard | `../scenarios/dinner_disaster.json` | Personal vs professional collision with **escalation risk**. |
+All of this is declared in **`../openenv.yaml`** so the Space, CLI, and notebooks agree on **names**, **ports**, and **grader wiring** without a second source of truth.
+---
+### 7. How a reviewer can verify (5-minute checklist)
+1. Open **`../openenv.yaml`** — confirm three tasks, `max_steps`, `app: server.app:app`, **`name: ghostexec`**.
+2. Open **`../scenarios/*.json`** — confirm episodes are **data**, not hardcoded Python lore.
+3. Skim **`../server/ghostexec_environment.py`** — `build_briefing_text`, `_apply_action`, `_apply_post_action_dynamics`, `_maybe_apply_schema_drift_events`.
+4. Skim **`../server/reward.py`** — fixed weights, invalid / idle handling, shaping caps.
+5. Open **`../graders.py`** — strict output bounds and trajectory consumption.
+6. Open the **public Space**: https://huggingface.co/spaces/modelbuilderhq/ghostexec — use `/docs` or `POST /reset` + `POST /step`: legal actions change state; illegal actions return errors, **not** stack traces.
+---
+### 8. Closing
+**World quality.** The challenge is **interactional and operational**: overlapping human-style goals, strict tool use, evolving social signals, and mid-episode drift—**not** a single binary “did you answer correctly.”
+**What this stack proves.** If you strip Ghostexec to one bullet, it is: **plain-text situational awareness + legal structured world edits + multi-channel rewards + timed scenario pressure + OpenEnv-native deployment and graders**—in one coherent package you can train, log, and host.
+That is the **innovation case** this repository is built to defend.
+---
+## Key files (from repo root)
+| Path | Role |
+|------|------|
+| `openenv.yaml` | Space name, port, tasks, graders, `max_steps` |
+| `scenarios/*.json` | Episode **data** (world content, drift hooks) |
+| `server/ghostexec_environment.py` | Briefing text, actions, dynamics, drift |
+| `server/reward.py` | Step reward, fixed 0.35 / 0.35 / 0.30 core + shaping |
+| `graders.py` | Trajectory scores in `(0.01, 0.99)` per task |
+| `models.py` | `GhostexecAction`, `GhostexecObservation`, `RewardBreakdown` |
+For install, tests, training scripts, and the rest of the hackathon submission, see the [main project README](../README.md).

server/app.py CHANGED Viewed

@@ -28,6 +28,9 @@ Usage:
     python -m server.app
 """
 try:
     import openenv.core.env_server.http_server as _openenv_http
 except Exception as e:  # pragma: no cover
@@ -53,6 +56,28 @@ _openenv_http.serialize_observation = _ghostexec_serialize_observation
 from openenv.core.env_server.http_server import create_app  # noqa: E402
 try:
     # Editable / normal install (package name `ghostexec`).
     from ghostexec.models import GhostexecAction, GhostexecObservation

     python -m server.app
 """
+import os
+from pathlib import Path
 try:
     import openenv.core.env_server.http_server as _openenv_http
 except Exception as e:  # pragma: no cover
 from openenv.core.env_server.http_server import create_app  # noqa: E402
+def _configure_openenv_readme_path() -> None:
+    """OpenEnv Gradio sidebar loads README from /app/README.md or ENV_README_PATH only.
+    Our Docker layout copies the repo to /app/env/, so README.md lives at
+    /app/env/README.md. Set ENV_README_PATH before create_app so the Playground
+    shows the README instead of "No README available."
+    """
+    if os.environ.get("ENV_README_PATH"):
+        return
+    _here = Path(__file__).resolve()
+    for candidate in (
+        Path("/app/env/README.md"),  # HF Space / openenv Docker layout
+        _here.parent.parent / "README.md",  # repo root when running from source
+    ):
+        if candidate.is_file():
+            os.environ["ENV_README_PATH"] = str(candidate)
+            return
+_configure_openenv_readme_path()
 try:
     # Editable / normal install (package name `ghostexec`).
     from ghostexec.models import GhostexecAction, GhostexecObservation