Spaces:

modelbuilderhq
/

ghostexec

Running

App Files Files Community

modelbuilderhq commited on 6 days ago

Commit

f59df3f

verified ·

1 Parent(s): 79bdfcd

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

notebooks/ghostexec_unsloth_grpo_hf_api.ipynb +792 -0
outputs/logs/api_dead_live_600.jsonl +398 -0

notebooks/ghostexec_unsloth_grpo_hf_api.ipynb ADDED Viewed

	@@ -0,0 +1,792 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Ghostexec — Unsloth + TRL GRPO against the deployed HF Space API\n",
+        "\n",
+        "Post-train `unsloth/Llama-3.2-3B-Instruct` with GRPO where every reward is fetched over HTTP from the **live** Ghostexec OpenEnv Space.\n",
+        "\n",
+        "- Live endpoint: `https://modelbuilderhq-ghostexec.hf.space`\n",
+        "- Algorithm: TRL `0.22.2` `GRPOTrainer` (no vLLM — HF `generate()` path)\n",
+        "- Base: `unsloth/Llama-3.2-3B-Instruct` (4-bit) + LoRA r=16 + bf16\n",
+        "- Curriculum: exploration schedule across three stages (T=1.0 → 0.7 → 0.5)\n",
+        "- Rewards: three **independent** functions — `env_reward` (live Space) / `format_reward` / `anti_idle_reward`\n",
+        "\n",
+        "### Help Guide phase map (notebook sections mirror `[Participant Help Guide] §18`)\n",
+        "| Phase | Where |\n",
+        "|---|---|\n",
+        "| 1 Pick a narrow task | section 1 |\n",
+        "| 2 Build the environment | section 2 (already deployed; health check here) |\n",
+        "| 3 Build rewards | section 3 |\n",
+        "| 4 Deploy | section 4 (confirm) |\n",
+        "| 5 Train small | section 5 (Stage B) |\n",
+        "| 6 Inspect for hacking | section 6 |\n",
+        "| 7 Add curriculum | section 7 (Stages C + D) |\n",
+        "| 8 Train bigger | section 8 (knobs, not action) |\n",
+        "| 9 Save and demo | section 9 |"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 1 — Pick a narrow task\n",
+        "\n",
+        "Single-step action selection from a plain-text executive briefing. The model reads the briefing from `/reset` and must emit exactly one JSON action matching `GhostexecAction`. The deployed Space scores that action and returns a reward from `/step`. That reward is the learning signal.\n",
+        "\n",
+        "Legal `action_type` values: `reply_email, archive_email, reschedule_meeting, cancel_meeting, complete_task, delegate_task, send_message, do_nothing`.\n",
+        "\n",
+        "The scenario is fixed on the deployed Space (`phase2_core`), so the curriculum is an **exploration schedule** (temperature / num_generations / learning rate) across three training stages rather than a scenario switch."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 2 — Build the environment (already deployed on HF Spaces)\n",
+        "\n",
+        "The next cell is the exact Unsloth install snippet. Restart the runtime after it finishes if Colab asks you to."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "%%capture\n",
+        "import os, importlib.util\n",
+        "!pip install --upgrade -qqq uv\n",
+        "if importlib.util.find_spec(\"torch\") is None or \"COLAB_\" in \"\".join(os.environ.keys()):\n",
+        "    try: import numpy; get_numpy = f\"numpy=={numpy.__version__}\"\n",
+        "    except: get_numpy = \"numpy\"\n",
+        "    !uv pip install -qqq \\\n",
+        "        \"torch>=2.8.0\" \"triton>=3.4.0\" {get_numpy} torchvision bitsandbytes \"transformers==4.56.2\" trackio \\\n",
+        "        \"unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo\" \\\n",
+        "        \"unsloth[base] @ git+https://github.com/unslothai/unsloth\" \\\n",
+        "        git+https://github.com/triton-lang/triton.git@0add68262ab0a2e33b84524346cb27cbb2787356#subdirectory=python/triton_kernels\n",
+        "elif importlib.util.find_spec(\"unsloth\") is None:\n",
+        "    !uv pip install -qqq unsloth trackio\n",
+        "!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "%pip install -q requests pydantic matplotlib pandas tqdm huggingface_hub datasets\n",
+        "print(\"aux deps installed\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import os, sys, json, time, random, re, math, pathlib\n",
+        "from typing import Any\n",
+        "\n",
+        "GHOSTEXEC_ENV_URL = os.environ.get(\"GHOSTEXEC_ENV_URL\", \"https://modelbuilderhq-ghostexec.hf.space\")\n",
+        "MODEL_ID          = os.environ.get(\"MODEL_ID\", \"unsloth/Llama-3.2-3B-Instruct\")\n",
+        "RUN_NAME          = os.environ.get(\"RUN_NAME\", \"ghostexec-unsloth-grpo\")\n",
+        "HUB_REPO_ID       = os.environ.get(\"HUB_REPO_ID\", \"\")\n",
+        "OUT = pathlib.Path(\"/content/ghostexec_out\") if os.path.exists(\"/content\") else pathlib.Path(\"./ghostexec_out\")\n",
+        "OUT.mkdir(parents=True, exist_ok=True)\n",
+        "\n",
+        "try:\n",
+        "    from google.colab import userdata  # type: ignore\n",
+        "    if not os.environ.get(\"HF_TOKEN\"):\n",
+        "        try: os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\") or \"\"\n",
+        "        except Exception: pass\n",
+        "except Exception:\n",
+        "    pass\n",
+        "\n",
+        "print(\"Endpoint :\", GHOSTEXEC_ENV_URL)\n",
+        "print(\"Model    :\", MODEL_ID)\n",
+        "print(\"Output   :\", OUT)\n",
+        "print(\"HF token :\", \"set\" if os.environ.get(\"HF_TOKEN\") else \"missing (needed only for push_to_hub)\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 2.1 HTTP client to the deployed Space\n",
+        "\n",
+        "Every reward in this notebook comes from this class — we never run Ghostexec locally."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import requests\n",
+        "\n",
+        "class GhostexecSpace:\n",
+        "    def __init__(self, url: str, timeout: float = 60.0, max_retries: int = 4):\n",
+        "        self.url = url.rstrip(\"/\")\n",
+        "        self.timeout = timeout\n",
+        "        self.max_retries = max_retries\n",
+        "        self.latency_ms: list[float] = []\n",
+        "\n",
+        "    def _post(self, path: str, payload: dict) -> dict:\n",
+        "        last_err: Exception | None = None\n",
+        "        for attempt in range(self.max_retries):\n",
+        "            try:\n",
+        "                t0 = time.perf_counter()\n",
+        "                r = requests.post(f\"{self.url}{path}\", json=payload, timeout=self.timeout)\n",
+        "                self.latency_ms.append((time.perf_counter() - t0) * 1000.0)\n",
+        "                r.raise_for_status()\n",
+        "                return r.json()\n",
+        "            except Exception as e:\n",
+        "                last_err = e\n",
+        "                time.sleep(min(2 ** attempt, 8.0))\n",
+        "        raise RuntimeError(f\"POST {path} failed after {self.max_retries} tries: {last_err}\")\n",
+        "\n",
+        "    def reset(self) -> dict:\n",
+        "        return self._post(\"/reset\", {})\n",
+        "\n",
+        "    def step(self, action: dict) -> tuple[float, dict]:\n",
+        "        raw = self._post(\"/step\", {\"action\": action})\n",
+        "        reward = raw.get(\"reward\")\n",
+        "        if reward is None:\n",
+        "            reward = (raw.get(\"observation\") or {}).get(\"reward\", 0.0)\n",
+        "        try:    return float(reward), raw\n",
+        "        except Exception: return 0.0, raw\n",
+        "\n",
+        "env = GhostexecSpace(GHOSTEXEC_ENV_URL)\n",
+        "print(\"Health reset ...\")\n",
+        "_obs = env.reset()\n",
+        "print(\"reset keys:\", sorted(_obs.keys()))\n",
+        "_brief = ((_obs.get(\"observation\") or _obs).get(\"echoed_message\") or \"\")[:400]\n",
+        "print(\"briefing preview:\\n\", _brief)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 2.2 Verifier sanity check (Help Guide §8)\n",
+        "\n",
+        "Fire every legal `action_type` once against the deployed Space. If rewards are all identical or `do_nothing` is not a floor, abort — GRPO cannot learn from a degenerate verifier."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "LEGAL_ACTION_TYPES = [\n",
+        "    \"reply_email\", \"archive_email\", \"reschedule_meeting\", \"cancel_meeting\",\n",
+        "    \"complete_task\", \"delegate_task\", \"send_message\", \"do_nothing\",\n",
+        "]\n",
+        "\n",
+        "def _smoke_action(action_type: str) -> dict:\n",
+        "    return {\n",
+        "        \"action_type\":   action_type,\n",
+        "        \"email_id\":      \"email_01\" if \"email\" in action_type else \"\",\n",
+        "        \"message_body\":  \"Acknowledged. Will follow up shortly.\",\n",
+        "        \"meeting_id\":    \"meeting_01\" if \"meeting\" in action_type else \"\",\n",
+        "        \"new_time\":      \"2025-01-02T15:00:00\" if action_type == \"reschedule_meeting\" else \"\",\n",
+        "        \"reason\":        \"scheduling conflict\",\n",
+        "        \"task_id\":       \"task_01\" if \"task\" in action_type else \"\",\n",
+        "        \"contact_name\": \"Alex\",\n",
+        "        \"message\":       \"\",\n",
+        "    }\n",
+        "\n",
+        "rewards_by_action: dict[str, float] = {}\n",
+        "for at in LEGAL_ACTION_TYPES:\n",
+        "    env.reset()\n",
+        "    r, _ = env.step(_smoke_action(at))\n",
+        "    rewards_by_action[at] = round(r, 4)\n",
+        "print(json.dumps(rewards_by_action, indent=2))\n",
+        "\n",
+        "uniq = set(rewards_by_action.values())\n",
+        "assert len(uniq) > 1, \"Verifier is constant across actions — env can't teach anything.\"\n",
+        "assert rewards_by_action[\"do_nothing\"] <= min(rewards_by_action.values()) + 1e-6, \\\n",
+        "    \"do_nothing is not the worst/floor — reward shape probably broken.\"\n",
+        "print(\"\\nverifier OK — rewards are discriminating and do_nothing is the floor.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 3 — Build rewards\n",
+        "\n",
+        "Three independent reward functions per Help Guide §7. Keeping them independent means we can plot each component, watch their correlations, and catch hacking (e.g. env reward climbs while format reward collapses)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from pydantic import BaseModel\n",
+        "from typing import Literal\n",
+        "\n",
+        "GhostexecActionType = Literal[\n",
+        "    \"reply_email\", \"archive_email\", \"reschedule_meeting\", \"cancel_meeting\",\n",
+        "    \"complete_task\", \"delegate_task\", \"send_message\", \"do_nothing\",\n",
+        "]\n",
+        "\n",
+        "class GhostexecAction(BaseModel):\n",
+        "    action_type:   GhostexecActionType = \"do_nothing\"\n",
+        "    email_id:      str = \"\"\n",
+        "    message_body:  str = \"\"\n",
+        "    meeting_id:    str = \"\"\n",
+        "    new_time:      str = \"\"\n",
+        "    reason:        str = \"\"\n",
+        "    task_id:       str = \"\"\n",
+        "    contact_name: str = \"\"\n",
+        "    message:       str = \"\"\n",
+        "\n",
+        "def _extract_json(text: str) -> dict:\n",
+        "    s = text.strip()\n",
+        "    s = re.sub(r\"^```(?:json)?\\s*|\\s*```$\", \"\", s, flags=re.IGNORECASE | re.MULTILINE).strip()\n",
+        "    start, end = s.find(\"{\"), s.rfind(\"}\")\n",
+        "    if start == -1 or end <= start: raise ValueError(\"no json object\")\n",
+        "    return json.loads(s[start:end+1])\n",
+        "\n",
+        "def parse_action_strict(text: str) -> dict:\n",
+        "    obj = _extract_json(text)\n",
+        "    GhostexecAction(**obj)\n",
+        "    return obj\n",
+        "\n",
+        "def parse_action(text: str) -> dict:\n",
+        "    try: return parse_action_strict(text)\n",
+        "    except Exception: return {\"action_type\": \"do_nothing\"}\n",
+        "\n",
+        "assert parse_action_strict('```json\\n{\"action_type\":\"archive_email\",\"email_id\":\"email_01\"}\\n```')[\"action_type\"] == \"archive_email\"\n",
+        "assert parse_action(\"garbage\")[\"action_type\"] == \"do_nothing\"\n",
+        "print(\"parser OK\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "def _completion_text(c) -> str:\n",
+        "    if isinstance(c, list) and c and isinstance(c[0], dict):\n",
+        "        return c[0].get(\"content\", \"\")\n",
+        "    return c if isinstance(c, str) else str(c)\n",
+        "\n",
+        "def env_reward(completions, prompts=None, **_) -> list[float]:\n",
+        "    out: list[float] = []\n",
+        "    for c in completions:\n",
+        "        text = _completion_text(c)\n",
+        "        action = parse_action(text)\n",
+        "        try:\n",
+        "            env.reset()\n",
+        "            r, _ = env.step(action)\n",
+        "        except Exception:\n",
+        "            r = -1.0\n",
+        "        out.append(float(r))\n",
+        "    return out\n",
+        "\n",
+        "def format_reward(completions, **_) -> list[float]:\n",
+        "    out: list[float] = []\n",
+        "    for c in completions:\n",
+        "        text = _completion_text(c)\n",
+        "        try:\n",
+        "            parse_action_strict(text); out.append(0.1)\n",
+        "        except Exception:\n",
+        "            out.append(-0.1)\n",
+        "    return out\n",
+        "\n",
+        "def anti_idle_reward(completions, **_) -> list[float]:\n",
+        "    out: list[float] = []\n",
+        "    for c in completions:\n",
+        "        text = _completion_text(c)\n",
+        "        act = parse_action(text)\n",
+        "        out.append(-0.05 if act.get(\"action_type\") == \"do_nothing\" else 0.0)\n",
+        "    return out\n",
+        "\n",
+        "_dummy = '{\"action_type\":\"archive_email\",\"email_id\":\"email_01\"}'\n",
+        "print(\"format   :\", format_reward([_dummy]))\n",
+        "print(\"anti_idle:\", anti_idle_reward([_dummy]))"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from transformers import TrainerCallback\n",
+        "\n",
+        "class HackingTripwire(TrainerCallback):\n",
+        "    \"\"\"Stop training on mode collapse or reward-format divergence (Help Guide §8).\"\"\"\n",
+        "    def __init__(self, min_unique_ratio: float = 0.2):\n",
+        "        self.min_unique_ratio = min_unique_ratio\n",
+        "\n",
+        "    def on_log(self, args, state, control, logs=None, **kw):\n",
+        "        logs = logs or {}\n",
+        "        uniq = logs.get(\"completions/unique_ratio\") or logs.get(\"completions/mean_unique\")\n",
+        "        env_r = logs.get(\"rewards/env_reward/mean\")\n",
+        "        fmt_r = logs.get(\"rewards/format_reward/mean\")\n",
+        "        if uniq is not None and uniq < self.min_unique_ratio:\n",
+        "            print(f\"[TRIPWIRE] unique_ratio={uniq:.2f} < {self.min_unique_ratio} — stopping.\")\n",
+        "            control.should_training_stop = True\n",
+        "        if env_r is not None and fmt_r is not None and env_r > 0.8 and fmt_r < 0.0:\n",
+        "            print(f\"[TRIPWIRE] env_r={env_r:.2f} but fmt_r={fmt_r:.2f} — possible hack. stopping.\")\n",
+        "            control.should_training_stop = True"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 4 — Deploy\n",
+        "\n",
+        "Already done. Live Space: [`modelbuilderhq/ghostexec`](https://huggingface.co/spaces/modelbuilderhq/ghostexec). The health-check cell above confirmed `/reset` + `/step` are green."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 5 — Train small\n",
+        "\n",
+        "Load `unsloth/Llama-3.2-3B-Instruct` in 4-bit with Unsloth, attach LoRA, then run one **short** GRPO stage to prove the loop works end-to-end. vLLM is not used anywhere in this notebook — rollouts go through the standard HF `generate()` path inside `GRPOTrainer`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "# IMPORTANT: import unsloth before transformers so its kernels patch cleanly.\n",
+        "from unsloth import FastLanguageModel\n",
+        "import torch\n",
+        "\n",
+        "MAX_SEQ_LENGTH = 2048\n",
+        "\n",
+        "policy, tokenizer = FastLanguageModel.from_pretrained(\n",
+        "    model_name=MODEL_ID,\n",
+        "    max_seq_length=MAX_SEQ_LENGTH,\n",
+        "    load_in_4bit=True,\n",
+        "    dtype=None,                 # auto (bf16 on T4 compute via bnb)\n",
+        ")\n",
+        "\n",
+        "policy = FastLanguageModel.get_peft_model(\n",
+        "    policy,\n",
+        "    r=16, lora_alpha=32, lora_dropout=0.0,\n",
+        "    target_modules=[\"q_proj\",\"k_proj\",\"v_proj\",\"o_proj\",\"gate_proj\",\"up_proj\",\"down_proj\"],\n",
+        "    bias=\"none\",\n",
+        "    use_gradient_checkpointing=\"unsloth\",\n",
+        "    random_state=3407,\n",
+        ")\n",
+        "\n",
+        "if tokenizer.pad_token is None:\n",
+        "    tokenizer.pad_token = tokenizer.eos_token\n",
+        "tokenizer.padding_side = \"left\"\n",
+        "\n",
+        "print(\"policy loaded:\", MODEL_ID)\n",
+        "policy.print_trainable_parameters()"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "SYSTEM_PROMPT = (\n",
+        "    \"You are Ghostexec, an AI Chief of Staff. You receive a plain-text briefing of an executive's \"\n",
+        "    \"inbox, calendar and tasks. You must choose the single best next action.\\n\\n\"\n",
+        "    \"Legal action_type values: reply_email, archive_email, reschedule_meeting, cancel_meeting, \"\n",
+        "    \"complete_task, delegate_task, send_message, do_nothing.\\n\\n\"\n",
+        "    \"Output ONLY a compact JSON object with these keys (no prose, no code fences):\\n\"\n",
+        "    \"{\\\"action_type\\\": <one of the legal values>, \\\"email_id\\\": \\\"\\\", \\\"message_body\\\": \\\"\\\", \"\n",
+        "    \"\\\"meeting_id\\\": \\\"\\\", \\\"new_time\\\": \\\"\\\", \\\"reason\\\": \\\"\\\", \\\"task_id\\\": \\\"\\\", \"\n",
+        "    \"\\\"contact_name\\\": \\\"\\\", \\\"message\\\": \\\"\\\"}.\\n\\n\"\n",
+        "    \"Rules: prioritise VIP/board/critical items, match tone to sender mood, never choose do_nothing \"\n",
+        "    \"if any critical item is unresolved.\"\n",
+        ")\n",
+        "\n",
+        "def build_prompt(briefing: str) -> list[dict]:\n",
+        "    return [\n",
+        "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+        "        {\"role\": \"user\",   \"content\": f\"BRIEFING:\\n{briefing}\\n\\nReturn one JSON action.\"},\n",
+        "    ]\n",
+        "\n",
+        "def render_chat(messages: list[dict]) -> str:\n",
+        "    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from datasets import Dataset\n",
+        "from tqdm.auto import tqdm\n",
+        "\n",
+        "def fetch_briefing() -> str:\n",
+        "    obs = env.reset()\n",
+        "    inner = obs.get(\"observation\") or obs\n",
+        "    brief = inner.get(\"echoed_message\") or inner.get(\"message\") or \"\"\n",
+        "    if not brief:\n",
+        "        raise RuntimeError(f\"Space returned no briefing: keys={list(inner.keys())}\")\n",
+        "    return brief\n",
+        "\n",
+        "N_BRIEFINGS = int(os.environ.get(\"N_BRIEFINGS\", \"24\"))\n",
+        "briefings: list[str] = []\n",
+        "for _ in tqdm(range(N_BRIEFINGS), desc=\"sampling /reset\"):\n",
+        "    briefings.append(fetch_briefing())\n",
+        "\n",
+        "print(f\"fetched {len(briefings)} briefings ({len(set(briefings))} unique)\")\n",
+        "train_ds = Dataset.from_list([{\"prompt\": build_prompt(b)} for b in briefings])\n",
+        "print(train_ds)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 5.1 Baselines — random policy + frozen model (Help Guide §19)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "N_EVAL = int(os.environ.get(\"N_EVAL\", \"8\"))\n",
+        "\n",
+        "def random_policy_reward() -> list[float]:\n",
+        "    rs: list[float] = []\n",
+        "    for _ in range(N_EVAL):\n",
+        "        at = random.choice(LEGAL_ACTION_TYPES)\n",
+        "        env.reset()\n",
+        "        r, _ = env.step(_smoke_action(at))\n",
+        "        rs.append(r)\n",
+        "    return rs\n",
+        "\n",
+        "@torch.no_grad()\n",
+        "def evaluate_policy(model, n: int = N_EVAL, temperature: float = 0.2) -> list[float]:\n",
+        "    FastLanguageModel.for_inference(model)\n",
+        "    rs: list[float] = []\n",
+        "    for i in range(n):\n",
+        "        brief = briefings[i % len(briefings)]\n",
+        "        prompt_text = render_chat(build_prompt(brief))\n",
+        "        inputs = tokenizer(prompt_text, return_tensors=\"pt\", truncation=True, max_length=MAX_SEQ_LENGTH).to(model.device)\n",
+        "        out = model.generate(\n",
+        "            **inputs,\n",
+        "            max_new_tokens=128,\n",
+        "            do_sample=(temperature > 0),\n",
+        "            temperature=max(temperature, 1e-5),\n",
+        "            pad_token_id=tokenizer.pad_token_id,\n",
+        "        )\n",
+        "        completion = tokenizer.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+        "        action = parse_action(completion)\n",
+        "        env.reset()\n",
+        "        r, _ = env.step(action)\n",
+        "        rs.append(r)\n",
+        "    FastLanguageModel.for_training(model)\n",
+        "    return rs\n",
+        "\n",
+        "print(\"Random baseline ...\")\n",
+        "random_rewards = random_policy_reward()\n",
+        "print(\" mean:\", sum(random_rewards) / len(random_rewards))\n",
+        "\n",
+        "print(\"Frozen-base baseline ...\")\n",
+        "frozen_rewards = evaluate_policy(policy, n=N_EVAL, temperature=0.2)\n",
+        "print(\" mean:\", sum(frozen_rewards) / len(frozen_rewards))\n",
+        "\n",
+        "baselines = {\"random\": random_rewards, \"frozen\": frozen_rewards}"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### 5.2 Stage B — first GRPO stage (broad exploration, short)\n",
+        "\n",
+        "T=1.0, num_generations=2, max_steps=20. Purpose: prove the training loop runs, the Space is reachable from the training process, and rewards move."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "from trl import GRPOConfig, GRPOTrainer\n",
+        "\n",
+        "reward_funcs = [env_reward, format_reward, anti_idle_reward]\n",
+        "stage_logs: dict[str, list[dict]] = {}\n",
+        "\n",
+        "def grpo_config(name: str, *, temperature: float, num_generations: int, max_steps: int, lr: float) -> GRPOConfig:\n",
+        "    return GRPOConfig(\n",
+        "        output_dir=str(OUT / f\"stage_{name}\"),\n",
+        "        per_device_train_batch_size=1,\n",
+        "        gradient_accumulation_steps=4,\n",
+        "        num_generations=num_generations,\n",
+        "        max_prompt_length=1920,\n",
+        "        max_completion_length=128,\n",
+        "        temperature=temperature,\n",
+        "        learning_rate=lr,\n",
+        "        beta=0.04,\n",
+        "        max_steps=max_steps,\n",
+        "        logging_steps=1,\n",
+        "        bf16=True,\n",
+        "        report_to=\"none\",\n",
+        "        save_strategy=\"no\",\n",
+        "        remove_unused_columns=False,\n",
+        "        log_completions=True,\n",
+        "    )\n",
+        "\n",
+        "def run_stage(name: str, **kw) -> None:\n",
+        "    print(f\"\\n=== Stage {name} → {kw} ===\")\n",
+        "    trainer = GRPOTrainer(\n",
+        "        model=policy,\n",
+        "        args=grpo_config(name, **kw),\n",
+        "        train_dataset=train_ds,\n",
+        "        reward_funcs=reward_funcs,\n",
+        "        processing_class=tokenizer,\n",
+        "        callbacks=[HackingTripwire()],\n",
+        "    )\n",
+        "    trainer.train()\n",
+        "    stage_logs[name] = list(trainer.state.log_history)\n",
+        "    adapter_dir = OUT / f\"adapter_stage_{name}\"\n",
+        "    trainer.model.save_pretrained(adapter_dir)\n",
+        "    tokenizer.save_pretrained(adapter_dir)\n",
+        "    print(f\"stage {name} adapter → {adapter_dir}\")\n",
+        "\n",
+        "run_stage(\"B\", temperature=1.0, num_generations=2, max_steps=20, lr=5e-6)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 6 — Inspect for hacking\n",
+        "\n",
+        "Don't trust the mean reward alone. Sample six post-Stage-B completions, parse them, hit the Space live, and print the full trio (completion / parsed action / reward). Look for obviously pathological outputs (repeated identical JSON, prose-only outputs, empty fields)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "FastLanguageModel.for_inference(policy)\n",
+        "for i in range(6):\n",
+        "    brief = briefings[i % len(briefings)]\n",
+        "    prompt_text = render_chat(build_prompt(brief))\n",
+        "    inputs = tokenizer(prompt_text, return_tensors=\"pt\", truncation=True, max_length=MAX_SEQ_LENGTH).to(policy.device)\n",
+        "    out = policy.generate(**inputs, max_new_tokens=128, do_sample=True, temperature=0.7,\n",
+        "                         pad_token_id=tokenizer.pad_token_id)\n",
+        "    completion = tokenizer.decode(out[0][inputs[\"input_ids\"].shape[1]:], skip_special_tokens=True)\n",
+        "    act = parse_action(completion)\n",
+        "    env.reset(); r, _ = env.step(act)\n",
+        "    print(f\"\\n--- sample {i} ---\")\n",
+        "    print(\"completion:\", completion.strip()[:200])\n",
+        "    print(\"parsed    :\", json.dumps(act))\n",
+        "    print(\"reward    :\", round(r, 4))\n",
+        "FastLanguageModel.for_training(policy)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 7 — Add curriculum\n",
+        "\n",
+        "The deployed Space scenario is fixed, so the curriculum is an **exploration schedule**: Stage C exploits what Stage B found (T=0.7) and Stage D hardens (T=0.5, lower lr)."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "run_stage(\"C\", temperature=0.7, num_generations=2, max_steps=25, lr=5e-6)\n",
+        "run_stage(\"D\", temperature=0.5, num_generations=2, max_steps=15, lr=2e-6)"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 8 — Train bigger (knobs, not action)\n",
+        "\n",
+        "Only after the loop is stable should you scale. If you rent an L4 or A100 with HF credits:\n",
+        "\n",
+        "- `MODEL_ID` → `unsloth/Qwen3-4B-Instruct-2507` or `unsloth/Llama-3.1-8B-Instruct`\n",
+        "- `N_BRIEFINGS` ↑ (more prompt diversity)\n",
+        "- `num_generations` ↑ and `max_steps` ↑ (more rollouts per prompt, more updates)\n",
+        "\n",
+        "All other cells are unchanged. Don't add features until you've watched a full stable run on this small config."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Phase 9 — Save and demo\n",
+        "\n",
+        "Re-evaluate on the same `N_EVAL` prompts, plot the before/after + reward curves, save the LoRA adapter (no 4-bit merge per Help Guide §16), and write a compliance manifest."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "print(\"Evaluating trained policy ...\")\n",
+        "trained_rewards = evaluate_policy(policy, n=N_EVAL, temperature=0.2)\n",
+        "print(\" trained mean:\", sum(trained_rewards) / len(trained_rewards))\n",
+        "\n",
+        "def _mean(xs): return sum(xs) / max(len(xs), 1)\n",
+        "summary = {\n",
+        "    \"random\":  _mean(baselines[\"random\"]),\n",
+        "    \"frozen\":  _mean(baselines[\"frozen\"]),\n",
+        "    \"trained\": _mean(trained_rewards),\n",
+        "}\n",
+        "print(json.dumps(summary, indent=2))"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "import pandas as pd, matplotlib.pyplot as plt\n",
+        "\n",
+        "plt.figure(figsize=(6, 4))\n",
+        "plt.bar(list(summary.keys()), list(summary.values()), color=[\"#888\", \"#1f77b4\", \"#2ca02c\"])\n",
+        "plt.title(\"Ghostexec: mean reward vs deployed HF Space\")\n",
+        "plt.ylabel(\"mean episode reward (higher is better)\")\n",
+        "plt.axhline(0.0, color=\"black\", linewidth=0.5)\n",
+        "plt.tight_layout()\n",
+        "plt.savefig(OUT / \"before_after.png\", dpi=150)\n",
+        "plt.show()\n",
+        "\n",
+        "rows = []\n",
+        "step_counter = 0\n",
+        "for name, log in stage_logs.items():\n",
+        "    for entry in log:\n",
+        "        r = entry.get(\"rewards/env_reward/mean\", entry.get(\"reward\"))\n",
+        "        if r is None: continue\n",
+        "        step_counter += 1\n",
+        "        rows.append({\n",
+        "            \"stage\": name, \"global_step\": step_counter, \"env\": r,\n",
+        "            \"fmt\":  entry.get(\"rewards/format_reward/mean\", 0.0),\n",
+        "            \"idle\": entry.get(\"rewards/anti_idle_reward/mean\", 0.0),\n",
+        "        })\n",
+        "df = pd.DataFrame(rows)\n",
+        "df.to_csv(OUT / \"reward_log.csv\", index=False)\n",
+        "\n",
+        "if not df.empty:\n",
+        "    plt.figure(figsize=(8, 4))\n",
+        "    for name, sub in df.groupby(\"stage\"):\n",
+        "        plt.plot(sub[\"global_step\"], sub[\"env\"], label=f\"stage {name}\")\n",
+        "    plt.xlabel(\"global step\"); plt.ylabel(\"mean env_reward\")\n",
+        "    plt.title(\"Ghostexec GRPO — reward vs step (Unsloth)\")\n",
+        "    plt.legend(); plt.tight_layout()\n",
+        "    plt.savefig(OUT / \"reward_curve.png\", dpi=150); plt.show()\n",
+        "\n",
+        "    plt.figure(figsize=(8, 4))\n",
+        "    plt.plot(df[\"global_step\"], df[\"env\"],  label=\"env_reward\")\n",
+        "    plt.plot(df[\"global_step\"], df[\"fmt\"],  label=\"format_reward\")\n",
+        "    plt.plot(df[\"global_step\"], df[\"idle\"], label=\"anti_idle_reward\")\n",
+        "    plt.xlabel(\"global step\"); plt.ylabel(\"mean component reward\")\n",
+        "    plt.title(\"Reward components — hacking-watch\")\n",
+        "    plt.legend(); plt.tight_layout()\n",
+        "    plt.savefig(OUT / \"components.png\", dpi=150); plt.show()\n",
+        "else:\n",
+        "    print(\"No numeric reward log found — skipping curve plots.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "final_adapter = OUT / \"adapter_final\"\n",
+        "policy.save_pretrained(final_adapter)\n",
+        "tokenizer.save_pretrained(final_adapter)\n",
+        "print(\"final adapter →\", final_adapter)\n",
+        "\n",
+        "if HUB_REPO_ID and os.environ.get(\"HF_TOKEN\"):\n",
+        "    from huggingface_hub import HfApi, login\n",
+        "    login(token=os.environ[\"HF_TOKEN\"], add_to_git_credential=False)\n",
+        "    policy.push_to_hub(HUB_REPO_ID, commit_message=f\"ghostexec GRPO adapter ({RUN_NAME})\")\n",
+        "    tokenizer.push_to_hub(HUB_REPO_ID)\n",
+        "    api = HfApi()\n",
+        "    for fname in (\"reward_log.csv\", \"before_after.png\", \"reward_curve.png\", \"components.png\"):\n",
+        "        p = OUT / fname\n",
+        "        if p.exists():\n",
+        "            api.upload_file(path_or_fileobj=str(p), path_in_repo=fname, repo_id=HUB_REPO_ID)\n",
+        "    print(\"pushed adapter + artefacts →\", HUB_REPO_ID)\n",
+        "else:\n",
+        "    print(\"HUB_REPO_ID / HF_TOKEN not set — skipping push.\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {},
+      "source": [
+        "manifest = {\n",
+        "    \"env_url\":    GHOSTEXEC_ENV_URL,\n",
+        "    \"model\":      MODEL_ID,\n",
+        "    \"run\":        RUN_NAME,\n",
+        "    \"stack\":      {\"unsloth\": True, \"trl\": \"0.22.2\"},\n",
+        "    \"rewards\": {\n",
+        "        \"random_mean\":  summary[\"random\"],\n",
+        "        \"frozen_mean\":  summary[\"frozen\"],\n",
+        "        \"trained_mean\": summary[\"trained\"],\n",
+        "        \"improvement_vs_frozen\": summary[\"trained\"] - summary[\"frozen\"],\n",
+        "    },\n",
+        "    \"stages\":       list(stage_logs.keys()),\n",
+        "    \"reward_fns\":   [\"env_reward\", \"format_reward\", \"anti_idle_reward\"],\n",
+        "    \"curriculum\":   \"exploration schedule (T=1.0→0.7→0.5)\",\n",
+        "    \"tripwire\":     \"HackingTripwire (unique_ratio<0.2 or env↑/fmt↓)\",\n",
+        "    \"adapter_path\": str(final_adapter),\n",
+        "    \"mean_space_latency_ms\": round(sum(env.latency_ms) / max(len(env.latency_ms), 1), 1),\n",
+        "    \"n_space_calls\":         len(env.latency_ms),\n",
+        "}\n",
+        "print(json.dumps(manifest, indent=2))\n",
+        "(OUT / \"manifest.json\").write_text(json.dumps(manifest, indent=2))\n",
+        "print(\"\\nmanifest →\", OUT / \"manifest.json\")"
+      ],
+      "execution_count": null,
+      "outputs": []
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python",
+      "version": "3.10"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}

outputs/logs/api_dead_live_600.jsonl CHANGED Viewed

@@ -200,3 +200,401 @@
 {"idx": 199, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
 {"idx": 200, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
 {"idx": 201, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}

 {"idx": 199, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
 {"idx": 200, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
 {"idx": 201, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 202, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 203, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 204, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 205, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 206, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 207, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 208, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 209, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 210, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 211, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 212, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 213, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 214, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 215, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 216, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 217, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 218, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 219, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 220, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 221, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 222, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 223, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 224, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 225, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 226, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 227, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 228, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 229, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 230, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 231, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 232, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 233, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 234, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 235, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 236, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 237, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 238, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 239, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 240, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 241, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 242, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 243, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 244, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 245, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 246, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 247, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 248, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 249, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 250, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 251, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 252, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 253, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 254, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 255, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 256, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 257, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 258, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 259, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 260, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 261, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 262, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 263, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 264, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 265, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 266, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 267, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 268, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 269, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 270, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 271, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 272, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 273, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 274, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 275, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 276, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 277, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 278, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 279, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 280, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 281, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 282, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 283, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 284, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 285, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 286, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 287, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 288, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 289, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 290, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 291, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 292, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 293, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 294, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 295, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 296, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 297, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 298, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 299, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 300, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 301, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 302, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 303, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 304, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 305, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 306, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 307, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 308, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 309, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 310, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 311, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 312, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 313, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 314, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 315, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 316, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 317, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 318, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 319, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 320, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 321, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 322, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 323, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 324, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 325, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 326, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 327, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 328, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 329, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 330, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 331, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 332, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 333, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 334, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 335, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 336, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 337, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 338, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 339, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 340, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 341, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 342, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 343, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 344, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 345, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 346, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 347, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 348, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 349, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 350, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 351, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 352, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 353, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 354, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 355, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 356, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 357, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 358, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 359, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 360, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 361, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 362, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 363, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 364, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 365, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 366, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 367, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 368, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 369, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 370, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 371, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 372, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 373, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 374, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 375, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 376, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 377, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 378, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 379, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 380, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 381, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 382, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 383, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 384, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 385, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 386, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 387, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 388, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 389, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 390, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 391, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 392, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 393, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 394, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 395, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 396, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 397, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 398, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 399, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 400, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 401, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 402, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 403, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 404, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 405, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 406, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 407, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 408, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 409, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 410, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 411, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 412, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 413, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 414, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 415, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 416, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 417, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 418, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 419, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 420, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 421, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 422, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 423, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 424, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 425, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 426, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 427, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 428, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 429, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 430, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 431, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 432, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 433, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 434, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 435, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 436, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 437, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 438, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 439, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 440, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 441, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 442, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 443, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 444, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 445, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 446, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 447, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 448, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 449, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 450, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 451, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 452, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 453, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 454, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 455, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 456, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 457, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 458, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 459, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 460, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 461, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 462, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 463, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 464, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 465, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 466, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 467, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 468, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 469, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 470, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 471, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 472, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 473, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 474, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 475, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 476, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 477, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 478, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 479, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 480, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 481, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 482, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 483, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 484, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 485, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 486, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 487, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 488, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 489, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 490, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 491, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 492, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 493, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 494, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 495, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 496, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 497, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 498, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 499, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 500, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 501, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 502, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 503, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 504, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 505, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 506, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 507, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 508, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 509, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 510, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 511, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 512, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 513, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 514, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 515, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 516, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 517, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 518, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 519, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 520, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 521, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 522, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 523, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 524, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 525, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 526, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 527, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 528, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 529, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 530, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 531, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 532, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 533, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 534, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 535, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 536, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 537, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 538, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 539, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 540, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 541, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 542, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 543, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 544, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 545, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 546, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 547, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 548, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 549, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 550, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 551, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 552, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 553, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 554, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 555, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 556, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 557, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 558, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 559, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 560, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 561, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 562, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 563, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 564, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 565, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 566, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 567, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 568, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 569, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 570, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 571, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 572, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 573, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 574, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 575, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 576, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 577, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 578, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 579, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 580, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 581, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 582, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 583, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}
+{"idx": 584, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m10", "reason": "dead test"}, "reward": 0.35616, "step_ok": true}
+{"idx": 585, "ok": true, "error": null, "action": {"action_type": "cancel_meeting", "meeting_id": "m99", "reason": "dead test"}, "reward": -0.25, "step_ok": false}
+{"idx": 586, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t07"}, "reward": 0.29663999999999996, "step_ok": true}
+{"idx": 587, "ok": true, "error": null, "action": {"action_type": "complete_task", "task_id": "t09"}, "reward": -0.25, "step_ok": false}
+{"idx": 588, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Jordan Lee"}, "reward": 0.1584, "step_ok": true}
+{"idx": 589, "ok": true, "error": null, "action": {"action_type": "delegate_task", "task_id": "t08", "contact_name": "Nobody"}, "reward": -0.25, "step_ok": false}
+{"idx": 590, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Jamie Liu", "message_body": "Quick sync please."}, "reward": 0.013439999999999999, "step_ok": true}
+{"idx": 591, "ok": true, "error": null, "action": {"action_type": "send_message", "contact_name": "Nobody", "message_body": "hello"}, "reward": -0.25, "step_ok": false}
+{"idx": 592, "ok": true, "error": null, "action": {"action_type": "do_nothing"}, "reward": -0.15, "step_ok": true}
+{"idx": 593, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e01", "message_body": "On it now."}, "reward": 0.13776, "step_ok": true}
+{"idx": 594, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "e14", "message_body": "Acknowledged."}, "reward": 0.006719999999999999, "step_ok": true}
+{"idx": 595, "ok": true, "error": null, "action": {"action_type": "reply_email", "email_id": "nope_999", "message_body": "x"}, "reward": -0.25, "step_ok": false}
+{"idx": 596, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "e09"}, "reward": 0.0, "step_ok": true}
+{"idx": 597, "ok": true, "error": null, "action": {"action_type": "archive_email", "email_id": "bad_id"}, "reward": -0.25, "step_ok": false}
+{"idx": 598, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m02", "new_time": "2026-04-21T18:00:00"}, "reward": 0.37296, "step_ok": true}
+{"idx": 599, "ok": true, "error": null, "action": {"action_type": "reschedule_meeting", "meeting_id": "m03", "new_time": "2026-04-21T09:30:00"}, "reward": -0.25, "step_ok": false}