Spaces:

ycwhencpp
/

final-iteration

Paused

App Files Files Community

vaibhav12332112312 commited on 13 days ago

Commit

97ee7e7

1 Parent(s): e2c547b

update

Browse files

Files changed (6) hide show

README.md +27 -0
RESEARCH.md +37 -1
inference.py +28 -5
models.py +37 -0
server/data/audience_overlap_matrix.json +11 -10
server/viraltest_environment.py +210 -36

README.md CHANGED Viewed

@@ -93,6 +93,33 @@ Tiered from [Buffer 2.1M study](https://buffer.com/resources/how-often-to-post-o
 | `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency |
 | `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity |
 ## Tool catalog
 | Tool | Cost | Returns |

 | `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency |
 | `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity |
+## Regulator/Judge Mode (per-day audit)
+Every day the env emits a deterministic, explainable `JudgeReport` on the observation:
+```python
+JudgeReport(
+    policy_compliance=1.00,    # 1.0 - sum(weighted_violations); see _compute_judge_report
+    sustainability_risk=0.10,  # 0.4*(1-energy_min) + 0.3*sleep_debt + 0.3*low_energy_ratio
+    strategic_quality=0.96,    # 0.4*engagement_per_post + 0.3*intent_diversity + 0.3*format_diversity
+    explanation="compliance=1.00 risk=0.10 strategy=0.96 | no policy violations",
+    violations=[],             # human-readable rule breaks (Buffer 2.1M, Van Dongen, Cen 2024)
+)
+```
+Auditable rules (all sourced): >5 posts/day → fatigue cliff (Buffer 2.1M); >7 posts/week → weekly cap; ≥4 collabs/month → diminishing returns (Cen 2024); >22h awake → sleep debt (Van Dongen 2003).
+## Headline metrics (final-step audit)
+The final observation carries `HeadlineMetrics` with the three numbers judges remember:
+| Metric | What it measures | Source of truth |
+|---|---|---|
+| `vs_baseline_pct` | (agent_score − heuristic_baseline) / heuristic_baseline | Empirical baseline loaded from `plots/training_summary.json["smart_heuristic"]` (0.43 / 0.77 / 0.81) |
+| `score_per_tool_call` | grader_score / total_tool_calls | Efficiency: did the agent learn to call tools sparingly? |
+| `score_per_1k_chars` | grader_score per 1k action JSON chars | Token-proxy efficiency |
+| `retention_under_shift` | shifted_score / baseline_score | Pass `episode_chain_id` + `shift_label="baseline"` then `="shifted"` to a second `reset` to populate. None until both runs complete. |
 ## Tool catalog
 | Tool | Cost | Returns |

RESEARCH.md CHANGED Viewed

@@ -135,7 +135,7 @@ Every constant and design decision in Viraltest is backed by a verifiable source
 **Key findings:** 3–5 posts/week doubles follower growth vs 1–2. 7+/week shows 20–35% engagement drop per post. Diminishing returns above 5/week.
-**What we use:** `FATIGUE_TIERS`, `WEEKLY_FATIGUE_THRESHOLD = 7`, `_theoretical_max_engagement` uses 5 posts/week × 4 weeks.
 ---
@@ -196,6 +196,42 @@ Every constant and design decision in Viraltest is backed by a verifiable source
 ---
 ### Goldman Sachs Global Investment Research (March 2025)
 **Title:** Creator Economy: Framing the Market Opportunity

 **Key findings:** 3–5 posts/week doubles follower growth vs 1–2. 7+/week shows 20–35% engagement drop per post. Diminishing returns above 5/week.
+**What we use:** `FATIGUE_TIERS`, `WEEKLY_FATIGUE_THRESHOLD = 7`, `_theoretical_max_engagement` caps at 5 posts/week × `TASK_HORIZON/7` weeks (≈21 posts for 30-day horizon — the Buffer-defined sweet spot before fatigue penalties kick in).
 ---
 ---
+### Later (2023) — Instagram Collaboration Posts Performance Study
+**URL:** [later.com/blog/instagram-collab-posts](https://later.com/blog/instagram-collab-posts)
+**Sample:** ~5K co-authored posts across the Later customer base (disclosed)
+**Methodology:** Comparison of Collab posts (single post shared to two feeds) vs equivalent solo posts from the same accounts.
+**Key findings:** Collab posts averaged ~88% more reach and ~40% more impressions than solo posts. Lift driven primarily by exposure to the partner's audience.
+**What we use:** `COLLAB_REACH_K = 0.60` — reach uplift scales with `(1 - overlap)` and is capped below the headline 88% because reach in our model is already amplified by `REACH_MULT` and `hour_mult`; net post-cap uplift on the constrained engagement value lands in the +30–50% band Later reports for matched-niche pairs.
+---
+### HypeAuditor (2024) — Influencer Collaboration Benchmark
+**URL:** [hypeauditor.com/blog/influencer-collaboration](https://hypeauditor.com/blog/influencer-collaboration)
+**Sample:** 10K+ Instagram collaboration posts across niches
+**Methodology:** Per-impression engagement rate, segmented by niche affinity (same niche, adjacent, cross-niche).
+**Key findings:** Same-niche collabs achieve ~30% higher engagement-per-impression than cross-niche; cross-niche collabs gain new followers but per-impression rate is roughly flat or slightly negative.
+**What we use:** `COLLAB_AFFINITY_K = 0.30` — engagement-per-impression boost scales with `overlap`, peaking when the partner's audience already shares the user's niche.
+---
+### Rival IQ (2025) — Cross-Industry Audience Overlap Patterns
+**URL:** [rivaliq.com/blog/social-media-industry-benchmark-report](https://www.rivaliq.com/blog/social-media-industry-benchmark-report/) (cross-industry chapter)
+**Key findings:** Same-industry account pairs share 40–65% of their audience; adjacent industries 20–35%; unrelated industries 5–15%. Cross-industry collabs drive new follower acquisition at roughly 2–2.5× the rate of same-industry collabs.
+**What we use:** `audience_overlap_matrix.json` values and `COLLAB_GROWTH_K = 1.50` — follower spillover scales with `(1 - overlap)`, peaking at +150% when overlap is zero (matches the upper end of Rival IQ's cross-industry follower-acquisition lift).
+Per-episode collab cadence is **not hard-capped**. Instead, each successive collab in a month is multiplied by `1 / (1 + COLLAB_FATIGUE_K · prior_collabs)` (`K = 0.3`): the multiplier falls to ~77% on the 2nd, 63% on the 3rd, 53% on the 4th. With base `engagement ≈ 1.52×` from a typical-overlap partner, this puts the 1st–2nd collab clearly above the no-collab baseline, the 3rd roughly neutral, and the 4th+ net-negative. This follows Cen et al. 2024's argument that disengagement-aware policies should price marginal exposure rather than impose binary caps, and lets the policy discover its own collab frequency from reward gradient.
+---
 ### Goldman Sachs Global Investment Research (March 2025)
 **Title:** Creator Economy: Framing the Market Opportunity

inference.py CHANGED Viewed

@@ -35,7 +35,7 @@ _REQUESTED_MAX = int(os.getenv("MAX_STEPS", str(TASK_HORIZON)))
 MAX_STEPS = _REQUESTED_MAX if _ALLOW_SHORT else max(_REQUESTED_MAX, TASK_HORIZON)
 TEMPERATURE = 0.7
 MAX_TOKENS = 768
-SUCCESS_SCORE_THRESHOLD = 0.1
 ALL_TOPICS: List[str] = [
     topic for topics in TOPIC_CATEGORIES.values() for topic in topics
@@ -111,11 +111,24 @@ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[
     )
-def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
         f"[END] success={str(success).lower()} steps={steps} "
-        f"score={score:.2f} rewards={rewards_str}",
         flush=True,
     )
@@ -140,6 +153,14 @@ def format_observation(obs: Any) -> str:
     if coach:
         coach_str = f"Coach: delta={coach.get('delta', 0):.3f}, suggestion={coach.get('suggestion', '')}\n"
     signals = getattr(obs, "engagement_signals", None)
     signals_str = ""
     if signals:
@@ -153,7 +174,7 @@ Day: {day_name} (day_of_week={obs.day_of_week}) | days_elapsed={obs.days_elapsed
 Energy: {obs.creator_energy:.2f} | Burnout risk: {burnout:.2f} | Followers: {obs.follower_count}
 Engagement rate: {obs.engagement_rate:.3f} | Content queue: {obs.content_queue_size}
 API budget remaining: {budget}
-{signals_str}{coach_str}Tool results from last step:
 {tool_results_str if tool_results_str else '  (none)\n'}Your notes from last step: {notes_echo}
 Plan your tool calls and actions for today:""")
@@ -282,6 +303,7 @@ async def run_task(client: OpenAI, task: str) -> None:
     score = 0.0
     success = False
     env: Optional[ViraltestEnv] = None
     log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
@@ -336,6 +358,7 @@ async def run_task(client: OpenAI, task: str) -> None:
                 if score == 0:
                     meta = getattr(result.observation, "metadata", {}) or {}
                     score = float(meta.get("grader_score", 0.0))
                 break
         success = score >= SUCCESS_SCORE_THRESHOLD
@@ -346,7 +369,7 @@ async def run_task(client: OpenAI, task: str) -> None:
                 await env.close()
             except Exception as e:
                 print(f"[DEBUG] env.close() error: {e}", flush=True)
-        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
 async def main() -> None:

 MAX_STEPS = _REQUESTED_MAX if _ALLOW_SHORT else max(_REQUESTED_MAX, TASK_HORIZON)
 TEMPERATURE = 0.7
 MAX_TOKENS = 768
+SUCCESS_SCORE_THRESHOLD = 0.50
 ALL_TOPICS: List[str] = [
     topic for topics in TOPIC_CATEGORIES.values() for topic in topics
     )
+def log_end(
+    success: bool, steps: int, score: float, rewards: List[float],
+    headline: Optional[Any] = None,
+) -> None:
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    head_str = ""
+    if headline is not None:
+        retention = headline.retention_under_shift
+        retention_str = f"{retention:.2f}" if retention is not None else "n/a"
+        head_str = (
+            f" vs_baseline_pct={headline.vs_baseline_pct:+.2%} "
+            f"score_per_tool={headline.score_per_tool_call:.3f} "
+            f"score_per_1k_chars={headline.score_per_1k_chars:.3f} "
+            f"retention_under_shift={retention_str}"
+        )
     print(
         f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.2f} rewards={rewards_str}{head_str}",
         flush=True,
     )
     if coach:
         coach_str = f"Coach: delta={coach.get('delta', 0):.3f}, suggestion={coach.get('suggestion', '')}\n"
+    judge = getattr(obs, "judge_report", None)
+    judge_str = ""
+    if judge:
+        judge_str = (
+            f"Judge: compliance={judge.policy_compliance:.2f} risk={judge.sustainability_risk:.2f} "
+            f"strategy={judge.strategic_quality:.2f} | {judge.explanation}\n"
+        )
     signals = getattr(obs, "engagement_signals", None)
     signals_str = ""
     if signals:
 Energy: {obs.creator_energy:.2f} | Burnout risk: {burnout:.2f} | Followers: {obs.follower_count}
 Engagement rate: {obs.engagement_rate:.3f} | Content queue: {obs.content_queue_size}
 API budget remaining: {budget}
+{signals_str}{coach_str}{judge_str}Tool results from last step:
 {tool_results_str if tool_results_str else '  (none)\n'}Your notes from last step: {notes_echo}
 Plan your tool calls and actions for today:""")
     score = 0.0
     success = False
     env: Optional[ViraltestEnv] = None
+    headline: Optional[Any] = None
     log_start(task=task, env=BENCHMARK, model=MODEL_NAME)
                 if score == 0:
                     meta = getattr(result.observation, "metadata", {}) or {}
                     score = float(meta.get("grader_score", 0.0))
+                headline = getattr(result.observation, "headline_metrics", None)
                 break
         success = score >= SUCCESS_SCORE_THRESHOLD
                 await env.close()
             except Exception as e:
                 print(f"[DEBUG] env.close() error: {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards, headline=headline)
 async def main() -> None:

models.py CHANGED Viewed

@@ -108,6 +108,35 @@ class ViraltestAction(Action):
         return deduped
 class EngagementSignals(BaseModel):
     """Mosseri-aligned engagement decomposition (Jan 2025 official ranking signals)."""
@@ -161,6 +190,14 @@ class ViraltestObservation(Observation):
         default=None,
         description="Counterfactual feedback: delta between agent plan and heatmap-optimal plan",
     )
     tool_results: List[ToolResult] = Field(default_factory=list, description="Results from tool_calls this step")
     agent_notes: Optional[str] = Field(default=None, description="Echo of agent's notes from previous step")

         return deduped
+class JudgeReport(BaseModel):
+    """Auditable per-day evaluation by the in-env Regulator/Judge.
+    Scores are 0..1. `sustainability_risk` is RISK (higher = worse).
+    """
+    policy_compliance: float = Field(default=1.0, ge=0.0, le=1.0)
+    sustainability_risk: float = Field(default=0.0, ge=0.0, le=1.0)
+    strategic_quality: float = Field(default=0.0, ge=0.0, le=1.0)
+    explanation: str = Field(default="")
+    violations: List[str] = Field(default_factory=list)
+class HeadlineMetrics(BaseModel):
+    """Three headline numbers reported once per episode (final observation)."""
+    vs_baseline_pct: float = Field(default=0.0, description="(agent - heuristic_baseline) / heuristic_baseline")
+    score_per_tool_call: float = Field(default=0.0, description="grader_score / total_tool_calls (efficiency)")
+    score_per_1k_chars: float = Field(default=0.0, description="grader_score per 1k action chars (token-proxy efficiency)")
+    retention_under_shift: Optional[float] = Field(
+        default=None,
+        description="shifted_score / baseline_score, populated when both runs share an episode_chain_id",
+    )
+    heuristic_baseline_score: float = Field(default=0.0)
+    agent_score: float = Field(default=0.0)
+    total_tool_calls: int = Field(default=0, ge=0)
+    total_action_chars: int = Field(default=0, ge=0)
 class EngagementSignals(BaseModel):
     """Mosseri-aligned engagement decomposition (Jan 2025 official ranking signals)."""
         default=None,
         description="Counterfactual feedback: delta between agent plan and heatmap-optimal plan",
     )
+    judge_report: Optional[JudgeReport] = Field(
+        default=None,
+        description="Regulator/Judge audit: policy compliance, sustainability risk, strategic quality + explanation",
+    )
+    headline_metrics: Optional[HeadlineMetrics] = Field(
+        default=None,
+        description="Final-observation hard numbers: improvement vs baseline, efficiency, shift retention",
+    )
     tool_results: List[ToolResult] = Field(default_factory=list, description="Results from tool_calls this step")
     agent_notes: Optional[str] = Field(default=None, description="Echo of agent's notes from previous step")

server/data/audience_overlap_matrix.json CHANGED Viewed

@@ -1,16 +1,17 @@
 {
   "_meta": {
-    "description": "7×7 symmetric audience overlap matrix between competitor archetypes. Values 0.0-1.0 represent fraction of shared audience. Used by propose_collab to split engagement. Derived from niche proximity (same-niche pairs ~0.4-0.65, cross-niche ~0.05-0.20).",
-    "source": "Estimated from Rival IQ 2025 cross-industry overlap patterns + niche proximity heuristic"
   },
-  "archetype_ids": ["niche_expert", "viral_chaser", "lifestyle_blogger", "b2b_thought_leader", "food_creator", "fitness_coach", "travel_creator"],
   "matrix": [
-    [1.00, 0.12, 0.10, 0.40, 0.08, 0.10, 0.15],
-    [0.12, 1.00, 0.55, 0.10, 0.20, 0.25, 0.30],
-    [0.10, 0.55, 1.00, 0.15, 0.30, 0.35, 0.40],
-    [0.40, 0.10, 0.15, 1.00, 0.08, 0.10, 0.12],
-    [0.08, 0.20, 0.30, 0.08, 1.00, 0.45, 0.35],
-    [0.10, 0.25, 0.35, 0.10, 0.45, 1.00, 0.30],
-    [0.15, 0.30, 0.40, 0.12, 0.35, 0.30, 1.00]
   ]
 }

 {
   "_meta": {
+    "description": "8x8 symmetric audience overlap matrix between competitor archetypes and the user creator. Values 0.0-1.0 represent fraction of shared audience. Used by propose_collab to compute collab reward multipliers and by query_creator_pool to expose overlap to the agent. Same-niche pairs ~0.4-0.65, cross-niche ~0.05-0.20.",
+    "source": "Competitor pairs estimated from Rival IQ 2025 cross-industry overlap patterns + niche proximity heuristic. user_creator row tuned to a generic micro-creator (no locked niche): broad mass-market partners (lifestyle_blogger, viral_chaser) score highest; specialist partners (b2b_thought_leader, niche_expert) score lowest."
   },
+  "archetype_ids": ["niche_expert", "viral_chaser", "lifestyle_blogger", "b2b_thought_leader", "food_creator", "fitness_coach", "travel_creator", "user_creator"],
   "matrix": [
+    [1.00, 0.12, 0.10, 0.40, 0.08, 0.10, 0.15, 0.10],
+    [0.12, 1.00, 0.55, 0.10, 0.20, 0.25, 0.30, 0.35],
+    [0.10, 0.55, 1.00, 0.15, 0.30, 0.35, 0.40, 0.40],
+    [0.40, 0.10, 0.15, 1.00, 0.08, 0.10, 0.12, 0.08],
+    [0.08, 0.20, 0.30, 0.08, 1.00, 0.45, 0.35, 0.25],
+    [0.10, 0.25, 0.35, 0.10, 0.45, 1.00, 0.30, 0.28],
+    [0.15, 0.30, 0.40, 0.12, 0.35, 0.30, 1.00, 0.30],
+    [0.10, 0.35, 0.40, 0.08, 0.25, 0.28, 0.30, 1.00]
   ]
 }

server/viraltest_environment.py CHANGED Viewed

@@ -27,6 +27,8 @@ try:
     from ..models import (
         CollabProposal,
         EngagementSignals,
         ReplyAction,
         ScheduledAction,
         ToolCall,
@@ -38,6 +40,8 @@ except ImportError:
     from models import (
         CollabProposal,
         EngagementSignals,
         ReplyAction,
         ScheduledAction,
         ToolCall,
@@ -156,11 +160,41 @@ WEEKLY_FATIGUE_MULT = 0.75
 SATURATION_PENALTY_K = 0.25
 TREND_DEFAULT_HALFLIFE_HOURS = 60
-COLLAB_MAX_PER_MONTH = 2
 REPLY_WINDOW_MINUTES = 90
 REPLY_REACH_BONUS = 1.4
 API_BUDGET_INITIAL = 100
 # Tool costs
 TOOL_COSTS = {
     "query_audience": 2,
@@ -231,7 +265,7 @@ TOOL_CATALOG = {
         "parameters": {},
     },
     "propose_collab": {
-        "description": "Propose a collaboration post with a competitor. Splits engagement by audience overlap. Max 2 per month.",
         "parameters": {
             "partner_id": {"type": "string"},
             "content_type": {"type": "string", "enum": ["reel", "story", "carousel", "text_post"]},
@@ -280,10 +314,15 @@ class ViraltestEnvironment(Environment):
         self._api_budget = API_BUDGET_INITIAL
         self._collabs_this_month = 0
         self._collab_history: List[str] = []
         self._low_energy_days = 0
         self._total_posts_this_week = 0
         self._week_start_day = 0
         self._daily_signals = EngagementSignals()
         self._trending_topics = self._pick_trending_topics()
         self._trending_tags = self._pick_trending_tags()
@@ -468,6 +507,32 @@ class ViraltestEnvironment(Environment):
         return daily_fatigue * weekly_mult
     # ----- engagement signals (Mosseri-aligned) -----
     def _compute_engagement_signals(
@@ -556,19 +621,17 @@ class ViraltestEnvironment(Environment):
         elif tool.name == "query_creator_pool":
             pool = []
             for comp in self._competitors:
-                idx = _OVERLAP_DATA["archetype_ids"].index(comp.id) if comp.id in _OVERLAP_DATA["archetype_ids"] else -1
-                overlap = 0.15
-                if idx >= 0 and idx < len(_OVERLAP_DATA["matrix"]):
-                    overlap = max(_OVERLAP_DATA["matrix"][idx])
-                pool.append({"id": comp.id, "name": comp.name, "niche": comp.niche, "max_audience_overlap": round(overlap, 2)})
             return ToolResult(name=tool.name, data=pool, budget_remaining=self._api_budget)
         elif tool.name == "propose_collab":
-            if self._collabs_this_month >= COLLAB_MAX_PER_MONTH:
-                return ToolResult(name=tool.name, success=False, error="collab_limit_reached", budget_remaining=self._api_budget)
             partner_id = tool.arguments.get("partner_id", "")
-            if partner_id in self._collab_history[-3:]:
-                return ToolResult(name=tool.name, success=False, error="recently_collaborated", budget_remaining=self._api_budget)
             return ToolResult(name=tool.name, data={"status": "proposal_accepted", "partner_id": partner_id}, budget_remaining=self._api_budget)
         return ToolResult(name=tool.name, success=False, error=f"unknown tool: {tool.name}", budget_remaining=self._api_budget)
@@ -576,6 +639,9 @@ class ViraltestEnvironment(Environment):
     # ----- counterfactual coach -----
     def _compute_coach_feedback(self, agent_engagement: float) -> Dict[str, Any]:
         dow = self._day % 7
         row = _HEATMAP_GRID.get(dow, [1.0] * 24)
         best_hours = sorted(range(24), key=lambda h: row[h] if h < len(row) else 0, reverse=True)[:2]
@@ -584,13 +650,98 @@ class ViraltestEnvironment(Environment):
         optimal_eng = sum(row[h] * best_base * best_reach for h in best_hours)
         delta = agent_engagement - optimal_eng
         return {
-            "optimal_hours": best_hours,
-            "optimal_engagement_estimate": round(optimal_eng, 4),
-            "your_engagement": round(agent_engagement, 4),
             "delta": round(delta, 4),
-            "suggestion": "You're outperforming the heatmap baseline!" if delta >= 0 else "Consider posting at peak hours for better reach.",
         }
     # ----- core API -----
     def reset(self, seed: Optional[int] = None, episode_id: Optional[str] = None, **kwargs: Any) -> ViraltestObservation:
@@ -602,6 +753,9 @@ class ViraltestEnvironment(Environment):
         self._state = State(episode_id=episode_id or str(uuid4()), step_count=0)
         self._init_state()
         chain_id = kwargs.get("episode_chain_id")
         if chain_id and chain_id in _BRAND_STORE:
             brand = _BRAND_STORE[chain_id]
@@ -623,16 +777,24 @@ class ViraltestEnvironment(Environment):
         if action.notes:
             self._agent_notes = action.notes
-        # Process tool calls first
         tool_results: List[ToolResult] = []
         for tc in action.tool_calls:
             result = self._dispatch_tool(tc)
             tool_results.append(result)
-        # Process collab proposal
-        if action.collab and self._collabs_this_month < COLLAB_MAX_PER_MONTH:
             self._collabs_this_month += 1
             self._collab_history.append(action.collab.partner_id)
         # Validate scheduled actions
         schedule: Dict[int, ScheduledAction] = {}
@@ -718,10 +880,12 @@ class ViraltestEnvironment(Environment):
         done = self._state.step_count >= TASK_HORIZON or self._energy <= 0.0
         coach = self._compute_coach_feedback(daily_engagement)
         if done:
             self._episode_done = True
             grader_score = self._run_grader()
             chain_id = kwargs.get("episode_chain_id")
             if chain_id:
@@ -738,7 +902,7 @@ class ViraltestEnvironment(Environment):
                 grader_score=grader_score, daily_total_engagement=daily_engagement,
                 daily_posts_made=daily_posts, daily_energy_min=energy_min,
                 tool_results=tool_results, engagement_signals=daily_signals,
-                coach_feedback=coach,
             )
             return self._final_observation
@@ -747,13 +911,15 @@ class ViraltestEnvironment(Environment):
             daily_total_engagement=daily_engagement,
             daily_posts_made=daily_posts, daily_energy_min=energy_min,
             tool_results=tool_results, engagement_signals=daily_signals,
-            coach_feedback=coach,
         )
     def _process_hour_action(self, sa: ScheduledAction) -> Tuple[float, float, Optional[EngagementSignals]]:
         engagement = 0.0
         signals = None
         if sa.action_type == "post":
             cost = CONTENT_ENERGY_COST.get(sa.content_type, 0.1)
             if self._content_queue > 0:
@@ -790,6 +956,12 @@ class ViraltestEnvironment(Environment):
                     * trending_bonus * comp_diff * fatigue * algo_mult
                     * niche_mult * saturation_factor
                 )
                 engagement = min(engagement, 5.0)
                 signals = self._compute_engagement_signals(sa.content_type, engagement, sa.intent)
@@ -819,7 +991,7 @@ class ViraltestEnvironment(Environment):
             self._time_since_last_post = 0
             if engagement > 0:
-                self._followers += int(engagement * 100)
         elif sa.action_type == "create_content":
             self._energy = max(0.0, self._energy - CREATE_CONTENT_COST)
@@ -955,6 +1127,8 @@ class ViraltestEnvironment(Environment):
         tool_results: Optional[List[ToolResult]] = None,
         engagement_signals: Optional[EngagementSignals] = None,
         coach_feedback: Optional[Dict[str, Any]] = None,
     ) -> ViraltestObservation:
         recent_eng = self._engagement_history[-10:] if self._engagement_history else []
         eng_rate = sum(recent_eng) / len(recent_eng) if recent_eng else 0.0
@@ -984,6 +1158,8 @@ class ViraltestEnvironment(Environment):
             daily_energy_min=round(daily_energy_min, 3),
             engagement_signals=engagement_signals,
             coach_feedback=coach_feedback,
             tool_results=tool_results or [],
             agent_notes=self._agent_notes,
             api_budget_remaining=self._api_budget,
@@ -1006,35 +1182,33 @@ class ViraltestEnvironment(Environment):
         return 0.0
     def _theoretical_max_engagement(self) -> float:
         best_base = max(BASE_ENGAGEMENT.values())
         best_reach = max(REACH_MULT.values())
         best_niche = max(_NICHE_MULTIPLIERS.values()) if _NICHE_MULTIPLIERS else 1.0
-        active_days = 26
-        rest_days = TASK_HORIZON - active_days
-        posts_per_active_day = 2
         avg_heatmap_peak = 1.0
         if _HEATMAP_GRID:
-            day_peaks = []
-            for dow, row in _HEATMAP_GRID.items():
-                top2 = sorted(row, reverse=True)[:posts_per_active_day]
-                day_peaks.append(sum(top2) / len(top2) if top2 else 1.0)
             avg_heatmap_peak = sum(day_peaks) / len(day_peaks) if day_peaks else 1.0
         trending_bonus = 1.25
         tag_boost = 1.1
-        total_posts = active_days * posts_per_active_day
-        weekly_fatigue = 1.0
-        posts_per_week = total_posts / (TASK_HORIZON / 7.0)
-        if posts_per_week >= WEEKLY_FATIGUE_THRESHOLD:
-            weekly_fatigue = WEEKLY_FATIGUE_MULT
         per_post = (
             best_base * best_reach * best_niche
-            * avg_heatmap_peak * trending_bonus * tag_boost * weekly_fatigue
         )
         return per_post * total_posts

     from ..models import (
         CollabProposal,
         EngagementSignals,
+        HeadlineMetrics,
+        JudgeReport,
         ReplyAction,
         ScheduledAction,
         ToolCall,
     from models import (
         CollabProposal,
         EngagementSignals,
+        HeadlineMetrics,
+        JudgeReport,
         ReplyAction,
         ScheduledAction,
         ToolCall,
 SATURATION_PENALTY_K = 0.25
 TREND_DEFAULT_HALFLIFE_HOURS = 60
+# Collab reward shaping (Later 2023 reach study, HypeAuditor 2024 niche affinity, Rival IQ 2025 overlap patterns,
+# Cen et al. 2024 disengagement model for diminishing returns instead of a hard cap).
+COLLAB_REACH_K = 0.60      # cross-audience exposure: capped reach uplift when overlap is 0
+COLLAB_AFFINITY_K = 0.30   # same-audience affinity: per-impression engagement uplift when overlap is 1
+COLLAB_GROWTH_K = 1.50     # cross-pollination follower spillover, scales (1 - overlap)
+COLLAB_PARTNER_REPEAT_PENALTY = 0.7  # discount on multipliers when partner reused this brand
+COLLAB_FATIGUE_K = 0.3     # per-collab diminishing-returns factor: 1/(1+K*prior_collabs_this_episode)
 REPLY_WINDOW_MINUTES = 90
 REPLY_REACH_BONUS = 1.4
 API_BUDGET_INITIAL = 100
+# Heuristic baselines for headline metric `vs_baseline_pct`.
+# Data-driven: loaded from `plots/training_summary.json["smart_heuristic"]` recorded by
+# `training/run_training_evidence.py`. Falls back to conservative calibration constants
+# if the file is missing (audit trail: see RESEARCH.md for the rule-based policy spec).
+def _load_heuristic_baselines() -> Dict[str, float]:
+    summary = Path(__file__).parent.parent / "plots" / "training_summary.json"
+    try:
+        data = json.loads(summary.read_text())
+        empirical = data.get("smart_heuristic") or {}
+        return {k: float(v) for k, v in empirical.items() if k in VALID_TASKS}
+    except Exception:
+        return {}
+HEURISTIC_BASELINE_SCORES: Dict[str, float] = _load_heuristic_baselines() or {
+    "monthly_engage": 0.43,
+    "monthly_strategic": 0.77,
+    "monthly_competitive": 0.81,
+}
+# Cross-episode store for distribution-shift retention. Keyed by episode_chain_id, stores
+# {"baseline": score, "shifted": score} so the second run can compute retention_under_shift.
+_SHIFT_HISTORY: Dict[str, Dict[str, float]] = {}
 # Tool costs
 TOOL_COSTS = {
     "query_audience": 2,
         "parameters": {},
     },
     "propose_collab": {
+        "description": "Propose a collab post with a competitor at a specific hour. The post you schedule at that hour will be co-authored with the partner.",
         "parameters": {
             "partner_id": {"type": "string"},
             "content_type": {"type": "string", "enum": ["reel", "story", "carousel", "text_post"]},
         self._api_budget = API_BUDGET_INITIAL
         self._collabs_this_month = 0
         self._collab_history: List[str] = []
+        self._active_collab: Optional[CollabProposal] = None
         self._low_energy_days = 0
         self._total_posts_this_week = 0
         self._week_start_day = 0
         self._daily_signals = EngagementSignals()
+        self._total_tool_calls = 0
+        self._total_action_chars = 0
+        self._shift_label: Optional[str] = None
+        self._chain_id: Optional[str] = None
         self._trending_topics = self._pick_trending_topics()
         self._trending_tags = self._pick_trending_tags()
         return daily_fatigue * weekly_mult
+    # ----- collab multipliers (overlap-driven) -----
+    def _user_partner_overlap(self, partner_id: str) -> Optional[float]:
+        ids = _OVERLAP_DATA.get("archetype_ids", [])
+        if "user_creator" not in ids or partner_id not in ids:
+            return None
+        u = ids.index("user_creator")
+        p = ids.index(partner_id)
+        return _OVERLAP_DATA["matrix"][u][p]
+    def _collab_multipliers(self, partner_id: str) -> Tuple[float, float]:
+        """Returns (engagement_multiplier, follower_growth_multiplier)."""
+        o = self._user_partner_overlap(partner_id)
+        if o is None:
+            return 1.0, 1.0
+        reach = 1.0 + (1.0 - o) * COLLAB_REACH_K
+        affinity = 1.0 + o * COLLAB_AFFINITY_K
+        growth = 1.0 + (1.0 - o) * COLLAB_GROWTH_K
+        eng_boost = reach * affinity
+        if partner_id in self._collab_history[:-1]:
+            eng_boost *= COLLAB_PARTNER_REPEAT_PENALTY
+            growth *= COLLAB_PARTNER_REPEAT_PENALTY
+        prior = max(0, self._collabs_this_month - 1)
+        fatigue = 1.0 / (1.0 + COLLAB_FATIGUE_K * prior)
+        return eng_boost * fatigue, growth * fatigue
     # ----- engagement signals (Mosseri-aligned) -----
     def _compute_engagement_signals(
         elif tool.name == "query_creator_pool":
             pool = []
             for comp in self._competitors:
+                overlap = self._user_partner_overlap(comp.id)
+                pool.append({
+                    "id": comp.id, "name": comp.name, "niche": comp.niche,
+                    "audience_overlap": round(overlap, 2) if overlap is not None else None,
+                })
             return ToolResult(name=tool.name, data=pool, budget_remaining=self._api_budget)
         elif tool.name == "propose_collab":
             partner_id = tool.arguments.get("partner_id", "")
+            if partner_id not in [c.id for c in self._competitors]:
+                return ToolResult(name=tool.name, success=False, error=f"unknown partner: {partner_id}", budget_remaining=self._api_budget)
             return ToolResult(name=tool.name, data={"status": "proposal_accepted", "partner_id": partner_id}, budget_remaining=self._api_budget)
         return ToolResult(name=tool.name, success=False, error=f"unknown tool: {tool.name}", budget_remaining=self._api_budget)
     # ----- counterfactual coach -----
     def _compute_coach_feedback(self, agent_engagement: float) -> Dict[str, Any]:
+        # World-modeling discipline: emit a SCALAR delta only (no optimal_hours leak).
+        # Agents must use `query_trends` / `predict_engagement` to discover *which* hours
+        # are optimal — coach only signals "you're above/below the heatmap optimum today".
         dow = self._day % 7
         row = _HEATMAP_GRID.get(dow, [1.0] * 24)
         best_hours = sorted(range(24), key=lambda h: row[h] if h < len(row) else 0, reverse=True)[:2]
         optimal_eng = sum(row[h] * best_base * best_reach for h in best_hours)
         delta = agent_engagement - optimal_eng
         return {
             "delta": round(delta, 4),
+            "suggestion": (
+                "Above heatmap optimum today."
+                if delta >= 0
+                else "Below heatmap optimum — try `query_trends` / `predict_engagement` to find peak hours."
+            ),
         }
+    # ----- regulator / judge mode (deterministic, explainable) -----
+    def _compute_judge_report(
+        self,
+        action: ViraltestAction,
+        daily_engagement: float,
+        daily_posts: int,
+        energy_min: float,
+        errors: List[str],
+    ) -> JudgeReport:
+        violations: List[str] = []
+        pc = 1.0
+        if daily_posts > 5:
+            violations.append(f"posts_today={daily_posts} exceeds tier-4 fatigue cliff (Buffer 2.1M)")
+            pc -= 0.30
+        elif daily_posts > 2:
+            violations.append(f"posts_today={daily_posts} enters fatigue tier (>2/day)")
+            pc -= 0.10
+        if self._total_posts_this_week > WEEKLY_FATIGUE_THRESHOLD:
+            violations.append(f"weekly posts={self._total_posts_this_week} > {WEEKLY_FATIGUE_THRESHOLD} (Buffer 2.1M cap)")
+            pc -= 0.20
+        if self._collabs_this_month >= 4:
+            violations.append(f"collab cadence={self._collabs_this_month} net-negative beyond 3 (Cen 2024)")
+            pc -= 0.20
+        if errors:
+            violations.append(f"plan_errors={len(errors)}")
+            pc -= 0.05 * len(errors)
+        if self._hours_since_sleep > 22:
+            violations.append(f"sleep_debt: {self._hours_since_sleep}h awake (Van Dongen 2003)")
+            pc -= 0.10
+        burnout_pressure = (1.0 - energy_min) * 0.4 + self._sleep_debt * 0.3 + (self._low_energy_days / 5.0) * 0.3
+        sustainability_risk = max(0.0, min(1.0, burnout_pressure))
+        intents_used = {sa.intent for sa in action.scheduled_actions if sa.intent}
+        formats_used = {sa.content_type for sa in action.scheduled_actions if sa.action_type == "post" and sa.content_type}
+        eng_per_post = daily_engagement / max(1, daily_posts)
+        sq = (
+            0.40 * min(1.0, eng_per_post / 1.2)
+            + 0.30 * min(1.0, len(intents_used) / 2.0)
+            + 0.30 * min(1.0, len(formats_used) / 2.0)
+        )
+        explanation = (
+            f"compliance={max(0.0, pc):.2f} risk={sustainability_risk:.2f} strategy={sq:.2f} | "
+            + (("violations: " + "; ".join(violations)) if violations else "no policy violations")
+        )
+        return JudgeReport(
+            policy_compliance=max(0.0, min(1.0, pc)),
+            sustainability_risk=sustainability_risk,
+            strategic_quality=max(0.0, min(1.0, sq)),
+            explanation=explanation,
+            violations=violations,
+        )
+    def _compute_headline_metrics(self, grader_score: float) -> HeadlineMetrics:
+        baseline = HEURISTIC_BASELINE_SCORES.get(self._task, 0.30)
+        vs_pct = (grader_score - baseline) / baseline if baseline > 0 else 0.0
+        spt = grader_score / max(1, self._total_tool_calls)
+        sp1k = grader_score / max(1.0, self._total_action_chars / 1000.0)
+        retention: Optional[float] = None
+        if self._chain_id:
+            entry = _SHIFT_HISTORY.setdefault(self._chain_id, {})
+            label = self._shift_label or "baseline"
+            entry[label] = grader_score
+            base = entry.get("baseline")
+            shifted = entry.get("shifted")
+            if base is not None and shifted is not None and base > 0:
+                retention = shifted / base
+        return HeadlineMetrics(
+            vs_baseline_pct=round(vs_pct, 4),
+            score_per_tool_call=round(spt, 4),
+            score_per_1k_chars=round(sp1k, 4),
+            retention_under_shift=round(retention, 4) if retention is not None else None,
+            heuristic_baseline_score=round(baseline, 4),
+            agent_score=round(grader_score, 4),
+            total_tool_calls=self._total_tool_calls,
+            total_action_chars=self._total_action_chars,
+        )
     # ----- core API -----
     def reset(self, seed: Optional[int] = None, episode_id: Optional[str] = None, **kwargs: Any) -> ViraltestObservation:
         self._state = State(episode_id=episode_id or str(uuid4()), step_count=0)
         self._init_state()
+        self._shift_label = kwargs.get("shift_label")
+        self._chain_id = kwargs.get("episode_chain_id")
         chain_id = kwargs.get("episode_chain_id")
         if chain_id and chain_id in _BRAND_STORE:
             brand = _BRAND_STORE[chain_id]
         if action.notes:
             self._agent_notes = action.notes
+        try:
+            self._total_action_chars += len(action.model_dump_json())
+        except Exception:
+            pass
         tool_results: List[ToolResult] = []
         for tc in action.tool_calls:
             result = self._dispatch_tool(tc)
             tool_results.append(result)
+            if result.success:
+                self._total_tool_calls += 1
+        # Process collab proposal (no hard cap; diminishing returns enforced via _collab_multipliers)
+        self._active_collab = None
+        if action.collab:
             self._collabs_this_month += 1
             self._collab_history.append(action.collab.partner_id)
+            self._active_collab = action.collab
         # Validate scheduled actions
         schedule: Dict[int, ScheduledAction] = {}
         done = self._state.step_count >= TASK_HORIZON or self._energy <= 0.0
         coach = self._compute_coach_feedback(daily_engagement)
+        judge = self._compute_judge_report(action, daily_engagement, daily_posts, energy_min, errors)
         if done:
             self._episode_done = True
             grader_score = self._run_grader()
+            headline = self._compute_headline_metrics(grader_score)
             chain_id = kwargs.get("episode_chain_id")
             if chain_id:
                 grader_score=grader_score, daily_total_engagement=daily_engagement,
                 daily_posts_made=daily_posts, daily_energy_min=energy_min,
                 tool_results=tool_results, engagement_signals=daily_signals,
+                coach_feedback=coach, judge_report=judge, headline_metrics=headline,
             )
             return self._final_observation
             daily_total_engagement=daily_engagement,
             daily_posts_made=daily_posts, daily_energy_min=energy_min,
             tool_results=tool_results, engagement_signals=daily_signals,
+            coach_feedback=coach, judge_report=judge,
         )
     def _process_hour_action(self, sa: ScheduledAction) -> Tuple[float, float, Optional[EngagementSignals]]:
         engagement = 0.0
         signals = None
+        collab_growth_mult = 1.0
         if sa.action_type == "post":
             cost = CONTENT_ENERGY_COST.get(sa.content_type, 0.1)
             if self._content_queue > 0:
                     * trending_bonus * comp_diff * fatigue * algo_mult
                     * niche_mult * saturation_factor
                 )
+                if self._active_collab is not None and self._active_collab.hour == sa.hour:
+                    eng_m, growth_m = self._collab_multipliers(self._active_collab.partner_id)
+                    engagement *= eng_m
+                    collab_growth_mult = growth_m
                 engagement = min(engagement, 5.0)
                 signals = self._compute_engagement_signals(sa.content_type, engagement, sa.intent)
             self._time_since_last_post = 0
             if engagement > 0:
+                self._followers += int(engagement * 100 * collab_growth_mult)
         elif sa.action_type == "create_content":
             self._energy = max(0.0, self._energy - CREATE_CONTENT_COST)
         tool_results: Optional[List[ToolResult]] = None,
         engagement_signals: Optional[EngagementSignals] = None,
         coach_feedback: Optional[Dict[str, Any]] = None,
+        judge_report: Optional[JudgeReport] = None,
+        headline_metrics: Optional[HeadlineMetrics] = None,
     ) -> ViraltestObservation:
         recent_eng = self._engagement_history[-10:] if self._engagement_history else []
         eng_rate = sum(recent_eng) / len(recent_eng) if recent_eng else 0.0
             daily_energy_min=round(daily_energy_min, 3),
             engagement_signals=engagement_signals,
             coach_feedback=coach_feedback,
+            judge_report=judge_report,
+            headline_metrics=headline_metrics,
             tool_results=tool_results or [],
             agent_notes=self._agent_notes,
             api_budget_remaining=self._api_budget,
         return 0.0
     def _theoretical_max_engagement(self) -> float:
+        # Buffer 2.1M (RESEARCH.md): 3–5 posts/week doubles follower growth vs 1–2,
+        # diminishing returns above 5/week, 20–35% engagement drop per post above 7/week.
+        # Cap at 5 posts/week × 4 weeks = 20 posts/month (sweet-spot, no fatigue penalty).
         best_base = max(BASE_ENGAGEMENT.values())
         best_reach = max(REACH_MULT.values())
         best_niche = max(_NICHE_MULTIPLIERS.values()) if _NICHE_MULTIPLIERS else 1.0
+        posts_per_week = 5
+        weeks_in_horizon = TASK_HORIZON / 7.0
+        total_posts = int(round(posts_per_week * weeks_in_horizon))
         avg_heatmap_peak = 1.0
         if _HEATMAP_GRID:
+            day_peaks = [
+                max(row) if row else 1.0
+                for row in _HEATMAP_GRID.values()
+            ]
             avg_heatmap_peak = sum(day_peaks) / len(day_peaks) if day_peaks else 1.0
+        # Trending + tag uplifts: tier-1 industry data shows ~1.2-1.3x for trending topics
+        # and ~1.05-1.15x for high-performance tags. Mid-range used to avoid headroom inflation.
         trending_bonus = 1.25
         tag_boost = 1.1
         per_post = (
             best_base * best_reach * best_niche
+            * avg_heatmap_peak * trending_bonus * tag_boost
         )
         return per_post * total_posts