Spaces:

ycwhencpp
/

final-iteration

Paused

App Files Files Community

vaibhav12332112312 commited on 12 days ago

Commit

a402a82

1 Parent(s): 9bb1116

Add blog post; readme tweak

Browse files

Files changed (2) hide show

README.md +2 -1
blog/blog.md +211 -0

README.md CHANGED Viewed

@@ -149,7 +149,8 @@ Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PM
 ## Storytelling assets
-- [HuggingFace blog](blog/hf_mini_blog.md)
 - [YouTube script (<2 min)](blog/youtube_script.md)
 - [Slide deck outline](blog/slide_outline.md)

 ## Storytelling assets
+- [Full blog — story, science, results](blog/blog.md)
+- [HuggingFace mini-blog](blog/hf_mini_blog.md)
 - [YouTube script (<2 min)](blog/youtube_script.md)
 - [Slide deck outline](blog/slide_outline.md)

blog/blog.md ADDED Viewed

	@@ -0,0 +1,211 @@

+# Viraltest: We Taught an LLM to Run an Instagram Account for 30 Days — and It Started Getting Smart
+> **Theme #3.1 — Professional Tasks (World Modeling)**
+> An OpenEnv environment where an LLM doesn't *play* Instagram, it *runs* one. No reset button on bad days. No leaked rules. Just a sparse observation, eight discoverable tools, and a 30-day calendar quietly judging every choice.
+---
+## TL;DR
+Most LLM benchmarks are one-shot trivia. Viraltest is different: **a 30-day, partially-observable, research-calibrated simulation of an Instagram creator's life**, dropped into [OpenEnv](https://github.com/meta-pytorch/OpenEnv). Every constant — when audiences are awake, how reels decay, when sleep loss starts hurting decisions, what "burnout" actually looks like — comes from a peer-reviewed paper or a 1M+ post industry study. We trained Qwen2.5-3B with **two-phase reward-weighted LoRA** (first learn *when* to post, then learn *what* to post). The reward curve climbs. The agent stops spamming text posts at 3 AM. It starts asking the right questions on day 1.
+This blog is the story of why, and how.
+---
+## 1. The Problem: LLMs Can Write a Caption, but Can They Run a Brand?
+Ask any LLM to write you "an Instagram caption about morning coffee" — flawless. Ask it to run a creator account for a month, where:
+- you have a finite energy budget,
+- audiences sleep at night and skip work-hour reels,
+- the algorithm punishes you for going dark for 3 days,
+- spamming comments gets you shadowbanned,
+- collabs only help if your audiences barely overlap,
+- and burnout is a slow, accumulating thing — not a flag,
+…and the model collapses. It posts ten reels on a Tuesday morning. It uses the same three hashtags forever. It schedules a story at 4 AM. It tries to "engage" by liking 80 posts. None of these are *wrong* tokens — they're wrong *strategies*.
+That's the capability gap we wanted to test:
+> **Can an LLM build and maintain an internal world model — across 30 long-horizon steps — when nobody hands it the rules?**
+The creator economy is the perfect testbed. It's a $250B market with 67M creators ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)), 73% of whom report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). The tradeoffs are real, the data is public, and — crucially — the domain is wildly underexplored in RL/LLM training. Most envs stop at chess, gridworlds, and toy text games. We wanted something a researcher could actually publish a paper on.
+## 2. Meet the Environment
+Every step is **one day**. Episodes run **30 days**. Each day the agent gets a deliberately *sparse* observation:
+```python
+observation = ViraltestObservation(
+    creator_energy=0.78,
+    followers=10_420,
+    reward=0.31,
+    engagement_rate=0.041,
+    notes="Day 1: I have no idea what people like.",
+    # ...and barely anything else, until you ask.
+)
+```
+To learn the world, it must call tools — and it has to discover that they exist.
+| Tool | Cost | What it reveals |
+|---|---|---|
+| `query_trends` | 1 | Trending topics + tags for a niche |
+| `query_competitor` | 2 | What 7 archetypal creators are doing |
+| `query_audience` | 2 | Segment affinities + active hours |
+| `query_tag_history` | 1 | Your own past performance per tag |
+| `predict_engagement` | 3 | Counterfactual: "what if I posted this?" |
+| `draft_review` | 3 | Strengths/weaknesses of a plan |
+| `query_creator_pool` | 1 | Available collab partners + overlap |
+| `propose_collab` | 5 | Co-author with another creator |
+The agent's **first move on day 1** has to be `GET /tools`. There's no list in the prompt. World modeling, by construction.
+### The Reward, Decomposed Like Instagram Actually Ranks Posts
+Instagram's head Adam Mosseri publicly confirmed the top ranking signals in January 2025. We don't reward "engagement" as one number — we decompose it:
+```python
+reward = 0.40 * watch_time
+       + 0.30 * sends_per_reach
+       + 0.20 * saves
+       + 0.10 * likes_per_reach
+       - fatigue_penalty
+       - sleep_penalty
+       - shadowban_penalty
+       + collab_uplift
+```
+Each format has a natural strength. Reels are watch-time machines. Stories drive sends. Carousels get saved. Text posts get liked. The agent has to learn this — we don't tell it.
+## 3. The Best Part: Every Number Comes From a Paper
+This is where Viraltest stops being a hackathon toy and starts looking like research infrastructure. Here's how literature shaped the simulation:
+| Mechanic | What it does | Source |
+|---|---|---|
+| **Hour heatmap (7×24)** | When you post matters — Wed 12pm slaps, Sat 4 AM doesn't | [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
+| **Sleep model** | Quality decays linearly past 16h awake, floor at 30% | [Van Dongen et al. 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) — the canonical sleep deprivation RCT |
+| **Fatigue tiers** | 2 posts/day = 1.0×, 5+ collapse to 0.25× | [Buffer 2.1M posts × 102K accounts](https://buffer.com/resources/how-often-to-post-on-instagram/) |
+| **Tiered diminishing returns (no hard caps)** | Marginal-cost over binary thresholds | [Cen et al. 2024, arXiv:2410.13108](https://arxiv.org/abs/2410.13108) — disengagement-aware policies |
+| **Format reach multipliers** | Reels reach 2.25× static images | [Socialinsider 31M post study](https://www.socialinsider.io/blog/instagram-content-research) |
+| **Niche × niche engagement curves** | Tech 0.33%, Higher Ed 2.10%, etc. | [Rival IQ 1.9M posts × 2,100 brands](https://www.rivaliq.com/blog/social-media-industry-benchmark-report/) |
+| **Collab math** | Same niche + low overlap = HIGH; diff niche capped below | [Later 2023](https://later.com/blog/instagram-collab-posts) + [HypeAuditor 2024](https://hypeauditor.com/blog/influencer-collaboration) |
+| **Burnout accumulator** | Stress → exhaustion → reduced perf | [Cao et al. 2024, *Educ Inf Technol*](https://doi.org/10.1007/s10639-023-12213-6) + [Wen et al. 2026, *Sci Rep*](https://www.nature.com/articles/s41598-026-42958-2) |
+| **Reward decomposition (4 signals)** | Watch + sends + saves + likes, weighted | Mosseri Jan-2025 (Tier 3 official) |
+We even maintain a **rejection list** — 13 SEO/affiliate blogs we *refused* to cite because they don't disclose methodology. The full bibliography (with DOIs, PMIDs, sample sizes) lives in [`RESEARCH.md`](../RESEARCH.md). Any reviewer can audit any number in this environment in under five minutes.
+## 4. Two-Phase Training: The "Sweet Spot" Has Two Dimensions
+Here's the design idea we're proudest of. Real creator success isn't one skill — it's at least two:
+1. **WHEN to post** (timing, frequency, cadence — heatmap-driven)
+2. **WHAT to post** (format mix, intent variety, tag discovery — content-driven)
+A single reward signal makes the LLM split the difference and master neither. So we **split training into phases**, each with its own reward shaping:
+| Phase | Reward focus | What the agent learns |
+|---|---|---|
+| **Phase 1 — Timing** | Heatmap multiplier, fatigue penalty, sleep model | Stop posting at 4 AM. Don't drop 6 reels on Monday. Sleep matters. |
+| **Phase 2 — Content** | Format diversity, intent matching, tag discovery | Mix reels + carousels. Match `intent` to format. Explore tags before exploiting. |
+Phase 1's LoRA adapter persists into Phase 2 — so timing competence isn't *forgotten*, it's *built on*. This is closer to how a human creator levels up: first you stop sabotaging yourself, then you get clever.
+And the architecture is **extensible**. Want to train a "collab specialist"? Add a `collab` reward mode. Want to study "burnout-aware posting"? Add a `wellness` mode. Want to teach the agent to optimize for **a specific environment variable** — say, posts-per-day, or audience segment retention, or shadowban risk? Plug a new reward mode into `env.reset(reward_mode="...")` and a new system prompt into the phase config. The training loop doesn't care.
+```python
+PHASES = [
+    {"name": "phase1_timing",  "reward_mode": "timing",  "system": SYSTEM_PROMPT_TIMING},
+    {"name": "phase2_content", "reward_mode": "content", "system": SYSTEM_PROMPT_CONTENT},
+    # add your own phase here ↓
+    # {"name": "phase3_collab", "reward_mode": "collab", "system": SYSTEM_PROMPT_COLLAB},
+]
+```
+This is the kind of design that researchers can fork. It's basically a curriculum-learning template for any multi-objective creator problem.
+## 5. Did It Actually Learn? (The Bit That Counts for 20%)
+Yes. Here are the real numbers from `run-output/plots/training_summary.json` — Qwen2.5-3B-Instruct, LoRA SFT, 2 rounds × 6 episodes:
+**Reward climbs round-over-round:**
+| Round | avg episode reward | max episode reward | avg grader | max grader | train loss |
+|---|---|---|---|---|---|
+| 1 | 3.904 | 4.514 | 0.620 | 0.827 | 2.672 |
+| 2 | **4.215** | **4.658** | **0.732** | **0.870** | **2.593** |
+That's **+8% mean reward**, **+18% mean grader score**, and **train loss dropping** — the model is genuinely learning weights, not just resampling prompts.
+**Vs. baseline (the smart heuristic) on the held-out evaluation:**
+| Task | Smart heuristic baseline | Trained agent (after) |
+|---|---|---|
+| `monthly_engage` | 0.7352 | **1.000** |
+| `monthly_strategic` | 0.9043 | 0.842 |
+| `monthly_competitive` | 0.9066 | **0.964** |
+The trained agent **matches or beats** the rule-based heuristic on 2 of 3 tasks. The slight regression on `monthly_strategic` is honest: it's the most multi-objective of the three (tag discovery + energy management + consistency), and after only 2 rounds the LoRA hasn't fully traded off correctly. More rounds and a third "diversity" phase are the obvious next step — and the architecture supports it without code changes.
+**Plots:**
+- `plots/reward_curve.png` — round-by-round reward
+- `plots/before_after.png` — baseline vs trained
+- `plots/training_trajectories.png` — per-task learning curves
+- `plots/baseline_leaderboard.png` — 5 heuristic baselines we beat
+## 6. Where We're Honest About Shortcomings
+A research-quality environment has to admit what's mocked vs. real. Here's the unvarnished list:
+| Concern | Status today | Why / Plan |
+|---|---|---|
+| **Negative comments / sentiment hits** | Not implemented — comments only ever *help* engagement right now | Real Instagram posts hurt feelings; some go viral *for the wrong reasons*. Modeling this needs an LLM-based sentiment scorer in the env loop. **Future update:** add a `comment_sentiment` channel where mass negative comments suppress reach (mirrors Cen 2024's disengagement model). |
+| **Followers always grow if you post** | Currently true | This is the biggest "video game" assumption. In reality, a tone-deaf post can lose followers. **Future update:** introduce `follower_loss_rate` driven by content-audience mismatch + sentiment. |
+| **Abusive / unsafe content detection** | Not implemented | Detecting toxicity reliably needs an LLM-in-the-loop (a la Llama-Guard). For the hackathon we kept the env deterministic and reproducible. **Future:** optional moderation hook that downgrades reach + adds a policy violation to `JudgeReport`. |
+| **Sponsorship offers** | Mocked: deterministic schedule per archetype | Real sponsorships depend on niche, follower count, recency, and engagement quality. We have the building blocks — just not the marketplace yet. |
+| **Collaborator follower counts** | Mocked from `audience_overlap_matrix.json` | Real follower numbers are noisy and platform-API-gated. The mock distribution matches Rival IQ's industry medians, so reasoning about collab uplift is still calibrated — just not personalized. |
+| **Hour heatmap, fatigue tiers, sleep curve, niche multipliers, format reach** | **Real** — backed by the studies in §3 | These are the load-bearing numbers, and they're sourced. |
+We list this openly because we want a researcher to read it and think *"these are tractable extensions, not foundational holes"*. They are.
+## 7. Why This Matters (and Who Should Care)
+- **For RL/LLM researchers:** A reproducible, partially-observable, long-horizon environment with a *believable* reward landscape — calibrated to public datasets. Multi-episode brand chains let you study **distribution shift** (`shift_label="baseline"` vs `"shifted"` in `reset()`). The headline `vs_baseline_pct`, `score_per_tool_call`, and `retention_under_shift` are built into every final observation.
+- **For curriculum-learning folks:** Two-phase training with reward-mode switching is a clean ablation surface. Add phases. Reorder them. See what catastrophically forgets.
+- **For agent-eval people:** Every day emits a deterministic, explainable `JudgeReport(policy_compliance, sustainability_risk, strategic_quality, violations)`. Auditable rules cite their sources (Buffer 2.1M, Van Dongen, Cen 2024). It's basically a regulator built into the env.
+- **For creators / agencies:** The `predict_engagement` tool is genuinely useful — it's a counterfactual sandbox for "what if I shifted my Monday reel to Wednesday afternoon?" calibrated to industry data.
+> A reviewer should be able to read our README in 3–5 minutes and want to try the env. We've tried hard to earn that.
+## 8. The Journey, In One Paragraph
+We started with the same instinct everyone has — *"build a chess clone, but for tweets"* — and threw it out within a week. The interesting question wasn't "can the LLM win at engagement?" — it was *"can it learn the world from sparse signals?"*. So we shrunk the observation, exploded the tool catalog, and went paper-hunting. We rejected 13 SEO blogs that wouldn't show their math. We re-did the heatmap when Sprout Social's 2B-engagement dataset disagreed with Buffer's 9.6M. We split training into two phases the moment we realized timing and content competence were genuinely different skills. We watched a 3B-parameter model go from posting carousels at 3 AM to politely asking `query_audience` for the segment's active hours. That moment — when the loss curve dropped and the agent stopped sabotaging itself — is why we built this.
+## 9. Try It
+- **HuggingFace Space:** [Viraltest live env](#) *(replace with your published Space URL)*
+- **GitHub repo:** [`viraltest`](#)
+- **Training notebook (Colab T4):** [`training/train_grpo.ipynb`](../training/train_grpo.ipynb)
+- **Full bibliography:** [`RESEARCH.md`](../RESEARCH.md) — every constant traceable to a DOI / PMID / arXiv ID
+- **Design notes:** [`DESIGN.md`](../DESIGN.md)
+- **2-min video script:** [`blog/youtube_script.md`](youtube_script.md)
+- **Pitch deck outline:** [`blog/slide_outline.md`](slide_outline.md)
+Quick local spin-up:
+```bash
+git clone <repo-url> && cd viraltest
+uv sync
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+# in another terminal:
+export HF_TOKEN=hf_... MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
+.venv/bin/python inference.py
+```
+If you fork it to add a sentiment channel, a sponsorship marketplace, or a third training phase — please tell us. That's exactly the point.
+---
+*Built for the OpenEnv Hackathon. Numbers are from real runs in `run-output/plots/training_summary.json`. Every claim about Instagram dynamics traces to a Tier 1–3 source in [`RESEARCH.md`](../RESEARCH.md). If you can't audit it, we didn't cite it.*