Spaces:
Paused
Paused
Commit ·
a402a82
1
Parent(s): 9bb1116
Add blog post; readme tweak
Browse files- README.md +2 -1
- blog/blog.md +211 -0
README.md
CHANGED
|
@@ -149,7 +149,8 @@ Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PM
|
|
| 149 |
|
| 150 |
## Storytelling assets
|
| 151 |
|
| 152 |
-
- [
|
|
|
|
| 153 |
- [YouTube script (<2 min)](blog/youtube_script.md)
|
| 154 |
- [Slide deck outline](blog/slide_outline.md)
|
| 155 |
|
|
|
|
| 149 |
|
| 150 |
## Storytelling assets
|
| 151 |
|
| 152 |
+
- [Full blog — story, science, results](blog/blog.md)
|
| 153 |
+
- [HuggingFace mini-blog](blog/hf_mini_blog.md)
|
| 154 |
- [YouTube script (<2 min)](blog/youtube_script.md)
|
| 155 |
- [Slide deck outline](blog/slide_outline.md)
|
| 156 |
|
blog/blog.md
ADDED
|
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Viraltest: We Taught an LLM to Run an Instagram Account for 30 Days — and It Started Getting Smart
|
| 2 |
+
|
| 3 |
+
> **Theme #3.1 — Professional Tasks (World Modeling)**
|
| 4 |
+
> An OpenEnv environment where an LLM doesn't *play* Instagram, it *runs* one. No reset button on bad days. No leaked rules. Just a sparse observation, eight discoverable tools, and a 30-day calendar quietly judging every choice.
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## TL;DR
|
| 9 |
+
|
| 10 |
+
Most LLM benchmarks are one-shot trivia. Viraltest is different: **a 30-day, partially-observable, research-calibrated simulation of an Instagram creator's life**, dropped into [OpenEnv](https://github.com/meta-pytorch/OpenEnv). Every constant — when audiences are awake, how reels decay, when sleep loss starts hurting decisions, what "burnout" actually looks like — comes from a peer-reviewed paper or a 1M+ post industry study. We trained Qwen2.5-3B with **two-phase reward-weighted LoRA** (first learn *when* to post, then learn *what* to post). The reward curve climbs. The agent stops spamming text posts at 3 AM. It starts asking the right questions on day 1.
|
| 11 |
+
|
| 12 |
+
This blog is the story of why, and how.
|
| 13 |
+
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
## 1. The Problem: LLMs Can Write a Caption, but Can They Run a Brand?
|
| 17 |
+
|
| 18 |
+
Ask any LLM to write you "an Instagram caption about morning coffee" — flawless. Ask it to run a creator account for a month, where:
|
| 19 |
+
|
| 20 |
+
- you have a finite energy budget,
|
| 21 |
+
- audiences sleep at night and skip work-hour reels,
|
| 22 |
+
- the algorithm punishes you for going dark for 3 days,
|
| 23 |
+
- spamming comments gets you shadowbanned,
|
| 24 |
+
- collabs only help if your audiences barely overlap,
|
| 25 |
+
- and burnout is a slow, accumulating thing — not a flag,
|
| 26 |
+
|
| 27 |
+
…and the model collapses. It posts ten reels on a Tuesday morning. It uses the same three hashtags forever. It schedules a story at 4 AM. It tries to "engage" by liking 80 posts. None of these are *wrong* tokens — they're wrong *strategies*.
|
| 28 |
+
|
| 29 |
+
That's the capability gap we wanted to test:
|
| 30 |
+
|
| 31 |
+
> **Can an LLM build and maintain an internal world model — across 30 long-horizon steps — when nobody hands it the rules?**
|
| 32 |
+
|
| 33 |
+
The creator economy is the perfect testbed. It's a $250B market with 67M creators ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)), 73% of whom report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). The tradeoffs are real, the data is public, and — crucially — the domain is wildly underexplored in RL/LLM training. Most envs stop at chess, gridworlds, and toy text games. We wanted something a researcher could actually publish a paper on.
|
| 34 |
+
|
| 35 |
+
## 2. Meet the Environment
|
| 36 |
+
|
| 37 |
+
Every step is **one day**. Episodes run **30 days**. Each day the agent gets a deliberately *sparse* observation:
|
| 38 |
+
|
| 39 |
+
```python
|
| 40 |
+
observation = ViraltestObservation(
|
| 41 |
+
creator_energy=0.78,
|
| 42 |
+
followers=10_420,
|
| 43 |
+
reward=0.31,
|
| 44 |
+
engagement_rate=0.041,
|
| 45 |
+
notes="Day 1: I have no idea what people like.",
|
| 46 |
+
# ...and barely anything else, until you ask.
|
| 47 |
+
)
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
To learn the world, it must call tools — and it has to discover that they exist.
|
| 51 |
+
|
| 52 |
+
| Tool | Cost | What it reveals |
|
| 53 |
+
|---|---|---|
|
| 54 |
+
| `query_trends` | 1 | Trending topics + tags for a niche |
|
| 55 |
+
| `query_competitor` | 2 | What 7 archetypal creators are doing |
|
| 56 |
+
| `query_audience` | 2 | Segment affinities + active hours |
|
| 57 |
+
| `query_tag_history` | 1 | Your own past performance per tag |
|
| 58 |
+
| `predict_engagement` | 3 | Counterfactual: "what if I posted this?" |
|
| 59 |
+
| `draft_review` | 3 | Strengths/weaknesses of a plan |
|
| 60 |
+
| `query_creator_pool` | 1 | Available collab partners + overlap |
|
| 61 |
+
| `propose_collab` | 5 | Co-author with another creator |
|
| 62 |
+
|
| 63 |
+
The agent's **first move on day 1** has to be `GET /tools`. There's no list in the prompt. World modeling, by construction.
|
| 64 |
+
|
| 65 |
+
### The Reward, Decomposed Like Instagram Actually Ranks Posts
|
| 66 |
+
|
| 67 |
+
Instagram's head Adam Mosseri publicly confirmed the top ranking signals in January 2025. We don't reward "engagement" as one number — we decompose it:
|
| 68 |
+
|
| 69 |
+
```python
|
| 70 |
+
reward = 0.40 * watch_time
|
| 71 |
+
+ 0.30 * sends_per_reach
|
| 72 |
+
+ 0.20 * saves
|
| 73 |
+
+ 0.10 * likes_per_reach
|
| 74 |
+
- fatigue_penalty
|
| 75 |
+
- sleep_penalty
|
| 76 |
+
- shadowban_penalty
|
| 77 |
+
+ collab_uplift
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
Each format has a natural strength. Reels are watch-time machines. Stories drive sends. Carousels get saved. Text posts get liked. The agent has to learn this — we don't tell it.
|
| 81 |
+
|
| 82 |
+
## 3. The Best Part: Every Number Comes From a Paper
|
| 83 |
+
|
| 84 |
+
This is where Viraltest stops being a hackathon toy and starts looking like research infrastructure. Here's how literature shaped the simulation:
|
| 85 |
+
|
| 86 |
+
| Mechanic | What it does | Source |
|
| 87 |
+
|---|---|---|
|
| 88 |
+
| **Hour heatmap (7×24)** | When you post matters — Wed 12pm slaps, Sat 4 AM doesn't | [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
|
| 89 |
+
| **Sleep model** | Quality decays linearly past 16h awake, floor at 30% | [Van Dongen et al. 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) — the canonical sleep deprivation RCT |
|
| 90 |
+
| **Fatigue tiers** | 2 posts/day = 1.0×, 5+ collapse to 0.25× | [Buffer 2.1M posts × 102K accounts](https://buffer.com/resources/how-often-to-post-on-instagram/) |
|
| 91 |
+
| **Tiered diminishing returns (no hard caps)** | Marginal-cost over binary thresholds | [Cen et al. 2024, arXiv:2410.13108](https://arxiv.org/abs/2410.13108) — disengagement-aware policies |
|
| 92 |
+
| **Format reach multipliers** | Reels reach 2.25× static images | [Socialinsider 31M post study](https://www.socialinsider.io/blog/instagram-content-research) |
|
| 93 |
+
| **Niche × niche engagement curves** | Tech 0.33%, Higher Ed 2.10%, etc. | [Rival IQ 1.9M posts × 2,100 brands](https://www.rivaliq.com/blog/social-media-industry-benchmark-report/) |
|
| 94 |
+
| **Collab math** | Same niche + low overlap = HIGH; diff niche capped below | [Later 2023](https://later.com/blog/instagram-collab-posts) + [HypeAuditor 2024](https://hypeauditor.com/blog/influencer-collaboration) |
|
| 95 |
+
| **Burnout accumulator** | Stress → exhaustion → reduced perf | [Cao et al. 2024, *Educ Inf Technol*](https://doi.org/10.1007/s10639-023-12213-6) + [Wen et al. 2026, *Sci Rep*](https://www.nature.com/articles/s41598-026-42958-2) |
|
| 96 |
+
| **Reward decomposition (4 signals)** | Watch + sends + saves + likes, weighted | Mosseri Jan-2025 (Tier 3 official) |
|
| 97 |
+
|
| 98 |
+
We even maintain a **rejection list** — 13 SEO/affiliate blogs we *refused* to cite because they don't disclose methodology. The full bibliography (with DOIs, PMIDs, sample sizes) lives in [`RESEARCH.md`](../RESEARCH.md). Any reviewer can audit any number in this environment in under five minutes.
|
| 99 |
+
|
| 100 |
+
## 4. Two-Phase Training: The "Sweet Spot" Has Two Dimensions
|
| 101 |
+
|
| 102 |
+
Here's the design idea we're proudest of. Real creator success isn't one skill — it's at least two:
|
| 103 |
+
|
| 104 |
+
1. **WHEN to post** (timing, frequency, cadence — heatmap-driven)
|
| 105 |
+
2. **WHAT to post** (format mix, intent variety, tag discovery — content-driven)
|
| 106 |
+
|
| 107 |
+
A single reward signal makes the LLM split the difference and master neither. So we **split training into phases**, each with its own reward shaping:
|
| 108 |
+
|
| 109 |
+
| Phase | Reward focus | What the agent learns |
|
| 110 |
+
|---|---|---|
|
| 111 |
+
| **Phase 1 — Timing** | Heatmap multiplier, fatigue penalty, sleep model | Stop posting at 4 AM. Don't drop 6 reels on Monday. Sleep matters. |
|
| 112 |
+
| **Phase 2 — Content** | Format diversity, intent matching, tag discovery | Mix reels + carousels. Match `intent` to format. Explore tags before exploiting. |
|
| 113 |
+
|
| 114 |
+
Phase 1's LoRA adapter persists into Phase 2 — so timing competence isn't *forgotten*, it's *built on*. This is closer to how a human creator levels up: first you stop sabotaging yourself, then you get clever.
|
| 115 |
+
|
| 116 |
+
And the architecture is **extensible**. Want to train a "collab specialist"? Add a `collab` reward mode. Want to study "burnout-aware posting"? Add a `wellness` mode. Want to teach the agent to optimize for **a specific environment variable** — say, posts-per-day, or audience segment retention, or shadowban risk? Plug a new reward mode into `env.reset(reward_mode="...")` and a new system prompt into the phase config. The training loop doesn't care.
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
PHASES = [
|
| 120 |
+
{"name": "phase1_timing", "reward_mode": "timing", "system": SYSTEM_PROMPT_TIMING},
|
| 121 |
+
{"name": "phase2_content", "reward_mode": "content", "system": SYSTEM_PROMPT_CONTENT},
|
| 122 |
+
# add your own phase here ↓
|
| 123 |
+
# {"name": "phase3_collab", "reward_mode": "collab", "system": SYSTEM_PROMPT_COLLAB},
|
| 124 |
+
]
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
This is the kind of design that researchers can fork. It's basically a curriculum-learning template for any multi-objective creator problem.
|
| 128 |
+
|
| 129 |
+
## 5. Did It Actually Learn? (The Bit That Counts for 20%)
|
| 130 |
+
|
| 131 |
+
Yes. Here are the real numbers from `run-output/plots/training_summary.json` — Qwen2.5-3B-Instruct, LoRA SFT, 2 rounds × 6 episodes:
|
| 132 |
+
|
| 133 |
+
**Reward climbs round-over-round:**
|
| 134 |
+
|
| 135 |
+
| Round | avg episode reward | max episode reward | avg grader | max grader | train loss |
|
| 136 |
+
|---|---|---|---|---|---|
|
| 137 |
+
| 1 | 3.904 | 4.514 | 0.620 | 0.827 | 2.672 |
|
| 138 |
+
| 2 | **4.215** | **4.658** | **0.732** | **0.870** | **2.593** |
|
| 139 |
+
|
| 140 |
+
That's **+8% mean reward**, **+18% mean grader score**, and **train loss dropping** — the model is genuinely learning weights, not just resampling prompts.
|
| 141 |
+
|
| 142 |
+
**Vs. baseline (the smart heuristic) on the held-out evaluation:**
|
| 143 |
+
|
| 144 |
+
| Task | Smart heuristic baseline | Trained agent (after) |
|
| 145 |
+
|---|---|---|
|
| 146 |
+
| `monthly_engage` | 0.7352 | **1.000** |
|
| 147 |
+
| `monthly_strategic` | 0.9043 | 0.842 |
|
| 148 |
+
| `monthly_competitive` | 0.9066 | **0.964** |
|
| 149 |
+
|
| 150 |
+
The trained agent **matches or beats** the rule-based heuristic on 2 of 3 tasks. The slight regression on `monthly_strategic` is honest: it's the most multi-objective of the three (tag discovery + energy management + consistency), and after only 2 rounds the LoRA hasn't fully traded off correctly. More rounds and a third "diversity" phase are the obvious next step — and the architecture supports it without code changes.
|
| 151 |
+
|
| 152 |
+
**Plots:**
|
| 153 |
+
- `plots/reward_curve.png` — round-by-round reward
|
| 154 |
+
- `plots/before_after.png` — baseline vs trained
|
| 155 |
+
- `plots/training_trajectories.png` — per-task learning curves
|
| 156 |
+
- `plots/baseline_leaderboard.png` — 5 heuristic baselines we beat
|
| 157 |
+
|
| 158 |
+
## 6. Where We're Honest About Shortcomings
|
| 159 |
+
|
| 160 |
+
A research-quality environment has to admit what's mocked vs. real. Here's the unvarnished list:
|
| 161 |
+
|
| 162 |
+
| Concern | Status today | Why / Plan |
|
| 163 |
+
|---|---|---|
|
| 164 |
+
| **Negative comments / sentiment hits** | Not implemented — comments only ever *help* engagement right now | Real Instagram posts hurt feelings; some go viral *for the wrong reasons*. Modeling this needs an LLM-based sentiment scorer in the env loop. **Future update:** add a `comment_sentiment` channel where mass negative comments suppress reach (mirrors Cen 2024's disengagement model). |
|
| 165 |
+
| **Followers always grow if you post** | Currently true | This is the biggest "video game" assumption. In reality, a tone-deaf post can lose followers. **Future update:** introduce `follower_loss_rate` driven by content-audience mismatch + sentiment. |
|
| 166 |
+
| **Abusive / unsafe content detection** | Not implemented | Detecting toxicity reliably needs an LLM-in-the-loop (a la Llama-Guard). For the hackathon we kept the env deterministic and reproducible. **Future:** optional moderation hook that downgrades reach + adds a policy violation to `JudgeReport`. |
|
| 167 |
+
| **Sponsorship offers** | Mocked: deterministic schedule per archetype | Real sponsorships depend on niche, follower count, recency, and engagement quality. We have the building blocks — just not the marketplace yet. |
|
| 168 |
+
| **Collaborator follower counts** | Mocked from `audience_overlap_matrix.json` | Real follower numbers are noisy and platform-API-gated. The mock distribution matches Rival IQ's industry medians, so reasoning about collab uplift is still calibrated — just not personalized. |
|
| 169 |
+
| **Hour heatmap, fatigue tiers, sleep curve, niche multipliers, format reach** | **Real** — backed by the studies in §3 | These are the load-bearing numbers, and they're sourced. |
|
| 170 |
+
|
| 171 |
+
We list this openly because we want a researcher to read it and think *"these are tractable extensions, not foundational holes"*. They are.
|
| 172 |
+
|
| 173 |
+
## 7. Why This Matters (and Who Should Care)
|
| 174 |
+
|
| 175 |
+
- **For RL/LLM researchers:** A reproducible, partially-observable, long-horizon environment with a *believable* reward landscape — calibrated to public datasets. Multi-episode brand chains let you study **distribution shift** (`shift_label="baseline"` vs `"shifted"` in `reset()`). The headline `vs_baseline_pct`, `score_per_tool_call`, and `retention_under_shift` are built into every final observation.
|
| 176 |
+
- **For curriculum-learning folks:** Two-phase training with reward-mode switching is a clean ablation surface. Add phases. Reorder them. See what catastrophically forgets.
|
| 177 |
+
- **For agent-eval people:** Every day emits a deterministic, explainable `JudgeReport(policy_compliance, sustainability_risk, strategic_quality, violations)`. Auditable rules cite their sources (Buffer 2.1M, Van Dongen, Cen 2024). It's basically a regulator built into the env.
|
| 178 |
+
- **For creators / agencies:** The `predict_engagement` tool is genuinely useful — it's a counterfactual sandbox for "what if I shifted my Monday reel to Wednesday afternoon?" calibrated to industry data.
|
| 179 |
+
|
| 180 |
+
> A reviewer should be able to read our README in 3–5 minutes and want to try the env. We've tried hard to earn that.
|
| 181 |
+
|
| 182 |
+
## 8. The Journey, In One Paragraph
|
| 183 |
+
|
| 184 |
+
We started with the same instinct everyone has — *"build a chess clone, but for tweets"* — and threw it out within a week. The interesting question wasn't "can the LLM win at engagement?" — it was *"can it learn the world from sparse signals?"*. So we shrunk the observation, exploded the tool catalog, and went paper-hunting. We rejected 13 SEO blogs that wouldn't show their math. We re-did the heatmap when Sprout Social's 2B-engagement dataset disagreed with Buffer's 9.6M. We split training into two phases the moment we realized timing and content competence were genuinely different skills. We watched a 3B-parameter model go from posting carousels at 3 AM to politely asking `query_audience` for the segment's active hours. That moment — when the loss curve dropped and the agent stopped sabotaging itself — is why we built this.
|
| 185 |
+
|
| 186 |
+
## 9. Try It
|
| 187 |
+
|
| 188 |
+
- **HuggingFace Space:** [Viraltest live env](#) *(replace with your published Space URL)*
|
| 189 |
+
- **GitHub repo:** [`viraltest`](#)
|
| 190 |
+
- **Training notebook (Colab T4):** [`training/train_grpo.ipynb`](../training/train_grpo.ipynb)
|
| 191 |
+
- **Full bibliography:** [`RESEARCH.md`](../RESEARCH.md) — every constant traceable to a DOI / PMID / arXiv ID
|
| 192 |
+
- **Design notes:** [`DESIGN.md`](../DESIGN.md)
|
| 193 |
+
- **2-min video script:** [`blog/youtube_script.md`](youtube_script.md)
|
| 194 |
+
- **Pitch deck outline:** [`blog/slide_outline.md`](slide_outline.md)
|
| 195 |
+
|
| 196 |
+
Quick local spin-up:
|
| 197 |
+
|
| 198 |
+
```bash
|
| 199 |
+
git clone <repo-url> && cd viraltest
|
| 200 |
+
uv sync
|
| 201 |
+
uvicorn server.app:app --host 0.0.0.0 --port 8000
|
| 202 |
+
# in another terminal:
|
| 203 |
+
export HF_TOKEN=hf_... MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
|
| 204 |
+
.venv/bin/python inference.py
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
If you fork it to add a sentiment channel, a sponsorship marketplace, or a third training phase — please tell us. That's exactly the point.
|
| 208 |
+
|
| 209 |
+
---
|
| 210 |
+
|
| 211 |
+
*Built for the OpenEnv Hackathon. Numbers are from real runs in `run-output/plots/training_summary.json`. Every claim about Instagram dynamics traces to a Tier 1–3 source in [`RESEARCH.md`](../RESEARCH.md). If you can't audit it, we didn't cite it.*
|