Lightloom Β· speak your world into being
Speak β your story unrolls as a living storyboard world.

The whole pitch fits in one line: you speak, and a painted world unrolls in front of you, live. No stock art, no cloud API. Every pixel is made on the fly by a handful of tiny models sharing a single ZeroGPU slot.
This is the honest version of how it went β including the parts that broke on me.
Voice in, mural out. Four small models take turns on one slot:
flowchart LR
V[π voice] -->|Parakeet-CTC| T[transcript]
T -->|split per phrase| D[Director Β· MiniCPM5-1B]
D -->|shot + art style| P[Painter Β· FLUX.2-klein-4B Β· 4 steps]
P --> M[one continuous mural]
M -->|Depth-Anything| Z[relief]
M -.->|at the end| A[Art Director Β· MiniCPM-V Β· names it, films it]
The clever part isn't any single model. It's that they're all small enough to ride one slot and still feel instant. Nobody model is the star; the orchestra is.
I don't generate separate images and stitch them. Each phrase outpaints the right edge of the mural, with the previous edge fed back in as the only "truth" the model is allowed to keep:
previous strip new canvas (768 px tall, the live size)
ββββββββββββββ βββββββββββββββ¬βββββββββββββββββββββββ
β ...world β β βoverlap 212 β new 556 (painted)β
β β β mask = 0 β mask = 255 β
ββββββββββββββ β (KEEP) β (GENERATE) β
ββββββββββββ ββ΄βββββββββββββββββββββββ
β² the model continues THIS to the right
Simple idea, fussy in practice. The seam between scenes kept reading as a cut. Two fixes that finally killed it: seed the fresh region with the carried edge's average colour (so a weak 4-step strip doesn't leak a grey band β the dreaded grey smears), and ramp the mask 0β255 over ~40 px at the join so the new scene emerges from the old one instead of slamming into it.
1. The 16,383-pixel wall. Long sessions grow a wide panorama. Mine kept crashing on save with a cryptic error β turns out WEBP has a hard 16383 px dimension limit and I was sailing past it. Switched the final stitch to JPEG and added a "skip the corrupt strip, keep going" read. After that my test panorama happily reached 19,116 px wide. One line of format, an hour of confusion.
2. "No depth, please." My first 3D pass moved the camera through the scene with little whip-glides and a zoom pulse. On paper: cinematic. In practice: motion-sick in ten seconds, and it fought the flatness that makes the live scroll calm. So the live view stays dead flat, and the real 3D moved to a separate, opt-in button β a depth-displaced mesh you drag around in your own browser GPU. Same depth map, totally different place. Killing a feature you're proud of is the job.
3. The cache that lied to me. Near the end I shipped a UI ("Ask your world") and a latency fix,
tested live, and⦠nothing changed. I re-read the code five times convinced I was insane. The browser
was serving a stale controller.js β no cache headers on my static routes. One
Cache-Control: no-cache and every deploy reached the screen again. The most human bug of the week.
Perceived speed is not real speed. Parakeet is alignment-based CTC, so it structurally can't hallucinate filler β great for hands-off narration. But on one shared slot, by the time a phrase is transcribed and painted, the text felt late. The fix wasn't faster inference; it was showing your spoken words the instant they're transcribed, letting the (slower) paint catch up behind them. Speaking feels fluid the moment you see your words, even if the picture lags a beat. UX, not FLOPs.
Everything below runs locally, live, on one slot β verifiable at /health:
| Model | Job | Params | Live? |
|---|---|---|---|
| NVIDIA Parakeet-CTC | voice β text | 1.10B | β |
| MiniCPM5-1B | Director β shot + art style per phrase | 1.00B | β |
| FLUX.2-klein-4B | Painter β 4-step, CFG-free strip | 4.00B | β |
| Depth-Anything-V2-Small | relief | 0.025B | β |
| MiniCPM-V-4.6 | Art Director β names & films the finish | 1.30B | post |
6.13 B on the live slot. ~1.3 s per spoken phrase, ~25 s one-time warm-up (covered by an ambient scroll so you never stare at a spinner). A rank-16 style LoRA, trained on Modal and fused at warm-up, gives the whole thing one hand-painted look for 0 B of runtime β it folds into klein's existing weights.

If you take one thing from this: the constraint was the design. One GPU slot is what forced the whole speakβoutpaintβscroll loop into existence. I'd never have found it with infinite compute.
π Speak a world into being. β¦link to the Spaceβ§
Speak β your story unrolls as a living storyboard world.
More from this author