I built a world you can talk into existence

Community Article
Published June 15, 2026

Lightloom β€” Build Small / Thousand Token Wood. A few sleepless days, one GPU slot, way too much coffee.

![Speak a phrase, watch a strip of the world bloom](⟦screenshot: the live scroll mid-sentence⟧)

The whole pitch fits in one line: you speak, and a painted world unrolls in front of you, live. No stock art, no cloud API. Every pixel is made on the fly by a handful of tiny models sharing a single ZeroGPU slot.

This is the honest version of how it went β€” including the parts that broke on me.


The shape of it

Voice in, mural out. Four small models take turns on one slot:

flowchart LR
  V[πŸŽ™ voice] -->|Parakeet-CTC| T[transcript]
  T -->|split per phrase| D[Director Β· MiniCPM5-1B]
  D -->|shot + art style| P[Painter Β· FLUX.2-klein-4B Β· 4 steps]
  P --> M[one continuous mural]
  M -->|Depth-Anything| Z[relief]
  M -.->|at the end| A[Art Director Β· MiniCPM-V Β· names it, films it]

The clever part isn't any single model. It's that they're all small enough to ride one slot and still feel instant. Nobody model is the star; the orchestra is.


The trick that makes it one painting (not a slideshow)

I don't generate separate images and stitch them. Each phrase outpaints the right edge of the mural, with the previous edge fed back in as the only "truth" the model is allowed to keep:

   previous strip            new canvas  (768 px tall, the live size)
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  ...world β”‚ β†’   β”‚overlap 212 β”‚   new 556 (painted)β”‚
   β”‚           β”‚     β”‚  mask = 0  β”‚      mask = 255    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  (KEEP)    β”‚      (GENERATE)    β”‚
                     └─────────── β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–² the model continues THIS to the right

Simple idea, fussy in practice. The seam between scenes kept reading as a cut. Two fixes that finally killed it: seed the fresh region with the carried edge's average colour (so a weak 4-step strip doesn't leak a grey band β€” the dreaded grey smears), and ramp the mask 0β†’255 over ~40 px at the join so the new scene emerges from the old one instead of slamming into it.


Three things that fought back

1. The 16,383-pixel wall. Long sessions grow a wide panorama. Mine kept crashing on save with a cryptic error β€” turns out WEBP has a hard 16383 px dimension limit and I was sailing past it. Switched the final stitch to JPEG and added a "skip the corrupt strip, keep going" read. After that my test panorama happily reached 19,116 px wide. One line of format, an hour of confusion.

2. "No depth, please." My first 3D pass moved the camera through the scene with little whip-glides and a zoom pulse. On paper: cinematic. In practice: motion-sick in ten seconds, and it fought the flatness that makes the live scroll calm. So the live view stays dead flat, and the real 3D moved to a separate, opt-in button β€” a depth-displaced mesh you drag around in your own browser GPU. Same depth map, totally different place. Killing a feature you're proud of is the job.

3. The cache that lied to me. Near the end I shipped a UI ("Ask your world") and a latency fix, tested live, and… nothing changed. I re-read the code five times convinced I was insane. The browser was serving a stale controller.js β€” no cache headers on my static routes. One Cache-Control: no-cache and every deploy reached the screen again. The most human bug of the week.


The thing I didn't expect to learn

Perceived speed is not real speed. Parakeet is alignment-based CTC, so it structurally can't hallucinate filler β€” great for hands-off narration. But on one shared slot, by the time a phrase is transcribed and painted, the text felt late. The fix wasn't faster inference; it was showing your spoken words the instant they're transcribed, letting the (slower) paint catch up behind them. Speaking feels fluid the moment you see your words, even if the picture lags a beat. UX, not FLOPs.


The ledger (the part judges love)

Everything below runs locally, live, on one slot β€” verifiable at /health:

Model Job Params Live?
NVIDIA Parakeet-CTC voice β†’ text 1.10B βœ“
MiniCPM5-1B Director β€” shot + art style per phrase 1.00B βœ“
FLUX.2-klein-4B Painter β€” 4-step, CFG-free strip 4.00B βœ“
Depth-Anything-V2-Small relief 0.025B βœ“
MiniCPM-V-4.6 Art Director β€” names & films the finish 1.30B post

6.13 B on the live slot. ~1.3 s per spoken phrase, ~25 s one-time warm-up (covered by an ambient scroll so you never stare at a spinner). A rank-16 style LoRA, trained on Modal and fused at warm-up, gives the whole thing one hand-painted look for 0 B of runtime β€” it folds into klein's existing weights.

![Step into the finished world in 3D](⟦screenshot: Explore in 3D⟧)


What I'd do next

  • A second voice track painting over the first, so two people build one world together.
  • Ambient audio that reads the scene's palette (the model's already in the repo, just not wired).
  • Smarter long-session memory β€” past ~20 strips the style can drift, and I heal it by re-injecting the first clean strip as a structure anchor. It works, but it's a patch, not a plan.

If you take one thing from this: the constraint was the design. One GPU slot is what forced the whole speak→outpaint→scroll loop into existence. I'd never have found it with infinite compute.

πŸŒ… Speak a world into being. ⟦link to the Space⟧

Community

Sign up or log in to comment