Sozai — Build Memories Together

Community Article
Published June 15, 2026

A real-time, two-person collaborative photo album that turns your snapshots into watercolour paintings, captions them by actually looking at them, and pins them to a shared map — all running on ZeroGPU.

👉 Try it live: huggingface.co/spaces/build-small-hackathon/storybook_test


TL;DR

Sozai is a shared scrapbook for two. Make a room, send a friend the code, and build a photo album together in real time — live cursors, shared edits, the works. A MiniCPM-V vision model captions your photos by actually looking at them, a FLUX.2-klein pipeline "develops" them into watercolour paintings, NeMo speech-to-text lets you dictate captions, and you can pin everything to a watercolour 3D globe. It runs on Hugging Face ZeroGPU, with Phoenix tracing baked right into the app.


What is Sozai?

Sozai (素材 — Japanese for "material" or "raw ingredients") is a small collaborative web app for building a shared scrapbook with one other person. You make a room, send a friend the code, and the two of you drop photos, write captions, pin memories to a map, and "develop" your pictures into soft watercolour paintings — together, live, with cursors and edits syncing in real time.

It's part photo album, part darkroom, part shared canvas. The name fits: your photos are the raw material; the app helps you turn them into something you keep.

How it works

Open the splash screen and you get three doors:

  • Make a room — pick a room-code length (4–12 characters), grab your code, and you're in.
  • Join by code — type a friend's code to drop into their room.
  • Continue without a room — run the whole thing solo, no collaboration needed.

Rooms are hybrid: each has a secure internal ID plus a short, human-friendly share code drawn from an unambiguous alphabet (no 0/O or 1/I/L to misread). Rooms cap at two people, persist your scrapbook so a returning visitor sees it as they left it, and clean themselves up automatically after they go idle — nothing lingers forever.

The AI features

Sozai leans on a stack of open-source models, each wired to degrade gracefully if it can't load, so the app never hard-fails on a missing dependency:

Auto-captioning with a vision model. A MiniCPM-V vision-language model genuinely looks at each photo to suggest a title, a one-line caption, or a set of tags. The pixels go to the model, not just a filename — so the suggestions are grounded in what's actually in the picture. Inside a Space it runs as a quantized GGUF build via llama.cpp; locally it runs through transformers.

"Proceed to Development" — watercolour img2img. This is the darkroom magic. Each photo is sent through a FLUX.2-klein img2img pipeline and repainted as a soft watercolour, with an optional trained scene LoRA steering the style. It's a staged rollout: with no LoRA you get base-model watercolour, and dropping in the trained adapter switches on the full painterly look automatically.

Voice → text captions. An NVIDIA NeMo streaming ASR model (Nemotron) lets you dictate captions instead of typing, with selectable latency/accuracy profiles and roughly 40 supported languages plus auto-detect.

A quiet safety gate. Every upload passes through a Falconsai NSFW classifier. It fails open — a missing model never blocks an upload — so the gate adds safety without getting in the way.

More than the models

A few other touches round out the experience:

A watercolour globe. Memories don't just sit in a list — you can pin photos to a 3D globe (powered by Cesium) with an optional Stamen "Watercolor" basemap, so the map itself matches the painted aesthetic of your developed photos. (The globe activates when a Cesium ion token is configured.)

Host-approved rooms. Joining isn't a free-for-all: when someone enters your code, the room host gets a request and approves or declines them before they're let in. Combined with the two-person cap, a room stays just between the people you actually invited.

Phone photos just work. HEIC/HEIF images straight off an iPhone are supported, so you don't have to convert anything before dropping them in.

Built for ZeroGPU

The whole backend is written with Hugging Face ZeroGPU in mind, which is finicky in a specific way: a GPU is only attached while a @spaces.GPU function is running, and CUDA must not be initialized in the main process. Sozai handles this carefully throughout — models load on CPU at startup and only move to the GPU inside the decorated worker, large pipelines build lazily on first use, and only picklable data (the image in, the image out) ever crosses the GPU boundary. The same code runs unchanged on a local machine, where the spaces decorator becomes a transparent no-op.

The heavy FLUX weights (~14 GB) are preloaded into the cache at build time via the README's preload_from_hub, so the GPU worker loads them read-only and never downloads mid-request.

Under the hood

Gradio is used purely as the backend engine (FastAPI under the hood) — the entire UI is hand-built. The real-time layer is worth a note: Spaces' router silently drops custom WebSocket upgrades, so Sozai mirrors what Gradio itself does and runs sync over Server-Sent Events for server→client messages plus plain POST for client→server. That carries presence, live cursors, chat, shared selections, map pins, and a server-authoritative lock so only one person runs an AI action at a time.

State lives in a local SQLite database (rooms, participants, room_state) with optimistic concurrency on saves, WAL mode, and incremental vacuuming so the file stays compact as rooms come and go. A background thread expires idle rooms after an hour and hard-caps everything at 24 hours.

For observability, Arize Phoenix runs in-process and is embedded directly in the app via a same-origin reverse proxy — every model call (caption, NSFW, ASR, img2img) is emitted as an OpenTelemetry span with OpenInference attributes, so you can watch latency, token counts, and errors live without running a separate collector.

There's also an easter egg: a oneko-style pet sprite library (76 animals) you can summon to wander the page.

The stack at a glance

  • Vision captions: openbmb/MiniCPM-V-4 (transformers / GGUF via llama.cpp)
  • Watercolour develop: black-forest-labs/FLUX.2-klein-4B + optional scene LoRA
  • Speech-to-text: nvidia/nemotron-3.5-asr-streaming-0.6b (NeMo)
  • Safety gate: Falconsai/nsfw_image_detection
  • Backend: Gradio Server (FastAPI), SQLite, SSE realtime
  • Observability: Arize Phoenix + OpenTelemetry, embedded in-app
  • Runtime: Hugging Face ZeroGPU

Try it

Make a room, send the code to someone you'd want to build a memory with, and start dropping in photos. Watch your snapshots get captioned, painted into watercolours, and pinned to the map — together, in real time.

Sozai — build memories together.


👉 Try it live: huggingface.co/spaces/build-small-hackathon/storybook_test

Spaces mentioned in this article 1

Community

Sign up or log in to comment

Spaces mentioned in this article 1