RPG-HaloTales-V1
A self-contained, turn-based illustrated and voiced RPG storyteller, packaged as a Lemonade Omni Model. On each turn it returns three things: a freshly generated 512Γ256 widescreen scene image, narration audio, and the same narration as second-person prose. Any OpenAI-compatible omni client can run it β pull the model, select it, and send messages to play.
Components
| Modality | Model |
|---|---|
| Planner LLM | Qwen3.6-27B (dense) |
| Image generation + editing | Flux-2-Klein-9B-GGUF |
| Speech-to-text | Whisper-Large-v3-Turbo |
| Text-to-speech | kokoro-v1 |
The storyteller behavior lives entirely in the collection's system_prompt and the planner's pinned server arguments; Lemonade's OmniRouter executes the generate_image / edit_image / text_to_speech tools server-side and embeds the results in each reply. World and character state are carried in the conversation history.
Usage
lemonade pull lemonade-sdk/RPG-HaloTales-V1
Select RPG-HaloTales-V1 in an omni client and send a first message that establishes the setting, the player's role, and any key characters. Each subsequent message is the player's action; the model advances the scene. Each reply also appends a compact ---CONTINUITY--- record (the model's fixed character/scene canon plus a turn-protocol footer) below the narration β it is metadata used to keep appearances and behavior consistent, not part of the story text.
Sampling: temperature 0.3 is baked into the planner and is load-bearing β in testing, raising it to 0.45 reintroduced turns that skip image/audio generation. Let the baked-in defaults stand; avoid overriding sampling from the client.
Design lessons
These are the changes that turned an inconsistent storyteller into a coherent, fast one. They were established with an automated harness that plays multi-turn adventures across genres and measures per-turn format adherence, latency, image canvas, and prose-vs-audio match (by transcribing the narration back through Whisper):
- Turn structure as a strict two-response protocol β the system prompt specifies the turn as exactly two responses (Response 1: one image call and one
text_to_speechcall together, nothing else; Response 2: the narration verbatim plus the continuity record) and includes a full worked example, including literal tool-call JSON. Emitting both tool calls in a single response saves an entire planner round-trip per turn. - Reasoning off, not budgeted β llama.cpp's
--reasoning-budgetis not enforced whentoolsare present in the request (the chat template pre-fills the think tag, so the budget sampler never arms), which made "briefly thinking" planners stall for minutes.--reasoning offeliminates the tail entirely, and Qwen tool calling is more reliable without thinking. - Guardrails at the token level, not just the prompt β in multi-turn history, past media is replaced by placeholder text (
[generated image]), and after a couple of turns the planner starts imitating those records instead of calling the tools β no prompt wording reliably prevented it. The planner args therefore hard-ban the[-family tokens (making the placeholder text unemittable) and add a small positive bias on the<tool_call>opener token, which anchors every turn's first response. The continuity record also opens with aTOOLS:line and ends with aNEXT:protocol footer, so the history itself keeps testifying that every turn runs the tools. - Character and place continuity across turns β a maintained record pins every named character and location to a fixed, highly specific physical descriptor (age, build, skin, hair, eyes, one distinguishing mark, dress, signature prop) that is reused verbatim in every image prompt. Concrete values ("deep-set grey eyes") hold; vague adjectives ("hard eyes") drift.
- Generate fresh, don't over-edit β few-step image editing cannot track scene changes, so edited frames lag behind the prose. Fresh generations with a pinned STYLE line and verbatim descriptors keep characters and palette stable and depict each turn's actual action;
edit_imageis reserved for single-detail changes within an unchanged shot. - Image matches the words β image prompts are built action-first (style, then this turn's single focal action and props, then character descriptors, then setting), with the canvas pinned to 512Γ256 on every call, so the picture depicts the narration's moment rather than a static portrait.
- Voice matches text β the narration is composed once, passed to
text_to_speech, and then repeated word-for-word as the printed text, so the audio and the on-screen words are identical. - Active, dramatic pacing β the world pushes back: no attempt lands exactly as intended, NPCs act on their own agendas, earlier hooks get paid off on screen, and each turn ends mid-pressure with the scene itself showing a few distinct risky ways forward β never a menu, and never a skipped image, even in quiet aftermath or epilogue beats.