RPG-HaloTales-V1

A self-contained, turn-based illustrated and voiced RPG storyteller, packaged as a Lemonade Omni Model. On each turn it returns three things: a freshly generated 512×256 widescreen scene image, narration audio, and the same narration as second-person prose. Any OpenAI-compatible omni client can run it — pull the model, select it, and send messages to play.

Components

Modality	Model
Planner LLM	Qwen3.6-27B (dense)
Image generation + editing	Flux-2-Klein-9B-GGUF
Speech-to-text	Whisper-Large-v3-Turbo
Text-to-speech	kokoro-v1

The storyteller behavior lives entirely in the collection's system_prompt and the planner's pinned server arguments; Lemonade's OmniRouter executes the generate_image / edit_image / text_to_speech tools server-side and embeds the results in each reply. World and character state are carried in the conversation history.

Usage

lemonade pull lemonade-sdk/RPG-HaloTales-V1

Select RPG-HaloTales-V1 in an omni client and send a first message that establishes the setting, the player's role, and any key characters. Each subsequent message is the player's action; the model advances the scene. Each reply also appends a compact ---CONTINUITY--- record (the model's fixed character/scene canon plus a turn-protocol footer) below the narration — it is metadata used to keep appearances and behavior consistent, not part of the story text.

Sampling: temperature 0.3 is baked into the planner and is load-bearing — in testing, raising it to 0.45 reintroduced turns that skip image/audio generation. Let the baked-in defaults stand; avoid overriding sampling from the client.

Design lessons

These are the changes that turned an inconsistent storyteller into a coherent, fast one. They were established with an automated harness that plays multi-turn adventures across genres and measures per-turn format adherence, latency, image canvas, and prose-vs-audio match (by transcribing the narration back through Whisper):

Turn structure as a strict two-response protocol — the system prompt specifies the turn as exactly two responses (Response 1: one image call and one text_to_speech call together, nothing else; Response 2: the narration verbatim plus the continuity record) and includes a full worked example, including literal tool-call JSON. Emitting both tool calls in a single response saves an entire planner round-trip per turn.
Reasoning off, not budgeted — llama.cpp's --reasoning-budget is not enforced when tools are present in the request (the chat template pre-fills the think tag, so the budget sampler never arms), which made "briefly thinking" planners stall for minutes. --reasoning off eliminates the tail entirely, and Qwen tool calling is more reliable without thinking.
Guardrails at the token level, not just the prompt — in multi-turn history, past media is replaced by placeholder text ([generated image]), and after a couple of turns the planner starts imitating those records instead of calling the tools — no prompt wording reliably prevented it. The planner args therefore hard-ban the [-family tokens (making the placeholder text unemittable) and add a small positive bias on the <tool_call> opener token, which anchors every turn's first response. The continuity record also opens with a TOOLS: line and ends with a NEXT: protocol footer, so the history itself keeps testifying that every turn runs the tools.
Character and place continuity across turns — a maintained record pins every named character and location to a fixed, highly specific physical descriptor (age, build, skin, hair, eyes, one distinguishing mark, dress, signature prop) that is reused verbatim in every image prompt. Concrete values ("deep-set grey eyes") hold; vague adjectives ("hard eyes") drift.
Generate fresh, don't over-edit — few-step image editing cannot track scene changes, so edited frames lag behind the prose. Fresh generations with a pinned STYLE line and verbatim descriptors keep characters and palette stable and depict each turn's actual action; edit_image is reserved for single-detail changes within an unchanged shot.
Image matches the words — image prompts are built action-first (style, then this turn's single focal action and props, then character descriptors, then setting), with the canvas pinned to 512×256 on every call, so the picture depicts the narration's moment rather than a static portrait.
Voice matches text — the narration is composed once, passed to text_to_speech, and then repeated word-for-word as the printed text, so the audio and the on-screen words are identical.
Active, dramatic pacing — the world pushes back: no attempt lands exactly as intended, NPCs act on their own agendas, earlier hooks get paid off on screen, and each turn ends mid-pressure with the scene itself showing a few distinct risky ways forward — never a menu, and never a skipped image, even in quiet aftermath or epilogue beats.

Downloads last month: -; Downloads are not tracked for this model. How to track