RPG-HaloTales-V1

A self-contained, turn-based illustrated and voiced RPG storyteller, packaged as a Lemonade Omni Model. On each turn it returns three things: a freshly generated 512Γ—256 widescreen scene image, narration audio, and the same narration as second-person prose. Any OpenAI-compatible omni client can run it β€” pull the model, select it, and send messages to play.

Components

Modality Model
Planner LLM Qwen3.6-27B (dense)
Image generation + editing Flux-2-Klein-9B-GGUF
Speech-to-text Whisper-Large-v3-Turbo
Text-to-speech kokoro-v1

The storyteller behavior lives entirely in the collection's system_prompt and the planner's pinned server arguments; Lemonade's OmniRouter executes the generate_image / edit_image / text_to_speech tools server-side and embeds the results in each reply. World and character state are carried in the conversation history.

Usage

lemonade pull lemonade-sdk/RPG-HaloTales-V1

Select RPG-HaloTales-V1 in an omni client and send a first message that establishes the setting, the player's role, and any key characters. Each subsequent message is the player's action; the model advances the scene. Each reply also appends a compact ---CONTINUITY--- record (the model's fixed character/scene canon plus a turn-protocol footer) below the narration β€” it is metadata used to keep appearances and behavior consistent, not part of the story text.

Sampling: temperature 0.3 is baked into the planner and is load-bearing β€” in testing, raising it to 0.45 reintroduced turns that skip image/audio generation. Let the baked-in defaults stand; avoid overriding sampling from the client.

Design lessons

These are the changes that turned an inconsistent storyteller into a coherent, fast one. They were established with an automated harness that plays multi-turn adventures across genres and measures per-turn format adherence, latency, image canvas, and prose-vs-audio match (by transcribing the narration back through Whisper):

  • Turn structure as a strict two-response protocol β€” the system prompt specifies the turn as exactly two responses (Response 1: one image call and one text_to_speech call together, nothing else; Response 2: the narration verbatim plus the continuity record) and includes a full worked example, including literal tool-call JSON. Emitting both tool calls in a single response saves an entire planner round-trip per turn.
  • Reasoning off, not budgeted β€” llama.cpp's --reasoning-budget is not enforced when tools are present in the request (the chat template pre-fills the think tag, so the budget sampler never arms), which made "briefly thinking" planners stall for minutes. --reasoning off eliminates the tail entirely, and Qwen tool calling is more reliable without thinking.
  • Guardrails at the token level, not just the prompt β€” in multi-turn history, past media is replaced by placeholder text ([generated image]), and after a couple of turns the planner starts imitating those records instead of calling the tools β€” no prompt wording reliably prevented it. The planner args therefore hard-ban the [-family tokens (making the placeholder text unemittable) and add a small positive bias on the <tool_call> opener token, which anchors every turn's first response. The continuity record also opens with a TOOLS: line and ends with a NEXT: protocol footer, so the history itself keeps testifying that every turn runs the tools.
  • Character and place continuity across turns β€” a maintained record pins every named character and location to a fixed, highly specific physical descriptor (age, build, skin, hair, eyes, one distinguishing mark, dress, signature prop) that is reused verbatim in every image prompt. Concrete values ("deep-set grey eyes") hold; vague adjectives ("hard eyes") drift.
  • Generate fresh, don't over-edit β€” few-step image editing cannot track scene changes, so edited frames lag behind the prose. Fresh generations with a pinned STYLE line and verbatim descriptors keep characters and palette stable and depict each turn's actual action; edit_image is reserved for single-detail changes within an unchanged shot.
  • Image matches the words β€” image prompts are built action-first (style, then this turn's single focal action and props, then character descriptors, then setting), with the canvas pinned to 512Γ—256 on every call, so the picture depicts the narration's moment rather than a static portrait.
  • Voice matches text β€” the narration is composed once, passed to text_to_speech, and then repeated word-for-word as the printed text, so the audio and the on-screen words are identical.
  • Active, dramatic pacing β€” the world pushes back: no attempt lands exactly as intended, NPCs act on their own agendas, earlier hooks get paid off on screen, and each turn ends mid-pressure with the scene itself showing a few distinct risky ways forward β€” never a menu, and never a skipped image, even in quiet aftermath or epilogue beats.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support