Dramabox / README.md
Manmay Nakhashi
Move LTX acknowledgement to top of Space README, position DramaBox as Resemble AI's product
e694869

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: DramaBox
emoji: 🎭
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.7.1
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
hf_oauth: false
short_description: Expressive TTS with voice cloning β€” DramaBox demo

DramaBox β€” Expressive TTS with Voice Cloning

Built on LTX-2 by Lightricks. DramaBox is Resemble AI's expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.

Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model.

πŸ€— Model ResembleAI/Dramabox
🎭 Demo Space ResembleAI/Dramabox (ZeroGPU)
πŸ—οΈ Base model Lightricks/LTX-2
πŸ“œ License LTX-2 Community License β€” see LICENSE

Models

Auto-downloaded from the HF model repo on first run.

File Size Description
dramabox-dit-v1.safetensors 6.6 GB DiT transformer (LoRA already merged into base)
dramabox-audio-components.safetensors 1.9 GB Audio embeddings connector + audio text projection + audio VAE + vocoder
unsloth/gemma-3-12b-it-bnb-4bit ~8 GB Text encoder

VRAM: ~24 GB peak Β· Speed: ~2.5 s / generation (warm server, H100)

Quick Start

Warm server (recommended)

from src.inference_server import TTSServer

server = TTSServer(device="cuda")

server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
    output="output.wav",
    voice_ref="reference.wav",   # optional, 10+ seconds
)

CLI

python src/inference.py \
  --voice-sample reference.wav \
  --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
  --output output.wav \
  --cfg-scale 2.5 --stg-scale 1.5

Gradio app

CUDA_VISIBLE_DEVICES=4 python app.py

Inference Settings

Parameter Default Notes
cfg-scale 2.5 Lower = more natural, higher = more text-faithful
stg-scale 1.5 Skip-token guidance
rescale 0 No rescaling
modality 1 No modality guidance
duration-multiplier 1.1 10% breathing room on auto-estimated length
steps 30 Euler flow matching

Prompt Writing Guide

Structure: <speaker description>, "<dialogue>" <action direction> "<more dialogue>"

Inside quotes (model produces actual sounds):

  • Laughs: "Hahaha" "Hehehe" (always one word, never separated)
  • Sounds: "Mmmmm" "Ugh" "Argh" "Ahhh" "Hmm"

Outside quotes (stage directions):

  • She sighs deeply. Β· He gulps nervously. Β· A long pause.
  • Her voice cracks. Β· He clears his throat. Β· She scoffs.

Avoid inside quotes (model speaks them literally): Ahem, Pfft, Sigh, Gasp, Cough.

Tips

  • Match gender/age in the speaker description to the voice reference
  • Break long dialogue into segments with action directions in between
  • End the prompt at the last closing quote mark (no trailing description)

Watermarking

Every audio output from inference.py and inference_server.TTSServer.generate_to_file is automatically watermarked with Resemble Perth β€” an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

import perth, librosa
wav, sr = librosa.load("output.wav", sr=None, mono=True)
detector = perth.PerthImplicitWatermarker()
print(detector.get_watermark(wav, sample_rate=sr))   # confidence β‰ˆ 1.0

Pass --no-watermark to inference.py (or watermark=False to generate_to_file) to disable for debugging.

Training a LoRA on top of DramaBox

You can fine-tune your own LoRA using DramaBox itself as the base β€” no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.

1. Prepare your index file

The preprocessor accepts four formats. The text field is the target transcript; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:

A woman speaks warmly, "<your transcript here>"

Both forms are supported β€” with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.

Format A β€” manifest (JSONL) β€” recommended for new datasets:

{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}

Fields: audio_filepath (or audio_path) is required, text (or transcript) is required, duration is optional.

Format B β€” tsv β€” simplest, one line per sample:

wavs/spk01_001.wav	A woman speaks warmly, "Hello, how are you today?"
wavs/spk01_002.wav	Hello, how are you today?

Format C β€” gemini_synthetic β€” ~-separated, used for prompted synthetic data:

id~speaker~lang~sr~samples~dur~phonemes~text
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"

Format D β€” libriheavy β€” ~-separated, for unprompted text-only data:

id~speaker~lang~samples~dur_ms~phonemes~text
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?

2. Preprocess

python src/preprocess.py \
  --dataset-type manifest \
  --index your_data.jsonl \
  --audio-dir /path/to/wavs \
  --output-dir /path/to/preprocessed/ \
  --checkpoint /path/to/dramabox-audio-components.safetensors \
  --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
  --max-duration 20.0 --min-duration 2.0

Output layout (training-ready .pt files):

preprocessed/
β”œβ”€β”€ audio_latents/sample_*.pt     # Audio VAE-encoded latents
β”œβ”€β”€ conditions/sample_*.pt        # Gemma text embeddings
└── latents/sample_*.pt           # Dummy video latents (placeholder)

3. Train

Copy configs/training_args.example.yaml, point data_dir / speaker_index at your preprocessed output, set checkpoint + full_checkpoint to the DramaBox files, then launch with HuggingFace accelerate. Any flag passed on the CLI overrides the YAML.

accelerate launch src/train.py \
  --config configs/training_args.example.yaml

The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: audio_attn1.{to_q,to_k,to_v,to_out.0} + audio_ff.{net.0.proj,net.2} Γ— 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.

To monitor training, set val_config: configs/val_config.example.yaml in your training YAML β€” src/validate.py is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.

Inference with your trained LoRA

python src/inference.py \
  --lora /path/to/your/lora_step_5000.safetensors \
  --voice-sample reference.wav \
  --prompt 'A woman speaks warmly, "..."' \
  --output output.wav

Always load the LoRA at inference rather than pre-merging it β€” pre-merged checkpoints have produced degraded output in our runs.

Language

English.

License & acknowledgement

DramaBox is a Resemble AI fine-tune of LTX-2. Distributed under the LTX-2 Community License Agreement β€” see LICENSE. Thanks again to Lightricks for releasing the base model.