Spaces:

ResembleAI
/

Dramabox

Running on Zero

App Files Files Community

Dramabox / README.md

Manmay Nakhashi

Move LTX acknowledgement to top of Space README, position DramaBox as Resemble AI's product

e694869 16 days ago

preview code

raw

history blame contribute delete

8.38 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: DramaBox
emoji: 🎭
colorFrom: red
colorTo: indigo
sdk: gradio
sdk_version: 5.7.1
app_file: app.py
pinned: true
license: other
license_name: ltx-2-community
license_link: https://huggingface.co/ResembleAI/Dramabox/blob/main/LICENSE
hf_oauth: false
short_description: Expressive TTS with voice cloning — DramaBox demo

DramaBox — Expressive TTS with Voice Cloning

Built on LTX-2 by Lightricks. DramaBox is Resemble AI's expressive TTS, trained on top of the LTX-2.3 audio branch under the LTX-2 Community License. Huge thanks to the Lightricks team for open-sourcing the base.

Prompt-driven TTS with voice cloning. The prompt itself controls speaker identity, emotion, delivery style, laughs, sighs, pauses and transitions; an optional 10-second voice reference clones the target timbre. DramaBox is an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model.


🤗 Model	`ResembleAI/Dramabox`
🎭 Demo Space	`ResembleAI/Dramabox` (ZeroGPU)
🏗️ Base model	`Lightricks/LTX-2`
📜 License	LTX-2 Community License — see `LICENSE`

Models

Auto-downloaded from the HF model repo on first run.

File	Size	Description
`dramabox-dit-v1.safetensors`	6.6 GB	DiT transformer (LoRA already merged into base)
`dramabox-audio-components.safetensors`	1.9 GB	Audio embeddings connector + audio text projection + audio VAE + vocoder
`unsloth/gemma-3-12b-it-bnb-4bit`	~8 GB	Text encoder

VRAM: ~24 GB peak · Speed: ~2.5 s / generation (warm server, H100)

Quick Start

Warm server (recommended)

from src.inference_server import TTSServer

server = TTSServer(device="cuda")

server.generate_to_file(
    prompt='A woman speaks warmly, "Hello, how are you today?" She laughs, "Hahaha, it is so good to see you!"',
    output="output.wav",
    voice_ref="reference.wav",   # optional, 10+ seconds
)

CLI

python src/inference.py \
  --voice-sample reference.wav \
  --prompt 'A woman speaks warmly, "Hello, how are you today?"' \
  --output output.wav \
  --cfg-scale 2.5 --stg-scale 1.5

Gradio app

CUDA_VISIBLE_DEVICES=4 python app.py

Inference Settings

Parameter	Default	Notes
`cfg-scale`	2.5	Lower = more natural, higher = more text-faithful
`stg-scale`	1.5	Skip-token guidance
`rescale`	0	No rescaling
`modality`	1	No modality guidance
`duration-multiplier`	1.1	10% breathing room on auto-estimated length
`steps`	30	Euler flow matching

Prompt Writing Guide

Structure: <speaker description>, "<dialogue>" <action direction> "<more dialogue>"

Inside quotes (model produces actual sounds):

Laughs: "Hahaha" "Hehehe" (always one word, never separated)
Sounds: "Mmmmm" "Ugh" "Argh" "Ahhh" "Hmm"

Outside quotes (stage directions):

She sighs deeply. · He gulps nervously. · A long pause.
Her voice cracks. · He clears his throat. · She scoffs.

Avoid inside quotes (model speaks them literally): Ahem, Pfft, Sigh, Gasp, Cough.

Tips

Match gender/age in the speaker description to the voice reference
Break long dialogue into segments with action directions in between
End the prompt at the last closing quote mark (no trailing description)

Watermarking

Every audio output from inference.py and inference_server.TTSServer.generate_to_file is automatically watermarked with Resemble Perth — an imperceptible neural watermark that survives MP3 compression, audio editing, and common manipulations while maintaining nearly 100% detection accuracy.

import perth, librosa
wav, sr = librosa.load("output.wav", sr=None, mono=True)
detector = perth.PerthImplicitWatermarker()
print(detector.get_watermark(wav, sample_rate=sr))   # confidence ≈ 1.0

Pass --no-watermark to inference.py (or watermark=False to generate_to_file) to disable for debugging.

Training a LoRA on top of DramaBox

You can fine-tune your own LoRA using DramaBox itself as the base — no need to start from raw LTX-2.3. Useful for adding a specific speaker, language flavour, or style on top of the existing expressive prior.

1. Prepare your index file

The preprocessor accepts four formats. The text field is the target transcript; if you want to attach a scene-style prompt (the part the model conditions on at inference time), prepend it to the transcript in the same format the model was trained on:

A woman speaks warmly, "<your transcript here>"

Both forms are supported — with or without the prompt wrapper. Without the wrapper the model treats the entry as plain text-to-speech.

Format A — manifest (JSONL) — recommended for new datasets:

{"audio_filepath": "wavs/spk01_001.wav", "text": "A woman speaks warmly, \"Hello, how are you today?\""}
{"audio_filepath": "wavs/spk01_002.wav", "text": "Hello, how are you today?"}
{"audio_filepath": "wavs/spk02_001.flac", "text": "An exhausted father sighs, \"Sweetie, daddy is asking very nicely.\"", "duration": 4.7}

Fields: audio_filepath (or audio_path) is required, text (or transcript) is required, duration is optional.

Format B — tsv — simplest, one line per sample:

wavs/spk01_001.wav	A woman speaks warmly, "Hello, how are you today?"
wavs/spk01_002.wav	Hello, how are you today?

Format C — gemini_synthetic — ~-separated, used for prompted synthetic data:

id~speaker~lang~sr~samples~dur~phonemes~text
spk01_001~spk01~en~24000~93000~3.875~_~A woman speaks warmly, "Hello, how are you today?"

Format D — libriheavy — ~-separated, for unprompted text-only data:

id~speaker~lang~samples~dur_ms~phonemes~text
spk01_001~spk01~en~93000~3875~_~Hello, how are you today?

2. Preprocess

python src/preprocess.py \
  --dataset-type manifest \
  --index your_data.jsonl \
  --audio-dir /path/to/wavs \
  --output-dir /path/to/preprocessed/ \
  --checkpoint /path/to/dramabox-audio-components.safetensors \
  --gemma-root /path/to/gemma-3-12b-it-bnb-4bit/ \
  --max-duration 20.0 --min-duration 2.0

Output layout (training-ready .pt files):

preprocessed/
├── audio_latents/sample_*.pt     # Audio VAE-encoded latents
├── conditions/sample_*.pt        # Gemma text embeddings
└── latents/sample_*.pt           # Dummy video latents (placeholder)

3. Train

Copy configs/training_args.example.yaml, point data_dir / speaker_index at your preprocessed output, set checkpoint + full_checkpoint to the DramaBox files, then launch with HuggingFace accelerate. Any flag passed on the CLI overrides the YAML.

accelerate launch src/train.py \
  --config configs/training_args.example.yaml

The trainer attaches a fresh LoRA to the audio branch on top of the DramaBox checkpoint. LoRA targets: audio_attn1.{to_q,to_k,to_v,to_out.0} + audio_ff.{net.0.proj,net.2} × 48 transformer blocks (288 LoRA pairs total). Default rank 128 / alpha 128 / dropout 0.1, cosine LR schedule from 1e-4 with 500-step warmup over 10k steps.

To monitor training, set val_config: configs/val_config.example.yaml in your training YAML — src/validate.py is then spawned at every save step to generate one wav per speaker entry, so you can A/B listen during the run.

Inference with your trained LoRA

python src/inference.py \
  --lora /path/to/your/lora_step_5000.safetensors \
  --voice-sample reference.wav \
  --prompt 'A woman speaks warmly, "..."' \
  --output output.wav

Always load the LoRA at inference rather than pre-merging it — pre-merged checkpoints have produced degraded output in our runs.

Language

English.

License & acknowledgement

DramaBox is a Resemble AI fine-tune of LTX-2. Distributed under the LTX-2 Community License Agreement — see LICENSE. Thanks again to Lightricks for releasing the base model.