LAION-Box Emotional v0.7 — expressive voice-acting TTS (fully-merged checkpoints)

Three fully-merged, standalone audio-DiT checkpoints of the LAION-Box / DramaBox text-to-speech model, fine-tuned to produce more emotionally expressive speech. Each file is a complete model — no LoRA loading required.

Lineage (important)

Base: Lightricks/LTX-2.3 — specifically the LTX-2.3 3.3B audio-only DiT (flow-matching audio latent transformer, AVTransformer3DModel, caption_channels=3840, metadata model_version 2.3.0). (It is LTX-2.3, not "LTX-2"; the audio DiT is 3.3B params — the "22b" in the original filename refers to the full multimodal LTX-2.3, not this audio branch.)
DramaBox (ResembleAI/Dramabox) — Resemble AI's expressive TTS, an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model.
run16 "v0.7" LoRA (rank 256) — LAION continued-fine-tune on the diverse DramaBox tuning mix (German 70% + diverse voice-acting 30%), step 19,500, merged in.
Emotion LoRA (rank 32, α 32) — trained here for 10 epochs on a high-emotion subset, merged in.

The result is a single standalone DiT you use exactly like the base LTX-2.3 audio model.

Files

The model (this repo)

file	source LoRA step	flow loss	notes
`LAION-Box-Emotional-v0.7_best1_step850.safetensors`	850 (~epoch 9.7)	0.111	strongest emotional fit (recommended)
`LAION-Box-Emotional-v0.7_best2_step800.safetensors`	800 (~epoch 9.1)	0.123	near-best
`LAION-Box-Emotional-v0.7_best3_step150.safetensors`	150 (~epoch 1.7)	0.145	lightest adaptation, closest to base
`dramabox-audio-components.safetensors`	—	—	VAE + vocoder + audio connector (from `ResembleAI/Dramabox`, ~1.9 GB). Required to turn DiT latents into a waveform.
`inference.py`, `download_components.py`	—	—	runnable example + fetch the two third-party foundation models below

Each *Emotional*.safetensors is LTX-2.3 base + DramaBox + run16 LoRA + emotion LoRA, all merged (α=32, rank=32) — interchangeable standalone checkpoints.

Components needed for inference

role	what	where	size
audio DiT	this repo's `Emotional.safetensors`	✅ included	6.1 GB each
VAE + vocoder	`dramabox-audio-components.safetensors`	✅ included	1.9 GB
text / prompt encoder	`unsloth/gemma-3-12b-it-bnb-4bit` (Google Gemma 3 12B, 4-bit)	⬇️ `download_components.py`	~7.4 GB
reference denoiser (RE-USE)	`nvidia/RE-USE` (SEMamba)	⬇️ `download_components.py`	small
pipeline code	DramaBox / LTX-2.3 `ltx2` core + `src/`	`ResembleAI/Dramabox`	—

The two foundation models (Google Gemma as the prompt encoder, NVIDIA RE-USE as the reference denoiser) are not re-hosted here — they are fetched from their canonical repos by download_components.py, under their own licenses (Gemma / NVIDIA). Everything DramaBox/LTX-2.3-specific (the DiT + VAE + vocoder) is in this repo.

Selection note: on this small (5.6k-sample) fine-tune, flow-matching loss is flat across epochs and only weakly tied to emotional expressivity — A/B the three checkpoints on your own prompts rather than trusting the loss ranking.

Requirements

pip install torch safetensors librosa soundfile huggingface_hub transformers
# + the DramaBox / LTX-2.3 pipeline (ltx2 core + src/) from ResembleAI/Dramabox
python download_components.py         # fetches Gemma + RE-USE

GPU ≥ ~24 GB (bf16; 4-bit Gemma option).

Usage

CLI (DramaBox `src/inference.py`)

python src/inference.py \
    --checkpoint LAION-Box-Emotional-v0.7_best1_step850.safetensors \
    --full-checkpoint dramabox-audio-components.safetensors \
    --prompt "A woman, trembling with grief: 'I can't do this anymore.'" \
    --voice-ref reference_voice.wav \
    --output out.wav \
    --cfg-scale 2.5 --stg-scale 1.5 --seed 42

Python (`TTSServer`)

import sys; sys.path.insert(0, "DramaBox/src")
from inference_server import TTSServer

tts = TTSServer(
    checkpoint="LAION-Box-Emotional-v0.7_best1_step850.safetensors",  # this DiT
    full_checkpoint="dramabox-audio-components.safetensors",           # VAE + vocoder
    gemma_root="<gemma snapshot dir from download_components.py>",     # prompt encoder
    device="cuda", dtype="bf16", bnb_4bit=True,
)
tts.generate_to_file(
    prompt="An old man, warm and amused, chuckling: 'You remind me of myself at your age.'",
    output="out.wav",
    voice_ref="reference_voice.wav",   # 5-10 s clean speaker reference
    cfg_scale=2.5, stg_scale=1.5, duration_multiplier=1.1,
    ref_duration=10.0, denoise_ref=True, seed=42,
)

Prompting for emotion

The model conditions on a natural-language description (via Gemma) + a voice reference. Put the emotional direction in the prompt ("furious, shouting", "tender and hushed", "nervous, voice shaking"). This checkpoint biases delivery toward stronger emotion than the base. Key knobs: cfg_scale (↑ = follows the emotional prompt harder, ~2–4), stg_scale (stability, ~1–2), voice_ref (timbre), denoise_ref (clean the reference via RE-USE).

Training (emotion stage)

rank 32, α 32, dropout 0.0; lr 1e-4, 8× GPU, grad-accum 8, bf16; 10 epochs (~875 steps), trained directly on precomputed DramaBox latents (tgt_latent + cond). The 3 lowest flow-matching-loss checkpoints were merged and shipped.

Emotion data provenance

Fine-tune set = top-emotional subset (5,596 clips) of the DramaBox mix: every clip scored with laion/Empathic-Insight-Voice-Plus (40 EmoNet emotions); top 10 % per source dataset (Elise 30 %) by intensity Σ(score − per-dim mean) were kept. Dataset: TTS-AGI/emotional-voice-acting-subset-v0.7 (private).

Credits & license

Prompt encoder: Google Gemma (Gemma license). Reference denoiser: NVIDIA RE-USE.
This emotional fine-tune by LAION. Governed by the LTX-2 Community License; research/eval use.

Downloads last month: -; Downloads are not tracked for this model. How to track