LAION-Box Emotional v0.7 β€” expressive voice-acting TTS (fully-merged checkpoints)

Three fully-merged, standalone audio-DiT checkpoints of the LAION-Box / DramaBox text-to-speech model, fine-tuned to produce more emotionally expressive speech. Each file is a complete model β€” no LoRA loading required.


Lineage (important)

  • Base: Lightricks/LTX-2.3 β€” specifically the LTX-2.3 3.3B audio-only DiT (flow-matching audio latent transformer, AVTransformer3DModel, caption_channels=3840, metadata model_version 2.3.0). (It is LTX-2.3, not "LTX-2"; the audio DiT is 3.3B params β€” the "22b" in the original filename refers to the full multimodal LTX-2.3, not this audio branch.)
  • DramaBox (ResembleAI/Dramabox) β€” Resemble AI's expressive TTS, an IC-LoRA fine-tune of the LTX-2.3 3.3B audio-only model.
  • run16 "v0.7" LoRA (rank 256) β€” LAION continued-fine-tune on the diverse DramaBox tuning mix (German 70% + diverse voice-acting 30%), step 19,500, merged in.
  • Emotion LoRA (rank 32, Ξ± 32) β€” trained here for 10 epochs on a high-emotion subset, merged in.

The result is a single standalone DiT you use exactly like the base LTX-2.3 audio model.


Files

The model (this repo)

file source LoRA step flow loss notes
LAION-Box-Emotional-v0.7_best1_step850.safetensors 850 (~epoch 9.7) 0.111 strongest emotional fit (recommended)
LAION-Box-Emotional-v0.7_best2_step800.safetensors 800 (~epoch 9.1) 0.123 near-best
LAION-Box-Emotional-v0.7_best3_step150.safetensors 150 (~epoch 1.7) 0.145 lightest adaptation, closest to base
dramabox-audio-components.safetensors β€” β€” VAE + vocoder + audio connector (from ResembleAI/Dramabox, ~1.9 GB). Required to turn DiT latents into a waveform.
inference.py, download_components.py β€” β€” runnable example + fetch the two third-party foundation models below

Each *Emotional*.safetensors is LTX-2.3 base + DramaBox + run16 LoRA + emotion LoRA, all merged (Ξ±=32, rank=32) β€” interchangeable standalone checkpoints.

Components needed for inference

role what where size
audio DiT this repo's *Emotional*.safetensors βœ… included 6.1 GB each
VAE + vocoder dramabox-audio-components.safetensors βœ… included 1.9 GB
text / prompt encoder unsloth/gemma-3-12b-it-bnb-4bit (Google Gemma 3 12B, 4-bit) ⬇️ download_components.py ~7.4 GB
reference denoiser (RE-USE) nvidia/RE-USE (SEMamba) ⬇️ download_components.py small
pipeline code DramaBox / LTX-2.3 ltx2 core + src/ ResembleAI/Dramabox β€”

The two foundation models (Google Gemma as the prompt encoder, NVIDIA RE-USE as the reference denoiser) are not re-hosted here β€” they are fetched from their canonical repos by download_components.py, under their own licenses (Gemma / NVIDIA). Everything DramaBox/LTX-2.3-specific (the DiT + VAE + vocoder) is in this repo.

Selection note: on this small (5.6k-sample) fine-tune, flow-matching loss is flat across epochs and only weakly tied to emotional expressivity β€” A/B the three checkpoints on your own prompts rather than trusting the loss ranking.


Requirements

pip install torch safetensors librosa soundfile huggingface_hub transformers
# + the DramaBox / LTX-2.3 pipeline (ltx2 core + src/) from ResembleAI/Dramabox
python download_components.py         # fetches Gemma + RE-USE

GPU β‰₯ ~24 GB (bf16; 4-bit Gemma option).

Usage

CLI (DramaBox src/inference.py)

python src/inference.py \
    --checkpoint LAION-Box-Emotional-v0.7_best1_step850.safetensors \
    --full-checkpoint dramabox-audio-components.safetensors \
    --prompt "A woman, trembling with grief: 'I can't do this anymore.'" \
    --voice-ref reference_voice.wav \
    --output out.wav \
    --cfg-scale 2.5 --stg-scale 1.5 --seed 42

Python (TTSServer)

import sys; sys.path.insert(0, "DramaBox/src")
from inference_server import TTSServer

tts = TTSServer(
    checkpoint="LAION-Box-Emotional-v0.7_best1_step850.safetensors",  # this DiT
    full_checkpoint="dramabox-audio-components.safetensors",           # VAE + vocoder
    gemma_root="<gemma snapshot dir from download_components.py>",     # prompt encoder
    device="cuda", dtype="bf16", bnb_4bit=True,
)
tts.generate_to_file(
    prompt="An old man, warm and amused, chuckling: 'You remind me of myself at your age.'",
    output="out.wav",
    voice_ref="reference_voice.wav",   # 5-10 s clean speaker reference
    cfg_scale=2.5, stg_scale=1.5, duration_multiplier=1.1,
    ref_duration=10.0, denoise_ref=True, seed=42,
)

Prompting for emotion

The model conditions on a natural-language description (via Gemma) + a voice reference. Put the emotional direction in the prompt ("furious, shouting", "tender and hushed", "nervous, voice shaking"). This checkpoint biases delivery toward stronger emotion than the base. Key knobs: cfg_scale (↑ = follows the emotional prompt harder, ~2–4), stg_scale (stability, ~1–2), voice_ref (timbre), denoise_ref (clean the reference via RE-USE).


Training (emotion stage)

rank 32, Ξ± 32, dropout 0.0; lr 1e-4, 8Γ— GPU, grad-accum 8, bf16; 10 epochs (~875 steps), trained directly on precomputed DramaBox latents (tgt_latent + cond). The 3 lowest flow-matching-loss checkpoints were merged and shipped.

Emotion data provenance

Fine-tune set = top-emotional subset (5,596 clips) of the DramaBox mix: every clip scored with laion/Empathic-Insight-Voice-Plus (40 EmoNet emotions); top 10 % per source dataset (Elise 30 %) by intensity Ξ£(score βˆ’ per-dim mean) were kept. Dataset: TTS-AGI/emotional-voice-acting-subset-v0.7 (private).

Credits & license

  • LTX-2.3 base model Β© Lightricks β€” LTX-2 Community License.
  • DramaBox expressive TTS Β© Resemble AI (IC-LoRA fine-tune of LTX-2.3).
  • Prompt encoder: Google Gemma (Gemma license). Reference denoiser: NVIDIA RE-USE.
  • This emotional fine-tune by LAION. Governed by the LTX-2 Community License; research/eval use.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support