LAIONBox v0.6-wip β€” Run 15 (LoRA-only, rank 256)

LoRA fine-tune of the DramaBox audio-only DiT (LTX-2.3-22B-Dev, ~3.3B, ResembleAI) for expressive voice-acting TTS. Pure LoRA (rank 256 / alpha 256, 5 epochs) β€” no AdaLN, no auxiliary losses, just flow matching. Scaled up from the v0.5 baseline (r128 / 3 epochs).

Evaluation (means over 72 samples; sidon_normalized enhanced refs: 5 speakers x 2 seeds x 6 prompts EN+DE)

nisqa_mos/scoreq/CE/PQ are held-out scorers. πŸ”Š Audio: see v06_eval_results.html (base64 players) in this repo.

Verdict: run5 (full fine-tune + AdaLN, v0.3-wip) has the best raw quality; run15 (this model, LoRA r256) has the best speaker similarity (raw 0.92, vc->sidon 0.947) while matching the r128 baseline on quality.

RAW

model mos utmos nisqa_mos scoreq CE PQ spk_sim
Run 5 β€” full-FT DiT + AdaLN (v0.3) 4.601 3.647 4.369 4.636 6.188 7.896 0.879
Run 10 β€” LoRA r128 baseline (v0.5) 4.481 3.448 4.201 4.5 5.982 7.793 0.876
Run 12 β€” frozen DiT + AdaLN (best) 4.378 3.395 4.057 4.428 6.023 7.858 0.894
Run 13 β€” frozen DiT + AdaLN (best) 4.434 3.498 4.138 4.471 5.997 7.789 0.88
Run 15 β€” LoRA r256 5ep (s2440) 4.474 3.385 4.184 4.487 6.0 7.851 0.919
Run 15 β€” LoRA r256 5ep (s2460) 4.487 3.418 4.205 4.504 6.003 7.841 0.916
Run 15 β€” LoRA r256 5ep (s2465) 4.491 3.446 4.199 4.512 6.0 7.841 0.916

SIDON

model mos utmos nisqa_mos scoreq CE PQ spk_sim
Run 5 β€” full-FT DiT + AdaLN (v0.3) 4.747 3.793 4.667 4.74 6.273 7.921 0.87
Run 10 β€” LoRA r128 baseline (v0.5) 4.656 3.588 4.545 4.647 6.116 7.912 0.873
Run 12 β€” frozen DiT + AdaLN (best) 4.616 3.537 4.488 4.606 6.168 7.962 0.89
Run 13 β€” frozen DiT + AdaLN (best) 4.654 3.659 4.528 4.645 6.12 7.911 0.877
Run 15 β€” LoRA r256 5ep (s2440) 4.648 3.553 4.543 4.622 6.133 7.947 0.918
Run 15 β€” LoRA r256 5ep (s2460) 4.661 3.603 4.56 4.636 6.132 7.941 0.915
Run 15 β€” LoRA r256 5ep (s2465) 4.649 3.596 4.55 4.634 6.13 7.94 0.914

VC->SIDON

model mos utmos nisqa_mos scoreq CE PQ spk_sim
Run 5 β€” full-FT DiT + AdaLN (v0.3) 4.713 3.823 4.575 4.683 6.223 7.908 0.92
Run 10 β€” LoRA r128 baseline (v0.5) 4.647 3.729 4.533 4.639 6.082 7.905 0.92
Run 12 β€” frozen DiT + AdaLN (best) 4.579 3.65 4.477 4.589 6.126 7.963 0.919
Run 13 β€” frozen DiT + AdaLN (best) 4.622 3.767 4.5 4.625 6.093 7.933 0.923
Run 15 β€” LoRA r256 5ep (s2440) 4.643 3.646 4.526 4.605 6.088 7.932 0.946
Run 15 β€” LoRA r256 5ep (s2460) 4.638 3.643 4.501 4.606 6.078 7.924 0.948
Run 15 β€” LoRA r256 5ep (s2465) 4.651 3.696 4.524 4.617 6.099 7.943 0.947

Training

  • Base: ltx-2.3-22b-dev-audio-only-v13-merged (DramaBox audio-only DiT)
  • LoRA rank 256, alpha 256 (scaling 1.0), dropout 0.0; pure flow-matching loss; 5 epochs / 2465 steps; lr 1e-4 cosine; bf16; 8xA100
  • Checkpoint: lora_r256_step2465.safetensors (453M params)

Inference

LoRA adapter for the DramaBox DiT β€” pass only the LoRA (no --adaln-checkpoint):

python inference_adaln.py \
  --checkpoint ltx-2.3-22b-dev-audio-only-v13-merged.safetensors \
  --full-checkpoint ltx-2.3-22b-dev.safetensors \
  --lora-checkpoint lora_r256_step2465.safetensors \
  --prompt "A warm, slightly husky 35-year-old woman, high-quality studio recording. 'I never thought I would say this out loud.'" \
  --voice-sample reference_speaker.wav --output out.wav --seed 42

Standalone: drop --voice-sample, add --no-ref, use a full prompt. Post-process with Sidon; for max speaker-sim use Chatterbox-VC -> Sidon.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support