LAIONBox v0.6-wip β Run 15 (LoRA-only, rank 256)
LoRA fine-tune of the DramaBox audio-only DiT (LTX-2.3-22B-Dev, ~3.3B, ResembleAI) for expressive voice-acting TTS. Pure LoRA (rank 256 / alpha 256, 5 epochs) β no AdaLN, no auxiliary losses, just flow matching. Scaled up from the v0.5 baseline (r128 / 3 epochs).
Evaluation (means over 72 samples; sidon_normalized enhanced refs: 5 speakers x 2 seeds x 6 prompts EN+DE)
nisqa_mos/scoreq/CE/PQ are held-out scorers. π Audio: see v06_eval_results.html (base64 players) in this repo.
Verdict: run5 (full fine-tune + AdaLN, v0.3-wip) has the best raw quality; run15 (this model, LoRA r256) has the best speaker similarity (raw 0.92, vc->sidon 0.947) while matching the r128 baseline on quality.
RAW
| model | mos | utmos | nisqa_mos | scoreq | CE | PQ | spk_sim |
|---|---|---|---|---|---|---|---|
| Run 5 β full-FT DiT + AdaLN (v0.3) | 4.601 | 3.647 | 4.369 | 4.636 | 6.188 | 7.896 | 0.879 |
| Run 10 β LoRA r128 baseline (v0.5) | 4.481 | 3.448 | 4.201 | 4.5 | 5.982 | 7.793 | 0.876 |
| Run 12 β frozen DiT + AdaLN (best) | 4.378 | 3.395 | 4.057 | 4.428 | 6.023 | 7.858 | 0.894 |
| Run 13 β frozen DiT + AdaLN (best) | 4.434 | 3.498 | 4.138 | 4.471 | 5.997 | 7.789 | 0.88 |
| Run 15 β LoRA r256 5ep (s2440) | 4.474 | 3.385 | 4.184 | 4.487 | 6.0 | 7.851 | 0.919 |
| Run 15 β LoRA r256 5ep (s2460) | 4.487 | 3.418 | 4.205 | 4.504 | 6.003 | 7.841 | 0.916 |
| Run 15 β LoRA r256 5ep (s2465) | 4.491 | 3.446 | 4.199 | 4.512 | 6.0 | 7.841 | 0.916 |
SIDON
| model | mos | utmos | nisqa_mos | scoreq | CE | PQ | spk_sim |
|---|---|---|---|---|---|---|---|
| Run 5 β full-FT DiT + AdaLN (v0.3) | 4.747 | 3.793 | 4.667 | 4.74 | 6.273 | 7.921 | 0.87 |
| Run 10 β LoRA r128 baseline (v0.5) | 4.656 | 3.588 | 4.545 | 4.647 | 6.116 | 7.912 | 0.873 |
| Run 12 β frozen DiT + AdaLN (best) | 4.616 | 3.537 | 4.488 | 4.606 | 6.168 | 7.962 | 0.89 |
| Run 13 β frozen DiT + AdaLN (best) | 4.654 | 3.659 | 4.528 | 4.645 | 6.12 | 7.911 | 0.877 |
| Run 15 β LoRA r256 5ep (s2440) | 4.648 | 3.553 | 4.543 | 4.622 | 6.133 | 7.947 | 0.918 |
| Run 15 β LoRA r256 5ep (s2460) | 4.661 | 3.603 | 4.56 | 4.636 | 6.132 | 7.941 | 0.915 |
| Run 15 β LoRA r256 5ep (s2465) | 4.649 | 3.596 | 4.55 | 4.634 | 6.13 | 7.94 | 0.914 |
VC->SIDON
| model | mos | utmos | nisqa_mos | scoreq | CE | PQ | spk_sim |
|---|---|---|---|---|---|---|---|
| Run 5 β full-FT DiT + AdaLN (v0.3) | 4.713 | 3.823 | 4.575 | 4.683 | 6.223 | 7.908 | 0.92 |
| Run 10 β LoRA r128 baseline (v0.5) | 4.647 | 3.729 | 4.533 | 4.639 | 6.082 | 7.905 | 0.92 |
| Run 12 β frozen DiT + AdaLN (best) | 4.579 | 3.65 | 4.477 | 4.589 | 6.126 | 7.963 | 0.919 |
| Run 13 β frozen DiT + AdaLN (best) | 4.622 | 3.767 | 4.5 | 4.625 | 6.093 | 7.933 | 0.923 |
| Run 15 β LoRA r256 5ep (s2440) | 4.643 | 3.646 | 4.526 | 4.605 | 6.088 | 7.932 | 0.946 |
| Run 15 β LoRA r256 5ep (s2460) | 4.638 | 3.643 | 4.501 | 4.606 | 6.078 | 7.924 | 0.948 |
| Run 15 β LoRA r256 5ep (s2465) | 4.651 | 3.696 | 4.524 | 4.617 | 6.099 | 7.943 | 0.947 |
Training
- Base:
ltx-2.3-22b-dev-audio-only-v13-merged(DramaBox audio-only DiT) - LoRA rank 256, alpha 256 (scaling 1.0), dropout 0.0; pure flow-matching loss; 5 epochs / 2465 steps; lr 1e-4 cosine; bf16; 8xA100
- Checkpoint:
lora_r256_step2465.safetensors(453M params)
Inference
LoRA adapter for the DramaBox DiT β pass only the LoRA (no --adaln-checkpoint):
python inference_adaln.py \
--checkpoint ltx-2.3-22b-dev-audio-only-v13-merged.safetensors \
--full-checkpoint ltx-2.3-22b-dev.safetensors \
--lora-checkpoint lora_r256_step2465.safetensors \
--prompt "A warm, slightly husky 35-year-old woman, high-quality studio recording. 'I never thought I would say this out loud.'" \
--voice-sample reference_speaker.wav --output out.wav --seed 42
Standalone: drop --voice-sample, add --no-ref, use a full prompt. Post-process with Sidon; for max speaker-sim use Chatterbox-VC -> Sidon.