LAIONBox v0.6-wip — Run 15 (LoRA-only, rank 256)

LoRA fine-tune of the DramaBox audio-only DiT (LTX-2.3-22B-Dev, ~3.3B, ResembleAI) for expressive voice-acting TTS. Pure LoRA (rank 256 / alpha 256, 5 epochs) — no AdaLN, no auxiliary losses, just flow matching. Scaled up from the v0.5 baseline (r128 / 3 epochs).

Evaluation (means over 72 samples; sidon_normalized enhanced refs: 5 speakers x 2 seeds x 6 prompts EN+DE)

nisqa_mos/scoreq/CE/PQ are held-out scorers. 🔊 Audio: see v06_eval_results.html (base64 players) in this repo.

Verdict: run5 (full fine-tune + AdaLN, v0.3-wip) has the best raw quality; run15 (this model, LoRA r256) has the best speaker similarity (raw 0.92, vc->sidon 0.947) while matching the r128 baseline on quality.

RAW

model	mos	utmos	nisqa_mos	scoreq	CE	PQ	spk_sim
Run 5 — full-FT DiT + AdaLN (v0.3)	4.601	3.647	4.369	4.636	6.188	7.896	0.879
Run 10 — LoRA r128 baseline (v0.5)	4.481	3.448	4.201	4.5	5.982	7.793	0.876
Run 12 — frozen DiT + AdaLN (best)	4.378	3.395	4.057	4.428	6.023	7.858	0.894
Run 13 — frozen DiT + AdaLN (best)	4.434	3.498	4.138	4.471	5.997	7.789	0.88
Run 15 — LoRA r256 5ep (s2440)	4.474	3.385	4.184	4.487	6.0	7.851	0.919
Run 15 — LoRA r256 5ep (s2460)	4.487	3.418	4.205	4.504	6.003	7.841	0.916
Run 15 — LoRA r256 5ep (s2465)	4.491	3.446	4.199	4.512	6.0	7.841	0.916

SIDON

model	mos	utmos	nisqa_mos	scoreq	CE	PQ	spk_sim
Run 5 — full-FT DiT + AdaLN (v0.3)	4.747	3.793	4.667	4.74	6.273	7.921	0.87
Run 10 — LoRA r128 baseline (v0.5)	4.656	3.588	4.545	4.647	6.116	7.912	0.873
Run 12 — frozen DiT + AdaLN (best)	4.616	3.537	4.488	4.606	6.168	7.962	0.89
Run 13 — frozen DiT + AdaLN (best)	4.654	3.659	4.528	4.645	6.12	7.911	0.877
Run 15 — LoRA r256 5ep (s2440)	4.648	3.553	4.543	4.622	6.133	7.947	0.918
Run 15 — LoRA r256 5ep (s2460)	4.661	3.603	4.56	4.636	6.132	7.941	0.915
Run 15 — LoRA r256 5ep (s2465)	4.649	3.596	4.55	4.634	6.13	7.94	0.914

VC->SIDON

model	mos	utmos	nisqa_mos	scoreq	CE	PQ	spk_sim
Run 5 — full-FT DiT + AdaLN (v0.3)	4.713	3.823	4.575	4.683	6.223	7.908	0.92
Run 10 — LoRA r128 baseline (v0.5)	4.647	3.729	4.533	4.639	6.082	7.905	0.92
Run 12 — frozen DiT + AdaLN (best)	4.579	3.65	4.477	4.589	6.126	7.963	0.919
Run 13 — frozen DiT + AdaLN (best)	4.622	3.767	4.5	4.625	6.093	7.933	0.923
Run 15 — LoRA r256 5ep (s2440)	4.643	3.646	4.526	4.605	6.088	7.932	0.946
Run 15 — LoRA r256 5ep (s2460)	4.638	3.643	4.501	4.606	6.078	7.924	0.948
Run 15 — LoRA r256 5ep (s2465)	4.651	3.696	4.524	4.617	6.099	7.943	0.947

Training

Base: ltx-2.3-22b-dev-audio-only-v13-merged (DramaBox audio-only DiT)
LoRA rank 256, alpha 256 (scaling 1.0), dropout 0.0; pure flow-matching loss; 5 epochs / 2465 steps; lr 1e-4 cosine; bf16; 8xA100
Checkpoint: lora_r256_step2465.safetensors (453M params)

Inference

LoRA adapter for the DramaBox DiT — pass only the LoRA (no --adaln-checkpoint):

python inference_adaln.py \
  --checkpoint ltx-2.3-22b-dev-audio-only-v13-merged.safetensors \
  --full-checkpoint ltx-2.3-22b-dev.safetensors \
  --lora-checkpoint lora_r256_step2465.safetensors \
  --prompt "A warm, slightly husky 35-year-old woman, high-quality studio recording. 'I never thought I would say this out loud.'" \
  --voice-sample reference_speaker.wav --output out.wav --seed 42

Standalone: drop --voice-sample, add --no-ref, use a full prompt. Post-process with Sidon; for max speaker-sim use Chatterbox-VC -> Sidon.

Downloads last month: -; Downloads are not tracked for this model. How to track