LAIONBox v0.4-wip
Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.
What's New in v0.4 (vs v0.3)
LAIONBox v0.4 uses a two-stage training approach that produces higher quality speech than v0.3:
- Stage 1 (Run 7): LoRA rank-64 fine-tuning of the DiT backbone (113M trainable params, 3.2% of DiT) β teaches the model better speech patterns
- Stage 2 (Run 8): Merge LoRA weights into the DiT, freeze the merged model, and train a fresh AdaLN-Zero speaker conditioning network on top β learns speaker identity without disturbing the improved backbone
This "LoRA-merge then AdaLN" recipe outperforms both standalone LoRA (Run 7) and standalone AdaLN-Zero (v0.3 / Run 5) across all metrics.
Key Improvements over v0.3
| Metric | v0.3 (Run 5) | v0.4 (Run 8) | Delta |
|---|---|---|---|
| UTMOS | 3.630 | 3.655 | +0.025 |
| MOS | 4.575 | 4.501 | -0.074 |
| SpkSim | 0.887 | 0.884 | -0.003 |
| NISQA | 4.334 | 4.198 | -0.136 |
| UTMOS (Sidon) | 3.763 | 3.855 | +0.092 |
| SpkSim (VCβSidon) | 0.927 | 0.934 | +0.007 |
The real gains show after post-processing: +0.092 UTMOS with Sidon and +0.007 SpkSim with VCβSidon β v0.4 produces audio that responds better to speech restoration and voice conversion.
Architecture
Reference Audio β [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
β
SpeakerAdaLNZero (455M params)
bottleneck_dim=512
β
Per-block scale/shift deltas (48 blocks Γ 9 params)
β
Added to timestep embeddings in each DiT transformer block
β
LoRA-merged DiT backbone (3.3B params, frozen during Stage 2)
The DiT backbone in v0.4 contains merged LoRA rank-64 weights from Stage 1, giving it better speech generation capabilities than the vanilla DramaBox backbone used in v0.3. The AdaLN-Zero network is trained on top of this improved backbone.
AdaLN-Zero Design
The conditioning network uses zero-initialization on all output projections, meaning it starts with zero conditioning effect and gradually learns to modulate the DiT's behavior. This ensures:
- Graceful degradation to vanilla generation when no speaker reference is provided
- Stable training β the model never "forgets" how to generate speech
- The frozen DiT backbone is never modified during Stage 2
Training Details
Stage 1: LoRA Fine-Tuning (Run 7)
- Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
- LoRA config: Rank 64, alpha 64, targeting
audio_attn1+audio_ffmodules - Trainable params: 113M (3.2% of DiT)
- Learning rate: 4e-5 with cosine schedule
- Epochs: 2 (340 steps, effective batch size 128)
- Precision: bf16 mixed precision
- Note: LoRA training in bf16 is effective for ~1 epoch before the bf16 ULP floor freezes updates (see Learnings section)
Stage 2: AdaLN-Zero on Frozen Merged DiT (Run 8)
- Base model: Stage 1 LoRA merged into vanilla DiT β
merged_dit.safetensors - DiT: Completely frozen (lr=0), no gradients, no weight saves
- AdaLN-Zero: 455M parameters, freshly initialized
- Learning rate: 7e-5 with cosine schedule, 25% warmup
- Epochs: 6 (1,020 steps, effective batch size 128)
- Best checkpoint: Step 850 (epoch 5.8, selected by validation)
- Precision: bf16 DiT + fp32 AdaLN
Training Data
21,734 samples from 3 sources:
- DramaBox best-of-25 (9,404 samples, weight 0.4) β high-quality TTS generations
- Podcast data (10,014 samples, weight 0.4) β natural conversational speech
- Emolia emotional speech (2,316 samples, weight 0.2) β expressive/emotional range
Auxiliary Losses
5 differentiable losses computed through the DramaBox decoder (ReFL-style):
| Loss | Weight | Purpose |
|---|---|---|
| WavLM-SV speaker similarity | adaptive (β€20) | Match speaker identity |
| Orange-tbr speaker similarity | adaptive (β€20) | Secondary speaker matching |
| DNSMOS OVR quality | adaptive (β€20) | Speech quality |
| VoiceCLAP-HOW naturalness | adaptive (β€20) | Speaking style/naturalness |
| VoiceCLAP-WHAT content accuracy | 0 (disabled in Run 8) | Content preservation |
Auxiliary loss coefficients are adaptive with exponential moving average targeting a 6:1 ratio between flow loss and total auxiliary contribution.
Evaluation Results
Evaluated on 6 prompts Γ 5 speakers Γ 2 seeds = 60 samples per model, with 3 post-processing variants.
Raw Output (DramaBox decoder only)
| Model | MOS β | UTMOS β | NISQA β | SpkSim β | CE β | PQ β |
|---|---|---|---|---|---|---|
| Vanilla DramaBox | 4.412 | 3.136 | 4.107 | 0.818 | β | β |
| LAIONBox v0.3 (Run 5 AdaLN) | 4.575 | 3.630 | 4.334 | 0.887 | 6.058 | 7.675 |
| LAIONBox v0.4 (Run 8) | 4.501 | 3.655 | 4.198 | 0.884 | 6.013 | 7.686 |
Sidon Enhanced (speech restoration)
| Model | MOS β | UTMOS β | NISQA β | SpkSim β | CE β | PQ β |
|---|---|---|---|---|---|---|
| Vanilla DramaBox | 4.663 | 3.280 | 4.516 | 0.813 | β | β |
| LAIONBox v0.3 | 4.722 | 3.763 | 4.635 | 0.874 | 6.200 | 7.864 |
| LAIONBox v0.4 | 4.751 | 3.855 | 4.670 | 0.875 | 6.183 | 7.878 |
ChatterboxVC β Sidon (voice conversion + restoration)
| Model | MOS β | UTMOS β | NISQA β | SpkSim β | CE β | PQ β |
|---|---|---|---|---|---|---|
| LAIONBox v0.3 | 4.677 | 3.756 | 4.508 | 0.927 | 6.029 | 7.783 |
| LAIONBox v0.4 | 4.690 | 3.835 | 4.514 | 0.934 | 6.109 | 7.874 |
Improvements over Vanilla DramaBox (raw β raw)
- UTMOS: +16.6% (3.136 β 3.655)
- Speaker Similarity: +8.1% (0.818 β 0.884)
- MOS: +2.0% (4.412 β 4.501)
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
6.2 GB | LoRA-merged DiT backbone (3.3B params, bf16) |
speaker_adaln.pt |
1.7 GB | AdaLN-Zero speaker conditioning network (455M params, fp32), step 850 |
scripts/inference_adaln.py |
26 KB | Full inference script with speaker conditioning, LoRA support, batch generation |
scripts/speaker_adaln.py |
7 KB | AdaLN-Zero module definition (SpeakerAdaLNZero class) |
training_args.json |
1 KB | Training hyperparameters for Run 8 |
eval/eval_full_report.html |
0.3 MB | Interactive evaluation report with metric tables and audio players |
Inference
Prerequisites
# Clone DramaBox
git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt
# Download model files
pip install huggingface_hub
huggingface-cli download laion/laionbox-v0.4-wip --local-dir laionbox-v0.4
Basic Usage
python laionbox-v0.4/scripts/inference_adaln.py \
--checkpoint laionbox-v0.4/model.safetensors \
--adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
--full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
--dramabox-dir DramaBox \
--ref-audio /path/to/reference_speaker.wav \
--prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
--output output.wav \
--device cuda:0
Without Speaker Conditioning (vanilla mode)
Omit --adaln-checkpoint to run as standard DramaBox with the improved LoRA-merged backbone:
python laionbox-v0.4/scripts/inference_adaln.py \
--checkpoint laionbox-v0.4/model.safetensors \
--full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
--dramabox-dir DramaBox \
--prompt "A calm narrator reading a story." \
--output output.wav
Batch Generation
Generate multiple samples with different seeds for best-of-N selection:
python laionbox-v0.4/scripts/inference_adaln.py \
--checkpoint laionbox-v0.4/model.safetensors \
--adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
--full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
--dramabox-dir DramaBox \
--ref-audio speaker_ref.wav \
--prompt "Excited announcer. 'And the winner is...!'" \
--output-dir ./outputs \
--seeds 42,123,456 \
--device cuda:0
Key Parameters
| Parameter | Default | Description |
|---|---|---|
--checkpoint |
required | Path to model.safetensors (LoRA-merged DiT) |
--adaln-checkpoint |
optional | Path to speaker_adaln.pt (omit for unconditioned) |
--full-checkpoint |
required | Path to original DramaBox ltx-2.3-22b-dev.safetensors (for decoder/conditioner) |
--ref-audio |
optional | Reference audio for speaker conditioning (WAV/MP3, 3-15s recommended) |
--prompt |
required | Text prompt describing speech style + content |
--cfg-scale |
4.5 | Classifier-free guidance scale |
--stg-scale |
0.0 | Spatiotemporal guidance scale |
--steps |
50 | Number of diffusion steps |
--duration |
auto | Target duration in seconds (auto-estimated from text) |
Post-Processing Pipeline
For best quality, apply the Generate β ChatterboxVC β Sidon pipeline:
Step 1: Generate with LAIONBox
# See inference commands above
Step 2: ChatterboxVC Voice Conversion (optional, +0.05 SpkSim)
ChatterboxVC converts the output to match the target speaker more closely:
from chatterbox.vc import ChatterboxVC
import torchaudio
vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
audio="output.wav",
target_se="reference_speaker.wav",
)
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)
Step 3: Sidon Speech Restoration (+0.15 MOS, +0.35 NISQA)
Sidon removes DAC vocoder artifacts and improves perceptual quality:
from sidon import Sidon
model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output_vc.wav") # or "output.wav" if skipping VC
enhanced.save("output_final.wav") # 48kHz mono
Expected Quality with Full Pipeline
| Metric | Raw | + Sidon | + VC β Sidon |
|---|---|---|---|
| MOS | 4.50 | 4.75 | 4.69 |
| UTMOS | 3.66 | 3.86 | 3.84 |
| NISQA | 4.20 | 4.67 | 4.51 |
| SpkSim | 0.88 | 0.88 | 0.93 |
Learnings from the Training Campaign (Runs 5β9)
What Worked
| Technique | Impact |
|---|---|
| LoRA-merge + fresh AdaLN | Best overall recipe. LoRA improves the backbone; AdaLN adds speaker identity cleanly on top. |
| FP32 master weights | Solves bf16 ULP floor for LoRA training. CPU-offloaded fp32 copies (~2.5 GB RAM) keep Adam updates alive. Zero throughput impact. |
| Separate optimizers | Different LRs for DiT vs AdaLN (4e-5 vs 1e-5) and independent freeze/unfreeze. |
| Weight debug logging | Hash + Ξ% tracking exposed training freezes that loss curves alone cannot detect. |
| Sidon post-processing | Consistent +0.15 MOS, +0.35 NISQA across all models. Essential for production. |
| Adaptive aux loss coefficients | EMA-based coefficient scaling prevents any single auxiliary loss from dominating. |
What Didn't Work
| Approach | Problem |
|---|---|
| Standard full fine-tune | 3.5B params at lr=2e-6 doesn't converge in 1 epoch. Too slow, too expensive. |
| LoRA in bf16 beyond epoch 1 | bf16 has 8 mantissa bits; for weights ~0.01, ULP β 8.58e-5, but Adam updates are ~1.72e-5. Updates round to zero. |
| Stacking LoRA on merged model | Diminishing returns: adding more LoRA on top of LoRA+AdaLN doesn't improve quality. |
| Training AdaLN beyond 5 epochs | Peak at epoch 4β5; epochs 5β6 show mild overfitting. |
Limitations
- Speaker conditioning requires reference audio (3β15 seconds recommended); without it the model runs as vanilla DramaBox with improved backbone
- Best results require the VC β Sidon post-processing pipeline, which adds latency (~5s for VC + ~3s for Sidon on GPU)
- Trained primarily on English and German speech
- The LoRA-merged backbone has slightly different characteristics than vanilla DramaBox β prompts may need minor adjustment
Citation
@misc{laionbox2026v04,
title={LAIONBox v0.4: Two-Stage Speaker-Conditioned Audio Generation with LoRA-Merged DiT and AdaLN-Zero},
author={LAION},
year={2026},
url={https://huggingface.co/laion/laionbox-v0.4-wip}
}
Model Details
- Total model size: 3.8B parameters (3.3B DiT + 455M AdaLN)
- DiT tensor type: BF16 (safetensors)
- AdaLN tensor type: FP32 (PyTorch)
- Training hardware: 8Γ GPU (DDP, bf16 mixed precision)
- Training time: ~6 hours Stage 1 (Run 7) + ~17 hours Stage 2 (Run 8)
- Framework: PyTorch + Accelerate
- License: Apache 2.0
- Organization: LAION e.V.