LAIONBox v0.4-wip

Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.

What's New in v0.4 (vs v0.3)

LAIONBox v0.4 uses a two-stage training approach that produces higher quality speech than v0.3:

Stage 1 (Run 7): LoRA rank-64 fine-tuning of the DiT backbone (113M trainable params, 3.2% of DiT) — teaches the model better speech patterns
Stage 2 (Run 8): Merge LoRA weights into the DiT, freeze the merged model, and train a fresh AdaLN-Zero speaker conditioning network on top — learns speaker identity without disturbing the improved backbone

This "LoRA-merge then AdaLN" recipe outperforms both standalone LoRA (Run 7) and standalone AdaLN-Zero (v0.3 / Run 5) across all metrics.

Key Improvements over v0.3

Metric	v0.3 (Run 5)	v0.4 (Run 8)	Delta
UTMOS	3.630	3.655	+0.025
MOS	4.575	4.501	-0.074
SpkSim	0.887	0.884	-0.003
NISQA	4.334	4.198	-0.136
UTMOS (Sidon)	3.763	3.855	+0.092
SpkSim (VC→Sidon)	0.927	0.934	+0.007

The real gains show after post-processing: +0.092 UTMOS with Sidon and +0.007 SpkSim with VC→Sidon — v0.4 produces audio that responds better to speech restoration and voice conversion.

Architecture

Reference Audio → [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
                                         ↓
                          SpeakerAdaLNZero (455M params)
                          bottleneck_dim=512
                                         ↓
                    Per-block scale/shift deltas (48 blocks × 9 params)
                                         ↓
                Added to timestep embeddings in each DiT transformer block
                                         ↓
               LoRA-merged DiT backbone (3.3B params, frozen during Stage 2)

The DiT backbone in v0.4 contains merged LoRA rank-64 weights from Stage 1, giving it better speech generation capabilities than the vanilla DramaBox backbone used in v0.3. The AdaLN-Zero network is trained on top of this improved backbone.

AdaLN-Zero Design

The conditioning network uses zero-initialization on all output projections, meaning it starts with zero conditioning effect and gradually learns to modulate the DiT's behavior. This ensures:

Graceful degradation to vanilla generation when no speaker reference is provided
Stable training — the model never "forgets" how to generate speech
The frozen DiT backbone is never modified during Stage 2

Training Details

Stage 1: LoRA Fine-Tuning (Run 7)

Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
LoRA config: Rank 64, alpha 64, targeting audio_attn1 + audio_ff modules
Trainable params: 113M (3.2% of DiT)
Learning rate: 4e-5 with cosine schedule
Epochs: 2 (340 steps, effective batch size 128)
Precision: bf16 mixed precision
Note: LoRA training in bf16 is effective for ~1 epoch before the bf16 ULP floor freezes updates (see Learnings section)

Stage 2: AdaLN-Zero on Frozen Merged DiT (Run 8)

Base model: Stage 1 LoRA merged into vanilla DiT → merged_dit.safetensors
DiT: Completely frozen (lr=0), no gradients, no weight saves
AdaLN-Zero: 455M parameters, freshly initialized
Learning rate: 7e-5 with cosine schedule, 25% warmup
Epochs: 6 (1,020 steps, effective batch size 128)
Best checkpoint: Step 850 (epoch 5.8, selected by validation)
Precision: bf16 DiT + fp32 AdaLN

Training Data

21,734 samples from 3 sources:

DramaBox best-of-25 (9,404 samples, weight 0.4) — high-quality TTS generations
Podcast data (10,014 samples, weight 0.4) — natural conversational speech
Emolia emotional speech (2,316 samples, weight 0.2) — expressive/emotional range

Auxiliary Losses

5 differentiable losses computed through the DramaBox decoder (ReFL-style):

Loss	Weight	Purpose
WavLM-SV speaker similarity	adaptive (≤20)	Match speaker identity
Orange-tbr speaker similarity	adaptive (≤20)	Secondary speaker matching
DNSMOS OVR quality	adaptive (≤20)	Speech quality
VoiceCLAP-HOW naturalness	adaptive (≤20)	Speaking style/naturalness
VoiceCLAP-WHAT content accuracy	0 (disabled in Run 8)	Content preservation

Auxiliary loss coefficients are adaptive with exponential moving average targeting a 6:1 ratio between flow loss and total auxiliary contribution.

Evaluation Results

Evaluated on 6 prompts × 5 speakers × 2 seeds = 60 samples per model, with 3 post-processing variants.

Raw Output (DramaBox decoder only)

Model	MOS ↑	UTMOS ↑	NISQA ↑	SpkSim ↑	CE ↑	PQ ↑
Vanilla DramaBox	4.412	3.136	4.107	0.818	—	—
LAIONBox v0.3 (Run 5 AdaLN)	4.575	3.630	4.334	0.887	6.058	7.675
LAIONBox v0.4 (Run 8)	4.501	3.655	4.198	0.884	6.013	7.686

Sidon Enhanced (speech restoration)

Model	MOS ↑	UTMOS ↑	NISQA ↑	SpkSim ↑	CE ↑	PQ ↑
Vanilla DramaBox	4.663	3.280	4.516	0.813	—	—
LAIONBox v0.3	4.722	3.763	4.635	0.874	6.200	7.864
LAIONBox v0.4	4.751	3.855	4.670	0.875	6.183	7.878

ChatterboxVC → Sidon (voice conversion + restoration)

Model	MOS ↑	UTMOS ↑	NISQA ↑	SpkSim ↑	CE ↑	PQ ↑
LAIONBox v0.3	4.677	3.756	4.508	0.927	6.029	7.783
LAIONBox v0.4	4.690	3.835	4.514	0.934	6.109	7.874

Improvements over Vanilla DramaBox (raw → raw)

UTMOS: +16.6% (3.136 → 3.655)
Speaker Similarity: +8.1% (0.818 → 0.884)
MOS: +2.0% (4.412 → 4.501)

Files

File	Size	Description
`model.safetensors`	6.2 GB	LoRA-merged DiT backbone (3.3B params, bf16)
`speaker_adaln.pt`	1.7 GB	AdaLN-Zero speaker conditioning network (455M params, fp32), step 850
`scripts/inference_adaln.py`	26 KB	Full inference script with speaker conditioning, LoRA support, batch generation
`scripts/speaker_adaln.py`	7 KB	AdaLN-Zero module definition (SpeakerAdaLNZero class)
`training_args.json`	1 KB	Training hyperparameters for Run 8
`eval/eval_full_report.html`	0.3 MB	Interactive evaluation report with metric tables and audio players

Inference

Prerequisites

# Clone DramaBox
git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt

# Download model files
pip install huggingface_hub
huggingface-cli download laion/laionbox-v0.4-wip --local-dir laionbox-v0.4

Basic Usage

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
    --output output.wav \
    --device cuda:0

Without Speaker Conditioning (vanilla mode)

Omit --adaln-checkpoint to run as standard DramaBox with the improved LoRA-merged backbone:

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --prompt "A calm narrator reading a story." \
    --output output.wav

Batch Generation

Generate multiple samples with different seeds for best-of-N selection:

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --ref-audio speaker_ref.wav \
    --prompt "Excited announcer. 'And the winner is...!'" \
    --output-dir ./outputs \
    --seeds 42,123,456 \
    --device cuda:0

Key Parameters

Parameter	Default	Description
`--checkpoint`	required	Path to `model.safetensors` (LoRA-merged DiT)
`--adaln-checkpoint`	optional	Path to `speaker_adaln.pt` (omit for unconditioned)
`--full-checkpoint`	required	Path to original DramaBox `ltx-2.3-22b-dev.safetensors` (for decoder/conditioner)
`--ref-audio`	optional	Reference audio for speaker conditioning (WAV/MP3, 3-15s recommended)
`--prompt`	required	Text prompt describing speech style + content
`--cfg-scale`	4.5	Classifier-free guidance scale
`--stg-scale`	0.0	Spatiotemporal guidance scale
`--steps`	50	Number of diffusion steps
`--duration`	auto	Target duration in seconds (auto-estimated from text)

Post-Processing Pipeline

For best quality, apply the Generate → ChatterboxVC → Sidon pipeline:

Step 1: Generate with LAIONBox

# See inference commands above

Step 2: ChatterboxVC Voice Conversion (optional, +0.05 SpkSim)

ChatterboxVC converts the output to match the target speaker more closely:

from chatterbox.vc import ChatterboxVC
import torchaudio

vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
    audio="output.wav",
    target_se="reference_speaker.wav",
)
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)

Step 3: Sidon Speech Restoration (+0.15 MOS, +0.35 NISQA)

Sidon removes DAC vocoder artifacts and improves perceptual quality:

from sidon import Sidon

model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output_vc.wav")  # or "output.wav" if skipping VC
enhanced.save("output_final.wav")  # 48kHz mono

Expected Quality with Full Pipeline

Metric	Raw	+ Sidon	+ VC → Sidon
MOS	4.50	4.75	4.69
UTMOS	3.66	3.86	3.84
NISQA	4.20	4.67	4.51
SpkSim	0.88	0.88	0.93

Learnings from the Training Campaign (Runs 5–9)

What Worked

Technique	Impact
LoRA-merge + fresh AdaLN	Best overall recipe. LoRA improves the backbone; AdaLN adds speaker identity cleanly on top.
FP32 master weights	Solves bf16 ULP floor for LoRA training. CPU-offloaded fp32 copies (~2.5 GB RAM) keep Adam updates alive. Zero throughput impact.
Separate optimizers	Different LRs for DiT vs AdaLN (4e-5 vs 1e-5) and independent freeze/unfreeze.
Weight debug logging	Hash + Δ% tracking exposed training freezes that loss curves alone cannot detect.
Sidon post-processing	Consistent +0.15 MOS, +0.35 NISQA across all models. Essential for production.
Adaptive aux loss coefficients	EMA-based coefficient scaling prevents any single auxiliary loss from dominating.

What Didn't Work

Approach	Problem
Standard full fine-tune	3.5B params at lr=2e-6 doesn't converge in 1 epoch. Too slow, too expensive.
LoRA in bf16 beyond epoch 1	bf16 has 8 mantissa bits; for weights ~0.01, ULP ≈ 8.58e-5, but Adam updates are ~1.72e-5. Updates round to zero.
Stacking LoRA on merged model	Diminishing returns: adding more LoRA on top of LoRA+AdaLN doesn't improve quality.
Training AdaLN beyond 5 epochs	Peak at epoch 4–5; epochs 5–6 show mild overfitting.

Limitations

Speaker conditioning requires reference audio (3–15 seconds recommended); without it the model runs as vanilla DramaBox with improved backbone
Best results require the VC → Sidon post-processing pipeline, which adds latency (~5s for VC + ~3s for Sidon on GPU)
Trained primarily on English and German speech
The LoRA-merged backbone has slightly different characteristics than vanilla DramaBox — prompts may need minor adjustment

Citation

@misc{laionbox2026v04,
  title={LAIONBox v0.4: Two-Stage Speaker-Conditioned Audio Generation with LoRA-Merged DiT and AdaLN-Zero},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.4-wip}
}

Model Details

Total model size: 3.8B parameters (3.3B DiT + 455M AdaLN)
DiT tensor type: BF16 (safetensors)
AdaLN tensor type: FP32 (PyTorch)
Training hardware: 8× GPU (DDP, bf16 mixed precision)
Training time: ~6 hours Stage 1 (Run 7) + ~17 hours Stage 2 (Run 8)
Framework: PyTorch + Accelerate
License: Apache 2.0
Organization: LAION e.V.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

3B params

Tensor type

F32

BF16