YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LAIONBox v0.3-wip
Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.
What's Different from DramaBox
LAIONBox v0.3 adds a 152M-parameter AdaLN-Zero speaker conditioning network (speaker_adaln.pt) that injects speaker identity into every transformer block of the 3.3B DiT backbone. The base DiT weights (model.safetensors) are identical to the DramaBox v13-merged pretrained model — only the AdaLN network was trained.
Architecture
Reference Audio → [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
↓
SpeakerAdaLNZero (152M params)
bottleneck_dim=512
↓
Per-block scale/shift deltas (48 blocks × 9 params)
↓
Added to timestep embeddings in each DiT transformer block
The AdaLN-Zero design means the model starts with zero conditioning (all output projections zero-initialized), so it degrades gracefully to vanilla DramaBox when no speaker reference is provided.
Training
- Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
- Mode: Full fine-tuning (no LoRA) with separate AdaLN-Zero optimizer
- Data: 21,734 samples from 3 sources:
- DramaBox best-of-25 (9,404 samples, weight 0.4)
- Podcast data (10,014 samples, weight 0.4)
- Emolia emotional speech (2,316 samples, weight 0.2)
- Learning rates: DiT lr=1e-5 (effectively frozen in bf16), AdaLN lr=5e-5
- Epochs: 5 (850 optimizer steps, effective batch size 128)
- Auxiliary losses (differentiable, ReFL-style through decoder):
- WavLM-SV speaker similarity
- Orange-tbr speaker similarity
- DNSMOS OVR quality
- VoiceCLAP-HOW naturalness
- VoiceCLAP-WHAT content accuracy
- Key insight: With
mixed_precision="bf16"and lr=1e-5, Adam updates (5e-6) are below the bf16 ULP (2.8e-4) for all DiT parameters. The DiT backbone is effectively frozen after warmup — only the fp32 AdaLN network learns. This is architecturally correct for AdaLN-Zero conditioning.
Evaluation Results
Evaluated on 6 prompts × 5 speakers × 2 seeds = 60 samples per model. Post-processing variants: Raw, Sidon (speech restoration), VC→Sidon (voice conversion + restoration).
| Model | Variant | MOS↑ | UTMOS↑ | SpkSim↑ | NISQA↑ |
|---|---|---|---|---|---|
| Vanilla DramaBox | Raw | 4.455 | 3.176 | 0.818 | 4.134 |
| Vanilla DramaBox | Sidon | 4.672 | 3.323 | 0.813 | 4.524 |
| LAIONBox v0.3 | Raw | 4.575 | 3.630 | 0.887 | 4.334 |
| LAIONBox v0.3 | Sidon | 4.722 | 3.763 | 0.874 | 4.635 |
| LAIONBox v0.3 | VC→Sidon | 4.677 | 3.756 | 0.927 | 4.508 |
Key improvements over vanilla DramaBox (raw → raw):
- UTMOS: +14.3% (3.176 → 3.630)
- Speaker Similarity: +8.4% (0.818 → 0.887)
- MOS: +2.7% (4.455 → 4.575)
Files
| File | Size | Description |
|---|---|---|
model.safetensors |
6.2 GB | DiT backbone (3.3B params, bf16) — same as DramaBox v13-merged |
speaker_adaln.pt |
1.7 GB | AdaLN-Zero speaker conditioning (152M params, fp32) |
scripts/inference_adaln.py |
26 KB | Inference script with speaker conditioning |
scripts/speaker_adaln.py |
7 KB | AdaLN-Zero module definition |
training_args.json |
1 KB | Training hyperparameters |
eval_enhanced_report.html |
108 MB | Interactive eval grid with embedded audio |
Inference
Prerequisites
You need the DramaBox repository and its dependencies:
git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt
Basic Usage
python scripts/inference_adaln.py \
--checkpoint model.safetensors \
--adaln-checkpoint speaker_adaln.pt \
--full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
--dramabox-dir /path/to/DramaBox \
--ref-audio /path/to/reference_speaker.wav \
--prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
--output output.wav \
--device cuda:0
Without AdaLN (vanilla DramaBox mode)
Simply omit --adaln-checkpoint to run as standard DramaBox:
python scripts/inference_adaln.py \
--checkpoint model.safetensors \
--full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
--dramabox-dir /path/to/DramaBox \
--ref-audio /path/to/reference_speaker.wav \
--prompt "Your prompt here" \
--output output.wav
Post-Processing: Sidon + ChatterboxVC
For best quality, apply post-processing to the generated audio:
Step 1: Sidon Speech Restoration
Sidon cleans up DAC vocoder artifacts and improves MOS by ~0.15-0.20 points:
pip install sidon
from sidon import Sidon
model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output.wav")
enhanced.save("output_sidon.wav") # 48kHz mono
Step 2: ChatterboxVC Voice Conversion (optional)
ChatterboxVC converts the output to match a target speaker's voice, boosting SpkSim from ~0.87 to ~0.93:
pip install chatterbox-tts
from chatterbox.vc import ChatterboxVC
vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
audio="output.wav",
target_se="reference_speaker.wav",
)
# Save then run Sidon on the VC output
import torchaudio
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)
# Then run Sidon on output_vc.wav for best results
Recommended Pipeline
For maximum quality: Generate → ChatterboxVC → Sidon
This achieves SpkSim 0.93+ with MOS 4.67+ and NISQA 4.50+.
Limitations
- The DiT backbone is identical to vanilla DramaBox v13-merged — all improvements come from the AdaLN conditioning
- Speaker conditioning requires reference audio; without it, the model behaves identically to vanilla DramaBox
- Best results with the VC→Sidon post-processing pipeline, which adds latency
- Trained primarily on English and German speech
License
Same as DramaBox base model. The AdaLN-Zero conditioning network and training code are released under Apache 2.0.
Citation
@misc{laionbox2026,
title={LAIONBox v0.3: Speaker-Conditioned Audio Generation with AdaLN-Zero},
author={LAION},
year={2026},
url={https://huggingface.co/laion/laionbox-v0.3-wip}
}