LAIONBox v0.3-wip

Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.

What's Different from DramaBox

LAIONBox v0.3 adds a 152M-parameter AdaLN-Zero speaker conditioning network (speaker_adaln.pt) that injects speaker identity into every transformer block of the 3.3B DiT backbone. The base DiT weights (model.safetensors) are identical to the DramaBox v13-merged pretrained model — only the AdaLN network was trained.

Architecture

Reference Audio → [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
                                         ↓
                              SpeakerAdaLNZero (152M params)
                              bottleneck_dim=512
                                         ↓
                        Per-block scale/shift deltas (48 blocks × 9 params)
                                         ↓
                    Added to timestep embeddings in each DiT transformer block

The AdaLN-Zero design means the model starts with zero conditioning (all output projections zero-initialized), so it degrades gracefully to vanilla DramaBox when no speaker reference is provided.

Training

Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
Mode: Full fine-tuning (no LoRA) with separate AdaLN-Zero optimizer
Data: 21,734 samples from 3 sources:
- DramaBox best-of-25 (9,404 samples, weight 0.4)
- Podcast data (10,014 samples, weight 0.4)
- Emolia emotional speech (2,316 samples, weight 0.2)
Learning rates: DiT lr=1e-5 (effectively frozen in bf16), AdaLN lr=5e-5
Epochs: 5 (850 optimizer steps, effective batch size 128)
Auxiliary losses (differentiable, ReFL-style through decoder):
1. WavLM-SV speaker similarity
2. Orange-tbr speaker similarity
3. DNSMOS OVR quality
4. VoiceCLAP-HOW naturalness
5. VoiceCLAP-WHAT content accuracy
Key insight: With mixed_precision="bf16" and lr=1e-5, Adam updates (~~5e-6) are below the bf16 ULP (~~2.8e-4) for all DiT parameters. The DiT backbone is effectively frozen after warmup — only the fp32 AdaLN network learns. This is architecturally correct for AdaLN-Zero conditioning.

Evaluation Results

Evaluated on 6 prompts × 5 speakers × 2 seeds = 60 samples per model. Post-processing variants: Raw, Sidon (speech restoration), VC→Sidon (voice conversion + restoration).

Model	Variant	MOS↑	UTMOS↑	SpkSim↑	NISQA↑
Vanilla DramaBox	Raw	4.455	3.176	0.818	4.134
Vanilla DramaBox	Sidon	4.672	3.323	0.813	4.524
LAIONBox v0.3	Raw	4.575	3.630	0.887	4.334
LAIONBox v0.3	Sidon	4.722	3.763	0.874	4.635
LAIONBox v0.3	VC→Sidon	4.677	3.756	0.927	4.508

Key improvements over vanilla DramaBox (raw → raw):

UTMOS: +14.3% (3.176 → 3.630)
Speaker Similarity: +8.4% (0.818 → 0.887)
MOS: +2.7% (4.455 → 4.575)

Files

File	Size	Description
`model.safetensors`	6.2 GB	DiT backbone (3.3B params, bf16) — same as DramaBox v13-merged
`speaker_adaln.pt`	1.7 GB	AdaLN-Zero speaker conditioning (152M params, fp32)
`scripts/inference_adaln.py`	26 KB	Inference script with speaker conditioning
`scripts/speaker_adaln.py`	7 KB	AdaLN-Zero module definition
`training_args.json`	1 KB	Training hyperparameters
`eval_enhanced_report.html`	108 MB	Interactive eval grid with embedded audio

Inference

Prerequisites

You need the DramaBox repository and its dependencies:

git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt

Basic Usage

python scripts/inference_adaln.py \
    --checkpoint model.safetensors \
    --adaln-checkpoint speaker_adaln.pt \
    --full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir /path/to/DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
    --output output.wav \
    --device cuda:0

Without AdaLN (vanilla DramaBox mode)

Simply omit --adaln-checkpoint to run as standard DramaBox:

python scripts/inference_adaln.py \
    --checkpoint model.safetensors \
    --full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir /path/to/DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Your prompt here" \
    --output output.wav

Post-Processing: Sidon + ChatterboxVC

For best quality, apply post-processing to the generated audio:

Step 1: Sidon Speech Restoration

Sidon cleans up DAC vocoder artifacts and improves MOS by ~0.15-0.20 points:

pip install sidon

from sidon import Sidon

model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output.wav")
enhanced.save("output_sidon.wav")  # 48kHz mono

Step 2: ChatterboxVC Voice Conversion (optional)

ChatterboxVC converts the output to match a target speaker's voice, boosting SpkSim from ~0.87 to ~0.93:

pip install chatterbox-tts

from chatterbox.vc import ChatterboxVC

vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
    audio="output.wav",
    target_se="reference_speaker.wav",
)
# Save then run Sidon on the VC output
import torchaudio
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)
# Then run Sidon on output_vc.wav for best results

Recommended Pipeline

For maximum quality: Generate → ChatterboxVC → Sidon

This achieves SpkSim 0.93+ with MOS 4.67+ and NISQA 4.50+.

Limitations

The DiT backbone is identical to vanilla DramaBox v13-merged — all improvements come from the AdaLN conditioning
Speaker conditioning requires reference audio; without it, the model behaves identically to vanilla DramaBox
Best results with the VC→Sidon post-processing pipeline, which adds latency
Trained primarily on English and German speech

License

Same as DramaBox base model. The AdaLN-Zero conditioning network and training code are released under Apache 2.0.

Citation

@misc{laionbox2026,
  title={LAIONBox v0.3: Speaker-Conditioned Audio Generation with AdaLN-Zero},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.3-wip}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support