YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LAIONBox v0.3-wip

Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.

What's Different from DramaBox

LAIONBox v0.3 adds a 152M-parameter AdaLN-Zero speaker conditioning network (speaker_adaln.pt) that injects speaker identity into every transformer block of the 3.3B DiT backbone. The base DiT weights (model.safetensors) are identical to the DramaBox v13-merged pretrained model — only the AdaLN network was trained.

Architecture

Reference Audio → [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
                                         ↓
                              SpeakerAdaLNZero (152M params)
                              bottleneck_dim=512
                                         ↓
                        Per-block scale/shift deltas (48 blocks × 9 params)
                                         ↓
                    Added to timestep embeddings in each DiT transformer block

The AdaLN-Zero design means the model starts with zero conditioning (all output projections zero-initialized), so it degrades gracefully to vanilla DramaBox when no speaker reference is provided.

Training

  • Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
  • Mode: Full fine-tuning (no LoRA) with separate AdaLN-Zero optimizer
  • Data: 21,734 samples from 3 sources:
    • DramaBox best-of-25 (9,404 samples, weight 0.4)
    • Podcast data (10,014 samples, weight 0.4)
    • Emolia emotional speech (2,316 samples, weight 0.2)
  • Learning rates: DiT lr=1e-5 (effectively frozen in bf16), AdaLN lr=5e-5
  • Epochs: 5 (850 optimizer steps, effective batch size 128)
  • Auxiliary losses (differentiable, ReFL-style through decoder):
    1. WavLM-SV speaker similarity
    2. Orange-tbr speaker similarity
    3. DNSMOS OVR quality
    4. VoiceCLAP-HOW naturalness
    5. VoiceCLAP-WHAT content accuracy
  • Key insight: With mixed_precision="bf16" and lr=1e-5, Adam updates (5e-6) are below the bf16 ULP (2.8e-4) for all DiT parameters. The DiT backbone is effectively frozen after warmup — only the fp32 AdaLN network learns. This is architecturally correct for AdaLN-Zero conditioning.

Evaluation Results

Evaluated on 6 prompts × 5 speakers × 2 seeds = 60 samples per model. Post-processing variants: Raw, Sidon (speech restoration), VC→Sidon (voice conversion + restoration).

Model Variant MOS↑ UTMOS↑ SpkSim↑ NISQA↑
Vanilla DramaBox Raw 4.455 3.176 0.818 4.134
Vanilla DramaBox Sidon 4.672 3.323 0.813 4.524
LAIONBox v0.3 Raw 4.575 3.630 0.887 4.334
LAIONBox v0.3 Sidon 4.722 3.763 0.874 4.635
LAIONBox v0.3 VC→Sidon 4.677 3.756 0.927 4.508

Key improvements over vanilla DramaBox (raw → raw):

  • UTMOS: +14.3% (3.176 → 3.630)
  • Speaker Similarity: +8.4% (0.818 → 0.887)
  • MOS: +2.7% (4.455 → 4.575)

Files

File Size Description
model.safetensors 6.2 GB DiT backbone (3.3B params, bf16) — same as DramaBox v13-merged
speaker_adaln.pt 1.7 GB AdaLN-Zero speaker conditioning (152M params, fp32)
scripts/inference_adaln.py 26 KB Inference script with speaker conditioning
scripts/speaker_adaln.py 7 KB AdaLN-Zero module definition
training_args.json 1 KB Training hyperparameters
eval_enhanced_report.html 108 MB Interactive eval grid with embedded audio

Inference

Prerequisites

You need the DramaBox repository and its dependencies:

git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt

Basic Usage

python scripts/inference_adaln.py \
    --checkpoint model.safetensors \
    --adaln-checkpoint speaker_adaln.pt \
    --full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir /path/to/DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
    --output output.wav \
    --device cuda:0

Without AdaLN (vanilla DramaBox mode)

Simply omit --adaln-checkpoint to run as standard DramaBox:

python scripts/inference_adaln.py \
    --checkpoint model.safetensors \
    --full-checkpoint /path/to/DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir /path/to/DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Your prompt here" \
    --output output.wav

Post-Processing: Sidon + ChatterboxVC

For best quality, apply post-processing to the generated audio:

Step 1: Sidon Speech Restoration

Sidon cleans up DAC vocoder artifacts and improves MOS by ~0.15-0.20 points:

pip install sidon
from sidon import Sidon

model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output.wav")
enhanced.save("output_sidon.wav")  # 48kHz mono

Step 2: ChatterboxVC Voice Conversion (optional)

ChatterboxVC converts the output to match a target speaker's voice, boosting SpkSim from ~0.87 to ~0.93:

pip install chatterbox-tts
from chatterbox.vc import ChatterboxVC

vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
    audio="output.wav",
    target_se="reference_speaker.wav",
)
# Save then run Sidon on the VC output
import torchaudio
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)
# Then run Sidon on output_vc.wav for best results

Recommended Pipeline

For maximum quality: Generate → ChatterboxVC → Sidon

This achieves SpkSim 0.93+ with MOS 4.67+ and NISQA 4.50+.

Limitations

  • The DiT backbone is identical to vanilla DramaBox v13-merged — all improvements come from the AdaLN conditioning
  • Speaker conditioning requires reference audio; without it, the model behaves identically to vanilla DramaBox
  • Best results with the VC→Sidon post-processing pipeline, which adds latency
  • Trained primarily on English and German speech

License

Same as DramaBox base model. The AdaLN-Zero conditioning network and training code are released under Apache 2.0.

Citation

@misc{laionbox2026,
  title={LAIONBox v0.3: Speaker-Conditioned Audio Generation with AdaLN-Zero},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.3-wip}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support