LAIONBox v0.4-wip

Speaker-conditioned audio generation model based on DramaBox (LTX-2.3-22B-dev Audio-Only), fine-tuned with AdaLN-Zero speaker conditioning and 5 differentiable auxiliary losses.

What's New in v0.4 (vs v0.3)

LAIONBox v0.4 uses a two-stage training approach that produces higher quality speech than v0.3:

  1. Stage 1 (Run 7): LoRA rank-64 fine-tuning of the DiT backbone (113M trainable params, 3.2% of DiT) β€” teaches the model better speech patterns
  2. Stage 2 (Run 8): Merge LoRA weights into the DiT, freeze the merged model, and train a fresh AdaLN-Zero speaker conditioning network on top β€” learns speaker identity without disturbing the improved backbone

This "LoRA-merge then AdaLN" recipe outperforms both standalone LoRA (Run 7) and standalone AdaLN-Zero (v0.3 / Run 5) across all metrics.

Key Improvements over v0.3

Metric v0.3 (Run 5) v0.4 (Run 8) Delta
UTMOS 3.630 3.655 +0.025
MOS 4.575 4.501 -0.074
SpkSim 0.887 0.884 -0.003
NISQA 4.334 4.198 -0.136
UTMOS (Sidon) 3.763 3.855 +0.092
SpkSim (VC→Sidon) 0.927 0.934 +0.007

The real gains show after post-processing: +0.092 UTMOS with Sidon and +0.007 SpkSim with VC→Sidon — v0.4 produces audio that responds better to speech restoration and voice conversion.

Architecture

Reference Audio β†’ [WavLM-SV(512) + Orange-tbr(128) + CLAP-HOW(768) + CLAP-WHAT(768)]
                                         ↓
                          SpeakerAdaLNZero (455M params)
                          bottleneck_dim=512
                                         ↓
                    Per-block scale/shift deltas (48 blocks Γ— 9 params)
                                         ↓
                Added to timestep embeddings in each DiT transformer block
                                         ↓
               LoRA-merged DiT backbone (3.3B params, frozen during Stage 2)

The DiT backbone in v0.4 contains merged LoRA rank-64 weights from Stage 1, giving it better speech generation capabilities than the vanilla DramaBox backbone used in v0.3. The AdaLN-Zero network is trained on top of this improved backbone.

AdaLN-Zero Design

The conditioning network uses zero-initialization on all output projections, meaning it starts with zero conditioning effect and gradually learns to modulate the DiT's behavior. This ensures:

  • Graceful degradation to vanilla generation when no speaker reference is provided
  • Stable training β€” the model never "forgets" how to generate speech
  • The frozen DiT backbone is never modified during Stage 2

Training Details

Stage 1: LoRA Fine-Tuning (Run 7)

  • Base model: LTX-2.3-22B-dev audio-only (v13-merged), 3.3B parameters
  • LoRA config: Rank 64, alpha 64, targeting audio_attn1 + audio_ff modules
  • Trainable params: 113M (3.2% of DiT)
  • Learning rate: 4e-5 with cosine schedule
  • Epochs: 2 (340 steps, effective batch size 128)
  • Precision: bf16 mixed precision
  • Note: LoRA training in bf16 is effective for ~1 epoch before the bf16 ULP floor freezes updates (see Learnings section)

Stage 2: AdaLN-Zero on Frozen Merged DiT (Run 8)

  • Base model: Stage 1 LoRA merged into vanilla DiT β†’ merged_dit.safetensors
  • DiT: Completely frozen (lr=0), no gradients, no weight saves
  • AdaLN-Zero: 455M parameters, freshly initialized
  • Learning rate: 7e-5 with cosine schedule, 25% warmup
  • Epochs: 6 (1,020 steps, effective batch size 128)
  • Best checkpoint: Step 850 (epoch 5.8, selected by validation)
  • Precision: bf16 DiT + fp32 AdaLN

Training Data

21,734 samples from 3 sources:

  • DramaBox best-of-25 (9,404 samples, weight 0.4) β€” high-quality TTS generations
  • Podcast data (10,014 samples, weight 0.4) β€” natural conversational speech
  • Emolia emotional speech (2,316 samples, weight 0.2) β€” expressive/emotional range

Auxiliary Losses

5 differentiable losses computed through the DramaBox decoder (ReFL-style):

Loss Weight Purpose
WavLM-SV speaker similarity adaptive (≀20) Match speaker identity
Orange-tbr speaker similarity adaptive (≀20) Secondary speaker matching
DNSMOS OVR quality adaptive (≀20) Speech quality
VoiceCLAP-HOW naturalness adaptive (≀20) Speaking style/naturalness
VoiceCLAP-WHAT content accuracy 0 (disabled in Run 8) Content preservation

Auxiliary loss coefficients are adaptive with exponential moving average targeting a 6:1 ratio between flow loss and total auxiliary contribution.

Evaluation Results

Evaluated on 6 prompts Γ— 5 speakers Γ— 2 seeds = 60 samples per model, with 3 post-processing variants.

Raw Output (DramaBox decoder only)

Model MOS ↑ UTMOS ↑ NISQA ↑ SpkSim ↑ CE ↑ PQ ↑
Vanilla DramaBox 4.412 3.136 4.107 0.818 β€” β€”
LAIONBox v0.3 (Run 5 AdaLN) 4.575 3.630 4.334 0.887 6.058 7.675
LAIONBox v0.4 (Run 8) 4.501 3.655 4.198 0.884 6.013 7.686

Sidon Enhanced (speech restoration)

Model MOS ↑ UTMOS ↑ NISQA ↑ SpkSim ↑ CE ↑ PQ ↑
Vanilla DramaBox 4.663 3.280 4.516 0.813 β€” β€”
LAIONBox v0.3 4.722 3.763 4.635 0.874 6.200 7.864
LAIONBox v0.4 4.751 3.855 4.670 0.875 6.183 7.878

ChatterboxVC β†’ Sidon (voice conversion + restoration)

Model MOS ↑ UTMOS ↑ NISQA ↑ SpkSim ↑ CE ↑ PQ ↑
LAIONBox v0.3 4.677 3.756 4.508 0.927 6.029 7.783
LAIONBox v0.4 4.690 3.835 4.514 0.934 6.109 7.874

Improvements over Vanilla DramaBox (raw β†’ raw)

  • UTMOS: +16.6% (3.136 β†’ 3.655)
  • Speaker Similarity: +8.1% (0.818 β†’ 0.884)
  • MOS: +2.0% (4.412 β†’ 4.501)

Files

File Size Description
model.safetensors 6.2 GB LoRA-merged DiT backbone (3.3B params, bf16)
speaker_adaln.pt 1.7 GB AdaLN-Zero speaker conditioning network (455M params, fp32), step 850
scripts/inference_adaln.py 26 KB Full inference script with speaker conditioning, LoRA support, batch generation
scripts/speaker_adaln.py 7 KB AdaLN-Zero module definition (SpeakerAdaLNZero class)
training_args.json 1 KB Training hyperparameters for Run 8
eval/eval_full_report.html 0.3 MB Interactive evaluation report with metric tables and audio players

Inference

Prerequisites

# Clone DramaBox
git clone https://github.com/LTX-Video/DramaBox.git
cd DramaBox
pip install -r requirements.txt

# Download model files
pip install huggingface_hub
huggingface-cli download laion/laionbox-v0.4-wip --local-dir laionbox-v0.4

Basic Usage

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --ref-audio /path/to/reference_speaker.wav \
    --prompt "Warm, conversational tone. 'Hello, welcome to the show.'" \
    --output output.wav \
    --device cuda:0

Without Speaker Conditioning (vanilla mode)

Omit --adaln-checkpoint to run as standard DramaBox with the improved LoRA-merged backbone:

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --prompt "A calm narrator reading a story." \
    --output output.wav

Batch Generation

Generate multiple samples with different seeds for best-of-N selection:

python laionbox-v0.4/scripts/inference_adaln.py \
    --checkpoint laionbox-v0.4/model.safetensors \
    --adaln-checkpoint laionbox-v0.4/speaker_adaln.pt \
    --full-checkpoint DramaBox/models/ltx-2.3-22b-dev.safetensors \
    --dramabox-dir DramaBox \
    --ref-audio speaker_ref.wav \
    --prompt "Excited announcer. 'And the winner is...!'" \
    --output-dir ./outputs \
    --seeds 42,123,456 \
    --device cuda:0

Key Parameters

Parameter Default Description
--checkpoint required Path to model.safetensors (LoRA-merged DiT)
--adaln-checkpoint optional Path to speaker_adaln.pt (omit for unconditioned)
--full-checkpoint required Path to original DramaBox ltx-2.3-22b-dev.safetensors (for decoder/conditioner)
--ref-audio optional Reference audio for speaker conditioning (WAV/MP3, 3-15s recommended)
--prompt required Text prompt describing speech style + content
--cfg-scale 4.5 Classifier-free guidance scale
--stg-scale 0.0 Spatiotemporal guidance scale
--steps 50 Number of diffusion steps
--duration auto Target duration in seconds (auto-estimated from text)

Post-Processing Pipeline

For best quality, apply the Generate β†’ ChatterboxVC β†’ Sidon pipeline:

Step 1: Generate with LAIONBox

# See inference commands above

Step 2: ChatterboxVC Voice Conversion (optional, +0.05 SpkSim)

ChatterboxVC converts the output to match the target speaker more closely:

from chatterbox.vc import ChatterboxVC
import torchaudio

vc = ChatterboxVC.from_pretrained(device="cuda")
wav = vc.generate(
    audio="output.wav",
    target_se="reference_speaker.wav",
)
torchaudio.save("output_vc.wav", wav.unsqueeze(0), 24000)

Step 3: Sidon Speech Restoration (+0.15 MOS, +0.35 NISQA)

Sidon removes DAC vocoder artifacts and improves perceptual quality:

from sidon import Sidon

model = Sidon.from_pretrained("sarulab-speech/sidon-v0.1")
enhanced = model.enhance("output_vc.wav")  # or "output.wav" if skipping VC
enhanced.save("output_final.wav")  # 48kHz mono

Expected Quality with Full Pipeline

Metric Raw + Sidon + VC β†’ Sidon
MOS 4.50 4.75 4.69
UTMOS 3.66 3.86 3.84
NISQA 4.20 4.67 4.51
SpkSim 0.88 0.88 0.93

Learnings from the Training Campaign (Runs 5–9)

What Worked

Technique Impact
LoRA-merge + fresh AdaLN Best overall recipe. LoRA improves the backbone; AdaLN adds speaker identity cleanly on top.
FP32 master weights Solves bf16 ULP floor for LoRA training. CPU-offloaded fp32 copies (~2.5 GB RAM) keep Adam updates alive. Zero throughput impact.
Separate optimizers Different LRs for DiT vs AdaLN (4e-5 vs 1e-5) and independent freeze/unfreeze.
Weight debug logging Hash + Ξ”% tracking exposed training freezes that loss curves alone cannot detect.
Sidon post-processing Consistent +0.15 MOS, +0.35 NISQA across all models. Essential for production.
Adaptive aux loss coefficients EMA-based coefficient scaling prevents any single auxiliary loss from dominating.

What Didn't Work

Approach Problem
Standard full fine-tune 3.5B params at lr=2e-6 doesn't converge in 1 epoch. Too slow, too expensive.
LoRA in bf16 beyond epoch 1 bf16 has 8 mantissa bits; for weights ~0.01, ULP β‰ˆ 8.58e-5, but Adam updates are ~1.72e-5. Updates round to zero.
Stacking LoRA on merged model Diminishing returns: adding more LoRA on top of LoRA+AdaLN doesn't improve quality.
Training AdaLN beyond 5 epochs Peak at epoch 4–5; epochs 5–6 show mild overfitting.

Limitations

  • Speaker conditioning requires reference audio (3–15 seconds recommended); without it the model runs as vanilla DramaBox with improved backbone
  • Best results require the VC β†’ Sidon post-processing pipeline, which adds latency (~5s for VC + ~3s for Sidon on GPU)
  • Trained primarily on English and German speech
  • The LoRA-merged backbone has slightly different characteristics than vanilla DramaBox β€” prompts may need minor adjustment

Citation

@misc{laionbox2026v04,
  title={LAIONBox v0.4: Two-Stage Speaker-Conditioned Audio Generation with LoRA-Merged DiT and AdaLN-Zero},
  author={LAION},
  year={2026},
  url={https://huggingface.co/laion/laionbox-v0.4-wip}
}

Model Details

  • Total model size: 3.8B parameters (3.3B DiT + 455M AdaLN)
  • DiT tensor type: BF16 (safetensors)
  • AdaLN tensor type: FP32 (PyTorch)
  • Training hardware: 8Γ— GPU (DDP, bf16 mixed precision)
  • Training time: ~6 hours Stage 1 (Run 7) + ~17 hours Stage 2 (Run 8)
  • Framework: PyTorch + Accelerate
  • License: Apache 2.0
  • Organization: LAION e.V.
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
3B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support