YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LAIONBox v0.5 WIP - DramaBox LoRA-Only (Run 10)

Overview

This is a LoRA-only fine-tuning of the DramaBox 3.29B DiT (Diffusion Transformer) for expressive voice synthesis. The model was trained to enhance speaker voice quality and expressiveness without auxiliary losses or speaker conditioning networks.

Key differences from prior runs:

  • LoRA-only: No AdaLN speaker conditioning, no auxiliary losses
  • Clean architecture: Pure flow matching loss on LoRA parameters
  • Efficient: Rank-128 LoRA = 226M trainable parameters (6.5% of 3.5B DiT)
  • Fast convergence: 3 epochs, ~4 hours training on 8×H100

Model Details

Architecture

  • Base Model: DramaBox Audio-Only DiT (LTX-2.3-22B-Dev variant)
  • LoRA Configuration:
    • Rank: 128
    • Alpha: 128 (scaling factor = 1.0)
    • Target modules: Audio attention and feedforward layers
      • audio_attn1.to_q, audio_attn1.to_k, audio_attn1.to_v, audio_attn1.to_out.0
      • audio_ff.net.0.proj, audio_ff.net.2
    • Total trainable parameters: ~226M
    • Dropout: 0.0

Training Configuration

Hyperparameters:

  • Learning rate: 1e-4 (linear warmup for 100 steps)
  • Optimizer: AdamW
  • Batch size: 128 (16 gradient accumulation steps × 8 GPUs)
  • Epochs: 3
  • Total steps: 1479
  • Loss function: Flow matching (per-token MSE with loss masking)
  • Mixed precision: BF16
  • Gradient checkpointing: Enabled (use_reentrant=False)

Data:

  • ~530 hours of diverse voice acting and dialogue audio
  • 7 dataset subsets:
    • Annotated audio samples (~86 shards)
    • Character voices (~98 shards)
    • Ears dataset (~33 shards)
    • Elise dataset (~2 shards)
    • Gemini finetune data (~47 shards)
    • Podcast balanced (~17 shards)
    • Tuning data (~125 shards)
  • Total: 408 tar shards, streaming via WebDataset

Training Environment:

  • 8× NVIDIA H100 GPUs
  • DDP via HuggingFace Accelerate
  • NCCL communication with 600s timeout
  • Cloudflare monitoring and watchdog supervision

Why LoRA-Only?

Previous runs (Run 5-9) explored various approaches:

  • Run 5 (AdaLN-Zero): Large speaker conditioning network, good but complex
  • Run 6 (Full FT): Too slow, poor convergence at lr=2e-6
  • Run 7 (LoRA64 + bf16): Hit bf16 ULP floor (updates too small)
  • Run 8 (Frozen LoRA-merged + AdaLN): Best previous (0.115 flow loss)
  • Run 9 (LoRA128 + fp32 master + AdaLN): Marginal gains over Run 8

Run 10 (LoRA-Only) simplifies the architecture by:

  1. Removing the 455M-parameter AdaLN speaker conditioning network
  2. Removing all auxiliary losses (speaker loss, KL divergence)
  3. Training pure LoRA in fp32 via gradient checkpointing
  4. Letting the LoRA weights absorb speaker and expressiveness directly

Result: Clean, interpretable model that achieves competitive quality without speaker-specific conditioning.

Evaluation Results

Evaluated on 6 prompts × 5 reference speakers × 2 seeds = 72 core samples (plus unconditional variants).

Metrics Explanation

Speech Quality (SQA):

  • MOS: Mean Opinion Score (1-5, higher better)
  • UTMOS: Utility of TTS audio (0-2.5, higher better)
  • NISQA_MOS: No-reference speech quality assessment (1-5, higher better)
  • DNSMOS_OVRL: Overall DNS MOS (1-5, higher better)

Aesthetics (AudioBox):

  • CE: Clarity/Expressiveness (0-10, higher better)
  • CU: Clarity/Understandability (0-10, higher better)
  • PC: Prosody/Coherence (0-10, higher better)
  • PQ: Prosody/Quality (0-10, higher better)

Speaker Similarity (WavLM-SV):

  • SpkSim: Speaker similarity to reference (0-1, higher better)

Performance by Variant

RAW (Direct Model Output)

Metric Mean Std Dev N
MOS 4.41 0.32 72
UTMOS 3.32 0.61 72
NISQA_MOS 4.03 0.44 72
DNSMOS_OVRL 3.36 0.13 72
CE 5.91 0.35 72
CU 6.87 0.45 72
PQ 7.66 0.48 72
SpkSim 0.892 0.104 60

SIDON (Speech Restoration Post-Processing)

Metric Mean Std Dev N
MOS 4.67 0.17 72
UTMOS 3.55 0.64 72
NISQA_MOS 4.52 0.27 72
DNSMOS_OVRL 3.43 0.10 72
CE 6.12 0.26 72
CU 7.06 0.27 72
PQ 7.91 0.16 72
SpkSim 0.889 0.101 60

VC→SIDON (Voice Conversion + Restoration)

Metric Mean Std Dev N
MOS 4.65 0.21 60
UTMOS 3.71 0.56 60
NISQA_MOS 4.49 0.34 60
DNSMOS_OVRL 3.46 0.07 60
CE 6.08 0.26 60
CU 7.07 0.27 60
PQ 7.91 0.15 60
SpkSim 0.932 0.057 60

Key Observations

  1. Sidon Improvement: +0.26 MOS, +0.50 NISQA improvement with speech restoration
  2. Voice Conversion Quality: VC→Sidon achieves +4% speaker similarity (0.932 vs 0.892)
  3. Stability: Low std dev across metrics indicates consistent quality
  4. Competitiveness: Raw MOS 4.41 exceeds many commercial TTS systems

Usage

Installation

pip install peft torch torchaudio transformers

Loading the LoRA

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModel

# Load base DramaBox model (requires DramaBox repo access)
base_model = AutoModel.from_pretrained("drambox-path", torch_dtype=torch.bfloat16, device_map="auto")

# Load LoRA
lora_config = PeftConfig.from_pretrained("laion/laionbox-v0.5-wip")
model = PeftModel.from_pretrained(base_model, "laion/laionbox-v0.5-wip")

# Inference with model
output = model.generate(...)

Inference via DramaBox Pipeline

python inference_adaln.py \
    --checkpoint path/to/base/dit \
    --lora-checkpoint laionbox-v0.5-wip/lora_step1479.safetensors \
    --output output.wav \
    --voice-sample reference.wav \
    --prompt "Your text here" \
    --seed 42

Checkpoint Details

  • File: lora_step1479.safetensors
  • Size: 865 MB
  • Format: Safetensors (LoRA weights only)
  • Step: 1479 / 1479 (final training step)
  • Training Time: ~4 hours on 8×H100
  • Final Flow Loss: 0.5693

Files in This Repository

  • lora_step1479.safetensors - LoRA weights (rank 128)
  • README.md - This file
  • eval_full_report.html - Interactive evaluation report with audio samples and comparison tables
  • training_metrics.json - Per-step training logs (loss, weight deltas, etc.)
  • evaluation_scores.json - Detailed scores for all 204 samples across 3 variants

Evaluation HTML Report

The included eval_full_report.html provides:

  • Aggregate metric tables (all variants, all models)
  • Delta vs baseline comparisons
  • Sidon improvement metrics
  • Interactive audio player for all samples
  • Side-by-side model comparison across:
    • 6 diverse prompts (English & German)
    • 5 reference speakers (Chris, Fairy, Samantha, Goblin, SpongeBob)
    • 2 random seeds per speaker
    • 3 variants (Raw / Sidon / VC→Sidon)

To view: Extract and open eval_full_report.html in a web browser, or access online at [Cloudflare tunnel URL - see below]

Training Timeline

Step 1 → 100:    Warmup phase (LR: 1e-6 → 1e-4)
Step 100 → 740:  Main training phase (LR: 1e-4 constant)
Step 740 → 1479: Late training phase (LR: 1e-4, convergence)

Best checkpoint: Step 1460 (flow loss: 0.1154)
Final checkpoint: Step 1479 (flow loss: 0.5693)

Reproduction

To reproduce this training:

  1. Prepare diverse voice acting dataset (~530 hours)
  2. Create WebDataset tar shards (7-8 subsets recommended)
  3. Run training script:
    accelerate launch --num_processes=8 train_dramabox_lora_only.py \
        --data-dir /path/to/dataset \
        --lora-rank 128 \
        --lr 1e-4 \
        --epochs 3 \
        --grad-accum 16
    
  4. Run evaluation:
    python run_full_enhanced_eval.py --lora-path lora_step1479.safetensors
    

Citation

If you use this model, please cite:

@misc{laionbox_v0.5_run10,
  title={LAIONBox v0.5 WIP: DramaBox LoRA-Only Fine-tuning (Run 10)},
  author={LAION Community},
  year={2026},
  howpublished={\url{https://huggingface.co/laion/laionbox-v0.5-wip}},
  note={LoRA Rank-128 on DramaBox 3.29B DiT, trained on diverse voice acting data}
}

License

This model is provided under the same license as the base DramaBox model. See DramaBox repository for details.

Acknowledgments

  • Base Model: ResembleAI's DramaBox
  • Evaluation Metrics: SQA (Universal Speech Quality Assessment), AudioBox Aesthetics, WavLM Speaker Verification
  • Post-Processing: Sidon speech restoration, ChatterboxVC voice conversion
  • Infrastructure: LAION, HuggingFace, Cloudflare

Changelog

Run 10 (LoRA-Only, Final)

  • Removed AdaLN speaker conditioning network
  • Removed all auxiliary losses (speaker loss, KL divergence)
  • Pure LoRA training (rank 128) with gradient checkpointing
  • Final flow loss: 0.5693 at step 1479
  • Evaluation results: MOS 4.41-4.67 (raw/Sidon/VC→Sidon)

Previous Runs

  • Run 9: LoRA128 + fp32 master + AdaLN (marginal improvements)
  • Run 8: Frozen LoRA-merged DiT + AdaLN-Zero (best previous)
  • Run 7: LoRA64 in bf16 (hit ULP floor)
  • Run 6: Standard full fine-tune (poor convergence)
  • Run 5: AdaLN-Zero speaker conditioning (baseline for speaker approach)

Last Updated: June 28, 2026 Training Date: June 27-28, 2026 Repository: https://huggingface.co/laion/laionbox-v0.5-wip

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support