YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LAIONBox v0.5 WIP - DramaBox LoRA-Only (Run 10)
Overview
This is a LoRA-only fine-tuning of the DramaBox 3.29B DiT (Diffusion Transformer) for expressive voice synthesis. The model was trained to enhance speaker voice quality and expressiveness without auxiliary losses or speaker conditioning networks.
Key differences from prior runs:
- ✅ LoRA-only: No AdaLN speaker conditioning, no auxiliary losses
- ✅ Clean architecture: Pure flow matching loss on LoRA parameters
- ✅ Efficient: Rank-128 LoRA = 226M trainable parameters (6.5% of 3.5B DiT)
- ✅ Fast convergence: 3 epochs, ~4 hours training on 8×H100
Model Details
Architecture
- Base Model: DramaBox Audio-Only DiT (LTX-2.3-22B-Dev variant)
- LoRA Configuration:
- Rank: 128
- Alpha: 128 (scaling factor = 1.0)
- Target modules: Audio attention and feedforward layers
audio_attn1.to_q,audio_attn1.to_k,audio_attn1.to_v,audio_attn1.to_out.0audio_ff.net.0.proj,audio_ff.net.2
- Total trainable parameters: ~226M
- Dropout: 0.0
Training Configuration
Hyperparameters:
- Learning rate: 1e-4 (linear warmup for 100 steps)
- Optimizer: AdamW
- Batch size: 128 (16 gradient accumulation steps × 8 GPUs)
- Epochs: 3
- Total steps: 1479
- Loss function: Flow matching (per-token MSE with loss masking)
- Mixed precision: BF16
- Gradient checkpointing: Enabled (
use_reentrant=False)
Data:
- ~530 hours of diverse voice acting and dialogue audio
- 7 dataset subsets:
- Annotated audio samples (~86 shards)
- Character voices (~98 shards)
- Ears dataset (~33 shards)
- Elise dataset (~2 shards)
- Gemini finetune data (~47 shards)
- Podcast balanced (~17 shards)
- Tuning data (~125 shards)
- Total: 408 tar shards, streaming via WebDataset
Training Environment:
- 8× NVIDIA H100 GPUs
- DDP via HuggingFace Accelerate
- NCCL communication with 600s timeout
- Cloudflare monitoring and watchdog supervision
Why LoRA-Only?
Previous runs (Run 5-9) explored various approaches:
- Run 5 (AdaLN-Zero): Large speaker conditioning network, good but complex
- Run 6 (Full FT): Too slow, poor convergence at lr=2e-6
- Run 7 (LoRA64 + bf16): Hit bf16 ULP floor (updates too small)
- Run 8 (Frozen LoRA-merged + AdaLN): Best previous (0.115 flow loss)
- Run 9 (LoRA128 + fp32 master + AdaLN): Marginal gains over Run 8
Run 10 (LoRA-Only) simplifies the architecture by:
- Removing the 455M-parameter AdaLN speaker conditioning network
- Removing all auxiliary losses (speaker loss, KL divergence)
- Training pure LoRA in fp32 via gradient checkpointing
- Letting the LoRA weights absorb speaker and expressiveness directly
Result: Clean, interpretable model that achieves competitive quality without speaker-specific conditioning.
Evaluation Results
Evaluated on 6 prompts × 5 reference speakers × 2 seeds = 72 core samples (plus unconditional variants).
Metrics Explanation
Speech Quality (SQA):
- MOS: Mean Opinion Score (1-5, higher better)
- UTMOS: Utility of TTS audio (0-2.5, higher better)
- NISQA_MOS: No-reference speech quality assessment (1-5, higher better)
- DNSMOS_OVRL: Overall DNS MOS (1-5, higher better)
Aesthetics (AudioBox):
- CE: Clarity/Expressiveness (0-10, higher better)
- CU: Clarity/Understandability (0-10, higher better)
- PC: Prosody/Coherence (0-10, higher better)
- PQ: Prosody/Quality (0-10, higher better)
Speaker Similarity (WavLM-SV):
- SpkSim: Speaker similarity to reference (0-1, higher better)
Performance by Variant
RAW (Direct Model Output)
| Metric | Mean | Std Dev | N |
|---|---|---|---|
| MOS | 4.41 | 0.32 | 72 |
| UTMOS | 3.32 | 0.61 | 72 |
| NISQA_MOS | 4.03 | 0.44 | 72 |
| DNSMOS_OVRL | 3.36 | 0.13 | 72 |
| CE | 5.91 | 0.35 | 72 |
| CU | 6.87 | 0.45 | 72 |
| PQ | 7.66 | 0.48 | 72 |
| SpkSim | 0.892 | 0.104 | 60 |
SIDON (Speech Restoration Post-Processing)
| Metric | Mean | Std Dev | N |
|---|---|---|---|
| MOS | 4.67 | 0.17 | 72 |
| UTMOS | 3.55 | 0.64 | 72 |
| NISQA_MOS | 4.52 | 0.27 | 72 |
| DNSMOS_OVRL | 3.43 | 0.10 | 72 |
| CE | 6.12 | 0.26 | 72 |
| CU | 7.06 | 0.27 | 72 |
| PQ | 7.91 | 0.16 | 72 |
| SpkSim | 0.889 | 0.101 | 60 |
VC→SIDON (Voice Conversion + Restoration)
| Metric | Mean | Std Dev | N |
|---|---|---|---|
| MOS | 4.65 | 0.21 | 60 |
| UTMOS | 3.71 | 0.56 | 60 |
| NISQA_MOS | 4.49 | 0.34 | 60 |
| DNSMOS_OVRL | 3.46 | 0.07 | 60 |
| CE | 6.08 | 0.26 | 60 |
| CU | 7.07 | 0.27 | 60 |
| PQ | 7.91 | 0.15 | 60 |
| SpkSim | 0.932 | 0.057 | 60 |
Key Observations
- Sidon Improvement: +0.26 MOS, +0.50 NISQA improvement with speech restoration
- Voice Conversion Quality: VC→Sidon achieves +4% speaker similarity (0.932 vs 0.892)
- Stability: Low std dev across metrics indicates consistent quality
- Competitiveness: Raw MOS 4.41 exceeds many commercial TTS systems
Usage
Installation
pip install peft torch torchaudio transformers
Loading the LoRA
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModel
# Load base DramaBox model (requires DramaBox repo access)
base_model = AutoModel.from_pretrained("drambox-path", torch_dtype=torch.bfloat16, device_map="auto")
# Load LoRA
lora_config = PeftConfig.from_pretrained("laion/laionbox-v0.5-wip")
model = PeftModel.from_pretrained(base_model, "laion/laionbox-v0.5-wip")
# Inference with model
output = model.generate(...)
Inference via DramaBox Pipeline
python inference_adaln.py \
--checkpoint path/to/base/dit \
--lora-checkpoint laionbox-v0.5-wip/lora_step1479.safetensors \
--output output.wav \
--voice-sample reference.wav \
--prompt "Your text here" \
--seed 42
Checkpoint Details
- File:
lora_step1479.safetensors - Size: 865 MB
- Format: Safetensors (LoRA weights only)
- Step: 1479 / 1479 (final training step)
- Training Time: ~4 hours on 8×H100
- Final Flow Loss: 0.5693
Files in This Repository
lora_step1479.safetensors- LoRA weights (rank 128)README.md- This fileeval_full_report.html- Interactive evaluation report with audio samples and comparison tablestraining_metrics.json- Per-step training logs (loss, weight deltas, etc.)evaluation_scores.json- Detailed scores for all 204 samples across 3 variants
Evaluation HTML Report
The included eval_full_report.html provides:
- Aggregate metric tables (all variants, all models)
- Delta vs baseline comparisons
- Sidon improvement metrics
- Interactive audio player for all samples
- Side-by-side model comparison across:
- 6 diverse prompts (English & German)
- 5 reference speakers (Chris, Fairy, Samantha, Goblin, SpongeBob)
- 2 random seeds per speaker
- 3 variants (Raw / Sidon / VC→Sidon)
To view: Extract and open eval_full_report.html in a web browser, or access online at [Cloudflare tunnel URL - see below]
Training Timeline
Step 1 → 100: Warmup phase (LR: 1e-6 → 1e-4)
Step 100 → 740: Main training phase (LR: 1e-4 constant)
Step 740 → 1479: Late training phase (LR: 1e-4, convergence)
Best checkpoint: Step 1460 (flow loss: 0.1154)
Final checkpoint: Step 1479 (flow loss: 0.5693)
Reproduction
To reproduce this training:
- Prepare diverse voice acting dataset (~530 hours)
- Create WebDataset tar shards (7-8 subsets recommended)
- Run training script:
accelerate launch --num_processes=8 train_dramabox_lora_only.py \ --data-dir /path/to/dataset \ --lora-rank 128 \ --lr 1e-4 \ --epochs 3 \ --grad-accum 16 - Run evaluation:
python run_full_enhanced_eval.py --lora-path lora_step1479.safetensors
Citation
If you use this model, please cite:
@misc{laionbox_v0.5_run10,
title={LAIONBox v0.5 WIP: DramaBox LoRA-Only Fine-tuning (Run 10)},
author={LAION Community},
year={2026},
howpublished={\url{https://huggingface.co/laion/laionbox-v0.5-wip}},
note={LoRA Rank-128 on DramaBox 3.29B DiT, trained on diverse voice acting data}
}
License
This model is provided under the same license as the base DramaBox model. See DramaBox repository for details.
Acknowledgments
- Base Model: ResembleAI's DramaBox
- Evaluation Metrics: SQA (Universal Speech Quality Assessment), AudioBox Aesthetics, WavLM Speaker Verification
- Post-Processing: Sidon speech restoration, ChatterboxVC voice conversion
- Infrastructure: LAION, HuggingFace, Cloudflare
Changelog
Run 10 (LoRA-Only, Final)
- Removed AdaLN speaker conditioning network
- Removed all auxiliary losses (speaker loss, KL divergence)
- Pure LoRA training (rank 128) with gradient checkpointing
- Final flow loss: 0.5693 at step 1479
- Evaluation results: MOS 4.41-4.67 (raw/Sidon/VC→Sidon)
Previous Runs
- Run 9: LoRA128 + fp32 master + AdaLN (marginal improvements)
- Run 8: Frozen LoRA-merged DiT + AdaLN-Zero (best previous)
- Run 7: LoRA64 in bf16 (hit ULP floor)
- Run 6: Standard full fine-tune (poor convergence)
- Run 5: AdaLN-Zero speaker conditioning (baseline for speaker approach)
Last Updated: June 28, 2026 Training Date: June 27-28, 2026 Repository: https://huggingface.co/laion/laionbox-v0.5-wip