YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LAIONBox v0.5 WIP - DramaBox LoRA-Only (Run 10)

Overview

This is a LoRA-only fine-tuning of the DramaBox 3.29B DiT (Diffusion Transformer) for expressive voice synthesis. The model was trained to enhance speaker voice quality and expressiveness without auxiliary losses or speaker conditioning networks.

Key differences from prior runs:

✅ LoRA-only: No AdaLN speaker conditioning, no auxiliary losses
✅ Clean architecture: Pure flow matching loss on LoRA parameters
✅ Efficient: Rank-128 LoRA = 226M trainable parameters (6.5% of 3.5B DiT)
✅ Fast convergence: 3 epochs, ~4 hours training on 8×H100

Model Details

Architecture

Base Model: DramaBox Audio-Only DiT (LTX-2.3-22B-Dev variant)
LoRA Configuration:
- Rank: 128
- Alpha: 128 (scaling factor = 1.0)
- Target modules: Audio attention and feedforward layers
  - audio_attn1.to_q, audio_attn1.to_k, audio_attn1.to_v, audio_attn1.to_out.0
  - audio_ff.net.0.proj, audio_ff.net.2
- Total trainable parameters: ~226M
- Dropout: 0.0

Training Configuration

Hyperparameters:

Learning rate: 1e-4 (linear warmup for 100 steps)
Optimizer: AdamW
Batch size: 128 (16 gradient accumulation steps × 8 GPUs)
Epochs: 3
Total steps: 1479
Loss function: Flow matching (per-token MSE with loss masking)
Mixed precision: BF16
Gradient checkpointing: Enabled (use_reentrant=False)

Data:

~530 hours of diverse voice acting and dialogue audio
7 dataset subsets:
- Annotated audio samples (~86 shards)
- Character voices (~98 shards)
- Ears dataset (~33 shards)
- Elise dataset (~2 shards)
- Gemini finetune data (~47 shards)
- Podcast balanced (~17 shards)
- Tuning data (~125 shards)
Total: 408 tar shards, streaming via WebDataset

Training Environment:

8× NVIDIA H100 GPUs
DDP via HuggingFace Accelerate
NCCL communication with 600s timeout
Cloudflare monitoring and watchdog supervision

Why LoRA-Only?

Previous runs (Run 5-9) explored various approaches:

Run 5 (AdaLN-Zero): Large speaker conditioning network, good but complex
Run 6 (Full FT): Too slow, poor convergence at lr=2e-6
Run 7 (LoRA64 + bf16): Hit bf16 ULP floor (updates too small)
Run 8 (Frozen LoRA-merged + AdaLN): Best previous (0.115 flow loss)
Run 9 (LoRA128 + fp32 master + AdaLN): Marginal gains over Run 8

Run 10 (LoRA-Only) simplifies the architecture by:

Removing the 455M-parameter AdaLN speaker conditioning network
Removing all auxiliary losses (speaker loss, KL divergence)
Training pure LoRA in fp32 via gradient checkpointing
Letting the LoRA weights absorb speaker and expressiveness directly

Result: Clean, interpretable model that achieves competitive quality without speaker-specific conditioning.

Evaluation Results

Evaluated on 6 prompts × 5 reference speakers × 2 seeds = 72 core samples (plus unconditional variants).

Metrics Explanation

Speech Quality (SQA):

MOS: Mean Opinion Score (1-5, higher better)
UTMOS: Utility of TTS audio (0-2.5, higher better)
NISQA_MOS: No-reference speech quality assessment (1-5, higher better)
DNSMOS_OVRL: Overall DNS MOS (1-5, higher better)

Aesthetics (AudioBox):

CE: Clarity/Expressiveness (0-10, higher better)
CU: Clarity/Understandability (0-10, higher better)
PC: Prosody/Coherence (0-10, higher better)
PQ: Prosody/Quality (0-10, higher better)

Speaker Similarity (WavLM-SV):

SpkSim: Speaker similarity to reference (0-1, higher better)

Performance by Variant

RAW (Direct Model Output)

Metric	Mean	Std Dev	N
MOS	4.41	0.32	72
UTMOS	3.32	0.61	72
NISQA_MOS	4.03	0.44	72
DNSMOS_OVRL	3.36	0.13	72
CE	5.91	0.35	72
CU	6.87	0.45	72
PQ	7.66	0.48	72
SpkSim	0.892	0.104	60

SIDON (Speech Restoration Post-Processing)

Metric	Mean	Std Dev	N
MOS	4.67	0.17	72
UTMOS	3.55	0.64	72
NISQA_MOS	4.52	0.27	72
DNSMOS_OVRL	3.43	0.10	72
CE	6.12	0.26	72
CU	7.06	0.27	72
PQ	7.91	0.16	72
SpkSim	0.889	0.101	60

VC→SIDON (Voice Conversion + Restoration)

Metric	Mean	Std Dev	N
MOS	4.65	0.21	60
UTMOS	3.71	0.56	60
NISQA_MOS	4.49	0.34	60
DNSMOS_OVRL	3.46	0.07	60
CE	6.08	0.26	60
CU	7.07	0.27	60
PQ	7.91	0.15	60
SpkSim	0.932	0.057	60

Key Observations

Sidon Improvement: +0.26 MOS, +0.50 NISQA improvement with speech restoration
Voice Conversion Quality: VC→Sidon achieves +4% speaker similarity (0.932 vs 0.892)
Stability: Low std dev across metrics indicates consistent quality
Competitiveness: Raw MOS 4.41 exceeds many commercial TTS systems

Usage

Installation

pip install peft torch torchaudio transformers

Loading the LoRA

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModel

# Load base DramaBox model (requires DramaBox repo access)
base_model = AutoModel.from_pretrained("drambox-path", torch_dtype=torch.bfloat16, device_map="auto")

# Load LoRA
lora_config = PeftConfig.from_pretrained("laion/laionbox-v0.5-wip")
model = PeftModel.from_pretrained(base_model, "laion/laionbox-v0.5-wip")

# Inference with model
output = model.generate(...)

Inference via DramaBox Pipeline

python inference_adaln.py \
    --checkpoint path/to/base/dit \
    --lora-checkpoint laionbox-v0.5-wip/lora_step1479.safetensors \
    --output output.wav \
    --voice-sample reference.wav \
    --prompt "Your text here" \
    --seed 42

Checkpoint Details

File: lora_step1479.safetensors
Size: 865 MB
Format: Safetensors (LoRA weights only)
Step: 1479 / 1479 (final training step)
Training Time: ~4 hours on 8×H100
Final Flow Loss: 0.5693

Files in This Repository

lora_step1479.safetensors - LoRA weights (rank 128)
README.md - This file
eval_full_report.html - Interactive evaluation report with audio samples and comparison tables
training_metrics.json - Per-step training logs (loss, weight deltas, etc.)
evaluation_scores.json - Detailed scores for all 204 samples across 3 variants

Evaluation HTML Report

The included eval_full_report.html provides:

Aggregate metric tables (all variants, all models)
Delta vs baseline comparisons
Sidon improvement metrics
Interactive audio player for all samples
Side-by-side model comparison across:
- 6 diverse prompts (English & German)
- 5 reference speakers (Chris, Fairy, Samantha, Goblin, SpongeBob)
- 2 random seeds per speaker
- 3 variants (Raw / Sidon / VC→Sidon)

To view: Extract and open eval_full_report.html in a web browser, or access online at [Cloudflare tunnel URL - see below]

Training Timeline

Step 1 → 100:    Warmup phase (LR: 1e-6 → 1e-4)
Step 100 → 740:  Main training phase (LR: 1e-4 constant)
Step 740 → 1479: Late training phase (LR: 1e-4, convergence)

Best checkpoint: Step 1460 (flow loss: 0.1154)
Final checkpoint: Step 1479 (flow loss: 0.5693)

Reproduction

To reproduce this training:

Prepare diverse voice acting dataset (~530 hours)
Create WebDataset tar shards (7-8 subsets recommended)

Run training script:

accelerate launch --num_processes=8 train_dramabox_lora_only.py \
    --data-dir /path/to/dataset \
    --lora-rank 128 \
    --lr 1e-4 \
    --epochs 3 \
    --grad-accum 16

Run evaluation:

python run_full_enhanced_eval.py --lora-path lora_step1479.safetensors

Citation

If you use this model, please cite:

@misc{laionbox_v0.5_run10,
  title={LAIONBox v0.5 WIP: DramaBox LoRA-Only Fine-tuning (Run 10)},
  author={LAION Community},
  year={2026},
  howpublished={\url{https://huggingface.co/laion/laionbox-v0.5-wip}},
  note={LoRA Rank-128 on DramaBox 3.29B DiT, trained on diverse voice acting data}
}

License

This model is provided under the same license as the base DramaBox model. See DramaBox repository for details.

Acknowledgments

Base Model: ResembleAI's DramaBox
Evaluation Metrics: SQA (Universal Speech Quality Assessment), AudioBox Aesthetics, WavLM Speaker Verification
Post-Processing: Sidon speech restoration, ChatterboxVC voice conversion
Infrastructure: LAION, HuggingFace, Cloudflare

Changelog

Run 10 (LoRA-Only, Final)

Removed AdaLN speaker conditioning network
Removed all auxiliary losses (speaker loss, KL divergence)
Pure LoRA training (rank 128) with gradient checkpointing
Final flow loss: 0.5693 at step 1479
Evaluation results: MOS 4.41-4.67 (raw/Sidon/VC→Sidon)

Previous Runs

Run 9: LoRA128 + fp32 master + AdaLN (marginal improvements)
Run 8: Frozen LoRA-merged DiT + AdaLN-Zero (best previous)
Run 7: LoRA64 in bf16 (hit ULP floor)
Run 6: Standard full fine-tune (poor convergence)
Run 5: AdaLN-Zero speaker conditioning (baseline for speaker approach)

Last Updated: June 28, 2026 Training Date: June 27-28, 2026 Repository: https://huggingface.co/laion/laionbox-v0.5-wip

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support