LaionBox Ablation Checkpoints
LoRA checkpoints and full training/evaluation code from an auxiliary-loss ablation study on DramaBox (LTX-2.3 22B audio-only flow-matching TTS model).
This repository accompanies the LaionBox voice-cloning fine-tuning project. It contains:
- Trained LoRA adapters (
.safetensors) for each ablation condition - Training metrics (
metrics.jsonl) logged every 10 optimizer steps - Full training and evaluation scripts under
code/ - YAML configs for every ablation condition
Motivation
LaionBox v0.1-wip was trained with a differentiable reward LoRA (5 epochs on DramaBox + Emolia data). This ablation study investigates which non-differentiable auxiliary losses improve voice quality when combined with the base flow-matching objective. All conditions start from the same v0.1-wip checkpoint and use identical hyperparameters; only the active auxiliary losses differ.
Training Setup
| Parameter | Value |
|---|---|
| Base model | LTX-2.3 22B audio-only (ltx-2.3-22b-dev-audio-only-v13-merged) |
| Resume from | LaionBox v0.1-wip (5-epoch diff reward LoRA) |
| LoRA rank / alpha | 128 / 128 |
| Learning rate | 2e-5, cosine schedule, 40 warmup steps |
| Effective batch size | 256 (8 GPUs x 1 micro-batch x 32 grad accum) |
| Duration | 2 podcast-anchored epochs (~234 optimizer steps) |
| CLAP model | gijs/voiceclap-lco-7b-lora (7B, INT4 quantized) |
| Reward type | Non-differentiable scalar (no grad through CLAP) |
| Aux trigger | sigma < 0.4 (~40% of micro-steps) |
| Data mix | 40% DramaBox, 20% Emolia, 40% Podcast |
| Text dropout | 10% |
Auxiliary Losses
| Loss | Description |
|---|---|
| Naturalness | CLAP text-audio cosine similarity between generated audio and positive/negative text prompts. Positive: "Realistic, genuine, spontaneous, authentic...". Negative: "distorted, unnatural, robotic..." |
| Quality MLP | P(real) from a binary classifier (MLP head on CLAP embeddings) trained to distinguish real vs. synthetic speech |
| Centroid | cos(emb, real_centroid) - cos(emb, synth_centroid) using pre-computed CLAP embedding centroids |
| Speaker Similarity | WavLM-SV cosine similarity between reference speaker embedding and generated speaker embedding |
All auxiliary losses are individually normalized via EMA-based adaptive coefficients to maintain approximately the same magnitude as the flow-matching loss, with a coefficient cap of 10.0.
Ablation Conditions
| Condition | Active Losses | Config | Status |
|---|---|---|---|
nat_only |
naturalness | finetune_nat_only.yaml |
Complete |
ablation_nat_quality |
naturalness + quality MLP | ablation_nat_quality.yaml |
Complete |
ablation_nat_centroid |
naturalness + centroid | ablation_nat_centroid.yaml |
Pending |
ablation_nat_quality_speaker |
naturalness + quality MLP + speaker sim | ablation_nat_quality_speaker.yaml |
Pending |
ablation_nat_speaker |
naturalness + speaker sim | ablation_nat_speaker.yaml |
Pending |
Completed Results
| Condition | Best Flow Loss | Best Flow Step | Best Nat Score | Best Nat Step | Quality Prob |
|---|---|---|---|---|---|
nat_only |
0.528 | 160 | 0.111 | 190 | 0.50 (disabled) |
ablation_nat_quality |
0.528 | 160 | 0.459 | 190 | ~0.88-0.92 |
Key observation: Adding the quality MLP dramatically improves the naturalness reward (0.111 -> 0.459) while maintaining identical flow loss, suggesting the quality classifier provides a complementary training signal.
Repository Structure
.
βββ README.md
βββ nat_only/
β βββ best_flow_step160.safetensors # LoRA: best flow loss (0.528)
β βββ best_nat_step190.safetensors # LoRA: best naturalness (0.111)
β βββ metrics.jsonl # 23 log entries, steps 10-230
βββ ablation_nat_quality/
β βββ best_flow_step160.safetensors # LoRA: best flow loss (0.528)
β βββ best_nat_step190.safetensors # LoRA: best naturalness (0.459)
β βββ metrics.jsonl # 23 log entries, steps 10-230
βββ code/
βββ scripts/
β βββ dramabox_finetune_train_multi_aux.py # Main training script (~102 KB)
β βββ run_comprehensive_eval.py # Multi-model eval + HTML report (~36 KB)
β βββ run_ablation_eval.py # Ablation eval wrapper (~5 KB)
β βββ run_ablation_overnight.py # Sequential train + eval runner (~15 KB)
β βββ ablation_status_server.py # HTTP monitoring dashboard (~6 KB)
βββ configs/
βββ finetune_nat_only.yaml
βββ ablation_nat_quality.yaml
βββ ablation_nat_centroid.yaml
βββ ablation_nat_quality_speaker.yaml
βββ ablation_nat_speaker.yaml
Code Overview
Training (dramabox_finetune_train_multi_aux.py)
The main training script supports:
- Multi-GPU training via
accelerate(tested on 8x GPU) - Bucket-weighted sampling across DramaBox, Emolia, and podcast data sources
- Three independent auxiliary losses (naturalness, quality MLP, centroid/speaker sim), each with EMA-adaptive coefficients
- Shifted logit-normal timestep sampling from the DramaBox training recipe
- LoRA fine-tuning with configurable rank, alpha, and dropout
- Checkpoint management: saves every N steps, keeps last K, promotes checkpoints that improve best metrics
- Built-in HTTP monitoring server for real-time loss curves
# Launch training for a single ablation condition
accelerate launch --num_processes=8 scripts/dramabox_finetune_train_multi_aux.py \
--config configs/ablation_nat_quality.yaml
Evaluation (run_comprehensive_eval.py)
The evaluation pipeline:
- Generates audio for multiple models x reference voices x prompts (parallelized across GPUs)
- Scores all generated audio with: CLAP-small, CLAP-large (7B), centroid distance, quality MLP, speaker similarity
- Builds an interactive HTML report with embedded MP3 audio for side-by-side comparison
# Run comprehensive evaluation
python scripts/run_comprehensive_eval.py \
--output-dir eval_output \
--num-gpus 8 \
--seeds 42
Overnight Runner (run_ablation_overnight.py)
Orchestrates the full ablation study: runs 4 training experiments sequentially, finds best checkpoints from each, then runs a combined evaluation across all conditions + baselines.
python scripts/run_ablation_overnight.py
Status Server (ablation_status_server.py)
Lightweight HTTP server for monitoring training progress in real-time during long runs.
Usage
These are LoRA adapters for the LTX-2.3 22B audio-only model. To use them for inference:
# Load with DramaBox inference pipeline
python inference.py \
--checkpoint <base_model_path> \
--lora <checkpoint.safetensors> \
--lora-rank 128
Loading a specific checkpoint
from huggingface_hub import hf_hub_download
# Download the best-naturalness checkpoint from nat+quality ablation
ckpt_path = hf_hub_download(
repo_id="TTS-AGI/laionbox-ablation-checkpoints",
filename="ablation_nat_quality/best_nat_step190.safetensors",
)
print(f"Downloaded to: {ckpt_path}")
Metrics Format
Each metrics.jsonl file contains one JSON object per logging step with the following fields:
| Field | Description |
|---|---|
step |
Optimizer step number |
flow_loss |
Flow-matching loss (primary training objective) |
lr |
Current learning rate |
clap_text_reward |
Raw CLAP text-audio cosine similarity |
quality_prob |
Quality MLP P(real) score (0.5 if disabled) |
naturalness_reward |
Combined naturalness signal |
centroid_score |
Centroid-based real/fake score (0.0 if disabled) |
aux1_loss / aux2_loss / aux3_loss |
Individual auxiliary loss values |
coeff1 / coeff2 / coeff3 |
EMA-adaptive coefficients |
total_loss |
flow_loss + sum(coeff_i * aux_i_loss) |
mode_counts |
Cumulative counts of voice_clone_fwd, unconditional, voice_clone_rev |
tgt_tokens / ref_tokens |
Token counts for current batch |
steps_per_sec / elapsed_sec / eta_sec |
Timing statistics |
Citation
If you use these checkpoints or training code, please cite:
@misc{laionbox-ablation-2026,
title={LaionBox Auxiliary Loss Ablation Study},
author={LAION},
year={2026},
url={https://huggingface.co/TTS-AGI/laionbox-ablation-checkpoints}
}
License
Apache 2.0