LaionBox Ablation Checkpoints

LoRA checkpoints and full training/evaluation code from an auxiliary-loss ablation study on DramaBox (LTX-2.3 22B audio-only flow-matching TTS model).

This repository accompanies the LaionBox voice-cloning fine-tuning project. It contains:

  • Trained LoRA adapters (.safetensors) for each ablation condition
  • Training metrics (metrics.jsonl) logged every 10 optimizer steps
  • Full training and evaluation scripts under code/
  • YAML configs for every ablation condition

Motivation

LaionBox v0.1-wip was trained with a differentiable reward LoRA (5 epochs on DramaBox + Emolia data). This ablation study investigates which non-differentiable auxiliary losses improve voice quality when combined with the base flow-matching objective. All conditions start from the same v0.1-wip checkpoint and use identical hyperparameters; only the active auxiliary losses differ.


Training Setup

Parameter Value
Base model LTX-2.3 22B audio-only (ltx-2.3-22b-dev-audio-only-v13-merged)
Resume from LaionBox v0.1-wip (5-epoch diff reward LoRA)
LoRA rank / alpha 128 / 128
Learning rate 2e-5, cosine schedule, 40 warmup steps
Effective batch size 256 (8 GPUs x 1 micro-batch x 32 grad accum)
Duration 2 podcast-anchored epochs (~234 optimizer steps)
CLAP model gijs/voiceclap-lco-7b-lora (7B, INT4 quantized)
Reward type Non-differentiable scalar (no grad through CLAP)
Aux trigger sigma < 0.4 (~40% of micro-steps)
Data mix 40% DramaBox, 20% Emolia, 40% Podcast
Text dropout 10%

Auxiliary Losses

Loss Description
Naturalness CLAP text-audio cosine similarity between generated audio and positive/negative text prompts. Positive: "Realistic, genuine, spontaneous, authentic...". Negative: "distorted, unnatural, robotic..."
Quality MLP P(real) from a binary classifier (MLP head on CLAP embeddings) trained to distinguish real vs. synthetic speech
Centroid cos(emb, real_centroid) - cos(emb, synth_centroid) using pre-computed CLAP embedding centroids
Speaker Similarity WavLM-SV cosine similarity between reference speaker embedding and generated speaker embedding

All auxiliary losses are individually normalized via EMA-based adaptive coefficients to maintain approximately the same magnitude as the flow-matching loss, with a coefficient cap of 10.0.


Ablation Conditions

Condition Active Losses Config Status
nat_only naturalness finetune_nat_only.yaml Complete
ablation_nat_quality naturalness + quality MLP ablation_nat_quality.yaml Complete
ablation_nat_centroid naturalness + centroid ablation_nat_centroid.yaml Pending
ablation_nat_quality_speaker naturalness + quality MLP + speaker sim ablation_nat_quality_speaker.yaml Pending
ablation_nat_speaker naturalness + speaker sim ablation_nat_speaker.yaml Pending

Completed Results

Condition Best Flow Loss Best Flow Step Best Nat Score Best Nat Step Quality Prob
nat_only 0.528 160 0.111 190 0.50 (disabled)
ablation_nat_quality 0.528 160 0.459 190 ~0.88-0.92

Key observation: Adding the quality MLP dramatically improves the naturalness reward (0.111 -> 0.459) while maintaining identical flow loss, suggesting the quality classifier provides a complementary training signal.


Repository Structure

.
β”œβ”€β”€ README.md
β”œβ”€β”€ nat_only/
β”‚   β”œβ”€β”€ best_flow_step160.safetensors     # LoRA: best flow loss (0.528)
β”‚   β”œβ”€β”€ best_nat_step190.safetensors      # LoRA: best naturalness (0.111)
β”‚   └── metrics.jsonl                     # 23 log entries, steps 10-230
β”œβ”€β”€ ablation_nat_quality/
β”‚   β”œβ”€β”€ best_flow_step160.safetensors     # LoRA: best flow loss (0.528)
β”‚   β”œβ”€β”€ best_nat_step190.safetensors      # LoRA: best naturalness (0.459)
β”‚   └── metrics.jsonl                     # 23 log entries, steps 10-230
└── code/
    β”œβ”€β”€ scripts/
    β”‚   β”œβ”€β”€ dramabox_finetune_train_multi_aux.py   # Main training script (~102 KB)
    β”‚   β”œβ”€β”€ run_comprehensive_eval.py              # Multi-model eval + HTML report (~36 KB)
    β”‚   β”œβ”€β”€ run_ablation_eval.py                   # Ablation eval wrapper (~5 KB)
    β”‚   β”œβ”€β”€ run_ablation_overnight.py              # Sequential train + eval runner (~15 KB)
    β”‚   └── ablation_status_server.py              # HTTP monitoring dashboard (~6 KB)
    └── configs/
        β”œβ”€β”€ finetune_nat_only.yaml
        β”œβ”€β”€ ablation_nat_quality.yaml
        β”œβ”€β”€ ablation_nat_centroid.yaml
        β”œβ”€β”€ ablation_nat_quality_speaker.yaml
        └── ablation_nat_speaker.yaml

Code Overview

Training (dramabox_finetune_train_multi_aux.py)

The main training script supports:

  • Multi-GPU training via accelerate (tested on 8x GPU)
  • Bucket-weighted sampling across DramaBox, Emolia, and podcast data sources
  • Three independent auxiliary losses (naturalness, quality MLP, centroid/speaker sim), each with EMA-adaptive coefficients
  • Shifted logit-normal timestep sampling from the DramaBox training recipe
  • LoRA fine-tuning with configurable rank, alpha, and dropout
  • Checkpoint management: saves every N steps, keeps last K, promotes checkpoints that improve best metrics
  • Built-in HTTP monitoring server for real-time loss curves
# Launch training for a single ablation condition
accelerate launch --num_processes=8 scripts/dramabox_finetune_train_multi_aux.py \
    --config configs/ablation_nat_quality.yaml

Evaluation (run_comprehensive_eval.py)

The evaluation pipeline:

  1. Generates audio for multiple models x reference voices x prompts (parallelized across GPUs)
  2. Scores all generated audio with: CLAP-small, CLAP-large (7B), centroid distance, quality MLP, speaker similarity
  3. Builds an interactive HTML report with embedded MP3 audio for side-by-side comparison
# Run comprehensive evaluation
python scripts/run_comprehensive_eval.py \
    --output-dir eval_output \
    --num-gpus 8 \
    --seeds 42

Overnight Runner (run_ablation_overnight.py)

Orchestrates the full ablation study: runs 4 training experiments sequentially, finds best checkpoints from each, then runs a combined evaluation across all conditions + baselines.

python scripts/run_ablation_overnight.py

Status Server (ablation_status_server.py)

Lightweight HTTP server for monitoring training progress in real-time during long runs.


Usage

These are LoRA adapters for the LTX-2.3 22B audio-only model. To use them for inference:

# Load with DramaBox inference pipeline
python inference.py \
    --checkpoint <base_model_path> \
    --lora <checkpoint.safetensors> \
    --lora-rank 128

Loading a specific checkpoint

from huggingface_hub import hf_hub_download

# Download the best-naturalness checkpoint from nat+quality ablation
ckpt_path = hf_hub_download(
    repo_id="TTS-AGI/laionbox-ablation-checkpoints",
    filename="ablation_nat_quality/best_nat_step190.safetensors",
)
print(f"Downloaded to: {ckpt_path}")

Metrics Format

Each metrics.jsonl file contains one JSON object per logging step with the following fields:

Field Description
step Optimizer step number
flow_loss Flow-matching loss (primary training objective)
lr Current learning rate
clap_text_reward Raw CLAP text-audio cosine similarity
quality_prob Quality MLP P(real) score (0.5 if disabled)
naturalness_reward Combined naturalness signal
centroid_score Centroid-based real/fake score (0.0 if disabled)
aux1_loss / aux2_loss / aux3_loss Individual auxiliary loss values
coeff1 / coeff2 / coeff3 EMA-adaptive coefficients
total_loss flow_loss + sum(coeff_i * aux_i_loss)
mode_counts Cumulative counts of voice_clone_fwd, unconditional, voice_clone_rev
tgt_tokens / ref_tokens Token counts for current batch
steps_per_sec / elapsed_sec / eta_sec Timing statistics

Citation

If you use these checkpoints or training code, please cite:

@misc{laionbox-ablation-2026,
  title={LaionBox Auxiliary Loss Ablation Study},
  author={LAION},
  year={2026},
  url={https://huggingface.co/TTS-AGI/laionbox-ablation-checkpoints}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support