LaionBox Ablation Checkpoints

LoRA checkpoints and full training/evaluation code from an auxiliary-loss ablation study on DramaBox (LTX-2.3 22B audio-only flow-matching TTS model).

This repository accompanies the LaionBox voice-cloning fine-tuning project. It contains:

Trained LoRA adapters (.safetensors) for each ablation condition
Training metrics (metrics.jsonl) logged every 10 optimizer steps
Full training and evaluation scripts under code/
YAML configs for every ablation condition

Motivation

LaionBox v0.1-wip was trained with a differentiable reward LoRA (5 epochs on DramaBox + Emolia data). This ablation study investigates which non-differentiable auxiliary losses improve voice quality when combined with the base flow-matching objective. All conditions start from the same v0.1-wip checkpoint and use identical hyperparameters; only the active auxiliary losses differ.

Training Setup

Parameter	Value
Base model	LTX-2.3 22B audio-only (`ltx-2.3-22b-dev-audio-only-v13-merged`)
Resume from	LaionBox v0.1-wip (5-epoch diff reward LoRA)
LoRA rank / alpha	128 / 128
Learning rate	2e-5, cosine schedule, 40 warmup steps
Effective batch size	256 (8 GPUs x 1 micro-batch x 32 grad accum)
Duration	2 podcast-anchored epochs (~234 optimizer steps)
CLAP model	`gijs/voiceclap-lco-7b-lora` (7B, INT4 quantized)
Reward type	Non-differentiable scalar (no grad through CLAP)
Aux trigger	sigma < 0.4 (~40% of micro-steps)
Data mix	40% DramaBox, 20% Emolia, 40% Podcast
Text dropout	10%

Auxiliary Losses

Loss	Description
Naturalness	CLAP text-audio cosine similarity between generated audio and positive/negative text prompts. Positive: "Realistic, genuine, spontaneous, authentic...". Negative: "distorted, unnatural, robotic..."
Quality MLP	P(real) from a binary classifier (MLP head on CLAP embeddings) trained to distinguish real vs. synthetic speech
Centroid	`cos(emb, real_centroid) - cos(emb, synth_centroid)` using pre-computed CLAP embedding centroids
Speaker Similarity	WavLM-SV cosine similarity between reference speaker embedding and generated speaker embedding

All auxiliary losses are individually normalized via EMA-based adaptive coefficients to maintain approximately the same magnitude as the flow-matching loss, with a coefficient cap of 10.0.

Ablation Conditions

Condition	Active Losses	Config	Status
`nat_only`	naturalness	`finetune_nat_only.yaml`	Complete
`ablation_nat_quality`	naturalness + quality MLP	`ablation_nat_quality.yaml`	Complete
`ablation_nat_centroid`	naturalness + centroid	`ablation_nat_centroid.yaml`	Pending
`ablation_nat_quality_speaker`	naturalness + quality MLP + speaker sim	`ablation_nat_quality_speaker.yaml`	Pending
`ablation_nat_speaker`	naturalness + speaker sim	`ablation_nat_speaker.yaml`	Pending

Completed Results

Condition	Best Flow Loss	Best Flow Step	Best Nat Score	Best Nat Step	Quality Prob
`nat_only`	0.528	160	0.111	190	0.50 (disabled)
`ablation_nat_quality`	0.528	160	0.459	190	~0.88-0.92

Key observation: Adding the quality MLP dramatically improves the naturalness reward (0.111 -> 0.459) while maintaining identical flow loss, suggesting the quality classifier provides a complementary training signal.

Repository Structure

.
├── README.md
├── nat_only/
│   ├── best_flow_step160.safetensors     # LoRA: best flow loss (0.528)
│   ├── best_nat_step190.safetensors      # LoRA: best naturalness (0.111)
│   └── metrics.jsonl                     # 23 log entries, steps 10-230
├── ablation_nat_quality/
│   ├── best_flow_step160.safetensors     # LoRA: best flow loss (0.528)
│   ├── best_nat_step190.safetensors      # LoRA: best naturalness (0.459)
│   └── metrics.jsonl                     # 23 log entries, steps 10-230
└── code/
    ├── scripts/
    │   ├── dramabox_finetune_train_multi_aux.py   # Main training script (~102 KB)
    │   ├── run_comprehensive_eval.py              # Multi-model eval + HTML report (~36 KB)
    │   ├── run_ablation_eval.py                   # Ablation eval wrapper (~5 KB)
    │   ├── run_ablation_overnight.py              # Sequential train + eval runner (~15 KB)
    │   └── ablation_status_server.py              # HTTP monitoring dashboard (~6 KB)
    └── configs/
        ├── finetune_nat_only.yaml
        ├── ablation_nat_quality.yaml
        ├── ablation_nat_centroid.yaml
        ├── ablation_nat_quality_speaker.yaml
        └── ablation_nat_speaker.yaml

Code Overview

Training (`dramabox_finetune_train_multi_aux.py`)

The main training script supports:

Multi-GPU training via accelerate (tested on 8x GPU)
Bucket-weighted sampling across DramaBox, Emolia, and podcast data sources
Three independent auxiliary losses (naturalness, quality MLP, centroid/speaker sim), each with EMA-adaptive coefficients
Shifted logit-normal timestep sampling from the DramaBox training recipe
LoRA fine-tuning with configurable rank, alpha, and dropout
Checkpoint management: saves every N steps, keeps last K, promotes checkpoints that improve best metrics
Built-in HTTP monitoring server for real-time loss curves

# Launch training for a single ablation condition
accelerate launch --num_processes=8 scripts/dramabox_finetune_train_multi_aux.py \
    --config configs/ablation_nat_quality.yaml

Evaluation (`run_comprehensive_eval.py`)

The evaluation pipeline:

Generates audio for multiple models x reference voices x prompts (parallelized across GPUs)
Scores all generated audio with: CLAP-small, CLAP-large (7B), centroid distance, quality MLP, speaker similarity
Builds an interactive HTML report with embedded MP3 audio for side-by-side comparison

# Run comprehensive evaluation
python scripts/run_comprehensive_eval.py \
    --output-dir eval_output \
    --num-gpus 8 \
    --seeds 42

Overnight Runner (`run_ablation_overnight.py`)

Orchestrates the full ablation study: runs 4 training experiments sequentially, finds best checkpoints from each, then runs a combined evaluation across all conditions + baselines.

python scripts/run_ablation_overnight.py

Status Server (`ablation_status_server.py`)

Lightweight HTTP server for monitoring training progress in real-time during long runs.

Usage

These are LoRA adapters for the LTX-2.3 22B audio-only model. To use them for inference:

# Load with DramaBox inference pipeline
python inference.py \
    --checkpoint <base_model_path> \
    --lora <checkpoint.safetensors> \
    --lora-rank 128

Loading a specific checkpoint

from huggingface_hub import hf_hub_download

# Download the best-naturalness checkpoint from nat+quality ablation
ckpt_path = hf_hub_download(
    repo_id="TTS-AGI/laionbox-ablation-checkpoints",
    filename="ablation_nat_quality/best_nat_step190.safetensors",
)
print(f"Downloaded to: {ckpt_path}")

Metrics Format

Each metrics.jsonl file contains one JSON object per logging step with the following fields:

Field	Description
`step`	Optimizer step number
`flow_loss`	Flow-matching loss (primary training objective)
`lr`	Current learning rate
`clap_text_reward`	Raw CLAP text-audio cosine similarity
`quality_prob`	Quality MLP P(real) score (0.5 if disabled)
`naturalness_reward`	Combined naturalness signal
`centroid_score`	Centroid-based real/fake score (0.0 if disabled)
`aux1_loss` / `aux2_loss` / `aux3_loss`	Individual auxiliary loss values
`coeff1` / `coeff2` / `coeff3`	EMA-adaptive coefficients
`total_loss`	`flow_loss + sum(coeff_i * aux_i_loss)`
`mode_counts`	Cumulative counts of voice_clone_fwd, unconditional, voice_clone_rev
`tgt_tokens` / `ref_tokens`	Token counts for current batch
`steps_per_sec` / `elapsed_sec` / `eta_sec`	Timing statistics

Citation

If you use these checkpoints or training code, please cite:

@misc{laionbox-ablation-2026,
  title={LaionBox Auxiliary Loss Ablation Study},
  author={LAION},
  year={2026},
  url={https://huggingface.co/TTS-AGI/laionbox-ablation-checkpoints}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track