notmax123
/

blue-codec

+---
+license: apache-2.0
+tags:
+  - text-to-speech
+  - speech-synthesis
+  - autoencoder
+  - flow-matching
+  - audio
+---
+# BlueCodec — Speech Autoencoder
+This repository contains the pretrained **Speech Autoencoder (AE)** from the Light-BlueTTS system — Stage 1 of a three-stage TTS pipeline.
+The AE encodes 44.1 kHz audio into a 24-dim continuous latent space at ~86 Hz, and decodes back to waveform via a causal dilated decoder.
+**Total inference model size:** ~71M parameters (AE decoder ~25M, TTL ~45M, DP ~0.5M)
+---
+## Checkpoint
+| File | Contents |
+|---|---|
+| `ae_latest.safetensors` | Encoder + Decoder weights (202 tensors, ~204 MB) |
+Keys are prefixed with `encoder.*` and `decoder.*`.
+---
+## Architecture
+Encodes via a concatenated log-linear (1025-ch) + log-mel (228-ch) spectrogram (FFT 2048, hop 512) into a 24-dim latent at ~86 Hz.
+| Component | Details |
+|---|---|
+| Input | 1253-channel spectrogram (1025 log-linear + 228 log-mel) |
+| Encoder (~25.6M) | Conv1d stem (1253→512) + 10 ConvNeXt blocks (intermediate 2048) + proj (512→24) |
+| Decoder (~25.3M) | CausalConv1d stem (24→512) + 10 causal dilated ConvNeXt blocks + VocoderHead |
+| Decoder dilations | `[1, 2, 4, 1, 2, 4, 1, 1, 1, 1]` |
+| Discriminators (train only) | MPD (periods 2,3,5,7,11) + MRD (FFTs 512/1024/2048) |
+**Generator loss (training):**
+```
+L_G = 45 * L_recon + 1 * L_adv + 0.1 * L_fm
+```
+Reconstruction uses multi-resolution mel L1 on 3 scales: (FFT 1024, 64 mels), (FFT 2048, 128 mels), (FFT 4096, 128 mels).
+---
+## Full Pipeline — Light-BlueTTS
+Training is split into three independent stages run in order:
+1. **Speech Autoencoder** ← *this repo* — encodes audio into a 24-dim continuous latent space
+2. **Text-to-Latent (TTL)** — flow-matching model that maps text + reference speech to latents
+3. **Duration Predictor (DP)** — utterance-level duration estimator
+### Text-to-Latent Module
+Operates on *compressed* latents: the 24-dim latent is reshaped to 144-dim at ~14 Hz (compression factor K_c = 6).
+| Component | Details |
+|---|---|
+| Reference Encoder (~4.8M) | Conv1d (144→256) + 6 ConvNeXt blocks (k=5) + 2 cross-attn layers → 50 style tokens |
+| Text Encoder (~6.9M) | Char embedding (256-dim) + 6 ConvNeXt blocks + 4 self-attn blocks (RoPE) + 2 style cross-attn layers |
+| Vector Field Estimator (~33M) | proj_in (144→512) + 4× superblock + 4 final ConvNeXt blocks + proj_out (512→144) |
+| VF superblock | 4× dilated ConvNeXt (d=1,2,4,8) + time injection + 2× ConvNeXt + text cross-attn + style cross-attn |
+**Flow-matching objective (L1, masked):**
+```
+L_TTL = E [ || m · (v(z_t, z_ref, c, t) - (z₁ - (1 - σ_min)·z₀)) ||₁ ]
+```
+σ_min = 1e-8, p_uncond = 0.05. Inference: Euler method, NFE=32, CFG=3.
+### Duration Predictor (~0.5M)
+Utterance-level (not phoneme-level).
+| Component | Details |
+|---|---|
+| DP Reference Encoder | Linear (144→64) + 4 ConvNeXt blocks + 2 cross-attn → 64-dim embedding |
+| DP Text Encoder | Char embedding (64-dim) + 6 ConvNeXt blocks + 2 self-attn + utterance token → 64-dim |
+| Estimator | Linear(192→128) + PReLU + Linear(128→1) → scalar log-duration |
+---
+## Training the Autoencoder
+Multi-GPU training via PyTorch DDP on 4× GPUs.
+```bash
+torchrun --nproc_per_node=4 src/train_autoencoder.py \
+    --arch_config configs/tts.json
+```
+**Key hyperparameters:**
+| Parameter | Value |
+|---|---|
+| Optimizer | AdamW (β₁=0.8, β₂=0.99, wd=0.01) |
+| Learning rate | 2e-4 with cosine annealing to 1e-6 |
+| Batch size | 128 |
+| Crop length | 0.19 s (~8,379 samples at 44.1 kHz) |
+| Total iterations | 1,500,000 |
+| Hardware (paper) | 2× RTX 3090 |
+Resume from checkpoint:
+```bash
+torchrun --nproc_per_node=4 src/train_autoencoder.py \
+    --resume checkpoints/ae/ae_latest.pt
+```
+Training dataset: ~5.9M files / ~10,000 hours of audio.
+---
+## Reducing Model Size
+All dimensions are controlled by `configs/tts.json`. Key levers for the AE:
+| Change | Param reduction |
+|---|---|
+| `encoder.idim`: 1253 → 228 (mel-only input) | −3.7M |
+| `encoder.hdim`: 512 → 256 | −10M |
+| `encoder.intermediate_dim`: 2048 → 1024 | −10.5M |
+| Reduce `encoder.num_layers` / `decoder.num_layers` 10 → 6 | −8.4M each |
+> Reducing `idim` to 228 also requires updating `LinearMelSpectrogram` to output mel-only (set `n_mels=228` and remove log-linear concatenation in `models/utils.py`).
+---
+## Repository Structure
+```
+training/
+├── src/
+│   ├── train_autoencoder.py        # Stage 1: AE training (multi-GPU DDP)
+│   ├── train_text_to_latent.py     # Stage 2: TTL flow-matching training
+│   └── train_duration_predictor.py # Stage 3: Duration predictor training
+├── models/
+│   ├── autoencoder/
+│   │   ├── latent_encoder.py       # LatentEncoder (mel → 24-dim latent)
+│   │   ├── latent_decoder.py       # LatentDecoder1D (latent → waveform)
+���   │   ├── discriminators.py       # MPD + MRD for GAN training
+│   │   └── modules.py              # Shared: ConvNeXtBlock, CausalConvNeXtBlock, etc.
+│   └── text2latent/
+│       ├── text_encoder.py         # TextEncoder with ConvNeXt + self-attn + style cross-attn
+│       ├── reference_encoder.py    # ReferenceEncoder (audio → style tokens)
+│       ├── vf_estimator.py         # VectorFieldEstimator (flow-matching backbone)
+│       ├── duration_predictor.py   # TTSDurationModel (full DP model)
+│       └── dp_network.py           # DPNetwork (backward-compatible wrapper)
+├── compute_latent_stats.py         # Compute latent mean/std (run before Stage 2/3)
+└── configs/
+    └── tts.json                    # Single config file for all stages
+```