notmax123
/

blue-codec

@@ -4,153 +4,28 @@ tags:
   - text-to-speech
   - speech-synthesis
   - autoencoder
-  - flow-matching
   - audio
 ---
 # BlueCodec — Speech Autoencoder
-This repository contains the pretrained **Speech Autoencoder (AE)** from the Light-BlueTTS system — Stage 1 of a three-stage TTS pipeline.
-The AE encodes 44.1 kHz audio into a 24-dim continuous latent space at ~86 Hz, and decodes back to waveform via a causal dilated decoder.
-**Total inference model size:** ~71M parameters (AE decoder ~25M, TTL ~45M, DP ~0.5M)
----
-## Checkpoint
-| File | Contents |
-|---|---|
-| `ae_latest.safetensors` | Encoder + Decoder weights (202 tensors, ~204 MB) |
-Keys are prefixed with `encoder.*` and `decoder.*`.
 ---
 ## Architecture
-Encodes via a concatenated log-linear (1025-ch) + log-mel (228-ch) spectrogram (FFT 2048, hop 512) into a 24-dim latent at ~86 Hz.
 | Component | Details |
 |---|---|
-| Input | 1253-channel spectrogram (1025 log-linear + 228 log-mel) |
-| Encoder (~25.6M) | Conv1d stem (1253→512) + 10 ConvNeXt blocks (intermediate 2048) + proj (512→24) |
 | Decoder (~25.3M) | CausalConv1d stem (24→512) + 10 causal dilated ConvNeXt blocks + VocoderHead |
-| Decoder dilations | `[1, 2, 4, 1, 2, 4, 1, 1, 1, 1]` |
-| Discriminators (train only) | MPD (periods 2,3,5,7,11) + MRD (FFTs 512/1024/2048) |
-**Generator loss (training):**
-```
-L_G = 45 * L_recon + 1 * L_adv + 0.1 * L_fm
-```
-Reconstruction uses multi-resolution mel L1 on 3 scales: (FFT 1024, 64 mels), (FFT 2048, 128 mels), (FFT 4096, 128 mels).
----
-## Full Pipeline — Light-BlueTTS
-Training is split into three independent stages run in order:
-1. **Speech Autoencoder** ← *this repo* — encodes audio into a 24-dim continuous latent space
-2. **Text-to-Latent (TTL)** — flow-matching model that maps text + reference speech to latents
-3. **Duration Predictor (DP)** — utterance-level duration estimator
-### Text-to-Latent Module
-Operates on *compressed* latents: the 24-dim latent is reshaped to 144-dim at ~14 Hz (compression factor K_c = 6).
-| Component | Details |
-|---|---|
-| Reference Encoder (~4.8M) | Conv1d (144→256) + 6 ConvNeXt blocks (k=5) + 2 cross-attn layers → 50 style tokens |
-| Text Encoder (~6.9M) | Char embedding (256-dim) + 6 ConvNeXt blocks + 4 self-attn blocks (RoPE) + 2 style cross-attn layers |
-| Vector Field Estimator (~33M) | proj_in (144→512) + 4× superblock + 4 final ConvNeXt blocks + proj_out (512→144) |
-| VF superblock | 4× dilated ConvNeXt (d=1,2,4,8) + time injection + 2× ConvNeXt + text cross-attn + style cross-attn |
-**Flow-matching objective (L1, masked):**
-```
-L_TTL = E [ || m · (v(z_t, z_ref, c, t) - (z₁ - (1 - σ_min)·z₀)) ||₁ ]
-```
-σ_min = 1e-8, p_uncond = 0.05. Inference: Euler method, NFE=32, CFG=3.
-### Duration Predictor (~0.5M)
-Utterance-level (not phoneme-level).
-| Component | Details |
-|---|---|
-| DP Reference Encoder | Linear (144→64) + 4 ConvNeXt blocks + 2 cross-attn → 64-dim embedding |
-| DP Text Encoder | Char embedding (64-dim) + 6 ConvNeXt blocks + 2 self-attn + utterance token → 64-dim |
-| Estimator | Linear(192→128) + PReLU + Linear(128→1) → scalar log-duration |
----
-## Training the Autoencoder
-Multi-GPU training via PyTorch DDP on 4× GPUs.
-```bash
-torchrun --nproc_per_node=4 src/train_autoencoder.py \
-    --arch_config configs/tts.json
-```
-**Key hyperparameters:**
-| Parameter | Value |
-|---|---|
-| Optimizer | AdamW (β₁=0.8, β₂=0.99, wd=0.01) |
-| Learning rate | 2e-4 with cosine annealing to 1e-6 |
-| Batch size | 128 |
-| Crop length | 0.19 s (~8,379 samples at 44.1 kHz) |
-| Total iterations | 1,500,000 |
-| Hardware (paper) | 2× RTX 3090 |
-Resume from checkpoint:
-```bash
-torchrun --nproc_per_node=4 src/train_autoencoder.py \
-    --resume checkpoints/ae/ae_latest.pt
-```
-Training dataset: ~5.9M files / ~10,000 hours of audio.
 ---
-## Reducing Model Size
-All dimensions are controlled by `configs/tts.json`. Key levers for the AE:
-| Change | Param reduction |
-|---|---|
-| `encoder.idim`: 1253 → 228 (mel-only input) | −3.7M |
-| `encoder.hdim`: 512 → 256 | −10M |
-| `encoder.intermediate_dim`: 2048 → 1024 | −10.5M |
-| Reduce `encoder.num_layers` / `decoder.num_layers` 10 → 6 | −8.4M each |
-> Reducing `idim` to 228 also requires updating `LinearMelSpectrogram` to output mel-only (set `n_mels=228` and remove log-linear concatenation in `models/utils.py`).
----
-## Repository Structure
-```
-training/
-├── src/
-│   ├── train_autoencoder.py        # Stage 1: AE training (multi-GPU DDP)
-│   ├── train_text_to_latent.py     # Stage 2: TTL flow-matching training
-│   └── train_duration_predictor.py # Stage 3: Duration predictor training
-├── models/
-│   ├── autoencoder/
-│   │   ├── latent_encoder.py       # LatentEncoder (mel → 24-dim latent)
-│   │   ├── latent_decoder.py       # LatentDecoder1D (latent → waveform)
-│   │   ├── discriminators.py       # MPD + MRD for GAN training
-│   │   └── modules.py              # Shared: ConvNeXtBlock, CausalConvNeXtBlock, etc.
-│   └── text2latent/
-│       ├── text_encoder.py         # TextEncoder with ConvNeXt + self-attn + style cross-attn
-│       ├── reference_encoder.py    # ReferenceEncoder (audio → style tokens)
-│       ├── vf_estimator.py         # VectorFieldEstimator (flow-matching backbone)
-│       ├── duration_predictor.py   # TTSDurationModel (full DP model)
-│       └── dp_network.py           # DPNetwork (backward-compatible wrapper)
-├── compute_latent_stats.py         # Compute latent mean/std (run before Stage 2/3)
-└── configs/
-    └── tts.json                    # Single config file for all stages
-```

   - text-to-speech
   - speech-synthesis
   - autoencoder
   - audio
 ---
 # BlueCodec — Speech Autoencoder
+A neural speech autoencoder that compresses 44.1 kHz audio into a compact continuous latent representation, used as the first stage of the Light-BlueTTS text-to-speech system.
+The encoder turns raw audio into a 24-dim latent sequence at ~86 Hz. Downstream TTS modules (flow-matching, duration prediction) operate entirely in this latent space, making synthesis fast and lightweight. The decoder reconstructs full-quality waveforms from those latents at inference time.
 ---
 ## Architecture
 | Component | Details |
 |---|---|
+| Input | 1253-channel spectrogram (1025 log-linear + 228 log-mel, FFT 2048, hop 512) |
+| Encoder (~25.6M) | Conv1d stem (1253→512) + 10 ConvNeXt blocks + proj (512→24) |
 | Decoder (~25.3M) | CausalConv1d stem (24→512) + 10 causal dilated ConvNeXt blocks + VocoderHead |
+| Latent | 24-dim @ ~86 Hz |
 ---
+## Checkpoint
+`ae_latest.safetensors` — encoder + decoder weights (~204 MB). Keys are prefixed with `encoder.*` and `decoder.*`.