luxtts / README.md
evilsocket's picture
Upload folder using huggingface_hub
454a376 verified
metadata
language:
  - en
tags:
  - tts
  - text-to-speech
  - safetensors
  - cake
license: apache-2.0
base_model: YatharthS/LuxTTS

LuxTTS (Safetensors / FP16)

This is a converted version of YatharthS/LuxTTS, a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.

What changed

The original PyTorch checkpoint (model.pt and vocoder/vocos.bin) has been converted to safetensors format in float16 precision for use with Cake. The conversion applies the following transformations:

  • Format: .pt / .bin.safetensors (safer, faster loading, memory-mappable).
  • Precision: FP32 → FP16, reducing total size from ~530 MB to ~266 MB.
  • Key remapping: The nested fm_decoder.encoders.{stack}.layers.{layer} hierarchy is flattened to fm_decoder.layers.{flat_index} using the stack sizes [2, 2, 4, 4, 4] (16 layers total). Similarly, text_encoder.encoders.0.layers is flattened to text_encoder.layers. Per-stack components (time_emb, downsample, out_combiner) are reorganized under fm_decoder.stack_time_emb, fm_decoder.downsample, and fm_decoder.out_combiner respectively.
  • Config: architectures field and feature extraction parameters (n_fft, hop_length, n_mels, sample_rate) are added to config.json.

No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization).

Model details

Component File Size
Main model (flow-matching decoder + text encoder) model.safetensors 235 MB
Vocoder (Vocos) vocos.safetensors 31 MB
  • Architecture: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
  • Vocoder: Vocos (iSTFT-based, 8 layers, 512 dim)
  • Sample rate: 24 kHz (with 48 kHz upsampler head)
  • Vocabulary: 360 tokens (characters + punctuation)

Original project