luxtts / README.md

Upload folder using huggingface_hub

454a376 verified 8 days ago

2.15 kB

language:
  - en
tags:
  - tts
  - text-to-speech
  - safetensors
  - cake
license: apache-2.0
base_model: YatharthS/LuxTTS

LuxTTS (Safetensors / FP16)

This is a converted version of YatharthS/LuxTTS, a flow-matching based text-to-speech model. All credit for the original model, training, and research goes to the original authors.

What changed

The original PyTorch checkpoint (model.pt and vocoder/vocos.bin) has been converted to safetensors format in float16 precision for use with Cake. The conversion applies the following transformations:

Format: .pt / .bin → .safetensors (safer, faster loading, memory-mappable).
Precision: FP32 → FP16, reducing total size from ~530 MB to ~266 MB.
Key remapping: The nested fm_decoder.encoders.{stack}.layers.{layer} hierarchy is flattened to fm_decoder.layers.{flat_index} using the stack sizes [2, 2, 4, 4, 4] (16 layers total). Similarly, text_encoder.encoders.0.layers is flattened to text_encoder.layers. Per-stack components (time_emb, downsample, out_combiner) are reorganized under fm_decoder.stack_time_emb, fm_decoder.downsample, and fm_decoder.out_combiner respectively.
Config: architectures field and feature extraction parameters (n_fft, hop_length, n_mels, sample_rate) are added to config.json.

No weights were retrained or fine-tuned — this is a lossless format conversion (modulo FP32→FP16 quantization).

Model details

Component	File	Size
Main model (flow-matching decoder + text encoder)	`model.safetensors`	235 MB
Vocoder (Vocos)	`vocos.safetensors`	31 MB

Architecture: Flow-matching TTS with conformer-based decoder (16 layers across 5 stacks) and 4-layer text encoder
Vocoder: Vocos (iSTFT-based, 8 layers, 512 dim)
Sample rate: 24 kHz (with 48 kHz upsampler head)
Vocabulary: 360 tokens (characters + punctuation)

Original project

Model: YatharthS/LuxTTS
License: Apache 2.0