BananaMind TTS V2
BananaMind TTS V2 is a small from-scratch English single-speaker text-to-speech model trained on LJSpeech. It combines the Tacotron-lite acoustic model with a self-trained HiFi-GAN vocoder, so it no longer depends on the Griffin-Lim fallback used in V1.
This release is named V2 for packaging and release purposes. In the local training code, the HiFi-GAN work was sometimes called V3.
Training code: https://github.com/Banaxi-Tech/bananamind-tts-v1-training-code
What This Model Is
- English-only TTS
- Single speaker
- Character-input Tacotron-lite acoustic model
- Self-trained HiFi-GAN neural vocoder
- Trained from scratch on LJSpeech
- 22.05 kHz audio output
- Weights provided as
safetensorsfor inference
What This Model Is Not
- Not voice cloning
- No speaker embeddings
- No reference audio conditioning
- No multi-speaker support
- No pretrained TTS checkpoint was used
Files
model.safetensors: Tacotron-lite acoustic model weightsvocoder.safetensors: default BF16 HiFi-GAN generator-only vocoderFP32/vocoder.safetensors: FP32 HiFi-GAN generator-only vocoderfull_vocoder/vocoder.pt: full epoch twenty eight vocoder training checkpoint with generator, discriminators, optimizers, config, epoch, and stepconfig.json: Hugging Face custom model config forAutoModelconfiguration_bananamind_tts.py: customAutoConfigimplementationmodeling_bananamind_tts.py: customAutoModelimplementation with HiFi-GAN loadingmodel_config.json: sidecar metadata with acoustic config, tokenizer, vocoder metadata, epoch, and stepgenerate.py: local generation example
Usage
Install runtime dependencies:
pip install torch numpy safetensors transformers huggingface_hub
Use with Transformers remote code:
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"Banaxi-Tech/BananaMind-TTS-V2",
trust_remote_code=True,
)
model.eval()
with torch.inference_mode():
out = model.tts(
"Hello from BananaMind TTS version two.",
normalize_wav=True,
)
model.save_wav("sample.wav", out.waveform, out.sample_rate)
From this folder, you can also run:
python generate.py
Vocoder Options
The default vocoder is vocoder.safetensors, a BF16 generator-only HiFi-GAN export. It is small and intended for normal inference.
To switch to the FP32 generator-only vocoder:
model.reload_vocoder("FP32/vocoder.safetensors", dtype="float32")
The full training vocoder checkpoint is included at full_vocoder/vocoder.pt. It is much larger because it includes the discriminators and optimizer states needed for training resume. It is not needed for normal text-to-speech generation.
Text Normalization Note
The tokenizer is character based and strips unsupported characters, including digits. Write numbers as words:
- Use
one plus one is two - Not
1 plus 1 is 2
Training Summary
- Acoustic model type: Tacotron-lite
- Vocoder type: HiFi-GAN generator
- Dataset: LJSpeech
- Sample rate: 22050 Hz
- Acoustic epoch: 80
- Acoustic step: 64529
- Vocoder epoch: 28
- Vocoder step: 42000
- Approximate acoustic parameters: 9.49M
- Approximate vocoder generator parameters: 3.56M
Current Limitations
- Character input is less robust than phoneme input for English pronunciation.
- Numbers and unusual symbols must be normalized before synthesis.
- Quality is limited by LJSpeech-only training and a small single-speaker architecture.
- This is a fixed-voice research/demo model, not a production voice system.
Intended Use
This model is intended for English single-speaker text-to-speech generation with the included custom Transformers model code.
Out-of-Scope Use
Do not present this model as a voice cloning model or use it to impersonate any person. It has no voice-cloning capability.
- Downloads last month
- -
