BananaMind TTS V2

BananaMind TTS V2 logo

BananaMind TTS V2 is a small from-scratch English single-speaker text-to-speech model trained on LJSpeech. It combines the Tacotron-lite acoustic model with a self-trained HiFi-GAN vocoder, so it no longer depends on the Griffin-Lim fallback used in V1.

This release is named V2 for packaging and release purposes. In the local training code, the HiFi-GAN work was sometimes called V3.

Training code: https://github.com/Banaxi-Tech/bananamind-tts-v1-training-code

What This Model Is

  • English-only TTS
  • Single speaker
  • Character-input Tacotron-lite acoustic model
  • Self-trained HiFi-GAN neural vocoder
  • Trained from scratch on LJSpeech
  • 22.05 kHz audio output
  • Weights provided as safetensors for inference

What This Model Is Not

  • Not voice cloning
  • No speaker embeddings
  • No reference audio conditioning
  • No multi-speaker support
  • No pretrained TTS checkpoint was used

Files

  • model.safetensors: Tacotron-lite acoustic model weights
  • vocoder.safetensors: default BF16 HiFi-GAN generator-only vocoder
  • FP32/vocoder.safetensors: FP32 HiFi-GAN generator-only vocoder
  • full_vocoder/vocoder.pt: full epoch twenty eight vocoder training checkpoint with generator, discriminators, optimizers, config, epoch, and step
  • config.json: Hugging Face custom model config for AutoModel
  • configuration_bananamind_tts.py: custom AutoConfig implementation
  • modeling_bananamind_tts.py: custom AutoModel implementation with HiFi-GAN loading
  • model_config.json: sidecar metadata with acoustic config, tokenizer, vocoder metadata, epoch, and step
  • generate.py: local generation example

Usage

Install runtime dependencies:

pip install torch numpy safetensors transformers huggingface_hub

Use with Transformers remote code:

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "Banaxi-Tech/BananaMind-TTS-V2",
    trust_remote_code=True,
)
model.eval()

with torch.inference_mode():
    out = model.tts(
        "Hello from BananaMind TTS version two.",
        normalize_wav=True,
    )

model.save_wav("sample.wav", out.waveform, out.sample_rate)

From this folder, you can also run:

python generate.py

Vocoder Options

The default vocoder is vocoder.safetensors, a BF16 generator-only HiFi-GAN export. It is small and intended for normal inference.

To switch to the FP32 generator-only vocoder:

model.reload_vocoder("FP32/vocoder.safetensors", dtype="float32")

The full training vocoder checkpoint is included at full_vocoder/vocoder.pt. It is much larger because it includes the discriminators and optimizer states needed for training resume. It is not needed for normal text-to-speech generation.

Text Normalization Note

The tokenizer is character based and strips unsupported characters, including digits. Write numbers as words:

  • Use one plus one is two
  • Not 1 plus 1 is 2

Training Summary

  • Acoustic model type: Tacotron-lite
  • Vocoder type: HiFi-GAN generator
  • Dataset: LJSpeech
  • Sample rate: 22050 Hz
  • Acoustic epoch: 80
  • Acoustic step: 64529
  • Vocoder epoch: 28
  • Vocoder step: 42000
  • Approximate acoustic parameters: 9.49M
  • Approximate vocoder generator parameters: 3.56M

Current Limitations

  • Character input is less robust than phoneme input for English pronunciation.
  • Numbers and unusual symbols must be normalized before synthesis.
  • Quality is limited by LJSpeech-only training and a small single-speaker architecture.
  • This is a fixed-voice research/demo model, not a production voice system.

Intended Use

This model is intended for English single-speaker text-to-speech generation with the included custom Transformers model code.

Out-of-Scope Use

Do not present this model as a voice cloning model or use it to impersonate any person. It has no voice-cloning capability.

Downloads last month
-
Safetensors
Model size
9.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Banaxi-Tech/BananaMind-TTS-V2

Space using Banaxi-Tech/BananaMind-TTS-V2 1