BananaMind TTS V2

BananaMind TTS V2 is a small from-scratch English single-speaker text-to-speech model trained on LJSpeech. It combines the Tacotron-lite acoustic model with a self-trained HiFi-GAN vocoder, so it no longer depends on the Griffin-Lim fallback used in V1.

This release is named V2 for packaging and release purposes. In the local training code, the HiFi-GAN work was sometimes called V3.

Training code: https://github.com/Banaxi-Tech/bananamind-tts-v1-training-code

What This Model Is

English-only TTS
Single speaker
Character-input Tacotron-lite acoustic model
Self-trained HiFi-GAN neural vocoder
Trained from scratch on LJSpeech
22.05 kHz audio output
Weights provided as safetensors for inference

What This Model Is Not

Not voice cloning
No speaker embeddings
No reference audio conditioning
No multi-speaker support
No pretrained TTS checkpoint was used

Files

model.safetensors: Tacotron-lite acoustic model weights
vocoder.safetensors: default BF16 HiFi-GAN generator-only vocoder
FP32/vocoder.safetensors: FP32 HiFi-GAN generator-only vocoder
full_vocoder/vocoder.pt: full epoch twenty eight vocoder training checkpoint with generator, discriminators, optimizers, config, epoch, and step
config.json: Hugging Face custom model config for AutoModel
configuration_bananamind_tts.py: custom AutoConfig implementation
modeling_bananamind_tts.py: custom AutoModel implementation with HiFi-GAN loading
model_config.json: sidecar metadata with acoustic config, tokenizer, vocoder metadata, epoch, and step
generate.py: local generation example

Usage

Install runtime dependencies:

pip install torch numpy safetensors transformers huggingface_hub

Use with Transformers remote code:

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "Banaxi-Tech/BananaMind-TTS-V2",
    trust_remote_code=True,
)
model.eval()

with torch.inference_mode():
    out = model.tts(
        "Hello from BananaMind TTS version two.",
        normalize_wav=True,
    )

model.save_wav("sample.wav", out.waveform, out.sample_rate)

From this folder, you can also run:

python generate.py

Vocoder Options

The default vocoder is vocoder.safetensors, a BF16 generator-only HiFi-GAN export. It is small and intended for normal inference.

To switch to the FP32 generator-only vocoder:

model.reload_vocoder("FP32/vocoder.safetensors", dtype="float32")

The full training vocoder checkpoint is included at full_vocoder/vocoder.pt. It is much larger because it includes the discriminators and optimizer states needed for training resume. It is not needed for normal text-to-speech generation.

Text Normalization Note

The tokenizer is character based and strips unsupported characters, including digits. Write numbers as words:

Use one plus one is two
Not 1 plus 1 is 2

Training Summary

Acoustic model type: Tacotron-lite
Vocoder type: HiFi-GAN generator
Dataset: LJSpeech
Sample rate: 22050 Hz
Acoustic epoch: 80
Acoustic step: 64529
Vocoder epoch: 28
Vocoder step: 42000
Approximate acoustic parameters: 9.49M
Approximate vocoder generator parameters: 3.56M

Current Limitations

Character input is less robust than phoneme input for English pronunciation.
Numbers and unusual symbols must be normalized before synthesis.
Quality is limited by LJSpeech-only training and a small single-speaker architecture.
This is a fixed-voice research/demo model, not a production voice system.

Intended Use

This model is intended for English single-speaker text-to-speech generation with the included custom Transformers model code.

Out-of-Scope Use

Do not present this model as a voice cloning model or use it to impersonate any person. It has no voice-cloning capability.

Downloads last month: -

Safetensors

Model size

9.5M params

Tensor type

F32

Banaxi-Tech
/

BananaMind-TTS-V2