BananaMind TTS V1

BananaMind TTS V1 logo

BananaMind TTS V1 is a small from-scratch English single-speaker text-to-speech acoustic model trained on LJSpeech. It is a fixed-voice TTS model, not a voice-cloning system.

What This Model Is

  • English-only TTS
  • Single speaker
  • Character-input Tacotron-lite acoustic model
  • Trained from scratch on LJSpeech
  • 22.05 kHz mel/audio configuration
  • Weights provided as safetensors

What This Model Is Not

  • Not voice cloning
  • No speaker embeddings
  • No reference audio conditioning
  • No multi-speaker support
  • No pretrained TTS checkpoint was used

Files

  • model.safetensors: Tacotron-lite acoustic model weights
  • config.json: Hugging Face custom model config
  • configuration_bananamind_tts.py: custom AutoConfig implementation
  • modeling_bananamind_tts.py: custom AutoModel implementation
  • model_config.json: sidecar metadata with model config, tokenizer, epoch, and step

Usage

Use with Transformers remote code:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "Banaxi-Tech/BananaMind-TTS-V1",
    trust_remote_code=True,
)

out = model.tts(
    "Hello from Banana TTS. This is a simple speech test.",
    normalize_wav=True,
)

model.save_wav("sample.wav", out.waveform, out.sample_rate)

Install runtime dependencies:

pip install torch numpy safetensors transformers

Current Limitations

  • Audio is generated with a Griffin-Lim fallback vocoder, so it can sound robotic.
  • Character input means numbers and unusual symbols are not spoken unless normalized first.
  • Quality is limited by the small model, LJSpeech-only training, and lack of a trained neural vocoder.

Text Normalization Note

Digits are stripped by the current tokenizer. Write numbers as words:

  • Use one plus one is two
  • Not 1 plus 1 is 2

Training Summary

  • Model type: Tacotron-lite acoustic model
  • Dataset: LJSpeech
  • Sample rate: 22050 Hz
  • Epoch: 37
  • Step: 30000
  • Approximate acoustic parameters: 9.49M

Intended Use

This model is intended for English single-speaker text-to-speech generation with the included local inference script.

Out-of-Scope Use

Do not present this model as a voice cloning model or use it to impersonate any person. It has no voice-cloning capability.

Downloads last month
-
Safetensors
Model size
9.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Banaxi-Tech/BananaMind-TTS-V1