BananaMind TTS V1

BananaMind TTS V1 is a small from-scratch English single-speaker text-to-speech acoustic model trained on LJSpeech. It is a fixed-voice TTS model, not a voice-cloning system.

What This Model Is

English-only TTS
Single speaker
Character-input Tacotron-lite acoustic model
Trained from scratch on LJSpeech
22.05 kHz mel/audio configuration
Weights provided as safetensors

What This Model Is Not

Not voice cloning
No speaker embeddings
No reference audio conditioning
No multi-speaker support
No pretrained TTS checkpoint was used

Files

model.safetensors: Tacotron-lite acoustic model weights
config.json: Hugging Face custom model config
configuration_bananamind_tts.py: custom AutoConfig implementation
modeling_bananamind_tts.py: custom AutoModel implementation
model_config.json: sidecar metadata with model config, tokenizer, epoch, and step

Usage

Use with Transformers remote code:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "Banaxi-Tech/BananaMind-TTS-V1",
    trust_remote_code=True,
)

out = model.tts(
    "Hello from Banana TTS. This is a simple speech test.",
    normalize_wav=True,
)

model.save_wav("sample.wav", out.waveform, out.sample_rate)

Install runtime dependencies:

pip install torch numpy safetensors transformers

Current Limitations

Audio is generated with a Griffin-Lim fallback vocoder, so it can sound robotic.
Character input means numbers and unusual symbols are not spoken unless normalized first.
Quality is limited by the small model, LJSpeech-only training, and lack of a trained neural vocoder.

Text Normalization Note

Digits are stripped by the current tokenizer. Write numbers as words:

Use one plus one is two
Not 1 plus 1 is 2

Training Summary

Model type: Tacotron-lite acoustic model
Dataset: LJSpeech
Sample rate: 22050 Hz
Epoch: 37
Step: 30000
Approximate acoustic parameters: 9.49M

Intended Use

This model is intended for English single-speaker text-to-speech generation with the included local inference script.

Out-of-Scope Use

Do not present this model as a voice cloning model or use it to impersonate any person. It has no voice-cloning capability.

Downloads last month: -

Safetensors

Model size

9.5M params

Tensor type

F32

Banaxi-Tech
/

BananaMind-TTS-V1