YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Shona MMS TTS Fine-Tune

This is a Shona fine-tune of Meta’s MMS TTS model, based on the original facebook/mms-tts-sna checkpoint.

The model is publicly available here:

https://huggingface.co/manassehzw/sna-tts-v1

It is lightweight and can run locally on most machines, including CPU-only setups.

Requirements

You’ll need:

  • Python 3.10+
  • torch
  • transformers
  • soundfile
  • fastapi
  • uvicorn

Install dependencies:

pip install torch transformers soundfile fastapi uvicorn

Quick Start: Local Inference

Create a Python script called run_tts.py:

import torch
import soundfile as sf
from transformers import AutoTokenizer, VitsModel

MODEL_ID = "manassehzw/sna-tts-v1"

text = "Mangwanani. Ndamuka zvakanaka nhasi."

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = VitsModel.from_pretrained(MODEL_ID)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    output = model(**inputs).waveform

waveform = output.squeeze().cpu().numpy()

sf.write("shona_tts.wav", waveform, model.config.sampling_rate)

print("Saved audio to shona_tts.wav")

Run it:

python run_tts.py

This will generate:

shona_tts.wav

Using the Model in a FastAPI Endpoint

You can also wrap the model in a small FastAPI service and expose it as an HTTP API.

Create a file called api.py:

import io

import torch
import soundfile as sf
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from transformers import AutoTokenizer, VitsModel

MODEL_ID = "manassehzw/sna-tts-v1"

app = FastAPI(title="Shona MMS TTS API")

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = VitsModel.from_pretrained(MODEL_ID)
model = model.to(device)
model.eval()


class TTSRequest(BaseModel):
    text: str


@app.post("/tts")
def generate_speech(request: TTSRequest):
    inputs = tokenizer(request.text, return_tensors="pt").to(device)

    with torch.no_grad():
        waveform = model(**inputs).waveform

    audio = waveform.squeeze().cpu().numpy()

    buffer = io.BytesIO()
    sf.write(buffer, audio, model.config.sampling_rate, format="WAV")
    buffer.seek(0)

    return StreamingResponse(
        buffer,
        media_type="audio/wav",
        headers={
            "Content-Disposition": "attachment; filename=shona_tts.wav"
        },
    )

Run the API locally:

uvicorn api:app --host 0.0.0.0 --port 8000

Then send text to the API:

curl -X POST "http://localhost:8000/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Mangwanani. Ndamuka zvakanaka nhasi."}' \
  --output shona_tts.wav

This will save the generated speech as:

shona_tts.wav

You can also open the interactive FastAPI docs in your browser:

http://localhost:8000/docs

Notes

  • Works best with full Shona sentences rather than short fragments.
  • CPU inference is supported, but GPU inference will be faster.
  • The model is lightweight, so short sentence generation should be quick locally.
  • Load the model once when your API starts, not inside every request.
  • For production use, put the API behind your normal authentication, rate limiting, and monitoring setup.
Downloads last month
5
Safetensors
Model size
36.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support