Shona MMS TTS Fine-Tune

This is a Shona fine-tune of Meta’s MMS TTS model, based on the original facebook/mms-tts-sna checkpoint.

The model is publicly available here:

https://huggingface.co/manassehzw/sna-tts-v1

It is lightweight and can run locally on most machines, including CPU-only setups.

Requirements

You’ll need:

Python 3.10+
torch
transformers
soundfile
fastapi
uvicorn

Install dependencies:

pip install torch transformers soundfile fastapi uvicorn

Quick Start: Local Inference

Create a Python script called run_tts.py:

import torch
import soundfile as sf
from transformers import AutoTokenizer, VitsModel

MODEL_ID = "manassehzw/sna-tts-v1"

text = "Mangwanani. Ndamuka zvakanaka nhasi."

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = VitsModel.from_pretrained(MODEL_ID)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    output = model(**inputs).waveform

waveform = output.squeeze().cpu().numpy()

sf.write("shona_tts.wav", waveform, model.config.sampling_rate)

print("Saved audio to shona_tts.wav")

Run it:

python run_tts.py

This will generate:

shona_tts.wav

Using the Model in a FastAPI Endpoint

You can also wrap the model in a small FastAPI service and expose it as an HTTP API.

Create a file called api.py:

import io

import torch
import soundfile as sf
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from transformers import AutoTokenizer, VitsModel

MODEL_ID = "manassehzw/sna-tts-v1"

app = FastAPI(title="Shona MMS TTS API")

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = VitsModel.from_pretrained(MODEL_ID)
model = model.to(device)
model.eval()


class TTSRequest(BaseModel):
    text: str


@app.post("/tts")
def generate_speech(request: TTSRequest):
    inputs = tokenizer(request.text, return_tensors="pt").to(device)

    with torch.no_grad():
        waveform = model(**inputs).waveform

    audio = waveform.squeeze().cpu().numpy()

    buffer = io.BytesIO()
    sf.write(buffer, audio, model.config.sampling_rate, format="WAV")
    buffer.seek(0)

    return StreamingResponse(
        buffer,
        media_type="audio/wav",
        headers={
            "Content-Disposition": "attachment; filename=shona_tts.wav"
        },
    )

Run the API locally:

uvicorn api:app --host 0.0.0.0 --port 8000

Then send text to the API:

curl -X POST "http://localhost:8000/tts" \
  -H "Content-Type: application/json" \
  -d '{"text": "Mangwanani. Ndamuka zvakanaka nhasi."}' \
  --output shona_tts.wav

This will save the generated speech as:

shona_tts.wav

You can also open the interactive FastAPI docs in your browser:

http://localhost:8000/docs

Notes

Works best with full Shona sentences rather than short fragments.
CPU inference is supported, but GPU inference will be faster.
The model is lightweight, so short sentence generation should be quick locally.
Load the model once when your API starts, not inside every request.
For production use, put the API behind your normal authentication, rate limiting, and monitoring setup.

Downloads last month: 5

Safetensors

Model size

36.3M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support