VITS Open Bible — Ewe

A multispeaker text-to-speech model for Ewe, trained from scratch on the Open Bible corpus using the VITS architecture (end-to-end TTS with adversarial learning, 22,050 Hz output) via the Coqui TTS framework.

Unlike zero-shot TTS models, VITS is conditioned on speaker embeddings learned during training. A speaker name from the training set must be supplied at inference time.

Files

File	Purpose
`model_last.pth`	Trained model weights.
`config.json`	Coqui TTS model configuration.
`speakers.pth`	Speaker ID → embedding mapping.

Intended use

Multispeaker TTS for Ewe using one of the training-set speaker voices.
Research on multilingual TTS, low-resource TTS evaluation, and listening studies on Open Bible–style read-speech.

How to use

Install Coqui TTS:

pip install TTS

Download the checkpoint and run inference:

import torch
from huggingface_hub import hf_hub_download
from TTS.tts.utils.speakers import SpeakerManager
from TTS.utils.synthesizer import Synthesizer

repo_id  = "multilingual-tts/VITS-OpenBible-Ewe"
ckpt     = hf_hub_download(repo_id, "model_last.pth")
config   = hf_hub_download(repo_id, "config.json")
speakers = hf_hub_download(repo_id, "speakers.pth")

use_cuda = torch.cuda.is_available()
synthesizer = Synthesizer(
    tts_checkpoint=ckpt,
    tts_config_path=config,
    tts_speakers_file=speakers,
    use_cuda=use_cuda,
)

# Coqui's Synthesizer may not inject the speakers file into the model config
# automatically — restore the SpeakerManager manually when needed.
if synthesizer.tts_model.speaker_manager is None:
    synthesizer.tts_model.speaker_manager = SpeakerManager(
        speaker_id_file_path=speakers
    )

# List available speaker names
print(sorted(synthesizer.tts_model.speaker_manager.speaker_names))

wav = synthesizer.tts(
    text="...",          # text to synthesise in Ewe
    speaker_name="...",  # one of the speaker names printed above
    split_sentences=True,
)

Training data

Source: davidguzmanr/open-bible-resources, config Ewe
Size: approximately 22,195 utterances
Speakers: multispeaker; speaker identity is fixed to one of the training-set voices and selected by name at inference time
Sample rate: 22,050 Hz

Training procedure

Architecture: VITS (Conditional Variational Autoencoder + adversarial training).
Grapheme-level tokenizer, built from the training transcripts.
Optimizer: AdamW, learning rate 2e-4.
Training budget: 500,000 optimizer updates on 2 GPUs with mixed precision (bf16).

Audio preprocessing and training are reproducible via the upstream open-bible-models repo.

Evaluation

Evaluated alongside other Open-Bible TTS systems on character/word error rate (via Meta's Omnilingual ASR) and UTMOSv2 naturalness scores. See the open-bible-models repository for the evaluation pipeline and the open-bible-surveys repository for the human-listening survey methodology.

Downloads last month: 67

Dataset used to train multilingual-tts/VITS-OpenBible-Ewe

Collection including multilingual-tts/VITS-OpenBible-Ewe

VITS

Collection

A family of VITS text-to-speech models trained from scratch on the Open Bible corpus. • 26 items • Updated about 3 hours ago

Paper for multilingual-tts/VITS-OpenBible-Ewe

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Paper • 2106.06103 • Published Jun 11, 2021 • 4