OmniVoice Yoruba Fine-tune

This model is a fine-tuned version of k2-fsa/OmniVoice on a Yoruba speech dataset.

Training Data

Total Duration: ~9.6 hours of Yoruba speech
Utterances: 6,989
Sampling Rate: 24 kHz
Data Sources:
- IroyinSpeech dataset
- Google Yoruba Speech Dataset (OpenSLR 86)

Evaluation

Intelligibility (WER): Evaluated using a Yoruba-finetuned Whisper model (ccibeekeoc42/whisper-small-yoruba-07-17), the generated speech achieves an excellent Normalized Word Error Rate (WER) of ~11.5%. This demonstrates highly intelligible and tonally accurate Yoruba synthesis.
Speaker Similarity: Cosine similarity scores for zero-shot voice cloning range between 0.2 and 0.4. The speech is clear and accurate, though perfect voice identity replication might require longer training or additional diverse speaker data.

Credits

Special thanks to the creators of the IroyinSpeech corpus and the contributors/maintainers of the Google Yoruba Speech Dataset on OpenSLR.

Usage & Voice Selection (Zero-shot Voice Cloning)

Because this model is built on top of OmniVoice, you must have the OmniVoice repository cloned and installed. The following Python snippet handles the repository setup automatically and synthesizes Yoruba speech using a reference audio of your choice.

import sys
import os

# If the environment lacks OmniVoice, clone and install it automatically
if not os.path.exists('/content/OmniVoice'):
    print("OmniVoice directory not found. Cloning and installing...")
    os.system('git clone https://github.com/k2-fsa/OmniVoice.git /content/OmniVoice')
    os.system('pip install -q -e /content/OmniVoice')

# Ensure the OmniVoice directory is in the Python path
omnivoice_path = '/content/OmniVoice'
if os.path.exists(omnivoice_path) and omnivoice_path not in sys.path:
    sys.path.append(omnivoice_path)

from omnivoice import OmniVoice
import soundfile as sf

# Load the fine-tuned model
print("Loading OmniVoice model from Sam4rano/omnivoice-yoruba-tts...")
model = OmniVoice.from_pretrained("Sam4rano/omnivoice-yoruba-tts")

# 1. Choose your reference audio (controls the speaker voice identity)
# Provide a path to a 5-10 second clip of the target voice (male or female)
ref_audio_path = "path/to/your_reference_audio.wav"
ref_text = "The exact text transcription spoken in the reference audio."

# 2. Synthesize Yoruba Speech
print("Synthesizing speech...")
audio = model.generate(
    text="Ẹ kú àárọ̀.",
    ref_audio=ref_audio_path,
    ref_text=ref_text
)

# 3. Save output
out_path = "output_test.wav"
sf.write(out_path, audio[0], 24000)
print(f"✅ Synthesis complete! Saved to {out_path}")

Downloads last month: 52

Safetensors

Model size

0.6B params

Tensor type

I64

F32