𓋹 VoiceTut-TTS
An Open-Source Text-to-Speech Model for Egyptian Arabic & Code-Switching
VoiceTut-TTS is an Egyptian-Arabic text-to-speech model fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.
Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️
🔗 Links
- 🎧 Audio demo (VoiceTut vs. base OmniVoice): https://mohammedaly22.github.io/VoiceTuT-TTS/
- 🚀 Interactive Space: https://huggingface.co/spaces/mohammedaly22/VoiceTut-TTS
- 💻 GitHub (code, notebooks): https://github.com/MohammedAly22/VoiceTuT-TTS
- 📦 PyPI: https://pypi.org/project/voicetut-tts/
✨ Features
- 🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
- 🔀 Code-switching — handles real Arabic + English mixed speech (
عندي meeting بكرة). - 🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
- 🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
- 🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + diacritics & name dictionaries.
- ⚡ True streaming — long text is split into sentences and yielded as audio chunks.
📦 Installation
# PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# OmniVoice backbone (not on PyPI — install from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
pip install voicetut-tts
🚀 Usage
from voicetut_tts import VoiceTutTTS
tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")
# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")
# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي", output="clone.wav")
# 3) Code-switching + generation params
tts.synthesize("عندي meeting الساعة 3:30 ومعايا ال presentation",
speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.05, output="cs.wav")
Streaming long text:
for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
play(chunk) # plays each sentence as it's generated
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")
🗣️ Built-in Voices
| Male | Female | |
|---|---|---|
| Names | Abdelrahman, Abdullah, Kamal, Hossam, Mohamed, Omar, Sayed, Zaki, Aly | Asmaa, Esraa, Hanan, Sarah, Yasmin, Omnia |
Each voice ships with a reference clip + Arabic style tags (e.g. شبابي, حيوي, هادي). Browse and listen in the Space.
📊 Performance
Measured on a single NVIDIA T4 (Colab),
float16,num_step=32. Reproduce withexamples/04_evaluation.ipynb.
| Metric | Value |
|---|---|
| Real-time factor (RTF, mean) | 1.13× |
| RTF (best) | 0.49× |
| Time-to-first-audio (streaming) | 1.68 s |
| Peak VRAM (fp16) | 2.93 GB |
| WER — Egyptian Arabic | 0.40 |
| WER — English | 0.07 |
| Speaker similarity (cloning, cosine) | 0.83 |
| Naturalness (UTMOS, 1–5) | 3.47 |
| Sampling rate | 24 kHz |
On A100 / H100 expect markedly lower RTF and TTFA.
🏗️ Training
- Base model: k2-fsa/OmniVoice (Qwen3-0.6B text backbone + Higgs audio tokenizer)
- Data: ~380 h Egyptian-Arabic YouTube podcasts (
language_id = arz) - Steps: 20,000 · LR: 3e-5 · bf16
⚠️ Responsible Use
Voice cloning is provided to enable beneficial use cases — voice assistants, accessibility, educational and creative content. Do not use it to impersonate real people, produce deceptive or misleading audio, or harm, harass, or defraud anyone. Always obtain consent before cloning a real person's voice, and disclose synthetic audio where appropriate.
📜 License & Citation
Apache-2.0.
@software{voicetut_tts_2026,
author = {Mohammed Aly},
title = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
year = {2026},
url = {https://github.com/MohammedAly22/VoiceTuT-TTS},
note = {Fine-tuned from OmniVoice}
}
- Downloads last month
- 80