𓋹 VoiceTut-TTS

An Open-Source Text-to-Speech Model for Egyptian Arabic & Code-Switching

VoiceTut-TTS is an Egyptian-Arabic text-to-speech model fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.

Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️

🔗 Links

🎧 Audio demo (VoiceTut vs. base OmniVoice): https://mohammedaly22.github.io/VoiceTuT-TTS/
🚀 Interactive Space: https://huggingface.co/spaces/mohammedaly22/VoiceTut-TTS
💻 GitHub (code, notebooks): https://github.com/MohammedAly22/VoiceTuT-TTS
📦 PyPI: https://pypi.org/project/voicetut-tts/

✨ Features

🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
🔀 Code-switching — handles real Arabic + English mixed speech (عندي meeting بكرة).
🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + diacritics & name dictionaries.
⚡ True streaming — long text is split into sentences and yielded as audio chunks.

📦 Installation

# PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# OmniVoice backbone (not on PyPI — install from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
pip install voicetut-tts

🚀 Usage

from voicetut_tts import VoiceTutTTS

tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")

# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")

# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
               ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي", output="clone.wav")

# 3) Code-switching + generation params
tts.synthesize("عندي meeting الساعة 3:30 ومعايا ال presentation",
               speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.05, output="cs.wav")

Streaming long text:

for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
    play(chunk)                    # plays each sentence as it's generated
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")

🗣️ Built-in Voices

	Male	Female
Names	Abdelrahman, Abdullah, Kamal, Hossam, Mohamed, Omar, Sayed, Zaki, Aly	Asmaa, Esraa, Hanan, Sarah, Yasmin, Omnia

Each voice ships with a reference clip + Arabic style tags (e.g. شبابي, حيوي, هادي). Browse and listen in the Space.

📊 Performance

Measured on a single NVIDIA T4 (Colab), float16, num_step=32. Reproduce with examples/04_evaluation.ipynb.

Metric	Value
Real-time factor (RTF, mean)	1.13×
RTF (best)	0.49×
Time-to-first-audio (streaming)	1.68 s
Peak VRAM (fp16)	2.93 GB
WER — Egyptian Arabic	0.40
WER — English	0.07
Speaker similarity (cloning, cosine)	0.83
Naturalness (UTMOS, 1–5)	3.47
Sampling rate	24 kHz

On A100 / H100 expect markedly lower RTF and TTFA.

🏗️ Training

Base model: k2-fsa/OmniVoice (Qwen3-0.6B text backbone + Higgs audio tokenizer)
Data: ~380 h Egyptian-Arabic YouTube podcasts (language_id = arz)
Steps: 20,000 · LR: 3e-5 · bf16

⚠️ Responsible Use

Voice cloning is provided to enable beneficial use cases — voice assistants, accessibility, educational and creative content. Do not use it to impersonate real people, produce deceptive or misleading audio, or harm, harass, or defraud anyone. Always obtain consent before cloning a real person's voice, and disclose synthetic audio where appropriate.

📜 License & Citation

Apache-2.0.

@software{voicetut_tts_2026,
  author  = {Mohammed Aly},
  title   = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
  year    = {2026},
  url     = {https://github.com/MohammedAly22/VoiceTuT-TTS},
  note    = {Fine-tuned from OmniVoice}
}