VoiceTut-TTS

𓋹 VoiceTut-TTS

An Open-Source Text-to-Speech Model for Egyptian Arabic & Code-Switching

🤗 Model 🤗 Space 🎧 Samples PyPI GitHub Base License

VoiceTut-TTS is an Egyptian-Arabic text-to-speech model fine-tuned from OmniVoice on ~380 hours of Egyptian podcast speech. It produces natural Egyptian speech with seamless Arabic ↔ English code-switching, ships 15 built-in studio voices, supports zero-shot voice cloning, and includes a robust Egyptian-Arabic text normalization pipeline plus true streaming for long text.

Why "VoiceTut"? Tut — after the boy-king Tutankhamun (توت عنخ آمون) — anchors the model in Egyptian identity, just as our companion ASR model QwenCleo-ASR is named after Cleopatra. Together they form an Egyptian speech stack: Cleo listens, Tut speaks. 🎙️🗣️

🔗 Links

✨ Features

  • 🎯 Egyptian-first — fine-tuned specifically on Egyptian Arabic, not generic MSA.
  • 🔀 Code-switching — handles real Arabic + English mixed speech (عندي meeting بكرة).
  • 🗣️ 15 built-in voices — male & female studio speakers, each with style tags.
  • 🧬 Zero-shot cloning — clone any voice from a few seconds of reference audio.
  • 🔢 Robust normalization — numbers, dates, times, currencies, phones, emails, URLs, abbreviations + diacritics & name dictionaries.
  • True streaming — long text is split into sentences and yielded as audio chunks.

📦 Installation

# PyTorch matching your CUDA (see https://pytorch.org)
pip install torch --index-url https://download.pytorch.org/whl/cu121
# OmniVoice backbone (not on PyPI — install from GitHub)
pip install git+https://github.com/k2-fsa/OmniVoice.git
pip install voicetut-tts

🚀 Usage

from voicetut_tts import VoiceTutTTS

tts = VoiceTutTTS.from_pretrained("mohammedaly22/VoiceTut-TTS")

# 1) Built-in speaker
tts.synthesize("ازيك عامل ايه النهاردة؟", speaker="Mohamed", output="out.wav")

# 2) Zero-shot voice cloning
tts.synthesize("النهارده الجو حلو اوي",
               ref_audio="my_voice.wav", ref_text="ده الصوت بتاعي", output="clone.wav")

# 3) Code-switching + generation params
tts.synthesize("عندي meeting الساعة 3:30 ومعايا ال presentation",
               speaker="Asmaa", num_step=48, guidance_scale=2.5, speed=1.05, output="cs.wav")

Streaming long text:

for sr, chunk in tts.stream(long_paragraph, speaker="Sayed"):
    play(chunk)                    # plays each sentence as it's generated
tts.synthesize_long(long_paragraph, "long.wav", speaker="Sayed")

🗣️ Built-in Voices

Male Female
Names Abdelrahman, Abdullah, Kamal, Hossam, Mohamed, Omar, Sayed, Zaki, Aly Asmaa, Esraa, Hanan, Sarah, Yasmin, Omnia

Each voice ships with a reference clip + Arabic style tags (e.g. شبابي, حيوي, هادي). Browse and listen in the Space.

📊 Performance

Measured on a single NVIDIA T4 (Colab), float16, num_step=32. Reproduce with examples/04_evaluation.ipynb.

Metric Value
Real-time factor (RTF, mean) 1.13×
RTF (best) 0.49×
Time-to-first-audio (streaming) 1.68 s
Peak VRAM (fp16) 2.93 GB
WER — Egyptian Arabic 0.40
WER — English 0.07
Speaker similarity (cloning, cosine) 0.83
Naturalness (UTMOS, 1–5) 3.47
Sampling rate 24 kHz

On A100 / H100 expect markedly lower RTF and TTFA.

🏗️ Training

  • Base model: k2-fsa/OmniVoice (Qwen3-0.6B text backbone + Higgs audio tokenizer)
  • Data: ~380 h Egyptian-Arabic YouTube podcasts (language_id = arz)
  • Steps: 20,000 · LR: 3e-5 · bf16

⚠️ Responsible Use

Voice cloning is provided to enable beneficial use cases — voice assistants, accessibility, educational and creative content. Do not use it to impersonate real people, produce deceptive or misleading audio, or harm, harass, or defraud anyone. Always obtain consent before cloning a real person's voice, and disclose synthetic audio where appropriate.

📜 License & Citation

Apache-2.0.

@software{voicetut_tts_2026,
  author  = {Mohammed Aly},
  title   = {VoiceTut-TTS: Egyptian Arabic & Code-Switching Text-to-Speech},
  year    = {2026},
  url     = {https://github.com/MohammedAly22/VoiceTuT-TTS},
  note    = {Fine-tuned from OmniVoice}
}
Downloads last month
80
Safetensors
Model size
0.6B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mohammedaly22/VoiceTut-TTS

Finetuned
Qwen/Qwen3-0.6B
Finetuned
k2-fsa/OmniVoice
Finetuned
(37)
this model

Space using mohammedaly22/VoiceTut-TTS 1

Collection including mohammedaly22/VoiceTut-TTS