Configuration Parsing Warning:Invalid JSON for config file config.json
๐ฟ Leva-TTS โ Low-Latency Code-Switching TTS (Levantine Arabic โ English)
A production-oriented Levantine Text-to-Speech model โ a fine-tuned XTTS-v2 optimized for real-time conversational agents.
| ๐ฏ KPI | Target | Measured | Status |
|---|---|---|---|
| Peak VRAM (inference) | โค 3 GB | 2.13 GB | โ |
| Time-to-First-Audio (p50) | < 300 ms | 565 ms | โ ๏ธ |
| Real-Time Factor (RTF) | < 0.3 | 0.21 | โ |
| Streaming output | required | chunked PCM + WS | โ |
Leva-TTS is a text-to-speech model for Levantine Arabic / English code-switching, built by fine-tuning XTTS-v2 on 50,000 synthetic utterances generated with Lahgtna-OmniVoice v2. It handles natural intra-sentence switching between Levantine dialect and English, supports 10 built-in speakers and zero-shot voice cloning, and offers a streaming generator for low-latency conversational use.
- Base model:
coqui/XTTS-v2(GPT autoregressive backbone + HiFi-GAN decoder) - Languages: Levantine Arabic (
ar), English (en), and code-switch mixes - Sample rate: 24 kHz
- Speakers: Badr, Mohamed, Saad, Rami, Fadi (M) ยท Amina, Fatma, Lamyaa, Mona, Haneen (F)
โจ Key Features
| Feature | Details |
|---|---|
| ๐ฃ๏ธ Natural code-switching | Intra-sentence Arabic โ English |
| โก Streaming output | First audio chunk < 300 ms |
| ๐พ Low VRAM | โค 3 GB at inference |
| ๐ฟ Levantine dialect | ูโ/ส/ glottal, ุฌโ/ส/, il- article, b- prefix |
| ๐ค Smart text front-end | Partial diacritics on homographs + Levantine lexicon |
| ๐ฅ 10 speakers | 5 male + 5 female, diverse Levantine accents |
| ๐ก WebSocket streaming | FastAPI server with real-time chunked PCM |
| ๐ Pipecat ready | Drop-in TTSService for voice agents |
๐ Quick start (pip)
conda create -n leva-tts python=3.10 -y && conda activate leva-tts
sudo apt-get install -y espeak-ng ffmpeg libsndfile1
# Install PyTorch first so pip locks a CUDA build matching your GPU driver.
# (torch >= 2.9 ships CUDA-13 wheels that fail on common CUDA-12.x drivers.)
pip install torch==2.3.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
pip install leva-tts
Leva-TTS uses the maintained
coqui-ttsfork (sameTTS/XTTS modules); the unmaintainedTTSpackage pinsnumpy==1.22.0and cannot resolve on modern Python. A plainpip install leva-ttsresolves cleanly.
from leva_tts import LevaTTS, SPEAKERS
import soundfile as sf
tts = LevaTTS(device="cuda", preprocess_text=True, verbose=False)
# auto-downloads this checkpoint + the 10 reference speakers on first use
# 1) Built-in speaker (speaker must be one of SPEAKERS, else ValueError)
wav, sr = tts.synthesize("ูููููู ุฃูุง ุนู
ุฃุดุชุบู ุนูู the project",
speaker="Badr", temperature=0.65)
sf.write("out.wav", wav, sr) # sr == 24000
# 2) Zero-shot voice cloning (your own 3โ10 s clip)
wav, sr = tts.zero_shot_synthesize("ูุงููู the meeting ูุงูุช important ูุชูุฑ",
"my_voice.wav")
# 3) Streaming generators
for chunk in tts.stream("ุจูุฏููู ุฃุญูููู ุนู the new feature", speaker="Amina"):
... # play / forward each chunk
for chunk in tts.zero_shot_stream("ููู ุนู
ูุดุชุบู", "my_voice.wav"):
...
Generation parameters (optional, per-call on every method):
temperature, length_penalty, repetition_penalty, top_k, top_p, speed.
For the FastAPI streaming server, Pipecat integration, the Gradio demo, evaluation and fine-tuning, clone the repo: ๐ https://github.com/MohammedAly22/Leva-TTS
๐ฆ Files in this repo
| File | Description |
|---|---|
best_model.pth |
Fine-tuned XTTS-v2 checkpoint (GPT + decoder) |
config.json |
XTTS-v2 config |
reference_audios/ |
The 10 built-in speaker reference clips + references.json |
sample_wavs/ |
Audio sample comparisons (Base XTTS-v2 vs Lahgtna v2 vs Leva-TTS) |
Manual download:
huggingface-cli download mohammedaly22/leva-tts
๐ต Audio samples โ Model comparison
Click a sentence to expand and play the three models. Progression: Base XTTS-v2 โ Lahgtna v2 โ Leva-TTS.
๐ Code-switching (Levantine + English)
ูููููู ุฃูุง ุนู ุฃุดุชุบู ุนูู the new project ุงููู ุญููุชูู ุนูู โ Badr (M)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
ูุงููู the weather today ูุชูุฑ ุญูู ุจุฏู ุฃุทูุน ุจุฑุง โ Fatma (F)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
ุจูุฏููู ุฃุญูููู ุนู the meeting ุงููู ูุงู ู ูู ูุชูุฑ ุงูููู โ Mona (F)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
Pure Levantine Arabic
ูููู ุงูููู ุ ุฅูุช ุดู ุนู ุชุนู ู ููููููุ โ Badr (M)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
ูููููู ุฑุญ ุฃุฑูุญ ุนูู ุงูุจูุช ูุจูุฑุง ุจุฑุฌุน โ Amina (F)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
ุดู ุฑุฃูู ูุทูุน ูุชู ุดู ุดูู ุจุนุฏ ุงูุดุบู ุฅุฐุง ุงูุฌู ูุงู ู ููุญุ โ Rami (M)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
๐ฌ๐ง Pure English
Hello, how are you doing today? โ Lamyaa (F)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
The project deadline is next Friday. โ Mohamed (M)
Base XTTS-v2
Lahgtna v2 (Levantine fine-tune)
๐ข Leva-TTS (this model)
๐ Evaluation
Speaker Mohamed ยท NVIDIA H100 ยท Whisper large-v3 ASR round-trip ยท UTMOS (reference-free MOS).
| Metric | Value |
|---|---|
| Peak VRAM (inference) | 2.13 GB |
| RTF p50 / p95 | 0.36 / 0.53 |
| TTFA p50 / p95 (batch) | 1194 / 1743 ms |
| TTFA streaming (first chunk) | ~565 ms |
| CER (mean) | 0.255 |
| WER (mean) | 0.496 |
| UTMOS | 3.13 / 5.0 |
| Category | CER โ | WER โ | UTMOS โ |
|---|---|---|---|
| Pure English | 0.144 | 0.190 | 3.35 |
| Pure Levantine Arabic | 0.236 | 0.544 | 2.97 |
| Code-Switching | 0.330 | 0.602 | 3.19 |
An optimized inference path (TF32 + torch.compile on the GPT) lowers RTF p95 by
~6% and TTFA while slightly improving UTMOS (3.24). See the repo's scripts/evaluate.py --optimize.
๐๏ธ How it was built
- Text collection โ 50K Levantine / code-switching / English sentences.
- Synthesis โ audio generated with Lahgtna-OmniVoice v2 (
apclanguage code). - Data prep โ 24 kHz, paired with a Levantine text front-end (number/date/ currency verbalization, partial diacritics on homographs, dialect lexicon).
- Fine-tuning โ XTTS-v2 GPT fine-tuned on the synthetic corpus.
A text front-end runs before synthesis (enabled via preprocess_text=True):
language-aware normalization of numbers, floats, dates, times, currency,
percentages, URLs, emails, phone numbers and codes, plus partial diacritics and a
Levantine lexicon.
โ ๏ธ Limitations & intended use
- Optimized for Levantine dialect + English code-switching; other Arabic dialects (Egyptian, Gulf, MSA) are out of distribution.
- Trained on synthetic speech โ voices reflect the Lahgtna v2 generator.
- License CC-BY-NC-4.0 (inherited from XTTS-v2): research / non-commercial use.
๐ Citation
@software{leva_tts_2026,
author = {Mohammed Aly},
title = {Leva-TTS: Low-Latency Code-Switching TTS for Levantine Arabic and English},
year = {2026},
url = {https://github.com/MohammedAly22/Leva-TTS}
}
Built on Coqui XTTS-v2 and Lahgtna-OmniVoice v2.
- Downloads last month
- 123
Model tree for mohammedaly22/leva-tts
Base model
coqui/XTTS-v2