๐ Github | ๐ค Hugging Face | ๐ Cookbooks
๐ฅ๏ธ Demo
๐ 3arab-TTS
An independent Arabic Text-to-Speech (TTS) model based on the Rectified Flow Diffusion Transformer (RF-DiT) architecture.
The acoustic model was trained entirely from scratch on Arabic speech data using random initialization, with independently developed training and inference pipelines.
โ ๏ธ Experimental Release
This project is currently in an early experimental stage and should not yet be considered production-ready.
The current version was trained on approximately 400โ500 hours of carefully filtered Arabic speech (SNR > 20dB). Due to the limited availability of large-scale open Arabic speech datasets, synthesis quality may still vary depending on:
- text length
- punctuation & formatting
- inference settings
- reference audio quality
- dialect variation
The model was trained without diacritics, e.g., "ูุฐุง ุงููุชุงุจ ูู ุฑุญูุฉ ูุญู ุงูููุจ"
Some artifacts, instability, repetition, or pronunciation mistakes may still occur during generation, especially on long or complex sentences.
Future versions will focus on:
- scaling training data
- improving stability
- enhancing pronunciation accuracy
- reducing audio artifacts
- improving expressive speech generation
๐ค Community Contributions Welcome
Contributions are highly appreciated, including:
- Arabic speech datasets
- training improvements
- inference optimizations
- bug fixes
- evaluation & testing
- documentation improvements---
๐ Technical Specifications & Requirements
| Specification | Value / Description |
|---|---|
| Total Parameters | ~553.4 Million |
| Core Architecture | model_dim: 1280, 12 Transformer layers, 20 attention heads, mlp_ratio: 2.875 |
| Latent Space | 32-dimensional continuous latent space via DACVAE |
| Sample Rate | 44100. Hz |
| Current Training Data | ~400โ500 hours of high-quality Arabic speech (SNR > 20dB) |
๐ Project Overview
This project is an diffusion-based TTS system, inspired by modern architectures :
Echo-TTSIrodori-TTS
| Example 1 | | Example 2 | | | Example 3 | | Example 4 | | | Example 5 | | Example 6 | |
๐๏ธ Architecture
Instead of relying on discrete audio tokens common in traditional TTS systems, this model generates Continuous Latent Representations using DACVAE.
| Component | Description |
|---|---|
| RF-DiT | Diffusion transformer responsible for step-by-step generation of acoustic latent representations |
| DACVAE | Encodes/decodes audio into a high-fidelity continuous latent space |
| Arabic Text Encoder | Processes Arabic text representations (hidden_size: 768) |
| Continuous Latent Space | Preserves fine acoustic details and minimizes spectral distortion |
๐ Continuous Latent Space
The system converts audio into compact continuous latent vectors (32-dim), which the diffusion model then learns to generate directly. This approach enables:
- โ Smoother temporal generation
- โ Reduced quantization artifacts
- โ Preservation of fine acoustic details (breathing, vocal characteristics, prosody)
- โ Improved stability for longer utterances
๐๏ธ Style & Pitch Control
The RF-DiT architecture supports conditional style embedding, allowing control over:
- Speaker identity & pitch/timbre
- Speech rate & rhythm
- Expressive characteristics
(Based on inference settings and the provided reference audio)
Integrated Watermarking: Integrated SilentCipher to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.
๐ Roadmap & Upcoming Updates
| Feature | Planned Updates |
|---|---|
| Speakers | Expand support to a larger pool of male & female speakers |
| Training Data | Scale to ~1000โ2000 hours of high-quality Arabic speech |
| Quality & Stability | Improve pronunciation accuracy & reduce spectral artifacts |
| Voice Cloning | Experimental support for Zero-Shot Voice Cloning (3โ10s reference) |
| Expressivity | Integration of fine-grained emotional & stylistic controls |
๐ Usage
Installation
git clone https://github.com/sherif1313/3arab-TTS.git
cd 3arab-TTS
uv sync
#!uv run python app.py
import gradio as gr
from pathlib import Path
from datetime import datetime
from huggingface_hub import hf_hub_download
from arabic_tts.inference_runtime import RuntimeKey, SamplingRequest, get_cached_runtime, save_wav
CHECKPOINT_ID = "sherif1313/3arab-TTS-500M-v1"
CODEC_REPO = "sherif1313/DACVAE-Arabic-32dim"
def get_local_checkpoint(repo_id: str) -> str:
for ext in [".pt", ".safetensors", ".bin", ".ckpt"]:
try:
return hf_hub_download(repo_id=repo_id, filename=f"checkpoint{ext}", cache_dir=".cache")
except: continue
return hf_hub_download(repo_id=repo_id, filename="model.safetensors", cache_dir=".cache")
CHECKPOINT_PATH = get_local_checkpoint(CHECKPOINT_ID)
def estimate_seconds(text: str) -> float:
return max(3.0, min(20.0, len(text.strip()) / 10 * 1.3))
def generate(m_dev, m_prec, c_dev, c_prec, text, ref, steps, cands):
if not text: return [], "โ ๏ธ ุฃุฏุฎู ุงููุต ุฃููุงู"
key = RuntimeKey(
checkpoint=CHECKPOINT_PATH,
model_device=m_dev, model_precision=m_prec,
codec_repo=CODEC_REPO, codec_device=c_dev, codec_precision=c_prec,
compile_model=False, compile_dynamic=False
)
runtime, _ = get_cached_runtime(key)
secs = estimate_seconds(text)
res = runtime.synthesize(SamplingRequest(
text=text, ref_wav=ref, no_ref=ref is None, seconds=secs,
num_steps=int(steps), num_candidates=int(cands), decode_mode="sequential"
))
out = Path("out"); out.mkdir(exist_ok=True)
paths = []
for i, audio in enumerate(res.audios):
p = out / f"gen_{datetime.now().strftime('%H%M%S')}_{i}.wav"
save_wav(p, audio, res.sample_rate)
paths.append(str(p))
return paths, f"โ
ุชู
ุงูุชูููุฏ | ู
ุฏุฉ: {secs:.1f}ุซ\n" + "\n".join(res.messages)
with gr.Blocks(title="TTS ุณุฑูุน") as app:
gr.Markdown(f"### ๐๏ธ ู
ููุฏ ุตูุช ุนุฑุจู | ุงููู
ูุฐุฌ: `{CHECKPOINT_ID}`")
with gr.Row():
with gr.Column(scale=1):
txt = gr.Textbox(label="ุงููุต", placeholder="ุงูุชุจ ุงููุต ููุง...")
ref = gr.Audio(label="ู
ูู ู
ุฑุฌุนู", type="filepath")
steps = gr.Slider(10, 80, value=40, step=1, label="ุฎุทูุงุช ุงูุชูููุฏ")
cands = gr.Slider(1, 4, value=1, step=1, label="ุนุฏุฏ ุงูู
ุฑุดุญูู")
d1 = gr.Dropdown(["cuda","cpu"], value="cuda", label="ุฌูุงุฒ ุงููู
ูุฐุฌ")
d2 = gr.Dropdown(["bf16","fp32"], value="fp32", label="ุฏูุฉ ุงููู
ูุฐุฌ")
d3 = gr.Dropdown(["cuda","cpu"], value="cuda", label="ุฌูุงุฒ ุงูููุฏูู")
d4 = gr.Dropdown(["bf16","fp32"], value="fp32", label="ุฏูุฉ ุงูููุฏูู")
btn = gr.Button("๐ ุชูููุฏ", variant="primary")
with gr.Column(scale=1):
out_audio = gr.Files(label="ุงูู
ููุงุช ุงูู
ููุฏุฉ")
log = gr.Textbox(label="ุงูุณุฌู", lines=4)
btn.click(generate, inputs=[d1,d2,d3,d4,txt,ref,steps,cands], outputs=[out_audio, log])
if __name__ == "__main__":
app.launch(server_name="0.0.0.0", server_port=7860)
๐ Acknowledgments by:
Aratako/Irodori-TTS
jordand/echo-tts-base
LlamaForCausalLM
facebook/dacvae-watermarked (Audio latent encoder)
All model training, pipeline implementation, and acoustic weights were developed independently. No proprietary acoustic weights, private datasets, or closed-source pipelines were used during development.
๐ License
Licensed under the Apache 2.0 License.