💜 Github | 🤗 Hugging Face | 📚 Cookbooks
🖥️ Demo

🌍 3arab-TTS

An independent Arabic Text-to-Speech (TTS) model based on the Rectified Flow Diffusion Transformer (RF-DiT) architecture.

The acoustic model was trained entirely from scratch on Arabic speech data using random initialization, with independently developed training and inference pipelines.

⚠️ Experimental Release

This project is currently in an early experimental stage and should not yet be considered production-ready.

The current version was trained on approximately 400–500 hours of carefully filtered Arabic speech (SNR > 20dB). Due to the limited availability of large-scale open Arabic speech datasets, synthesis quality may still vary depending on:

text length
punctuation & formatting
inference settings
reference audio quality
dialect variation

The model was trained without diacritics, e.g., "هذا الكتاب هو رحلة نحو القلب"

Some artifacts, instability, repetition, or pronunciation mistakes may still occur during generation, especially on long or complex sentences.

Future versions will focus on:

scaling training data
improving stability
enhancing pronunciation accuracy
reducing audio artifacts
improving expressive speech generation

🤝 Community Contributions Welcome

Contributions are highly appreciated, including:

Arabic speech datasets
training improvements
inference optimizations
bug fixes
evaluation & testing
documentation improvements---

📊 Technical Specifications & Requirements

Specification	Value / Description
Total Parameters	~553.4 Million
Core Architecture	`model_dim: 1280`, `12` Transformer layers, `20` attention heads, `mlp_ratio: 2.875`
Latent Space	32-dimensional continuous latent space via `DACVAE`
Sample Rate	`44100. Hz`
Current Training Data	~400–500 hours of high-quality Arabic speech (`SNR > 20dB`)

📌 Project Overview

This project is an diffusion-based TTS system, inspired by modern architectures :

Echo-TTS
Irodori-TTS

🏗️ Architecture

Instead of relying on discrete audio tokens common in traditional TTS systems, this model generates Continuous Latent Representations using DACVAE.

Component	Description
RF-DiT	Diffusion transformer responsible for step-by-step generation of acoustic latent representations
DACVAE	Encodes/decodes audio into a high-fidelity continuous latent space
Arabic Text Encoder	Processes Arabic text representations (`hidden_size: 768`)
Continuous Latent Space	Preserves fine acoustic details and minimizes spectral distortion

🌊 Continuous Latent Space

The system converts audio into compact continuous latent vectors (32-dim), which the diffusion model then learns to generate directly. This approach enables:

✅ Smoother temporal generation
✅ Reduced quantization artifacts
✅ Preservation of fine acoustic details (breathing, vocal characteristics, prosody)
✅ Improved stability for longer utterances

🎛️ Style & Pitch Control

The `RF-DiT` architecture supports conditional style embedding, allowing control over: - Speaker identity & pitch/timbre - Speech rate & rhythm - Expressive characteristics
(Based on inference settings and the provided reference audio)

Integrated Watermarking: Integrated SilentCipher to apply robust, invisible audio watermarks directly to the generated outputs, promoting responsible AI usage.

🚀 Roadmap & Upcoming Updates

Feature	Planned Updates
Speakers	Expand support to a larger pool of male & female speakers
Training Data	Scale to ~1000–2000 hours of high-quality Arabic speech
Quality & Stability	Improve pronunciation accuracy & reduce spectral artifacts
Voice Cloning	Experimental support for Zero-Shot Voice Cloning (3–10s reference)
Expressivity	Integration of fine-grained emotional & stylistic controls

🚀 Usage

Installation

git clone https://github.com/sherif1313/3arab-TTS.git
cd 3arab-TTS
uv sync


#!uv run python app.py
import gradio as gr
from pathlib import Path
from datetime import datetime
from huggingface_hub import hf_hub_download  
from arabic_tts.inference_runtime import RuntimeKey, SamplingRequest, get_cached_runtime, save_wav

CHECKPOINT_ID = "sherif1313/3arab-TTS-500M-v1" 
CODEC_REPO = "sherif1313/DACVAE-Arabic-32dim"

def get_local_checkpoint(repo_id: str) -> str:

    for ext in [".pt", ".safetensors", ".bin", ".ckpt"]:
        try:
            return hf_hub_download(repo_id=repo_id, filename=f"checkpoint{ext}", cache_dir=".cache")
        except: continue

    return hf_hub_download(repo_id=repo_id, filename="model.safetensors", cache_dir=".cache")

CHECKPOINT_PATH = get_local_checkpoint(CHECKPOINT_ID) 

def estimate_seconds(text: str) -> float:
    return max(3.0, min(20.0, len(text.strip()) / 10 * 1.3))

def generate(m_dev, m_prec, c_dev, c_prec, text, ref, steps, cands):
    if not text: return [], "⚠️ أدخل النص أولاً"
    
    key = RuntimeKey(
        checkpoint=CHECKPOINT_PATH,  
        model_device=m_dev, model_precision=m_prec,
        codec_repo=CODEC_REPO, codec_device=c_dev, codec_precision=c_prec,
        compile_model=False, compile_dynamic=False
    )
    
    runtime, _ = get_cached_runtime(key)
    secs = estimate_seconds(text)
    
    res = runtime.synthesize(SamplingRequest(
        text=text, ref_wav=ref, no_ref=ref is None, seconds=secs,
        num_steps=int(steps), num_candidates=int(cands), decode_mode="sequential"
    ))
    
    out = Path("out"); out.mkdir(exist_ok=True)
    paths = []
    for i, audio in enumerate(res.audios):
        p = out / f"gen_{datetime.now().strftime('%H%M%S')}_{i}.wav"
        save_wav(p, audio, res.sample_rate)
        paths.append(str(p))
        
    return paths, f"✅ تم التوليد | مدة: {secs:.1f}ث\n" + "\n".join(res.messages)

with gr.Blocks(title="TTS سريع") as app:
    gr.Markdown(f"### 🎙️ مولد صوت عربي | النموذج: `{CHECKPOINT_ID}`")
    with gr.Row():
        with gr.Column(scale=1):
            txt = gr.Textbox(label="النص", placeholder="اكتب النص هنا...")
            ref = gr.Audio(label="ملف مرجعي", type="filepath")
            steps = gr.Slider(10, 80, value=40, step=1, label="خطوات التوليد")
            cands = gr.Slider(1, 4, value=1, step=1, label="عدد المرشحين")
            d1 = gr.Dropdown(["cuda","cpu"], value="cuda", label="جهاز النموذج")
            d2 = gr.Dropdown(["bf16","fp32"], value="fp32", label="دقة النموذج")
            d3 = gr.Dropdown(["cuda","cpu"], value="cuda", label="جهاز الكوديك")
            d4 = gr.Dropdown(["bf16","fp32"], value="fp32", label="دقة الكوديك")
            btn = gr.Button("🔊 توليد", variant="primary")
        with gr.Column(scale=1):
            out_audio = gr.Files(label="الملفات المولدة")
            log = gr.Textbox(label="السجل", lines=4)
    btn.click(generate, inputs=[d1,d2,d3,d4,txt,ref,steps,cands], outputs=[out_audio, log])

if __name__ == "__main__":
    app.launch(server_name="0.0.0.0", server_port=7860)

🙏 Acknowledgments by:

Aratako/Irodori-TTS
jordand/echo-tts-base
LlamaForCausalLM
facebook/dacvae-watermarked (Audio latent encoder)

All model training, pipeline implementation, and acoustic weights were developed independently. No proprietary acoustic weights, private datasets, or closed-source pipelines were used during development.

📜 License

Licensed under the Apache 2.0 License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.5B params

Tensor type

F32

F16

Model tree for sherif1313/3arab-TTS-500M-v1

Finetunes

1 model

Datasets used to train sherif1313/3arab-TTS-500M-v1

Space using sherif1313/3arab-TTS-500M-v1 1

Collection including sherif1313/3arab-TTS-500M-v1

3arab-tts

Collection

3 items • Updated 4 days ago