๐Ÿ’œ Github   |   ๐Ÿค— Hugging Face   |   ๐Ÿ“š Cookbooks  
๐Ÿ–ฅ๏ธ Demo  

๐ŸŒ 3arab-TTS

An independent Arabic Text-to-Speech (TTS) model based on the Rectified Flow Diffusion Transformer (RF-DiT) architecture.

The acoustic model was trained entirely from scratch on Arabic speech data using random initialization, with independently developed training and inference pipelines.

โš ๏ธ What's New

Current Version: v2

  • ~553M parameters
  • ~700 hours of Arabic speech
  • 48 kHz audio generation
  • DACVAE latent codec
  • RF-DiT acoustic model

Due to the limited availability of large-scale open Arabic speech datasets, a significant portion of the training data was collected from publicly available Arabic content and carefully filtered for quality.

The current release does not include integrated audio watermarking. Support for optional SilentCipher watermarking may be added in future inference releases without affecting audio quality.

The current release demonstrates that open-source Arabic TTS systems can achieve a level of quality and naturalness comparable to many production-grade solutions. With over 700 hours of carefully curated Arabic speech and a large-scale RF-DiT architecture, 3arab-TTS establishes a strong baseline for next-generation Arabic speech synthesis.

Future versions will focus on:

improving expressive speech generation

๐Ÿค Community Contributions Welcome

Contributions are highly appreciated, including:

Arabic speech datasets
training improvements
inference optimizations
bug fixes
evaluation & testing
documentation improvements

๐Ÿ“Š Technical Specifications & Requirements

Specification Value / Description
Total Parameters ~553.4 Million
Core Architecture model_dim: 1280, 12 Transformer layers, 20 attention heads, mlp_ratio: 2.875
Latent Space 32-dimensional continuous latent space via DACVAE
Sample Rate 48 kHz
Current Training Data ~700 hours of carefully filtered Arabic speech

Evaluation

The model has been evaluated primarily through human listening tests and qualitative assessment.

Current strengths:

  • Natural Arabic pronunciation

  • Good speaker consistency

  • High-fidelity DACVAE latent reconstruction

  • Strong performance on Modern Standard Arabic

    Current Limitations

  • Limited expressive and emotional control

  • Arabic numerals are reliably supported for simple numbers (e.g., 1โ€“10). Larger numbers may produce better results when written in words rather than digits. For example, writing "ุซู…ุงู†ู…ุงุฆุฉ ูˆุฎู…ุณุฉ ูˆุฃุฑุจุนูˆู†" may yield more accurate pronunciation than "845".

  • Colloquial Arabic dialects are partially supported. The model can often pronounce dialectal text, but quality, pronunciation accuracy, and naturalness may be noticeably lower than for Modern Standard Arabic (MSA).

  • Some Arabic words may require explicit diacritics (Tashkeel) to ensure correct pronunciation, especially when multiple readings are possible. For example:

    • Without diacritics: "ูˆูŠุณุชุบูุฑูˆู† ู„ู„ุฐูŠู† ุฃู…ู†ูˆุง"
    • Preferred: "ูˆูŠุณุชุบูุฑูˆู† ู„ู„ุฐูŠู† ุขู…ูŽู†ููˆุง"
  • Pronunciation quality may vary for rare words, names, religious texts, and highly ambiguous unvocalized Arabic sentences.

  • For better use removal of unnecessary symbols such as "-" "_" "," "." or other non-linguistic characters.

  • Generated speech quality is influenced by both input text and inference configuration. Achieving the best results may require careful tuning of generation parameters, including Seed, Number of Steps, guidance strength, and reference conditioning settings. Different voices, speaking styles, and text types may benefit from different parameter configurations.

๐Ÿ“Œ Project Overview

This project is a diffusion-based Arabic Text-to-Speech system inspired by modern latent-space speech synthesis architectures. Inspired by:

  • Echo-TTS
  • Irodori-TTS

๐Ÿ—๏ธ Architecture

Instead of relying on discrete audio tokens common in traditional TTS systems, this model generates Continuous Latent Representations using DACVAE.

Component Description
Component Description
------------ ------------
RF-DiT 12-layer diffusion transformer with 20 attention heads
DACVAE Continuous latent audio codec (32-dim latent space)
Arabic Text Encoder 512-dimensional transformer encoder (10 layers, 8 heads)
Reference Speaker Encoder 768-dimensional transformer encoder (8 layers, 12 heads)
Continuous Latent Space Preserves fine acoustic details and minimizes spectral distortion

๐ŸŒŠ Continuous Latent Space

The system converts audio into compact continuous latent vectors (32-dim), which the diffusion model then learns to generate directly. This approach enables:

  • โœ… Smoother temporal generation
  • โœ… Reduced quantization artifacts
  • โœ… Preservation of fine acoustic details (breathing, vocal characteristics, prosody)
  • โœ… Improved stability for longer utterances

๐ŸŽ›๏ธ Style & Pitch Control

The RF-DiT architecture supports conditional style embedding, allowing control over:

  • Speaker identity & pitch/timbre
  • Speech rate & rhythm
  • Expressive characteristics
    (Based on inference settings )

๐Ÿš€ Roadmap & Upcoming Updates

Feature Planned Updates
Speakers Expand support to a larger pool of male & female speakers
Training Data Scale to ~1000โ€“2000 hours of high-quality Arabic speech
Quality & Stability Improve pronunciation accuracy & reduce spectral artifacts
Voice Cloning Experimental research toward zero-shot voice adaptation using 3โ€“12 second reference audio
Expressivity Integration of fine-grained emotional & stylistic controls

๐ŸŽง Audio Samples

Audio demonstrations are available on the Hugging Face model page.

The samples include:

  • Male and female voices
  • Modern Standard Arabic
  • Long-form synthesis examples
  • Reference-conditioned generation

๐Ÿš€ Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

https://github.com/sherif1313/3arab-TTS

Installation

git clone https://github.com/sherif1313/3arab-TTS.git
cd 3arab-TTS
uv sync

Training

  • Architecture: RF-DiT
  • Parameters: ~553M
  • Latent Codec: DACVAE (32-dim)
  • Training Data: ~700 hours of Arabic speech
  • Sample Rate: 48 kHz
  • Text Encoder: 512-dim, 10 layers, 8 heads
  • Speaker Encoder: 768-dim, 8 layers, 12 heads
  • Random Initialization
  • Fully trained from scratch

๐Ÿ™ Acknowledgments

Aratako/Irodori-TTS
jordand/echo-tts-base
LlamaForCausalLM
facebook/dacvae-watermarked (Audio latent encoder)

All model training, pipeline implementation, and acoustic model weights were developed independently and trained from scratch. No proprietary acoustic models, private datasets, or closed-source training pipelines were used during development.

๐Ÿ“œ License

Licensed under the Apache 2.0 License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sherif1313/3arab-TTS-500M-v2

Finetuned
(2)
this model

Space using sherif1313/3arab-TTS-500M-v2 1

Collection including sherif1313/3arab-TTS-500M-v2