Rade-ASR-CTC-3B-fa

Persian (Farsi) fine-tuned version of Meta AI's Omnilingual ASR CTC-3B model, released and maintained by Rade AI.

Open in Colab

TL;DR — A fast, non-autoregressive (CTC) speech-to-text model specialized for Persian, built on top of Meta's 3-billion-parameter Omnilingual ASR encoder. It transcribes Persian audio clips (≤ 40 s) and runs ~199× faster than real time in fp16 on a single RTX 4090. On Persian test sets it reaches CER ≈ 4 % on FLEURS (clean read speech) and ≈ 18 % on Common Voice (noisier, crowd-sourced) — normalized.


Model description

Base model facebook/omniASR-CTC-3B (Omnilingual ASR, omniASR_CTC_3B_v2)
Architecture Wav2Vec2-style encoder + CTC head
Parameters ~3.08 B (3,080,423,636)
Fine-tuned for Persian / Farsi (pes_Arab)
Task Automatic Speech Recognition (speech → text)
Toolkit fairseq2 via the omnilingual-asr package
License Apache-2.0

This checkpoint takes the strong multilingual representations of Omnilingual ASR's 3B CTC encoder and adapts them to Persian, giving accurate, low-latency transcription for Iranian Persian.

Evaluation

Measured by Rade with greedy CTC decoding (fp16) on two standard Persian test sets. Both reference and hypothesis are normalized before scoring — unify ك→ک / ي→ی, convert ZWNJ (نیم‌فاصله) to space, strip punctuation and diacritics, collapse whitespace — so that orthography-only differences don't count as errors.

Test set Clips WER CER
FLEURS fa_ir — read speech 871 19.6 % 4.4 %
VisualEars golden fa — curated (clean/farfield/obstructed) 6,669 22.9 % 4.3 %
Common Voice 17.0 fa — crowd-sourced 10,355 21.8 % 17.8 %

CER is the more faithful metric for Persian. Persian WER is inflated by orthographic/spacing variation (نیم‌فاصله/ZWNJ, affix spacing, compound spelling) that doesn't reflect actual mis-recognition — note FLEURS sits at 19.6 % WER but only 4.4 % CER, i.e. most "word errors" are one-character spelling differences. On clean, well-curated speech (FLEURS, VisualEars) the model reaches CER ≈ 4 %, and it stays robust across recording conditions (VisualEars far-field 4.5 % / obstructed 4.2 % CER). On noisier crowd-sourced audio (Common Voice — spontaneous speech, varied mics/accents, loan words) CER rises to ≈ 18 %.

Speed & hardware

CTC models are non-autoregressive — they decode a whole clip in one forward pass, so they're very fast. Measured by Rade on a single RTX 4090 (8.8 s Persian clip, batch_size=1):

Precision Inference latency Speed Peak VRAM
FP16 (recommended) 44 ms ~199× real time 6.4 GB
FP32 102 ms ~87× real time 12.7 GB

FP16 and FP32 produce identical transcripts, so FP16 is the recommended default (half the VRAM, 2× faster). Batched throughput on the 4090 reaches **208× real time** (the 222-min FLEURS set transcribes in ~64 s). A 16 GB GPU (e.g. Colab T4 / L4) is enough for fp16. CPU works but is slow.

Files in this repo

File What it is Size When to use
model_fp16.pt Consolidated fp16 weights, single file ~6.2 GB Recommended. Smaller, faster download; fp16 inference.
pp_00/tp_00/sdp_00.pt, sdp_01.pt Original fp32 FSDP checkpoint shards ~12 GB If you want full fp32 precision weights.
config.json Model metadata (arch, tokenizer, vocab) Read by tooling; you don't load it directly.
notebook.ipynb Ready-to-run Colab/Kaggle notebook One-click demo (powers the "Open in Colab" button).

Both weight files produce identical transcripts at fp16. The single model_fp16.pt is just half the download — prefer it unless you specifically need the fp32 master weights.

Usage

1) Install the Omnilingual ASR runtime (the model loads through fairseq2's asset system):

# system dependency for audio I/O
sudo apt-get install -y libsndfile1   # (Linux)  /  brew install libsndfile (macOS)

# Need omnilingual-asr 0.2.0 (it registers the 3b_v2 architecture this model uses).
# --ignore-requires-python: 0.2.0's metadata caps python at "<=3.12" (read as <=3.12.0), which
# wrongly excludes Python 3.12.x (e.g. Colab). The flag installs it anyway; it runs fine on 3.12.
pip install --ignore-requires-python omnilingual-asr==0.2.0 huggingface_hub
# Pin the whole torch stack to the CUDA 12.8 build that fairseq2 needs.
# (Without this you hit `libcudart.so.13` / torchvision::nms errors from a mismatched torchaudio/torchvision.)
pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
# On Colab/Jupyter, restart the runtime once after this install.

2) Download the weights + register a fairseq2 asset card that points at them, reusing Omnilingual ASR's official Persian-capable tokenizer. The asset card's checkpoint: can point at either the single model_fp16.pt file or the shard directory — both work.

import pathlib, torch
from huggingface_hub import hf_hub_download

# --- Recommended: download ONLY the single fp16 file (~6.2 GB) ---
ckpt = hf_hub_download("RadeAI/Rade-ASR-CTC-3B-fa", "model_fp16.pt")

# --- Alternative: full fp32 shards (~12 GB) ---
# from huggingface_hub import snapshot_download
# ckpt = snapshot_download("RadeAI/Rade-ASR-CTC-3B-fa", allow_patterns=["pp_00/tp_00/*"])

# Register a fairseq2 asset card pointing at the downloaded checkpoint
asset_dir = pathlib.Path.home() / ".config/fairseq2/assets"
asset_dir.mkdir(parents=True, exist_ok=True)
(asset_dir / "rade.yaml").write_text(f"""\
name: rade_CTC_3B_fa
model_family: wav2vec2_asr
model_arch: 3b_v2
checkpoint: "{ckpt}"
tokenizer_ref: omniASR_tokenizer_written_v2
""")

3) Transcribe (clips must be < 40 s):

from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipe = ASRInferencePipeline(
    model_card="rade_CTC_3B_fa",
    device="cuda" if torch.cuda.is_available() else "cpu",
    dtype=torch.float16,   # fp16: ~199x real time, 6.4 GB VRAM, identical output to fp32
)
text = pipe.transcribe(["sample_fa.wav"], lang=["pes_Arab"], batch_size=1)
print(text[0])

A ready-to-run notebook is provided: notebook.ipynb — or just click Open in Colab (also in the "Use this model" menu at the top of this page).

Limitations

  • Audio length: CTC models accept clips shorter than 40 seconds. For longer audio, split into < 40 s chunks (e.g. on silence) and concatenate the results.
  • Domain: best on clear Persian speech; very noisy audio or heavy code-switching may degrade accuracy.
  • Language code: use pes_Arab (Western/Iranian Persian). Note that for CTC models the lang argument is informational — CTC decoding is language-agnostic at inference time.

License & attribution

Released under Apache-2.0, consistent with the base Omnilingual ASR model and code. This is a derivative fine-tune of facebook/omniASR-CTC-3B; all credit for the base architecture and pre-training goes to the Meta AI Omnilingual ASR Team.

@misc{omnilingualasr2025,
  title={{Omnilingual ASR}: Open-Source Multilingual Speech Recognition for 1600+ Languages},
  author={{Omnilingual ASR Team}},
  year={2025},
  url={https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/}
}

Contact

Built and maintained by Rade AI. For questions, collaboration, or custom Persian speech/NLP models, get in touch:


معرفی (فارسی)

این مدل نسخه‌ی فاین‌تیون‌شده روی زبان فارسی از مدل Omnilingual ASR CTC-3B شرکت متا است که توسط راده منتشر شده.

  • گفتار فارسی را به متن تبدیل می‌کند (کلیپ‌های کوتاه‌تر از ۴۰ ثانیه).
  • معماری CTC (غیر-اتورگرسیو) دارد، برای همین خیلی سریع است — در fp16 حدود ۱۹۹ برابر سریع‌تر از زمان واقعی روی یک RTX 4090.
  • در fp16 فقط ۶.۴ گیگابایت VRAM می‌خواهد (یک GPU ۱۶ گیگ کافی است).
  • دقت (با نرمال‌سازیِ متن): روی FLEURS فارسی (گفتارِ تمیز) CER حدود ۴٪ (WER ۱۹.۶٪)، و روی Common Voice 17 فارسی (داده‌ی محاوره‌ایِ نویزی، ۱۰٬۳۵۵ کلیپ) CER حدود ۱۸٪ (WER ۲۱.۸٪). در فارسی CER معیارِ معتبرتریه چون WER با اختلافِ املایی/نیم‌فاصله متورم می‌شه.

نحوه‌ی استفاده در بخش انگلیسیِ بالا آمده. برای تستِ سریع، دکمه‌ی Open in Colab (بالای همین صفحه، منوی «Use this model») یا نوت‌بوکِ notebook.ipynb رو باز کن.

ارتباط با راده: تلگرام @Rade_admin — تلفن: ۰۹۳۶۸۶۴۷۴۹۹

Downloads last month
91
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RadeAI/Rade-ASR-CTC-3B-fa

Finetuned
(4)
this model

Dataset used to train RadeAI/Rade-ASR-CTC-3B-fa