Rade-ASR-CTC-3B-fa
Persian (Farsi) fine-tuned version of Meta AI's Omnilingual ASR CTC-3B model, released and maintained by Rade AI.
TL;DR — A fast, non-autoregressive (CTC) speech-to-text model specialized for Persian, built on top of Meta's 3-billion-parameter Omnilingual ASR encoder. It transcribes Persian audio clips (≤ 40 s) and runs ~199× faster than real time in fp16 on a single RTX 4090. On Persian test sets it reaches CER ≈ 4 % on FLEURS (clean read speech) and ≈ 18 % on Common Voice (noisier, crowd-sourced) — normalized.
Model description
| Base model | facebook/omniASR-CTC-3B (Omnilingual ASR, omniASR_CTC_3B_v2) |
| Architecture | Wav2Vec2-style encoder + CTC head |
| Parameters | ~3.08 B (3,080,423,636) |
| Fine-tuned for | Persian / Farsi (pes_Arab) |
| Task | Automatic Speech Recognition (speech → text) |
| Toolkit | fairseq2 via the omnilingual-asr package |
| License | Apache-2.0 |
This checkpoint takes the strong multilingual representations of Omnilingual ASR's 3B CTC encoder and adapts them to Persian, giving accurate, low-latency transcription for Iranian Persian.
Evaluation
Measured by Rade with greedy CTC decoding (fp16) on two standard Persian test sets. Both reference and hypothesis are normalized before scoring — unify ك→ک / ي→ی, convert ZWNJ (نیمفاصله) to space, strip punctuation and diacritics, collapse whitespace — so that orthography-only differences don't count as errors.
| Test set | Clips | WER | CER |
|---|---|---|---|
FLEURS fa_ir — read speech |
871 | 19.6 % | 4.4 % |
VisualEars golden fa — curated (clean/farfield/obstructed) |
6,669 | 22.9 % | 4.3 % |
Common Voice 17.0 fa — crowd-sourced |
10,355 | 21.8 % | 17.8 % |
CER is the more faithful metric for Persian. Persian WER is inflated by orthographic/spacing variation (نیمفاصله/ZWNJ, affix spacing, compound spelling) that doesn't reflect actual mis-recognition — note FLEURS sits at 19.6 % WER but only 4.4 % CER, i.e. most "word errors" are one-character spelling differences. On clean, well-curated speech (FLEURS, VisualEars) the model reaches CER ≈ 4 %, and it stays robust across recording conditions (VisualEars far-field 4.5 % / obstructed 4.2 % CER). On noisier crowd-sourced audio (Common Voice — spontaneous speech, varied mics/accents, loan words) CER rises to ≈ 18 %.
Speed & hardware
CTC models are non-autoregressive — they decode a whole clip in one forward pass, so they're very fast. Measured by Rade on a single RTX 4090 (8.8 s Persian clip, batch_size=1):
| Precision | Inference latency | Speed | Peak VRAM |
|---|---|---|---|
| FP16 (recommended) | 44 ms | ~199× real time | 6.4 GB |
| FP32 | 102 ms | ~87× real time | 12.7 GB |
FP16 and FP32 produce identical transcripts, so FP16 is the recommended default (half the VRAM, 2× faster). Batched throughput on the 4090 reaches **208× real time** (the 222-min FLEURS set transcribes in ~64 s). A 16 GB GPU (e.g. Colab T4 / L4) is enough for fp16. CPU works but is slow.
Files in this repo
| File | What it is | Size | When to use |
|---|---|---|---|
model_fp16.pt |
Consolidated fp16 weights, single file | ~6.2 GB | Recommended. Smaller, faster download; fp16 inference. |
pp_00/tp_00/sdp_00.pt, sdp_01.pt |
Original fp32 FSDP checkpoint shards | ~12 GB | If you want full fp32 precision weights. |
config.json |
Model metadata (arch, tokenizer, vocab) | — | Read by tooling; you don't load it directly. |
notebook.ipynb |
Ready-to-run Colab/Kaggle notebook | — | One-click demo (powers the "Open in Colab" button). |
Both weight files produce identical transcripts at fp16. The single
model_fp16.ptis just half the download — prefer it unless you specifically need the fp32 master weights.
Usage
1) Install the Omnilingual ASR runtime (the model loads through fairseq2's asset system):
# system dependency for audio I/O
sudo apt-get install -y libsndfile1 # (Linux) / brew install libsndfile (macOS)
# Need omnilingual-asr 0.2.0 (it registers the 3b_v2 architecture this model uses).
# --ignore-requires-python: 0.2.0's metadata caps python at "<=3.12" (read as <=3.12.0), which
# wrongly excludes Python 3.12.x (e.g. Colab). The flag installs it anyway; it runs fine on 3.12.
pip install --ignore-requires-python omnilingual-asr==0.2.0 huggingface_hub
# Pin the whole torch stack to the CUDA 12.8 build that fairseq2 needs.
# (Without this you hit `libcudart.so.13` / torchvision::nms errors from a mismatched torchaudio/torchvision.)
pip install torch==2.8.0 torchaudio==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128
# On Colab/Jupyter, restart the runtime once after this install.
2) Download the weights + register a fairseq2 asset card that points at them, reusing Omnilingual ASR's official Persian-capable tokenizer. The asset card's checkpoint: can point at either the single model_fp16.pt file or the shard directory — both work.
import pathlib, torch
from huggingface_hub import hf_hub_download
# --- Recommended: download ONLY the single fp16 file (~6.2 GB) ---
ckpt = hf_hub_download("RadeAI/Rade-ASR-CTC-3B-fa", "model_fp16.pt")
# --- Alternative: full fp32 shards (~12 GB) ---
# from huggingface_hub import snapshot_download
# ckpt = snapshot_download("RadeAI/Rade-ASR-CTC-3B-fa", allow_patterns=["pp_00/tp_00/*"])
# Register a fairseq2 asset card pointing at the downloaded checkpoint
asset_dir = pathlib.Path.home() / ".config/fairseq2/assets"
asset_dir.mkdir(parents=True, exist_ok=True)
(asset_dir / "rade.yaml").write_text(f"""\
name: rade_CTC_3B_fa
model_family: wav2vec2_asr
model_arch: 3b_v2
checkpoint: "{ckpt}"
tokenizer_ref: omniASR_tokenizer_written_v2
""")
3) Transcribe (clips must be < 40 s):
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline
pipe = ASRInferencePipeline(
model_card="rade_CTC_3B_fa",
device="cuda" if torch.cuda.is_available() else "cpu",
dtype=torch.float16, # fp16: ~199x real time, 6.4 GB VRAM, identical output to fp32
)
text = pipe.transcribe(["sample_fa.wav"], lang=["pes_Arab"], batch_size=1)
print(text[0])
A ready-to-run notebook is provided: notebook.ipynb — or just click Open in Colab (also in the "Use this model" menu at the top of this page).
Limitations
- Audio length: CTC models accept clips shorter than 40 seconds. For longer audio, split into < 40 s chunks (e.g. on silence) and concatenate the results.
- Domain: best on clear Persian speech; very noisy audio or heavy code-switching may degrade accuracy.
- Language code: use
pes_Arab(Western/Iranian Persian). Note that for CTC models thelangargument is informational — CTC decoding is language-agnostic at inference time.
License & attribution
Released under Apache-2.0, consistent with the base Omnilingual ASR model and code. This is a derivative fine-tune of facebook/omniASR-CTC-3B; all credit for the base architecture and pre-training goes to the Meta AI Omnilingual ASR Team.
@misc{omnilingualasr2025,
title={{Omnilingual ASR}: Open-Source Multilingual Speech Recognition for 1600+ Languages},
author={{Omnilingual ASR Team}},
year={2025},
url={https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/}
}
Contact
Built and maintained by Rade AI. For questions, collaboration, or custom Persian speech/NLP models, get in touch:
- Telegram: @Rade_admin
- Phone: +98 936 864 7499
- Hugging Face: RadeAI
معرفی (فارسی)
این مدل نسخهی فاینتیونشده روی زبان فارسی از مدل Omnilingual ASR CTC-3B شرکت متا است که توسط راده منتشر شده.
- گفتار فارسی را به متن تبدیل میکند (کلیپهای کوتاهتر از ۴۰ ثانیه).
- معماری CTC (غیر-اتورگرسیو) دارد، برای همین خیلی سریع است — در fp16 حدود ۱۹۹ برابر سریعتر از زمان واقعی روی یک RTX 4090.
- در fp16 فقط ۶.۴ گیگابایت VRAM میخواهد (یک GPU ۱۶ گیگ کافی است).
- دقت (با نرمالسازیِ متن): روی FLEURS فارسی (گفتارِ تمیز) CER حدود ۴٪ (WER ۱۹.۶٪)، و روی Common Voice 17 فارسی (دادهی محاورهایِ نویزی، ۱۰٬۳۵۵ کلیپ) CER حدود ۱۸٪ (WER ۲۱.۸٪). در فارسی CER معیارِ معتبرتریه چون WER با اختلافِ املایی/نیمفاصله متورم میشه.
نحوهی استفاده در بخش انگلیسیِ بالا آمده. برای تستِ سریع، دکمهی Open in Colab (بالای همین صفحه، منوی «Use this model») یا نوتبوکِ notebook.ipynb رو باز کن.
ارتباط با راده: تلگرام @Rade_admin — تلفن: ۰۹۳۶۸۶۴۷۴۹۹
- Downloads last month
- 91
Model tree for RadeAI/Rade-ASR-CTC-3B-fa
Base model
facebook/omniASR-CTC-3B