Instructions to use Omar10lfc/whisper-small-arabic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Omar10lfc/whisper-small-arabic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Omar10lfc/whisper-small-arabic")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("Omar10lfc/whisper-small-arabic") model = AutoModelForSpeechSeq2Seq.from_pretrained("Omar10lfc/whisper-small-arabic") - Notebooks
- Google Colab
- Kaggle
whisper-small-arabic
Fine-tuned openai/whisper-small for Arabic speech recognition.
Results
| Metric | Baseline (openai/whisper-small) |
Fine-tuned (this model) | Improvement |
|---|---|---|---|
| WER (โ) | 42.69 | 20.61 | โ22.08 abs / โ51.7 % rel |
Evaluated on 300 held-out clips from the Mozilla Common Voice Arabic test split. Baseline and fine-tuned numbers use identical decoding configuration (greedy, language=arabic, task=transcribe).
Training data
Mozilla Common Voice โ Arabic, the crowdsourced Arabic speech corpus distributed by the Mozilla Foundation. Train/dev/test were combined, shuffled with seed=42, and the first 25,000 clips were used for fine-tuning. Audio is at 16 kHz mono.
Training procedure
Single Tesla T4 (16 GB) on Kaggle, mixed-precision fp16. Long audio is handled by Whisper's standard 30-second windowing.
Hyperparameters
| Base model | openai/whisper-small (244 M params) |
| Training samples | 25,000 (Common Voice ar, shuffled with seed 42) |
| Per-device batch size | 8 |
| Gradient accumulation | 4 (effective batch 32) |
| Optimizer | AdamW (HF default) |
| Learning rate | 1e-5 |
| LR schedule | linear, warmup 500 steps |
| Weight decay | 0.01 |
| Max steps | 4,000 (in two stages โ see below) |
| Mixed precision | fp16 |
| Eval / save every | 500 steps |
| Early stopping patience | 3 evals |
generation_max_length (eval) |
80 |
| Forced language / task | arabic / transcribe |
| Seed | 42 |
Two-stage learning rate schedule
The first 2,000 steps used learning_rate = 1e-5 and converged to WER โ 21.3. Validation loss had plateaued, so training was resumed from the 2,000-step checkpoint at the lower learning_rate = 5e-6 for another 2,000 steps. The smaller LR squeezed out an additional point of WER and produced the published checkpoint at step 3,000.
Validation curve
| Step | Training loss | Validation loss | WER โ |
|---|---|---|---|
| 500 | 2.3978 | 0.3251 | 28.53 |
| 1000 | 1.4769 | 0.2624 | 26.93 |
| 1500 | 0.9937 | 0.2583 | 22.21 |
| 2000 | 0.7566 | 0.2517 | 21.31 |
| 2500 | 0.6933 | 0.2582 | 21.25 |
| 3000 | 0.5785 | 0.2596 | 20.61 |
| 3500 | 0.4998 | 0.2593 | 20.68 |
| 4000 | 0.4839 | 0.2589 | 20.74 |
Learning rate is dropped from
1e-5to5e-6after step 2,000. Step 3,000 is the published checkpoint (lowest validation WER).
The published checkpoint is step 3,000, the lowest WER on the held-out validation split. Training loss continued to fall after that point but validation loss flattened and WER ticked back up โ classic mild overfitting, so the earlier checkpoint was preserved as the final model.
Quick use
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
REPO = "Omar10lfc/whisper-small-arabic"
processor = WhisperProcessor.from_pretrained(REPO)
model = WhisperForConditionalGeneration.from_pretrained(REPO)
model.generation_config.language = "arabic"
model.generation_config.task = "transcribe"
speech, _ = librosa.load("audio.wav", sr=16000, mono=True)
features = processor(speech, sampling_rate=16000, return_tensors="pt").input_features
ids = model.generate(features, max_new_tokens=440)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
For long audio (lectures, podcasts), use the HF pipeline with chunking:
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model=REPO,
chunk_length_s=30,
stride_length_s=5,
device=0, # or -1 for CPU
)
print(asr("lecture.wav", generate_kwargs={"language": "arabic", "task": "transcribe"})["text"])
Intended use & limitations
- Intended use: Modern Standard Arabic speech-to-text โ lectures, news, podcasts, dictation. Used as the ASR front-end of an Arabic audio understanding pipeline (Whisper โ AraBART summarization โ CAMeL-BERT/FAISS semantic search).
- Out of scope: code-switching with English beyond a few words, speaker diarization, regional dialects under-represented in Common Voice (Maghrebi, Khaleeji), very long form (>30 s without chunking).
- Bias: Common Voice is volunteer-recorded; speaker demographics skew toward MSA-fluent contributors and may underrepresent dialects, accents, and acoustic conditions outside that distribution.
Citation
@misc{whisper-small-arabic,
title = {whisper-small-arabic: Fine-tuned Whisper for Arabic on Mozilla Common Voice},
author = {{Omar10lfc}},
year = {2026},
howpublished = {Hugging Face},
note = {Fine-tune of openai/whisper-small on Mozilla Common Voice Arabic, WER 20.61.}
}
Built on top of:
@article{radford2022whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and others},
journal= {arXiv preprint arXiv:2212.04356},
year = {2022}
}
- Downloads last month
- 71
Model tree for Omar10lfc/whisper-small-arabic
Base model
openai/whisper-smallSpace using Omar10lfc/whisper-small-arabic 1
Paper for Omar10lfc/whisper-small-arabic
Evaluation results
- Word Error Rate on Mozilla Common Voice (Arabic)test set self-reported20.610