whisper-small-arabic

Fine-tuned openai/whisper-small for Arabic speech recognition.

Results

Metric	Baseline (`openai/whisper-small`)	Fine-tuned (this model)	Improvement
WER (↓)	42.69	20.61	−22.08 abs / −51.7 % rel

Evaluated on 300 held-out clips from the Mozilla Common Voice Arabic test split. Baseline and fine-tuned numbers use identical decoding configuration (greedy, language=arabic, task=transcribe).

Training data

Mozilla Common Voice — Arabic, the crowdsourced Arabic speech corpus distributed by the Mozilla Foundation. Train/dev/test were combined, shuffled with seed=42, and the first 25,000 clips were used for fine-tuning. Audio is at 16 kHz mono.

Training procedure

Single Tesla T4 (16 GB) on Kaggle, mixed-precision fp16. Long audio is handled by Whisper's standard 30-second windowing.

Hyperparameters


Base model	`openai/whisper-small` (244 M params)
Training samples	25,000 (Common Voice ar, shuffled with seed 42)
Per-device batch size	8
Gradient accumulation	4 (effective batch 32)
Optimizer	AdamW (HF default)
Learning rate	1e-5
LR schedule	linear, warmup 500 steps
Weight decay	0.01
Max steps	4,000 (in two stages — see below)
Mixed precision	fp16
Eval / save every	500 steps
Early stopping patience	3 evals
`generation_max_length` (eval)	80
Forced language / task	`arabic` / `transcribe`
Seed	42

Two-stage learning rate schedule

The first 2,000 steps used learning_rate = 1e-5 and converged to WER ≈ 21.3. Validation loss had plateaued, so training was resumed from the 2,000-step checkpoint at the lower learning_rate = 5e-6 for another 2,000 steps. The smaller LR squeezed out an additional point of WER and produced the published checkpoint at step 3,000.

Validation curve

Step	Training loss	Validation loss	WER ↓
500	2.3978	0.3251	28.53
1000	1.4769	0.2624	26.93
1500	0.9937	0.2583	22.21
2000	0.7566	0.2517	21.31
2500	0.6933	0.2582	21.25
3000	0.5785	0.2596	20.61
3500	0.4998	0.2593	20.68
4000	0.4839	0.2589	20.74

Learning rate is dropped from 1e-5 to 5e-6 after step 2,000. Step 3,000 is the published checkpoint (lowest validation WER).

The published checkpoint is step 3,000, the lowest WER on the held-out validation split. Training loss continued to fall after that point but validation loss flattened and WER ticked back up — classic mild overfitting, so the earlier checkpoint was preserved as the final model.

Quick use

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

REPO = "Omar10lfc/whisper-small-arabic"
processor = WhisperProcessor.from_pretrained(REPO)
model     = WhisperForConditionalGeneration.from_pretrained(REPO)
model.generation_config.language = "arabic"
model.generation_config.task     = "transcribe"

speech, _ = librosa.load("audio.wav", sr=16000, mono=True)
features  = processor(speech, sampling_rate=16000, return_tensors="pt").input_features
ids       = model.generate(features, max_new_tokens=440)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

For long audio (lectures, podcasts), use the HF pipeline with chunking:

from transformers import pipeline
asr = pipeline(
    "automatic-speech-recognition",
    model=REPO,
    chunk_length_s=30,
    stride_length_s=5,
    device=0,                    # or -1 for CPU
)
print(asr("lecture.wav", generate_kwargs={"language": "arabic", "task": "transcribe"})["text"])

Intended use & limitations

Intended use: Modern Standard Arabic speech-to-text — lectures, news, podcasts, dictation. Used as the ASR front-end of an Arabic audio understanding pipeline (Whisper → AraBART summarization → CAMeL-BERT/FAISS semantic search).
Out of scope: code-switching with English beyond a few words, speaker diarization, regional dialects under-represented in Common Voice (Maghrebi, Khaleeji), very long form (>30 s without chunking).
Bias: Common Voice is volunteer-recorded; speaker demographics skew toward MSA-fluent contributors and may underrepresent dialects, accents, and acoustic conditions outside that distribution.

Citation

@misc{whisper-small-arabic,
  title  = {whisper-small-arabic: Fine-tuned Whisper for Arabic on Mozilla Common Voice},
  author = {{Omar10lfc}},
  year   = {2026},
  howpublished = {Hugging Face},
  note = {Fine-tune of openai/whisper-small on Mozilla Common Voice Arabic, WER 20.61.}
}

Built on top of:

@article{radford2022whisper,
  title  = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and others},
  journal= {arXiv preprint arXiv:2212.04356},
  year   = {2022}
}

Downloads last month: 71

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Omar10lfc/whisper-small-arabic

Base model

openai/whisper-small

Finetuned

(3489)

this model

Space using Omar10lfc/whisper-small-arabic 1

Paper for Omar10lfc/whisper-small-arabic

Robust Speech Recognition via Large-Scale Weak Supervision

Paper • 2212.04356 • Published Dec 6, 2022 • 54

Evaluation results

Word Error Rate on Mozilla Common Voice (Arabic)
test set self-reported

20.610