whisper-small-arabic

Fine-tuned openai/whisper-small for Arabic speech recognition.

Results

Metric Baseline (openai/whisper-small) Fine-tuned (this model) Improvement
WER (โ†“) 42.69 20.61 โˆ’22.08 abs / โˆ’51.7 % rel

Evaluated on 300 held-out clips from the Mozilla Common Voice Arabic test split. Baseline and fine-tuned numbers use identical decoding configuration (greedy, language=arabic, task=transcribe).

Training data

Mozilla Common Voice โ€” Arabic, the crowdsourced Arabic speech corpus distributed by the Mozilla Foundation. Train/dev/test were combined, shuffled with seed=42, and the first 25,000 clips were used for fine-tuning. Audio is at 16 kHz mono.

Training procedure

Single Tesla T4 (16 GB) on Kaggle, mixed-precision fp16. Long audio is handled by Whisper's standard 30-second windowing.

Hyperparameters

Base model openai/whisper-small (244 M params)
Training samples 25,000 (Common Voice ar, shuffled with seed 42)
Per-device batch size 8
Gradient accumulation 4 (effective batch 32)
Optimizer AdamW (HF default)
Learning rate 1e-5
LR schedule linear, warmup 500 steps
Weight decay 0.01
Max steps 4,000 (in two stages โ€” see below)
Mixed precision fp16
Eval / save every 500 steps
Early stopping patience 3 evals
generation_max_length (eval) 80
Forced language / task arabic / transcribe
Seed 42

Two-stage learning rate schedule

The first 2,000 steps used learning_rate = 1e-5 and converged to WER โ‰ˆ 21.3. Validation loss had plateaued, so training was resumed from the 2,000-step checkpoint at the lower learning_rate = 5e-6 for another 2,000 steps. The smaller LR squeezed out an additional point of WER and produced the published checkpoint at step 3,000.

Validation curve

Step Training loss Validation loss WER โ†“
500 2.3978 0.3251 28.53
1000 1.4769 0.2624 26.93
1500 0.9937 0.2583 22.21
2000 0.7566 0.2517 21.31
2500 0.6933 0.2582 21.25
3000 0.5785 0.2596 20.61
3500 0.4998 0.2593 20.68
4000 0.4839 0.2589 20.74

Learning rate is dropped from 1e-5 to 5e-6 after step 2,000. Step 3,000 is the published checkpoint (lowest validation WER).

The published checkpoint is step 3,000, the lowest WER on the held-out validation split. Training loss continued to fall after that point but validation loss flattened and WER ticked back up โ€” classic mild overfitting, so the earlier checkpoint was preserved as the final model.

Quick use

import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

REPO = "Omar10lfc/whisper-small-arabic"
processor = WhisperProcessor.from_pretrained(REPO)
model     = WhisperForConditionalGeneration.from_pretrained(REPO)
model.generation_config.language = "arabic"
model.generation_config.task     = "transcribe"

speech, _ = librosa.load("audio.wav", sr=16000, mono=True)
features  = processor(speech, sampling_rate=16000, return_tensors="pt").input_features
ids       = model.generate(features, max_new_tokens=440)
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

For long audio (lectures, podcasts), use the HF pipeline with chunking:

from transformers import pipeline
asr = pipeline(
    "automatic-speech-recognition",
    model=REPO,
    chunk_length_s=30,
    stride_length_s=5,
    device=0,                    # or -1 for CPU
)
print(asr("lecture.wav", generate_kwargs={"language": "arabic", "task": "transcribe"})["text"])

Intended use & limitations

  • Intended use: Modern Standard Arabic speech-to-text โ€” lectures, news, podcasts, dictation. Used as the ASR front-end of an Arabic audio understanding pipeline (Whisper โ†’ AraBART summarization โ†’ CAMeL-BERT/FAISS semantic search).
  • Out of scope: code-switching with English beyond a few words, speaker diarization, regional dialects under-represented in Common Voice (Maghrebi, Khaleeji), very long form (>30 s without chunking).
  • Bias: Common Voice is volunteer-recorded; speaker demographics skew toward MSA-fluent contributors and may underrepresent dialects, accents, and acoustic conditions outside that distribution.

Citation

@misc{whisper-small-arabic,
  title  = {whisper-small-arabic: Fine-tuned Whisper for Arabic on Mozilla Common Voice},
  author = {{Omar10lfc}},
  year   = {2026},
  howpublished = {Hugging Face},
  note = {Fine-tune of openai/whisper-small on Mozilla Common Voice Arabic, WER 20.61.}
}

Built on top of:

@article{radford2022whisper,
  title  = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and others},
  journal= {arXiv preprint arXiv:2212.04356},
  year   = {2022}
}
Downloads last month
71
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Omar10lfc/whisper-small-arabic

Finetuned
(3489)
this model

Space using Omar10lfc/whisper-small-arabic 1

Paper for Omar10lfc/whisper-small-arabic

Evaluation results