YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Model Card β whisper-small-darija (DODa fine-tune)
Fine-tuned OpenAI Whisper Small for Moroccan Darija speech recognition, trained on the DODa audio dataset. This is the owned ML component of the Multilingual Smart Airport Wayfinding Assistant (Master IT 2026).
Hub repository: Oussamawork/whisper-small-darija
Base model: openai/whisper-small
Training config: config/doda_darija.yaml Β· Reproduce: notebooks/finetune_whisper_colab.ipynb
Model description
Whisper has no dedicated Darija language token; Darija audio is handled under
the arabic token with task=transcribe. Out-of-the-box whisper-small
transcribes Moroccan Darija poorly (wrong script/language on many clips).
This checkpoint is fine-tuned on ~9h46m of parallel Darija speech/text from DODa
using the Arabic-script column darija_Arab_new.
- Parameters: ~244M (Whisper Small)
- Input: 16 kHz mono audio (log-Mel features, max 30 s per clip)
- Output: Arabic-script Darija transcript string
- Language token:
arabicΒ· Task:transcribe
Intended use
- Speech-to-text for Moroccan Darija passenger utterances in the airport wayfinding assistant (case study: multilingual voice + text interface).
- Research / academic evaluation of Darija ASR fine-tuning on public data.
Out-of-scope / misuse
- Not validated for MSA-only, French, or English airport speech (use base Whisper or language-appropriate models).
- Not trained on real terminal noise, PA announcements, or code-mixed FR/EN/AR airport dialogue β expect lower accuracy in those conditions.
- Not a certified accessibility or safety-critical transcription system.
Training data
| Field | Value |
|---|---|
| Dataset | atlasia/DODa-audio-dataset |
| Size | 12,743 clips (~9h46m); 22 rows with empty darija_Arab_new dropped |
| Text column | darija_Arab_new (Arabic script) |
| Train/eval split | No Hub eval split β 90/10 carve from train, grouped by sentence (no transcript leakage across parallel recordings) |
| Train clips | ~11,462 |
| Eval clips | 1,259 |
Training procedure
| Hyperparameter | Value |
|---|---|
| Base checkpoint | openai/whisper-small |
| Optimizer LR | 1e-5 |
| Warmup steps | 300 |
| Max steps | 3000 |
| Batch size (train / eval) | 16 / 8 |
| FP16 | yes |
| Gradient checkpointing | yes |
| Eval / save every | 300 steps |
| Best checkpoint | lowest eval WER (load_best_model_at_end) |
| Seed | 42 |
Hardware: Google Colab T4 (16 GB) Β· Wall time: ~3h 23min
cd asr_finetuning
python -m src.train --config config/doda_darija.yaml
Evaluation
Held-out DODa eval split (1,259 samples). Metrics via jiwer (WER/CER reported
as percentages).
Headline results
| Model | WER β | CER β |
|---|---|---|
Base openai/whisper-small |
108.18% | 63.76% |
| Fine-tuned (this checkpoint) | 28.75% | 9.63% |
| Ξ (absolute) | β79.43 pp | β54.13 pp |
WER during training (eval every 300 steps)
| Step | WER | CER |
|---|---|---|
| 300 | 45.43% | 15.67% |
| 600 | 36.50% | 12.17% |
| 900 | 33.13% | 11.81% |
| 1200 | 31.35% | 10.60% |
| 1500 | 30.13% | 10.09% |
| 3000 (final) | 28.75% | 9.63% |
Final metrics are also written to outputs/whisper-small-doda-darija/final_metrics.txt
after training.
How to use
Python (project helper)
from src.transcribe import WhisperTranscriber
stt = WhisperTranscriber("Oussamawork/whisper-small-darija") # or local path
text = stt.transcribe("clip.wav")
Transformers
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_id = "Oussamawork/whisper-small-darija"
processor = WhisperProcessor.from_pretrained(model_id, language="arabic", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")
# ... feature extraction + model.generate (see src/transcribe.py)
CLI
python -m src.transcribe --model.name Oussamawork/whisper-small-darija --audio clip.wav
Limitations
- Dialect coverage: trained on DODa only; other Maghrebi/regional variants may differ.
- Script: targets Arabic-script Darija (
darija_Arab_new), not Arabizi. - WER > 0: ~29% WER means roughly one word wrong every few words β usable for agent input with LLM correction, not verbatim legal transcription.
- Language ID: Whisper reports
arabicfor Darija; Darija vs MSA is not disambiguated at the token level.
License & attribution
- Whisper: OpenAI model license β see openai/whisper-small.
- DODa: see the dataset card for terms (gated dataset; research use).
- This fine-tune: academic project artifact; cite DODa and Whisper if you build on this checkpoint.
Citation
@misc{whisper-small-darija-2026,
title = {whisper-small fine-tuned on DODa for Moroccan Darija ASR},
author = {Oussamawork},
year = {2026},
url = {https://huggingface.co/Oussamawork/whisper-small-darija}
}
- Downloads last month
- 428