YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card — `whisper-small-darija` (DODa fine-tune)

Fine-tuned OpenAI Whisper Small for Moroccan Darija speech recognition, trained on the DODa audio dataset. This is the owned ML component of the Multilingual Smart Airport Wayfinding Assistant (Master IT 2026).

Hub repository: Oussamawork/whisper-small-darija
Base model: openai/whisper-small
Training config: config/doda_darija.yaml · Reproduce: notebooks/finetune_whisper_colab.ipynb

Model description

Whisper has no dedicated Darija language token; Darija audio is handled under the arabic token with task=transcribe. Out-of-the-box whisper-small transcribes Moroccan Darija poorly (wrong script/language on many clips). This checkpoint is fine-tuned on ~9h46m of parallel Darija speech/text from DODa using the Arabic-script column darija_Arab_new.

Parameters: ~244M (Whisper Small)
Input: 16 kHz mono audio (log-Mel features, max 30 s per clip)
Output: Arabic-script Darija transcript string
Language token: arabic · Task: transcribe

Intended use

Speech-to-text for Moroccan Darija passenger utterances in the airport wayfinding assistant (case study: multilingual voice + text interface).
Research / academic evaluation of Darija ASR fine-tuning on public data.

Out-of-scope / misuse

Not validated for MSA-only, French, or English airport speech (use base Whisper or language-appropriate models).
Not trained on real terminal noise, PA announcements, or code-mixed FR/EN/AR airport dialogue — expect lower accuracy in those conditions.
Not a certified accessibility or safety-critical transcription system.

Training data

Field	Value
Dataset	`atlasia/DODa-audio-dataset`
Size	12,743 clips (~9h46m); 22 rows with empty `darija_Arab_new` dropped
Text column	`darija_Arab_new` (Arabic script)
Train/eval split	No Hub eval split → 90/10 carve from `train`, grouped by sentence (no transcript leakage across parallel recordings)
Train clips	~11,462
Eval clips	1,259

Training procedure

Hyperparameter	Value
Base checkpoint	`openai/whisper-small`
Optimizer LR	`1e-5`
Warmup steps	300
Max steps	3000
Batch size (train / eval)	16 / 8
FP16	yes
Gradient checkpointing	yes
Eval / save every	300 steps
Best checkpoint	lowest eval WER (`load_best_model_at_end`)
Seed	42

Hardware: Google Colab T4 (16 GB) · Wall time: ~3h 23min

cd asr_finetuning
python -m src.train --config config/doda_darija.yaml

Evaluation

Held-out DODa eval split (1,259 samples). Metrics via jiwer (WER/CER reported as percentages).

Headline results

Model	WER ↓	CER ↓
Base `openai/whisper-small`	108.18%	63.76%
Fine-tuned (this checkpoint)	28.75%	9.63%
Δ (absolute)	−79.43 pp	−54.13 pp

WER during training (eval every 300 steps)

Step	WER	CER
300	45.43%	15.67%
600	36.50%	12.17%
900	33.13%	11.81%
1200	31.35%	10.60%
1500	30.13%	10.09%
3000 (final)	28.75%	9.63%

Final metrics are also written to outputs/whisper-small-doda-darija/final_metrics.txt after training.

How to use

Python (project helper)

from src.transcribe import WhisperTranscriber

stt = WhisperTranscriber("Oussamawork/whisper-small-darija")  # or local path
text = stt.transcribe("clip.wav")

Transformers

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Oussamawork/whisper-small-darija"
processor = WhisperProcessor.from_pretrained(model_id, language="arabic", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")
# ... feature extraction + model.generate (see src/transcribe.py)

CLI

python -m src.transcribe --model.name Oussamawork/whisper-small-darija --audio clip.wav

Limitations

Dialect coverage: trained on DODa only; other Maghrebi/regional variants may differ.
Script: targets Arabic-script Darija (darija_Arab_new), not Arabizi.
WER > 0: ~29% WER means roughly one word wrong every few words — usable for agent input with LLM correction, not verbatim legal transcription.
Language ID: Whisper reports arabic for Darija; Darija vs MSA is not disambiguated at the token level.

License & attribution

Whisper: OpenAI model license — see openai/whisper-small.
DODa: see the dataset card for terms (gated dataset; research use).
This fine-tune: academic project artifact; cite DODa and Whisper if you build on this checkpoint.

Citation

@misc{whisper-small-darija-2026,
  title  = {whisper-small fine-tuned on DODa for Moroccan Darija ASR},
  author = {Oussamawork},
  year   = {2026},
  url    = {https://huggingface.co/Oussamawork/whisper-small-darija}
}

Downloads last month: 428

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model Card — whisper-small-darija (DODa fine-tune)