YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Model Card β€” whisper-small-darija (DODa fine-tune)

Fine-tuned OpenAI Whisper Small for Moroccan Darija speech recognition, trained on the DODa audio dataset. This is the owned ML component of the Multilingual Smart Airport Wayfinding Assistant (Master IT 2026).

Hub repository: Oussamawork/whisper-small-darija
Base model: openai/whisper-small
Training config: config/doda_darija.yaml Β· Reproduce: notebooks/finetune_whisper_colab.ipynb


Model description

Whisper has no dedicated Darija language token; Darija audio is handled under the arabic token with task=transcribe. Out-of-the-box whisper-small transcribes Moroccan Darija poorly (wrong script/language on many clips). This checkpoint is fine-tuned on ~9h46m of parallel Darija speech/text from DODa using the Arabic-script column darija_Arab_new.

  • Parameters: ~244M (Whisper Small)
  • Input: 16 kHz mono audio (log-Mel features, max 30 s per clip)
  • Output: Arabic-script Darija transcript string
  • Language token: arabic Β· Task: transcribe

Intended use

  • Speech-to-text for Moroccan Darija passenger utterances in the airport wayfinding assistant (case study: multilingual voice + text interface).
  • Research / academic evaluation of Darija ASR fine-tuning on public data.

Out-of-scope / misuse

  • Not validated for MSA-only, French, or English airport speech (use base Whisper or language-appropriate models).
  • Not trained on real terminal noise, PA announcements, or code-mixed FR/EN/AR airport dialogue β€” expect lower accuracy in those conditions.
  • Not a certified accessibility or safety-critical transcription system.

Training data

Field Value
Dataset atlasia/DODa-audio-dataset
Size 12,743 clips (~9h46m); 22 rows with empty darija_Arab_new dropped
Text column darija_Arab_new (Arabic script)
Train/eval split No Hub eval split β†’ 90/10 carve from train, grouped by sentence (no transcript leakage across parallel recordings)
Train clips ~11,462
Eval clips 1,259

Training procedure

Hyperparameter Value
Base checkpoint openai/whisper-small
Optimizer LR 1e-5
Warmup steps 300
Max steps 3000
Batch size (train / eval) 16 / 8
FP16 yes
Gradient checkpointing yes
Eval / save every 300 steps
Best checkpoint lowest eval WER (load_best_model_at_end)
Seed 42

Hardware: Google Colab T4 (16 GB) Β· Wall time: ~3h 23min

cd asr_finetuning
python -m src.train --config config/doda_darija.yaml

Evaluation

Held-out DODa eval split (1,259 samples). Metrics via jiwer (WER/CER reported as percentages).

Headline results

Model WER ↓ CER ↓
Base openai/whisper-small 108.18% 63.76%
Fine-tuned (this checkpoint) 28.75% 9.63%
Ξ” (absolute) βˆ’79.43 pp βˆ’54.13 pp

WER during training (eval every 300 steps)

Step WER CER
300 45.43% 15.67%
600 36.50% 12.17%
900 33.13% 11.81%
1200 31.35% 10.60%
1500 30.13% 10.09%
3000 (final) 28.75% 9.63%

Final metrics are also written to outputs/whisper-small-doda-darija/final_metrics.txt after training.

How to use

Python (project helper)

from src.transcribe import WhisperTranscriber

stt = WhisperTranscriber("Oussamawork/whisper-small-darija")  # or local path
text = stt.transcribe("clip.wav")

Transformers

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_id = "Oussamawork/whisper-small-darija"
processor = WhisperProcessor.from_pretrained(model_id, language="arabic", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained(model_id).to("cuda" if torch.cuda.is_available() else "cpu")
# ... feature extraction + model.generate (see src/transcribe.py)

CLI

python -m src.transcribe --model.name Oussamawork/whisper-small-darija --audio clip.wav

Limitations

  • Dialect coverage: trained on DODa only; other Maghrebi/regional variants may differ.
  • Script: targets Arabic-script Darija (darija_Arab_new), not Arabizi.
  • WER > 0: ~29% WER means roughly one word wrong every few words β€” usable for agent input with LLM correction, not verbatim legal transcription.
  • Language ID: Whisper reports arabic for Darija; Darija vs MSA is not disambiguated at the token level.

License & attribution

  • Whisper: OpenAI model license β€” see openai/whisper-small.
  • DODa: see the dataset card for terms (gated dataset; research use).
  • This fine-tune: academic project artifact; cite DODa and Whisper if you build on this checkpoint.

Citation

@misc{whisper-small-darija-2026,
  title  = {whisper-small fine-tuned on DODa for Moroccan Darija ASR},
  author = {Oussamawork},
  year   = {2026},
  url    = {https://huggingface.co/Oussamawork/whisper-small-darija}
}
Downloads last month
428
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support