Whisper Pidgin v1

LoRA adapter on openai/whisper-large-v3-turbo fine-tuned for Nigerian Pidgin English (Naija, pcm) automatic speech recognition.

Trained on ~8.6 hours of curated Pidgin audio with a single LoRA fine-tune on a free Kaggle T4 GPU. Achieves 21.37% WER / 9.90% CER on the held-out test set — an 8.2 pp absolute (28% relative) improvement over the strongest published Pidgin ASR baseline on the same data.

The adapter is 26 MB; combined with the 809M-parameter base model it runs in real-time on a laptop CPU via faster-whisper.

Results

Model Test WER Test CER
Whisper Pidgin v1 (this model) 21.37% 9.90%
Wav2Vec2-XLSR-53 (published baseline on same test set) 29.6% —

Test split: 893 clips, ~1.78 hours, from asr-nigerian-pidgin/nigerian-pidgin-1.0.

How to use

Quick start — transformers + peft

import torch
from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor

BASE = "openai/whisper-large-v3-turbo"
ADAPTER = "michaelodafe/whisper-pidgin-v1"

processor = WhisperProcessor.from_pretrained(BASE, language="english", task="transcribe")
base = WhisperForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base, ADAPTER).merge_and_unload().to("cuda")
model.generation_config.language = "english"
model.generation_config.task = "transcribe"
model.generation_config.forced_decoder_ids = None
model.generation_config.suppress_tokens = []

# audio: a 16kHz mono numpy array
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
out = model.generate(inputs.input_features, max_length=225)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Production path — faster-whisper (4–6× faster)

The source repo includes a one-shot script that merges the adapter into the base, exports to CTranslate2 int8_float16, and runs streaming inference via faster-whisper:

git clone https://github.com/michaelodafe/Naija-Pidgin-Whisper.git
cd Naija-Pidgin-Whisper
pip install -r requirements.txt
HF_HUB_DISABLE_XET=1 python infer/01_merge_and_convert.py   # ~5 min, one-time
python infer/02_streaming_demo.py                            # live mic → transcript

Training data

Source Clips Hours Notes
asr-nigerian-pidgin/nigerian-pidgin-1.0 4,277 ~8.6 h 10 native speakers, 16 kHz, CC-BY-4.0
Rexe/nigerian-pidgin-speech 73 ~0.05 h Eval-only; single YouTube source

Combined and re-published as michaelodafe/pidgin-asr-combined with a unified schema:

  • train: 2,708 clips · 5.41 h
  • validation: 677 clips · 1.37 h
  • test: 893 clips · 1.78 h

Training procedure

  • Base model: openai/whisper-large-v3-turbo (809M params)
  • Method: LoRA fine-tune (PEFT)
  • LoRA config: r=32, alpha=64, target_modules=["q_proj","v_proj"], dropout=0.05
  • Trainable parameters: ~3M
  • Effective batch size: 16 (4 per device × 4 grad-accum)
  • Optimizer: AdamW, learning rate 1e-4, warmup ratio 0.05
  • Epochs: 5 (845 steps total)
  • Mixed precision: fp16
  • Hardware: Kaggle free tier, 1× NVIDIA T4 (16 GB VRAM)
  • Training time: ~3 h 47 min

Validation trajectory:

Step Train loss Val loss Val WER Val CER
200 2.91 0.81 25.97% 12.72%
400 2.25 0.73 23.25% 11.49%
600 1.94 0.71 22.39% 11.23%
800 2.09 0.70 21.96% 11.02%

Test WER (21.37%) was slightly better than validation WER (21.96%), indicating clean generalization with no overfitting.

Limitations and bias

  • Domain: Trained on read-style news Pidgin (BBC News Pidgin register, single-speaker recordings). Casual conversational Pidgin, shouting, music backgrounds, and group conversation will all show higher error rates.
  • Orthography: The model normalizes some Pidgin orthographic variants (hin ↔ him, kain ↔ kind, neva ↔ never). This is partially a label-inconsistency artifact in the source dataset itself; future versions could use orthography-aware metrics.
  • Code-switching: Pidgin↔Standard English mid-utterance is handled, but heavy code-switching with Yoruba / Igbo / Hausa was not in training and is likely to fail.
  • 30-second window: Audio longer than 30 seconds is silently truncated by Whisper's input encoder. For longer-form audio, segment with VAD first (see infer/02_streaming_demo.py in the source repo).
  • Speaker coverage: Training data has 10 speakers, all aged 20–28, recorded in a single accent register. Older speakers or different regional accents may underperform.
  • Number/format: The model sometimes outputs 60 000 where the reference is 60000, etc. A simple postprocess pass (in infer/decode.py) fixes most of these.

Decode-time enhancements (Path A)

The source repo ships a decode.py helper that adds two zero-cost enhancements:

  • initial_prompt hotwords — a Pidgin-style context sentence listing common Nigerian proper nouns and Pidgin function words. Biases the decoder toward correct vocabulary, especially proper nouns.
  • Postprocess — strips punctuation the labels don't use, merges digit groups, drops in-number commas.

These together provide an additional ~1–2 pp WER improvement on top of the base model's 21.37%, at zero inference cost. Already enabled in the streaming demo and the HF Inference Endpoint handler.

License and attribution

  • Code & adapter weights: MIT
  • Base model (openai/whisper-large-v3-turbo): MIT (Whisper)
  • Training data (asr-nigerian-pidgin/nigerian-pidgin-1.0): CC-BY-4.0 — attribution to the original dataset authors is required for any downstream use.

If you use this model in research or a product, please credit:

Citation

@misc{odafe2026pidginwhisper,
  title  = {Whisper Pidgin v1: Nigerian Pidgin English Speech-to-Text},
  author = {Odafe, Michael},
  year   = {2026},
  url    = {https://huggingface.co/michaelodafe/whisper-pidgin-v1},
  note   = {LoRA fine-tune of openai/whisper-large-v3-turbo}
}

Acknowledgments

  • The asr-nigerian-pidgin/nigerian-pidgin-1.0 dataset team for releasing the only sizeable open Pidgin ASR corpus.
  • OpenAI for Whisper.
  • HuggingFace for hosting and the transformers / datasets / peft libraries.
  • SYSTRAN for faster-whisper and CTranslate2.
  • The Silero team for VAD.
  • Kaggle for free GPU compute.
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for michaelodafe/whisper-pidgin-v1

Adapter
(120)
this model

Datasets used to train michaelodafe/whisper-pidgin-v1

Space using michaelodafe/whisper-pidgin-v1 1

Evaluation results