whisper-distant-voices

This model is a fine-tuned version of openai/whisper-small on the google/fleurs dataset.

Model description

whisper-distant-voices is a fine-tuned version of openai/whisper-small for multilingual transcription of Swahili (sw), English (en), French (fr), and Arabic (ar). It is designed for community voice transcription in contexts such as elections, crisis response, and civic reporting across East Africa and the broader Global South.

Try it live: katoernest/distant-voices-transcription

Intended uses & limitations

Intended uses:

  • Transcribing audio from community reporters and field workers
  • Supporting multilingual voice input in civic tech applications
  • Automatic speech recognition across Swahili, English, French, and Arabic

Limitations:

  • Fine-tuned on 500 samples per language — accuracy will be lower than the base whisper-small on formal speech
  • May struggle with heavy accents, background noise, or code-switching between languages mid-sentence
  • Training steps were limited (10 steps); a full run (2000 steps) will significantly improve performance
  • Not suitable for medical, legal, or safety-critical transcription without further evaluation

Training and evaluation data

Fine-tuned on the google/fleurs dataset:

  • sw_ke: 500 samples (Swahili, Kenya)
  • en_us: 500 samples (English, US)
  • fr_fr: 500 samples (French, France)
  • ar_eg: 500 samples (Arabic, Egypt)

Total: ~2,000 samples, 90/10 train/eval split, shuffled with seed 42.

Training procedure

The model was fine-tuned using HuggingFace Seq2SeqTrainer on Google Colab with a T4 GPU.

Data preparation

  • Audio resampled to 16 kHz and converted to log-mel spectrograms using WhisperProcessor
  • Transcriptions tokenized with a max length of 448 tokens
  • Dataset shuffled with seed 42 and split 90/10 train/eval

Multilingual setup

No language token was forced during training — the model learns to predict the language automatically per sample across all four languages (Swahili, English, French, Arabic).

Compute

  • Hardware: NVIDIA T4 GPU (Google Colab)
  • Training time: ~45 minutes (10-step prototype run)
  • Framework: Transformers 5.0.0, PyTorch 2.11.0+cu128

Intended next steps

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • optimizer: AdamW (fused) with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 200
  • training_steps: 10
  • mixed_precision_training: Native AMP

Training results

Training Steps Training Loss Validation Loss
10 - -

Note: This checkpoint is a prototype run (10 steps) used to validate the training pipeline end-to-end. Loss values were not recorded before the run completed. A full training run of 2000 steps is in progress and results will be updated here upon completion.

What to expect after full training:

  • Training loss should decrease steadily from ~3.0 toward ~0.5–1.0
  • Validation WER (Word Error Rate) target: below 30% across all four languages
  • Evaluation checkpoints saved every 500 steps

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.11.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2
Downloads last month
33
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for katoernest/whisper-distant-voices

Finetuned
(3550)
this model

Dataset used to train katoernest/whisper-distant-voices

Space using katoernest/whisper-distant-voices 1