Instructions to use katoernest/whisper-distant-voices with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use katoernest/whisper-distant-voices with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="katoernest/whisper-distant-voices")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("katoernest/whisper-distant-voices") model = AutoModelForMultimodalLM.from_pretrained("katoernest/whisper-distant-voices") - Notebooks
- Google Colab
- Kaggle
whisper-distant-voices
This model is a fine-tuned version of openai/whisper-small on the google/fleurs dataset.
Model description
whisper-distant-voices is a fine-tuned version of openai/whisper-small for multilingual transcription of Swahili (sw), English (en), French (fr), and Arabic (ar). It is designed for community voice transcription in contexts such as elections, crisis response, and civic reporting across East Africa and the broader Global South.
Try it live: katoernest/distant-voices-transcription
Intended uses & limitations
Intended uses:
- Transcribing audio from community reporters and field workers
- Supporting multilingual voice input in civic tech applications
- Automatic speech recognition across Swahili, English, French, and Arabic
Limitations:
- Fine-tuned on 500 samples per language — accuracy will be lower than the base whisper-small on formal speech
- May struggle with heavy accents, background noise, or code-switching between languages mid-sentence
- Training steps were limited (10 steps); a full run (2000 steps) will significantly improve performance
- Not suitable for medical, legal, or safety-critical transcription without further evaluation
Training and evaluation data
Fine-tuned on the google/fleurs dataset:
sw_ke: 500 samples (Swahili, Kenya)en_us: 500 samples (English, US)fr_fr: 500 samples (French, France)ar_eg: 500 samples (Arabic, Egypt)
Total: ~2,000 samples, 90/10 train/eval split, shuffled with seed 42.
Training procedure
The model was fine-tuned using HuggingFace Seq2SeqTrainer on Google Colab
with a T4 GPU.
Data preparation
- Audio resampled to 16 kHz and converted to log-mel spectrograms
using
WhisperProcessor - Transcriptions tokenized with a max length of 448 tokens
- Dataset shuffled with seed 42 and split 90/10 train/eval
Multilingual setup
No language token was forced during training — the model learns to predict the language automatically per sample across all four languages (Swahili, English, French, Arabic).
Compute
- Hardware: NVIDIA T4 GPU (Google Colab)
- Training time: ~45 minutes (10-step prototype run)
- Framework: Transformers 5.0.0, PyTorch 2.11.0+cu128
Intended next steps
- Full training run of 2000 steps on expanded data
- Evaluation on held-out community audio from East Africa
- Integration into the live demo Space: katoernest/distant-voices-transcription
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: AdamW (fused) with betas=(0.9, 0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 200
- training_steps: 10
- mixed_precision_training: Native AMP
Training results
| Training Steps | Training Loss | Validation Loss |
|---|---|---|
| 10 | - | - |
Note: This checkpoint is a prototype run (10 steps) used to validate the training pipeline end-to-end. Loss values were not recorded before the run completed. A full training run of 2000 steps is in progress and results will be updated here upon completion.
What to expect after full training:
- Training loss should decrease steadily from ~3.0 toward ~0.5–1.0
- Validation WER (Word Error Rate) target: below 30% across all four languages
- Evaluation checkpoints saved every 500 steps
Framework versions
- Transformers 5.0.0
- PyTorch 2.11.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
- Downloads last month
- 33
Model tree for katoernest/whisper-distant-voices
Base model
openai/whisper-small