Whisper Small — Krio Speech-to-Text Model
Fine-tuned OpenAI Whisper Small for high-accuracy Krio language automatic speech recognition (ASR).
Overview
This model transcribes Krio speech with 4.78% Word Error Rate (WER) and 2.12% Character Error Rate (CER) on held-out test data. It was trained on 30 hours of Krio voice data and is optimized for clean, conversational Krio speech.
Krio is the national language of Sierra Leone, spoken by millions. This is the first publicly available fine-tuned Whisper model specifically for Krio ASR.
Performance
| Metric | Score |
|---|---|
| Test WER | 4.78% |
| Test CER | 2.12% |
| Test Loss | 0.0906 |
| Training Samples | 6,126 |
| Test Samples | 341 |
| Base Model | openai/whisper-small (244M params) |
Model Details
- Architecture: Encoder-decoder transformer (Whisper)
- Training Data: MosesJoshuaCoker/30_hours_krio_voice
- Language: Krio (kri)
- Sample Rate: 16 kHz
- Max Audio Length: 30 seconds
- Training Duration: ~5 hours on Tesla T4 GPU
- Training Epochs: 8
- Batch Size: 16 (effective, with gradient accumulation)
- Learning Rate: 1e-5 with linear warmup + decay
- Regularization: SpecAugment (mask_time_prob=0.05, mask_feature_prob=0.05)
Usage
Quick Start
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition",
model="MosesJoshuaCoker/novax-krio-v3")
result = pipe("krio_speech.wav")
print(result["text"])
Advanced: Direct Model & Processor
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
# Load audio (16 kHz mono)
audio, sr = librosa.load("krio_speech.wav", sr=16000)
# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
predicted_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Batch Processing
import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
# Process multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audios = [librosa.load(f, sr=16000)[0] for f in audio_files]
# Pad sequences
max_len = max(len(a) for a in audios)
padded = [
librosa.util.pad(a, (0, max_len - len(a)))
for a in audios
]
inputs = processor(
padded,
sampling_rate=16000,
return_tensors="pt"
)
with torch.no_grad():
predicted_ids = model.generate(inputs.input_features)
transcriptions = processor.batch_decode(predicted_ids, skip_special_tokens=True)
for audio_file, transcript in zip(audio_files, transcriptions):
print(f"{audio_file}: {transcript}")
Training Details
Data Preparation
- Total samples: 6,807 (30 hours of audio)
- Train/Val/Test split: 90% / 5% / 5%
- Train samples: 6,126
- Val samples: 340
- Test samples: 341
- Audio filtering: Removed clips >30 seconds (Whisper hard limit)
- Preprocessing: 16 kHz resampling, log-mel feature extraction
Training Configuration
- Optimizer: AdamW with 1e-5 learning rate
- Warmup steps: 100
- Total steps: 1,536 (8 epochs × 192 steps/epoch)
- Gradient accumulation: 2 steps
- Mixed precision: FP16 on Tesla T4
- Gradient checkpointing: Enabled for memory efficiency
- Regularization: SpecAugment on encoder
Data Augmentation
- SpecAugment: Time masking (5%) + frequency masking (5%)
- Normalization: Krio-safe (lowercase + whitespace collapse only)
Limitations
- Audio quality: Trained on conversational speech; may struggle with highly accented, whispered, or very noisy audio
- Audio length: Handles clips up to 30 seconds; longer clips are silently truncated
- Phonetics: Optimized for standard Krio phonetics; strong accents may reduce accuracy
- Domain: Trained on general Krio speech; specialized domains (music, code-switching) not extensively tested
Examples
Sample Predictions
Sample 1 (perfect match):
- Reference: kɛnayt dɛn fɔ muf kɔmɔt to di amalɛkayt dɛn mek i nɔ dɔnawe wit dɛn ɛn di amalɛkayt dɛn togɛda...
- Prediction: kɛnayt dɛn fɔ muf kɔmɔt to di amalɛkayt dɛn mek i nɔ dɔnawe wit dɛn ɛn di amalɛkayt dɛn togɛda...
- WER: 0%
Sample 2 (minor typo):
- Reference: asaia na agia in bɔypikin di myuzishian dɛn na di tɛmpul
- Prediction: asia na agia in bɔypikin di myuzishan dɛn na di tɛmpul
- WER: ~7%
Comparison with Baseline
This fine-tuned model significantly outperforms the vanilla Whisper Small model:
| Model | WER | CER |
|---|---|---|
| Whisper Small (baseline) | ~50% | ~30% |
| Whisper Small + Fine-tuning (this model) | 4.78% | 2.12% |
| Improvement | 89.6% reduction | 92.9% reduction |
Citation
If you use this model in your research or product, please cite:
@model{novax_krio_v3_2024,
title={Whisper Small — Krio Speech-to-Text (novax-krio-v3)},
author={Fine-tuned on MosesJoshuaCoker/30_hours_krio_voice},
year={2024},
publisher={Hugging Face Hub},
howpublished={\url{https://huggingface.co/MosesJoshuaCoker/novax-krio-v3}}
}
License
This model is licensed under the Apache License 2.0, same as the base Whisper model. See the LICENSE file for details.
Acknowledgments
- Base model: OpenAI's Whisper
- Training data: MosesJoshuaCoker/30_hours_krio_voice
- Framework: Hugging Face Transformers
Contact & Support
For questions, issues, or suggestions:
- Open an issue on the model page
- Check existing discussions for common troubleshooting
Last Updated: January 2025
Base Model Version: openai/whisper-small
Training Framework: Hugging Face Transformers 4.40+
Model ID: MosesJoshuaCoker/novax-krio-v3
- Downloads last month
- 41
Model tree for MosesJoshuaCoker/novax-krio-v3
Base model
openai/whisper-smallDataset used to train MosesJoshuaCoker/novax-krio-v3
Evaluation results
- Test WER on Krio Voice (30 hours)self-reported4.780
- Test CER on Krio Voice (30 hours)self-reported2.120