Whisper Small — Krio Speech-to-Text Model

Fine-tuned OpenAI Whisper Small for high-accuracy Krio language automatic speech recognition (ASR).

Overview

This model transcribes Krio speech with 4.78% Word Error Rate (WER) and 2.12% Character Error Rate (CER) on held-out test data. It was trained on 30 hours of Krio voice data and is optimized for clean, conversational Krio speech.

Krio is the national language of Sierra Leone, spoken by millions. This is the first publicly available fine-tuned Whisper model specifically for Krio ASR.

Performance

Metric Score
Test WER 4.78%
Test CER 2.12%
Test Loss 0.0906
Training Samples 6,126
Test Samples 341
Base Model openai/whisper-small (244M params)

Model Details

  • Architecture: Encoder-decoder transformer (Whisper)
  • Training Data: MosesJoshuaCoker/30_hours_krio_voice
  • Language: Krio (kri)
  • Sample Rate: 16 kHz
  • Max Audio Length: 30 seconds
  • Training Duration: ~5 hours on Tesla T4 GPU
  • Training Epochs: 8
  • Batch Size: 16 (effective, with gradient accumulation)
  • Learning Rate: 1e-5 with linear warmup + decay
  • Regularization: SpecAugment (mask_time_prob=0.05, mask_feature_prob=0.05)

Usage

Quick Start

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", 
                model="MosesJoshuaCoker/novax-krio-v3")

result = pipe("krio_speech.wav")
print(result["text"])

Advanced: Direct Model & Processor

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/novax-krio-v3")

# Load audio (16 kHz mono)
audio, sr = librosa.load("krio_speech.wav", sr=16000)

# Transcribe
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Batch Processing

import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("MosesJoshuaCoker/novax-krio-v3")
processor = WhisperProcessor.from_pretrained("MosesJoshuaCoker/novax-krio-v3")

# Process multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
audios = [librosa.load(f, sr=16000)[0] for f in audio_files]

# Pad sequences
max_len = max(len(a) for a in audios)
padded = [
    librosa.util.pad(a, (0, max_len - len(a))) 
    for a in audios
]

inputs = processor(
    padded, 
    sampling_rate=16000, 
    return_tensors="pt"
)

with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)

transcriptions = processor.batch_decode(predicted_ids, skip_special_tokens=True)
for audio_file, transcript in zip(audio_files, transcriptions):
    print(f"{audio_file}: {transcript}")

Training Details

Data Preparation

  • Total samples: 6,807 (30 hours of audio)
  • Train/Val/Test split: 90% / 5% / 5%
  • Train samples: 6,126
  • Val samples: 340
  • Test samples: 341
  • Audio filtering: Removed clips >30 seconds (Whisper hard limit)
  • Preprocessing: 16 kHz resampling, log-mel feature extraction

Training Configuration

  • Optimizer: AdamW with 1e-5 learning rate
  • Warmup steps: 100
  • Total steps: 1,536 (8 epochs × 192 steps/epoch)
  • Gradient accumulation: 2 steps
  • Mixed precision: FP16 on Tesla T4
  • Gradient checkpointing: Enabled for memory efficiency
  • Regularization: SpecAugment on encoder

Data Augmentation

  • SpecAugment: Time masking (5%) + frequency masking (5%)
  • Normalization: Krio-safe (lowercase + whitespace collapse only)

Limitations

  1. Audio quality: Trained on conversational speech; may struggle with highly accented, whispered, or very noisy audio
  2. Audio length: Handles clips up to 30 seconds; longer clips are silently truncated
  3. Phonetics: Optimized for standard Krio phonetics; strong accents may reduce accuracy
  4. Domain: Trained on general Krio speech; specialized domains (music, code-switching) not extensively tested

Examples

Sample Predictions

Sample 1 (perfect match):

  • Reference: kɛnayt dɛn fɔ muf kɔmɔt to di amalɛkayt dɛn mek i nɔ dɔnawe wit dɛn ɛn di amalɛkayt dɛn togɛda...
  • Prediction: kɛnayt dɛn fɔ muf kɔmɔt to di amalɛkayt dɛn mek i nɔ dɔnawe wit dɛn ɛn di amalɛkayt dɛn togɛda...
  • WER: 0%

Sample 2 (minor typo):

  • Reference: asaia na agia in bɔypikin di myuzishian dɛn na di tɛmpul
  • Prediction: asia na agia in bɔypikin di myuzishan dɛn na di tɛmpul
  • WER: ~7%

Comparison with Baseline

This fine-tuned model significantly outperforms the vanilla Whisper Small model:

Model WER CER
Whisper Small (baseline) ~50% ~30%
Whisper Small + Fine-tuning (this model) 4.78% 2.12%
Improvement 89.6% reduction 92.9% reduction

Citation

If you use this model in your research or product, please cite:

@model{novax_krio_v3_2024,
  title={Whisper Small — Krio Speech-to-Text (novax-krio-v3)},
  author={Fine-tuned on MosesJoshuaCoker/30_hours_krio_voice},
  year={2024},
  publisher={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/MosesJoshuaCoker/novax-krio-v3}}
}

License

This model is licensed under the Apache License 2.0, same as the base Whisper model. See the LICENSE file for details.

Acknowledgments

Contact & Support

For questions, issues, or suggestions:

  • Open an issue on the model page
  • Check existing discussions for common troubleshooting

Last Updated: January 2025
Base Model Version: openai/whisper-small
Training Framework: Hugging Face Transformers 4.40+
Model ID: MosesJoshuaCoker/novax-krio-v3

Downloads last month
41
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MosesJoshuaCoker/novax-krio-v3

Finetuned
(3559)
this model

Dataset used to train MosesJoshuaCoker/novax-krio-v3

Evaluation results