Whisper Tiny Dioula (Dioula Speech-to-Text)
This is a fine-tuned version of openai/whisper-tiny for Dioula (Jula) automatic speech recognition (ASR), a Manding language spoken by over 20 million people across West Africa (primarily in Ivory Coast, Burkina Faso, and Mali).
Model Details
Model Description
This model was trained to provide Speech-to-Text capabilities for the Dioula language. As a low-resource language, the fine-tuning of Whisper Tiny represents a foundational step towards integrating Dioula into modern NLP pipelines, educational software, and accessibility tools.
- Developed by: Soumana Dama — Full stack developer & AI engineer - Founder & Lead AI Engineer at Scoinvestigator AI
- Model type: Whisper (Speech-to-Text / Automatic Speech Recognition)
- Language(s) (NLP): Dioula / Jula (ISO 639-3:
dyu) - License: Apache 2.0
- Finetuned from model:
openai/whisper-tiny
Model Sources
- Developer Contacts:
- LinkedIn: Soumana Dama
- Scoinvestigator AI: scoinvestigator.com
- GitHub: Damasoumana1
- Portfolio: soumanadama.netlify.app
- Email: soumanadama93@gmail.com
Uses
Direct Use
This model is intended for transcribing spoken Dioula audio into written Dioula text. It can be used for:
- Transcription of podcasts, radio broadcasts, or voice notes in Dioula.
- Integration into multilingual AI voice assistants.
- Educational tools aiming to teach Dioula or provide reading assistance.
Out-of-Scope Use
This model is not designed for translating audio directly into other languages (e.g., Dioula to English) without an external translation pipeline. Due to its "tiny" architecture, it may struggle with strong background noise, diverse accents not represented in the training data, and high-stakes clinical or legal transcriptions without human supervision.
Bias, Risks, and Limitations
As with many models trained on crowdsourced data (like Common Voice), the model may exhibit biases toward the specific accents, ages, and genders of the most frequent contributors. The current Word Error Rate (WER) indicates that the model is still in its early stages and will produce transcription errors. Human supervision is required for critical tasks.
How to Get Started with the Model
Use the code below to get started with the model. Note that using generate() directly is recommended over the pipeline abstraction for maximum stability with custom configurations.
import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration
# Load model and processor
model_id = "Dama12/whisper-tiny-dioula"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load your 16kHz audio file
audio_path = "path_to_your_dioula_audio.wav"
audio_array, sampling_rate = librosa.load(audio_path, sr=16000)
# Process and Predict
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
predicted_ids = model.generate(input_features)
# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)
Training Details
Training Data
The model was fine-tuned on a curated, normalized Dioula dataset derived from Mozilla Common Voice.
- Training set: 8,978 validated audio-text pairs.
- Validation set: 558 audio-text pairs. All audio files were resampled and normalized to 16,000 Hz.
Training Procedure
The model was trained on Kaggle using NVIDIA T4 x2 GPUs. The dataset was processed using a custom PyTorch Dataset loader to apply lazy loading of audio spectrograms, preventing system RAM bottlenecks.
Training Hyperparameters
- Training regime: fp16 mixed precision
- Learning rate: 1e-05
- Train batch size (per device): 8
- Eval batch size (per device): 4
- Gradient accumulation steps: 2
- Optimizer: AdamW
- Warmup steps: 500
- Max steps: 4000
- Generation max length: 225
Evaluation
Metrics
The model was evaluated using the Word Error Rate (WER) metric on the 558-sample validation split.
Results
At step 4000, the model achieved the following metrics on the validation set:
- Validation Loss: 1.3226
- Word Error Rate (WER): 85.53%
While the WER indicates room for improvement, the model successfully captures phonetic structures and vocabulary (e.g., successfully predicting "o tchè bi to gondola" for "o cɛ to gundo la"). Future iterations will focus on scaling up the base model size (e.g., whisper-small or base) to drastically reduce the WER.
Environmental Impact
- Hardware Type: 2x Tesla T4 GPUs (Kaggle Compute)
- Hours used: ~1.5 hours
- Cloud Provider: Kaggle
- Compute Region: Global
- Downloads last month
- 68
Model tree for Dama12/whisper-tiny-dioula
Base model
openai/whisper-tinyEvaluation results
- Word Error Rate (WER) on Common Voice (Dioula)self-reported85.530