You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Whisper Tiny Dioula (Dioula Speech-to-Text)

This is a fine-tuned version of openai/whisper-tiny for Dioula (Jula) automatic speech recognition (ASR), a Manding language spoken by over 20 million people across West Africa (primarily in Ivory Coast, Burkina Faso, and Mali).

Model Details

Model Description

This model was trained to provide Speech-to-Text capabilities for the Dioula language. As a low-resource language, the fine-tuning of Whisper Tiny represents a foundational step towards integrating Dioula into modern NLP pipelines, educational software, and accessibility tools.

  • Developed by: Soumana Dama — Full stack developer & AI engineer - Founder & Lead AI Engineer at Scoinvestigator AI
  • Model type: Whisper (Speech-to-Text / Automatic Speech Recognition)
  • Language(s) (NLP): Dioula / Jula (ISO 639-3: dyu)
  • License: Apache 2.0
  • Finetuned from model: openai/whisper-tiny

Model Sources


Uses

Direct Use

This model is intended for transcribing spoken Dioula audio into written Dioula text. It can be used for:

  • Transcription of podcasts, radio broadcasts, or voice notes in Dioula.
  • Integration into multilingual AI voice assistants.
  • Educational tools aiming to teach Dioula or provide reading assistance.

Out-of-Scope Use

This model is not designed for translating audio directly into other languages (e.g., Dioula to English) without an external translation pipeline. Due to its "tiny" architecture, it may struggle with strong background noise, diverse accents not represented in the training data, and high-stakes clinical or legal transcriptions without human supervision.


Bias, Risks, and Limitations

As with many models trained on crowdsourced data (like Common Voice), the model may exhibit biases toward the specific accents, ages, and genders of the most frequent contributors. The current Word Error Rate (WER) indicates that the model is still in its early stages and will produce transcription errors. Human supervision is required for critical tasks.


How to Get Started with the Model

Use the code below to get started with the model. Note that using generate() directly is recommended over the pipeline abstraction for maximum stability with custom configurations.

import torch
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load model and processor
model_id = "Dama12/whisper-tiny-dioula"
processor = WhisperProcessor.from_pretrained(model_id)
model = WhisperForConditionalGeneration.from_pretrained(model_id)

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load your 16kHz audio file
audio_path = "path_to_your_dioula_audio.wav"
audio_array, sampling_rate = librosa.load(audio_path, sr=16000)

# Process and Predict
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device)
predicted_ids = model.generate(input_features)

# Decode transcription
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)

Training Details

Training Data

The model was fine-tuned on a curated, normalized Dioula dataset derived from Mozilla Common Voice.

  • Training set: 8,978 validated audio-text pairs.
  • Validation set: 558 audio-text pairs. All audio files were resampled and normalized to 16,000 Hz.

Training Procedure

The model was trained on Kaggle using NVIDIA T4 x2 GPUs. The dataset was processed using a custom PyTorch Dataset loader to apply lazy loading of audio spectrograms, preventing system RAM bottlenecks.

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Learning rate: 1e-05
  • Train batch size (per device): 8
  • Eval batch size (per device): 4
  • Gradient accumulation steps: 2
  • Optimizer: AdamW
  • Warmup steps: 500
  • Max steps: 4000
  • Generation max length: 225

Evaluation

Metrics

The model was evaluated using the Word Error Rate (WER) metric on the 558-sample validation split.

Results

At step 4000, the model achieved the following metrics on the validation set:

  • Validation Loss: 1.3226
  • Word Error Rate (WER): 85.53%

While the WER indicates room for improvement, the model successfully captures phonetic structures and vocabulary (e.g., successfully predicting "o tchè bi to gondola" for "o cɛ to gundo la"). Future iterations will focus on scaling up the base model size (e.g., whisper-small or base) to drastically reduce the WER.


Environmental Impact

  • Hardware Type: 2x Tesla T4 GPUs (Kaggle Compute)
  • Hours used: ~1.5 hours
  • Cloud Provider: Kaggle
  • Compute Region: Global
Downloads last month
68
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dama12/whisper-tiny-dioula

Finetuned
(1855)
this model

Evaluation results

  • Word Error Rate (WER) on Common Voice (Dioula)
    self-reported
    85.530