YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
license: apache-2.0 language: - ln tags: - automatic-speech-recognition - speech-recognition - whisper - lingala - bantu-languages - lora - peft datasets: - google/waxal base_model: openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition library_name: peft
BLI ASR 0
BLI ASR 0 is an automatic speech recognition model for Lingala developed by the Bantu Language Initiative.
The model is based on OpenAI Whisper large-v3 and adapted with LoRA for Lingala speech transcription. It is intended as an early research and community-oriented ASR model for under-resourced Bantu languages, starting with Lingala.
Project website and examples:
https://bantulanguageinitiative.com/en
Quick Test Dataset
A small cropped real-world Lingala audio benchmark is available here:
BantuLanguagesInitiative/lingala_real_eval_benchmark_croped
It contains short audio clips from several domains such as news, catechesis, comedy, cartoon and interview speech. The inference notebook loads this dataset directly, plays the selected audio sample, and transcribes it with BLI ASR 0.
Model Description
- Model name: BLI ASR 0
- Task: Automatic Speech Recognition
- Language: Lingala
- Base model:
openai/whisper-large-v3 - Adaptation method: LoRA / PEFT
- Training dataset: Waxal Lingala ASR
- Output: Lingala transcription from speech audio
This model transcribes Lingala speech into text. It is not a translation model.
Dataset
The model was trained on the Waxal Lingala ASR dataset.
The dataset was split into:
| Split | Approx. number of samples | Usage |
|---|---|---|
| Train | 14,400 | Model training |
| Validation | 1,844 | Validation during development |
| Test | 1,866 | Final held-out evaluation |
Text Post-processing
We applied a light normalization pipeline to the training and evaluation transcriptions.
The goal was not to impose a strict Lingala orthography, but to reduce noise and improve consistency. The post-processing included:
- Unicode normalization
- lowercasing
- whitespace normalization
- punctuation and symbol cleanup
- preservation of the original raw transcription when available
- creation of a normalized transcription field used for training/evaluation
We intentionally avoided aggressive spelling correction because Lingala has substantial orthographic variation across speakers, regions, and data sources.
Training Details
The model was fine-tuned from openai/whisper-large-v3 using LoRA.
Main training choices:
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA |
| Task token | transcribe |
| Language token | Lingala |
| Precision | bf16 |
| Optimizer | AdamW |
| Evaluation strategy | small random validation subsets during training |
| Final evaluation | full validation/test split |
| Dataset | Waxal Lingala ASR |
Performance
We report CER rather than WER for this release.
| Metric | Value |
|---|---|
| CER normalized | 0.1703 |
We do not report WER in this first release because WER is not fully fair for the current Lingala ASR setting. Lingala does not yet have a single widely enforced normalized orthography in our data, and WER strongly penalizes spelling variants, segmentation differences, and silence-related insertions/deletions. We plan to release a corrected WER metric that better accounts for linguistic and contextual variation.
Intended Use
This model can be used for:
- Lingala speech transcription
- research on low-resource ASR
- dataset bootstrapping
- assisted transcription before human correction
- evaluation of ASR pipelines for Bantu languages
The model is especially useful as a first-pass transcription model before review by human annotators.
Limitations
This is an early release and still has important limitations:
- silence handling still needs improvement
- the model may hallucinate text during long silent regions
- performance can degrade with music, jingles, intros, outros, and strong background noise
- performance in real-world media with overlapping speech is still limited
- the training data is not general enough to cover all common Lingala varieties
- the model may struggle with recent slang, popular urban expressions, and code-switching
- the model is not yet robust across all domains such as news, sermons, informal conversation, street interviews, and music-heavy content
Example Inference in a Notebook
!pip install -U transformers peft accelerate soundfile librosa torchao
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa
base_model = "openai/whisper-large-v3"
adapter_model = "BantuLanguagesInitiative/bli-asr-0"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
processor = WhisperProcessor.from_pretrained(
base_model,
language="lingala",
task="transcribe",
)
model = WhisperForConditionalGeneration.from_pretrained(
base_model,
torch_dtype=dtype,
)
model = PeftModel.from_pretrained(model, adapter_model)
model = model.merge_and_unload()
model.to(device)
model.eval()
audio_path = "example.mp3"
audio, sr = librosa.load(audio_path, sr=16000)
inputs = processor.feature_extractor(
audio,
sampling_rate=16000,
return_tensors="pt",
)
input_features = inputs.input_features.to(device=device, dtype=dtype)
forced_decoder_ids = processor.get_decoder_prompt_ids(
language="lingala",
task="transcribe",
)
with torch.no_grad():
generated_ids = model.generate(
input_features,
forced_decoder_ids=forced_decoder_ids,
max_new_tokens=225,
)
text = processor.tokenizer.batch_decode(
generated_ids,
skip_special_tokens=True,
)[0]
print(text)