license: apache-2.0 language: - ln tags: - automatic-speech-recognition - speech-recognition - whisper - lingala - bantu-languages - lora - peft datasets: - google/waxal base_model: openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition library_name: peft

BLI ASR 0

BLI ASR 0 is an automatic speech recognition model for Lingala developed by the Bantu Language Initiative.

The model is based on OpenAI Whisper large-v3 and adapted with LoRA for Lingala speech transcription. It is intended as an early research and community-oriented ASR model for under-resourced Bantu languages, starting with Lingala.

Project website and examples:
https://bantulanguageinitiative.com/en

Quick Test Dataset

A small cropped real-world Lingala audio benchmark is available here:

BantuLanguagesInitiative/lingala_real_eval_benchmark_croped

It contains short audio clips from several domains such as news, catechesis, comedy, cartoon and interview speech. The inference notebook loads this dataset directly, plays the selected audio sample, and transcribes it with BLI ASR 0.

Model Description

Model name: BLI ASR 0
Task: Automatic Speech Recognition
Language: Lingala
Base model: openai/whisper-large-v3
Adaptation method: LoRA / PEFT
Training dataset: Waxal Lingala ASR
Output: Lingala transcription from speech audio

This model transcribes Lingala speech into text. It is not a translation model.

Dataset

The model was trained on the Waxal Lingala ASR dataset.

The dataset was split into:

Split	Approx. number of samples	Usage
Train	14,400	Model training
Validation	1,844	Validation during development
Test	1,866	Final held-out evaluation

Text Post-processing

We applied a light normalization pipeline to the training and evaluation transcriptions.

The goal was not to impose a strict Lingala orthography, but to reduce noise and improve consistency. The post-processing included:

Unicode normalization
lowercasing
whitespace normalization
punctuation and symbol cleanup
preservation of the original raw transcription when available
creation of a normalized transcription field used for training/evaluation

We intentionally avoided aggressive spelling correction because Lingala has substantial orthographic variation across speakers, regions, and data sources.

Training Details

The model was fine-tuned from openai/whisper-large-v3 using LoRA.

Main training choices:

Parameter	Value
Base model	`openai/whisper-large-v3`
Fine-tuning method	LoRA
Task token	transcribe
Language token	Lingala
Precision	bf16
Optimizer	AdamW
Evaluation strategy	small random validation subsets during training
Final evaluation	full validation/test split
Dataset	Waxal Lingala ASR

Performance

We report CER rather than WER for this release.

Metric	Value
CER normalized	0.1703

We do not report WER in this first release because WER is not fully fair for the current Lingala ASR setting. Lingala does not yet have a single widely enforced normalized orthography in our data, and WER strongly penalizes spelling variants, segmentation differences, and silence-related insertions/deletions. We plan to release a corrected WER metric that better accounts for linguistic and contextual variation.

Intended Use

This model can be used for:

Lingala speech transcription
research on low-resource ASR
dataset bootstrapping
assisted transcription before human correction
evaluation of ASR pipelines for Bantu languages

The model is especially useful as a first-pass transcription model before review by human annotators.

Limitations

This is an early release and still has important limitations:

silence handling still needs improvement
the model may hallucinate text during long silent regions
performance can degrade with music, jingles, intros, outros, and strong background noise
performance in real-world media with overlapping speech is still limited
the training data is not general enough to cover all common Lingala varieties
the model may struggle with recent slang, popular urban expressions, and code-switching
the model is not yet robust across all domains such as news, sermons, informal conversation, street interviews, and music-heavy content

Example Inference in a Notebook

!pip install -U transformers peft accelerate soundfile librosa torchao

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa

base_model = "openai/whisper-large-v3"
adapter_model = "BantuLanguagesInitiative/bli-asr-0"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

processor = WhisperProcessor.from_pretrained(
    base_model,
    language="lingala",
    task="transcribe",
)

model = WhisperForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=dtype,
)

model = PeftModel.from_pretrained(model, adapter_model)
model = model.merge_and_unload()
model.to(device)
model.eval()

audio_path = "example.mp3"
audio, sr = librosa.load(audio_path, sr=16000)

inputs = processor.feature_extractor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
)

input_features = inputs.input_features.to(device=device, dtype=dtype)

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="lingala",
    task="transcribe",
)

with torch.no_grad():
    generated_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids,
        max_new_tokens=225,
    )

text = processor.tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

print(text)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support