YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Open BLI ASR 0 in Colab

license: apache-2.0 language: - ln tags: - automatic-speech-recognition - speech-recognition - whisper - lingala - bantu-languages - lora - peft datasets: - google/waxal base_model: openai/whisper-large-v3 pipeline_tag: automatic-speech-recognition library_name: peft

BLI ASR 0

BLI ASR 0 is an automatic speech recognition model for Lingala developed by the Bantu Language Initiative.

The model is based on OpenAI Whisper large-v3 and adapted with LoRA for Lingala speech transcription. It is intended as an early research and community-oriented ASR model for under-resourced Bantu languages, starting with Lingala.

Project website and examples:
https://bantulanguageinitiative.com/en

Quick Test Dataset

A small cropped real-world Lingala audio benchmark is available here:

BantuLanguagesInitiative/lingala_real_eval_benchmark_croped

It contains short audio clips from several domains such as news, catechesis, comedy, cartoon and interview speech. The inference notebook loads this dataset directly, plays the selected audio sample, and transcribes it with BLI ASR 0.

Model Description

  • Model name: BLI ASR 0
  • Task: Automatic Speech Recognition
  • Language: Lingala
  • Base model: openai/whisper-large-v3
  • Adaptation method: LoRA / PEFT
  • Training dataset: Waxal Lingala ASR
  • Output: Lingala transcription from speech audio

This model transcribes Lingala speech into text. It is not a translation model.

Dataset

The model was trained on the Waxal Lingala ASR dataset.

The dataset was split into:

Split Approx. number of samples Usage
Train 14,400 Model training
Validation 1,844 Validation during development
Test 1,866 Final held-out evaluation

Text Post-processing

We applied a light normalization pipeline to the training and evaluation transcriptions.

The goal was not to impose a strict Lingala orthography, but to reduce noise and improve consistency. The post-processing included:

  • Unicode normalization
  • lowercasing
  • whitespace normalization
  • punctuation and symbol cleanup
  • preservation of the original raw transcription when available
  • creation of a normalized transcription field used for training/evaluation

We intentionally avoided aggressive spelling correction because Lingala has substantial orthographic variation across speakers, regions, and data sources.

Training Details

The model was fine-tuned from openai/whisper-large-v3 using LoRA.

Main training choices:

Parameter Value
Base model openai/whisper-large-v3
Fine-tuning method LoRA
Task token transcribe
Language token Lingala
Precision bf16
Optimizer AdamW
Evaluation strategy small random validation subsets during training
Final evaluation full validation/test split
Dataset Waxal Lingala ASR

Performance

We report CER rather than WER for this release.

Metric Value
CER normalized 0.1703

We do not report WER in this first release because WER is not fully fair for the current Lingala ASR setting. Lingala does not yet have a single widely enforced normalized orthography in our data, and WER strongly penalizes spelling variants, segmentation differences, and silence-related insertions/deletions. We plan to release a corrected WER metric that better accounts for linguistic and contextual variation.

Intended Use

This model can be used for:

  • Lingala speech transcription
  • research on low-resource ASR
  • dataset bootstrapping
  • assisted transcription before human correction
  • evaluation of ASR pipelines for Bantu languages

The model is especially useful as a first-pass transcription model before review by human annotators.

Limitations

This is an early release and still has important limitations:

  • silence handling still needs improvement
  • the model may hallucinate text during long silent regions
  • performance can degrade with music, jingles, intros, outros, and strong background noise
  • performance in real-world media with overlapping speech is still limited
  • the training data is not general enough to cover all common Lingala varieties
  • the model may struggle with recent slang, popular urban expressions, and code-switching
  • the model is not yet robust across all domains such as news, sermons, informal conversation, street interviews, and music-heavy content

Example Inference in a Notebook

!pip install -U transformers peft accelerate soundfile librosa torchao

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import librosa

base_model = "openai/whisper-large-v3"
adapter_model = "BantuLanguagesInitiative/bli-asr-0"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

processor = WhisperProcessor.from_pretrained(
    base_model,
    language="lingala",
    task="transcribe",
)

model = WhisperForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=dtype,
)

model = PeftModel.from_pretrained(model, adapter_model)
model = model.merge_and_unload()
model.to(device)
model.eval()

audio_path = "example.mp3"
audio, sr = librosa.load(audio_path, sr=16000)

inputs = processor.feature_extractor(
    audio,
    sampling_rate=16000,
    return_tensors="pt",
)

input_features = inputs.input_features.to(device=device, dtype=dtype)

forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="lingala",
    task="transcribe",
)

with torch.no_grad():
    generated_ids = model.generate(
        input_features,
        forced_decoder_ids=forced_decoder_ids,
        max_new_tokens=225,
    )

text = processor.tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)[0]

print(text)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support