Language WER vs V1
πŸ‡³πŸ‡¬ Pidgin 14.7% ↓ 2.1pp
πŸ‡³πŸ‡¬ Nigerian English 19.6% ↓ 1.5pp
πŸ‡³πŸ‡¬ Yoruba 22.3% ↓ 6.5pp
πŸ‡³πŸ‡¬ Hausa 25.8% ↓ 5.2pp
πŸ‡³πŸ‡¬ Igbo 30.5% ↓ 11.4pp

Nigeria's Voice in AI. Now Sharper.

NaijaVox-2.0 is the second generation of Axiveri's open-weight automatic speech recognition model for Nigerian languages β€” Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 with PEFT LoRA fine-tuning, NaijaVox-2.0 delivers significant accuracy gains over V1 through a larger and more diverse training corpus (25,866 samples across 7 datasets), deeper LoRA adaptation (r=64 targeting attention and feed-forward layers), SpecAugment, and realistic noise augmentation for real-world robustness.

"Every Nigerian deserves to be heard and understood by AI β€” in their own language, with their own voice."

← NaijaVox-V1 β€” the original model


πŸ“ˆ V1 β†’ V2 Improvement

Evaluated on identical test sets with identical methodology (50 samples/language, strict WER, no normalization):

Language V1 WER V2 WER Absolute Ξ” Relative Gain
πŸ‡³πŸ‡¬ Yoruba 28.8% 22.3% βˆ’6.5pp +22.6%
πŸ‡³πŸ‡¬ Hausa 31.0% 25.8% βˆ’5.2pp +16.8%
πŸ‡³πŸ‡¬ Igbo 41.9% 30.5% βˆ’11.4pp +27.2%
πŸ‡³πŸ‡¬ Nigerian English 21.1% 19.6% βˆ’1.5pp +7.1%
πŸ‡³πŸ‡¬ Nigerian Pidgin 16.8% 14.7% βˆ’2.1pp +12.5%
Average 27.9% 22.58% βˆ’5.3pp +19.1%

Igbo sees the largest jump (+27.2% relative) β€” driven by WaxalNLP Igbo TTS data and Nigerian Common Voice Igbo samples, combined with SpecAugment frequency masking.


πŸ—£οΈ Languages Supported

Language ISO Code Script Token
Yoruba yo Latin + full diacritics (ẹ, ọ, ṣ, à, Ñ, etc.) <|yo|>
Hausa ha Latin + special chars (Ζ™, Ζ΄, Ι—, etc.) <|ha|>
Igbo ig Latin + diacritics <|ig|>
Nigerian Pidgin pcm Latin <|pcm|>
Nigerian English en Latin <|en|>

Note: <\|ig\|> and <\|pcm\|> are custom language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.


πŸš€ Quick Start

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="Axiveri/NaijaVox-2.0",
    device=0  # use GPU, or remove for CPU
)

result = pipe("your_audio.wav")
print(result["text"])

Specifying Language

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model     = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-2.0")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-2.0")
vocab     = processor.tokenizer.get_vocab()

LANG_TOKENS = {
    "yoruba":           "<|yo|>",
    "hausa":            "<|ha|>",
    "igbo":             "<|ig|>",
    "nigerian_english": "<|en|>",
    "pidgin":           "<|pcm|>",
}

def transcribe(audio_array, sampling_rate, language="yoruba"):
    lang_id      = vocab[LANG_TOKENS[language]]
    transcribe   = vocab["<|transcribe|>"]
    notimestamps = vocab["<|notimestamps|>"]
    forced_ids   = [[1, lang_id], [2, transcribe], [3, notimestamps]]

    inputs = processor.feature_extractor(
        audio_array, sampling_rate=sampling_rate, return_tensors="pt"
    ).input_features

    with torch.no_grad():
        generated = model.generate(
            input_features=inputs,
            forced_decoder_ids=forced_ids,
            max_new_tokens=448
        )
    return processor.tokenizer.decode(generated[0], skip_special_tokens=True)

πŸ“Š Benchmark Results

Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding, strict WER via jiwer (no text normalization). Identical methodology to V1 for direct comparison.

Language WER (%) Accuracy (%) Test Set Samples
πŸ‡³πŸ‡¬ Nigerian Pidgin 14.7 85.3 asr-nigerian-pidgin/nigerian-pidgin-1.0 50
πŸ‡³πŸ‡¬ Nigerian English 19.6 80.4 benjaminogbonna/nigerian_accented_english 50
πŸ‡³πŸ‡¬ Yoruba 22.3 77.7 google/fleurs yo_ng 50
πŸ‡³πŸ‡¬ Hausa 25.8 74.2 google/fleurs ha_ng 50
πŸ‡³πŸ‡¬ Igbo 30.5 70.5 google/fleurs ig_ng 50
Average 22.58 77.62 β€” 250

Lower WER = better. Human-level transcription β‰ˆ 5–10%.


πŸ›‘οΈ Robustness Improvements over V1

SpecAugment

Frequency masking (up to 27 mel bins) and time masking (up to 100 time steps) applied to mel spectrograms during training. This prevents over-reliance on specific frequency bands or time positions, improving generalization to real-world recordings.

Noise Augmentation

30% of training samples received realistic background noise injection at random SNR levels before mel extraction. This directly trains the model for common Nigerian recording conditions β€” market noise, phone compression artifacts, outdoor ambient sound, and crowd audio.

Code-Switching Robustness

Trained on Nigerian Pidgin and Nigerian English together with Yoruba, Hausa, and Igbo β€” all of which contain natural code-switching patterns present in everyday Nigerian speech, media, and social content.


πŸŽ™οΈ Sample Transcriptions

Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets β€” data the model never saw during training. Transcriptions generated by the published merged model.

Yoruba

Reference Audio NaijaVox-2.0 Output
Γ wọn Γ¨yΓ n ti mọ̀ nΓ­pa Γ wọn kemika pepe bΓ­ wΓΊrΓ  fΓ dΓ‘kΓ  Γ ti kọ́pa Γ tijọ́ torΓ­pΓ© a lΓ¨ rΓ­ wọn Γ wọn èèyΓ n ti mọ̀ nΓ­pa Γ wọn kαΊΉmΓ­kΓ  pèèpèé bΓ­ wΓΊrΓ  fΓ dΓ‘kΓ  Γ ti kọpa Γ tijọ́ torΓ­ pΓ© a lΓ¨ rΓ­ wọn
Γ wọn ara Γ¬rano lo kọ́kọ́ bαΊΉΜ€rαΊΉΜ€ si ni sin ewure nΓ­ bΓ­i ọdΓΊn 15,0000 sẹ́yΓ¬n nΓ­ oke sagrosi Γ wọn arΓ‘ Γ¬rΓ  nÑà lΓ³ kọ́kọ́ bαΊΉΜ€rαΊΉΜ€ sΓ­ nΓ­ sin ewΓΊrẹ́ nΓ­ bΓ­ ọdΓΊn 1500 sẹ́yΓ¬n nΓ­ Γ²kΓ¨ sagrosi

Hausa

Reference Audio NaijaVox-2.0 Output
an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiΖ™i mai walΖ™i an kwatanta feretin gine-ginen da ke yin sararin samaniya hong kong da ginshiki mai walΖ™iy
aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga ab aristotle masanin falsafani ya yi tunanin cewa kome ya kunshi ca kuda daya ko fiye daga ab

Igbo

Reference Audio NaijaVox-2.0 Output
ka akara rossby na-adα»‹ obere karα»‹a ka arα»₯marα»₯ na-adα»‹kwu obere nke kpakpando n'ikwanye ugwu akara rossby na-adα»‹ obere karα»‹a ka arα»₯marα»₯ na-adα»‹kwa obere nke kpakpando n'α»‹kwΓ  nye monto
ka agha dara mba britenα»‹ jiri ndα»‹ agha elu mmiri gbochie ndα»‹ jamani inweta enyemaka ka agha adara mba briten jiri ndα»‹ agha elu mmiri gbochie ndα»‹ jamanα»‹ inweta enyemaka

Nigerian English

Reference Audio NaijaVox-2.0 Output
Did it change plain? Yes. yes. Ok that means he was correct so this is if he's right that Did it change green? Yes. Ok that means she was correct. So this is if its red then its no
Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University. Ebube Nwagbo studied Mass Communication at Nnamdi Azikiwe University.

Nigerian Pidgin

Reference Audio NaijaVox-2.0 Output
on top di injury her uncle no even carry her go hospital for treatment on top di injury and her uncle no even carry her go hospital for treatment
she tell don jazzy for december 2016 say as she be she tell don jazzy for december 2016 say i should be

πŸ—οΈ Model Architecture

Input Audio (16kHz)
        β”‚
        β–Ό
Whisper-large-v3 Encoder  (frozen during fine-tuning)
        β”‚  1500 Γ— 1280 features
        β–Ό
Whisper Decoder + LoRA    (r=64, alpha=128, fine-tuned)
  target modules: q_proj, k_proj, v_proj, out_proj, fc1, fc2
  V1: attention only (q/k/v/out) β€” V2: adds feed-forward (fc1/fc2)
        β”‚
        β–Ό
Extended Tokenizer         (vocab: 51,868 tokens)
  + <|ig|> Igbo token
  + <|pcm|> Nigerian Pidgin token
        β”‚
        β–Ό
Transcript

V2 publishes a fully merged standalone model β€” no PEFT dependency required. Load directly with transformers.


πŸ“¦ Training Details

Parameter V1 V2
Base model openai/whisper-large-v3 openai/whisper-large-v3
Fine-tuning method LoRA (PEFT) LoRA (PEFT)
LoRA rank 32 64
LoRA alpha 64 128
Target modules q/k/v/out_proj q/k/v/out_proj + fc1/fc2
LoRA dropout 0.05 0.05
Training precision fp16 fp16
Effective batch size 16 32
Learning rate 1e-3 5e-4
Warmup steps 50 200
Epochs (best) 2 3 of 5
SpecAugment ❌ βœ…
Noise augmentation ❌ βœ… (30% of samples)
Total training samples 13,866 25,866
GPU Tesla T4 Γ— 2 (Kaggle) Tesla T4 Γ— 2 (Kaggle)
Total training time ~20 hours ~40 hours

Training Datasets

Dataset Language(s) Samples New in V2
google/fleurs (yo_ng, ha_ng, ig_ng) Yoruba, Hausa, Igbo 8,437 β€”
benjaminogbonna/nigerian_accented_english_dataset Nigerian English 2,721 β€”
asr-nigerian-pidgin/nigerian-pidgin-1.0 Nigerian Pidgin 2,708 β€”
Tundragoon/IroyinSpeech Yoruba 2,500 βœ…
google/WaxalNLP (ha/ig/yo/pcm) Hausa, Igbo, Yoruba, Pidgin 6,000 βœ…
benjaminogbonna/nigerian_common_voice_dataset en/ha/ig/yo 2,000 βœ…
vpetukhov/bible_tts_hausa Hausa 1,500 βœ…
Total 5 languages 25,866

βœ… Intended Use

  • 🏦 Fintech & banking β€” voice transactions and customer service in Nigerian languages
  • πŸ“± Mobile apps β€” voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
  • πŸŽ™οΈ Media & journalism β€” transcribing interviews and broadcasts
  • πŸ₯ Healthcare β€” patient intake and medical documentation
  • πŸ“š Education β€” language learning tools and accessibility
  • πŸ”¬ Research β€” low-resource ASR study for West African languages
  • β™Ώ Accessibility β€” assistive technology for Nigerians with disabilities

🚫 Prohibited Use

  • ❌ Non-consensual surveillance β€” transcribing calls without consent of all parties
  • ❌ Fraud facilitation β€” forging spoken statements or supporting advance-fee fraud
  • ❌ Deepfake pipelines β€” combining with TTS to fake audio attributed to real people
  • ❌ Discriminatory systems β€” denying services based on language or accent identification
  • ❌ Political disinformation β€” generating or verifying false transcripts of political speech

πŸ‘€ Creator

Emmanuel Ariyo (Ememzyvisuals) β€” Founder, Axiveri

NaijaVox is conceived, built, and trained by Emmanuel Ariyo β€” combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Nigerian language speakers.


πŸ‘₯ About Axiveri

Axiveri is building Africa's AI infrastructure β€” open models, open data, and open tools for African languages and developers.


πŸ“„ Citation

@misc{naijavox2026,
  title        = {NaijaVox-2.0: Open-Weight Speech Recognition for Nigerian Languages},
  author       = {Ariyo, Emmanuel (Ememzyvisuals)},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-2.0}}
}

πŸ“œ License

This model is released under the Apache License 2.0 with the following additional behavioral restrictions. Use of this model constitutes acceptance of both the Apache 2.0 terms and these conditions.

Apache 2.0 Terms

Free to use, modify, distribute, and use commercially with attribution. Full terms: apache.org/licenses/LICENSE-2.0

Additional Conditions (Binding)

The following uses are explicitly prohibited regardless of the Apache 2.0 permissions:

  1. Non-consensual surveillance β€” transcribing private calls or conversations without the informed consent of all parties involved.
  2. Fraud and impersonation β€” using transcription output to forge or misrepresent spoken statements, support advance-fee fraud, or impersonate individuals.
  3. Synthetic media abuse β€” combining this model with TTS systems to fabricate audio-text pairs attributed to real, identifiable people without their consent.
  4. Discriminatory gatekeeping β€” using the model's language or accent detection to deny individuals access to services, employment, housing, or legal rights.
  5. Political disinformation β€” generating, falsifying, or selectively editing transcripts of political speech to deceive or manipulate public opinion.

These restrictions constitute a binding behavioral license condition under Axiveri's Responsible AI Use Policy. Violation of these conditions terminates your license to use this model.


Built in Nigeria πŸ‡³πŸ‡¬ β€” for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
Second model in the NaijaVox series by Axiveri

Downloads last month
-
Safetensors
Model size
2B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Axiveri/NaijaVox-2.0

Finetuned
(863)
this model

Space using Axiveri/NaijaVox-2.0 1

Collection including Axiveri/NaijaVox-2.0