| Language | WER |
|---|---|
| π³π¬ Pidgin | 16.8% |
| π³π¬ Nigerian English | 21.1% |
| π³π¬ Yoruba | 28.8% |
| π³π¬ Hausa | 31.0% |
| π³π¬ Igbo | 41.9% |
Nigeria's Voice in AI.
NaijaVox-V1 is the first open-weight automatic speech recognition model fine-tuned specifically for Nigerian languages β Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 using parameter-efficient LoRA fine-tuning, NaijaVox is designed to be practical, deployable, and accessible to Nigerian developers and researchers.
"Every Nigerian deserves to be heard and understood by AI β in their own language, with their own voice."
π£οΈ Languages Supported
| Language | ISO Code | Script | Token |
|---|---|---|---|
| Yoruba | yo |
Latin + full diacritics (αΊΉ, α», αΉ£, Γ , Γ‘, etc.) | <|yo|> |
| Hausa | ha |
Latin + special chars (Ζ, Ζ΄, Ι, etc.) | <|ha|> |
| Igbo | ig |
Latin + diacritics | <|ig|> |
| Nigerian Pidgin | pcm |
Latin | <|pcm|> |
| Nigerian English | en |
Latin | <|en|> |
Note:
<\|ig\|>and<\|pcm\|>are new language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.
π Quick Start
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model="Axiveri/NaijaVox-V1",
device=0 # use GPU, or remove for CPU
)
result = pipe("your_audio.wav")
print(result["text"])
Specifying Language
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
model = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-V1")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-V1")
vocab = processor.tokenizer.get_vocab()
# Map language to token
LANG_TOKENS = {
"yoruba": "<|yo|>",
"hausa": "<|ha|>",
"igbo": "<|ig|>",
"nigerian_english": "<|en|>",
"pidgin": "<|pcm|>",
}
def transcribe(audio_array, sampling_rate, language="yoruba"):
lang_id = vocab[LANG_TOKENS[language]]
transcribe = vocab["<|transcribe|>"]
notimestamps = vocab["<|notimestamps|>"]
forced_ids = [[1, lang_id], [2, transcribe], [3, notimestamps]]
inputs = processor.feature_extractor(
audio_array, sampling_rate=sampling_rate, return_tensors="pt"
).input_features
with torch.no_grad():
generated = model.generate(
input_features=inputs,
forced_decoder_ids=forced_ids,
max_new_tokens=448
)
return processor.tokenizer.decode(generated[0], skip_special_tokens=True)
π Benchmark Results
Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding.
| Language | WER (%) | Test Set | Samples |
|---|---|---|---|
| π³π¬ Nigerian Pidgin | 16.8 | asr-nigerian-pidgin/nigerian-pidgin-1.0 | 50 |
| π³π¬ Nigerian English | 21.1 | benjaminogbonna/nigerian_accented_english | 50 |
| π³π¬ Yoruba | 28.8 | google/fleurs yo_ng | 50 |
| π³π¬ Hausa | 31.0 | google/fleurs ha_ng | 50 |
| π³π¬ Igbo | 41.9 | google/fleurs ig_ng | 50 |
| Average | 27.9 | β | 250 |
Lower WER = better. Human-level transcription is approximately 5β10%.
ποΈ Sample Transcriptions
Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets β data the model never saw during training. Transcriptions generated live by the published model itself.
Yoruba
| Transcription | Audio | NaijaVox Output |
|---|---|---|
| Γ wα»n Γ¨yΓ n ti mα»Μ nΓpa Γ wα»n kemika pepe bΓ wΓΊrΓ fΓ dΓ‘kΓ Γ ti kα»Μpa Γ tijα»Μ torΓpΓ© a lΓ¨ rΓ wα»n nΓ Γ¬αΉ£αΊΉΜdΓ‘ nΓ ipΓ² Γ dΓ‘yΓ©ba wα»n kΓ² dαΊΉΜ le jΓΉ lΓ‘ti wΓΊ wα»n jΓ‘da lΓ‘ti inΓΊ ilαΊΉΜ pαΊΉΜlΓΊ irinαΉ£αΊΉΜ Γ tilαΊΉΜbΓ‘ | Γ wα»n èèyΓ n ti mα»Μ nΓpa wα»n kαΊΉΜmΓkΓ pΓ©pΓ¨ bΓ wΓΊra fadaka Γ ti kα»pa Γ tijα»Μ torΓ pΓ© a lΓ¨ rΓ wα»Μn nΓ Γ¬αΉ£αΊΉΜdΓ‘ nΓ ipΓ² Γ dΓ‘yΓ©bΓ‘ wα»n kΓ² dαΊΉΜ lΓ© ju lΓ‘ti wu wα»n jΓ‘jΓ‘jΓ‘ lΓ‘tinΓΊ ilαΊΉΜ pαΊΉΜlΓΊ irinαΉ£αΊΉΜ Γ | |
| Γ wα»n ara Γ¬rano lo kα»Μkα»Μ bαΊΉΜrαΊΉΜ si ni sin ewure nΓ bΓi α»dΓΊn 15,0000 sαΊΉΜyΓ¬n nΓ oke sagrosi | Γ wα»n arΓ‘ Γ¬rΓ‘nu olΓ³kα»Μkα»Μ bαΊΉΜrαΊΉΜ sΓ ni sin ewu rαΊΉΜ nΓbi α»dΓΊn 1500 sαΊΉΜyΓ¬n Γ¬nΓ Γ²kΓ¨ sΓ‘grα»ΜsΓ¬ |
Hausa
| Transcription | Audio | NaijaVox Output |
|---|---|---|
| an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiΖi mai walΖiya wanda aka bayyana ta gaban ruwan victoria harbor | an kwatanta ferretton gine-ginen da ke yin sararin samaniya hong kong da yin shiki mai walΖiya wanda aka bayyana ta gaban ruwan victoria harbor | |
| aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga abubuwa hudu sun kasance Ζasa ruwa iska da wuta | aristotle masanin falsafani ya yi tunanin cewa komai ya kunshi ca kuΙaΙaya ko fiye daga abubuwa hudu sun kasance Ζasa ruwa iska da wuta |
Igbo
| Transcription | Audio | NaijaVox Output |
|---|---|---|
| ka akara rossby na-adα» obere karα»a ka arα»₯marα»₯ na-adα»kwu obere nke kpakpando n'ikwanye ugwu nye α»₯mα»₯ ntα»₯gharα» nke ihendα»ta | akara rossby na-adα» obere karα»a ka arα»₯marα»₯ na-adα»kwa obere nke kpakpando n'α»kwanye ugwu onyewu mα»ntα»₯gharα» nke ihe ndα»ta | |
| ka agha dara mba britenα» jiri ndα» agha elu mmiri gbochie ndα» jamani inweta enyemaka | ta agha dara mba briteni jiri ndα» agha elu mmiri gbochi ndα» jamani inweta enyemaka |
Nigerian English
| Transcription | Audio | NaijaVox Output |
|---|---|---|
| Closing the Google assistant app prevents it from working with your headphones. | Closing the Google assistant app prevents it from working with your headphones. β | |
| Head south on Ibo Road towards Emir Road | Head south on Ibo Road towards Emir Road β |
Nigerian Pidgin
| Transcription | Audio | NaijaVox Output |
|---|---|---|
| on top di injury her uncle no even carry her go hospital for treatment | untop di injury and her uncle no even carry her go hospital for treatment | |
| she tell don jazzy for december 2016 say as she be | she tell don jazzy for december 2016 say as she be β |
ποΈ Model Architecture
Input Audio (16kHz)
β
βΌ
Whisper-large-v3 Encoder (frozen during fine-tuning)
β 1500 Γ 1280 features
βΌ
Whisper Decoder + LoRA (r=32, alpha=64, fine-tuned)
target modules: q_proj, k_proj, v_proj, out_proj
trainable params: 31,457,280 / 1,574,950,400 (1.997%)
β
βΌ
Extended Tokenizer (vocab: 51,868 tokens)
+ <|ig|> Igbo token
+ <|pcm|> Nigerian Pidgin token
β
βΌ
Transcript
π¦ Training Details
| Parameter | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT) |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| Target modules | q_proj, k_proj, v_proj, out_proj |
| Training precision | fp16 |
| Batch size | 8 Γ gradient accumulation 2 = 16 effective |
| Learning rate | 1e-3 with 50 warmup steps |
| Epochs | 2 (selected by WER, not loss) |
| Total training samples | 13,866 |
| GPU | Tesla T4 Γ 2 (Kaggle) |
| Training time | ~20 hours total |
Training Datasets
| Dataset | Language | Samples | License |
|---|---|---|---|
| google/fleurs (yo_ng) | Yoruba | 2,339 | CC-BY 4.0 |
| google/fleurs (ha_ng) | Hausa | 3,259 | CC-BY 4.0 |
| google/fleurs (ig_ng) | Igbo | 2,839 | CC-BY 4.0 |
| benjaminogbonna/nigerian_accented_english_dataset | Nigerian English | 2,721 | Apache 2.0 |
| asr-nigerian-pidgin/nigerian-pidgin-1.0 | Nigerian Pidgin | 2,708 | CC-BY 4.0 |
| Total | 5 languages | 13,866 |
β Intended Use
NaijaVox-V1 is designed for:
- π¦ Fintech & banking β voice-based transactions and customer service in Nigerian languages
- π± Mobile apps β voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
- ποΈ Media & journalism β transcribing interviews and broadcasts in Nigerian languages
- π₯ Healthcare β patient intake and medical documentation
- π Education β language learning tools and accessibility
- π¬ Research β low-resource ASR study for West African languages
- βΏ Accessibility β assistive technology for Nigerians with disabilities
π« Prohibited Use
The following uses are explicitly prohibited under this model's responsible use policy:
- β Non-consensual surveillance β transcribing phone calls, conversations, or audio without the knowledge and consent of all parties
- β Fraud facilitation β using transcription output to generate misleading records, forge spoken statements, or support advance-fee fraud (419 scams)
- β Deepfake pipelines β combining with TTS models to create fake audio-text pairs attributed to real people
- β Discriminatory systems β building applications that deny services based on language or accent identification derived from this model
- β Political disinformation β generating or verifying false transcripts of political speech
While the Apache 2.0 license permits broad commercial use, these prohibited uses apply as a binding behavioral restriction under Axiveri's Responsible AI Use Policy.
πΊοΈ Roadmap
NaijaVox-V2 (Planned)
- Add ΓrΓ²yΓ¬nSpeech (42hrs clean Yoruba, CC-BY) β target Yoruba WER < 15%
- Add Mozilla Common Voice Nigerian Pidgin (14hrs) β target Pidgin WER < 10%
- Add OpenSLR SLR70 Lagos-accented English
- Noise augmentation (SpecAugment, crowd noise, phone audio)
- Code-switching support (Yoruba-English, Pidgin-English)
NaijaVox-Omni (Future)
- Combine NaijaVox STT + YarnGPT TTS into full speech pipeline
- Emotion and tone detection
- Real-time streaming transcription API
- Additional languages: Tiv, Igede, Efik, Ibibio
π€ Creator
Emmanuel Ariyo (Ememzyvisuals) β Founder, Axiveri
NaijaVox-V1 was conceived, built, and trained by Emmanuel Ariyo, combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Yoruba, Hausa, Igbo, Nigerian Pidgin, and Nigerian English speakers.
π₯ About Axiveri
Axiveri is building Africa's AI infrastructure β open models, open data, and open tools for African languages and developers.
- π Africlaude Series β African language models
- π£οΈ NaijaVox β Nigerian speech recognition (this model)
π Citation
@misc{naijavox2026,
title = {NaijaVox-V1: Open-Weight Speech Recognition for Nigerian Languages},
author = {Ariyo, Emmanuel (Ememzyvisuals)},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-V1}},
note = {Whisper-large-v3 fine-tuned on Yoruba, Hausa, Igbo,
Nigerian Pidgin and Nigerian-accented English}
}
π License
Apache 2.0 β free for commercial and research use with attribution. Additional behavioral restrictions apply as described in the Prohibited Use section above.
Built in Nigeria π³π¬ β for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
First model in the NaijaVox series by Axiveri
- Downloads last month
- 255
Model tree for Axiveri/NaijaVox-V1
Base model
openai/whisper-large-v3