Language	WER
🇳🇬 Pidgin	16.8%
🇳🇬 Nigerian English	21.1%
🇳🇬 Yoruba	28.8%
🇳🇬 Hausa	31.0%
🇳🇬 Igbo	41.9%

Nigeria's Voice in AI.

NaijaVox-V1 is the first open-weight automatic speech recognition model fine-tuned specifically for Nigerian languages — Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 using parameter-efficient LoRA fine-tuning, NaijaVox is designed to be practical, deployable, and accessible to Nigerian developers and researchers.

"Every Nigerian deserves to be heard and understood by AI — in their own language, with their own voice."

🗣️ Languages Supported

Language	ISO Code	Script	Token
Yoruba	`yo`	Latin + full diacritics (ẹ, ọ, ṣ, à, á, etc.)	`<\|yo\|>`
Hausa	`ha`	Latin + special chars (ƙ, ƴ, ɗ, etc.)	`<\|ha\|>`
Igbo	`ig`	Latin + diacritics	`<\|ig\|>`
Nigerian Pidgin	`pcm`	Latin	`<\|pcm\|>`
Nigerian English	`en`	Latin	`<\|en\|>`

Note: <\|ig\|> and <\|pcm\|> are new language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.

🚀 Quick Start

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="Axiveri/NaijaVox-V1",
    device=0  # use GPU, or remove for CPU
)

result = pipe("your_audio.wav")
print(result["text"])

Specifying Language

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model     = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-V1")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-V1")
vocab     = processor.tokenizer.get_vocab()

# Map language to token
LANG_TOKENS = {
    "yoruba":           "<|yo|>",
    "hausa":            "<|ha|>",
    "igbo":             "<|ig|>",
    "nigerian_english": "<|en|>",
    "pidgin":           "<|pcm|>",
}

def transcribe(audio_array, sampling_rate, language="yoruba"):
    lang_id    = vocab[LANG_TOKENS[language]]
    transcribe = vocab["<|transcribe|>"]
    notimestamps = vocab["<|notimestamps|>"]
    forced_ids = [[1, lang_id], [2, transcribe], [3, notimestamps]]

    inputs = processor.feature_extractor(
        audio_array, sampling_rate=sampling_rate, return_tensors="pt"
    ).input_features

    with torch.no_grad():
        generated = model.generate(
            input_features=inputs,
            forced_decoder_ids=forced_ids,
            max_new_tokens=448
        )
    return processor.tokenizer.decode(generated[0], skip_special_tokens=True)

📊 Benchmark Results

Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding.

Language	WER (%)	Test Set	Samples
🇳🇬 Nigerian Pidgin	16.8	asr-nigerian-pidgin/nigerian-pidgin-1.0	50
🇳🇬 Nigerian English	21.1	benjaminogbonna/nigerian_accented_english	50
🇳🇬 Yoruba	28.8	google/fleurs yo_ng	50
🇳🇬 Hausa	31.0	google/fleurs ha_ng	50
🇳🇬 Igbo	41.9	google/fleurs ig_ng	50
Average	27.9	—	250

Lower WER = better. Human-level transcription is approximately 5–10%.

🎙️ Sample Transcriptions

Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets — data the model never saw during training. Transcriptions generated live by the published model itself.

Yoruba

Transcription	Audio	NaijaVox Output
àwọn èyàn ti mọ̀ nípa àwọn kemika pepe bí wúrà fàdákà àti kọ́pa àtijọ́ torípé a lè rí wọn ní ìṣẹ̀dá ní ipò àdáyéba wọn kò dẹ̀ le jù láti wú wọn jáda láti inú ilẹ̀ pẹ̀lú irinṣẹ́ àtilẹ̀bá		àwọn èèyàn ti mọ̀ nípa wọn kẹ́míkà pépè bí wúra fadaka àti kọpa àtijọ́ torí pé a lè rí wọ́n ní ìṣẹ̀dá ní ipò àdáyébá wọn kò dẹ̀ lé ju láti wu wọn jájájá látinú ilẹ̀ pẹ̀lú irinṣẹ́ à
àwọn ara ìrano lo kọ́kọ́ bẹ̀rẹ̀ si ni sin ewure ní bíi ọdún 15,0000 sẹ́yìn ní oke sagrosi		àwọn ará ìránu olókọ́kọ́ bẹ̀rẹ̀ sí ni sin ewu rẹ́ níbi ọdún 1500 sẹ́yìn ìní òkè ságrọ́sì

Hausa

Transcription	Audio	NaijaVox Output
an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiƙi mai walƙiya wanda aka bayyana ta gaban ruwan victoria harbor		an kwatanta ferretton gine-ginen da ke yin sararin samaniya hong kong da yin shiki mai walƙiya wanda aka bayyana ta gaban ruwan victoria harbor
aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga abubuwa hudu sun kasance ƙasa ruwa iska da wuta		aristotle masanin falsafani ya yi tunanin cewa komai ya kunshi ca kuɗaɗaya ko fiye daga abubuwa hudu sun kasance ƙasa ruwa iska da wuta

Igbo

Transcription	Audio	NaijaVox Output
ka akara rossby na-adị obere karịa ka arụmarụ na-adịkwu obere nke kpakpando n'ikwanye ugwu nye ụmụ ntụgharị nke ihendọta		akara rossby na-adị obere karịa ka arụmarụ na-adịkwa obere nke kpakpando n'ịkwanye ugwu onyewu mọntụgharị nke ihe ndọta
ka agha dara mba britenị jiri ndị agha elu mmiri gbochie ndị jamani inweta enyemaka		ta agha dara mba briteni jiri ndị agha elu mmiri gbochi ndị jamani inweta enyemaka

Nigerian English

Transcription	Audio	NaijaVox Output
Closing the Google assistant app prevents it from working with your headphones.		Closing the Google assistant app prevents it from working with your headphones. ✅
Head south on Ibo Road towards Emir Road		Head south on Ibo Road towards Emir Road ✅

Nigerian Pidgin

Transcription	Audio	NaijaVox Output
on top di injury her uncle no even carry her go hospital for treatment		untop di injury and her uncle no even carry her go hospital for treatment
she tell don jazzy for december 2016 say as she be		she tell don jazzy for december 2016 say as she be ✅

🏗️ Model Architecture

Input Audio (16kHz)
        │
        ▼
Whisper-large-v3 Encoder  (frozen during fine-tuning)
        │  1500 × 1280 features
        ▼
Whisper Decoder + LoRA    (r=32, alpha=64, fine-tuned)
  target modules: q_proj, k_proj, v_proj, out_proj
  trainable params: 31,457,280 / 1,574,950,400 (1.997%)
        │
        ▼
Extended Tokenizer         (vocab: 51,868 tokens)
  + <|ig|> Igbo token
  + <|pcm|> Nigerian Pidgin token
        │
        ▼
Transcript

📦 Training Details

Parameter	Value
Base model	openai/whisper-large-v3
Fine-tuning method	LoRA (PEFT)
LoRA rank	32
LoRA alpha	64
Target modules	q_proj, k_proj, v_proj, out_proj
Training precision	fp16
Batch size	8 × gradient accumulation 2 = 16 effective
Learning rate	1e-3 with 50 warmup steps
Epochs	2 (selected by WER, not loss)
Total training samples	13,866
GPU	Tesla T4 × 2 (Kaggle)
Training time	~20 hours total

Training Datasets

Dataset	Language	Samples	License
google/fleurs (yo_ng)	Yoruba	2,339	CC-BY 4.0
google/fleurs (ha_ng)	Hausa	3,259	CC-BY 4.0
google/fleurs (ig_ng)	Igbo	2,839	CC-BY 4.0
benjaminogbonna/nigerian_accented_english_dataset	Nigerian English	2,721	Apache 2.0
asr-nigerian-pidgin/nigerian-pidgin-1.0	Nigerian Pidgin	2,708	CC-BY 4.0
Total	5 languages	13,866

✅ Intended Use

NaijaVox-V1 is designed for:

🏦 Fintech & banking — voice-based transactions and customer service in Nigerian languages
📱 Mobile apps — voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
🎙️ Media & journalism — transcribing interviews and broadcasts in Nigerian languages
🏥 Healthcare — patient intake and medical documentation
📚 Education — language learning tools and accessibility
🔬 Research — low-resource ASR study for West African languages
♿ Accessibility — assistive technology for Nigerians with disabilities

🚫 Prohibited Use

The following uses are explicitly prohibited under this model's responsible use policy:

❌ Non-consensual surveillance — transcribing phone calls, conversations, or audio without the knowledge and consent of all parties
❌ Fraud facilitation — using transcription output to generate misleading records, forge spoken statements, or support advance-fee fraud (419 scams)
❌ Deepfake pipelines — combining with TTS models to create fake audio-text pairs attributed to real people
❌ Discriminatory systems — building applications that deny services based on language or accent identification derived from this model
❌ Political disinformation — generating or verifying false transcripts of political speech

While the Apache 2.0 license permits broad commercial use, these prohibited uses apply as a binding behavioral restriction under Axiveri's Responsible AI Use Policy.

🗺️ Roadmap

NaijaVox-V2 (Planned)

Add ÌròyìnSpeech (42hrs clean Yoruba, CC-BY) — target Yoruba WER < 15%
Add Mozilla Common Voice Nigerian Pidgin (14hrs) — target Pidgin WER < 10%
Add OpenSLR SLR70 Lagos-accented English
Noise augmentation (SpecAugment, crowd noise, phone audio)
Code-switching support (Yoruba-English, Pidgin-English)

NaijaVox-Omni (Future)

Combine NaijaVox STT + YarnGPT TTS into full speech pipeline
Emotion and tone detection
Real-time streaming transcription API
Additional languages: Tiv, Igede, Efik, Ibibio

👤 Creator

Emmanuel Ariyo (Ememzyvisuals) — Founder, Axiveri

NaijaVox-V1 was conceived, built, and trained by Emmanuel Ariyo, combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Yoruba, Hausa, Igbo, Nigerian Pidgin, and Nigerian English speakers.

👥 About Axiveri

Axiveri is building Africa's AI infrastructure — open models, open data, and open tools for African languages and developers.

🌍 Africlaude Series — African language models
🗣️ NaijaVox — Nigerian speech recognition (this model)

📄 Citation

@misc{naijavox2026,
  title        = {NaijaVox-V1: Open-Weight Speech Recognition for Nigerian Languages},
  author       = {Ariyo, Emmanuel (Ememzyvisuals)},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-V1}},
  note         = {Whisper-large-v3 fine-tuned on Yoruba, Hausa, Igbo,
                  Nigerian Pidgin and Nigerian-accented English}
}

📜 License

Apache 2.0 — free for commercial and research use with attribution. Additional behavioral restrictions apply as described in the Prohibited Use section above.

Built in Nigeria 🇳🇬 — for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
First model in the NaijaVox series by Axiveri

Downloads last month: 255

Safetensors

Model size

2B params

Tensor type

F16

Model tree for Axiveri/NaijaVox-V1

Base model

openai/whisper-large-v3

Finetuned

(872)

this model

Axiveri
/

NaijaVox-V1