Language WER
πŸ‡³πŸ‡¬ Pidgin 16.8%
πŸ‡³πŸ‡¬ Nigerian English 21.1%
πŸ‡³πŸ‡¬ Yoruba 28.8%
πŸ‡³πŸ‡¬ Hausa 31.0%
πŸ‡³πŸ‡¬ Igbo 41.9%

Nigeria's Voice in AI.

NaijaVox-V1 is the first open-weight automatic speech recognition model fine-tuned specifically for Nigerian languages β€” Yoruba (with full diacritics), Hausa, Igbo, Nigerian Pidgin, and Nigerian-accented English. Built on OpenAI Whisper-large-v3 using parameter-efficient LoRA fine-tuning, NaijaVox is designed to be practical, deployable, and accessible to Nigerian developers and researchers.

"Every Nigerian deserves to be heard and understood by AI β€” in their own language, with their own voice."


πŸ—£οΈ Languages Supported

Language ISO Code Script Token
Yoruba yo Latin + full diacritics (ẹ, ọ, ṣ, à, Ñ, etc.) <|yo|>
Hausa ha Latin + special chars (Ζ™, Ζ΄, Ι—, etc.) <|ha|>
Igbo ig Latin + diacritics <|ig|>
Nigerian Pidgin pcm Latin <|pcm|>
Nigerian English en Latin <|en|>

Note: <\|ig\|> and <\|pcm\|> are new language tokens added to the Whisper vocabulary. The extended tokenizer is included in this repository.


πŸš€ Quick Start

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="Axiveri/NaijaVox-V1",
    device=0  # use GPU, or remove for CPU
)

result = pipe("your_audio.wav")
print(result["text"])

Specifying Language

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

model     = WhisperForConditionalGeneration.from_pretrained("Axiveri/NaijaVox-V1")
processor = WhisperProcessor.from_pretrained("Axiveri/NaijaVox-V1")
vocab     = processor.tokenizer.get_vocab()

# Map language to token
LANG_TOKENS = {
    "yoruba":           "<|yo|>",
    "hausa":            "<|ha|>",
    "igbo":             "<|ig|>",
    "nigerian_english": "<|en|>",
    "pidgin":           "<|pcm|>",
}

def transcribe(audio_array, sampling_rate, language="yoruba"):
    lang_id    = vocab[LANG_TOKENS[language]]
    transcribe = vocab["<|transcribe|>"]
    notimestamps = vocab["<|notimestamps|>"]
    forced_ids = [[1, lang_id], [2, transcribe], [3, notimestamps]]

    inputs = processor.feature_extractor(
        audio_array, sampling_rate=sampling_rate, return_tensors="pt"
    ).input_features

    with torch.no_grad():
        generated = model.generate(
            input_features=inputs,
            forced_decoder_ids=forced_ids,
            max_new_tokens=448
        )
    return processor.tokenizer.decode(generated[0], skip_special_tokens=True)

πŸ“Š Benchmark Results

Evaluated on FLEURS test splits (Yoruba, Hausa, Igbo), Nigerian Pidgin ASR test set, and Nigerian Accented English dataset. 50 samples per language, greedy decoding.

Language WER (%) Test Set Samples
πŸ‡³πŸ‡¬ Nigerian Pidgin 16.8 asr-nigerian-pidgin/nigerian-pidgin-1.0 50
πŸ‡³πŸ‡¬ Nigerian English 21.1 benjaminogbonna/nigerian_accented_english 50
πŸ‡³πŸ‡¬ Yoruba 28.8 google/fleurs yo_ng 50
πŸ‡³πŸ‡¬ Hausa 31.0 google/fleurs ha_ng 50
πŸ‡³πŸ‡¬ Igbo 41.9 google/fleurs ig_ng 50
Average 27.9 β€” 250

Lower WER = better. Human-level transcription is approximately 5–10%.


πŸŽ™οΈ Sample Transcriptions

Real audio samples from FLEURS test, Nigerian English, and Pidgin datasets β€” data the model never saw during training. Transcriptions generated live by the published model itself.

Yoruba

Transcription Audio NaijaVox Output
Γ wọn Γ¨yΓ n ti mọ̀ nΓ­pa Γ wọn kemika pepe bΓ­ wΓΊrΓ  fΓ dΓ‘kΓ  Γ ti kọ́pa Γ tijọ́ torΓ­pΓ© a lΓ¨ rΓ­ wọn nΓ­ Γ¬αΉ£αΊΉΜ€dΓ‘ nΓ­ ipΓ² Γ dΓ‘yΓ©ba wọn kΓ² dαΊΉΜ€ le jΓΉ lΓ‘ti wΓΊ wọn jΓ‘da lΓ‘ti inΓΊ ilαΊΉΜ€ pαΊΉΜ€lΓΊ irinṣẹ́ Γ tilαΊΉΜ€bΓ‘ Γ wọn èèyΓ n ti mọ̀ nΓ­pa wọn kẹ́mΓ­kΓ  pΓ©pΓ¨ bΓ­ wΓΊra fadaka Γ ti kọpa Γ tijọ́ torΓ­ pΓ© a lΓ¨ rΓ­ wọ́n nΓ­ Γ¬αΉ£αΊΉΜ€dΓ‘ nΓ­ ipΓ² Γ dΓ‘yΓ©bΓ‘ wọn kΓ² dαΊΉΜ€ lΓ© ju lΓ‘ti wu wọn jΓ‘jΓ‘jΓ‘ lΓ‘tinΓΊ ilαΊΉΜ€ pαΊΉΜ€lΓΊ irinṣẹ́ Γ 
Γ wọn ara Γ¬rano lo kọ́kọ́ bαΊΉΜ€rαΊΉΜ€ si ni sin ewure nΓ­ bΓ­i ọdΓΊn 15,0000 sẹ́yΓ¬n nΓ­ oke sagrosi Γ wọn arΓ‘ Γ¬rΓ‘nu olΓ³kọ́kọ́ bαΊΉΜ€rαΊΉΜ€ sΓ­ ni sin ewu rẹ́ nΓ­bi ọdΓΊn 1500 sẹ́yΓ¬n Γ¬nΓ­ Γ²kΓ¨ sΓ‘grọ́sΓ¬

Hausa

Transcription Audio NaijaVox Output
an kwatanta faretin gine-ginen da ke yin sararin samaniyar hong kong da ginshiΖ™i mai walΖ™iya wanda aka bayyana ta gaban ruwan victoria harbor an kwatanta ferretton gine-ginen da ke yin sararin samaniya hong kong da yin shiki mai walΖ™iya wanda aka bayyana ta gaban ruwan victoria harbor
aristotle masanin falsafa ne yayi tunanin cewa komai ya kunshi cakuda daya ko fiye daga abubuwa hudu sun kasance Ζ™asa ruwa iska da wuta aristotle masanin falsafani ya yi tunanin cewa komai ya kunshi ca kuΙ—aΙ—aya ko fiye daga abubuwa hudu sun kasance Ζ™asa ruwa iska da wuta

Igbo

Transcription Audio NaijaVox Output
ka akara rossby na-adα»‹ obere karα»‹a ka arα»₯marα»₯ na-adα»‹kwu obere nke kpakpando n'ikwanye ugwu nye α»₯mα»₯ ntα»₯gharα»‹ nke ihendọta akara rossby na-adα»‹ obere karα»‹a ka arα»₯marα»₯ na-adα»‹kwa obere nke kpakpando n'α»‹kwanye ugwu onyewu mọntα»₯gharα»‹ nke ihe ndọta
ka agha dara mba britenα»‹ jiri ndα»‹ agha elu mmiri gbochie ndα»‹ jamani inweta enyemaka ta agha dara mba briteni jiri ndα»‹ agha elu mmiri gbochi ndα»‹ jamani inweta enyemaka

Nigerian English

Transcription Audio NaijaVox Output
Closing the Google assistant app prevents it from working with your headphones. Closing the Google assistant app prevents it from working with your headphones. βœ…
Head south on Ibo Road towards Emir Road Head south on Ibo Road towards Emir Road βœ…

Nigerian Pidgin

Transcription Audio NaijaVox Output
on top di injury her uncle no even carry her go hospital for treatment untop di injury and her uncle no even carry her go hospital for treatment
she tell don jazzy for december 2016 say as she be she tell don jazzy for december 2016 say as she be βœ…

πŸ—οΈ Model Architecture

Input Audio (16kHz)
        β”‚
        β–Ό
Whisper-large-v3 Encoder  (frozen during fine-tuning)
        β”‚  1500 Γ— 1280 features
        β–Ό
Whisper Decoder + LoRA    (r=32, alpha=64, fine-tuned)
  target modules: q_proj, k_proj, v_proj, out_proj
  trainable params: 31,457,280 / 1,574,950,400 (1.997%)
        β”‚
        β–Ό
Extended Tokenizer         (vocab: 51,868 tokens)
  + <|ig|> Igbo token
  + <|pcm|> Nigerian Pidgin token
        β”‚
        β–Ό
Transcript

πŸ“¦ Training Details

Parameter Value
Base model openai/whisper-large-v3
Fine-tuning method LoRA (PEFT)
LoRA rank 32
LoRA alpha 64
Target modules q_proj, k_proj, v_proj, out_proj
Training precision fp16
Batch size 8 Γ— gradient accumulation 2 = 16 effective
Learning rate 1e-3 with 50 warmup steps
Epochs 2 (selected by WER, not loss)
Total training samples 13,866
GPU Tesla T4 Γ— 2 (Kaggle)
Training time ~20 hours total

Training Datasets

Dataset Language Samples License
google/fleurs (yo_ng) Yoruba 2,339 CC-BY 4.0
google/fleurs (ha_ng) Hausa 3,259 CC-BY 4.0
google/fleurs (ig_ng) Igbo 2,839 CC-BY 4.0
benjaminogbonna/nigerian_accented_english_dataset Nigerian English 2,721 Apache 2.0
asr-nigerian-pidgin/nigerian-pidgin-1.0 Nigerian Pidgin 2,708 CC-BY 4.0
Total 5 languages 13,866

βœ… Intended Use

NaijaVox-V1 is designed for:

  • 🏦 Fintech & banking β€” voice-based transactions and customer service in Nigerian languages
  • πŸ“± Mobile apps β€” voice input for Yoruba, Hausa, Igbo, and Pidgin speakers
  • πŸŽ™οΈ Media & journalism β€” transcribing interviews and broadcasts in Nigerian languages
  • πŸ₯ Healthcare β€” patient intake and medical documentation
  • πŸ“š Education β€” language learning tools and accessibility
  • πŸ”¬ Research β€” low-resource ASR study for West African languages
  • β™Ώ Accessibility β€” assistive technology for Nigerians with disabilities

🚫 Prohibited Use

The following uses are explicitly prohibited under this model's responsible use policy:

  • ❌ Non-consensual surveillance β€” transcribing phone calls, conversations, or audio without the knowledge and consent of all parties
  • ❌ Fraud facilitation β€” using transcription output to generate misleading records, forge spoken statements, or support advance-fee fraud (419 scams)
  • ❌ Deepfake pipelines β€” combining with TTS models to create fake audio-text pairs attributed to real people
  • ❌ Discriminatory systems β€” building applications that deny services based on language or accent identification derived from this model
  • ❌ Political disinformation β€” generating or verifying false transcripts of political speech

While the Apache 2.0 license permits broad commercial use, these prohibited uses apply as a binding behavioral restriction under Axiveri's Responsible AI Use Policy.


πŸ—ΊοΈ Roadmap

NaijaVox-V2 (Planned)

  • Add ÌrΓ²yΓ¬nSpeech (42hrs clean Yoruba, CC-BY) β€” target Yoruba WER < 15%
  • Add Mozilla Common Voice Nigerian Pidgin (14hrs) β€” target Pidgin WER < 10%
  • Add OpenSLR SLR70 Lagos-accented English
  • Noise augmentation (SpecAugment, crowd noise, phone audio)
  • Code-switching support (Yoruba-English, Pidgin-English)

NaijaVox-Omni (Future)

  • Combine NaijaVox STT + YarnGPT TTS into full speech pipeline
  • Emotion and tone detection
  • Real-time streaming transcription API
  • Additional languages: Tiv, Igede, Efik, Ibibio

πŸ‘€ Creator

Emmanuel Ariyo (Ememzyvisuals) β€” Founder, Axiveri

NaijaVox-V1 was conceived, built, and trained by Emmanuel Ariyo, combining ML engineering with a Nigerian cultural design identity to bring open-weight speech recognition to Yoruba, Hausa, Igbo, Nigerian Pidgin, and Nigerian English speakers.


πŸ‘₯ About Axiveri

Axiveri is building Africa's AI infrastructure β€” open models, open data, and open tools for African languages and developers.

  • 🌍 Africlaude Series β€” African language models
  • πŸ—£οΈ NaijaVox β€” Nigerian speech recognition (this model)

πŸ“„ Citation

@misc{naijavox2026,
  title        = {NaijaVox-V1: Open-Weight Speech Recognition for Nigerian Languages},
  author       = {Ariyo, Emmanuel (Ememzyvisuals)},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Axiveri/NaijaVox-V1}},
  note         = {Whisper-large-v3 fine-tuned on Yoruba, Hausa, Igbo,
                  Nigerian Pidgin and Nigerian-accented English}
}

πŸ“œ License

Apache 2.0 β€” free for commercial and research use with attribution. Additional behavioral restrictions apply as described in the Prohibited Use section above.


Built in Nigeria πŸ‡³πŸ‡¬ β€” for Nigeria and the world.
Created by Emmanuel Ariyo (Ememzyvisuals)
First model in the NaijaVox series by Axiveri

Downloads last month
255
Safetensors
Model size
2B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Axiveri/NaijaVox-V1

Finetuned
(872)
this model

Space using Axiveri/NaijaVox-V1 1