Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

HIYACCENT: An Improved Nigerian-Accented Speech Recognition System Based on Contrastive Learning

The global objective of this research was to develop a more robust model for the Nigerian English Speakers whose English pronunciations are heavily affected by their mother tongue. For this, the Wav2Vec-HIYACCENT model was proposed which introduced a new layer to the Novel Facebook Wav2vec to capture the disparity between the baseline model and Nigerian English Speeches. A CTC loss was also inserted on top of the model which adds flexibility to the speech-text alignment. This resulted in over 20% improvement in the performance for NAE.T

Fine-tuned facebook/wav2vec2-large on English using the UISpeech Corpus. When using this model, make sure that your speech input is sampled at 16kHz.

The script used for training can be found here: https://github.com/amceejay/HIYACCENT-NE-Speech-Recognition-System

##Usage: The model can be used directly (without a language model) as follows...

#Using the ASRecognition library: from asrecognition import ASREngine

asr = ASREngine("fr", model_path="codeceejay/HIYACCENT_Wav2Vec2")

audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"] transcriptions = asr.transcribe(audio_paths)

##Writing your own inference speech: import torch import librosa from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en" MODEL_ID = "codeceejay/HIYACCENT_Wav2Vec2" SAMPLES = 10

#You can use common_voice/timit or Nigerian Accented Speeches can also be found here: https://openslr.org/70/ test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID) model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

Preprocessing the datasets.

We need to read the audio files as arrays

def speech_file_to_array_fn(batch): speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000) batch["speech"] = speech_array batch["sentence"] = batch["sentence"].upper() return batch

test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1) predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences): print("-" * 100) print("Reference:", test_dataset[i]["sentence"]) print("Prediction:", predicted_sentence)

Downloads last month
8
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.