--- language: ar datasets: - common_voice metrics: - wer tags: - audio - automatic-speech-recognition - speech - xlsr-fine-tuning-week license: apache-2.0 model-index: - name: Sinai Voice Arabic Specch Recognition Model results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice ar type: common_voice args: ar metrics: - name: Test WER type: wer value: 40.2 --- # Sinai Voice Arabic Speech Recognition Model # نموذج **صوت سيناء** للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Arabic using the [Common Voice](https://huggingface.co/datasets/common_voice) ## Usage Please install: - [PyTorch](https://pytorch.org/) - `$ pip3 install jiwer lang_trans torchaudio datasets transformers` The model can be used directly (without a language model) as follows: ```python import torch import torchaudio from datasets import load_dataset from lang_trans.arabic import buckwalter from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor dataset = load_dataset("common_voice", "ar", split="test[:10]") resamplers = { # all three sampling rates exist in test split 48000: torchaudio.transforms.Resample(48000, 16000), 44100: torchaudio.transforms.Resample(44100, 16000), 32000: torchaudio.transforms.Resample(32000, 16000), } def prepare_example(example): speech, sampling_rate = torchaudio.load(example["path"]) example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy() return example dataset = dataset.map(prepare_example) processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt") model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").eval() def predict(batch): inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): predicted = torch.argmax(model(inputs.input_values).logits, dim=-1) predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script batch["predicted"] = processor.tokenizer.batch_decode(predicted) return batch dataset = dataset.map(predict, batched=True, batch_size=1, remove_columns=["speech"]) for reference, predicted in zip(dataset["sentence"], dataset["predicted"]): print("reference:", reference) print("predicted:", predicted) print("--") ``` Here's the output: ``` reference: ألديك قلم ؟ predicted: ألديك قلم -- reference: ليست هناك مسافة على هذه الأرض أبعد من يوم أمس. predicted: ليست نارك مسافة على هذه الأرض أبعد من يوم أمس -- reference: إنك تكبر المشكلة. predicted: إنك تكبر المشكلة -- reference: يرغب أن يلتقي بك. predicted: يرغب أن يلتقي بك -- reference: إنهم لا يعرفون لماذا حتى. predicted: إنهم لا يعرفون لماذا حتى -- reference: سيسعدني مساعدتك أي وقت تحب. predicted: سيسعدن مساعثتك أي وقد تحب -- reference: أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة. predicted: أحب نظرية علمية إلي هي أن أحلقتز حلم كوينا بالكامل من الأمت عن المفقودة -- reference: سأشتري له قلماً. predicted: سأشتري له قلما -- reference: أين المشكلة ؟ predicted: أين المشكل -- reference: وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ predicted: ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون ``` ## Evaluation CLONED from [elgeish/wav2vec2-large-xlsr-53-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic) The model can be evaluated as follows on the Arabic test data of Common Voice: ```python import jiwer import torch import torchaudio from datasets import load_dataset from lang_trans.arabic import buckwalter from transformers import set_seed, Wav2Vec2ForCTC, Wav2Vec2Processor set_seed(42) test_split = load_dataset("common_voice", "ar", split="test") resamplers = { # all three sampling rates exist in test split 48000: torchaudio.transforms.Resample(48000, 16000), 44100: torchaudio.transforms.Resample(44100, 16000), 32000: torchaudio.transforms.Resample(32000, 16000), } def prepare_example(example): speech, sampling_rate = torchaudio.load(example["path"]) example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy() return example test_split = test_split.map(prepare_example) processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt") model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").to("cuda").eval() def predict(batch): inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): predicted = torch.argmax(model(inputs.input_values.to("cuda")).logits, dim=-1) predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script batch["predicted"] = processor.batch_decode(predicted) return batch test_split = test_split.map(predict, batched=True, batch_size=16, remove_columns=["speech"]) transformation = jiwer.Compose([ # normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones jiwer.SubstituteRegexes({ r'[auiFNKo\~_،؟»\?;:\-,\.؛«!"]': "", "\u06D6": "", r"[\|\{]": "A", "p": "h", "ک": "k", "ی": "y"}), # default transformation below jiwer.RemoveMultipleSpaces(), jiwer.Strip(), jiwer.SentencesToListOfWords(), jiwer.RemoveEmptyStrings(), ]) metrics = jiwer.compute_measures( truth=[buckwalter.trans(s) for s in test_split["sentence"]], # Buckwalter transliteration hypothesis=test_split["predicted"], truth_transform=transformation, hypothesis_transform=transformation, ) print(f"WER: {metrics['wer']:.2%}") ``` **Test Result**: 40.2%