--- language: fa datasets: - common_voice tags: - audio - automatic-speech-recognition - speech - xlsr-fine-tuning-week license: apache-2.0 widget: - label: Common Voice sample 687 src: https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian/resolve/main/sample687.flac - label: Common Voice sample 1671 src: https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian/resolve/main/sample1671.flac model-index: - name: XLSR Wav2Vec2 Persian (Farsi) by Mehrdad Farahani results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice fa type: common_voice args: fa metrics: - name: Test WER type: wer value: 32.09 - name: Test CER type: cer value: 8.23 --- # Wav2Vec2-Large-XLSR-53 Persian Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in Persian (Farsi) using [Common Voice](https://huggingface.co/datasets/common_voice). When using this model, make sure that your speech input is sampled at 16kHz. ## Usage The model can be used directly (without a language model) as follows: ```bash !pip install git+https://github.com/huggingface/datasets.git !pip install git+https://github.com/huggingface/transformers.git !pip install torchaudio !pip install librosa !pip install jiwer !pip install hazm ``` ```python import torch import torchaudio from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import librosa import pandas as pd import numpy as np import hazm import random import os import string import six import re import IPython.display as ipd # Loading the datasets dataset = load_dataset("common_voice", "fa", split="test[:2%]") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-persian") model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-persian").to(device) # Preprocessing the datasets. # Normalizing the texts _normalizer = hazm.Normalizer() def multiple_replace(mapping, text): pattern = "|".join(map(re.escape, mapping.keys())) return re.sub(pattern, lambda m: mapping[m.group()], str(text)) def convert_weirdos(input_str): # character mapping = { 'ك': 'ک', 'دِ': 'د', 'بِ': 'ب', 'زِ': 'ز', 'ذِ': 'ذ', 'شِ': 'ش', 'سِ': 'س', 'ى': 'ی', 'ي': 'ی', 'أ': 'ا', 'ؤ': 'و', "ے": "ی", "ۀ": "ه", "ﭘ": "پ", "ﮐ": "ک", "ﯽ": "ی", "ﺎ": "ا", "ﺑ": "ب", "ﺘ": "ت", "ﺧ": "خ", "ﺩ": "د", "ﺱ": "س", "ﻀ": "ض", "ﻌ": "ع", "ﻟ": "ل", "ﻡ": "م", "ﻢ": "م", "ﻪ": "ه", "ﻮ": "و", "ئ": "ی", 'ﺍ': "ا", 'ة': "ه", 'ﯾ': "ی", 'ﯿ': "ی", 'ﺒ': "ب", 'ﺖ': "ت", 'ﺪ': "د", 'ﺮ': "ر", 'ﺴ': "س", 'ﺷ': "ش", 'ﺸ': "ش", 'ﻋ': "ع", 'ﻤ': "م", 'ﻥ': "ن", 'ﻧ': "ن", 'ﻭ': "و", 'ﺭ': "ر", "ﮔ": "گ", } # notation mapping.update(**{ "#": " ", "!": " ", "؟": " ", "?": " ", "«": " ", "»": " ", "ء": " ", "،": " ", "(": " ", ")": " ", "؛": " ", "'ٔ": " ", "٬": " ", 'ٔ': " ", ",": " ", "?": " ", ".": " ", "!": " ", "-": " ", ";": " ", ":": " ", '"': " ", "“": " ", "%": " ", "‘": " ", "”": " ", "�": " ", "–": " ", "…": " ", "_": " ", }) return multiple_replace(mapping, input_str) PERSIAN_ALPHA = "\u0621-\u0628\u062A-\u063A\u0641-\u0642\u0644-\u0648\u064E-\u0651\u0655\u067E\u0686\u0698\u06A9\u06AF\u06BE\u06CC" # noqa: E501 PERSIAN_DIGIT = "\u06F0-\u06F9" COMMON_ARABIC_ALPHA = "\u0629\u0643\u0649-\u064B\u064D\u06D5" COMMON_ARABIC_DIGIT = "\u0660-\u0669" ZWNJ = "\u200c" ENGLISH = "a-z0-9\&" PERSIAN = PERSIAN_ALPHA + PERSIAN_DIGIT + COMMON_ARABIC_ALPHA + COMMON_ARABIC_DIGIT + ZWNJ def normalizer(text, min_ratio=1.1): text = text.lower() text = _normalizer.normalize(text) text = text.replace("\u200c", " ") text = text.replace("\u200d", " ") text = text.replace("\u200e", " ") text = text.replace("\u200f", " ") text = text.replace("\ufeff", " ") text = convert_weirdos(text) words = [word.replace("آ", "ا") if "آ" in word and not word.startswith("آ") else word for word in text.split()] text = " ".join(words) if not text or not len(text) > 2: return None en_text = re.sub(r"[^" + ENGLISH + "+]", " ", six.ensure_str(text)) en_text = re.sub(r"\s+", " ", en_text) if len(en_text) > 1: return None return text chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]' def remove_special_characters(batch): text = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " " text = normalizer(text) batch["sentence"] = text return batch # We need to read the aduio files as arrays def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) speech_array = speech_array.squeeze().numpy() speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000) batch["speech"] = speech_array return batch def predict(batch): features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) input_values = features.input_values.to(device) attention_mask = features.attention_mask.to(device) with torch.no_grad(): logits = model(input_values, attention_mask=attention_mask).logits pred_ids = torch.argmax(logits, dim=-1) batch["predicted"] = processor.batch_decode(pred_ids)[0] return batch dataset = dataset.map(remove_special_characters) dataset = dataset.map(speech_file_to_array_fn, remove_columns=list(set(dataset.column_names) - set(['sentence', 'path']))) result = dataset.map(predict) ``` ## Prediction ```python max_items = np.random.randint(0, len(result), 20).tolist() for i in max_items: reference, predicted = result["sentence"][i], result["predicted"][i] print("reference:", reference) print("predicted:", predicted) print('---') ``` ```text reference: اطلاعات مسری است predicted: اطلاعات مسری است --- reference: نه منظورم اینه که وقتی که ساکته چه کاریه خودمونه بندازیم زحمت predicted: نه منظورم اینه که وقتی که ساکت چی کاریه خودمونو بندازیم زحمت --- reference: من آب پرتقال می خورم لطفا predicted: من آپ ارتغال می خورم لطفا --- reference: وقت آن رسیده آنها را که قدم پیش میگذارند بزرگ بداریم predicted: وقت آ رسیده آنها را که قدم پیش میگذارند بزرگ بداریم --- reference: سیم باتری دارید predicted: سیم باتری دارید --- reference: این بهتره تا اینکه به بهونه درس و مشق هر روز بره خونه شون predicted: این بهتره تا اینکه به بهمونه درسومش خرروز بره خونه اشون --- reference: ژاکت تنگ است predicted: ژاکت تنگ است --- reference: آت و اشغال های خیابان predicted: آت و اشغال های خیابان --- reference: من به این روند اعتراض دارم predicted: من به این لوند تراج دارم --- reference: کرایه این مکان چند است predicted: کرایه این مکان چند است --- reference: ولی این فرصت این سهم جوانی اعطا نشده است predicted: ولی این فرصت این سحم جوانی اتان نشده است --- reference: متوجه فاجعهای محیطی میشوم predicted: متوجه فاجایهای محیطی میشوم --- reference: ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن predicted: ترافیک شدید ی هم بودا دیدن نور ماشینا و چراغ لامپهای مراکز تجاری حس خولی بهم میدادن --- reference: این مورد عمل ها مربوط به تخصص شما می شود predicted: این مورد عملها مربوط به تخصص شما میشود --- reference: انرژی خیلی کمی دارم predicted: انرژی خیلی کمی دارم --- reference: زیادی خوبی کردنم تهش داستانه predicted: زیادی خوبی کردنم ترش داستانه --- reference: بردهای که پادشاه شود predicted: برده ای که پاده شاه شود --- reference: یونسکو predicted: یونسکو --- reference: شما اخراج هستید predicted: شما اخراج هستید --- reference: من سفر کردن را دوست دارم predicted: من سفر کردم را دوست دارم ``` ## Evaluation ```python !mkdir cer !wget -O cer/cer.py https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese/raw/main/cer.py wer = load_metric("wer") cer = load_metric("./cer") print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"]))) print("CER: {:.2f}".format(100 * cer.compute(predictions=result["predicted"], references=result["sentence"]))) ``` **Test Result**: - WER: 32.09% - CER: 8.23% ## Training The Common Voice `train`, `validation` datasets were used for training. The script used for training can be found [here](https://colab.research.google.com/github/m3hrdadfi/notebooks/blob/main/Fine_Tune_XLSR_Wav2Vec2_on_Persian_ASR_with_%F0%9F%A4%97_Transformers_ipynb.ipynb)