change precision

a0462f3 over 3 years ago

10.2 kB

	---
	language: fa
	datasets:
	- common_voice
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- xlsr-fine-tuning-week
	license: apache-2.0
	widget:
	- label: Common Voice sample 687
	src: https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian/resolve/main/sample687.flac
	- label: Common Voice sample 1671
	src: https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian/resolve/main/sample1671.flac
	model-index:
	- name: XLSR Wav2Vec2 Persian (Farsi) by Mehrdad Farahani
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice fa
	type: common_voice
	args: fa
	metrics:
	- name: Test WER
	type: wer
	value: 32.09
	- name: Test CER
	type: cer
	value: 8.23

	---

	# Wav2Vec2-Large-XLSR-53-tw-gpt
	Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in Persian (Farsi) using [Common Voice](https://huggingface.co/datasets/common_voice). When using this model, make sure that your speech input is sampled at 16kHz.

	## Usage
	The model can be used directly (without a language model) as follows:

	```bash
	!pip install git+https://github.com/huggingface/datasets.git
	!pip install git+https://github.com/huggingface/transformers.git
	!pip install torchaudio
	!pip install librosa
	!pip install jiwer
	!pip install hazm
	```

	```python
	import torch
	import torchaudio
	from datasets import load_dataset, load_metric
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	import librosa

	import pandas as pd
	import numpy as np

	import hazm

	import random
	import os
	import string
	import six
	import re

	import IPython.display as ipd

	# Loading the datasets
	dataset = load_dataset("common_voice", "fa", split="test[:2%]")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	processor = Wav2Vec2Processor.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-persian")
	model = Wav2Vec2ForCTC.from_pretrained("m3hrdadfi/wav2vec2-large-xlsr-persian").to(device)


	# Preprocessing the datasets.
	# Normalizing the texts
	_normalizer = hazm.Normalizer()
	def multiple_replace(mapping, text):
	pattern = "\|".join(map(re.escape, mapping.keys()))
	return re.sub(pattern, lambda m: mapping[m.group()], str(text))

	def convert_weirdos(input_str):
	# character
	mapping = {
	'ك': 'ک',
	'دِ': 'د',
	'بِ': 'ب',
	'زِ': 'ز',
	'ذِ': 'ذ',
	'شِ': 'ش',
	'سِ': 'س',
	'ى': 'ی',
	'ي': 'ی',
	'أ': 'ا',
	'ؤ': 'و',
	"ے": "ی",
	"ۀ": "ه",
	"ﭘ": "پ",
	"ﮐ": "ک",
	"ﯽ": "ی",
	"ﺎ": "ا",
	"ﺑ": "ب",
	"ﺘ": "ت",
	"ﺧ": "خ",
	"ﺩ": "د",
	"ﺱ": "س",
	"ﻀ": "ض",
	"ﻌ": "ع",
	"ﻟ": "ل",
	"ﻡ": "م",
	"ﻢ": "م",
	"ﻪ": "ه",
	"ﻮ": "و",
	"ئ": "ی",
	'ﺍ': "ا",
	'ة': "ه",
	'ﯾ': "ی",
	'ﯿ': "ی",
	'ﺒ': "ب",
	'ﺖ': "ت",
	'ﺪ': "د",
	'ﺮ': "ر",
	'ﺴ': "س",
	'ﺷ': "ش",
	'ﺸ': "ش",
	'ﻋ': "ع",
	'ﻤ': "م",
	'ﻥ': "ن",
	'ﻧ': "ن",
	'ﻭ': "و",
	'ﺭ': "ر",
	"ﮔ": "گ",
	}

	# notation
	mapping.update(**{
	"#": " ",
	"!": " ",
	"؟": " ",
	"?": " ",
	"«": " ",
	"»": " ",
	"ء": " ",
	"،": " ",
	"(": " ",
	")": " ",
	"؛": " ",
	"'ٔ": " ",
	"٬": " ",
	'ٔ': " ",
	",": " ",
	"?": " ",
	".": " ",
	"!": " ",
	"-": " ",
	";": " ",
	":": " ",
	'"': " ",
	"“": " ",
	"%": " ",
	"‘": " ",
	"”": " ",
	"�": " ",
	"–": " ",
	"…": " ",
	"_": " ",
	})

	return multiple_replace(mapping, input_str)


	PERSIAN_ALPHA = "\u0621-\u0628\u062A-\u063A\u0641-\u0642\u0644-\u0648\u064E-\u0651\u0655\u067E\u0686\u0698\u06A9\u06AF\u06BE\u06CC" # noqa: E501
	PERSIAN_DIGIT = "\u06F0-\u06F9"

	COMMON_ARABIC_ALPHA = "\u0629\u0643\u0649-\u064B\u064D\u06D5"
	COMMON_ARABIC_DIGIT = "\u0660-\u0669"

	ZWNJ = "\u200c"

	ENGLISH = "a-z0-9\&"
	PERSIAN = PERSIAN_ALPHA + PERSIAN_DIGIT + COMMON_ARABIC_ALPHA + COMMON_ARABIC_DIGIT + ZWNJ


	def normalizer(text, min_ratio=1.1):
	text = text.lower()
	text = _normalizer.normalize(text)
	text = text.replace("\u200c", " ")
	text = text.replace("\u200d", " ")
	text = text.replace("\u200e", " ")
	text = text.replace("\u200f", " ")
	text = text.replace("\ufeff", " ")
	text = convert_weirdos(text)

	words = [word.replace("آ", "ا") if "آ" in word and not word.startswith("آ") else word for word in text.split()]
	text = " ".join(words)

	if not text or not len(text) > 2:
	return None

	en_text = re.sub(r"[^" + ENGLISH + "+]", " ", six.ensure_str(text))
	en_text = re.sub(r"\s+", " ", en_text)
	if len(en_text) > 1:
	return None

	return text


	chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
	def remove_special_characters(batch):
	text = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() + " "
	text = normalizer(text)
	batch["sentence"] = text
	return batch

	# We need to read the aduio files as arrays
	def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	speech_array = speech_array.squeeze().numpy()
	speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, 16_000)

	batch["speech"] = speech_array
	return batch

	def predict(batch):
	features = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

	input_values = features.input_values.to(device)
	attention_mask = features.attention_mask.to(device)

	with torch.no_grad():
	logits = model(input_values, attention_mask=attention_mask).logits

	pred_ids = torch.argmax(logits, dim=-1)

	batch["predicted"] = processor.batch_decode(pred_ids)[0]
	return batch

	dataset = dataset.map(remove_special_characters)
	dataset = dataset.map(speech_file_to_array_fn, remove_columns=list(set(dataset.column_names) - set(['sentence', 'path'])))
	result = dataset.map(predict)
	```

	## Prediction

	```python
	max_items = np.random.randint(0, len(result), 20).tolist()
	for i in max_items:
	reference, predicted = result["sentence"][i], result["predicted"][i]
	print("reference:", reference)
	print("predicted:", predicted)
	print('---')
	```

	```text
	reference: اطلاعات مسری است
	predicted: اطلاعات مسری است
	---
	reference: نه منظورم اینه که وقتی که ساکته چه کاریه خودمونه بندازیم زحمت
	predicted: نه منظورم اینه که وقتی که ساکت چی کاریه خودمونو بندازیم زحمت
	---
	reference: من آب پرتقال می خورم لطفا
	predicted: من آپ ارتغال می خورم لطفا
	---
	reference: وقت آن رسیده آنها را که قدم پیش میگذارند بزرگ بداریم
	predicted: وقت آ رسیده آنها را که قدم پیش میگذارند بزرگ بداریم
	---
	reference: سیم باتری دارید
	predicted: سیم باتری دارید
	---
	reference: این بهتره تا اینکه به بهونه درس و مشق هر روز بره خونه شون
	predicted: این بهتره تا اینکه به بهمونه درسومش خرروز بره خونه اشون
	---
	reference: ژاکت تنگ است
	predicted: ژاکت تنگ است
	---
	reference: آت و اشغال های خیابان
	predicted: آت و اشغال های خیابان
	---
	reference: من به این روند اعتراض دارم
	predicted: من به این لوند تراج دارم
	---
	reference: کرایه این مکان چند است
	predicted: کرایه این مکان چند است
	---
	reference: ولی این فرصت این سهم جوانی اعطا نشده است
	predicted: ولی این فرصت این سحم جوانی اتان نشده است
	---
	reference: متوجه فاجعهای محیطی میشوم
	predicted: متوجه فاجایهای محیطی میشوم
	---
	reference: ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن
	predicted: ترافیک شدید ی هم بودا دیدن نور ماشینا و چراغ لامپهای مراکز تجاری حس خولی بهم میدادن
	---
	reference: این مورد عمل ها مربوط به تخصص شما می شود
	predicted: این مورد عملها مربوط به تخصص شما میشود
	---
	reference: انرژی خیلی کمی دارم
	predicted: انرژی خیلی کمی دارم
	---
	reference: زیادی خوبی کردنم تهش داستانه
	predicted: زیادی خوبی کردنم ترش داستانه
	---
	reference: بردهای که پادشاه شود
	predicted: برده ای که پاده شاه شود
	---
	reference: یونسکو
	predicted: یونسکو
	---
	reference: شما اخراج هستید
	predicted: شما اخراج هستید
	---
	reference: من سفر کردن را دوست دارم
	predicted: من سفر کردم را دوست دارم
	```

	## Evaluation

	```python
	!mkdir cer
	!wget -O cer/cer.py https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese/raw/main/cer.py

	wer = load_metric("wer")
	cer = load_metric("./cer")

	print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))
	print("CER: {:.2f}".format(100 * cer.compute(predictions=result["predicted"], references=result["sentence"])))
	```

	Test Result:
	- WER: 32.09%
	- CER: 8.23%


	## Training
	The Common Voice `train`, `validation` datasets were used for training.
	The script used for training can be found [here](https://colab.research.google.com/github/m3hrdadfi/notebooks/blob/main/Fine_Tune_XLSR_Wav2Vec2_on_Persian_ASR_with_%F0%9F%A4%97_Transformers_ipynb.ipynb)