Upload README.md

3155938 over 2 years ago

8.64 kB

	---
	language: th
	datasets:
	- common_voice
	tags:
	- audio
	- automatic-speech-recognition
	- hf-asr-leaderboard
	- robust-speech-event
	- speech
	- xlsr-fine-tuning
	license: cc-by-sa-4.0
	model-index:
	- name: XLS-R-53 - Thai
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 7
	type: mozilla-foundation/common_voice_7_0
	args: th
	metrics:
	- name: Test WER
	type: wer
	value: 0.9524
	- name: Test SER
	type: ser
	value: 1.2346
	- name: Test CER
	type: cer
	value: 0.1623
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Robust Speech Event - Dev Data
	type: speech-recognition-community-v2/dev_data
	args: sv
	metrics:
	- name: Test WER
	type: wer
	value: null
	- name: Test SER
	type: ser
	value: null
	- name: Test CER
	type: cer
	value: null
	---

	# `wav2vec2-large-xlsr-53-th`
	Finetuning `wav2vec2-large-xlsr-53` on Thai [Common Voice 7.0](https://commonvoice.mozilla.org/en/datasets)

	[Read more on our blog](https://medium.com/airesearch-in-th/airesearch-in-th-3c1019a99cd)

	We finetune [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) based on [Fine-tuning Wav2Vec2 for English ASR](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tuning_Wav2Vec2_for_English_ASR.ipynb) using Thai examples of [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets). The notebooks and scripts can be found in [vistec-ai/wav2vec2-large-xlsr-53-th](https://github.com/vistec-ai/wav2vec2-large-xlsr-53-th). The pretrained model and processor can be found at [airesearch/wav2vec2-large-xlsr-53-th](https://huggingface.co/airesearch/wav2vec2-large-xlsr-53-th).

	## `robust-speech-event`

	Add `syllable_tokenize`, `word_tokenize` ([PyThaiNLP](https://github.com/PyThaiNLP/pythainlp)) and [deepcut](https://github.com/rkcosmos/deepcut) tokenizers to `eval.py` from [robust-speech-event](https://github.com/huggingface/transformers/tree/master/examples/research_projects/robust-speech-event#evaluation)

	```
	> python eval.py --model_id ./ --dataset mozilla-foundation/common_voice_7_0 --config th --split test --log_outputs --thai_tokenizer newmm/syllable/deepcut/cer
	```

	### Eval results on Common Voice 7 "test":

	\| \| WER PyThaiNLP 2.3.1 \| WER deepcut \| SER \| CER \|
	\|---------------------------------\|---------------------\|-------------\|---------\|---------\|
	\| Only Tokenization \| 0.9524% \| 2.5316% \| 1.2346% \| 0.1623% \|
	\| Cleaning rules and Tokenization \| TBD \| TBD \| TBD \| TBD \|


	## Usage

	```
	#load pretrained processor and model
	processor = Wav2Vec2Processor.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")
	model = Wav2Vec2ForCTC.from_pretrained("airesearch/wav2vec2-large-xlsr-53-th")

	#function to resample to 16_000
	def speech_file_to_array_fn(batch,
	text_col="sentence",
	fname_col="path",
	resampling_to=16000):
	speech_array, sampling_rate = torchaudio.load(batch[fname_col])
	resampler=torchaudio.transforms.Resample(sampling_rate, resampling_to)
	batch["speech"] = resampler(speech_array)[0].numpy()
	batch["sampling_rate"] = resampling_to
	batch["target_text"] = batch[text_col]
	return batch

	#get 2 examples as sample input
	test_dataset = test_dataset.map(speech_file_to_array_fn)
	inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

	#infer
	with torch.no_grad():
	logits = model(inputs.input_values,).logits

	predicted_ids = torch.argmax(logits, dim=-1)

	print("Prediction:", processor.batch_decode(predicted_ids))
	print("Reference:", test_dataset["sentence"][:2])

	>> Prediction: ['และ เขา ก็ สัมผัส ดีบุก', 'คุณ สามารถ รับทราบ เมื่อ ข้อความ นี้ ถูก อ่าน แล้ว']
	>> Reference: ['และเขาก็สัมผัสดีบุก', 'คุณสามารถรับทราบเมื่อข้อความนี้ถูกอ่านแล้ว']
	```

	## Datasets

	Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) contains 133 validated hours of Thai (255 total hours) at 5GB. We pre-tokenize with `pythainlp.tokenize.word_tokenize`. We preprocess the dataset using cleaning rules described in `notebooks/cv-preprocess.ipynb` by [@tann9949](https://github.com/tann9949). We then deduplicate and split as described in [ekapolc/Thai_commonvoice_split](https://github.com/ekapolc/Thai_commonvoice_split) in order to 1) avoid data leakage due to random splits after cleaning in [Common Voice Corpus 7.0](https://commonvoice.mozilla.org/en/datasets) and 2) preserve the majority of the data for the training set. The dataset loading script is `scripts/th_common_voice_70.py`. You can use this scripts together with `train_cleand.tsv`, `validation_cleaned.tsv` and `test_cleaned.tsv` to have the same splits as we do. The resulting dataset is as follows:

	```
	DatasetDict({
	train: Dataset({
	features: ['path', 'sentence'],
	num_rows: 86586
	})
	test: Dataset({
	features: ['path', 'sentence'],
	num_rows: 2502
	})
	validation: Dataset({
	features: ['path', 'sentence'],
	num_rows: 3027
	})
	})
	```

	## Training

	We fintuned using the following configuration on a single V100 GPU and chose the checkpoint with the lowest validation loss. The finetuning script is `scripts/wav2vec2_finetune.py`

	```
	# create model
	model = Wav2Vec2ForCTC.from_pretrained(
	"facebook/wav2vec2-large-xlsr-53",
	attention_dropout=0.1,
	hidden_dropout=0.1,
	feat_proj_dropout=0.0,
	mask_time_prob=0.05,
	layerdrop=0.1,
	gradient_checkpointing=True,
	ctc_loss_reduction="mean",
	pad_token_id=processor.tokenizer.pad_token_id,
	vocab_size=len(processor.tokenizer)
	)
	model.freeze_feature_extractor()
	training_args = TrainingArguments(
	output_dir="../data/wav2vec2-large-xlsr-53-thai",
	group_by_length=True,
	per_device_train_batch_size=32,
	gradient_accumulation_steps=1,
	per_device_eval_batch_size=16,
	metric_for_best_model='wer',
	evaluation_strategy="steps",
	eval_steps=1000,
	logging_strategy="steps",
	logging_steps=1000,
	save_strategy="steps",
	save_steps=1000,
	num_train_epochs=100,
	fp16=True,
	learning_rate=1e-4,
	warmup_steps=1000,
	save_total_limit=3,
	report_to="tensorboard"
	)
	```

	## Evaluation

	We benchmark on the test set using WER with words tokenized by [PyThaiNLP](https://github.com/PyThaiNLP/pythainlp) 2.3.1 and [deepcut](https://github.com/rkcosmos/deepcut), and CER. We also measure performance when spell correction using [TNC](http://www.arts.chula.ac.th/ling/tnc/) ngrams is applied. Evaluation codes can be found in `notebooks/wav2vec2_finetuning_tutorial.ipynb`. Benchmark is performed on `test-unique` split.

	\| \| WER PyThaiNLP 2.3.1 \| WER deepcut \| CER \|
	\|--------------------------------\|---------------------\|----------------\|----------------\|
	\| [Kaldi from scratch](https://github.com/vistec-AI/commonvoice-th) \| 23.04 \| \| 7.57 \|
	\| Ours without spell correction \| 13.634024 \| 8.152052 \| 2.813019 \|
	\| Ours with spell correction \| 17.996397 \| 14.167975 \| 5.225761 \|
	\| Google Web Speech API※ \| 13.711234 \| 10.860058 \| 7.357340 \|
	\| Microsoft Bing Speech API※ \| 12.578819 \| 9.620991 \| 5.016620 \|
	\| Amazon Transcribe※ \| 21.86334 \| 14.487553 \| 7.077562 \|
	\| NECTEC AI for Thai Partii API※ \| 20.105887 \| 15.515631 \| 9.551027 \|

	※ APIs are not finetuned with Common Voice 7.0 data

	## LICENSE

	[cc-by-sa 4.0](https://github.com/vistec-AI/wav2vec2-large-xlsr-53-th/blob/main/LICENSE)

	## Ackowledgements
	* model training and validation notebooks/scripts [@cstorm125](https://github.com/cstorm125/)
	* dataset cleaning scripts [@tann9949](https://github.com/tann9949)
	* dataset splits [@ekapolc](https://github.com/ekapolc/) and [@14mss](https://github.com/14mss)
	* running the training [@mrpeerat](https://github.com/mrpeerat)
	* spell correction [@wannaphong](https://github.com/wannaphong)