Upload README.md

563903f about 2 years ago

No virus

10.4 kB

	---
	language: zh-HK
	license: apache-2.0
	tags:
	- automatic-speech-recognition
	- generated_from_trainer
	- hf-asr-leaderboard
	- robust-speech-event
	datasets:
	- common_voice
	model-index:
	- name: Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice
	type: common_voice
	args: zh-HK
	metrics:
	- name: Test CER
	type: cer
	value: 24.09
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 7
	type: mozilla-foundation/common_voice_7_0
	args: zh-HK
	metrics:
	- name: Test CER
	type: cer
	value: 23.1
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 8
	type: mozilla-foundation/common_voice_8_0
	args: zh-HK
	metrics:
	- name: Test CER
	type: cer
	value: 23.02
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Robust Speech Event - Dev Data
	type: speech-recognition-community-v2/dev_data
	args: zh-HK
	metrics:
	- name: Test CER
	type: cer
	value: 56.86
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Robust Speech Event - Test Data
	type: speech-recognition-community-v2/eval_data
	args: zh-HK
	metrics:
	- name: Test CER
	type: cer
	value: 55.76
	---

	# Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM

	Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM is an automatic speech recognition model based on the [XLS-R](https://arxiv.org/abs/2111.09296) architecture. This model is a fine-tuned version of [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the `zh-HK` subset of the [Common Voice](https://huggingface.co/datasets/common_voice) dataset. A 5-gram Language model, trained on multiple [PyCantonese](https://pycantonese.org/data.html) corpora, was then subsequently added to this model.

	This model was trained using HuggingFace's PyTorch framework and is part of the [Robust Speech Challenge Event](https://discuss.huggingface.co/t/open-to-the-community-robust-speech-recognition-challenge/13614) organized by HuggingFace. All training was done on a Tesla V100, sponsored by OVH.

	All necessary scripts used for training could be found in the [Files and versions](https://huggingface.co/w11wo/wav2vec2-xls-r-300m-zh-HK-lm-v2/tree/main) tab, as well as the [Training metrics](https://huggingface.co/w11wo/wav2vec2-xls-r-300m-zh-HK-lm-v2/tensorboard) logged via Tensorboard.

	As for the N-gram language model training, we followed the [blog post tutorial](https://huggingface.co/blog/wav2vec2-with-ngram) provided by HuggingFace.

	## Model

	\| Model \| #params \| Arch. \| Training/Validation data (text) \|
	\| --------------------------------- \| ------- \| ----- \| ------------------------------- \|
	\| `wav2vec2-xls-r-300m-zh-HK-lm-v2` \| 300M \| XLS-R \| `Common Voice zh-HK` Dataset \|

	## Evaluation Results

	The model achieves the following results on evaluation without a language model:

	\| Dataset \| CER \|
	\| -------------------------------- \| ------ \|
	\| `Common Voice` \| 31.73% \|
	\| `Common Voice 7` \| 23.11% \|
	\| `Common Voice 8` \| 23.02% \|
	\| `Robust Speech Event - Dev Data` \| 56.60% \|

	With the addition of the language model, it achieves the following results:

	\| Dataset \| CER \|
	\| -------------------------------- \| ------ \|
	\| `Common Voice` \| 24.09% \|
	\| `Common Voice 7` \| 23.10% \|
	\| `Common Voice 8` \| 23.02% \|
	\| `Robust Speech Event - Dev Data` \| 56.86% \|

	## Training procedure

	The training process did not involve the addition of a language model. The following results were simply lifted from the original automatic speech recognition [model training](https://huggingface.co/w11wo/wav2vec2-xls-r-300m-zh-HK-v2).

	### Training hyperparameters

	The following hyperparameters were used during training:

	- `learning_rate`: 0.0001
	- `train_batch_size`: 8
	- `eval_batch_size`: 8
	- `seed`: 42
	- `gradient_accumulation_steps`: 4
	- `total_train_batch_size`: 32
	- `optimizer`: Adam with `betas=(0.9, 0.999)` and `epsilon=1e-08`
	- `lr_scheduler_type`: linear
	- `lr_scheduler_warmup_steps`: 2000
	- `num_epochs`: 100.0
	- `mixed_precision_training`: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \| Cer \|
	\| :-----------: \| :---: \| :---: \| :-------------: \| :----: \| :----: \|
	\| 69.8341 \| 1.34 \| 500 \| 80.0722 \| 1.0 \| 1.0 \|
	\| 6.6418 \| 2.68 \| 1000 \| 6.6346 \| 1.0 \| 1.0 \|
	\| 6.2419 \| 4.02 \| 1500 \| 6.2909 \| 1.0 \| 1.0 \|
	\| 6.0813 \| 5.36 \| 2000 \| 6.1150 \| 1.0 \| 1.0 \|
	\| 5.9677 \| 6.7 \| 2500 \| 6.0301 \| 1.1386 \| 1.0028 \|
	\| 5.9296 \| 8.04 \| 3000 \| 5.8975 \| 1.2113 \| 1.0058 \|
	\| 5.6434 \| 9.38 \| 3500 \| 5.5404 \| 2.1624 \| 1.0171 \|
	\| 5.1974 \| 10.72 \| 4000 \| 4.5440 \| 2.1702 \| 0.9366 \|
	\| 4.3601 \| 12.06 \| 4500 \| 3.3839 \| 2.2464 \| 0.8998 \|
	\| 3.9321 \| 13.4 \| 5000 \| 2.8785 \| 2.3097 \| 0.8400 \|
	\| 3.6462 \| 14.74 \| 5500 \| 2.5108 \| 1.9623 \| 0.6663 \|
	\| 3.5156 \| 16.09 \| 6000 \| 2.2790 \| 1.6479 \| 0.5706 \|
	\| 3.32 \| 17.43 \| 6500 \| 2.1450 \| 1.8337 \| 0.6244 \|
	\| 3.1918 \| 18.77 \| 7000 \| 1.8536 \| 1.9394 \| 0.6017 \|
	\| 3.1139 \| 20.11 \| 7500 \| 1.7205 \| 1.9112 \| 0.5638 \|
	\| 2.8995 \| 21.45 \| 8000 \| 1.5478 \| 1.0624 \| 0.3250 \|
	\| 2.7572 \| 22.79 \| 8500 \| 1.4068 \| 1.1412 \| 0.3367 \|
	\| 2.6881 \| 24.13 \| 9000 \| 1.3312 \| 2.0100 \| 0.5683 \|
	\| 2.5993 \| 25.47 \| 9500 \| 1.2553 \| 2.0039 \| 0.6450 \|
	\| 2.5304 \| 26.81 \| 10000 \| 1.2422 \| 2.0394 \| 0.5789 \|
	\| 2.4352 \| 28.15 \| 10500 \| 1.1582 \| 1.9970 \| 0.5507 \|
	\| 2.3795 \| 29.49 \| 11000 \| 1.1160 \| 1.8255 \| 0.4844 \|
	\| 2.3287 \| 30.83 \| 11500 \| 1.0775 \| 1.4123 \| 0.3780 \|
	\| 2.2622 \| 32.17 \| 12000 \| 1.0704 \| 1.7445 \| 0.4894 \|
	\| 2.2225 \| 33.51 \| 12500 \| 1.0272 \| 1.7237 \| 0.5058 \|
	\| 2.1843 \| 34.85 \| 13000 \| 0.9756 \| 1.8042 \| 0.5028 \|
	\| 2.1 \| 36.19 \| 13500 \| 0.9527 \| 1.8909 \| 0.6055 \|
	\| 2.0741 \| 37.53 \| 14000 \| 0.9418 \| 1.9026 \| 0.5880 \|
	\| 2.0179 \| 38.87 \| 14500 \| 0.9363 \| 1.7977 \| 0.5246 \|
	\| 2.0615 \| 40.21 \| 15000 \| 0.9635 \| 1.8112 \| 0.5599 \|
	\| 1.9448 \| 41.55 \| 15500 \| 0.9249 \| 1.7250 \| 0.4914 \|
	\| 1.8966 \| 42.89 \| 16000 \| 0.9023 \| 1.5829 \| 0.4319 \|
	\| 1.8662 \| 44.24 \| 16500 \| 0.9002 \| 1.4833 \| 0.4230 \|
	\| 1.8136 \| 45.58 \| 17000 \| 0.9076 \| 1.1828 \| 0.2987 \|
	\| 1.7908 \| 46.92 \| 17500 \| 0.8774 \| 1.5773 \| 0.4258 \|
	\| 1.7354 \| 48.26 \| 18000 \| 0.8727 \| 1.5037 \| 0.4024 \|
	\| 1.6739 \| 49.6 \| 18500 \| 0.8636 \| 1.1239 \| 0.2789 \|
	\| 1.6457 \| 50.94 \| 19000 \| 0.8516 \| 1.2269 \| 0.3104 \|
	\| 1.5847 \| 52.28 \| 19500 \| 0.8399 \| 1.3309 \| 0.3360 \|
	\| 1.5971 \| 53.62 \| 20000 \| 0.8441 \| 1.3153 \| 0.3335 \|
	\| 1.602 \| 54.96 \| 20500 \| 0.8590 \| 1.2932 \| 0.3433 \|
	\| 1.5063 \| 56.3 \| 21000 \| 0.8334 \| 1.1312 \| 0.2875 \|
	\| 1.4631 \| 57.64 \| 21500 \| 0.8474 \| 1.1698 \| 0.2999 \|
	\| 1.4997 \| 58.98 \| 22000 \| 0.8638 \| 1.4279 \| 0.3854 \|
	\| 1.4301 \| 60.32 \| 22500 \| 0.8550 \| 1.2737 \| 0.3300 \|
	\| 1.3798 \| 61.66 \| 23000 \| 0.8266 \| 1.1802 \| 0.2934 \|
	\| 1.3454 \| 63.0 \| 23500 \| 0.8235 \| 1.3816 \| 0.3711 \|
	\| 1.3678 \| 64.34 \| 24000 \| 0.8550 \| 1.6427 \| 0.5035 \|
	\| 1.3761 \| 65.68 \| 24500 \| 0.8510 \| 1.6709 \| 0.4907 \|
	\| 1.2668 \| 67.02 \| 25000 \| 0.8515 \| 1.5842 \| 0.4505 \|
	\| 1.2835 \| 68.36 \| 25500 \| 0.8283 \| 1.5353 \| 0.4221 \|
	\| 1.2961 \| 69.7 \| 26000 \| 0.8339 \| 1.5743 \| 0.4369 \|
	\| 1.2656 \| 71.05 \| 26500 \| 0.8331 \| 1.5331 \| 0.4217 \|
	\| 1.2556 \| 72.39 \| 27000 \| 0.8242 \| 1.4708 \| 0.4109 \|
	\| 1.2043 \| 73.73 \| 27500 \| 0.8245 \| 1.4469 \| 0.4031 \|
	\| 1.2722 \| 75.07 \| 28000 \| 0.8202 \| 1.4924 \| 0.4096 \|
	\| 1.202 \| 76.41 \| 28500 \| 0.8290 \| 1.3807 \| 0.3719 \|
	\| 1.1679 \| 77.75 \| 29000 \| 0.8195 \| 1.4097 \| 0.3749 \|
	\| 1.1967 \| 79.09 \| 29500 \| 0.8059 \| 1.2074 \| 0.3077 \|
	\| 1.1241 \| 80.43 \| 30000 \| 0.8137 \| 1.2451 \| 0.3270 \|
	\| 1.1414 \| 81.77 \| 30500 \| 0.8117 \| 1.2031 \| 0.3121 \|
	\| 1.132 \| 83.11 \| 31000 \| 0.8234 \| 1.4266 \| 0.3901 \|
	\| 1.0982 \| 84.45 \| 31500 \| 0.8064 \| 1.3712 \| 0.3607 \|
	\| 1.0797 \| 85.79 \| 32000 \| 0.8167 \| 1.3356 \| 0.3562 \|
	\| 1.0119 \| 87.13 \| 32500 \| 0.8215 \| 1.2754 \| 0.3268 \|
	\| 1.0216 \| 88.47 \| 33000 \| 0.8163 \| 1.2512 \| 0.3184 \|
	\| 1.0375 \| 89.81 \| 33500 \| 0.8137 \| 1.2685 \| 0.3290 \|
	\| 0.9794 \| 91.15 \| 34000 \| 0.8220 \| 1.2724 \| 0.3255 \|
	\| 1.0207 \| 92.49 \| 34500 \| 0.8165 \| 1.2906 \| 0.3361 \|
	\| 1.0169 \| 93.83 \| 35000 \| 0.8153 \| 1.2819 \| 0.3305 \|
	\| 1.0127 \| 95.17 \| 35500 \| 0.8187 \| 1.2832 \| 0.3252 \|
	\| 0.9978 \| 96.51 \| 36000 \| 0.8111 \| 1.2612 \| 0.3210 \|
	\| 0.9923 \| 97.85 \| 36500 \| 0.8076 \| 1.2278 \| 0.3122 \|
	\| 1.0451 \| 99.2 \| 37000 \| 0.8086 \| 1.2451 \| 0.3156 \|

	## Disclaimer

	Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

	## Authors

	Wav2Vec2 XLS-R 300M Cantonese (zh-HK) LM was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on OVH Cloud.

	## Framework versions

	- Transformers 4.17.0.dev0
	- Pytorch 1.10.2+cu102
	- Datasets 1.18.4.dev0
	- Tokenizers 0.11.0