not-tanh
/

wav2vec2-large-xlsr-53-vietnamese

@@ -24,12 +24,12 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 52.486188
 ---
-# Wav2Vec2-Large-XLSR-53-vietnamese #TODO: replace language with your {language}, *e.g.* French
-Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Vietnamese using the [Common Voice](https://huggingface.co/datasets/common_voice), and [Vivos dataset](https://ailab.hcmus.edu.vn/vivos).
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Usage
@@ -42,10 +42,10 @@ import torchaudio
 from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-test_dataset = load_dataset("common_voice", "vi", split="test") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
-processor = Wav2Vec2Processor.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
-model = Wav2Vec2ForCTC.from_pretrained("{model_id}") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
@@ -71,7 +71,7 @@ print("Reference:", test_dataset["sentence"][:2])
 ## Evaluation
-The model can be evaluated as follows on the {language} test data of Common Voice.  # TODO: replace #TODO: replace language with your {language}, *e.g.* French
 ```python
@@ -88,7 +88,7 @@ processor = Wav2Vec2Processor.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-v
 model = Wav2Vec2ForCTC.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
 model.to("cuda")
-chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“%\\\\'�]'
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
 # Preprocessing the datasets.
@@ -118,12 +118,12 @@ result = test_dataset.map(evaluate, batched=True, batch_size=8)
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
 ```
-**Test Result**: 52.486188%
 ## Training
 ## TODO
-The Common Voice `train`, `validation`, and `vivos` datasets were used for training
-The script used for training can be found ... # TODO: fill in a link to your training script here. If you trained your model in a colab, simply fill in the link here. If you trained the model locally, it would be great if you could upload the training script on github and paste the link here.

     metrics:
        - name: Test WER
          type: wer
+         value: 40.745856
 ---
+# Wav2Vec2-Large-XLSR-53-Vietnamese
+Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Vietnamese using the [Common Voice](https://huggingface.co/datasets/common_voice), [Vivos dataset](https://ailab.hcmus.edu.vn/vivos) and [FOSD dataset](https://data.mendeley.com/datasets/k9sxg2twv4/4).
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Usage
 from datasets import load_dataset
 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+test_dataset = load_dataset("common_voice", "vi", split="test")
+processor = Wav2Vec2Processor.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
+model = Wav2Vec2ForCTC.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
 ## Evaluation
+The model can be evaluated as follows on the Vietnamese test data of Common Voice.
 ```python
 model = Wav2Vec2ForCTC.from_pretrained("not-tanh/wav2vec2-large-xlsr-53-vietnamese")
 model.to("cuda")
+chars_to_ignore_regex = r'[,?.!\-;:"“%\'�]'
 resampler = torchaudio.transforms.Resample(48_000, 16_000)
 # Preprocessing the datasets.
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
 ```
+**Test Result**: 40.745856%
 ## Training
 ## TODO
+The Common Voice `train`, `validation`, the VIVOS and FOSD datasets were used for training
+The script used for training can be found ... # TODO