CuongLD commited on
Commit
8a7ee00
1 Parent(s): 720badd

Add readme file

Browse files
Files changed (1) hide show
  1. README.md +224 -0
README.md ADDED
@@ -0,0 +1,224 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ datasets:
4
+ - common_voice
5
+ - infore_25h voice dataset
6
+ (link: https://files.huylenguyen.com/25hours.zip , Password: BroughtToYouByInfoRe)
7
+
8
+ metrics:
9
+ - wer
10
+ tags:
11
+ - audio
12
+ - automatic-speech-recognition
13
+ - speech
14
+ - xlsr-fine-tuning-week
15
+ license: apache-2.0
16
+ model-index:
17
+ - name: {Cuong-Cong XLSR Wav2Vec2 Large 53}
18
+ results:
19
+ - task:
20
+ name: Speech Recognition
21
+ type: automatic-speech-recognition
22
+ dataset:
23
+ name: Common Voice vi
24
+ type: common_voice
25
+ args: vi
26
+ metrics:
27
+ - name: Test WER
28
+ type: wer
29
+ value: 58.63
30
+ ---
31
+
32
+ # Wav2Vec2-Large-XLSR-53-Vietnamese
33
+
34
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Vietnamese using the [Common Voice](https://huggingface.co/datasets/common_voice), [Infore_25h dataset](https://files.huylenguyen.com/25hours.zip) (Password: BroughtToYouByInfoRe)
35
+
36
+ When using this model, make sure that your speech input is sampled at 16kHz.
37
+
38
+ ## Usage
39
+
40
+ The model can be used directly (without a language model) as follows:
41
+
42
+ ```python
43
+ import torch
44
+ import torchaudio
45
+ from datasets import load_dataset
46
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
47
+
48
+ test_dataset = load_dataset("common_voice", "vi", split="test[:2%]")
49
+ processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
50
+ model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
51
+
52
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
53
+
54
+ # Preprocessing the datasets.
55
+ # We need to read the aduio files as arrays
56
+ def speech_file_to_array_fn(batch):
57
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
58
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
59
+ return batch
60
+
61
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
62
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
63
+
64
+ with torch.no_grad():
65
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
66
+
67
+ predicted_ids = torch.argmax(logits, dim=-1)
68
+
69
+ print("Prediction:", processor.batch_decode(predicted_ids))
70
+ print("Reference:", test_dataset["sentence"][:2])
71
+ ```
72
+
73
+
74
+ ## Evaluation
75
+
76
+ The model can be evaluated as follows on the Vietnamese test data of Common Voice.
77
+
78
+ ```python
79
+ import torch
80
+ import torchaudio
81
+ from datasets import load_dataset, load_metric
82
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
83
+ import re
84
+
85
+ test_dataset = load_dataset("common_voice", "vi", split="test")
86
+ wer = load_metric("wer")
87
+
88
+ processor = Wav2Vec2Processor.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
89
+ model = Wav2Vec2ForCTC.from_pretrained("CuongLD/wav2vec2-large-xlsr-vietnamese")
90
+ model.to("cuda")
91
+
92
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
93
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
94
+
95
+ # Preprocessing the datasets.
96
+ # We need to read the aduio files as arrays
97
+ def speech_file_to_array_fn(batch):
98
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
99
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
100
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
101
+ return batch
102
+
103
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
104
+
105
+ # Preprocessing the datasets.
106
+ # We need to read the aduio files as arrays
107
+ def evaluate(batch):
108
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
109
+
110
+ with torch.no_grad():
111
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
112
+
113
+ pred_ids = torch.argmax(logits, dim=-1)
114
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
115
+ return batch
116
+
117
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
118
+
119
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
120
+ ```
121
+
122
+ **Test Result**: 58.63 %
123
+
124
+ ## Training
125
+
126
+ The Common Voice `train`, `validation`, and `Infore_25h` datasets were used for training
127
+
128
+ The script used for training can be found [here](https://drive.google.com/file/d/1AW9R8IlsapiSGh9n3aECf23t-zhk3wUh/view?usp=sharing)
129
+
130
+ =======================To here===============================>
131
+
132
+ Your model in then available under *huggingface.co/CuongLD/wav2vec2-large-xlsr-vietnamese* for everybody to use 🎉.
133
+
134
+ ## How to evaluate my trained checkpoint
135
+
136
+ Having uploaded your model, you should now evaluate your model in a final step. This should be as simple as
137
+ copying the evaluation code of your model card into a python script and running it. Make sure to note
138
+ the final result on the model card **both** under the YAML tags at the very top **and** below your evaluation code under "Test Results".
139
+
140
+ ## Rules of training and evaluation
141
+
142
+ In this section, we will quickly go over what data is allowed to be used as training
143
+ data, what kind of data preprocessing is allowed be used, and how the model should be evaluated.
144
+
145
+ To make it very simple regarding the first point: **All data except the official common voice `test` data set can be used as training data**. For models trained in a language that is not included in Common Voice, the author of the model is responsible to
146
+ leave a reasonable amount of data for evaluation.
147
+
148
+ Second, the rules regarding the preprocessing are not that as straight-forward. It is allowed (and recommended) to
149
+ normalize the data to only have lower-case characters. It is also allowed (and recommended) to remove typographical
150
+ symbols and punctuation marks. A list of such symbols can *e.g.* be fonud [here](https://en.wikipedia.org/wiki/List_of_typographical_symbols_and_punctuation_marks) - however here we already must be careful. We should **not** remove a symbol that
151
+ would change the meaning of the words, *e.g.* in English, we should not remove the single quotation mark `'` since it
152
+ would change the meaning of the word `"it's"` to `"its"` which would then be incorrect. So the golden rule here is to
153
+ not remove any characters that could change the meaning of a word into another word. This is not always obvious and should
154
+ be given some consideration. As another example, it is fine to remove the "Hypen-minus" sign "`-`" since it doesn't change the
155
+ meaninng of a word to another one. *E.g.* "`fine-tuning`" would be changed to "`finetuning`" which has still the same meaning.
156
+
157
+ Since those choices are not always obvious when in doubt feel free to ask on Slack or even better post on the forum, as was
158
+ done, *e.g.* [here](https://discuss.huggingface.co/t/spanish-asr-fine-tuning-wav2vec2/4586).
159
+
160
+ ## Tips and tricks
161
+
162
+ This section summarizes a couple of tips and tricks across various topics. It will continously be updated during the week.
163
+
164
+ ### How to combine multiple datasets into one
165
+
166
+ Check out [this](https://discuss.huggingface.co/t/how-to-combine-local-data-files-with-an-official-dataset/4685) post.
167
+
168
+ ### How to effectively preprocess the data
169
+
170
+
171
+ ### How to do efficiently load datasets with limited ram and hard drive space
172
+
173
+ Check out [this](https://discuss.huggingface.co/t/german-asr-fine-tuning-wav2vec2/4558/8?u=patrickvonplaten) post.
174
+
175
+
176
+ ### How to do hyperparameter tuning
177
+
178
+
179
+ ### How to preprocess and evaluate character based languages
180
+
181
+
182
+ ## Further reading material
183
+
184
+ It is recommended that take some time to read up on how Wav2vec2 works in theory.
185
+ Getting a better understanding of the theory and the inner mechanisms of the model often helps when fine-tuning the model.
186
+
187
+ **However**, if you don't like reading blog posts/papers, don't worry - it is by no means necessary to go through the theory to fine-tune Wav2Vec2 on your language of choice.
188
+
189
+ If you are interested in learning more about the model though, here are a couple of resources that are important to better understand Wav2Vec2:
190
+
191
+ - [Facebook's Wav2Vec2 blog post](https://ai.facebook.com/blog/wav2vec-state-of-the-art-speech-recognition-through-self-supervision/)
192
+ - [Official Wav2Vec2 paper](https://arxiv.org/abs/2006.11477)
193
+ - [Official XLSR Wav2vec2 paper](https://arxiv.org/pdf/2006.13979.pdf)
194
+ - [Hugging Face Blog](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)
195
+ - [How does CTC (Connectionist Temporal Classification) work](https://distill.pub/2017/ctc/)
196
+
197
+ It helps to have a good understanding of the following points:
198
+
199
+ - How was XLSR-Wav2Vec2 pretrained? -> Feature vectors were masked and had to be predicted by the model; very similar in spirit to masked language model of BERT.
200
+
201
+ - What parts of XLSR-Wav2Vec2 are responsible for what? What is the feature extractor part used for? -> extract feature vectors from the 1D raw audio waveform; What is the transformer part doing? -> mapping feature vectors to contextualized feature vectors; ...
202
+
203
+ - What part of the model needs to be fine-tuned? -> The pretrained model **does not** include a language head to classify the contextualized features to letters. This is randomly initialized when loading the pretrained checkpoint and has to be fine-tuned. Also, note that the authors recommend to **not** further fine-tune the feature extractor.
204
+
205
+ - What data was used to XLSR-Wav2Vec2? The checkpoint we will use for further fine-tuning was pretrained on **53** languages.
206
+
207
+ - What languages are considered to be similar by XLSR-Wav2Vec2? In the official [XLSR Wav2Vec2 paper](https://arxiv.org/pdf/2006.13979.pdf), the authors show nicely which languages share a common contextualized latent space. It might be useful for you to extend your training data with data of other languages that are considered to be very similar by the model (or you).
208
+
209
+
210
+ ## FAQ
211
+
212
+ - Can a participant fine-tune models for more than one language?
213
+ Yes! A participant can fine-tune models in as many languages she/he likes
214
+ - Can a participant use extra data (apart from the common voice data)?
215
+ Yes! All data except the official common voice `test data` can be used for training.
216
+ If a participant wants to train a model on a language that is not part of Common Voice (which
217
+ is very much encouraged!), the participant should make sure that some test data is held out to
218
+ make sure the model is not overfitting.
219
+ - Can we fine-tune for high-resource languages?
220
+ Yes! While we do not really recommend people to fine-tune models in English since there are
221
+ already so many fine-tuned speech recognition models in English. However, it is very much
222
+ appreciated if participants want to fine-tune models in other "high-resource" languages, such
223
+ as French, Spanish, or German. For such cases, one probably needs to train locally and apply
224
+ might have to apply tricks such as lazy data loading (check the ["Lazy data loading"](#how-to-do-lazy-data-loading) section for more details).