masoudmzb commited on
Commit
4129748
1 Parent(s): 2824913

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -0
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # wav2vec 2.0 multilingual ( Finetued )
2
+ The base model pretrained on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. Note that this model should be fine-tuned on a downstream task, like Automatic Speech Recognition. Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more information.
3
+
4
+ [Paper](https://arxiv.org/abs/2006.13979)
5
+
6
+ Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli
7
+
8
+ **Abstract** This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to a comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages. We hope to catalyze research in low-resource speech understanding by releasing XLSR-53, a large model pretrained in 53 languages.
9
+
10
+ The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
11
+
12
+
13
+
14
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in Persian (Farsi) using [Common Voice](https://huggingface.co/datasets/common_voice) plus Our own created Dataset(1/3 of total dataset). When using this model, make sure that your speech input is sampled at 16kHz.
15
+
16
+ ## Evaluation: 🌡️
17
+ We have evaluated the model on private dataset with different type of audios (unfortunately the dataset for testing and validation is not publicly available but to see a sample of the dataset [check this link)](https://github.com/shenasa-ai/speech2text#part-of-our-dataset-v01--) :
18
+
19
+ | Name | test dataset (wer) |
20
+ | :----------------------------------------------------------: | :-----------------: |
21
+ | [m3hrdadfi/wav2vec2-large-xlsr-persian-v3](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3) | 0.56754 |
22
+ | [This New Model](https://huggingface.co/masoudmzb/wav2vec2-xlsr-multilingual-53-fa) | **0.40815** |
23
+ | Base Multilingual Model | 0.69746 |
24
+
25
+ - This Table show if we add more data we will have much better result
26
+
27
+
28
+ ## How to use❓
29
+
30
+ ### Use FineTuned Model
31
+
32
+ This model is finetuned on [m3hrdadfi/wav2vec2-large-xlsr-persian-v3](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3) , so the process for train or evaluation is same
33
+
34
+ > ```bash
35
+ > # requirement packages
36
+ > !pip install git+https://github.com/huggingface/datasets.git
37
+ > !pip install git+https://github.com/huggingface/transformers.git
38
+ > !pip install torchaudio
39
+ > !pip install librosa
40
+ > !pip install jiwer
41
+ > !pip install parsivar
42
+ > !pip install num2fawords
43
+ > ```
44
+
45
+
46
+
47
+ **Normalizer**
48
+
49
+ ```bash
50
+ # Normalizer
51
+ !wget -O normalizer.py https://huggingface.co/m3hrdadfi/"wav2vec2-large-xlsr-persian-v3/raw/main/dictionary.py
52
+ !wget -O normalizer.py https://huggingface.co/m3hrdadfi/"wav2vec2-large-xlsr-persian-v3/raw/main/normalizer.py
53
+
54
+ ```
55
+
56
+
57
+
58
+ If you are not sure your transcriptions are clean or not (having weird characters or any other alphabete chars ) use this code provided by [m3hrdadfi/wav2vec2-large-xlsr-persian-v3](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3)
59
+
60
+
61
+
62
+ **Cleaning** (Fill the data part with your own data dir)
63
+
64
+ ```python
65
+
66
+ from normalizer import normalizer
67
+
68
+ def cleaning(text):
69
+ if not isinstance(text, str):
70
+ return None
71
+
72
+ return normalizer({"sentence": text}, return_dict=False)
73
+
74
+ # edit these parts with your own data directory
75
+
76
+ data_dir = "data"
77
+
78
+
79
+ test = pd.read_csv(f"{data_dir}/yourtest.tsv", sep=" ")
80
+ test["path"] = data_dir + "/clips/" + test["path"]
81
+ print(f"Step 0: {len(test)}")
82
+
83
+ test["status"] = test["path"].apply(lambda path: True if os.path.exists(path) else None)
84
+ test = test.dropna(subset=["path"])
85
+ test = test.drop("status", 1)
86
+ print(f"Step 1: {len(test)}")
87
+
88
+ test["sentence"] = test["sentence"].apply(lambda t: cleaning(t))
89
+ test = test.dropna(subset=["sentence"])
90
+ print(f"Step 2: {len(test)}")
91
+
92
+ test = test.reset_index(drop=True)
93
+ print(test.head())
94
+
95
+ test = test[["path", "sentence"]]
96
+ test.to_csv("/content/test.csv", sep=" ", encoding="utf-8", index=False)
97
+ ```
98
+
99
+
100
+
101
+ **Prediction**
102
+
103
+ ```python
104
+ import numpy as np
105
+ import pandas as pd
106
+
107
+ import librosa
108
+ import torch
109
+ import torchaudio
110
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
111
+ from datasets import load_dataset, load_metric
112
+
113
+ import IPython.display as ipd
114
+
115
+ model_name_or_path = "masoudmzb/wav2vec2-xlsr-multilingual-53-fa"
116
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
117
+ print(model_name_or_path, device)
118
+
119
+ processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
120
+ model = Wav2Vec2ForCTC.from_pretrained(model_name_or_path).to(device)
121
+
122
+
123
+ def speech_file_to_array_fn(batch):
124
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
125
+ speech_array = speech_array.squeeze().numpy()
126
+ speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate)
127
+
128
+ batch["speech"] = speech_array
129
+ return batch
130
+
131
+
132
+ def predict(batch):
133
+ features = processor(
134
+ batch["speech"],
135
+ sampling_rate=processor.feature_extractor.sampling_rate,
136
+ return_tensors="pt",
137
+ padding=True
138
+ )
139
+
140
+ input_values = features.input_values.to(device)
141
+ attention_mask = features.attention_mask.to(device)
142
+
143
+ with torch.no_grad():
144
+ logits = model(input_values, attention_mask=attention_mask).logits
145
+
146
+ pred_ids = torch.argmax(logits, dim=-1)
147
+
148
+ batch["predicted"] = processor.batch_decode(pred_ids)
149
+ return batch
150
+
151
+ # edit these parts with your own data directory
152
+ dataset = load_dataset("csv", data_files={"test": "/path_to/your_test.csv"}, delimiter=" ")["test"]
153
+ dataset = dataset.map(speech_file_to_array_fn)
154
+ result = dataset.map(predict, batched=True, batch_size=4)
155
+ ```
156
+
157
+
158
+
159
+ **WER Score**
160
+
161
+ ```python
162
+
163
+ wer = load_metric("wer")
164
+ print("WER: {:.2f}".format(100 * wer.compute(predictions=result["predicted"], references=result["sentence"])))
165
+ ```
166
+
167
+
168
+
169
+ **Output**
170
+
171
+ ```python
172
+
173
+ max_items = np.random.randint(0, len(result), 20).tolist()
174
+ for i in max_items:
175
+ reference, predicted = result["sentence"][i], result["predicted"][i]
176
+ print("reference:", reference)
177
+ print("predicted:", predicted)
178
+ print('---')
179
+ ```
180
+
181
+
182
+
183
+
184
+
185
+ ## training details: 🔭
186
+
187
+ One model was trained on Persian Mozilla dataset before So we Decided to continue from that one. Model is warm started from `mehrdadfa`’s [checkpoint](https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3)
188
+ - For more details, you can take a look at config.json at the model card in 🤗 Model Hub
189
+ - The model trained 84000 steps, equal to 12.42 Epochs.
190
+ - The base model to finetune was https://huggingface.co/m3hrdadfi/wav2vec2-large-xlsr-persian-v3/tree/main
191
+
192
+ ## Fine Tuning Recommendations: 🐤
193
+ For fine tuning you can check the link below. but be aware some Tips. you may need gradient_accumulation because you need more batch size. the are many hyperparameters make sure you set them properly :
194
+
195
+ - learning_rate
196
+ - attention_dropout
197
+ - hidden_dropout
198
+ - feat_proj_dropout
199
+ - mask_time_prob
200
+ - layer_drop
201
+
202
+
203
+
204
+ ### Fine Tuning Examples 👷‍♂️👷‍♀️
205
+
206
+ | Dataset | Fine Tuning Example |
207
+ | ------------------------------------------------ | ------------------------------------------------------------ |
208
+ | Fine Tune on Mozilla Turkish Dataset | <a href="https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb"><img src="https://img.shields.io/static/v1?label=Colab&message=Fine-tuning Example&logo=Google%20Colab&color=f9ab00"></a> |
209
+ | Sample Code for Other Dataset And other Language | [github_link](https://github.com/m3hrdadfi/notebooks/) |
210
+
211
+
212
+ ## Contact us: 🤝
213
+ If you have a technical question regarding the model, pretraining, code or publication, please create an issue in the repository. This is the fastest way to reach us.
214
+
215
+ ## Citation: ↩️
216
+ we didn't publish any papers on the work. However, if you did, please cite us properly with an entry like one below.
217
+ ```bibtex
218
+ @misc{wav2vec2-xlsr-multilingual-53-fa,
219
+ author = {Paparnchi, Seyyed Mohammad Masoud},
220
+ title = {wav2vec2-xlsr-multilingual-53-fa},
221
+ year = 2021,
222
+ publisher = {GitHub},
223
+ journal = {GitHub repository},
224
+ howpublished = {\url{https://github.com/Hamtech-ai/wav2vec2-fa}},
225
+ }
226
+ ```