gvs commited on
Commit
e866430
1 Parent(s): 36fdec1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ml
3
+ datasets:
4
+ - [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus)
5
+ - [Openslr Malayalam Speech Corpus](http://openslr.org/63/)
6
+ - [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/)
7
+ metrics:
8
+ - wer
9
+ tags:
10
+ - audio
11
+ - automatic-speech-recognition
12
+ - speech
13
+ - xlsr-fine-tuning-week
14
+ license: apache-2.0
15
+ model-index:
16
+ - name: Malayalam XLSR Wav2Vec2 Large 53
17
+ results:
18
+ - task:
19
+ name: Speech Recognition
20
+ type: automatic-speech-recognition
21
+ dataset:
22
+ name: Test split of combined dataset using all datasets mentioned above
23
+ type: custom
24
+ args: ml
25
+ metrics:
26
+ - name: Test WER
27
+ type: wer
28
+ value: 39.46
29
+ ---
30
+
31
+ # Wav2Vec2-Large-XLSR-53-ml
32
+
33
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on ml using the [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/).
34
+ When using this model, make sure that your speech input is sampled at 16kHz.
35
+
36
+ ## Usage
37
+
38
+ The model can be used directly (without a language model) as follows:
39
+
40
+ ```python
41
+ import torch
42
+ import torchaudio
43
+ from datasets import load_dataset
44
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
45
+
46
+ test_dataset = <load-test-split-of-combined-dataset> #TODO
47
+
48
+ processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
49
+ model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
50
+
51
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
52
+
53
+ # Preprocessing the datasets.
54
+ # We need to read the audio files as arrays
55
+ def speech_file_to_array_fn(batch):
56
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
57
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
58
+ return batch
59
+
60
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
61
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
62
+
63
+ with torch.no_grad():
64
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
65
+
66
+ predicted_ids = torch.argmax(logits, dim=-1)
67
+
68
+ print("Prediction:", processor.batch_decode(predicted_ids))
69
+ print("Reference:", test_dataset["sentence"])
70
+ ```
71
+
72
+
73
+ ## Evaluation
74
+
75
+ The model can be evaluated as follows on the test data of combined custom dataset. For more details on dataset preparation, check the notebooks mentioned at the end of this file.
76
+
77
+
78
+ ```python
79
+ import torch
80
+ import torchaudio
81
+ from datasets import load_dataset, load_metric
82
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
83
+ import re
84
+
85
+ test_dataset = <load-test-split-of-combined-dataset> #TODO
86
+
87
+ wer = load_metric("wer")
88
+
89
+ processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
90
+ model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
91
+ model.to("cuda")
92
+
93
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�Utrnle\_]'
94
+ unicode_ignore_regex = r'[\u200d\u200c\u200e]'
95
+
96
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
97
+
98
+ # Preprocessing the datasets.
99
+ # We need to read the audio files as arrays
100
+ def speech_file_to_array_fn(batch):
101
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
102
+ batch["sentence"] = re.sub(unicode_ignore_regex, '', batch["sentence"])
103
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
104
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
105
+ return batch
106
+
107
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
108
+
109
+ # Preprocessing the datasets.
110
+ # We need to read the audio files as arrays
111
+ def evaluate(batch):
112
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
113
+
114
+ with torch.no_grad():
115
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
116
+
117
+ pred_ids = torch.argmax(logits, dim=-1)
118
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
119
+ return batch
120
+
121
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
122
+
123
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
124
+ ```
125
+
126
+ **Test Result**: 39.46 %
127
+
128
+
129
+ ## Training
130
+
131
+ A combined dataset was created using [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/). The datasets were downloaded and was converted to HF Dataset format using [this notebook](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/make_hf_dataset.ipynb)
132
+
133
+ The notebook used for training and evaluation can be found [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/fine-tune-xlsr-wav2vec2-on-malayalam-asr-with-transformers.ipynb)