sid330 commited on
Commit
4ffbe9f
1 Parent(s): 7385727

Deleting Readme due to dataset not being allowed during YAML metadata verification

Browse files
Files changed (1) hide show
  1. README.md +0 -182
README.md DELETED
@@ -1,182 +0,0 @@
1
- ---
2
- language: ml
3
- datasets:
4
- - Indic TTS Malayalam Speech Corpus
5
- - Openslr Malayalam Speech Corpus
6
- - SMC Malayalam Speech Corpus
7
- - IIIT-H Indic Speech Databases
8
- metrics:
9
- - wer
10
- tags:
11
- - audio
12
- - automatic-speech-recognition
13
- - speech
14
- - xlsr-fine-tuning-week
15
- license: apache-2.0
16
- model-index:
17
- - name: Malayalam XLSR Wav2Vec2 Large 53
18
- results:
19
- - task:
20
- name: Speech Recognition
21
- type: automatic-speech-recognition
22
- dataset:
23
- name: Test split of combined dataset using all datasets mentioned above
24
- type: custom
25
- args: ml
26
- metrics:
27
- - name: Test WER
28
- type: wer
29
- value: 28.43
30
- ---
31
-
32
- # Wav2Vec2-Large-XLSR-53-ml
33
-
34
- Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on ml (Malayalam) using the [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html). The notebooks used to train model are available [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/). When using this model, make sure that your speech input is sampled at 16kHz.
35
-
36
- ## Usage
37
-
38
- The model can be used directly (without a language model) as follows:
39
-
40
- ```python
41
- import torch
42
- import torchaudio
43
- from datasets import load_dataset
44
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
45
-
46
- test_dataset = <load-test-split-of-combined-dataset> # Details on loading this dataset in the evaluation section
47
-
48
- processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
49
- model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
50
-
51
- resampler = torchaudio.transforms.Resample(48_000, 16_000)
52
-
53
- # Preprocessing the datasets.
54
- # We need to read the audio files as arrays
55
- def speech_file_to_array_fn(batch):
56
- speech_array, sampling_rate = torchaudio.load(batch["path"])
57
- batch["speech"] = resampler(speech_array).squeeze().numpy()
58
- return batch
59
-
60
- test_dataset = test_dataset.map(speech_file_to_array_fn)
61
- inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
62
-
63
- with torch.no_grad():
64
- logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
65
-
66
- predicted_ids = torch.argmax(logits, dim=-1)
67
-
68
- print("Prediction:", processor.batch_decode(predicted_ids))
69
- print("Reference:", test_dataset["sentence"])
70
- ```
71
-
72
-
73
- ## Evaluation
74
-
75
- The model can be evaluated as follows on the test data of combined custom dataset. For more details on dataset preparation, check the notebooks mentioned at the end of this file.
76
-
77
-
78
- ```python
79
- import torch
80
- import torchaudio
81
- from datasets import load_dataset, load_metric
82
- from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
83
- import re
84
- from datasets import load_dataset, load_metric
85
- from pathlib import Path
86
-
87
- # The custom dataset needs to be created using notebook mentioned at the end of this file
88
- data_dir = Path('<path-to-custom-dataset>')
89
-
90
- dataset_folders = {
91
- 'iiit': 'iiit_mal_abi',
92
- 'openslr': 'openslr',
93
- 'indic-tts': 'indic-tts-ml',
94
- 'msc-reviewed': 'msc-reviewed-speech-v1.0+20200825',
95
- }
96
-
97
- # Set directories for datasets
98
- openslr_male_dir = data_dir / dataset_folders['openslr'] / 'male'
99
- openslr_female_dir = data_dir / dataset_folders['openslr'] / 'female'
100
- iiit_dir = data_dir / dataset_folders['iiit']
101
- indic_tts_male_dir = data_dir / dataset_folders['indic-tts'] / 'male'
102
- indic_tts_female_dir = data_dir / dataset_folders['indic-tts'] / 'female'
103
- msc_reviewed_dir = data_dir / dataset_folders['msc-reviewed']
104
-
105
- # Load the datasets
106
- openslr_male = load_dataset("json", data_files=[f"{str(openslr_male_dir.absolute())}/sample_{i}.json" for i in range(2023)], split="train")
107
- openslr_female = load_dataset("json", data_files=[f"{str(openslr_female_dir.absolute())}/sample_{i}.json" for i in range(2103)], split="train")
108
- iiit = load_dataset("json", data_files=[f"{str(iiit_dir.absolute())}/sample_{i}.json" for i in range(1000)], split="train")
109
- indic_tts_male = load_dataset("json", data_files=[f"{str(indic_tts_male_dir.absolute())}/sample_{i}.json" for i in range(5649)], split="train")
110
- indic_tts_female = load_dataset("json", data_files=[f"{str(indic_tts_female_dir.absolute())}/sample_{i}.json" for i in range(2950)], split="train")
111
- msc_reviewed = load_dataset("json", data_files=[f"{str(msc_reviewed_dir.absolute())}/sample_{i}.json" for i in range(1541)], split="train")
112
-
113
- # Create test split as 20%, set random seed as well.
114
- test_size = 0.2
115
- random_seed=1
116
- openslr_male_splits = openslr_male.train_test_split(test_size=test_size, seed=random_seed)
117
- openslr_female_splits = openslr_female.train_test_split(test_size=test_size, seed=random_seed)
118
- iiit_splits = iiit.train_test_split(test_size=test_size, seed=random_seed)
119
- indic_tts_male_splits = indic_tts_male.train_test_split(test_size=test_size, seed=random_seed)
120
- indic_tts_female_splits = indic_tts_female.train_test_split(test_size=test_size, seed=random_seed)
121
- msc_reviewed_splits = msc_reviewed.train_test_split(test_size=test_size, seed=random_seed)
122
-
123
- # Get combined test dataset
124
- split_list = [openslr_male_splits, openslr_female_splits, indic_tts_male_splits, indic_tts_female_splits, msc_reviewed_splits, iiit_splits]
125
- test_dataset = datasets.concatenate_datasets([split['test'] for split in split_list)
126
-
127
- wer = load_metric("wer")
128
-
129
- processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
130
- model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
131
- model.to("cuda")
132
-
133
- resamplers = {
134
- 48000: torchaudio.transforms.Resample(48_000, 16_000),
135
- }
136
-
137
- chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“\\\\%\\\\‘\\\\”\\\\�Utrnle\\\\_]'
138
- unicode_ignore_regex = r'[\\\\u200e]'
139
-
140
- # Preprocessing the datasets.
141
- # We need to read the audio files as arrays
142
- def speech_file_to_array_fn(batch):
143
- batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
144
- batch["sentence"] = re.sub(unicode_ignore_regex, '', batch["sentence"])
145
- speech_array, sampling_rate = torchaudio.load(batch["path"])
146
- # Resample if its not in 16kHz
147
- if sampling_rate != 16000:
148
- batch["speech"] = resamplers[sampling_rate](speech_array).squeeze().numpy()
149
- else:
150
- batch["speech"] = speech_array.squeeze().numpy()
151
- # If more than one dimension is present, pick first one
152
- if batch["speech"].ndim > 1:
153
- batch["speech"] = batch["speech"][0]
154
- return batch
155
-
156
- test_dataset = test_dataset.map(speech_file_to_array_fn)
157
-
158
- # Preprocessing the datasets.
159
- # We need to read the audio files as arrays
160
- def evaluate(batch):
161
- inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
162
-
163
- with torch.no_grad():
164
- logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
165
-
166
- pred_ids = torch.argmax(logits, dim=-1)
167
- batch["pred_strings"] = processor.batch_decode(pred_ids)
168
- return batch
169
-
170
- result = test_dataset.map(evaluate, batched=True, batch_size=8)
171
-
172
- print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
173
- ```
174
-
175
- **Test Result (WER)**: 28.43 %
176
-
177
-
178
- ## Training
179
-
180
- A combined dataset was created using [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html). The datasets were downloaded and was converted to HF Dataset format using [this notebook](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/make_hf_dataset.ipynb)
181
-
182
- The notebook used for training and evaluation can be found [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/fine-tune-xlsr-wav2vec2-on-malayalam-asr-with-transformers_v2.ipynb)