1
---
2
language: ml
3
datasets:
4
- Indic TTS Malayalam Speech Corpus
5
- Openslr Malayalam Speech Corpus
6
- SMC Malayalam Speech Corpus
7
- IIIT-H Indic Speech Databases
8
metrics:
9
- wer
10
tags:
11
- audio
12
- automatic-speech-recognition
13
- speech
14
- xlsr-fine-tuning-week
15
license: apache-2.0
16
model-index:
17
- name: Malayalam XLSR Wav2Vec2 Large 53
18
  results:
19
  - task: 
20
      name: Speech Recognition
21
      type: automatic-speech-recognition
22
    dataset:
23
      name: Test split of combined dataset using all datasets mentioned above
24
      type: custom
25
      args: ml
26
    metrics:
27
       - name: Test WER
28
         type: wer
29
         value: 28.43
30
---
31
32
# Wav2Vec2-Large-XLSR-53-ml
33
34
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on ml (Malayalam) using the [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html). The notebooks used to train model are available [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/). When using this model, make sure that your speech input is sampled at 16kHz.
35
36
## Usage
37
38
The model can be used directly (without a language model) as follows:
39
40
```python
41
import torch
42
import torchaudio
43
from datasets import load_dataset
44
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
45
46
test_dataset = <load-test-split-of-combined-dataset> # Details on loading this dataset in the evaluation section
47
48
processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
49
model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
50
51
resampler = torchaudio.transforms.Resample(48_000, 16_000)
52
53
# Preprocessing the datasets.
54
# We need to read the audio files as arrays
55
def speech_file_to_array_fn(batch):
56
  speech_array, sampling_rate = torchaudio.load(batch["path"])
57
  batch["speech"] = resampler(speech_array).squeeze().numpy()
58
  return batch
59
60
test_dataset = test_dataset.map(speech_file_to_array_fn)
61
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
62
63
with torch.no_grad():
64
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
65
66
predicted_ids = torch.argmax(logits, dim=-1)
67
68
print("Prediction:", processor.batch_decode(predicted_ids))
69
print("Reference:", test_dataset["sentence"])
70
```
71
72
73
## Evaluation
74
75
The model can be evaluated as follows on the test data of combined custom dataset. For more details on dataset preparation, check the notebooks mentioned at the end of this file.
76
77
78
```python
79
import torch
80
import torchaudio
81
from datasets import load_dataset, load_metric
82
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
83
import re
84
from datasets import load_dataset, load_metric
85
from pathlib import Path
86
87
# The custom dataset needs to be created using notebook mentioned at the end of this file
88
data_dir = Path('<path-to-custom-dataset>')
89
90
dataset_folders = {
91
    'iiit': 'iiit_mal_abi',
92
    'openslr': 'openslr',
93
    'indic-tts': 'indic-tts-ml',
94
    'msc-reviewed': 'msc-reviewed-speech-v1.0+20200825',
95
}
96
97
# Set directories for datasets
98
openslr_male_dir = data_dir / dataset_folders['openslr'] / 'male'
99
openslr_female_dir = data_dir / dataset_folders['openslr'] / 'female'
100
iiit_dir = data_dir / dataset_folders['iiit']
101
indic_tts_male_dir = data_dir / dataset_folders['indic-tts'] / 'male'
102
indic_tts_female_dir = data_dir / dataset_folders['indic-tts'] / 'female'
103
msc_reviewed_dir = data_dir / dataset_folders['msc-reviewed']
104
105
# Load the datasets
106
openslr_male = load_dataset("json", data_files=[f"{str(openslr_male_dir.absolute())}/sample_{i}.json" for i in range(2023)], split="train")
107
openslr_female = load_dataset("json", data_files=[f"{str(openslr_female_dir.absolute())}/sample_{i}.json" for i in range(2103)], split="train")
108
iiit = load_dataset("json", data_files=[f"{str(iiit_dir.absolute())}/sample_{i}.json" for i in range(1000)], split="train")
109
indic_tts_male = load_dataset("json", data_files=[f"{str(indic_tts_male_dir.absolute())}/sample_{i}.json" for i in range(5649)], split="train")
110
indic_tts_female = load_dataset("json", data_files=[f"{str(indic_tts_female_dir.absolute())}/sample_{i}.json" for i in range(2950)], split="train")
111
msc_reviewed = load_dataset("json", data_files=[f"{str(msc_reviewed_dir.absolute())}/sample_{i}.json" for i in range(1541)], split="train")
112
113
# Create test split as 20%, set random seed as well.
114
test_size = 0.2
115
random_seed=1
116
openslr_male_splits = openslr_male.train_test_split(test_size=test_size, seed=random_seed)
117
openslr_female_splits = openslr_female.train_test_split(test_size=test_size, seed=random_seed)
118
iiit_splits = iiit.train_test_split(test_size=test_size, seed=random_seed)
119
indic_tts_male_splits = indic_tts_male.train_test_split(test_size=test_size, seed=random_seed)
120
indic_tts_female_splits = indic_tts_female.train_test_split(test_size=test_size, seed=random_seed)
121
msc_reviewed_splits = msc_reviewed.train_test_split(test_size=test_size, seed=random_seed)
122
123
# Get combined test dataset
124
split_list = [openslr_male_splits, openslr_female_splits, indic_tts_male_splits, indic_tts_female_splits, msc_reviewed_splits, iiit_splits]
125
test_dataset = datasets.concatenate_datasets([split['test'] for split in split_list)
126
127
wer = load_metric("wer")
128
129
processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam") 
130
model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
131
model.to("cuda")
132
133
resamplers = {
134
    48000: torchaudio.transforms.Resample(48_000, 16_000),
135
}
136
137
chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“\\\\%\\\\‘\\\\”\\\\�Utrnle\\\\_]'
138
unicode_ignore_regex = r'[\\\\u200e]'
139
140
# Preprocessing the datasets.
141
# We need to read the audio files as arrays
142
def speech_file_to_array_fn(batch):
143
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
144
    batch["sentence"] = re.sub(unicode_ignore_regex, '', batch["sentence"])
145
    speech_array, sampling_rate = torchaudio.load(batch["path"])
146
    # Resample if its not in 16kHz
147
    if sampling_rate != 16000:
148
        batch["speech"] = resamplers[sampling_rate](speech_array).squeeze().numpy()
149
    else:
150
        batch["speech"] = speech_array.squeeze().numpy()
151
    # If more than one dimension is present, pick first one
152
    if batch["speech"].ndim > 1:
153
        batch["speech"] = batch["speech"][0]
154
    return batch
155
156
test_dataset = test_dataset.map(speech_file_to_array_fn)
157
158
# Preprocessing the datasets.
159
# We need to read the audio files as arrays
160
def evaluate(batch):
161
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
162
163
  with torch.no_grad():
164
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
165
166
  pred_ids = torch.argmax(logits, dim=-1)
167
  batch["pred_strings"] = processor.batch_decode(pred_ids)
168
  return batch
169
170
result = test_dataset.map(evaluate, batched=True, batch_size=8)
171
172
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
173
```
174
175
**Test Result (WER)**: 28.43 %  
176
177
178
## Training
179
180
A combined dataset was created using [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html). The datasets were downloaded and was converted to HF Dataset format using [this notebook](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/make_hf_dataset.ipynb)
181
182
The notebook used for training and evaluation can be found [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/fine-tune-xlsr-wav2vec2-on-malayalam-asr-with-transformers_v2.ipynb)