gvs
/

wav2vec2-large-xlsr-malayalam

@@ -4,6 +4,7 @@ datasets:
 - Indic TTS Malayalam Speech Corpus
 - Openslr Malayalam Speech Corpus
 - SMC Malayalam Speech Corpus
 metrics:
 - wer
 tags:
@@ -25,12 +26,12 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 39.46
 ---
 # Wav2Vec2-Large-XLSR-53-ml
-Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on ml using the [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/).
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Usage
@@ -84,35 +85,44 @@ import re
 from datasets import load_dataset, load_metric
 from pathlib import Path
 data_dir = Path('<path-to-custom-dataset>')
 dataset_folders = {
     'openslr': 'openslr',
     'indic-tts': 'indic-tts-ml',
 }
 # Set directories for datasets
 openslr_male_dir = data_dir / dataset_folders['openslr'] / 'male'
 openslr_female_dir = data_dir / dataset_folders['openslr'] / 'female'
 indic_tts_male_dir = data_dir / dataset_folders['indic-tts'] / 'male'
 indic_tts_female_dir = data_dir / dataset_folders['indic-tts'] / 'female'
-# Load the datasets, total count is set manually
 openslr_male = load_dataset("json", data_files=[f"{str(openslr_male_dir.absolute())}/sample_{i}.json" for i in range(2023)], split="train")
 openslr_female = load_dataset("json", data_files=[f"{str(openslr_female_dir.absolute())}/sample_{i}.json" for i in range(2103)], split="train")
 indic_tts_male = load_dataset("json", data_files=[f"{str(indic_tts_male_dir.absolute())}/sample_{i}.json" for i in range(5649)], split="train")
 indic_tts_female = load_dataset("json", data_files=[f"{str(indic_tts_female_dir.absolute())}/sample_{i}.json" for i in range(2950)], split="train")
 # Create test split as 20%, set random seed as well.
 test_size = 0.2
 random_seed=1
 openslr_male_splits = openslr_male.train_test_split(test_size=test_size, seed=random_seed)
 openslr_female_splits = openslr_female.train_test_split(test_size=test_size, seed=random_seed)
 indic_tts_male_splits = indic_tts_male.train_test_split(test_size=test_size, seed=random_seed)
 indic_tts_female_splits = indic_tts_female.train_test_split(test_size=test_size, seed=random_seed)
 # Get combined test dataset
-split_list = [openslr_male_splits, openslr_female_splits, indic_tts_male_splits, indic_tts_female_splits]
 test_dataset = datasets.concatenate_datasets([split['test'] for split in split_list)
 wer = load_metric("wer")
@@ -121,19 +131,28 @@ processor = Wav2Vec2Processor.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam
 model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
 model.to("cuda")
-chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�Utrnle\_]'
-unicode_ignore_regex = r'[\u200c\u200d\u200e]'
-resampler = torchaudio.transforms.Resample(48_000, 16_000)
 # Preprocessing the datasets.
 # We need to read the audio files as arrays
 def speech_file_to_array_fn(batch):
-  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
-  batch["sentence"] = re.sub(unicode_ignore_regex, '', batch["sentence"])
-  speech_array, sampling_rate = torchaudio.load(batch["path"])
-  batch["speech"] = resampler(speech_array).squeeze().numpy()
-  return batch
 test_dataset = test_dataset.map(speech_file_to_array_fn)
@@ -154,11 +173,11 @@ result = test_dataset.map(evaluate, batched=True, batch_size=8)
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
 ```
-**Test Result**: 39.46 %
 ## Training
-A combined dataset was created using [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/). The datasets were downloaded and was converted to HF Dataset format using [this notebook](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/make_hf_dataset.ipynb)
 The notebook used for training and evaluation can be found [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/fine-tune-xlsr-wav2vec2-on-malayalam-asr-with-transformers.ipynb)

 - Indic TTS Malayalam Speech Corpus
 - Openslr Malayalam Speech Corpus
 - SMC Malayalam Speech Corpus
+- IIIT-H Indic Speech Databases
 metrics:
 - wer
 tags:
     metrics:
        - name: Test WER
          type: wer
+         value: 28.43
 ---
 # Wav2Vec2-Large-XLSR-53-ml
+Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on ml using the [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html).
 When using this model, make sure that your speech input is sampled at 16kHz.
 ## Usage
 from datasets import load_dataset, load_metric
 from pathlib import Path
+# The custom dataset needs to be created using notebook mentioned at the end of this file
 data_dir = Path('<path-to-custom-dataset>')
 dataset_folders = {
+    'iiit': 'iiit_mal_abi',
     'openslr': 'openslr',
     'indic-tts': 'indic-tts-ml',
+    'msc-reviewed': 'msc-reviewed-speech-v1.0+20200825',
 }
 # Set directories for datasets
 openslr_male_dir = data_dir / dataset_folders['openslr'] / 'male'
 openslr_female_dir = data_dir / dataset_folders['openslr'] / 'female'
+iiit_dir = data_dir / dataset_folders['iiit']
 indic_tts_male_dir = data_dir / dataset_folders['indic-tts'] / 'male'
 indic_tts_female_dir = data_dir / dataset_folders['indic-tts'] / 'female'
+msc_reviewed_dir = data_dir / dataset_folders['msc-reviewed']
+# Load the datasets
 openslr_male = load_dataset("json", data_files=[f"{str(openslr_male_dir.absolute())}/sample_{i}.json" for i in range(2023)], split="train")
 openslr_female = load_dataset("json", data_files=[f"{str(openslr_female_dir.absolute())}/sample_{i}.json" for i in range(2103)], split="train")
+iiit = load_dataset("json", data_files=[f"{str(iiit_dir.absolute())}/sample_{i}.json" for i in range(1000)], split="train")
 indic_tts_male = load_dataset("json", data_files=[f"{str(indic_tts_male_dir.absolute())}/sample_{i}.json" for i in range(5649)], split="train")
 indic_tts_female = load_dataset("json", data_files=[f"{str(indic_tts_female_dir.absolute())}/sample_{i}.json" for i in range(2950)], split="train")
+msc_reviewed = load_dataset("json", data_files=[f"{str(msc_reviewed_dir.absolute())}/sample_{i}.json" for i in range(1541)], split="train")
 # Create test split as 20%, set random seed as well.
 test_size = 0.2
 random_seed=1
 openslr_male_splits = openslr_male.train_test_split(test_size=test_size, seed=random_seed)
 openslr_female_splits = openslr_female.train_test_split(test_size=test_size, seed=random_seed)
+iiit_splits = iiit.train_test_split(test_size=test_size, seed=random_seed)
 indic_tts_male_splits = indic_tts_male.train_test_split(test_size=test_size, seed=random_seed)
 indic_tts_female_splits = indic_tts_female.train_test_split(test_size=test_size, seed=random_seed)
+msc_reviewed_splits = msc_reviewed.train_test_split(test_size=test_size, seed=random_seed)
 # Get combined test dataset
+split_list = [openslr_male_splits, openslr_female_splits, indic_tts_male_splits, indic_tts_female_splits, msc_reviewed_splits, iiit_splits]
 test_dataset = datasets.concatenate_datasets([split['test'] for split in split_list)
 wer = load_metric("wer")
 model = Wav2Vec2ForCTC.from_pretrained("gvs/wav2vec2-large-xlsr-malayalam")
 model.to("cuda")
+resamplers = {
+    48000: torchaudio.transforms.Resample(48_000, 16_000),
+}
+chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�Utrnle\_]'
+unicode_ignore_regex = r'[\u200e]'
 # Preprocessing the datasets.
 # We need to read the audio files as arrays
 def speech_file_to_array_fn(batch):
+    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"])
+    batch["sentence"] = re.sub(unicode_ignore_regex, '', batch["sentence"])
+    speech_array, sampling_rate = torchaudio.load(batch["path"])
+    # Resample if its not in 16kHz
+    if sampling_rate != 16000:
+        batch["speech"] = resamplers[sampling_rate](speech_array).squeeze().numpy()
+    else:
+        batch["speech"] = speech_array.squeeze().numpy()
+    # If more than one dimension is present, pick first one
+    if batch["speech"].ndim > 1:
+        batch["speech"] = batch["speech"][0]
+    return batch
 test_dataset = test_dataset.map(speech_file_to_array_fn)
 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
 ```
+**Test Result (WER)**: 28.43 %
 ## Training
+A combined dataset was created using [Indic TTS Malayalam Speech Corpus (via Kaggle)](https://www.kaggle.com/kavyamanohar/indic-tts-malayalam-speech-corpus), [Openslr Malayalam Speech Corpus](http://openslr.org/63/), [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) and [IIIT-H Indic Speech Databases](http://speech.iiit.ac.in/index.php/research-svl/69.html). The datasets were downloaded and was converted to HF Dataset format using [this notebook](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/make_hf_dataset.ipynb)
 The notebook used for training and evaluation can be found [here](https://github.com/gauthamsuresh09/wav2vec2-large-xlsr-53-malayalam/blob/main/fine-tune-xlsr-wav2vec2-on-malayalam-asr-with-transformers.ipynb)