Tejas2000 commited on
Commit
7aeb50b
1 Parent(s): 6cb07e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -1
README.md CHANGED
@@ -26,4 +26,84 @@ model-index:
26
  - name: Test WER
27
  type: wer
28
  value: 12.7
29
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  - name: Test WER
27
  type: wer
28
  value: 12.7
29
+ ---
30
+
31
+ # Wav2Vec2-Large-XLSR-53-Marathi
32
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [Open SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices but the model works well for male voices too. Trained on Google Colab Pro on Tesla P100 16GB GPU.<br>
33
+ **WER (Word Error Rate) on the Test Set**: 12.70 %
34
+ ## Usage
35
+ The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
36
+ ```python
37
+ import torch, torchaudio
38
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
39
+
40
+ #Since marathi is not present on Common Voice, script for reading the below dataset can be picked up from the eval script below
41
+ mr_test_dataset = all_data['test']
42
+
43
+ processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
44
+ model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
45
+
46
+ resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
47
+ # Preprocessing the datasets. We need to read the aduio files as arrays
48
+ def speech_file_to_array_fn(batch):
49
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
50
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
51
+ return batch
52
+ mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
53
+ inputs = processor(mr_test_dataset["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
54
+ with torch.no_grad():
55
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
56
+ predicted_ids = torch.argmax(logits, dim=-1)
57
+ print("Prediction:", processor.batch_decode(predicted_ids))
58
+ print("Reference:", mr_test_dataset["actual_text"][:5])
59
+ ```
60
+ ## Evaluation
61
+ Evaluated on 10% of the Marathi data on Open SLR-64.
62
+ ```python
63
+ import os, re, torch, torchaudio
64
+ from datasets import Dataset, load_metric
65
+ import pandas as pd
66
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
67
+
68
+ #below is a custom script to be used for reading marathi dataset since its not present on the Common Voice
69
+ dataset_path = "./OpenSLR-64_Marathi/mr_in_female/" #TODO : include the path of the dataset extracted from http://openslr.org/64/
70
+ audio_df = pd.read_csv(os.path.join(dataset_path,'line_index.tsv'),sep='\t',header=None)
71
+ audio_df.columns = ['path_in_folder','actual_text']
72
+ audio_df['path_in_folder'] = audio_df['path_in_folder'].apply(lambda x: dataset_path + x + '.wav')
73
+ audio_df = audio_df.sample(frac=1, random_state=2020).reset_index(drop=True) #seed number is important for reproducibility of WER score
74
+ all_data = Dataset.from_pandas(audio_df)
75
+ all_data = all_data.train_test_split(test_size=0.10,seed=2020) #seed number is important for reproducibility of WER score
76
+
77
+ mr_test_dataset = all_data['test']
78
+ wer = load_metric("wer")
79
+
80
+ processor = Wav2Vec2Processor.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
81
+ model = Wav2Vec2ForCTC.from_pretrained("sumedh/wav2vec2-large-xlsr-marathi")
82
+ model.to("cuda")
83
+
84
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
85
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
86
+ # Preprocessing the datasets. We need to read the aduio files as arrays
87
+ def speech_file_to_array_fn(batch):
88
+ batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
89
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
90
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
91
+ return batch
92
+ mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
93
+ def evaluate(batch):
94
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
95
+ with torch.no_grad():
96
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
97
+ pred_ids = torch.argmax(logits, dim=-1)
98
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
99
+ return batch
100
+ result = mr_test_dataset.map(evaluate, batched=True, batch_size=8)
101
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
102
+ ```
103
+
104
+ ## Training
105
+ Train-Test ratio was 90:10.
106
+ The training notebook Colab link [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing).
107
+
108
+ ## Training Config and Summary
109
+ weights-and-biases run summary [here](https://wandb.ai/wandb/xlsr/runs/3itdhtb8/overview?workspace=user-sumedhkhodke)