Tejas2000 commited on
Commit
9a47e0c
1 Parent(s): 75dd3b9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - openslr
5
+ language:
6
+ - mr
7
+ metrics:
8
+ - wer
9
+ pipeline_tag: automatic-speech-recognition
10
+ tags:
11
+ - speech_to_text
12
+ - audio
13
+ - automatic-speech-recognition
14
+ - speech
15
+ - xlsr-fine-tuning-week
16
+ model-index:
17
+ - name: XLSR Wav2Vec2 Large 53 Marathi by Sumedh Khodke
18
+ results:
19
+ - task:
20
+ name: Speech Recognition
21
+ type: automatic-speech-recognition
22
+ dataset:
23
+ name: OpenSLR mr
24
+ type: openslr
25
+ metrics:
26
+ - name: Test WER
27
+ type: wer
28
+ value: 12.7
29
+ ---
30
+
31
+ # Wav2Vec2-Large-XLSR-53-Marathi
32
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [Open SLR64](http://openslr.org/64/) dataset. When using this model, make sure that your speech input is sampled at 16kHz. This data contains only female voices but the model works well for male voices too. Trained on Google Colab Pro on Tesla P100 16GB GPU.<br>
33
+ **WER (Word Error Rate) on the Test Set**: 12.70 %
34
+ ## Usage
35
+ The model can be used directly without a language model as follows, given that your dataset has Marathi `actual_text` and `path_in_folder` columns:
36
+ ```python
37
+ import torch, torchaudio
38
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
39
+
40
+ #Since marathi is not present on Common Voice, script for reading the below dataset can be picked up from the eval script below
41
+ mr_test_dataset = all_data['test']
42
+
43
+ processor = Wav2Vec2Processor.from_pretrained("Tejas2000/SpeechRecog")
44
+ model = Wav2Vec2ForCTC.from_pretrained("Tejas2000/SpeechRecog")
45
+
46
+ resampler = torchaudio.transforms.Resample(48_000, 16_000) #first arg - input sample, second arg - output sample
47
+ # Preprocessing the datasets. We need to read the aduio files as arrays
48
+ def speech_file_to_array_fn(batch):
49
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
50
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
51
+ return batch
52
+ mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
53
+ inputs = processor(mr_test_dataset["speech"][:5], sampling_rate=16_000, return_tensors="pt", padding=True)
54
+ with torch.no_grad():
55
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
56
+ predicted_ids = torch.argmax(logits, dim=-1)
57
+ print("Prediction:", processor.batch_decode(predicted_ids))
58
+ print("Reference:", mr_test_dataset["actual_text"][:5])
59
+ ```
60
+ ## Evaluation
61
+ Evaluated on 10% of the Marathi data on Open SLR-64.
62
+ ```python
63
+ import os, re, torch, torchaudio
64
+ from datasets import Dataset, load_metric
65
+ import pandas as pd
66
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
67
+
68
+ #below is a custom script to be used for reading marathi dataset since its not present on the Common Voice
69
+ dataset_path = "./OpenSLR-64_Marathi/mr_in_female/" #TODO : include the path of the dataset extracted from http://openslr.org/64/
70
+ audio_df = pd.read_csv(os.path.join(dataset_path,'line_index.tsv'),sep='\t',header=None)
71
+ audio_df.columns = ['path_in_folder','actual_text']
72
+ audio_df['path_in_folder'] = audio_df['path_in_folder'].apply(lambda x: dataset_path + x + '.wav')
73
+ audio_df = audio_df.sample(frac=1, random_state=2020).reset_index(drop=True) #seed number is important for reproducibility of WER score
74
+ all_data = Dataset.from_pandas(audio_df)
75
+ all_data = all_data.train_test_split(test_size=0.10,seed=2020) #seed number is important for reproducibility of WER score
76
+
77
+ mr_test_dataset = all_data['test']
78
+ wer = load_metric("wer")
79
+
80
+ processor = Wav2Vec2Processor.from_pretrained("Tejas2000/SpeechRecog")
81
+ model = Wav2Vec2ForCTC.from_pretrained("Tejas2000/SpeechRecog")
82
+ model.to("cuda")
83
+
84
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
85
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
86
+ # Preprocessing the datasets. We need to read the aduio files as arrays
87
+ def speech_file_to_array_fn(batch):
88
+ batch["actual_text"] = re.sub(chars_to_ignore_regex, '', batch["actual_text"]).lower()
89
+ speech_array, sampling_rate = torchaudio.load(batch["path_in_folder"])
90
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
91
+ return batch
92
+ mr_test_dataset = mr_test_dataset.map(speech_file_to_array_fn)
93
+ def evaluate(batch):
94
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
95
+ with torch.no_grad():
96
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
97
+ pred_ids = torch.argmax(logits, dim=-1)
98
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
99
+ return batch
100
+ result = mr_test_dataset.map(evaluate, batched=True, batch_size=8)
101
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["actual_text"])))
102
+ ```
103
+
104
+ ## Training
105
+ Train-Test ratio was 90:10.
106
+ The training notebook Colab link [here](https://colab.research.google.com/drive/1wX46fjExcgU5t3AsWhSPTipWg_aMDg2f?usp=sharing).
107
+
108
+ ## Training Config and Summary
109
+ weights-and-biases run summary [here](https://wandb.ai/wandb/xlsr/runs/3itdhtb8/overview?workspace=user-sumedhkhodke)