gchhablani commited on
Commit
7278321
1 Parent(s): bd06a05

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: mr
3
+ datasets:
4
+ - interspeech_2021_asr
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - xlsr-fine-tuning-week
12
+ license: apache-2.0
13
+ model-index:
14
+ - name: XLSR Wav2Vec2 Large 53 Marathi 2 by Gunjan Chhablani
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: InterSpeech 2021 ASR mr
21
+ type: interspeech_2021_asr
22
+ metrics:
23
+ - name: Test WER
24
+ type: wer
25
+ value: 14.53
26
+ ---
27
+
28
+ # Wav2Vec2-Large-XLSR-53-Marathi
29
+
30
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using a part of the [InterSpeech 2021 Marathi](https://navana-tech.github.io/IS21SS-indicASRchallenge/data.html) dataset. Please keep this in mind before using the model for your task, although it works very well for male voice too. When using this model, make sure that your speech input is sampled at 16kHz.
31
+
32
+ ## Usage
33
+
34
+ The model can be used directly (without a language model) as follows, assuming you have a dataset with Marathi `sentence` and `path` fields:
35
+
36
+ ```python
37
+ import torch
38
+ import torchaudio
39
+ from datasets import load_dataset
40
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
41
+
42
+ # test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.
43
+
44
+ processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
45
+ model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
46
+
47
+ resampler = torchaudio.transforms.Resample(8_000, 16_000) # The original data was with 8,000 sampling rate. You can change it according to your input.
48
+
49
+ # Preprocessing the datasets.
50
+ # We need to read the audio files as arrays
51
+ def speech_file_to_array_fn(batch):
52
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
53
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
54
+ return batch
55
+
56
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
57
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
58
+
59
+ with torch.no_grad():
60
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
61
+
62
+ predicted_ids = torch.argmax(logits, dim=-1)
63
+
64
+ print("Prediction:", processor.batch_decode(predicted_ids))
65
+ print("Reference:", test_dataset["sentence"][:2])
66
+ ```
67
+
68
+
69
+ ## Evaluation
70
+
71
+ The model can be evaluated as follows on the test set of the Marathi data on InterSpeech-2021.
72
+
73
+ ```python
74
+ import torch
75
+ import torchaudio
76
+ from datasets import load_dataset, load_metric
77
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
78
+ import re
79
+
80
+ # test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.
81
+
82
+ wer = load_metric("wer")
83
+
84
+ processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
85
+ model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr-2")
86
+ model.to("cuda")
87
+
88
+ chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\–\…]'
89
+ resampler = torchaudio.transforms.Resample(8_000, 16_000)
90
+
91
+ # Preprocessing the datasets.
92
+ # We need to read the aduio files as arrays
93
+ def speech_file_to_array_fn(batch):
94
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
95
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
96
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
97
+ return batch
98
+
99
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
100
+
101
+ # Preprocessing the datasets.
102
+ # We need to read the aduio files as arrays
103
+ def evaluate(batch):
104
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
105
+ with torch.no_grad():
106
+ logits = model(inputs.input_values.to("cuda"),
107
+ attention_mask=inputs.attention_mask.to("cuda")).logits
108
+ pred_ids = torch.argmax(logits, dim=-1)
109
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
110
+ return batch
111
+
112
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
113
+
114
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
115
+ ```
116
+
117
+ **Test Result**: 19.98 % (555 examples from test set were used for evaluation)
118
+
119
+ **Test Result on 10% of OpenSLR74 data**: 64.64 %
120
+
121
+ ## Training
122
+
123
+ 5000 examples of the InterSpeech Marathi dataset were used for training.
124
+ The colab notebook used for training can be found [here](https://colab.research.google.com/drive/1sIwGOLJPQqhKm_wVZDkzRuoJqAEgArFr?usp=sharing).