1
---
2
language: mr
3
datasets:
4
- openslr
5
metrics:
6
- wer
7
tags:
8
- audio
9
- automatic-speech-recognition
10
- speech
11
- xlsr-fine-tuning-week
12
license: apache-2.0
13
model-index:
14
- name: XLSR Wav2Vec2 Large 53 Marathi by Gunjan Chhablani
15
  results:
16
  - task: 
17
      name: Speech Recognition
18
      type: automatic-speech-recognition
19
    dataset:
20
      name: OpenSLR mr
21
      type: openslr
22
    metrics:
23
       - name: Test WER
24
         type: wer
25
         value: 14.53
26
---
27
28
# Wav2Vec2-Large-XLSR-53-Marathi
29
30
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Marathi using the [OpenSLR SLR64](http://openslr.org/64/) dataset. Note that this data contains only female voices. Please keep this in mind before using the model for your task, although it works very well for male voice too. When using this model, make sure that your speech input is sampled at 16kHz. 
31
32
## Usage
33
34
The model can be used directly (without a language model) as follows, assuming you have a dataset with Marathi `sentence` and `path` fields:
35
36
```python
37
import torch
38
import torchaudio
39
from datasets import load_dataset
40
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
41
42
# test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.
43
44
processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr")
45
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr")
46
47
resampler = torchaudio.transforms.Resample(48_000, 16_000) # The original data was with 48,000 sampling rate. You can change it according to your input.
48
49
# Preprocessing the datasets.
50
# We need to read the audio files as arrays
51
def speech_file_to_array_fn(batch):
52
    speech_array, sampling_rate = torchaudio.load(batch["path"])
53
    batch["speech"] = resampler(speech_array).squeeze().numpy()
54
    return batch
55
56
test_dataset = test_dataset.map(speech_file_to_array_fn)
57
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
58
59
with torch.no_grad():
60
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
61
62
predicted_ids = torch.argmax(logits, dim=-1)
63
64
print("Prediction:", processor.batch_decode(predicted_ids))
65
print("Reference:", test_dataset["sentence"][:2])
66
```
67
68
69
## Evaluation
70
71
The model can be evaluated as follows on 10% of the Marathi data on OpenSLR.
72
73
```python
74
import torch
75
import torchaudio
76
from datasets import load_dataset, load_metric
77
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
78
import re
79
80
# test_dataset = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.
81
82
wer = load_metric("wer")
83
84
processor = Wav2Vec2Processor.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr")
85
model = Wav2Vec2ForCTC.from_pretrained("gchhablani/wav2vec2-large-xlsr-mr")
86
model.to("cuda")
87
88
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�\–\…]'
89
resampler = torchaudio.transforms.Resample(48_000, 16_000)
90
91
# Preprocessing the datasets.
92
# We need to read the aduio files as arrays
93
def speech_file_to_array_fn(batch):
94
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
95
    speech_array, sampling_rate = torchaudio.load(batch["path"])
96
    batch["speech"] = resampler(speech_array).squeeze().numpy()
97
    return batch
98
99
test_dataset = test_dataset.map(speech_file_to_array_fn)
100
101
# Preprocessing the datasets.
102
# We need to read the aduio files as arrays
103
def evaluate(batch):
104
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
105
    with torch.no_grad():
106
        logits = model(inputs.input_values.to("cuda"), 
107
        attention_mask=inputs.attention_mask.to("cuda")).logits
108
        pred_ids = torch.argmax(logits, dim=-1)
109
        batch["pred_strings"] = processor.batch_decode(pred_ids)
110
        return batch
111
112
result = test_dataset.map(evaluate, batched=True, batch_size=8)
113
114
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
115
```
116
117
**Test Result**: 14.53 %  
118
119
## Training
120
121
90% of the OpenSLR Marathi dataset was used for training.
122
The colab notebook used for training can be found [here](https://colab.research.google.com/drive/1_BbLyLqDUsXG3RpSULfLRjC6UY3RjwME?usp=sharing). 
123