1
---
2
language: or
3
datasets:
4
- common_voice
5
metrics:
6
- wer
7
tags:
8
- audio
9
- automatic-speech-recognition
10
- speech
11
- xlsr-fine-tuning-week
12
license: apache-2.0
13
model-index:
14
- name: odia XLSR Wav2Vec2 Large 2000
15
  results:
16
  - task: 
17
      name: Speech Recognition
18
      type: automatic-speech-recognition
19
    dataset:
20
      name: Common Voice or
21
      type: common_voice
22
      args: or
23
    metrics:
24
       - name: Test WER
25
         type: wer
26
         value: 54.6
27
---
28
29
# Wav2Vec2-Large-XLSR-53-or 
30
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on odia using the [Common Voice](https://huggingface.co/datasets/common_voice)
31
When using this model, make sure that your speech input is sampled at 16kHz.
32
33
## Usage
34
35
The model can be used directly (without a language model) as follows:
36
37
```python
38
import torch
39
import torchaudio
40
from datasets import load_dataset
41
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
42
43
test_dataset = load_dataset("common_voice", "or", split="test[:2%]") 
44
45
processor = Wav2Vec2Processor.from_pretrained("danurahul/wav2vec2-large-xlsr-or") 
46
model = Wav2Vec2ForCTC.from_pretrained("danurahul/wav2vec2-large-xlsr-or") 
47
48
resampler = torchaudio.transforms.Resample(48_000, 16_000)
49
50
# Preprocessing the datasets.
51
# We need to read the aduio files as arrays
52
def speech_file_to_array_fn(batch):
53
\tspeech_array, sampling_rate = torchaudio.load(batch["path"])
54
\tbatch["speech"] = resampler(speech_array).squeeze().numpy()
55
\treturn batch
56
57
test_dataset = test_dataset.map(speech_file_to_array_fn)
58
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
59
60
with torch.no_grad():
61
\tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
62
63
predicted_ids = torch.argmax(logits, dim=-1)
64
65
print("Prediction:", processor.batch_decode(predicted_ids))
66
print("Reference:", test_dataset["sentence"][:2])
67
```
68
69
70
## Evaluation
71
72
The model can be evaluated as follows on the odia test data of Common Voice.  
73
74
```python
75
import torch
76
import torchaudio
77
from datasets import load_dataset, load_metric
78
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
79
import re
80
81
test_dataset = load_dataset("common_voice", "or", split="test") 
82
wer = load_metric("wer")
83
84
processor = Wav2Vec2Processor.from_pretrained("danurahul/wav2vec2-large-xlsr-or") 
85
model = Wav2Vec2ForCTC.from_pretrained("danurahul/wav2vec2-large-xlsr-or") 
86
model.to("cuda")
87
88
chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\"\\“]'  
89
resampler = torchaudio.transforms.Resample(48_000, 16_000)
90
91
# Preprocessing the datasets.
92
# We need to read the aduio files as arrays
93
def speech_file_to_array_fn(batch):
94
\tbatch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
95
\tspeech_array, sampling_rate = torchaudio.load(batch["path"])
96
\tbatch["speech"] = resampler(speech_array).squeeze().numpy()
97
\treturn batch
98
99
test_dataset = test_dataset.map(speech_file_to_array_fn)
100
101
# Preprocessing the datasets.
102
# We need to read the aduio files as arrays
103
def evaluate(batch):
104
\tinputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
105
106
\twith torch.no_grad():
107
\t\tlogits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
108
109
\tpred_ids = torch.argmax(logits, dim=-1)
110
\tbatch["pred_strings"] = processor.batch_decode(pred_ids)
111
\treturn batch
112
113
result = test_dataset.map(evaluate, batched=True, batch_size=8)
114
115
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
116
```
117
118
**Test Result**: 54.6 %  
119
120
## Training
121
122
The Common Voice `train`, `validation`, and test datasets were used for training as well as prediction and testing  
123
124
The script used for training can be found [https://github.com/rahul-art/wav2vec2_or]