1
---
2
language: nl
3
datasets:
4
- common_voice
5
tags:
6
- audio
7
- automatic-speech-recognition
8
- speech
9
- xlsr-fine-tuning-week
10
license: apache-2.0
11
model-index:
12
- name: wav2vec2-large-xlsr-53-Dutch by Mehdi Hosseini Moghadam
13
  results:
14
  - task: 
15
      name: Speech Recognition
16
      type: automatic-speech-recognition
17
    dataset:
18
      name: Common Voice nl
19
      type: common_voice
20
      args: nl
21
    metrics:
22
       - name: Test WER
23
         type: wer
24
         value:  26.494162
25
---
26
27
# wav2vec2-large-xlsr-53-Dutch
28
29
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in Dutch using the [Common Voice](https://huggingface.co/datasets/common_voice)
30
31
When using this model, make sure that your speech input is sampled at 16kHz.
32
33
## Usage
34
35
The model can be used directly (without a language model) as follows:
36
37
```python
38
39
import torch
40
41
import torchaudio
42
43
from datasets import load_dataset
44
45
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
46
47
test_dataset = load_dataset("common_voice", "nl", split="test[:2%]")
48
49
processor = Wav2Vec2Processor.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-Dutch")
50
51
model = Wav2Vec2ForCTC.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-Dutch")
52
53
resampler = torchaudio.transforms.Resample(48_000, 16_000)
54
55
# Preprocessing the datasets.
56
57
# We need to read the aduio files as arrays
58
59
def speech_file_to_array_fn(batch):
60
61
  speech_array, sampling_rate = torchaudio.load(batch["path"])
62
63
  batch["speech"] = resampler(speech_array).squeeze().numpy()
64
65
  return batch
66
67
test_dataset = test_dataset.map(speech_file_to_array_fn)
68
69
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
70
71
with torch.no_grad():
72
73
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
74
75
predicted_ids = torch.argmax(logits, dim=-1)
76
77
print("Prediction:", processor.batch_decode(predicted_ids))
78
79
print("Reference:", test_dataset["sentence"][:2])
80
81
```
82
83
## Evaluation
84
85
The model can be evaluated as follows on the Dutch test data of Common Voice.
86
87
```python
88
89
import torch
90
91
import torchaudio
92
93
from datasets import load_dataset, load_metric
94
95
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
96
97
import re
98
99
test_dataset = load_dataset("common_voice", "nl", split="test")
100
101
wer = load_metric("wer")
102
103
processor = Wav2Vec2Processor.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-Dutch")
104
105
model = Wav2Vec2ForCTC.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-Dutch")
106
107
model.to("cuda")
108
109
chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“]'
110
111
resampler = torchaudio.transforms.Resample(48_000, 16_000)
112
113
# Preprocessing the datasets.
114
115
# We need to read the aduio files as arrays
116
117
def speech_file_to_array_fn(batch):
118
119
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
120
  
121
  speech_array, sampling_rate = torchaudio.load(batch["path"])
122
  
123
  batch["speech"] = resampler(speech_array).squeeze().numpy()
124
  
125
  return batch
126
127
test_dataset = test_dataset.map(speech_file_to_array_fn)
128
129
# Preprocessing the datasets.
130
131
# We need to read the aduio files as arrays
132
133
def evaluate(batch):
134
135
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
136
    
137
    with torch.no_grad():
138
    
139
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
140
    
141
    pred_ids = torch.argmax(logits, dim=-1)
142
    
143
    batch["pred_strings"] = processor.batch_decode(pred_ids)
144
    
145
    return batch
146
147
result = test_dataset.map(evaluate, batched=True, batch_size=8)
148
149
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
150
151
```
152
153
**Test Result**:  26.494162 %
154
155
## Training
156
157
The Common Voice `train`, `validation` datasets were used for training.