1
---
2
language: de
3
datasets:
4
- common_voice
5
tags:
6
- audio
7
- automatic-speech-recognition
8
- speech
9
- xlsr-fine-tuning-week
10
license: apache-2.0
11
model-index:
12
- name: wav2vec2-large-xlsr-53-German by Mehdi Hosseini Moghadam
13
  results:
14
  - task: 
15
      name: Speech Recognition
16
      type: automatic-speech-recognition
17
    dataset:
18
      name: Common Voice de
19
      type: common_voice
20
      args: de
21
    metrics:
22
       - name: Test WER
23
         type: wer
24
         value:  25.284593
25
---
26
27
# wav2vec2-large-xlsr-53-German
28
29
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in German using the [Common Voice](https://huggingface.co/datasets/common_voice)
30
31
When using this model, make sure that your speech input is sampled at 16kHz.
32
33
## Usage
34
35
The model can be used directly (without a language model) as follows:
36
37
```python
38
39
import torch
40
41
import torchaudio
42
43
from datasets import load_dataset
44
45
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
46
47
test_dataset = load_dataset("common_voice", "de", split="test[:2%]")
48
49
processor = Wav2Vec2Processor.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-German")
50
51
model = Wav2Vec2ForCTC.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-German")
52
53
resampler = torchaudio.transforms.Resample(48_000, 16_000)
54
55
# Preprocessing the datasets.
56
57
# We need to read the aduio files as arrays
58
59
def speech_file_to_array_fn(batch):
60
61
  speech_array, sampling_rate = torchaudio.load(batch["path"])
62
63
  batch["speech"] = resampler(speech_array).squeeze().numpy()
64
65
  return batch
66
67
test_dataset = test_dataset.map(speech_file_to_array_fn)
68
69
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
70
71
with torch.no_grad():
72
73
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
74
75
predicted_ids = torch.argmax(logits, dim=-1)
76
77
print("Prediction:", processor.batch_decode(predicted_ids))
78
79
print("Reference:", test_dataset["sentence"][:2])
80
81
```
82
83
## Evaluation
84
85
The model can be evaluated as follows on the Czech test data of Common Voice.
86
87
```python
88
89
import torch
90
91
import torchaudio
92
93
from datasets import load_dataset, load_metric
94
95
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
96
97
import re
98
99
test_dataset = load_dataset("common_voice", "de", split="test[:15%]")
100
101
wer = load_metric("wer")
102
103
processor = Wav2Vec2Processor.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-German")
104
105
model = Wav2Vec2ForCTC.from_pretrained("MehdiHosseiniMoghadam/wav2vec2-large-xlsr-53-German")
106
107
model.to("cuda")
108
109
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'
110
111
resampler = torchaudio.transforms.Resample(48_000, 16_000)
112
113
# Preprocessing the datasets.
114
115
# We need to read the aduio files as arrays
116
117
def speech_file_to_array_fn(batch):
118
119
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
120
121
  
122
123
  speech_array, sampling_rate = torchaudio.load(batch["path"])
124
125
  
126
127
  batch["speech"] = resampler(speech_array).squeeze().numpy()
128
129
  
130
131
  return batch
132
133
test_dataset = test_dataset.map(speech_file_to_array_fn)
134
135
# Preprocessing the datasets.
136
137
# We need to read the aduio files as arrays
138
139
def evaluate(batch):
140
141
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
142
143
    
144
145
    with torch.no_grad():
146
147
    
148
149
      logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
150
151
    
152
153
    pred_ids = torch.argmax(logits, dim=-1)
154
155
    
156
157
    batch["pred_strings"] = processor.batch_decode(pred_ids)
158
159
    
160
161
    return batch
162
163
result = test_dataset.map(evaluate, batched=True, batch_size=8)
164
165
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
166
167
```
168
169
**Test Result**:  25.284593 %
170
171
## Training
172
173
10% of the Common Voice `train`, `validation` datasets were used for training.
174
175
## Testing
176
177
15% of the Common Voice `Test` dataset were used for training.