wav2vec-xlsr-german / README.md
1
---
2
language: de
3
datasets:
4
- common_voice
5
tags:
6
- audio
7
- automatic-speech-recognition
8
- speech
9
- xlsr-fine-tuning-week
10
license: apache-2.0
11
model-index:
12
- name: XLSR Wav2Vec2 German by Florian Zimmermeister
13
  results:
14
  - task: 
15
      name: Speech Recognition
16
      type: automatic-speech-recognition
17
    dataset:
18
      name: Common Voice de
19
      type: common_voice
20
      args: de  
21
    metrics:
22
       - name: Test WER
23
         type: wer
24
         value: 15.36
25
---
26
# Wav2Vec2-Large-XLSR-53-German
27
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) in German using the [Common Voice](https://huggingface.co/datasets/common_voice)
28
When using this model, make sure that your speech input is sampled at 16kHz.
29
## Usage
30
The model can be used directly (without a language model) as follows:
31
```python
32
import torch
33
import torchaudio
34
from datasets import load_dataset
35
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
36
test_dataset = load_dataset("common_voice", "de", split="test[:2%]").
37
processor = Wav2Vec2Processor.from_pretrained("flozi00/wav2vec-xlsr-german")
38
model = Wav2Vec2ForCTC.from_pretrained("flozi00/wav2vec-xlsr-german")
39
resampler = torchaudio.transforms.Resample(48_000, 16_000)
40
# Preprocessing the datasets.
41
# We need to read the aduio files as arrays
42
def speech_file_to_array_fn(batch):
43
    speech_array, sampling_rate = torchaudio.load(batch["path"])
44
    batch["speech"] = resampler(speech_array).squeeze().numpy()
45
    return batch
46
test_dataset = test_dataset.map(speech_file_to_array_fn)
47
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
48
with torch.no_grad():
49
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
50
predicted_ids = torch.argmax(logits, dim=-1)
51
print("Prediction:", processor.batch_decode(predicted_ids))
52
print("Reference:", test_dataset["sentence"][:2])
53
```
54
## Evaluation
55
The model can be evaluated as follows on the {language} test data of Common Voice.
56
```python
57
import datasets
58
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
59
import soundfile as sf
60
import torch
61
from jiwer import wer
62
import librosa
63
import re
64
import json
65
import time
66
67
chars_to_ignore_regex = '[\\\\,\\\\?\\\\.\\\\!\\\\-\\\\;\\\\:\\\\"\\\\“\\\\%\\\\‘\\\\”\\\\�]'
68
pattern = re.compile("[A-Za-z0-9 ]+")
69
70
#try:
71
#    os.remove("dataset.json")
72
#except:
73
#    pass
74
75
modelName = "flozi00/wav2vec-xlsr-german"
76
model = Wav2Vec2ForCTC.from_pretrained(modelName).to("cuda")
77
tokenizer = Wav2Vec2Processor.from_pretrained(modelName)
78
79
80
print("loading and cleaning")
81
val_dataset = datasets.load_dataset("common_voice","de", split="validation")
82
val_dataset = val_dataset.remove_columns(["client_id","up_votes","down_votes","age","gender","accent","locale","segment"])
83
val_dataset = val_dataset.rename_column("path","file")
84
val_dataset = val_dataset.rename_column("sentence","text")
85
#val_dataset = val_dataset.filter(lambda example, indice: indice % 30 == 0, with_indices=True)
86
87
88
print(val_dataset)
89
90
91
def map_to_array(batch):
92
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).upper() + " "
93
    try:
94
        speech_array, sampling_rate = sf.read(batch["file"]+ ".wav")
95
    except:
96
        speech_array, sampling_rate = librosa.load(batch["file"], sr=16000, res_type='kaiser_fast')
97
        sf.write(batch["file"] + ".wav", speech_array, sampling_rate, subtype='PCM_24')
98
    batch["speech"] = speech_array
99
    return batch
100
101
102
print("reading audio")
103
start = time.time()
104
val_dataset = val_dataset.map(map_to_array)
105
print(time.time()-start)
106
val_dataset = val_dataset.filter(lambda example: True if pattern.fullmatch(example["text"]) is not None else False)
107
108
109
def map_to_pred(batch):
110
    inputs = tokenizer(batch["speech"], return_tensors="pt", padding="longest")
111
    input_values = inputs.input_values.to("cuda")
112
    attention_mask = inputs.attention_mask.to("cuda")
113
114
    with torch.no_grad():
115
        logits = model(input_values, attention_mask=attention_mask).logits
116
117
    predicted_ids = torch.argmax(logits, dim=-1)
118
    transcription = tokenizer.batch_decode(predicted_ids)
119
    batch["transcription"] = transcription
120
    return batch
121
122
123
print("predicting")
124
result = val_dataset.map(map_to_pred, batched=True, batch_size=32, remove_columns=["speech"])
125
126
print("WER:", wer(result["text"], result["transcription"]))
127
128
```
129
**Test Result**: 15.36 %
130