Pedro Cuenca
* Add model card. 661b20f
1
---
2
language: eu
3
datasets:
4
- common_voice
5
metrics:
6
- wer
7
tags:
8
- audio
9
- automatic-speech-recognition
10
- speech
11
- xlsr-fine-tuning-week
12
license: apache-2.0
13
model-index:
14
- name: XLSR Wav2Vec2 Large 53 Basque by pcuenq 
15
  results:
16
  - task: 
17
      name: Speech Recognition
18
      type: automatic-speech-recognition
19
    dataset:
20
      name: Common Voice eu
21
      type: common_voice
22
      args: eu
23
    metrics:
24
       - name: Test WER
25
         type: wer
26
         value: 15.34
27
---
28
29
# Wav2Vec2-Large-XLSR-53-EU
30
31
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Basque using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset.
32
When using this model, make sure that your speech input is sampled at 16kHz.
33
34
## Usage
35
36
The model can be used directly (without a language model) as follows:
37
38
```python
39
import torch
40
import torchaudio
41
from datasets import load_dataset
42
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
44
test_dataset = load_dataset("common_voice", "eu", split="test[:2%]")
45
46
processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-eu")
47
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-eu")
48
49
resampler = torchaudio.transforms.Resample(48_000, 16_000)
50
51
# Preprocessing the datasets.
52
# We need to read the audio files as arrays
53
def speech_file_to_array_fn(batch):
54
	speech_array, sampling_rate = torchaudio.load(batch["path"])
55
	batch["speech"] = resampler(speech_array).squeeze().numpy()
56
	return batch
57
58
test_dataset = test_dataset.map(speech_file_to_array_fn)
59
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
61
with torch.no_grad():
62
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
64
predicted_ids = torch.argmax(logits, dim=-1)
65
66
print("Prediction:", processor.batch_decode(predicted_ids))
67
print("Reference:", test_dataset["sentence"][:2])
68
```
69
70
71
## Evaluation
72
73
The model can be evaluated as follows on the Basque test data of Common Voice.
74
75
```python
76
import torch
77
import torchaudio
78
from datasets import load_dataset, load_metric
79
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
80
import re
81
82
test_dataset = load_dataset("common_voice", "eu", split="test")
83
wer = load_metric("wer")
84
85
model_name = "pcuenq/wav2vec2-large-xlsr-53-eu"
86
87
processor = Wav2Vec2Processor.from_pretrained(model_name)
88
model = Wav2Vec2ForCTC.from_pretrained(model_name)
89
model.to("cuda")
90
91
## Text pre-processing
92
93
chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
94
chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)
95
96
def remove_special_characters(batch):
97
    batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
98
    return batch
99
100
## Audio pre-processing
101
102
import librosa
103
def speech_file_to_array_fn(batch):
104
    speech_array, sample_rate = torchaudio.load(batch["path"])
105
    batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
106
    return batch
107
108
# Text transformation and audio resampling
109
def cv_prepare(batch):
110
    batch = remove_special_characters(batch)
111
    batch = speech_file_to_array_fn(batch)
112
    return batch
113
114
# Number of CPUs or None
115
num_proc = 16
116
test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)
117
118
def evaluate(batch):
119
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
120
121
    with torch.no_grad():
122
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
123
124
    pred_ids = torch.argmax(logits, dim=-1)
125
    batch["pred_strings"] = processor.batch_decode(pred_ids)
126
    return batch
127
128
result = test_dataset.map(evaluate, batched=True, batch_size=8)
129
130
# WER Metric computation
131
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
132
```
133
134
**Test Result**: 15.34 %
135
136
## Training
137
138
The Common Voice `train` and `validation` datasets were used for training. Training was performed for 22 + 20 epochs with the following parameters:
139
140
- Batch size 16, 2 gradient accumulation steps.
141
- Learning rate: 2.5e-4
142
- Activation dropout: 0.05
143
- Attention dropout: 0.1
144
- Hidden dropout: 0.05
145
- Feature proj. dropout: 0.05
146
- Mask time probability: 0.08
147
- Layer dropout: 0.05
148
149