Pedro Cuenca commited on
Commit
661b20f
1 Parent(s): c23b401

* Add model card.

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: eu
3
+ datasets:
4
+ - common_voice
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - xlsr-fine-tuning-week
12
+ license: apache-2.0
13
+ model-index:
14
+ - name: XLSR Wav2Vec2 Large 53 Basque by pcuenq
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: Common Voice eu
21
+ type: common_voice
22
+ args: eu
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 15.34
27
+ ---
28
+
29
+ # Wav2Vec2-Large-XLSR-53-EU
30
+
31
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Basque using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset.
32
+ When using this model, make sure that your speech input is sampled at 16kHz.
33
+
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+
38
+ ```python
39
+ import torch
40
+ import torchaudio
41
+ from datasets import load_dataset
42
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
+
44
+ test_dataset = load_dataset("common_voice", "eu", split="test[:2%]")
45
+
46
+ processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-eu")
47
+ model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-eu")
48
+
49
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
50
+
51
+ # Preprocessing the datasets.
52
+ # We need to read the audio files as arrays
53
+ def speech_file_to_array_fn(batch):
54
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
55
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
56
+ return batch
57
+
58
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
59
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
+
61
+ with torch.no_grad():
62
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
+
64
+ predicted_ids = torch.argmax(logits, dim=-1)
65
+
66
+ print("Prediction:", processor.batch_decode(predicted_ids))
67
+ print("Reference:", test_dataset["sentence"][:2])
68
+ ```
69
+
70
+
71
+ ## Evaluation
72
+
73
+ The model can be evaluated as follows on the Basque test data of Common Voice.
74
+
75
+ ```python
76
+ import torch
77
+ import torchaudio
78
+ from datasets import load_dataset, load_metric
79
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
80
+ import re
81
+
82
+ test_dataset = load_dataset("common_voice", "eu", split="test")
83
+ wer = load_metric("wer")
84
+
85
+ model_name = "pcuenq/wav2vec2-large-xlsr-53-eu"
86
+
87
+ processor = Wav2Vec2Processor.from_pretrained(model_name)
88
+ model = Wav2Vec2ForCTC.from_pretrained(model_name)
89
+ model.to("cuda")
90
+
91
+ ## Text pre-processing
92
+
93
+ chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
94
+ chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)
95
+
96
+ def remove_special_characters(batch):
97
+ batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
98
+ return batch
99
+
100
+ ## Audio pre-processing
101
+
102
+ import librosa
103
+ def speech_file_to_array_fn(batch):
104
+ speech_array, sample_rate = torchaudio.load(batch["path"])
105
+ batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
106
+ return batch
107
+
108
+ # Text transformation and audio resampling
109
+ def cv_prepare(batch):
110
+ batch = remove_special_characters(batch)
111
+ batch = speech_file_to_array_fn(batch)
112
+ return batch
113
+
114
+ # Number of CPUs or None
115
+ num_proc = 16
116
+ test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)
117
+
118
+ def evaluate(batch):
119
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
120
+
121
+ with torch.no_grad():
122
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
123
+
124
+ pred_ids = torch.argmax(logits, dim=-1)
125
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
126
+ return batch
127
+
128
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
129
+
130
+ # WER Metric computation
131
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
132
+ ```
133
+
134
+ **Test Result**: 15.34 %
135
+
136
+ ## Training
137
+
138
+ The Common Voice `train` and `validation` datasets were used for training. Training was performed for 22 + 20 epochs with the following parameters:
139
+
140
+ - Batch size 16, 2 gradient accumulation steps.
141
+ - Learning rate: 2.5e-4
142
+ - Activation dropout: 0.05
143
+ - Attention dropout: 0.1
144
+ - Hidden dropout: 0.05
145
+ - Feature proj. dropout: 0.05
146
+ - Mask time probability: 0.08
147
+ - Layer dropout: 0.05
148
+