Pedro Cuenca commited on
Commit
e41df72
·
1 Parent(s): c665395

* Add README.

Browse files

Training details to be added later. I performed it incrementally due to
the difficulty in manipulating the dataset, and don't have a
self-contained script right now.

The evaluation script, however, is correct as far as I can tell, and
represents the pre-processing steps that I followed.

Note that the `wer` metric produces a memory error when used on large
datasets. I created a version that accumulates the base measures and
then computes the result, instead of doing it on all samples at once.

Files changed (1) hide show
  1. README.md +190 -0
README.md ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: es
3
+ datasets:
4
+ - common_voice
5
+ metrics:
6
+ - wer
7
+ tags:
8
+ - audio
9
+ - automatic-speech-recognition
10
+ - speech
11
+ - xlsr-fine-tuning-week
12
+ license: apache-2.0
13
+ model-index:
14
+ - name: XLSR Wav2Vec2 Large 53 Spanish by pcuenq
15
+ results:
16
+ - task:
17
+ name: Speech Recognition
18
+ type: automatic-speech-recognition
19
+ dataset:
20
+ name: Common Voice es
21
+ type: common_voice
22
+ args: es
23
+ metrics:
24
+ - name: Test WER
25
+ type: wer
26
+ value: 13.47
27
+ ---
28
+
29
+ # Wav2Vec2-Large-XLSR-53-Spanish
30
+
31
+ Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Spanish using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset{s}.
32
+ When using this model, make sure that your speech input is sampled at 16kHz.
33
+
34
+ ## Usage
35
+
36
+ The model can be used directly (without a language model) as follows:
37
+
38
+ ```python
39
+ import torch
40
+ import torchaudio
41
+ from datasets import load_dataset
42
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
+
44
+ test_dataset = load_dataset("common_voice", "es", split="test[:2%]")
45
+
46
+ processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
47
+ model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
48
+
49
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
50
+
51
+ # Preprocessing the datasets.
52
+ # We need to read the audio files as arrays
53
+ def speech_file_to_array_fn(batch):
54
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
55
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
56
+ return batch
57
+
58
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
59
+ inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
+
61
+ with torch.no_grad():
62
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
+
64
+ predicted_ids = torch.argmax(logits, dim=-1)
65
+
66
+ print("Prediction:", processor.batch_decode(predicted_ids))
67
+ print("Reference:", test_dataset["sentence"][:2])
68
+ ```
69
+
70
+
71
+ ## Evaluation
72
+
73
+ The model can be evaluated as follows on the Spanish test data of Common Voice.
74
+
75
+ ```python
76
+ import torch
77
+ import torchaudio
78
+ from datasets import load_dataset, load_metric
79
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
80
+ import re
81
+
82
+ test_dataset = load_dataset("common_voice", "es", split="test")
83
+ wer = load_metric("wer")
84
+
85
+ processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
86
+ model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
87
+ model.to("cuda")
88
+
89
+ ## Text pre-processing
90
+
91
+ chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
92
+ chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)
93
+
94
+ def remove_special_characters(batch):
95
+ batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
96
+ return batch
97
+
98
+ def replace_diacritics(batch):
99
+ sentence = batch["sentence"]
100
+ sentence = re.sub('ì', 'í', sentence)
101
+ sentence = re.sub('ù', 'ú', sentence)
102
+ sentence = re.sub('ò', 'ó', sentence)
103
+ sentence = re.sub('à', 'á', sentence)
104
+ batch["sentence"] = sentence
105
+ return batch
106
+
107
+ def replace_additional(batch):
108
+ sentence = batch["sentence"]
109
+ sentence = re.sub('ã', 'a', sentence) # Portuguese, as in São Paulo
110
+ sentence = re.sub('ō', 'o', sentence) # Japanese
111
+ sentence = re.sub('ê', 'e', sentence) # Português
112
+ batch["sentence"] = sentence
113
+ return batch
114
+
115
+ ## Audio pre-processing
116
+
117
+ # I tried to perform the resampling using a `torchaudio` `Resampler` transform,
118
+ # but found that the process deadlocked when using multiple processes.
119
+ # Perhaps my torchaudio is using the wrong sox library under the hood, I'm not sure.
120
+ # Fortunately, `librosa` seems to work fine, so that's what I'll use for now.
121
+
122
+ import librosa
123
+ def speech_file_to_array_fn(batch):
124
+ speech_array, _ = torchaudio.load(batch["path"])
125
+ batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), 48_000, 16_000)
126
+ return batch
127
+
128
+ # One-pass mapping function
129
+
130
+ # Text transformation and audio resampling
131
+ def cv_prepare(batch):
132
+ batch = remove_special_characters(batch)
133
+ batch = replace_diacritics(batch)
134
+ batch = replace_additional(batch)
135
+ batch = speech_file_to_array_fn(batch)
136
+ return batch
137
+
138
+ # Number of CPUs or None
139
+ num_proc = 16
140
+
141
+ test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)
142
+
143
+ def evaluate(batch):
144
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
145
+
146
+ with torch.no_grad():
147
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
148
+
149
+ pred_ids = torch.argmax(logits, dim=-1)
150
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
151
+ return batch
152
+
153
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
154
+
155
+ # WER Metric computation
156
+ # `wer.compute` crashes in my computer with more than ~10000 samples.
157
+ # Until I confirm in a different one, I created a "chunked" version of the computation.
158
+ # It gives the same results as `wer.compute` for smaller datasets.
159
+
160
+ import jiwer
161
+
162
+ def chunked_wer(targets, predictions, chunk_size=None):
163
+ if chunk_size is None: return jiwer.wer(targets, predictions)
164
+ start = 0
165
+ end = chunk_size
166
+ H, S, D, I = 0, 0, 0, 0
167
+ while start < len(targets):
168
+ chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])
169
+ H = H + chunk_metrics["hits"]
170
+ S = S + chunk_metrics["substitutions"]
171
+ D = D + chunk_metrics["deletions"]
172
+ I = I + chunk_metrics["insertions"]
173
+ start += chunk_size
174
+ end += chunk_size
175
+ return float(S + D + I) / float(H + S + D)
176
+
177
+ print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
178
+ #print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
179
+
180
+ ```
181
+
182
+ **Test Result**: 13.47 %
183
+
184
+
185
+ ## Training
186
+
187
+ The Common Voice `train` and `validation` datasets were used for training.
188
+
189
+ Training details TBD (I did it incrementally, don't have a self-contained script right now).
190
+