lgris commited on
Commit
3309b4b
1 Parent(s): 9a49fc6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +394 -0
README.md ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ datasets:
4
+ - common_voice
5
+ - mls
6
+ - cetuc
7
+ - lapsbm
8
+ - voxforge
9
+ - tedx
10
+ - sid
11
+ metrics:
12
+ - wer
13
+ tags:
14
+ - audio
15
+ - speech
16
+ - wav2vec2
17
+ - pt
18
+ - portuguese-speech-corpus
19
+ - automatic-speech-recognition
20
+ - speech
21
+ - PyTorch
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # bp500-xlsr: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
26
+
27
+ This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the following datasets:
28
+
29
+ - [CETUC](http://www02.smt.ufrj.br/~igor.quintanilha/alcaim.tar.gz): contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the [CETEN-Folha](https://www.linguateca.pt/cetenfolha/) corpus;
30
+ - [Common Voice 7.0](https://commonvoice.mozilla.org/pt): is a project proposed by Mozilla Foundation with the goal to create a wide open dataset in different languages. In this project, volunteers donate and validate speech using the [oficial site](https://commonvoice.mozilla.org/pt);
31
+ - [Lapsbm](https://github.com/falabrasil/gitlab-resources): "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control;
32
+ - [Multilingual Librispeech (MLS)](https://arxiv.org/abs/2012.03411): a massive dataset available in many languages. The MLS is based on audiobook recordings in public domain like [LibriVox](https://librivox.org/). The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese [used in this work](http://www.openslr.org/94/) (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers;
33
+ - [VoxForge](http://www.voxforge.org/): is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
34
+
35
+ These datasets were combined to build a larger Brazilian Portuguese dataset. All data was used for training except Common Voice dev/test sets, that were used for validation/test respectively. We also made test sets for all the gathered datasets.
36
+
37
+ | Dataset | Train | Valid | Test |
38
+ |--------------------------------|-------:|------:|------:|
39
+ | CETUC | 93.9h | -- | 5.4h |
40
+ | Common Voice | 37.6h | 8.9h | 9.5h |
41
+ | LaPS BM | 0.8h | -- | 0.1h |
42
+ | MLS | 161.0h | -- | 3.7h |
43
+ | Multilingual TEDx (Portuguese) | 144.2h | -- | 1.8h |
44
+ | SID | 5.0h | -- | 1.0h |
45
+ | VoxForge | 2.8h | -- | 0.1h |
46
+ | Total | 437.2h | 8.9h | 21.6h |
47
+
48
+ The original model was fine-tuned using [fairseq](https://github.com/pytorch/fairseq). This notebook uses a converted version of the original one. The link to the original fairseq model is available [here](https://drive.google.com/file/d/1J8aR1ltDLQFe-dVrGuyxoRm2uyJjCWgf/view?usp=sharing).
49
+
50
+ #### Summary
51
+
52
+ | | CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG |
53
+ |----------------------|---------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
54
+ | bp\_500 (demonstration below) | 0.051 | 0.136 | 0.032 | 0.118 | 0.095 | 0.248 | 0.082 | 0.108 |
55
+ | bp\_500 + 4-gram (demonstration below) | 0.032 | 0.097 | 0.022 | 0.114 | 0.125 | 0.246 | 0.065 | 0.100 |
56
+
57
+ #### Transcription examples
58
+
59
+ | Text | Transcription |
60
+ |------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
61
+ |não há um departamento de mediadores independente das federações e das agremiações|não há um **dearamento** de mediadores independente das federações e das **agrebiações**|
62
+ |mas que bodega|**masque** bodega|
63
+ |a cortina abriu o show começou|a cortina abriu o **chô** começou|
64
+ |por sorte havia uma passadeira|**busote avinhoa** **passadeiro**|
65
+ |estou maravilhada está tudo pronto|**stou** estou maravilhada está tudo pronto|
66
+
67
+
68
+ ## Demonstration
69
+
70
+
71
+ ```python
72
+ MODEL_NAME = "lgris/bp500-xlsr"
73
+ ```
74
+
75
+ ### Imports and dependencies
76
+
77
+
78
+ ```python
79
+ %%capture
80
+ !pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
81
+ !pip install datasets
82
+ !pip install jiwer
83
+ !pip install transformers
84
+ !pip install soundfile
85
+ !pip install pyctcdecode
86
+ !pip install https://github.com/kpu/kenlm/archive/master.zip
87
+ ```
88
+
89
+
90
+ ```python
91
+ import jiwer
92
+ import torchaudio
93
+ from datasets import load_dataset, load_metric
94
+ from transformers import (
95
+ Wav2Vec2ForCTC,
96
+ Wav2Vec2Processor,
97
+ )
98
+ from pyctcdecode import build_ctcdecoder
99
+ import torch
100
+ import re
101
+ import sys
102
+ ```
103
+
104
+ ### Helpers
105
+
106
+
107
+ ```python
108
+ chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
109
+
110
+ def map_to_array(batch):
111
+ speech, _ = torchaudio.load(batch["path"])
112
+ batch["speech"] = speech.squeeze(0).numpy()
113
+ batch["sampling_rate"] = 16_000
114
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
115
+ batch["target"] = batch["sentence"]
116
+ return batch
117
+ ```
118
+
119
+
120
+ ```python
121
+ def calc_metrics(truths, hypos):
122
+ wers = []
123
+ mers = []
124
+ wils = []
125
+ for t, h in zip(truths, hypos):
126
+ try:
127
+ wers.append(jiwer.wer(t, h))
128
+ mers.append(jiwer.mer(t, h))
129
+ wils.append(jiwer.wil(t, h))
130
+ except: # Empty string?
131
+ pass
132
+ wer = sum(wers)/len(wers)
133
+ mer = sum(mers)/len(mers)
134
+ wil = sum(wils)/len(wils)
135
+ return wer, mer, wil
136
+ ```
137
+
138
+
139
+ ```python
140
+ def load_data(dataset):
141
+ data_files = {'test': f'{dataset}/test.csv'}
142
+ dataset = load_dataset('csv', data_files=data_files)["test"]
143
+ return dataset.map(map_to_array)
144
+ ```
145
+
146
+ ### Model
147
+
148
+
149
+ ```python
150
+ class STT:
151
+
152
+ def __init__(self,
153
+ model_name,
154
+ device='cuda' if torch.cuda.is_available() else 'cpu',
155
+ lm=None):
156
+ self.model_name = model_name
157
+ self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
158
+ self.processor = Wav2Vec2Processor.from_pretrained(model_name)
159
+ self.vocab_dict = self.processor.tokenizer.get_vocab()
160
+ self.sorted_dict = {
161
+ k.lower(): v for k, v in sorted(self.vocab_dict.items(),
162
+ key=lambda item: item[1])
163
+ }
164
+ self.device = device
165
+ self.lm = lm
166
+ if self.lm:
167
+ self.lm_decoder = build_ctcdecoder(
168
+ list(self.sorted_dict.keys()),
169
+ self.lm
170
+ )
171
+
172
+ def batch_predict(self, batch):
173
+ features = self.processor(batch["speech"],
174
+ sampling_rate=batch["sampling_rate"][0],
175
+ padding=True,
176
+ return_tensors="pt")
177
+ input_values = features.input_values.to(self.device)
178
+ attention_mask = features.attention_mask.to(self.device)
179
+ with torch.no_grad():
180
+ logits = self.model(input_values, attention_mask=attention_mask).logits
181
+ if self.lm:
182
+ logits = logits.cpu().numpy()
183
+ batch["predicted"] = []
184
+ for sample_logits in logits:
185
+ batch["predicted"].append(self.lm_decoder.decode(sample_logits))
186
+ else:
187
+ pred_ids = torch.argmax(logits, dim=-1)
188
+ batch["predicted"] = self.processor.batch_decode(pred_ids)
189
+ return batch
190
+ ```
191
+
192
+ ### Download datasets
193
+
194
+
195
+ ```python
196
+ %%capture
197
+ !gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
198
+ !mkdir bp_dataset
199
+ !unzip bp_dataset -d bp_dataset/
200
+ ```
201
+
202
+
203
+ ```python
204
+ %cd bp_dataset
205
+ ```
206
+
207
+ /content/bp_dataset
208
+
209
+
210
+ ### Tests
211
+
212
+
213
+ ```python
214
+ stt = STT(MODEL_NAME)
215
+ ```
216
+
217
+ #### CETUC
218
+
219
+
220
+ ```python
221
+ ds = load_data('cetuc_dataset')
222
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
223
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
224
+ print("CETUC WER:", wer)
225
+ ```
226
+ CETUC WER: 0.05159097808687998
227
+
228
+
229
+ #### Common Voice
230
+
231
+
232
+ ```python
233
+ ds = load_data('commonvoice_dataset')
234
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
235
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
236
+ print("CV WER:", wer)
237
+ ```
238
+ CV WER: 0.13659981509705973
239
+
240
+
241
+ #### LaPS
242
+
243
+
244
+ ```python
245
+ ds = load_data('lapsbm_dataset')
246
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
247
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
248
+ print("Laps WER:", wer)
249
+ ```
250
+ Laps WER: 0.03196969696969697
251
+
252
+
253
+ #### MLS
254
+
255
+
256
+ ```python
257
+ ds = load_data('mls_dataset')
258
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
259
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
260
+ print("MLS WER:", wer)
261
+ ```
262
+ MLS WER: 0.1178481066463896
263
+
264
+
265
+ #### SID
266
+
267
+
268
+ ```python
269
+ ds = load_data('sid_dataset')
270
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
271
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
272
+ print("Sid WER:", wer)
273
+ ```
274
+ Sid WER: 0.09544588416964224
275
+
276
+
277
+ #### TEDx
278
+
279
+
280
+ ```python
281
+ ds = load_data('tedx_dataset')
282
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
283
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
284
+ print("TEDx WER:", wer)
285
+ ```
286
+ TEDx WER: 0.24868046340420813
287
+
288
+
289
+ #### VoxForge
290
+
291
+
292
+ ```python
293
+ ds = load_data('voxforge_dataset')
294
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
295
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
296
+ print("VoxForge WER:", wer)
297
+ ```
298
+ VoxForge WER: 0.08246076839826841
299
+
300
+
301
+ ### Tests with LM
302
+
303
+
304
+ ```python
305
+ !rm -rf ~/.cache
306
+ !gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
307
+ stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
308
+ # !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
309
+ # stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
310
+ ```
311
+
312
+ ### Cetuc
313
+
314
+
315
+ ```python
316
+ ds = load_data('cetuc_dataset')
317
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
318
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
319
+ print("CETUC WER:", wer)
320
+ ```
321
+ CETUC WER: 0.03222801788375573
322
+
323
+
324
+ #### Common Voice
325
+
326
+
327
+ ```python
328
+ ds = load_data('commonvoice_dataset')
329
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
330
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
331
+ print("CV WER:", wer)
332
+ ```
333
+ CV WER: 0.09713866021093655
334
+
335
+
336
+ #### LaPS
337
+
338
+
339
+ ```python
340
+ ds = load_data('lapsbm_dataset')
341
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
342
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
343
+ print("Laps WER:", wer)
344
+ ```
345
+ Laps WER: 0.022310606060606065
346
+
347
+
348
+ #### MLS
349
+
350
+
351
+ ```python
352
+ ds = load_data('mls_dataset')
353
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
354
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
355
+ print("MLS WER:", wer)
356
+ ```
357
+ MLS WER: 0.11408590958696524
358
+
359
+
360
+ #### SID
361
+
362
+
363
+ ```python
364
+ ds = load_data('sid_dataset')
365
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
366
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
367
+ print("Sid WER:", wer)
368
+ ```
369
+ Sid WER: 0.12502797252979136
370
+
371
+
372
+ #### TEDx
373
+
374
+
375
+ ```python
376
+ ds = load_data('tedx_dataset')
377
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
378
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
379
+ print("TEDx WER:", wer)
380
+ ```
381
+ TEDx WER: 0.24603179403904793
382
+
383
+
384
+ #### VoxForge
385
+
386
+
387
+ ```python
388
+ ds = load_data('voxforge_dataset')
389
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
390
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
391
+ print("VoxForge WER:", wer)
392
+ ```
393
+ VoxForge WER: 0.06542207792207791
394
+