lgris commited on
Commit
85080a2
1 Parent(s): 9a4910b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +402 -0
README.md ADDED
@@ -0,0 +1,402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ datasets:
4
+ - common_voice
5
+ - mls
6
+ - cetuc
7
+ - lapsbm
8
+ - voxforge
9
+ - tedx
10
+ - sid
11
+ metrics:
12
+ - wer
13
+ tags:
14
+ - audio
15
+ - speech
16
+ - wav2vec2
17
+ - pt
18
+ - portuguese-speech-corpus
19
+ - automatic-speech-recognition
20
+ - speech
21
+ - PyTorch
22
+ license: apache-2.0
23
+ model-index:
24
+ - name: bp400-xlsr
25
+ results:
26
+ - task:
27
+ name: Speech Recognition
28
+ type: automatic-speech-recognition
29
+ ---
30
+
31
+ # bp500-base100k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
32
+
33
+ This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the following datasets:
34
+
35
+ - [CETUC](http://www02.smt.ufrj.br/~igor.quintanilha/alcaim.tar.gz): contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the [CETEN-Folha](https://www.linguateca.pt/cetenfolha/) corpus.
36
+ - [Common Voice 7.0](https://commonvoice.mozilla.org/pt): is a project proposed by Mozilla Foundation with the goal to create a wide open dataset in different languages. In this project, volunteers donate and validate speech using the [oficial site](https://commonvoice.mozilla.org/pt).
37
+ - [Lapsbm](https://github.com/falabrasil/gitlab-resources): "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
38
+ - [Multilingual Librispeech (MLS)](https://arxiv.org/abs/2012.03411): a massive dataset available in many languages. The MLS is based on audiobook recordings in public domain like [LibriVox](https://librivox.org/). The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese [used in this work](http://www.openslr.org/94/) (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
39
+ - [Multilingual TEDx](http://www.openslr.org/100): a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
40
+ - [Sidney](https://igormq.github.io/datasets/) (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
41
+ - [VoxForge](http://www.voxforge.org/): is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
42
+
43
+ These datasets were combined to build a larger Brazilian Portuguese dataset. All data was used for training except Common Voice dev/test sets, that were used for validation/test respectively. We also made test sets for all the gathered datasets.
44
+
45
+ | Dataset | Train | Valid | Test |
46
+ |--------------------------------|-------:|------:|------:|
47
+ | CETUC | 94.0h | -- | 5.4h |
48
+ | Common Voice | 37.8h | 8.9h | 9.5h |
49
+ | LaPS BM | 0.8h | -- | 0.1h |
50
+ | MLS | 161.0h | -- | 3.7h |
51
+ | Multilingual TEDx (Portuguese) | 148.9h | -- | 1.8h |
52
+ | SID | 7.2h | -- | 1.0h |
53
+ | VoxForge | 3.9h | -- | 0.1h |
54
+ | Total | 453.6h | 8.9h | 21.6h |
55
+
56
+ The original model was fine-tuned using [fairseq](https://github.com/pytorch/fairseq). This notebook uses a converted version of the original one. The link to the original fairseq model is available [here](https://drive.google.com/file/d/10iESR5AQxuxF5F7w3wLbpc_9YMsYbY9H/view?usp=sharing).
57
+
58
+ #### Summary
59
+
60
+ | | CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG |
61
+ |----------------------|---------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
62
+ | bp\_500-base100k_voxpopuli (demonstration below) | 0.142 | 0.201 | 0.052 | 0.224 | 0.102 | 0.317 | 0.048 | 0.155 |
63
+ | bp\_500-base100k_voxpopuli + 4-gram (demonstration below) | 0.099 | 0.149 | 0.047 | 0.192 | 0.115 | 0.371 | 0.127 | 0.157 |
64
+
65
+ #### Transcription examples
66
+
67
+ | Text | Transcription |
68
+ |------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
69
+ |qual o instagram dele|**qualo** **está** **gramedele**|
70
+ |o capitão foi expulso do exército porque era doido|o **capitãl** foi **exposo** do exército porque era doido|
71
+ |também por que não|também **porque** não|
72
+ |não existe tempo como o presente|não existe tempo como *o* presente|
73
+ |eu pulei para salvar rachel|eu pulei para salvar **haquel**|
74
+ |augusto cezar passos marinho|augusto **cesa** **passoesmarinho**|
75
+
76
+
77
+ ## Demonstration
78
+
79
+
80
+ ```python
81
+ MODEL_NAME = "lgris/bp500-base100k_voxpopuli"
82
+ ```
83
+
84
+ ### Imports and dependencies
85
+
86
+
87
+ ```python
88
+ %%capture
89
+ !pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
90
+ !pip install datasets
91
+ !pip install jiwer
92
+ !pip install transformers
93
+ !pip install soundfile
94
+ !pip install pyctcdecode
95
+ !pip install https://github.com/kpu/kenlm/archive/master.zip
96
+ ```
97
+
98
+
99
+ ```python
100
+ import jiwer
101
+ import torchaudio
102
+ from datasets import load_dataset, load_metric
103
+ from transformers import (
104
+ Wav2Vec2ForCTC,
105
+ Wav2Vec2Processor,
106
+ )
107
+ from pyctcdecode import build_ctcdecoder
108
+ import torch
109
+ import re
110
+ import sys
111
+ ```
112
+
113
+ ### Helpers
114
+
115
+
116
+ ```python
117
+ chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
118
+
119
+ def map_to_array(batch):
120
+ speech, _ = torchaudio.load(batch["path"])
121
+ batch["speech"] = speech.squeeze(0).numpy()
122
+ batch["sampling_rate"] = 16_000
123
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
124
+ batch["target"] = batch["sentence"]
125
+ return batch
126
+ ```
127
+
128
+
129
+ ```python
130
+ def calc_metrics(truths, hypos):
131
+ wers = []
132
+ mers = []
133
+ wils = []
134
+ for t, h in zip(truths, hypos):
135
+ try:
136
+ wers.append(jiwer.wer(t, h))
137
+ mers.append(jiwer.mer(t, h))
138
+ wils.append(jiwer.wil(t, h))
139
+ except: # Empty string?
140
+ pass
141
+ wer = sum(wers)/len(wers)
142
+ mer = sum(mers)/len(mers)
143
+ wil = sum(wils)/len(wils)
144
+ return wer, mer, wil
145
+ ```
146
+
147
+
148
+ ```python
149
+ def load_data(dataset):
150
+ data_files = {'test': f'{dataset}/test.csv'}
151
+ dataset = load_dataset('csv', data_files=data_files)["test"]
152
+ return dataset.map(map_to_array)
153
+ ```
154
+
155
+ ### Model
156
+
157
+
158
+ ```python
159
+ class STT:
160
+
161
+ def __init__(self,
162
+ model_name,
163
+ device='cuda' if torch.cuda.is_available() else 'cpu',
164
+ lm=None):
165
+ self.model_name = model_name
166
+ self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
167
+ self.processor = Wav2Vec2Processor.from_pretrained(model_name)
168
+ self.vocab_dict = self.processor.tokenizer.get_vocab()
169
+ self.sorted_dict = {
170
+ k.lower(): v for k, v in sorted(self.vocab_dict.items(),
171
+ key=lambda item: item[1])
172
+ }
173
+ self.device = device
174
+ self.lm = lm
175
+ if self.lm:
176
+ self.lm_decoder = build_ctcdecoder(
177
+ list(self.sorted_dict.keys()),
178
+ self.lm
179
+ )
180
+
181
+ def batch_predict(self, batch):
182
+ features = self.processor(batch["speech"],
183
+ sampling_rate=batch["sampling_rate"][0],
184
+ padding=True,
185
+ return_tensors="pt")
186
+ input_values = features.input_values.to(self.device)
187
+ with torch.no_grad():
188
+ logits = self.model(input_values).logits
189
+ if self.lm:
190
+ logits = logits.cpu().numpy()
191
+ batch["predicted"] = []
192
+ for sample_logits in logits:
193
+ batch["predicted"].append(self.lm_decoder.decode(sample_logits))
194
+ else:
195
+ pred_ids = torch.argmax(logits, dim=-1)
196
+ batch["predicted"] = self.processor.batch_decode(pred_ids)
197
+ return batch
198
+ ```
199
+
200
+ ### Download datasets
201
+
202
+
203
+ ```python
204
+ %%capture
205
+ !gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
206
+ !mkdir bp_dataset
207
+ !unzip bp_dataset -d bp_dataset/
208
+ ```
209
+
210
+
211
+ ```python
212
+ %cd bp_dataset
213
+ ```
214
+
215
+ /content/bp_dataset
216
+
217
+
218
+ ### Tests
219
+
220
+
221
+ ```python
222
+ stt = STT(MODEL_NAME)
223
+ ```
224
+
225
+ #### CETUC
226
+
227
+
228
+ ```python
229
+ ds = load_data('cetuc_dataset')
230
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
231
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
232
+ print("CETUC WER:", wer)
233
+ ```
234
+ CETUC WER: 0.1419179499917191
235
+
236
+
237
+ #### Common Voice
238
+
239
+
240
+ ```python
241
+ ds = load_data('commonvoice_dataset')
242
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
243
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
244
+ print("CV WER:", wer)
245
+ ```
246
+ CV WER: 0.20079950312040154
247
+
248
+
249
+ #### LaPS
250
+
251
+
252
+ ```python
253
+ ds = load_data('lapsbm_dataset')
254
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
255
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
256
+ print("Laps WER:", wer)
257
+ ```
258
+ Laps WER: 0.052780934343434324
259
+
260
+
261
+ #### MLS
262
+
263
+
264
+ ```python
265
+ ds = load_data('mls_dataset')
266
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
267
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
268
+ print("MLS WER:", wer)
269
+ ```
270
+ MLS WER: 0.22413887199364113
271
+
272
+
273
+ #### SID
274
+
275
+
276
+ ```python
277
+ ds = load_data('sid_dataset')
278
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
279
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
280
+ print("Sid WER:", wer)
281
+ ```
282
+ Sid WER: 0.1019041538671034
283
+
284
+
285
+ #### TEDx
286
+
287
+
288
+ ```python
289
+ ds = load_data('tedx_dataset')
290
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
291
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
292
+ print("TEDx WER:", wer)
293
+ ```
294
+ TEDx WER: 0.31711268778273327
295
+
296
+
297
+ #### VoxForge
298
+
299
+
300
+ ```python
301
+ ds = load_data('voxforge_dataset')
302
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
303
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
304
+ print("VoxForge WER:", wer)
305
+ ```
306
+ VoxForge WER: 0.04826433982683982
307
+
308
+
309
+ ### Tests with LM
310
+
311
+
312
+ ```python
313
+ !rm -rf ~/.cache
314
+ !gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
315
+ stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
316
+ # !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
317
+ # stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
318
+ ```
319
+
320
+ ### Cetuc
321
+
322
+
323
+ ```python
324
+ ds = load_data('cetuc_dataset')
325
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
326
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
327
+ print("CETUC WER:", wer)
328
+ ```
329
+ CETUC WER: 0.099518615112877
330
+
331
+
332
+ #### Common Voice
333
+
334
+
335
+ ```python
336
+ ds = load_data('commonvoice_dataset')
337
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
338
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
339
+ print("CV WER:", wer)
340
+ ```
341
+ CV WER: 0.1488912889506362
342
+
343
+
344
+ #### LaPS
345
+
346
+
347
+ ```python
348
+ ds = load_data('lapsbm_dataset')
349
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
350
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
351
+ print("Laps WER:", wer)
352
+ ```
353
+ Laps WER: 0.047080176767676764
354
+
355
+
356
+ #### MLS
357
+
358
+
359
+ ```python
360
+ ds = load_data('mls_dataset')
361
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
362
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
363
+ print("MLS WER:", wer)
364
+ ```
365
+ MLS WER: 0.19220291966887196
366
+
367
+
368
+ #### SID
369
+
370
+
371
+ ```python
372
+ ds = load_data('sid_dataset')
373
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
374
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
375
+ print("Sid WER:", wer)
376
+ ```
377
+ Sid WER: 0.11535498771650306
378
+
379
+
380
+ #### TEDx
381
+
382
+
383
+ ```python
384
+ ds = load_data('tedx_dataset')
385
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
386
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
387
+ print("TEDx WER:", wer)
388
+ ```
389
+ TEDx WER: 0.3707890073539895
390
+
391
+
392
+ #### VoxForge
393
+
394
+
395
+ ```python
396
+ ds = load_data('voxforge_dataset')
397
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
398
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
399
+ print("VoxForge WER:", wer)
400
+ ```
401
+ VoxForge WER: 0.12682088744588746
402
+