lgris commited on
Commit
c2cccd4
1 Parent(s): 0bcbbd4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +398 -0
README.md ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ datasets:
4
+ - common_voice
5
+ - mls
6
+ - cetuc
7
+ - lapsbm
8
+ - voxforge
9
+ - tedx
10
+ - sid
11
+ metrics:
12
+ - wer
13
+ tags:
14
+ - audio
15
+ - speech
16
+ - wav2vec2
17
+ - pt
18
+ - portuguese-speech-corpus
19
+ - automatic-speech-recognition
20
+ - speech
21
+ - PyTorch
22
+ license: apache-2.0
23
+ model-index:
24
+ - name: bp400-xlsr
25
+ results:
26
+ - task:
27
+ name: Speech Recognition
28
+ type: automatic-speech-recognition
29
+ ---
30
+
31
+ # bp400-xlsr: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
32
+
33
+ This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the following datasets:
34
+
35
+ - [CETUC](http://www02.smt.ufrj.br/~igor.quintanilha/alcaim.tar.gz): contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the [CETEN-Folha](https://www.linguateca.pt/cetenfolha/) corpus.
36
+ - [Common Voice 7.0](https://commonvoice.mozilla.org/pt): is a project proposed by Mozilla Foundation with the goal to create a wide open dataset in different languages. In this project, volunteers donate and validate speech using the [oficial site](https://commonvoice.mozilla.org/pt).
37
+ - [Lapsbm](https://github.com/falabrasil/gitlab-resources): "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
38
+ - [Multilingual Librispeech (MLS)](https://arxiv.org/abs/2012.03411): a massive dataset available in many languages. The MLS is based on audiobook recordings in public domain like [LibriVox](https://librivox.org/). The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese [used in this work](http://www.openslr.org/94/) (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
39
+ - [Multilingual TEDx](http://www.openslr.org/100): a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
40
+ - [Sidney](https://igormq.github.io/datasets/) (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
41
+ - [VoxForge](http://www.voxforge.org/): is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
42
+
43
+ These datasets were combined to build a larger Brazilian Portuguese dataset. All data was used for training except Common Voice dev/test sets, that were used for validation/test respectively. We also made test sets for all the gathered datasets.
44
+
45
+ | Dataset | Train | Valid | Test |
46
+ |--------------------------------|-------:|------:|------:|
47
+ | CETUC | 93.9h | -- | 5.4h |
48
+ | Common Voice | 37.6h | 8.9h | 9.5h |
49
+ | LaPS BM | 0.8h | -- | 0.1h |
50
+ | MLS | 161.0h | -- | 3.7h |
51
+ | Multilingual TEDx (Portuguese) | 144.2h | -- | 1.8h |
52
+ | SID | 5.0h | -- | 1.0h |
53
+ | VoxForge | 2.8h | -- | 0.1h |
54
+ | Total | 437.2h | 8.9h | 21.6h |
55
+
56
+ The original model was fine-tuned using [fairseq](https://github.com/pytorch/fairseq). This notebook uses a converted version of the original one. The link to the original fairseq model is available [here](https://drive.google.com/drive/folders/1eRUExXRF2XK8JxUjIzbLBkLa5wuR3nig?usp=sharing).
57
+
58
+ #### Summary
59
+
60
+ | | CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG |
61
+ |----------------------|---------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
62
+ | bp\_400 (demonstration below) | 0.052 | 0.140 | 0.074 | 0.117 | 0.121 | 0.245 | 0.118 | 0.124 |
63
+ | bp\_400 + 3-gram | 0.033 | 0.095 | 0.046 | **0.123** | 0.112 | 0.212 | 0.123 | 0.106 |
64
+ | bp\_400 + 4-gram (demonstration below) | 0.030 | 0.096 | 0.043 | 0.106 | 0.118 | 0.229 | 0.117 | 0.105 |
65
+ | bp\_400 + 5-gram | 0.033 | 0.094 | 0.043 | **0.123** | **0.111** | **0.210** | **0.123** | **0.105** |
66
+ | bp\_400 + Transf. | **0.032** | **0.092** | **0.036** | 0.130 | 0.115 | 0.215 | 0.125 | 0.106 |
67
+
68
+ #### Transcription examples
69
+
70
+ | Text | Transcription |
71
+ |------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
72
+ |alguém sabe a que horas começa o jantar | alguém sabe a que horas **começo** jantar |
73
+ |lila covas ainda não sabe o que vai fazer no fundo|**lilacovas** ainda não sabe o que vai fazer no fundo|
74
+ |que tal um pouco desse bom spaghetti|**quetá** um pouco **deste** bom **ispaguete**|
75
+ |hong kong em cantonês significa porto perfumado|**rongkong** **en** **cantones** significa porto perfumado|
76
+ |vamos hackear esse problema|vamos **rackar** esse problema|
77
+ |apenas a poucos metros há uma estação de ônibus|apenas **ha** poucos metros **á** uma estação de ônibus|
78
+ |relâmpago e trovão sempre andam juntos|**relampagotrevão** sempre andam juntos|
79
+
80
+
81
+ ## Demonstration
82
+
83
+
84
+ ```python
85
+ MODEL_NAME = "lgris/bp400-xlsr"
86
+ ```
87
+
88
+ ### Imports and dependencies
89
+
90
+
91
+ ```python
92
+ %%capture
93
+ !pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
94
+ !pip install datasets
95
+ !pip install jiwer
96
+ !pip install transformers
97
+ !pip install soundfile
98
+ !pip install pyctcdecode
99
+ !pip install https://github.com/kpu/kenlm/archive/master.zip
100
+ ```
101
+
102
+
103
+ ```python
104
+ import jiwer
105
+ import torchaudio
106
+ from datasets import load_dataset, load_metric
107
+ from transformers import (
108
+ Wav2Vec2ForCTC,
109
+ Wav2Vec2Processor,
110
+ )
111
+ from pyctcdecode import build_ctcdecoder
112
+ import torch
113
+ import re
114
+ import sys
115
+ ```
116
+
117
+ ### Helpers
118
+
119
+
120
+ ```python
121
+ chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
122
+
123
+ def map_to_array(batch):
124
+ speech, _ = torchaudio.load(batch["path"])
125
+ batch["speech"] = speech.squeeze(0).numpy()
126
+ batch["sampling_rate"] = 16_000
127
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
128
+ batch["target"] = batch["sentence"]
129
+ return batch
130
+ ```
131
+
132
+
133
+ ```python
134
+ def calc_metrics(truths, hypos):
135
+ wers = []
136
+ mers = []
137
+ wils = []
138
+ for t, h in zip(truths, hypos):
139
+ try:
140
+ wers.append(jiwer.wer(t, h))
141
+ mers.append(jiwer.mer(t, h))
142
+ wils.append(jiwer.wil(t, h))
143
+ except: # Empty string?
144
+ pass
145
+ wer = sum(wers)/len(wers)
146
+ mer = sum(mers)/len(mers)
147
+ wil = sum(wils)/len(wils)
148
+ return wer, mer, wil
149
+ ```
150
+
151
+
152
+ ```python
153
+ def load_data(dataset):
154
+ data_files = {'test': f'{dataset}/test.csv'}
155
+ dataset = load_dataset('csv', data_files=data_files)["test"]
156
+ return dataset.map(map_to_array)
157
+ ```
158
+
159
+ ### Model
160
+
161
+
162
+ ```python
163
+ class STT:
164
+
165
+ def __init__(self,
166
+ model_name,
167
+ device='cuda' if torch.cuda.is_available() else 'cpu',
168
+ lm=None):
169
+ self.model_name = model_name
170
+ self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
171
+ self.processor = Wav2Vec2Processor.from_pretrained(model_name)
172
+ self.vocab_dict = self.processor.tokenizer.get_vocab()
173
+ self.sorted_dict = {
174
+ k.lower(): v for k, v in sorted(self.vocab_dict.items(),
175
+ key=lambda item: item[1])
176
+ }
177
+ self.device = device
178
+ self.lm = lm
179
+ if self.lm:
180
+ self.lm_decoder = build_ctcdecoder(
181
+ list(self.sorted_dict.keys()),
182
+ self.lm
183
+ )
184
+
185
+ def batch_predict(self, batch):
186
+ features = self.processor(batch["speech"],
187
+ sampling_rate=batch["sampling_rate"][0],
188
+ padding=True,
189
+ return_tensors="pt")
190
+ input_values = features.input_values.to(self.device)
191
+ attention_mask = features.attention_mask.to(self.device)
192
+ with torch.no_grad():
193
+ logits = self.model(input_values, attention_mask=attention_mask).logits
194
+ if self.lm:
195
+ logits = logits.cpu().numpy()
196
+ batch["predicted"] = []
197
+ for sample_logits in logits:
198
+ batch["predicted"].append(self.lm_decoder.decode(sample_logits))
199
+ else:
200
+ pred_ids = torch.argmax(logits, dim=-1)
201
+ batch["predicted"] = self.processor.batch_decode(pred_ids)
202
+ return batch
203
+ ```
204
+
205
+ ### Download datasets
206
+
207
+
208
+ ```python
209
+ %%capture
210
+ !gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
211
+ !mkdir bp_dataset
212
+ !unzip bp_dataset -d bp_dataset/
213
+ ```
214
+
215
+ ### Tests
216
+
217
+
218
+ ```python
219
+ stt = STT(MODEL_NAME)
220
+ ```
221
+
222
+ #### CETUC
223
+
224
+
225
+ ```python
226
+ ds = load_data('cetuc_dataset')
227
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
228
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
229
+ print("CETUC WER:", wer)
230
+ ```
231
+ CETUC WER: 0.05159104708285062
232
+
233
+
234
+ #### Common Voice
235
+
236
+
237
+ ```python
238
+ ds = load_data('commonvoice_dataset')
239
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
240
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
241
+ print("CV WER:", wer)
242
+ ```
243
+ CV WER: 0.14031426198658084
244
+
245
+
246
+ #### LaPS
247
+
248
+
249
+ ```python
250
+ ds = load_data('lapsbm_dataset')
251
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
252
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
253
+ print("Laps WER:", wer)
254
+ ```
255
+ Laps WER: 0.07432133838383838
256
+
257
+
258
+ #### MLS
259
+
260
+
261
+ ```python
262
+ ds = load_data('mls_dataset')
263
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
264
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
265
+ print("MLS WER:", wer)
266
+ ```
267
+ MLS WER: 0.11678793514817509
268
+
269
+
270
+ #### SID
271
+
272
+
273
+ ```python
274
+ ds = load_data('sid_dataset')
275
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
276
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
277
+ print("Sid WER:", wer)
278
+ ```
279
+ Sid WER: 0.12152357273433984
280
+
281
+
282
+ #### TEDx
283
+
284
+
285
+ ```python
286
+ ds = load_data('tedx_dataset')
287
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
288
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
289
+ print("TEDx WER:", wer)
290
+ ```
291
+ TEDx WER: 0.24666815906766504
292
+
293
+
294
+ #### VoxForge
295
+
296
+
297
+ ```python
298
+ ds = load_data('voxforge_dataset')
299
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
300
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
301
+ print("VoxForge WER:", wer)
302
+ ```
303
+ VoxForge WER: 0.11873106060606062
304
+
305
+
306
+ ### Tests with LM
307
+
308
+
309
+ ```python
310
+ !rm -rf ~/.cache
311
+ !gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
312
+ stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
313
+ # !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
314
+ # stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
315
+ ```
316
+ ### Cetuc
317
+
318
+
319
+ ```python
320
+ ds = load_data('cetuc_dataset')
321
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
322
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
323
+ print("CETUC WER:", wer)
324
+ ```
325
+ CETUC WER: 0.030266462438593742
326
+
327
+
328
+ #### Common Voice
329
+
330
+
331
+ ```python
332
+ ds = load_data('commonvoice_dataset')
333
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
334
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
335
+ print("CV WER:", wer)
336
+ ```
337
+ CV WER: 0.09577710237417715
338
+
339
+
340
+ #### LaPS
341
+
342
+
343
+ ```python
344
+ ds = load_data('lapsbm_dataset')
345
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
346
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
347
+ print("Laps WER:", wer)
348
+ ```
349
+ Laps WER: 0.043617424242424235
350
+
351
+
352
+ #### MLS
353
+
354
+
355
+ ```python
356
+ ds = load_data('mls_dataset')
357
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
358
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
359
+ print("MLS WER:", wer)
360
+ ```
361
+ MLS WER: 0.10642133314350002
362
+
363
+
364
+ #### SID
365
+
366
+
367
+ ```python
368
+ ds = load_data('sid_dataset')
369
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
370
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
371
+ print("Sid WER:", wer)
372
+ ```
373
+ Sid WER: 0.11839021001747055
374
+
375
+
376
+ #### TEDx
377
+
378
+
379
+ ```python
380
+ ds = load_data('tedx_dataset')
381
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
382
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
383
+ print("TEDx WER:", wer)
384
+ ```
385
+ TEDx WER: 0.22929952467810416
386
+
387
+
388
+ #### VoxForge
389
+
390
+
391
+ ```python
392
+ ds = load_data('voxforge_dataset')
393
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
394
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
395
+ print("VoxForge WER:", wer)
396
+ ```
397
+ VoxForge WER: 0.11716314935064935
398
+