lgris commited on
Commit
8961f1e
1 Parent(s): c4a26bd

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +395 -0
README.md ADDED
@@ -0,0 +1,395 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pt
3
+ datasets:
4
+ - common_voice
5
+ - mls
6
+ - cetuc
7
+ - lapsbm
8
+ - voxforge
9
+ - tedx
10
+ - sid
11
+ metrics:
12
+ - wer
13
+ tags:
14
+ - audio
15
+ - speech
16
+ - wav2vec2
17
+ - pt
18
+ - portuguese-speech-corpus
19
+ - automatic-speech-recognition
20
+ - speech
21
+ - PyTorch
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # bp500-base10k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
26
+
27
+ This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the following datasets:
28
+
29
+ - [CETUC](http://www02.smt.ufrj.br/~igor.quintanilha/alcaim.tar.gz): contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the [CETEN-Folha](https://www.linguateca.pt/cetenfolha/) corpus.
30
+ - [Common Voice 7.0](https://commonvoice.mozilla.org/pt): is a project proposed by Mozilla Foundation with the goal to create a wide open dataset in different languages. In this project, volunteers donate and validate speech using the [oficial site](https://commonvoice.mozilla.org/pt).
31
+ - [Lapsbm](https://github.com/falabrasil/gitlab-resources): "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
32
+ - [Multilingual Librispeech (MLS)](https://arxiv.org/abs/2012.03411): a massive dataset available in many languages. The MLS is based on audiobook recordings in public domain like [LibriVox](https://librivox.org/). The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese [used in this work](http://www.openslr.org/94/) (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
33
+ - [Multilingual TEDx](http://www.openslr.org/100): a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
34
+ - [Sidney](https://igormq.github.io/datasets/) (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
35
+ - [VoxForge](http://www.voxforge.org/): is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
36
+
37
+ These datasets were combined to build a larger Brazilian Portuguese dataset. All data was used for training except Common Voice dev/test sets, that were used for validation/test respectively. We also made test sets for all the gathered datasets.
38
+
39
+ | Dataset | Train | Valid | Test |
40
+ |--------------------------------|-------:|------:|------:|
41
+ | CETUC | 94.0h | -- | 5.4h |
42
+ | Common Voice | 37.8h | 8.9h | 9.5h |
43
+ | LaPS BM | 0.8h | -- | 0.1h |
44
+ | MLS | 161.0h | -- | 3.7h |
45
+ | Multilingual TEDx (Portuguese) | 148.9h | -- | 1.8h |
46
+ | SID | 7.2h | -- | 1.0h |
47
+ | VoxForge | 3.9h | -- | 0.1h |
48
+ | Total | 453.6h | 8.9h | 21.6h |
49
+
50
+ The original model was fine-tuned using [fairseq](https://github.com/pytorch/fairseq). This notebook uses a converted version of the original one. The link to the original fairseq model is available [here](https://drive.google.com/file/d/19kkENi8uvczmw9OLSdqnjvKqBE53cl_W/view?usp=sharing).
51
+
52
+ #### Summary
53
+
54
+ | | CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG |
55
+ |----------------------|---------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
56
+ | bp\_500-base10k_voxpopuli (demonstration below) | 0.120 | 0.249 | 0.039 | 0.227 | 0.169 | 0.349 | 0.116 | 0.181 |
57
+ | bp\_500-base10k_voxpopuli + 4-gram (demonstration below) | 0.074 | 0.174 | 0.032 | 0.182 | 0.181 | 0.349 | 0.111 | 0.157 |
58
+
59
+ #### Transcription examples
60
+
61
+ | Text | Transcription |
62
+ |------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
63
+ |suco de uva e água misturam bem|suco **deúva** e água **misturão** bem|
64
+ |culpa do dinheiro|**cupa** do dinheiro|
65
+ |eu amo shooters call of duty é o meu favorito|eu **omo** **shúters cofedete** é meu favorito|
66
+ |você pode explicar por que isso acontece|você pode explicar *por* que isso **ontece**|
67
+ |no futuro você desejará ter começado a investir hoje|no futuro você desejará **a** ter começado a investir hoje|
68
+
69
+
70
+ ## Demonstration
71
+
72
+
73
+ ```python
74
+ MODEL_NAME = "lgris/bp500-base10k_voxpopuli"
75
+ ```
76
+
77
+ ### Imports and dependencies
78
+
79
+
80
+ ```python
81
+ %%capture
82
+ !pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
83
+ !pip install datasets
84
+ !pip install jiwer
85
+ !pip install transformers
86
+ !pip install soundfile
87
+ !pip install pyctcdecode
88
+ !pip install https://github.com/kpu/kenlm/archive/master.zip
89
+ ```
90
+
91
+
92
+ ```python
93
+ import jiwer
94
+ import torchaudio
95
+ from datasets import load_dataset, load_metric
96
+ from transformers import (
97
+ Wav2Vec2ForCTC,
98
+ Wav2Vec2Processor,
99
+ )
100
+ from pyctcdecode import build_ctcdecoder
101
+ import torch
102
+ import re
103
+ import sys
104
+ ```
105
+
106
+ ### Helpers
107
+
108
+
109
+ ```python
110
+ chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
111
+
112
+ def map_to_array(batch):
113
+ speech, _ = torchaudio.load(batch["path"])
114
+ batch["speech"] = speech.squeeze(0).numpy()
115
+ batch["sampling_rate"] = 16_000
116
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
117
+ batch["target"] = batch["sentence"]
118
+ return batch
119
+ ```
120
+
121
+
122
+ ```python
123
+ def calc_metrics(truths, hypos):
124
+ wers = []
125
+ mers = []
126
+ wils = []
127
+ for t, h in zip(truths, hypos):
128
+ try:
129
+ wers.append(jiwer.wer(t, h))
130
+ mers.append(jiwer.mer(t, h))
131
+ wils.append(jiwer.wil(t, h))
132
+ except: # Empty string?
133
+ pass
134
+ wer = sum(wers)/len(wers)
135
+ mer = sum(mers)/len(mers)
136
+ wil = sum(wils)/len(wils)
137
+ return wer, mer, wil
138
+ ```
139
+
140
+
141
+ ```python
142
+ def load_data(dataset):
143
+ data_files = {'test': f'{dataset}/test.csv'}
144
+ dataset = load_dataset('csv', data_files=data_files)["test"]
145
+ return dataset.map(map_to_array)
146
+ ```
147
+
148
+ ### Model
149
+
150
+
151
+ ```python
152
+ class STT:
153
+
154
+ def __init__(self,
155
+ model_name,
156
+ device='cuda' if torch.cuda.is_available() else 'cpu',
157
+ lm=None):
158
+ self.model_name = model_name
159
+ self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
160
+ self.processor = Wav2Vec2Processor.from_pretrained(model_name)
161
+ self.vocab_dict = self.processor.tokenizer.get_vocab()
162
+ self.sorted_dict = {
163
+ k.lower(): v for k, v in sorted(self.vocab_dict.items(),
164
+ key=lambda item: item[1])
165
+ }
166
+ self.device = device
167
+ self.lm = lm
168
+ if self.lm:
169
+ self.lm_decoder = build_ctcdecoder(
170
+ list(self.sorted_dict.keys()),
171
+ self.lm
172
+ )
173
+
174
+ def batch_predict(self, batch):
175
+ features = self.processor(batch["speech"],
176
+ sampling_rate=batch["sampling_rate"][0],
177
+ padding=True,
178
+ return_tensors="pt")
179
+ input_values = features.input_values.to(self.device)
180
+ with torch.no_grad():
181
+ logits = self.model(input_values).logits
182
+ if self.lm:
183
+ logits = logits.cpu().numpy()
184
+ batch["predicted"] = []
185
+ for sample_logits in logits:
186
+ batch["predicted"].append(self.lm_decoder.decode(sample_logits))
187
+ else:
188
+ pred_ids = torch.argmax(logits, dim=-1)
189
+ batch["predicted"] = self.processor.batch_decode(pred_ids)
190
+ return batch
191
+ ```
192
+
193
+ ### Download datasets
194
+
195
+
196
+ ```python
197
+ %%capture
198
+ !gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
199
+ !mkdir bp_dataset
200
+ !unzip bp_dataset -d bp_dataset/
201
+ ```
202
+
203
+
204
+ ```python
205
+ %cd bp_dataset
206
+ ```
207
+
208
+ /content/bp_dataset
209
+
210
+
211
+ ### Tests
212
+
213
+
214
+ ```python
215
+ stt = STT(MODEL_NAME)
216
+ ```
217
+
218
+ #### CETUC
219
+
220
+
221
+ ```python
222
+ ds = load_data('cetuc_dataset')
223
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
224
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
225
+ print("CETUC WER:", wer)
226
+ ```
227
+ CETUC WER: 0.12096759949218888
228
+
229
+
230
+ #### Common Voice
231
+
232
+
233
+ ```python
234
+ ds = load_data('commonvoice_dataset')
235
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
236
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
237
+ print("CV WER:", wer)
238
+ ```
239
+ CV WER: 0.24977003159495725
240
+
241
+
242
+ #### LaPS
243
+
244
+
245
+ ```python
246
+ ds = load_data('lapsbm_dataset')
247
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
248
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
249
+ print("Laps WER:", wer)
250
+ ```
251
+ Laps WER: 0.039769570707070705
252
+
253
+
254
+ #### MLS
255
+
256
+
257
+ ```python
258
+ ds = load_data('mls_dataset')
259
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
260
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
261
+ print("MLS WER:", wer)
262
+ ```
263
+ MLS WER: 0.2269637077788063
264
+
265
+
266
+ #### SID
267
+
268
+
269
+ ```python
270
+ ds = load_data('sid_dataset')
271
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
272
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
273
+ print("Sid WER:", wer)
274
+ ```
275
+ Sid WER: 0.1691680138494731
276
+
277
+
278
+ #### TEDx
279
+
280
+
281
+ ```python
282
+ ds = load_data('tedx_dataset')
283
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
284
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
285
+ print("TEDx WER:", wer)
286
+ ```
287
+ TEDx WER: 0.34908555859018014
288
+
289
+
290
+ #### VoxForge
291
+
292
+
293
+ ```python
294
+ ds = load_data('voxforge_dataset')
295
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
296
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
297
+ print("VoxForge WER:", wer)
298
+ ```
299
+ VoxForge WER: 0.11649350649350651
300
+
301
+
302
+ ### Tests with LM
303
+
304
+
305
+ ```python
306
+ !rm -rf ~/.cache
307
+ !gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
308
+ stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
309
+ # !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
310
+ # stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
311
+ ```
312
+
313
+ ### Cetuc
314
+
315
+
316
+ ```python
317
+ ds = load_data('cetuc_dataset')
318
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
319
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
320
+ print("CETUC WER:", wer)
321
+ ```
322
+ CETUC WER: 0.07499558425787961
323
+
324
+
325
+ #### Common Voice
326
+
327
+
328
+ ```python
329
+ ds = load_data('commonvoice_dataset')
330
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
331
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
332
+ print("CV WER:", wer)
333
+ ```
334
+ CV WER: 0.17442648452610307
335
+
336
+
337
+ #### LaPS
338
+
339
+
340
+ ```python
341
+ ds = load_data('lapsbm_dataset')
342
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
343
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
344
+ print("Laps WER:", wer)
345
+ ```
346
+ Laps WER: 0.032774621212121206
347
+
348
+
349
+ #### MLS
350
+
351
+
352
+ ```python
353
+ ds = load_data('mls_dataset')
354
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
355
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
356
+ print("MLS WER:", wer)
357
+ ```
358
+ MLS WER: 0.18213620321569274
359
+
360
+
361
+ #### SID
362
+
363
+
364
+ ```python
365
+ ds = load_data('sid_dataset')
366
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
367
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
368
+ print("Sid WER:", wer)
369
+ ```
370
+ Sid WER: 0.18102544972868206
371
+
372
+
373
+ #### TEDx
374
+
375
+
376
+ ```python
377
+ ds = load_data('tedx_dataset')
378
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
379
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
380
+ print("TEDx WER:", wer)
381
+ ```
382
+ TEDx WER: 0.3491402028105601
383
+
384
+
385
+ #### VoxForge
386
+
387
+
388
+ ```python
389
+ ds = load_data('voxforge_dataset')
390
+ result = ds.map(stt.batch_predict, batched=True, batch_size=8)
391
+ wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
392
+ print("VoxForge WER:", wer)
393
+ ```
394
+ VoxForge WER: 0.11189529220779222
395
+