carlosdanielhernandezmena commited on
Commit
177496d
1 Parent(s): d2a3dec

Adding info to the README file

Browse files
Files changed (1) hide show
  1. README.md +204 -0
README.md CHANGED
@@ -1,3 +1,207 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: cc-by-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: es
3
+ datasets:
4
+ - common_voice
5
+ - ciempiess_test
6
+ - hub4ne_es_LDC98S74
7
+ - callhome_es_LDC96S35
8
+ tags:
9
+ - audio
10
+ - automatic-speech-recognition
11
+ - spanish
12
+ - xlrs-53-spanish
13
+ - ciempiess
14
+ - cimpiess-unam
15
  license: cc-by-4.0
16
+ widget:
17
+ model-index:
18
+ - name: wav2vec2-large-xlsr-53-spanish-ep5-944h
19
+ results:
20
+ - task:
21
+ name: Automatic Speech Recognition
22
+ type: automatic-speech-recognition
23
+ dataset:
24
+ name: Mozilla Common Voice 10.0
25
+ type: mozilla-foundation/common_voice_10_0
26
+ split: test
27
+ args:
28
+ language: es
29
+ metrics:
30
+ - name: Test WER
31
+ type: wer
32
+ value: <unk>
33
+ - task:
34
+ name: Automatic Speech Recognition
35
+ type: automatic-speech-recognition
36
+ dataset:
37
+ name: Mozilla Common Voice 10.0
38
+ type: mozilla-foundation/common_voice_10_0
39
+ split: dev
40
+ args:
41
+ language: es
42
+ metrics:
43
+ - name: Test WER
44
+ type: wer
45
+ value: <unk>
46
+ - task:
47
+ name: Automatic Speech Recognition
48
+ type: automatic-speech-recognition
49
+ dataset:
50
+ name: CIEMPIESS-TEST
51
+ type: ciempiess/ciempiess_test
52
+ split: test
53
+ args:
54
+ language: es
55
+ metrics:
56
+ - name: Test WER
57
+ type: wer
58
+ value: 11.17
59
+ - task:
60
+ name: Automatic Speech Recognition
61
+ type: automatic-speech-recognition
62
+ dataset:
63
+ name: 1997 Spanish Broadcast News Speech (HUB4-NE)
64
+ type: HUB4NE_LDC98S74
65
+ split: test
66
+ args:
67
+ language: es
68
+ metrics:
69
+ - name: Test WER
70
+ type: wer
71
+ value: 7.48
72
+ - task:
73
+ name: Automatic Speech Recognition
74
+ type: automatic-speech-recognition
75
+ dataset:
76
+ name: CALLHOME Spanish Speech (Test)
77
+ type: callhome_LDC96S35
78
+ split: test
79
+ args:
80
+ language: es
81
+ metrics:
82
+ - name: Test WER
83
+ type: wer
84
+ value: 39.12
85
+ - task:
86
+ name: Automatic Speech Recognition
87
+ type: automatic-speech-recognition
88
+ dataset:
89
+ name: CALLHOME Spanish Speech (Dev)
90
+ type: callhome_LDC96S35
91
+ split: dev
92
+ args:
93
+ language: es
94
+ metrics:
95
+ - name: Test WER
96
+ type: wer
97
+ value: 40.39
98
  ---
99
+
100
+ # wav2vec2-large-xlsr-53-spanish-ep5-944h
101
+
102
+ The "wav2vec2-large-xlsr-53-spanish-ep5-944h" is an acoustic model suitable for Automatic Speech Recognition in Spanish. It is the result of fine-tuning the model "facebook/wav2vec2-large-xlsr-53" with around 944 hours of Spanish data gathered or developed by the [CIEMPIESS-UNAM Project](https://huggingface.co/ciempiess) since 2012. Most of the data is available at the the CIEMPIESS-UNAM Project homepage http://www.ciempiess.org/. The rest can be found in public repositories such as [LDC](https://www.ldc.upenn.edu/) or [OpenSLR](https://openslr.org/)
103
+
104
+ The specific list of corpora used to fine-tune the model is:
105
+
106
+ - [CIEMPIESS-LIGHT (18h25m)](https://catalog.ldc.upenn.edu/LDC2017S23)
107
+ - [CIEMPIESS-BALANCE (18h20m)](https://catalog.ldc.upenn.edu/LDC2018S11)
108
+ - [CIEMPIESS-FEM (13h54m)](https://catalog.ldc.upenn.edu/LDC2019S07)
109
+ - [CHM150 (1h38m)](https://catalog.ldc.upenn.edu/LDC2016S04)
110
+ - [TEDX_SPANISH (24h29m)](https://openslr.org/67/)
111
+ - [LIBRIVOX_SPANISH (73h01m)](https://catalog.ldc.upenn.edu/LDC2020S01)
112
+ - [WIKIPEDIA_SPANISH (25h37m)](https://catalog.ldc.upenn.edu/LDC2021S07)
113
+ - [VOXFORGE_SPANISH (49h42m)](http://www.voxforge.org/es)
114
+ - [MOZILLA COMMON VOICE 10.0 (320h22m)](https://commonvoice.mozilla.org/es)
115
+ - [HEROICO (16h33m)](https://catalog.ldc.upenn.edu/LDC2006S37)
116
+ - [LATINO-40 (6h48m)](https://catalog.ldc.upenn.edu/LDC95S28)
117
+ - [CALLHOME_SPANISH (13h22m)](https://catalog.ldc.upenn.edu/LDC96S35)
118
+ - [HUB4NE_SPANISH (31h41m)](https://catalog.ldc.upenn.edu/LDC98S74)
119
+ - [FISHER_SPANISH (127h22m)](https://catalog.ldc.upenn.edu/LDC2010S01)
120
+ - [Chilean Spanish speech data set (7h08m)](https://openslr.org/71/)
121
+ - [Colombian Spanish speech data set (7h34m)](https://openslr.org/72/)
122
+ - [Peruvian Spanish speech data set (9h13m)](https://openslr.org/73/)
123
+ - [Argentinian Spanish speech data set (8h01m)](https://openslr.org/61/)
124
+ - [Puerto Rico Spanish speech data set (1h00m)](https://openslr.org/74/)
125
+ - [MediaSpeech Spanish (10h00m)](https://openslr.org/108/)
126
+ - [DIMEX100-LIGHT (6h09m)](https://turing.iimas.unam.mx/~luis/DIME/CORPUS-DIMEX.html)
127
+ - [DIMEX100-NIÑOS (08h09m)](https://turing.iimas.unam.mx/~luis/DIME/CORPUS-DIMEX.html)
128
+ - [GOLEM-UNIVERSUM (00h10m)](https://turing.iimas.unam.mx/~luis/DIME/CORPUS-DIMEX.html)
129
+ - [GLISSANDO (6h40m)](https://glissando.labfon.uned.es/es)
130
+ - TELE_con_CIENCIA (28h16m) **Unplished Material**
131
+ - UNSHAREABLE MATERIAL (118h22m) **Not available for sharing**
132
+
133
+ The fine-tuning process was perform during November (2022) in the servers of the Language and Voice Lab (https://lvl.ru.is/) at Reykjavík University (Iceland) by Carlos Daniel Hernández Mena.
134
+
135
+ # Evaluation
136
+ ```python
137
+ import torch
138
+ from transformers import Wav2Vec2Processor
139
+ from transformers import Wav2Vec2ForCTC
140
+ #Load the processor and model.
141
+ MODEL_NAME="carlosdanielhernandezmena/stt_es_quartznet15x5_ft_ep53_944h"
142
+ processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
143
+ model = Wav2Vec2ForCTC.from_pretrained(MODEL_NAME)
144
+ #Load the dataset
145
+ from datasets import load_dataset, load_metric, Audio
146
+ ds=load_dataset("common_voice", "es", split="test")
147
+ #Normalize the transcriptions
148
+ import re
149
+ chars_to_ignore_regex = '[\\,\\?\\.\\!\\\;\\:\\"\\“\\%\\‘\\”\\�\\)\\(\\*)]'
150
+ def remove_special_characters(batch):
151
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
152
+ return batch
153
+ ds = ds.map(remove_special_characters)
154
+ #Downsample to 16kHz
155
+ ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
156
+ #Process the dataset
157
+ def prepare_dataset(batch):
158
+ audio = batch["audio"]
159
+ #Batched output is "un-batched" to ensure mapping is correct
160
+ batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
161
+ with processor.as_target_processor():
162
+ batch["labels"] = processor(batch["sentence"]).input_ids
163
+ return batch
164
+ ds = ds.map(prepare_dataset, remove_columns=ds.column_names,num_proc=1)
165
+ #Define the evaluation metric
166
+ import numpy as np
167
+ wer_metric = load_metric("wer")
168
+ def compute_metrics(pred):
169
+ pred_logits = pred.predictions
170
+ pred_ids = np.argmax(pred_logits, axis=-1)
171
+ pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
172
+ pred_str = processor.batch_decode(pred_ids)
173
+ #We do not want to group tokens when computing the metrics
174
+ label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
175
+ wer = wer_metric.compute(predictions=pred_str, references=label_str)
176
+ return {"wer": wer}
177
+ #Do the evaluation (with batch_size=1)
178
+ model = model.to(torch.device("cuda"))
179
+ def map_to_result(batch):
180
+ with torch.no_grad():
181
+ input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
182
+ logits = model(input_values).logits
183
+ pred_ids = torch.argmax(logits, dim=-1)
184
+ batch["pred_str"] = processor.batch_decode(pred_ids)[0]
185
+ batch["sentence"] = processor.decode(batch["labels"], group_tokens=False)
186
+ return batch
187
+ results = ds.map(map_to_result,remove_columns=ds.column_names)
188
+ #Compute the overall WER now.
189
+ print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["sentence"])))
190
+
191
+ ```
192
+ **Test Result**: <unk>
193
+ # BibTeX entry and citation info
194
+ *When publishing results based on these models please refer to:*
195
+ ```bibtex
196
+ @misc{mena2022xlrs53spanish,
197
+ title={Acoustic Model in Spanish: wav2vec2-large-xlsr-53-spanish-ep5-944h.},
198
+ author={Hernandez Mena, Carlos Daniel},
199
+ year={2022},
200
+ url={https://huggingface.co/carlosdanielhernandezmena/wav2vec2-large-xlsr-53-spanish-ep5-944h},
201
+ }
202
+ ```
203
+ # Acknowledgements
204
+
205
+ The author wants to thank to the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html) at the [Facultad de Ingeniería (FI)](https://www.ingenieria.unam.mx/) of the [Universidad Nacional Autónoma de México (UNAM)](https://www.unam.mx/). He also thanks to the social service students for all the hard work.
206
+
207
+ Special thanks to Jón Guðnason, head of the Language and Voice Lab for providing computational power to make this model possible. The author also thanks to the "Language Technology Programme for Icelandic 2019-2023" which is managed and coordinated by Almannarómur, and it is funded by the Icelandic Ministry of Education, Science and Culture.