cahya commited on
Commit
772a66e
1 Parent(s): d88653c

enabled lm

Browse files
Files changed (2) hide show
  1. README.md +150 -98
  2. preprocessor_config.json +1 -0
README.md CHANGED
@@ -1,105 +1,157 @@
1
  ---
2
- language:
3
- - lg
4
- license: apache-2.0
 
 
5
  tags:
 
6
  - automatic-speech-recognition
7
- - mozilla-foundation/common_voice_7_0
8
- - generated_from_trainer
9
- datasets:
10
  - common_voice
 
 
 
11
  model-index:
12
- - name: ''
13
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
-
19
- #
20
-
21
- This model is a fine-tuned version of [indonesian-nlp/wav2vec2-luganda](https://huggingface.co/indonesian-nlp/wav2vec2-luganda) on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - LG dataset.
22
- It achieves the following results on the evaluation set:
23
- - Loss: 8.8279
24
- - Wer: 1.0123
25
-
26
- ## Model description
27
-
28
- More information needed
29
-
30
- ## Intended uses & limitations
31
-
32
- More information needed
33
-
34
- ## Training and evaluation data
35
-
36
- More information needed
37
-
38
- ## Training procedure
39
-
40
- ### Training hyperparameters
41
-
42
- The following hyperparameters were used during training:
43
- - learning_rate: 1e-08
44
- - train_batch_size: 64
45
- - eval_batch_size: 2
46
- - seed: 42
47
- - gradient_accumulation_steps: 4
48
- - total_train_batch_size: 256
49
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
50
- - lr_scheduler_type: linear
51
- - lr_scheduler_warmup_steps: 10
52
- - num_epochs: 10.0
53
- - mixed_precision_training: Native AMP
54
-
55
- ### Training results
56
-
57
- | Training Loss | Epoch | Step | Validation Loss | Wer |
58
- |:-------------:|:-----:|:----:|:---------------:|:------:|
59
- | 8.1684 | 0.25 | 10 | 8.8588 | 1.0125 |
60
- | 8.1428 | 0.5 | 20 | 8.8569 | 1.0125 |
61
- | 8.1333 | 0.75 | 30 | 8.8552 | 1.0124 |
62
- | 8.7873 | 1.03 | 40 | 8.8532 | 1.0124 |
63
- | 8.1298 | 1.28 | 50 | 8.8516 | 1.0124 |
64
- | 8.1445 | 1.53 | 60 | 8.8499 | 1.0123 |
65
- | 8.1635 | 1.78 | 70 | 8.8483 | 1.0124 |
66
- | 8.7587 | 2.05 | 80 | 8.8468 | 1.0125 |
67
- | 8.1424 | 2.3 | 90 | 8.8454 | 1.0124 |
68
- | 8.1318 | 2.55 | 100 | 8.8440 | 1.0124 |
69
- | 8.1469 | 2.81 | 110 | 8.8428 | 1.0125 |
70
- | 8.7602 | 3.08 | 120 | 8.8416 | 1.0125 |
71
- | 8.1584 | 3.33 | 130 | 8.8405 | 1.0126 |
72
- | 8.142 | 3.58 | 140 | 8.8394 | 1.0126 |
73
- | 8.1285 | 3.83 | 150 | 8.8384 | 1.0124 |
74
- | 8.7756 | 4.1 | 160 | 8.8371 | 1.0124 |
75
- | 8.0991 | 4.35 | 170 | 8.8363 | 1.0125 |
76
- | 8.1442 | 4.6 | 180 | 8.8354 | 1.0124 |
77
- | 8.1294 | 4.86 | 190 | 8.8346 | 1.0124 |
78
- | 8.7276 | 5.13 | 200 | 8.8338 | 1.0125 |
79
- | 8.1439 | 5.38 | 210 | 8.8329 | 1.0124 |
80
- | 8.1115 | 5.63 | 220 | 8.8322 | 1.0124 |
81
- | 8.1501 | 5.88 | 230 | 8.8316 | 1.0125 |
82
- | 8.7143 | 6.15 | 240 | 8.8308 | 1.0124 |
83
- | 8.143 | 6.4 | 250 | 8.8302 | 1.0124 |
84
- | 8.1528 | 6.65 | 260 | 8.8300 | 1.0125 |
85
- | 8.1293 | 6.91 | 270 | 8.8297 | 1.0124 |
86
- | 8.7519 | 7.18 | 280 | 8.8293 | 1.0125 |
87
- | 8.1153 | 7.43 | 290 | 8.8289 | 1.0124 |
88
- | 8.1292 | 7.68 | 300 | 8.8288 | 1.0124 |
89
- | 8.0904 | 7.93 | 310 | 8.8284 | 1.0124 |
90
- | 8.7425 | 8.2 | 320 | 8.8283 | 1.0125 |
91
- | 8.0963 | 8.45 | 330 | 8.8281 | 1.0124 |
92
- | 8.1112 | 8.7 | 340 | 8.8281 | 1.0124 |
93
- | 8.124 | 8.96 | 350 | 8.8281 | 1.0125 |
94
- | 8.7327 | 9.23 | 360 | 8.8279 | 1.0123 |
95
- | 8.1261 | 9.48 | 370 | 8.8279 | 1.0126 |
96
- | 8.1259 | 9.73 | 380 | 8.8279 | 1.0124 |
97
- | 8.1116 | 9.98 | 390 | 8.8279 | 1.0123 |
98
-
99
-
100
- ### Framework versions
101
-
102
- - Transformers 4.17.0.dev0
103
- - Pytorch 1.10.2+cu102
104
- - Datasets 1.18.3
105
- - Tokenizers 0.11.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: lg
3
+ datasets:
4
+ - mozilla-foundation/common_voice_7_0
5
+ metrics:
6
+ - wer
7
  tags:
8
+ - audio
9
  - automatic-speech-recognition
10
+ - speech
 
 
11
  - common_voice
12
+ - lg
13
+ - robust-speech-event
14
+ license: apache-2.0
15
  model-index:
16
+ - name: Wav2Vec2 Luganda by Indonesian-NLP
17
+ results:
18
+ - task:
19
+ name: Speech Recognition
20
+ type: automatic-speech-recognition
21
+ dataset:
22
+ name: Common Voice lg
23
+ type: common_voice
24
+ args: lg
25
+ metrics:
26
+ - name: Test WER
27
+ type: wer
28
+ value: 7.53
29
+ - task:
30
+ name: Automatic Speech Recognition
31
+ type: automatic-speech-recognition
32
+ dataset:
33
+ name: Common Voice 7
34
+ type: mozilla-foundation/common_voice_7_0
35
+ args: tr
36
+ metrics:
37
+ - name: Test WER
38
+ type: wer
39
+ value: 8.147
40
+ - name: Test CER
41
+ type: cer
42
+ value: 2.802
43
  ---
44
 
45
+ # Automatic Speech Recognition for Luganda
46
+
47
+ This is the model built for the
48
+ [Mozilla Luganda Automatic Speech Recognition competition](https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition).
49
+ It is a fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
50
+ model on the [Luganda Common Voice dataset](https://huggingface.co/datasets/common_voice) version 7.0.
51
+
52
+ We also provide a [live demo](https://huggingface.co/spaces/indonesian-nlp/luganda-asr) to test the model.
53
+
54
+ When using this model, make sure that your speech input is sampled at 16kHz.
55
+
56
+ ## Usage
57
+ The model can be used directly (without a language model) as follows:
58
+ ```python
59
+ import torch
60
+ import torchaudio
61
+ from datasets import load_dataset
62
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
63
+
64
+ test_dataset = load_dataset("common_voice", "lg", split="test[:2%]")
65
+
66
+ processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-luganda")
67
+ model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-luganda")
68
+
69
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
70
+
71
+ # Preprocessing the datasets.
72
+ # We need to read the aduio files as arrays
73
+ def speech_file_to_array_fn(batch):
74
+ if "audio" in batch:
75
+ speech_array = torch.tensor(batch["audio"]["array"])
76
+ else:
77
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
78
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
79
+ return batch
80
+
81
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
82
+ inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
83
+
84
+ with torch.no_grad():
85
+ logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
86
+
87
+ predicted_ids = torch.argmax(logits, dim=-1)
88
+
89
+ print("Prediction:", processor.batch_decode(predicted_ids))
90
+ print("Reference:", test_dataset[:2]["sentence"])
91
+ ```
92
+
93
+
94
+ ## Evaluation
95
+
96
+ The model can be evaluated as follows on the Indonesian test data of Common Voice.
97
+
98
+ ```python
99
+ import torch
100
+ import torchaudio
101
+ from datasets import load_dataset, load_metric
102
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
103
+ import re
104
+
105
+ test_dataset = load_dataset("common_voice", "lg", split="test")
106
+ wer = load_metric("wer")
107
+
108
+ processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-luganda")
109
+ model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-luganda")
110
+ model.to("cuda")
111
+
112
+ chars_to_ignore = [",", "?", ".", "!", "-", ";", ":", '""', "%", "'", '"', "�", "‘", "’", "’"]
113
+ chars_to_ignore_regex = f'[{"".join(chars_to_ignore)}]'
114
+
115
+ resampler = torchaudio.transforms.Resample(48_000, 16_000)
116
+
117
+ # Preprocessing the datasets.
118
+ # We need to read the audio files as arrays
119
+ def speech_file_to_array_fn(batch):
120
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
121
+ if "audio" in batch:
122
+ speech_array = torch.tensor(batch["audio"]["array"])
123
+ else:
124
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
125
+ batch["speech"] = resampler(speech_array).squeeze().numpy()
126
+ return batch
127
+
128
+ test_dataset = test_dataset.map(speech_file_to_array_fn)
129
+
130
+ # Preprocessing the datasets.
131
+ # We need to read the audio files as arrays
132
+ def evaluate(batch):
133
+ inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
134
+
135
+ with torch.no_grad():
136
+ logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
137
+
138
+ pred_ids = torch.argmax(logits, dim=-1)
139
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
140
+ return batch
141
+
142
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
143
+
144
+ print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
145
+ ```
146
+
147
+ WER without KenLM: 15.38 %
148
+
149
+ WER With KenLM:
150
+
151
+ **Test Result**: 7.53 %
152
+
153
+ ## Training
154
+
155
+ The Common Voice `train`, `validation`, and ... datasets were used for training as well as ... and ... # TODO
156
+
157
+ The script used for training can be found [here](https://github.com/indonesian-nlp/luganda-asr)
preprocessor_config.json CHANGED
@@ -5,5 +5,6 @@
5
  "padding_side": "right",
6
  "padding_value": 0.0,
7
  "return_attention_mask": true,
 
8
  "sampling_rate": 16000
9
  }
5
  "padding_side": "right",
6
  "padding_value": 0.0,
7
  "return_attention_mask": true,
8
+ "processor_class": "Wav2Vec2ProcessorWithLM",
9
  "sampling_rate": 16000
10
  }