xekri commited on
Commit
c5471f3
1 Parent(s): 7d728c1

Corrects README

Browse files
Files changed (1) hide show
  1. README.md +132 -104
README.md CHANGED
@@ -1,104 +1,132 @@
1
- ---
2
- language:
3
- - eo
4
- license: apache-2.0
5
- tags:
6
- - automatic-speech-recognition
7
- - mozilla-foundation/common_voice_13_0
8
- - generated_from_trainer
9
- datasets:
10
- - common_voice_13_0
11
- metrics:
12
- - wer
13
- model-index:
14
- - name: wav2vec2-common_voice_13_0-eo-10
15
- results:
16
- - task:
17
- name: Automatic Speech Recognition
18
- type: automatic-speech-recognition
19
- dataset:
20
- name: MOZILLA-FOUNDATION/COMMON_VOICE_13_0 - EO
21
- type: common_voice_13_0
22
- config: eo
23
- split: validation
24
- args: 'Config: eo, Training split: train, Eval split: validation'
25
- metrics:
26
- - name: Wer
27
- type: wer
28
- value: 0.06566915357190017
29
- ---
30
-
31
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
32
- should probably proofread and complete it, then remove this comment. -->
33
-
34
- # wav2vec2-common_voice_13_0-eo-10
35
-
36
- This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the MOZILLA-FOUNDATION/COMMON_VOICE_13_0 - EO dataset.
37
- It achieves the following results on the evaluation set:
38
- - Loss: 0.0454
39
- - Cer: 0.0118
40
- - Wer: 0.0657
41
-
42
- ## Model description
43
-
44
- More information needed
45
-
46
- ## Intended uses & limitations
47
-
48
- More information needed
49
-
50
- ## Training and evaluation data
51
-
52
- More information needed
53
-
54
- ## Training procedure
55
-
56
- ### Training hyperparameters
57
-
58
- The following hyperparameters were used during training:
59
- - learning_rate: 3e-05
60
- - train_batch_size: 16
61
- - eval_batch_size: 8
62
- - seed: 42
63
- - gradient_accumulation_steps: 2
64
- - total_train_batch_size: 32
65
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
66
- - lr_scheduler_type: linear
67
- - lr_scheduler_warmup_steps: 500
68
- - num_epochs: 5
69
- - mixed_precision_training: Native AMP
70
-
71
- ### Training results
72
-
73
- | Training Loss | Epoch | Step | Cer | Validation Loss | Wer |
74
- |:-------------:|:-----:|:-----:|:------:|:---------------:|:------:|
75
- | 2.9894 | 0.22 | 1000 | 1.0 | 2.9257 | 1.0 |
76
- | 0.7104 | 0.44 | 2000 | 0.0457 | 0.2129 | 0.2538 |
77
- | 0.2853 | 0.67 | 3000 | 0.0274 | 0.1109 | 0.1583 |
78
- | 0.2327 | 0.89 | 4000 | 0.0231 | 0.0909 | 0.1320 |
79
- | 0.1917 | 1.11 | 5000 | 0.0206 | 0.0775 | 0.1188 |
80
- | 0.1803 | 1.33 | 6000 | 0.0184 | 0.0698 | 0.1055 |
81
- | 0.1661 | 1.56 | 7000 | 0.0169 | 0.0645 | 0.0961 |
82
- | 0.1635 | 1.78 | 8000 | 0.0170 | 0.0639 | 0.0964 |
83
- | 0.1555 | 2.0 | 9000 | 0.0156 | 0.0592 | 0.0881 |
84
- | 0.1386 | 2.22 | 10000 | 0.0147 | 0.0559 | 0.0821 |
85
- | 0.1338 | 2.45 | 11000 | 0.0146 | 0.0548 | 0.0831 |
86
- | 0.1307 | 2.67 | 12000 | 0.0137 | 0.0529 | 0.0759 |
87
- | 0.1297 | 2.89 | 13000 | 0.0134 | 0.0504 | 0.0745 |
88
- | 0.1201 | 3.11 | 14000 | 0.0131 | 0.0499 | 0.0734 |
89
- | 0.1152 | 3.34 | 15000 | 0.0128 | 0.0484 | 0.0712 |
90
- | 0.1144 | 3.56 | 16000 | 0.0125 | 0.0477 | 0.0695 |
91
- | 0.1179 | 3.78 | 17000 | 0.0122 | 0.0468 | 0.0679 |
92
- | 0.1112 | 4.0 | 18000 | 0.0121 | 0.0468 | 0.0676 |
93
- | 0.1141 | 4.23 | 19000 | 0.0121 | 0.0462 | 0.0668 |
94
- | 0.1085 | 4.45 | 20000 | 0.0119 | 0.0458 | 0.0664 |
95
- | 0.105 | 4.67 | 21000 | 0.0119 | 0.0456 | 0.0660 |
96
- | 0.1072 | 4.89 | 22000 | 0.0119 | 0.0454 | 0.0658 |
97
-
98
-
99
- ### Framework versions
100
-
101
- - Transformers 4.29.2
102
- - Pytorch 2.0.1+cu117
103
- - Datasets 2.12.0
104
- - Tokenizers 0.13.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - eo
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - mozilla-foundation/common_voice_13_0
8
+ - generated_from_trainer
9
+ datasets:
10
+ - common_voice_13_0
11
+ metrics:
12
+ - wer
13
+ - cer
14
+ model-index:
15
+ - name: wav2vec2-common_voice_13_0-eo-10
16
+ results:
17
+ - task:
18
+ name: Automatic Speech Recognition
19
+ type: automatic-speech-recognition
20
+ dataset:
21
+ name: mozilla-foundation/common_voice_13_0
22
+ type: common_voice_13_0
23
+ config: eo
24
+ split: validation
25
+ args: 'Config: eo, Training split: train, Eval split: validation'
26
+ metrics:
27
+ - name: WER
28
+ type: wer
29
+ value: 0.0656526475637132
30
+ - name: CER
31
+ type: cer
32
+ value: 0.0118
33
+ ---
34
+
35
+ # wav2vec2-common_voice_13_0-eo-10, an Esperanto speech recognizer
36
+
37
+ This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) Esperanto dataset.
38
+ It achieves the following results on the evaluation set:
39
+ - Loss: 0.0453
40
+ - Cer: 0.0118
41
+ - Wer: 0.0657
42
+
43
+ The first 10 examples in the evaluation set:
44
+
45
+ | Actual<br>Predicted | CER |
46
+ |:--------------------|:----|
47
+ | `la orienta parto apud benino kaj niĝerio estis nomita sklavmarbordo`<br>`la orienta parto apud benino kaj niĝerio estis nomita sklafmarbordo` | 0.014925373134328358 |
48
+ | `en la sekva jaro li ricevis premion`<br>`en la sekva jaro li ricevis premion` | 0.0 |
49
+ | `ŝi studis historion ĉe la universitato de brita kolumbio`<br>`ŝi studis historion ĉe la universitato de brita kolumbio` | 0.0 |
50
+ | `larĝaj ŝtupoj kuras al la fasado`<br>`larĝaj ŝtupoj kuras al la fasado` | 0.0 |
51
+ | `la municipo ĝuas duan epokon de etendo kaj disvolviĝo`<br>`la municipo ĝuas duan eepokon de etendo kaj disvolviĝo` | 0.018867924528301886 |
52
+ | `li estis ankaŭ katedrestro kaj dekano`<br>`li estis ankaŭ katedristo kaj dekano` | 0.05405405405405406 |
53
+ | `librovendejo apartenas al la muzeo`<br>`librovendejo apartenas al la muzeo` | 0.0 |
54
+ | `ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵaro de arbaroj`<br>`ĝi estas kutime malfacile videbla kaj troviĝas en subkreskaĵo de arbaroj` | 0.02702702702702703 |
55
+ | `unue ili estas ruĝaj poste brunaj`<br>`unue ili estas ruĝaj poste brunaj` | 0.0 |
56
+ | `la loĝantaro laboras en la proksima ĉefurbo`<br>`la loĝantaro laboras en la proksima ĉefurbo` | 0.0 |
57
+
58
+ ## Model description
59
+
60
+ See [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
61
+
62
+ ## Intended uses & limitations
63
+
64
+ Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.
65
+
66
+ The output is all lowercase, no punctuation.
67
+
68
+ ## Training and evaluation data
69
+
70
+ The training split was set to `train` while the eval split was set to `validation`. Some files were filtered out of the train and validation dataset due to bad data; see [xekri/wav2vec2-common_voice_13_0-eo-3](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-3) for a detailed discussion. In summary, I used `xekri/wav2vec2-common_voice_13_0-eo-3` as a detector to detect bad files, then hardcoded those files into the trainer code to be filtered out.
71
+
72
+ ## Training procedure
73
+
74
+ I used a modified version of [`run_speech_recognition_ctc.py`](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) for training. See [`run_speech_recognition_ctc.py`](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10/blob/main/run_speech_recognition_ctc.py) in this repo.
75
+
76
+ The parameters to the trainer are in [train.json](https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-10/blob/main/train.json) in this repo.
77
+
78
+ The key changes between this training run and `xekri/wav2vec2-common_voice_13_0-eo-3`, aside from the filtering and use of the full training and validation sets are:
79
+
80
+ * Layer drop probability is 20%
81
+ * Train only for 5 epochs
82
+
83
+ ### Training hyperparameters
84
+
85
+ The following hyperparameters were used during training:
86
+ - learning_rate: 3e-05
87
+ - train_batch_size: 16
88
+ - eval_batch_size: 8
89
+ - seed: 42
90
+ - gradient_accumulation_steps: 2
91
+ - total_train_batch_size: 32
92
+ - layerdrop: 0.2
93
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
94
+ - lr_scheduler_type: linear
95
+ - lr_scheduler_warmup_steps: 500
96
+ - num_epochs: 5
97
+ - mixed_precision_training: Native AMP
98
+
99
+ ### Training results
100
+
101
+ | Training Loss | Epoch | Step | Cer | Validation Loss | Wer |
102
+ |:-------------:|:-----:|:-----:|:------:|:---------------:|:------:|
103
+ | 2.9894 | 0.22 | 1000 | 1.0 | 2.9257 | 1.0 |
104
+ | 0.7104 | 0.44 | 2000 | 0.0457 | 0.2129 | 0.2538 |
105
+ | 0.2853 | 0.67 | 3000 | 0.0274 | 0.1109 | 0.1583 |
106
+ | 0.2327 | 0.89 | 4000 | 0.0231 | 0.0909 | 0.1320 |
107
+ | 0.1917 | 1.11 | 5000 | 0.0206 | 0.0775 | 0.1188 |
108
+ | 0.1803 | 1.33 | 6000 | 0.0184 | 0.0698 | 0.1055 |
109
+ | 0.1661 | 1.56 | 7000 | 0.0169 | 0.0645 | 0.0961 |
110
+ | 0.1635 | 1.78 | 8000 | 0.0170 | 0.0639 | 0.0964 |
111
+ | 0.1555 | 2.0 | 9000 | 0.0156 | 0.0592 | 0.0881 |
112
+ | 0.1386 | 2.22 | 10000 | 0.0147 | 0.0559 | 0.0821 |
113
+ | 0.1338 | 2.45 | 11000 | 0.0146 | 0.0548 | 0.0831 |
114
+ | 0.1307 | 2.67 | 12000 | 0.0137 | 0.0529 | 0.0759 |
115
+ | 0.1297 | 2.89 | 13000 | 0.0134 | 0.0504 | 0.0745 |
116
+ | 0.1201 | 3.11 | 14000 | 0.0131 | 0.0499 | 0.0734 |
117
+ | 0.1152 | 3.34 | 15000 | 0.0128 | 0.0484 | 0.0712 |
118
+ | 0.1144 | 3.56 | 16000 | 0.0125 | 0.0477 | 0.0695 |
119
+ | 0.1179 | 3.78 | 17000 | 0.0122 | 0.0468 | 0.0679 |
120
+ | 0.1112 | 4.0 | 18000 | 0.0121 | 0.0468 | 0.0676 |
121
+ | 0.1141 | 4.23 | 19000 | 0.0121 | 0.0462 | 0.0668 |
122
+ | 0.1085 | 4.45 | 20000 | 0.0119 | 0.0458 | 0.0664 |
123
+ | 0.105 | 4.67 | 21000 | 0.0119 | 0.0456 | 0.0660 |
124
+ | 0.1072 | 4.89 | 22000 | 0.0119 | 0.0454 | 0.0658 |
125
+
126
+
127
+ ### Framework versions
128
+
129
+ - Transformers 4.29.2
130
+ - Pytorch 2.0.1+cu117
131
+ - Datasets 2.12.0
132
+ - Tokenizers 0.13.3