xekri commited on
Commit
d8c7361
1 Parent(s): 2d3f123

Update model card

Browse files
Files changed (1) hide show
  1. README.md +114 -16
README.md CHANGED
@@ -13,41 +13,139 @@ model-index:
13
  results: []
14
  ---
15
 
16
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
- should probably proofread and complete it, then remove this comment. -->
18
 
19
- # wav2vec2-common_voice_13_0-eo-3
20
-
21
- This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the MOZILLA-FOUNDATION/COMMON_VOICE_13_0 - EO dataset.
22
  It achieves the following results on the evaluation set:
 
23
  - Loss: 0.2191
24
  - Cer: 0.0208
25
  - Wer: 0.0687
26
 
27
  ## Model description
28
 
29
- More information needed
30
 
31
  ## Intended uses & limitations
32
 
33
- More information needed
34
 
35
  ## Training and evaluation data
36
 
37
- More information needed
38
 
39
  ## Training procedure
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ### Training hyperparameters
42
 
43
  The following hyperparameters were used during training:
44
- - learning_rate: 5e-06
45
  - train_batch_size: 8
46
  - eval_batch_size: 8
47
  - seed: 42
48
  - gradient_accumulation_steps: 4
49
  - total_train_batch_size: 32
50
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 
51
  - lr_scheduler_type: linear
52
  - lr_scheduler_warmup_steps: 500
53
  - num_epochs: 100
@@ -96,13 +194,13 @@ The following hyperparameters were used during training:
96
  | 0.0341 | 78.93 | 37000 | 0.0208 | 0.2170 | 0.0688 |
97
  | 0.032 | 81.07 | 38000 | 0.0209 | 0.2157 | 0.0686 |
98
  | 0.0318 | 83.33 | 39000 | 0.0209 | 0.2166 | 0.0685 |
99
- | 0.0325 | 85.47 | 40000 | 0.2172 | 0.0209 | 0.0687 |
100
- | 0.0316 | 87.6 | 41000 | 0.2181 | 0.0208 | 0.0678 |
101
- | 0.0302 | 89.73 | 42000 | 0.2171 | 0.0208 | 0.0679 |
102
- | 0.0318 | 91.87 | 43000 | 0.2179 | 0.0211 | 0.0702 |
103
- | 0.0314 | 94.0 | 44000 | 0.2186 | 0.0208 | 0.0690 |
104
- | 0.0309 | 96.13 | 45000 | 0.2193 | 0.0210 | 0.0696 |
105
- | 0.031 | 98.27 | 46000 | 0.2191 | 0.0208 | 0.0686 |
106
 
107
 
108
  ### Framework versions
 
13
  results: []
14
  ---
15
 
16
+ # wav2vec2-common_voice_13_0-eo-3, an Esperanto speech recognizer
 
17
 
18
+ This model is a fine-tuned version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the [mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) Esperanto dataset.
 
 
19
  It achieves the following results on the evaluation set:
20
+
21
  - Loss: 0.2191
22
  - Cer: 0.0208
23
  - Wer: 0.0687
24
 
25
  ## Model description
26
 
27
+ See [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).
28
 
29
  ## Intended uses & limitations
30
 
31
+ Speech recognition for Esperanto. The base model was pretrained and finetuned on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16KHz.
32
 
33
  ## Training and evaluation data
34
 
35
+ The training split was set to `train[:15000]` while the eval split was set to `validation[:1500]`.
36
 
37
  ## Training procedure
38
 
39
+ I used [`run_speech_recognition_ctc.py`](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition) with the following `train.json` file passed to it:
40
+
41
+ ```json
42
+ {
43
+ "dataset_name": "mozilla-foundation/common_voice_13_0",
44
+ "model_name_or_path": "facebook/wav2vec2-large-xlsr-53",
45
+ "dataset_config_name": "eo",
46
+ "output_dir": "./wav2vec2-common_voice_13_0-eo-3",
47
+ "train_split_name": "train[:15000]",
48
+ "eval_split_name": "validation[:1500]",
49
+ "eval_metrics": ["cer", "wer"],
50
+ "overwrite_output_dir": true,
51
+ "preprocessing_num_workers": 8,
52
+ "num_train_epochs": 100,
53
+ "per_device_train_batch_size": 8,
54
+ "gradient_accumulation_steps": 4,
55
+ "gradient_checkpointing": true,
56
+ "learning_rate": 3e-5,
57
+ "warmup_steps": 500,
58
+ "evaluation_strategy": "steps",
59
+ "text_column_name": "sentence",
60
+ "length_column_name": "input_length",
61
+ "save_steps": 1000,
62
+ "eval_steps": 1000,
63
+ "layerdrop": 0.1,
64
+ "save_total_limit": 3,
65
+ "freeze_feature_encoder": true,
66
+ "chars_to_ignore": "-!\"'(),.:;=?_`¨«¸»ʼ‑–—‘’“”„…‹›♫?",
67
+ "chars_to_substitute": {
68
+ "przy": "pŝe",
69
+ "byn": "bin",
70
+ "cx": "ĉ",
71
+ "sx": "ŝ",
72
+ "fi": "fi",
73
+ "fl": "fl",
74
+ "ǔ": "ŭ",
75
+ "ñ": "nj",
76
+ "á": "a",
77
+ "é": "e",
78
+ "ü": "ŭ",
79
+ "y": "j",
80
+ "qu": "ku"
81
+ },
82
+ "fp16": true,
83
+ "group_by_length": true,
84
+ "push_to_hub": true,
85
+ "do_train": true,
86
+ "do_eval": true
87
+ }
88
+ ```
89
+
90
+ I went through the dataset to find non-speech characters, and these were placed in `chars_to_ignore`. In addition, there were character sequences that could be transcribed to Esperanto phonemes, and these were placed as a dictionary in `chars_to_substitute`. This required adding such an argument to the program:
91
+
92
+ ```py
93
+ def dict_field(default=None, metadata=None):
94
+ return field(default_factory=lambda: default, metadata=metadata)
95
+
96
+ @dataclass
97
+ class DataTrainingArguments:
98
+ ...
99
+ chars_to_substitute: Optional[Dict[str, str]] = dict_field(
100
+ default=None,
101
+ metadata={"help": "A dict of characters to replace."},
102
+ )
103
+
104
+ ```
105
+
106
+ Then I copied `remove_special_characters` to do the actual substitution:
107
+
108
+ ```py
109
+ def remove_special_characters(batch):
110
+ text = batch[text_column_name]
111
+ if chars_to_ignore_regex is not None:
112
+ text = re.sub(chars_to_ignore_regex, "", batch[text_column_name])
113
+ batch["target_text"] = text.lower() + " "
114
+ return batch
115
+
116
+ def substitute_characters(batch):
117
+ text: str = batch["target_text"]
118
+ if data_args.chars_to_substitute is not None:
119
+ for k, v in data_args.chars_to_substitute.items():
120
+ text.replace(k, v)
121
+ batch["target_text"] = text.lower()
122
+ return batch
123
+
124
+ with training_args.main_process_first(desc="dataset map special characters removal"):
125
+ raw_datasets = raw_datasets.map(
126
+ remove_special_characters,
127
+ remove_columns=[text_column_name],
128
+ desc="remove special characters from datasets",
129
+ )
130
+
131
+ with training_args.main_process_first(desc="dataset map special characters substitute"):
132
+ raw_datasets = raw_datasets.map(
133
+ substitute_characters,
134
+ desc="substitute special characters in datasets",
135
+ )
136
+ ```
137
+
138
  ### Training hyperparameters
139
 
140
  The following hyperparameters were used during training:
141
+ - learning_rate: 3e-05
142
  - train_batch_size: 8
143
  - eval_batch_size: 8
144
  - seed: 42
145
  - gradient_accumulation_steps: 4
146
  - total_train_batch_size: 32
147
  - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
148
+ - layerdrop: 0.1
149
  - lr_scheduler_type: linear
150
  - lr_scheduler_warmup_steps: 500
151
  - num_epochs: 100
 
194
  | 0.0341 | 78.93 | 37000 | 0.0208 | 0.2170 | 0.0688 |
195
  | 0.032 | 81.07 | 38000 | 0.0209 | 0.2157 | 0.0686 |
196
  | 0.0318 | 83.33 | 39000 | 0.0209 | 0.2166 | 0.0685 |
197
+ | 0.0325 | 85.47 | 40000 | 0.0209 | 0.2172 | 0.0687 |
198
+ | 0.0316 | 87.6 | 41000 | 0.0208 | 0.2181 | 0.0678 |
199
+ | 0.0302 | 89.73 | 42000 | 0.0208 | 0.2171 | 0.0679 |
200
+ | 0.0318 | 91.87 | 43000 | 0.0211 | 0.2179 | 0.0702 |
201
+ | 0.0314 | 94.0 | 44000 | 0.0208 | 0.2186 | 0.0690 |
202
+ | 0.0309 | 96.13 | 45000 | 0.0210 | 0.2193 | 0.0696 |
203
+ | 0.031 | 98.27 | 46000 | 0.0208 | 0.2191 | 0.0686 |
204
 
205
 
206
  ### Framework versions