Plim commited on
Commit
e2f6d01
1 Parent(s): 89ae304

Training in progress, step 14000

Browse files
Files changed (32) hide show
  1. .gitattributes +1 -0
  2. .ipynb_checkpoints/README-checkpoint.md +105 -0
  3. .ipynb_checkpoints/create_lm-checkpoint.ipynb +309 -0
  4. .ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_predictions-checkpoint.txt +0 -0
  5. .ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_targets-checkpoint.txt +0 -0
  6. .ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_predictions-checkpoint.txt +0 -0
  7. .ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_targets-checkpoint.txt +0 -0
  8. .ipynb_checkpoints/mozilla-foundation_common_voice_8_0_fr_test_eval_results-checkpoint.txt +2 -0
  9. .ipynb_checkpoints/preprocessor_config-checkpoint.json +10 -0
  10. .ipynb_checkpoints/run-checkpoint.sh +2 -2
  11. alphabet.json +1 -0
  12. config.json +1 -1
  13. create_lm.ipynb +344 -0
  14. keep_model/pytorch_model.bin +3 -0
  15. langague_model/5gram.bin +3 -0
  16. langague_model/attrs.json +1 -0
  17. langague_model/unigrams.txt +0 -0
  18. pytorch_model.bin +1 -1
  19. run.sh +2 -2
  20. training_args.bin +1 -1
  21. wandb/debug-internal.log +1 -1
  22. wandb/debug.log +1 -1
  23. wandb/latest-run +1 -1
  24. wandb/run-20220206_201634-uhiy9e2t/files/conda-environment.yaml +0 -0
  25. wandb/run-20220206_201634-uhiy9e2t/files/config.yaml +0 -0
  26. wandb/run-20220206_201634-uhiy9e2t/files/output.log +1491 -0
  27. wandb/run-20220206_201634-uhiy9e2t/files/requirements.txt +183 -0
  28. wandb/run-20220206_201634-uhiy9e2t/files/wandb-metadata.json +61 -0
  29. wandb/run-20220206_201634-uhiy9e2t/files/wandb-summary.json +0 -0
  30. wandb/run-20220206_201634-uhiy9e2t/logs/debug-internal.log +0 -0
  31. wandb/run-20220206_201634-uhiy9e2t/logs/debug.log +26 -0
  32. wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb +3 -0
.gitattributes CHANGED
@@ -26,3 +26,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
  wandb/run-20220203_170643-2fkfdtzb/run-2fkfdtzb.wandb filter=lfs diff=lfs merge=lfs -text
 
 
26
  *.zstandard filter=lfs diff=lfs merge=lfs -text
27
  *tfevents* filter=lfs diff=lfs merge=lfs -text
28
  wandb/run-20220203_170643-2fkfdtzb/run-2fkfdtzb.wandb filter=lfs diff=lfs merge=lfs -text
29
+ wandb/run-20220206_201634-uhiy9e2t/run-uhiy9e2t.wandb filter=lfs diff=lfs merge=lfs -text
.ipynb_checkpoints/README-checkpoint.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - mozilla-foundation/common_voice_8_0
8
+ - generated_from_trainer
9
+ - robust-speech-event
10
+ model-index:
11
+ - name: XLS-R-1B - French
12
+ results:
13
+ - task:
14
+ name: Automatic Speech Recognition
15
+ type: automatic-speech-recognition
16
+ dataset:
17
+ name: Common Voice 8
18
+ type: mozilla-foundation/common_voice_8_0
19
+ args: fr
20
+ metrics:
21
+ - name: Test WER
22
+ type: wer
23
+ value: 18.33
24
+ - name: Test CER
25
+ type: cer
26
+ value: 5.60
27
+ - task:
28
+ name: Automatic Speech Recognition
29
+ type: automatic-speech-recognition
30
+ dataset:
31
+ name: Robust Speech Event - Dev Data
32
+ type: speech-recognition-community-v2/dev_data
33
+ args: fr
34
+ metrics:
35
+ - name: Test WER
36
+ type: wer
37
+ value: 60.25
38
+ - name: Test CER
39
+ type: cer
40
+ value: 15.68
41
+ ---
42
+
43
+ ## Model description
44
+
45
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - FR dataset.
46
+
47
+ ## Training procedure
48
+
49
+ ### Training hyperparameters
50
+
51
+ The following hyperparameters were used during training:
52
+ - learning_rate: 7.5e-05
53
+ - train_batch_size: 16
54
+ - eval_batch_size: 16
55
+ - seed: 42
56
+ - gradient_accumulation_steps: 8
57
+ - total_train_batch_size: 128
58
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
59
+ - lr_scheduler_type: linear
60
+ - lr_scheduler_warmup_steps: 2000
61
+ - num_epochs: 4.0
62
+ - mixed_precision_training: Native AMP
63
+
64
+ ### Training results
65
+
66
+ | Training Loss | Epoch | Step | Validation Loss | Wer |
67
+ |:-------------:|:-----:|:-----:|:---------------:|:------:|
68
+ | 0.9827 | 0.29 | 1000 | inf | 0.2937 |
69
+ | 1.0203 | 0.57 | 2000 | inf | 0.2711 |
70
+ | 1.0048 | 0.86 | 3000 | inf | 0.2620 |
71
+ | 0.9858 | 1.15 | 4000 | inf | 0.2522 |
72
+ | 0.9709 | 1.43 | 5000 | inf | 0.2365 |
73
+ | 0.9347 | 1.72 | 6000 | inf | 0.2332 |
74
+ | 0.9256 | 2.01 | 7000 | inf | 0.2261 |
75
+ | 0.8936 | 2.29 | 8000 | inf | 0.2203 |
76
+ | 0.877 | 2.58 | 9000 | inf | 0.2096 |
77
+ | 0.8393 | 2.87 | 10000 | inf | 0.2017 |
78
+ | 0.8156 | 3.15 | 11000 | inf | 0.1936 |
79
+ | 0.8015 | 3.44 | 12000 | inf | 0.1880 |
80
+ | 0.774 | 3.73 | 13000 | inf | 0.1834 |
81
+
82
+ It achieves the best result on the validation set on STEP 13000:
83
+ - Wer: 0.1834
84
+
85
+ Some problem occurs when calculating the validation loss.
86
+
87
+ ### Framework versions
88
+
89
+ - Transformers 4.17.0.dev0
90
+ - Pytorch 1.10.2+cu102
91
+ - Datasets 1.18.3.dev0
92
+ - Tokenizers 0.11.0
93
+
94
+ ### Evaluation Commands
95
+ 1. To evaluate on `mozilla-foundation/common_voice_8` with split `test`
96
+
97
+ ```bash
98
+ python eval.py --model_id Plim/xls-r-1b-cv_8-fr --dataset mozilla-foundation/common_voice_8_0 --config fr --split test
99
+ ```
100
+
101
+ 2. To evaluate on `speech-recognition-community-v2/dev_data`
102
+
103
+ ```bash
104
+ python eval.py --model_id Plim/xls-r-1b-cv_8-fr --dataset speech-recognition-community-v2/dev_data --config fr --split validation --chunk_length_s 5.0 --stride_length_s 1.0
105
+ ```
.ipynb_checkpoints/create_lm-checkpoint.ipynb ADDED
@@ -0,0 +1,309 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 10,
6
+ "id": "7b5f7142",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import transformers\n",
11
+ "from datasets import load_dataset\n",
12
+ "import re"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "code",
17
+ "execution_count": 11,
18
+ "id": "4ad6422f",
19
+ "metadata": {},
20
+ "outputs": [],
21
+ "source": [
22
+ "username = \"Plim\" # change to your username\n",
23
+ "target_lang = \"fr\""
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": 4,
29
+ "id": "37b2c1d6",
30
+ "metadata": {},
31
+ "outputs": [
32
+ {
33
+ "data": {
34
+ "application/vnd.jupyter.widget-view+json": {
35
+ "model_id": "f230feb459c441a9a11e53b867e8914a",
36
+ "version_major": 2,
37
+ "version_minor": 0
38
+ },
39
+ "text/plain": [
40
+ "Downloading: 0%| | 0.00/2.60k [00:00<?, ?B/s]"
41
+ ]
42
+ },
43
+ "metadata": {},
44
+ "output_type": "display_data"
45
+ },
46
+ {
47
+ "data": {
48
+ "application/vnd.jupyter.widget-view+json": {
49
+ "model_id": "a4a8fa35d48f4a6db8072baed6b2389b",
50
+ "version_major": 2,
51
+ "version_minor": 0
52
+ },
53
+ "text/plain": [
54
+ "Downloading: 0%| | 0.00/29.6k [00:00<?, ?B/s]"
55
+ ]
56
+ },
57
+ "metadata": {},
58
+ "output_type": "display_data"
59
+ },
60
+ {
61
+ "name": "stderr",
62
+ "output_type": "stream",
63
+ "text": [
64
+ "Using custom data configuration en-fr-lang1=en,lang2=fr\n"
65
+ ]
66
+ },
67
+ {
68
+ "name": "stdout",
69
+ "output_type": "stream",
70
+ "text": [
71
+ "Downloading and preparing dataset europarl_bilingual/en-fr (download: 278.07 MiB, generated: 643.66 MiB, post-processed: Unknown size, total: 921.72 MiB) to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950...\n"
72
+ ]
73
+ },
74
+ {
75
+ "data": {
76
+ "application/vnd.jupyter.widget-view+json": {
77
+ "model_id": "aa00b5d6dc154449861dddcf9f0d2fc8",
78
+ "version_major": 2,
79
+ "version_minor": 0
80
+ },
81
+ "text/plain": [
82
+ "Downloading: 0%| | 0.00/142M [00:00<?, ?B/s]"
83
+ ]
84
+ },
85
+ "metadata": {},
86
+ "output_type": "display_data"
87
+ },
88
+ {
89
+ "data": {
90
+ "application/vnd.jupyter.widget-view+json": {
91
+ "model_id": "563096fc78454333b5ae23e87a7e3469",
92
+ "version_major": 2,
93
+ "version_minor": 0
94
+ },
95
+ "text/plain": [
96
+ "Downloading: 0%| | 0.00/140M [00:00<?, ?B/s]"
97
+ ]
98
+ },
99
+ "metadata": {},
100
+ "output_type": "display_data"
101
+ },
102
+ {
103
+ "data": {
104
+ "application/vnd.jupyter.widget-view+json": {
105
+ "model_id": "eba05e5151b34505b9a43e383cb6cfe0",
106
+ "version_major": 2,
107
+ "version_minor": 0
108
+ },
109
+ "text/plain": [
110
+ "Downloading: 0%| | 0.00/9.30M [00:00<?, ?B/s]"
111
+ ]
112
+ },
113
+ "metadata": {},
114
+ "output_type": "display_data"
115
+ },
116
+ {
117
+ "data": {
118
+ "application/vnd.jupyter.widget-view+json": {
119
+ "model_id": "",
120
+ "version_major": 2,
121
+ "version_minor": 0
122
+ },
123
+ "text/plain": [
124
+ "0 examples [00:00, ? examples/s]"
125
+ ]
126
+ },
127
+ "metadata": {},
128
+ "output_type": "display_data"
129
+ },
130
+ {
131
+ "name": "stdout",
132
+ "output_type": "stream",
133
+ "text": [
134
+ "Dataset europarl_bilingual downloaded and prepared to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950. Subsequent calls will reuse this data.\n"
135
+ ]
136
+ }
137
+ ],
138
+ "source": [
139
+ "dataset = load_dataset(\"europarl_bilingual\", lang1=\"en\", lang2=target_lang, split=\"train\")"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": 12,
145
+ "id": "81259294",
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "def extract_text(batch):\n",
150
+ " target_lang = \"fr\"\n",
151
+ " chars_to_ignore_regex = '[^a-zàâäçéèêëîïôöùûüÿ\\'’ ]'\n",
152
+ " text = batch[\"translation\"][target_lang]\n",
153
+ " batch[\"text\"] = re.sub(chars_to_ignore_regex, \"\", text.lower()).replace('’', \"'\")\n",
154
+ " return batch"
155
+ ]
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 13,
160
+ "id": "2dec7b80",
161
+ "metadata": {},
162
+ "outputs": [
163
+ {
164
+ "data": {
165
+ "application/vnd.jupyter.widget-view+json": {
166
+ "model_id": "00d998de52544f6c8750c53bc0c85d66",
167
+ "version_major": 2,
168
+ "version_minor": 0
169
+ },
170
+ "text/plain": [
171
+ "0ex [00:00, ?ex/s]"
172
+ ]
173
+ },
174
+ "metadata": {},
175
+ "output_type": "display_data"
176
+ }
177
+ ],
178
+ "source": [
179
+ "dataset = dataset.map(extract_text, remove_columns=dataset.column_names)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 14,
185
+ "id": "c6feaf74",
186
+ "metadata": {},
187
+ "outputs": [
188
+ {
189
+ "data": {
190
+ "application/vnd.jupyter.widget-view+json": {
191
+ "model_id": "461a219cdb6d42b2b890ec028c336e7f",
192
+ "version_major": 2,
193
+ "version_minor": 0
194
+ },
195
+ "text/plain": [
196
+ "Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]"
197
+ ]
198
+ },
199
+ "metadata": {},
200
+ "output_type": "display_data"
201
+ }
202
+ ],
203
+ "source": [
204
+ "dataset.push_to_hub(f\"{target_lang}_corpora_parliament_processed\", split=\"train\")"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "code",
209
+ "execution_count": 15,
210
+ "id": "b0e6ae25",
211
+ "metadata": {},
212
+ "outputs": [],
213
+ "source": [
214
+ "with open(\"text.txt\", \"w\") as file:\n",
215
+ " file.write(\" \".join(dataset[\"text\"]))"
216
+ ]
217
+ },
218
+ {
219
+ "cell_type": "code",
220
+ "execution_count": 17,
221
+ "id": "f95596a5",
222
+ "metadata": {},
223
+ "outputs": [],
224
+ "source": [
225
+ "with open(\"5gram.arpa\", \"r\") as read_file, open(\"5gram_correct.arpa\", \"w\") as write_file:\n",
226
+ " has_added_eos = False\n",
227
+ " for line in read_file:\n",
228
+ " if not has_added_eos and \"ngram 1=\" in line:\n",
229
+ " count=line.strip().split(\"=\")[-1]\n",
230
+ " write_file.write(line.replace(f\"{count}\", f\"{int(count)+1}\"))\n",
231
+ " elif not has_added_eos and \"<s>\" in line:\n",
232
+ " write_file.write(line)\n",
233
+ " write_file.write(line.replace(\"<s>\", \"</s>\"))\n",
234
+ " has_added_eos = True\n",
235
+ " else:\n",
236
+ " write_file.write(line)"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": 1,
242
+ "id": "f6489f25",
243
+ "metadata": {},
244
+ "outputs": [
245
+ {
246
+ "name": "stderr",
247
+ "output_type": "stream",
248
+ "text": [
249
+ "file ./config.json not found\n"
250
+ ]
251
+ },
252
+ {
253
+ "ename": "OSError",
254
+ "evalue": "Can't load config for './'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './' is the correct path to a directory containing a config.json file",
255
+ "output_type": "error",
256
+ "traceback": [
257
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
258
+ "\u001b[0;31mOSError\u001b[0m Traceback (most recent call last)",
259
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:585\u001b[0m, in \u001b[0;36mPretrainedConfig._get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m 583\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 584\u001b[0m \u001b[38;5;66;03m# Load from URL or cache if already cached\u001b[39;00m\n\u001b[0;32m--> 585\u001b[0m resolved_config_file \u001b[38;5;241m=\u001b[39m \u001b[43mcached_path\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 586\u001b[0m \u001b[43m \u001b[49m\u001b[43mconfig_file\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 587\u001b[0m \u001b[43m \u001b[49m\u001b[43mcache_dir\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcache_dir\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 588\u001b[0m \u001b[43m \u001b[49m\u001b[43mforce_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mforce_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 589\u001b[0m \u001b[43m \u001b[49m\u001b[43mproxies\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mproxies\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 590\u001b[0m \u001b[43m \u001b[49m\u001b[43mresume_download\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mresume_download\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 591\u001b[0m \u001b[43m \u001b[49m\u001b[43mlocal_files_only\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlocal_files_only\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 592\u001b[0m \u001b[43m \u001b[49m\u001b[43muse_auth_token\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muse_auth_token\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 593\u001b[0m \u001b[43m \u001b[49m\u001b[43muser_agent\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43muser_agent\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 594\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 596\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m RepositoryNotFoundError \u001b[38;5;28;01mas\u001b[39;00m err:\n",
260
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/file_utils.py:1861\u001b[0m, in \u001b[0;36mcached_path\u001b[0;34m(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)\u001b[0m\n\u001b[1;32m 1859\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m urlparse(url_or_filename)\u001b[38;5;241m.\u001b[39mscheme \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 1860\u001b[0m \u001b[38;5;66;03m# File, but it doesn't exist.\u001b[39;00m\n\u001b[0;32m-> 1861\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfile \u001b[39m\u001b[38;5;132;01m{\u001b[39;00murl_or_filename\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m not found\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 1862\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 1863\u001b[0m \u001b[38;5;66;03m# Something unknown\u001b[39;00m\n",
261
+ "\u001b[0;31mOSError\u001b[0m: file ./config.json not found",
262
+ "\nDuring handling of the above exception, another exception occurred:\n",
263
+ "\u001b[0;31mOSError\u001b[0m Traceback (most recent call last)",
264
+ "Input \u001b[0;32mIn [1]\u001b[0m, in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mtransformers\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m AutoProcessor\n\u001b[0;32m----> 3\u001b[0m processor \u001b[38;5;241m=\u001b[39m \u001b[43mAutoProcessor\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m./\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n",
265
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/models/auto/processing_auto.py:178\u001b[0m, in \u001b[0;36mAutoProcessor.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m 176\u001b[0m \u001b[38;5;66;03m# Otherwise, load config, if it can be loaded.\u001b[39;00m\n\u001b[1;32m 177\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(config, PretrainedConfig):\n\u001b[0;32m--> 178\u001b[0m config \u001b[38;5;241m=\u001b[39m \u001b[43mAutoConfig\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfrom_pretrained\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 180\u001b[0m model_type \u001b[38;5;241m=\u001b[39m config_class_to_model_type(\u001b[38;5;28mtype\u001b[39m(config)\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m)\n\u001b[1;32m 182\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(config, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mprocessor_class\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mNone\u001b[39;00m) \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n",
266
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py:617\u001b[0m, in \u001b[0;36mAutoConfig.from_pretrained\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m 615\u001b[0m kwargs[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mname_or_path\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m pretrained_model_name_or_path\n\u001b[1;32m 616\u001b[0m trust_remote_code \u001b[38;5;241m=\u001b[39m kwargs\u001b[38;5;241m.\u001b[39mpop(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtrust_remote_code\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m--> 617\u001b[0m config_dict, _ \u001b[38;5;241m=\u001b[39m \u001b[43mPretrainedConfig\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mget_config_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 618\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mauto_map\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mAutoConfig\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mauto_map\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m 619\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m trust_remote_code:\n",
267
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:537\u001b[0m, in \u001b[0;36mPretrainedConfig.get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m 535\u001b[0m original_kwargs \u001b[38;5;241m=\u001b[39m copy\u001b[38;5;241m.\u001b[39mdeepcopy(kwargs)\n\u001b[1;32m 536\u001b[0m \u001b[38;5;66;03m# Get config dict associated with the base config file\u001b[39;00m\n\u001b[0;32m--> 537\u001b[0m config_dict, kwargs \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mcls\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_get_config_dict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mpretrained_model_name_or_path\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 539\u001b[0m \u001b[38;5;66;03m# That config file may point us toward another config file to use.\u001b[39;00m\n\u001b[1;32m 540\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mconfiguration_files\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m config_dict:\n",
268
+ "File \u001b[0;32m/opt/conda/lib/python3.8/site-packages/transformers/configuration_utils.py:626\u001b[0m, in \u001b[0;36mPretrainedConfig._get_config_dict\u001b[0;34m(cls, pretrained_model_name_or_path, **kwargs)\u001b[0m\n\u001b[1;32m 624\u001b[0m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m err:\n\u001b[1;32m 625\u001b[0m logger\u001b[38;5;241m.\u001b[39merror(err)\n\u001b[0;32m--> 626\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mEnvironmentError\u001b[39;00m(\n\u001b[1;32m 627\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mCan\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt load config for \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpretrained_model_name_or_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m. If you were trying to load it from \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 628\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mhttps://huggingface.co/models\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m, make sure you don\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt have a local directory with the same name. \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 629\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mOtherwise, make sure \u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mpretrained_model_name_or_path\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m is the correct path to a directory \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 630\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mcontaining a \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mconfiguration_file\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m file\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 631\u001b[0m )\n\u001b[1;32m 633\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 634\u001b[0m \u001b[38;5;66;03m# Load config dict\u001b[39;00m\n\u001b[1;32m 635\u001b[0m config_dict \u001b[38;5;241m=\u001b[39m \u001b[38;5;28mcls\u001b[39m\u001b[38;5;241m.\u001b[39m_dict_from_json_file(resolved_config_file)\n",
269
+ "\u001b[0;31mOSError\u001b[0m: Can't load config for './'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure './' is the correct path to a directory containing a config.json file"
270
+ ]
271
+ }
272
+ ],
273
+ "source": [
274
+ "from transformers import AutoProcessor\n",
275
+ "\n",
276
+ "processor = AutoProcessor.from_pretrained(\"./\")"
277
+ ]
278
+ },
279
+ {
280
+ "cell_type": "code",
281
+ "execution_count": null,
282
+ "id": "ab24f645",
283
+ "metadata": {},
284
+ "outputs": [],
285
+ "source": []
286
+ }
287
+ ],
288
+ "metadata": {
289
+ "kernelspec": {
290
+ "display_name": "Python 3 (ipykernel)",
291
+ "language": "python",
292
+ "name": "python3"
293
+ },
294
+ "language_info": {
295
+ "codemirror_mode": {
296
+ "name": "ipython",
297
+ "version": 3
298
+ },
299
+ "file_extension": ".py",
300
+ "mimetype": "text/x-python",
301
+ "name": "python",
302
+ "nbconvert_exporter": "python",
303
+ "pygments_lexer": "ipython3",
304
+ "version": "3.8.8"
305
+ }
306
+ },
307
+ "nbformat": 4,
308
+ "nbformat_minor": 5
309
+ }
.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_predictions-checkpoint.txt ADDED
The diff for this file is too large to render. See raw diff
 
.ipynb_checkpoints/log_mozilla-foundation_common_voice_8_0_fr_test_targets-checkpoint.txt ADDED
The diff for this file is too large to render. See raw diff
 
.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_predictions-checkpoint.txt ADDED
The diff for this file is too large to render. See raw diff
 
.ipynb_checkpoints/log_speech-recognition-community-v2_dev_data_fr_validation_targets-checkpoint.txt ADDED
The diff for this file is too large to render. See raw diff
 
.ipynb_checkpoints/mozilla-foundation_common_voice_8_0_fr_test_eval_results-checkpoint.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.18333515105245937
2
+ CER: 0.05606368028384753
.ipynb_checkpoints/preprocessor_config-checkpoint.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "processor_class": "Wav2Vec2ProcessorWithLM",
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 16000
10
+ }
.ipynb_checkpoints/run-checkpoint.sh CHANGED
@@ -20,8 +20,8 @@ python run_speech_recognition_ctc.py \
20
  --mask_feature_prob="0.25" \
21
  --mask_time_length="10" \
22
  --mask_time_prob="0.75" \
23
- --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
24
- --num_train_epochs="4.0" \
25
  --output_dir="./" \
26
  --overwrite_output_dir \
27
  --per_device_train_batch_size="16" \
 
20
  --mask_feature_prob="0.25" \
21
  --mask_time_length="10" \
22
  --mask_time_prob="0.75" \
23
+ --model_name_or_path="./checkpoint-13000" \
24
+ --num_train_epochs="6.0" \
25
  --output_dir="./" \
26
  --overwrite_output_dir \
27
  --per_device_train_batch_size="16" \
alphabet.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"labels": [" ", "'", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "\u00e0", "\u00e2", "\u00e4", "\u00e7", "\u00e8", "\u00e9", "\u00ea", "\u00eb", "\u00ee", "\u00ef", "\u00f4", "\u00f6", "\u00f9", "\u00fb", "\u00fc", "\u00ff", "\u2047", ""], "is_bpe": false}
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "facebook/wav2vec2-xls-r-1b",
3
  "activation_dropout": 0.1,
4
  "adapter_kernel_size": 3,
5
  "adapter_stride": 2,
 
1
  {
2
+ "_name_or_path": "./checkpoint-13000",
3
  "activation_dropout": 0.1,
4
  "adapter_kernel_size": 3,
5
  "adapter_stride": 2,
create_lm.ipynb ADDED
@@ -0,0 +1,344 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 6,
6
+ "id": "d354f2ac",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import transformers\n",
11
+ "from datasets import load_dataset\n",
12
+ "import re"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "code",
17
+ "execution_count": 11,
18
+ "id": "fe33d468",
19
+ "metadata": {},
20
+ "outputs": [],
21
+ "source": [
22
+ "username = \"Plim\" # change to your username\n",
23
+ "target_lang = \"fr\""
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": 4,
29
+ "id": "f84ba325",
30
+ "metadata": {},
31
+ "outputs": [
32
+ {
33
+ "data": {
34
+ "application/vnd.jupyter.widget-view+json": {
35
+ "model_id": "f230feb459c441a9a11e53b867e8914a",
36
+ "version_major": 2,
37
+ "version_minor": 0
38
+ },
39
+ "text/plain": [
40
+ "Downloading: 0%| | 0.00/2.60k [00:00<?, ?B/s]"
41
+ ]
42
+ },
43
+ "metadata": {},
44
+ "output_type": "display_data"
45
+ },
46
+ {
47
+ "data": {
48
+ "application/vnd.jupyter.widget-view+json": {
49
+ "model_id": "a4a8fa35d48f4a6db8072baed6b2389b",
50
+ "version_major": 2,
51
+ "version_minor": 0
52
+ },
53
+ "text/plain": [
54
+ "Downloading: 0%| | 0.00/29.6k [00:00<?, ?B/s]"
55
+ ]
56
+ },
57
+ "metadata": {},
58
+ "output_type": "display_data"
59
+ },
60
+ {
61
+ "name": "stderr",
62
+ "output_type": "stream",
63
+ "text": [
64
+ "Using custom data configuration en-fr-lang1=en,lang2=fr\n"
65
+ ]
66
+ },
67
+ {
68
+ "name": "stdout",
69
+ "output_type": "stream",
70
+ "text": [
71
+ "Downloading and preparing dataset europarl_bilingual/en-fr (download: 278.07 MiB, generated: 643.66 MiB, post-processed: Unknown size, total: 921.72 MiB) to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950...\n"
72
+ ]
73
+ },
74
+ {
75
+ "data": {
76
+ "application/vnd.jupyter.widget-view+json": {
77
+ "model_id": "aa00b5d6dc154449861dddcf9f0d2fc8",
78
+ "version_major": 2,
79
+ "version_minor": 0
80
+ },
81
+ "text/plain": [
82
+ "Downloading: 0%| | 0.00/142M [00:00<?, ?B/s]"
83
+ ]
84
+ },
85
+ "metadata": {},
86
+ "output_type": "display_data"
87
+ },
88
+ {
89
+ "data": {
90
+ "application/vnd.jupyter.widget-view+json": {
91
+ "model_id": "563096fc78454333b5ae23e87a7e3469",
92
+ "version_major": 2,
93
+ "version_minor": 0
94
+ },
95
+ "text/plain": [
96
+ "Downloading: 0%| | 0.00/140M [00:00<?, ?B/s]"
97
+ ]
98
+ },
99
+ "metadata": {},
100
+ "output_type": "display_data"
101
+ },
102
+ {
103
+ "data": {
104
+ "application/vnd.jupyter.widget-view+json": {
105
+ "model_id": "eba05e5151b34505b9a43e383cb6cfe0",
106
+ "version_major": 2,
107
+ "version_minor": 0
108
+ },
109
+ "text/plain": [
110
+ "Downloading: 0%| | 0.00/9.30M [00:00<?, ?B/s]"
111
+ ]
112
+ },
113
+ "metadata": {},
114
+ "output_type": "display_data"
115
+ },
116
+ {
117
+ "data": {
118
+ "application/vnd.jupyter.widget-view+json": {
119
+ "model_id": "",
120
+ "version_major": 2,
121
+ "version_minor": 0
122
+ },
123
+ "text/plain": [
124
+ "0 examples [00:00, ? examples/s]"
125
+ ]
126
+ },
127
+ "metadata": {},
128
+ "output_type": "display_data"
129
+ },
130
+ {
131
+ "name": "stdout",
132
+ "output_type": "stream",
133
+ "text": [
134
+ "Dataset europarl_bilingual downloaded and prepared to /workspace/.cache/huggingface/datasets/europarl_bilingual/en-fr-lang1=en,lang2=fr/8.0.0/2ab0200e7729616bfd4a4df6bfb29b31746ceb5a59f8c75c02ca35e1ebead950. Subsequent calls will reuse this data.\n"
135
+ ]
136
+ }
137
+ ],
138
+ "source": [
139
+ "dataset = load_dataset(\"europarl_bilingual\", lang1=\"en\", lang2=target_lang, split=\"train\")"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": 12,
145
+ "id": "c26261e9",
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "def extract_text(batch):\n",
150
+ " target_lang = \"fr\"\n",
151
+ " chars_to_ignore_regex = '[^a-zàâäçéèêëîïôöùûüÿ\\'’ ]'\n",
152
+ " text = batch[\"translation\"][target_lang]\n",
153
+ " batch[\"text\"] = re.sub(chars_to_ignore_regex, \"\", text.lower()).replace('’', \"'\")\n",
154
+ " return batch"
155
+ ]
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 13,
160
+ "id": "5434c0b7",
161
+ "metadata": {},
162
+ "outputs": [
163
+ {
164
+ "data": {
165
+ "application/vnd.jupyter.widget-view+json": {
166
+ "model_id": "00d998de52544f6c8750c53bc0c85d66",
167
+ "version_major": 2,
168
+ "version_minor": 0
169
+ },
170
+ "text/plain": [
171
+ "0ex [00:00, ?ex/s]"
172
+ ]
173
+ },
174
+ "metadata": {},
175
+ "output_type": "display_data"
176
+ }
177
+ ],
178
+ "source": [
179
+ "dataset = dataset.map(extract_text, remove_columns=dataset.column_names)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": 14,
185
+ "id": "e1c780b8",
186
+ "metadata": {},
187
+ "outputs": [
188
+ {
189
+ "data": {
190
+ "application/vnd.jupyter.widget-view+json": {
191
+ "model_id": "461a219cdb6d42b2b890ec028c336e7f",
192
+ "version_major": 2,
193
+ "version_minor": 0
194
+ },
195
+ "text/plain": [
196
+ "Pushing dataset shards to the dataset hub: 0%| | 0/1 [00:00<?, ?it/s]"
197
+ ]
198
+ },
199
+ "metadata": {},
200
+ "output_type": "display_data"
201
+ }
202
+ ],
203
+ "source": [
204
+ "dataset.push_to_hub(f\"{target_lang}_corpora_parliament_processed\", split=\"train\")"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "code",
209
+ "execution_count": 15,
210
+ "id": "41c0ab30",
211
+ "metadata": {},
212
+ "outputs": [],
213
+ "source": [
214
+ "with open(\"text.txt\", \"w\") as file:\n",
215
+ " file.write(\" \".join(dataset[\"text\"]))"
216
+ ]
217
+ },
218
+ {
219
+ "cell_type": "code",
220
+ "execution_count": 7,
221
+ "id": "4d6bfb67",
222
+ "metadata": {},
223
+ "outputs": [],
224
+ "source": [
225
+ "with open(\"language_model/5gram.arpa\", \"r\") as read_file, open(\"language_model/5gram_correct.arpa\", \"w\") as write_file:\n",
226
+ " has_added_eos = False\n",
227
+ " for line in read_file:\n",
228
+ " if not has_added_eos and \"ngram 1=\" in line:\n",
229
+ " count=line.strip().split(\"=\")[-1]\n",
230
+ " write_file.write(line.replace(f\"{count}\", f\"{int(count)+1}\"))\n",
231
+ " elif not has_added_eos and \"<s>\" in line:\n",
232
+ " write_file.write(line)\n",
233
+ " write_file.write(line.replace(\"<s>\", \"</s>\"))\n",
234
+ " has_added_eos = True\n",
235
+ " else:\n",
236
+ " write_file.write(line)"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "execution_count": 8,
242
+ "id": "3407085c",
243
+ "metadata": {},
244
+ "outputs": [],
245
+ "source": [
246
+ "from transformers import AutoProcessor\n",
247
+ "\n",
248
+ "processor = AutoProcessor.from_pretrained(\"./\")"
249
+ ]
250
+ },
251
+ {
252
+ "cell_type": "code",
253
+ "execution_count": 9,
254
+ "id": "5a60df92",
255
+ "metadata": {},
256
+ "outputs": [],
257
+ "source": [
258
+ "vocab_dict = processor.tokenizer.get_vocab()\n",
259
+ "sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "code",
264
+ "execution_count": 10,
265
+ "id": "cd1a94ea",
266
+ "metadata": {},
267
+ "outputs": [
268
+ {
269
+ "name": "stderr",
270
+ "output_type": "stream",
271
+ "text": [
272
+ "Loading the LM will be faster if you build a binary file.\n",
273
+ "Reading /workspace/xls-r-1b-cv_8-fr/language_model/5gram_correct.arpa\n",
274
+ "----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100\n",
275
+ "****************************************************************************************************\n"
276
+ ]
277
+ }
278
+ ],
279
+ "source": [
280
+ "from pyctcdecode import build_ctcdecoder\n",
281
+ "\n",
282
+ "decoder = build_ctcdecoder(\n",
283
+ " labels=list(sorted_vocab_dict.keys()),\n",
284
+ " kenlm_model_path=\"language_model/5gram_correct.arpa\",\n",
285
+ ")"
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "code",
290
+ "execution_count": 11,
291
+ "id": "e627079a",
292
+ "metadata": {},
293
+ "outputs": [],
294
+ "source": [
295
+ "from transformers import Wav2Vec2ProcessorWithLM\n",
296
+ "\n",
297
+ "processor_with_lm = Wav2Vec2ProcessorWithLM(\n",
298
+ " feature_extractor=processor.feature_extractor,\n",
299
+ " tokenizer=processor.tokenizer,\n",
300
+ " decoder=decoder\n",
301
+ ")"
302
+ ]
303
+ },
304
+ {
305
+ "cell_type": "code",
306
+ "execution_count": 18,
307
+ "id": "bc665f62",
308
+ "metadata": {},
309
+ "outputs": [],
310
+ "source": [
311
+ "processor_with_lm.save_pretrained(\"Plim/xls-r-1b-cv_8-fr\")"
312
+ ]
313
+ },
314
+ {
315
+ "cell_type": "code",
316
+ "execution_count": null,
317
+ "id": "7bcbb30b",
318
+ "metadata": {},
319
+ "outputs": [],
320
+ "source": []
321
+ }
322
+ ],
323
+ "metadata": {
324
+ "kernelspec": {
325
+ "display_name": "Python 3 (ipykernel)",
326
+ "language": "python",
327
+ "name": "python3"
328
+ },
329
+ "language_info": {
330
+ "codemirror_mode": {
331
+ "name": "ipython",
332
+ "version": 3
333
+ },
334
+ "file_extension": ".py",
335
+ "mimetype": "text/x-python",
336
+ "name": "python",
337
+ "nbconvert_exporter": "python",
338
+ "pygments_lexer": "ipython3",
339
+ "version": "3.8.8"
340
+ }
341
+ },
342
+ "nbformat": 4,
343
+ "nbformat_minor": 5
344
+ }
keep_model/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a7ac9a4075231a9b1f2ef054fe1161fdf7235b6c7bd018f7505d44da3332960
3
+ size 3850548401
langague_model/5gram.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:726c0eaeadf24aa621faaddc6640ddc431a65f45e1b16ff0e6a9af565facd09f
3
+ size 2075344331
langague_model/attrs.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"alpha": 0.5, "beta": 1.5, "unk_score_offset": -10.0, "score_boundary": true}
langague_model/unigrams.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4a7ac9a4075231a9b1f2ef054fe1161fdf7235b6c7bd018f7505d44da3332960
3
  size 3850548401
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:581f4c8322c68fc308f22b68839669bf48755ac5706016bfe60c0234ba26e947
3
  size 3850548401
run.sh CHANGED
@@ -20,8 +20,8 @@ python run_speech_recognition_ctc.py \
20
  --mask_feature_prob="0.25" \
21
  --mask_time_length="10" \
22
  --mask_time_prob="0.75" \
23
- --model_name_or_path="facebook/wav2vec2-xls-r-1b" \
24
- --num_train_epochs="4.0" \
25
  --output_dir="./" \
26
  --overwrite_output_dir \
27
  --per_device_train_batch_size="16" \
 
20
  --mask_feature_prob="0.25" \
21
  --mask_time_length="10" \
22
  --mask_time_prob="0.75" \
23
+ --model_name_or_path="./checkpoint-13000" \
24
+ --num_train_epochs="6.0" \
25
  --output_dir="./" \
26
  --overwrite_output_dir \
27
  --per_device_train_batch_size="16" \
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:af168053e52049b75214ffe031549b7a5ed7c0e774b3f727dc7f6c7d61dd0f9c
3
  size 2991
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e48a427c7cb614f5fbcb1a4088c5c4abce496c5c8b09e2e526d99a2586ab2194
3
  size 2991
wandb/debug-internal.log CHANGED
@@ -1 +1 @@
1
- run-20220203_170643-2fkfdtzb/logs/debug-internal.log
 
1
+ run-20220206_201634-uhiy9e2t/logs/debug-internal.log
wandb/debug.log CHANGED
@@ -1 +1 @@
1
- run-20220203_170643-2fkfdtzb/logs/debug.log
 
1
+ run-20220206_201634-uhiy9e2t/logs/debug.log
wandb/latest-run CHANGED
@@ -1 +1 @@
1
- run-20220203_170643-2fkfdtzb
 
1
+ run-20220206_201634-uhiy9e2t
wandb/run-20220206_201634-uhiy9e2t/files/conda-environment.yaml ADDED
File without changes
wandb/run-20220206_201634-uhiy9e2t/files/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
wandb/run-20220206_201634-uhiy9e2t/files/output.log ADDED
@@ -0,0 +1,1491 @@