zohirjonsharipov commited on
Commit
6895c84
1 Parent(s): e9f9ac0

Upload 16 files

Browse files
README.md CHANGED
@@ -1,3 +1,126 @@
1
  ---
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - uz
4
  license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - generated_from_trainer
8
+ - hf-asr-leaderboard
9
+ - mozilla-foundation/common_voice_8_0
10
+ - robust-speech-event
11
+ datasets:
12
+ - mozilla-foundation/common_voice_8_0
13
+ base_model: facebook/wav2vec2-xls-r-300m
14
+ model-index:
15
+ - name: XLS-R-300M Uzbek CV8
16
+ results:
17
+ - task:
18
+ type: automatic-speech-recognition
19
+ name: Automatic Speech Recognition
20
+ dataset:
21
+ name: Common Voice 8
22
+ type: mozilla-foundation/common_voice_8_0
23
+ args: uz
24
+ metrics:
25
+ - type: wer
26
+ value: 15.065
27
+ name: Test WER (with LM)
28
+ - type: cer
29
+ value: 3.077
30
+ name: Test CER (with LM)
31
+ - type: wer
32
+ value: 32.88
33
+ name: Test WER (no LM)
34
+ - type: cer
35
+ value: 6.53
36
+ name: Test CER (no LM)
37
  ---
38
+
39
+ # XLS-R-300M Uzbek CV8
40
+
41
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - UZ dataset.
42
+ It achieves the following results on the validation set:
43
+ - Loss: 0.3063
44
+ - Wer: 0.3852
45
+ - Cer: 0.0777
46
+
47
+ ## Model description
48
+
49
+ For a description of the model architecture, see [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m)
50
+
51
+ The model vocabulary consists of the [Modern Latin alphabet for Uzbek](https://en.wikipedia.org/wiki/Uzbek_alphabet), with punctuation removed.
52
+ Note that the characters <‘> and <’> do not count as punctuation, as <‘> modifies \<o\> and \<g\>, and <’> indicates the glottal stop or a long vowel.
53
+
54
+ The decoder uses a kenlm language model built on common_voice text.
55
+
56
+ ## Intended uses & limitations
57
+
58
+ This model is expected to be of some utility for low-fidelity use cases such as:
59
+ - Draft video captions
60
+ - Indexing of recorded broadcasts
61
+
62
+ The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.
63
+
64
+ ## Training and evaluation data
65
+
66
+ The 50% of the `train` common voice official split was used as training data. The 50% of the official `dev` split was used as validation data, and the full `test` set was used for final evaluation of the model without LM, while the model with LM was evaluated only on 500 examples from the `test` set.
67
+
68
+ The kenlm language model was compiled from the target sentences of the train + other dataset splits.
69
+
70
+ ### Training hyperparameters
71
+
72
+ The following hyperparameters were used during training:
73
+ - learning_rate: 3e-05
74
+ - train_batch_size: 32
75
+ - eval_batch_size: 8
76
+ - seed: 42
77
+ - gradient_accumulation_steps: 4
78
+ - total_train_batch_size: 128
79
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
80
+ - lr_scheduler_type: linear
81
+ - lr_scheduler_warmup_steps: 500
82
+ - num_epochs: 100.0
83
+ - mixed_precision_training: Native AMP
84
+
85
+ ### Training results
86
+
87
+ | Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
88
+ |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
89
+ | 3.1401 | 3.25 | 500 | 3.1146 | 1.0 | 1.0 |
90
+ | 2.7484 | 6.49 | 1000 | 2.2842 | 1.0065 | 0.7069 |
91
+ | 1.0899 | 9.74 | 1500 | 0.5414 | 0.6125 | 0.1351 |
92
+ | 0.9465 | 12.99 | 2000 | 0.4566 | 0.5635 | 0.1223 |
93
+ | 0.8771 | 16.23 | 2500 | 0.4212 | 0.5366 | 0.1161 |
94
+ | 0.8346 | 19.48 | 3000 | 0.3994 | 0.5144 | 0.1102 |
95
+ | 0.8127 | 22.73 | 3500 | 0.3819 | 0.4944 | 0.1051 |
96
+ | 0.7833 | 25.97 | 4000 | 0.3705 | 0.4798 | 0.1011 |
97
+ | 0.7603 | 29.22 | 4500 | 0.3661 | 0.4704 | 0.0992 |
98
+ | 0.7424 | 32.47 | 5000 | 0.3529 | 0.4577 | 0.0957 |
99
+ | 0.7251 | 35.71 | 5500 | 0.3410 | 0.4473 | 0.0928 |
100
+ | 0.7106 | 38.96 | 6000 | 0.3401 | 0.4428 | 0.0919 |
101
+ | 0.7027 | 42.21 | 6500 | 0.3355 | 0.4353 | 0.0905 |
102
+ | 0.6927 | 45.45 | 7000 | 0.3308 | 0.4296 | 0.0885 |
103
+ | 0.6828 | 48.7 | 7500 | 0.3246 | 0.4204 | 0.0863 |
104
+ | 0.6706 | 51.95 | 8000 | 0.3250 | 0.4233 | 0.0868 |
105
+ | 0.6629 | 55.19 | 8500 | 0.3264 | 0.4159 | 0.0849 |
106
+ | 0.6556 | 58.44 | 9000 | 0.3213 | 0.4100 | 0.0835 |
107
+ | 0.6484 | 61.69 | 9500 | 0.3182 | 0.4124 | 0.0837 |
108
+ | 0.6407 | 64.93 | 10000 | 0.3171 | 0.4050 | 0.0825 |
109
+ | 0.6375 | 68.18 | 10500 | 0.3150 | 0.4039 | 0.0822 |
110
+ | 0.6363 | 71.43 | 11000 | 0.3129 | 0.3991 | 0.0810 |
111
+ | 0.6307 | 74.67 | 11500 | 0.3114 | 0.3986 | 0.0807 |
112
+ | 0.6232 | 77.92 | 12000 | 0.3103 | 0.3895 | 0.0790 |
113
+ | 0.6216 | 81.17 | 12500 | 0.3086 | 0.3891 | 0.0790 |
114
+ | 0.6174 | 84.41 | 13000 | 0.3082 | 0.3881 | 0.0785 |
115
+ | 0.6196 | 87.66 | 13500 | 0.3059 | 0.3875 | 0.0782 |
116
+ | 0.6174 | 90.91 | 14000 | 0.3084 | 0.3862 | 0.0780 |
117
+ | 0.6169 | 94.16 | 14500 | 0.3070 | 0.3860 | 0.0779 |
118
+ | 0.6166 | 97.4 | 15000 | 0.3066 | 0.3855 | 0.0778 |
119
+
120
+
121
+ ### Framework versions
122
+
123
+ - Transformers 4.16.2
124
+ - Pytorch 1.10.2+cu102
125
+ - Datasets 1.18.3
126
+ - Tokenizers 0.11.0
mozilla-foundation_common_voice_8_0_uz_test[_500]_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.150650789255054
2
+ CER: 0.03076592082616179
mozilla-foundation_common_voice_8_0_uz_test_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.3288115957647439
2
+ CER: 0.06534626547372732
preprocessor_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "processor_class": "Wav2Vec2ProcessorWithLM",
8
+ "return_attention_mask": true,
9
+ "sampling_rate": 16000
10
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa3ebb6bf2207974a80f51af9ea129771bc599abed9a0f00f4b93e0bf058b624
3
+ size 1262058993
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ unidecode
2
+ tensorboard
3
+ torch
4
+ torchaudio
5
+ jiwer
6
+ soundfile
7
+ transformers
8
+ datasets
9
+ pyctcdecode
10
+ https://github.com/kpu/kenlm/archive/master.zip
run.sh ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ python ~/xls-r-uzbek-cv8/run_speech_recognition_ctc.py \
2
+ --dataset_name="mozilla-foundation/common_voice_8_0" \
3
+ --model_name_or_path="facebook/wav2vec2-xls-r-300m" \
4
+ --dataset_config_name="uz" \
5
+ --output_dir="~/xls-r-uzbek-cv8" \
6
+ --train_split_name="train[:50%]" \
7
+ --eval_split_name="validation[50%:]" \
8
+ --overwrite_output_dir \
9
+ --num_train_epochs="100" \
10
+ --per_device_train_batch_size="32" \
11
+ --per_device_eval_batch_size="8" \
12
+ --gradient_accumulation_steps="4" \
13
+ --learning_rate="3e-5" \
14
+ --warmup_steps="500" \
15
+ --length_column_name="input_length" \
16
+ --evaluation_strategy="steps" \
17
+ --text_column_name="sentence" \
18
+ --eval_metrics wer cer \
19
+ --save_steps="500" \
20
+ --eval_steps="500" \
21
+ --logging_steps="100" \
22
+ --min_duration_in_seconds="0.2" \
23
+ --layerdrop="0.01" \
24
+ --activation_dropout="0.1" \
25
+ --save_total_limit="3" \
26
+ --freeze_feature_encoder \
27
+ --feat_proj_dropout="0.05" \
28
+ --mask_time_prob="0.50" \
29
+ --mask_time_length="10" \
30
+ --mask_feature_prob="0.15" \
31
+ --mask_feature_length="64" \
32
+ --gradient_checkpointing \
33
+ --use_auth_token \
34
+ --fp16 \
35
+ --group_by_length \
36
+ --do_train --do_eval \
37
+ --push_to_hub
38
+ # --chars_to_ignore \ # default to all punct
run_speech_recognition_ctc.py ADDED
@@ -0,0 +1,760 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding=utf-8
3
+ # Copyright 2021 The HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+
16
+ """ Fine-tuning a 🤗 Transformers CTC model for automatic speech recognition"""
17
+
18
+ import functools
19
+ import json
20
+ import logging
21
+ import os
22
+ import re
23
+ import string
24
+ import sys
25
+ import unidecode
26
+ import warnings
27
+ from dataclasses import dataclass, field
28
+ from typing import Dict, List, Optional, Union
29
+
30
+ import datasets
31
+ import numpy as np
32
+ import torch
33
+ from datasets import DatasetDict, load_dataset, load_metric
34
+
35
+ import transformers
36
+ from transformers import (
37
+ AutoConfig,
38
+ AutoFeatureExtractor,
39
+ AutoModelForCTC,
40
+ AutoProcessor,
41
+ AutoTokenizer,
42
+ HfArgumentParser,
43
+ Trainer,
44
+ TrainingArguments,
45
+ Wav2Vec2Processor,
46
+ set_seed,
47
+ )
48
+ from transformers.trainer_utils import get_last_checkpoint, is_main_process
49
+ from transformers.utils import check_min_version
50
+ from transformers.utils.versions import require_version
51
+
52
+
53
+ # Will error if the minimal version of Transformers is not installed. Remove at your own risks.
54
+ check_min_version("4.16.0.dev0")
55
+
56
+ require_version("datasets>=1.13.3", "To fix: pip install -r examples/pytorch/text-classification/requirements.txt")
57
+
58
+
59
+ logger = logging.getLogger(__name__)
60
+
61
+
62
+ def list_field(default=None, metadata=None):
63
+ return field(default_factory=lambda: default, metadata=metadata)
64
+
65
+
66
+ @dataclass
67
+ class ModelArguments:
68
+ """
69
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
70
+ """
71
+
72
+ model_name_or_path: str = field(
73
+ metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
74
+ )
75
+ tokenizer_name_or_path: Optional[str] = field(
76
+ default=None,
77
+ metadata={"help": "Path to pretrained tokenizer or tokenizer identifier from huggingface.co/models"},
78
+ )
79
+ cache_dir: Optional[str] = field(
80
+ default=None,
81
+ metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
82
+ )
83
+ freeze_feature_encoder: bool = field(
84
+ default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
85
+ )
86
+ attention_dropout: float = field(
87
+ default=0.0, metadata={"help": "The dropout ratio for the attention probabilities."}
88
+ )
89
+ activation_dropout: float = field(
90
+ default=0.0, metadata={"help": "The dropout ratio for activations inside the fully connected layer."}
91
+ )
92
+ feat_proj_dropout: float = field(default=0.0, metadata={"help": "The dropout ratio for the projected features."})
93
+ hidden_dropout: float = field(
94
+ default=0.0,
95
+ metadata={
96
+ "help": "The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
97
+ },
98
+ )
99
+ final_dropout: float = field(
100
+ default=0.0,
101
+ metadata={"help": "The dropout probability for the final projection layer."},
102
+ )
103
+ mask_time_prob: float = field(
104
+ default=0.05,
105
+ metadata={
106
+ "help": "Probability of each feature vector along the time axis to be chosen as the start of the vector"
107
+ "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature"
108
+ "vectors will be masked along the time axis."
109
+ },
110
+ )
111
+ mask_time_length: int = field(
112
+ default=10,
113
+ metadata={"help": "Length of vector span to mask along the time axis."},
114
+ )
115
+ mask_feature_prob: float = field(
116
+ default=0.0,
117
+ metadata={
118
+ "help": "Probability of each feature vector along the feature axis to be chosen as the start of the vector"
119
+ "span to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature bins will be masked along the time axis."
120
+ },
121
+ )
122
+ mask_feature_length: int = field(
123
+ default=10,
124
+ metadata={"help": "Length of vector span to mask along the feature axis."},
125
+ )
126
+ layerdrop: float = field(default=0.0, metadata={"help": "The LayerDrop probability."})
127
+ ctc_loss_reduction: Optional[str] = field(
128
+ default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."}
129
+ )
130
+
131
+
132
+ @dataclass
133
+ class DataTrainingArguments:
134
+ """
135
+ Arguments pertaining to what data we are going to input our model for training and eval.
136
+
137
+ Using `HfArgumentParser` we can turn this class
138
+ into argparse arguments to be able to specify them on
139
+ the command line.
140
+ """
141
+
142
+ dataset_name: str = field(
143
+ metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
144
+ )
145
+ dataset_config_name: str = field(
146
+ default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
147
+ )
148
+ train_split_name: str = field(
149
+ default="train",
150
+ metadata={
151
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
152
+ },
153
+ )
154
+ eval_split_name: str = field(
155
+ default="validation",
156
+ metadata={
157
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'validation'"
158
+ },
159
+ )
160
+ audio_column_name: str = field(
161
+ default="audio",
162
+ metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
163
+ )
164
+ text_column_name: str = field(
165
+ default="text",
166
+ metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"},
167
+ )
168
+ overwrite_cache: bool = field(
169
+ default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
170
+ )
171
+ preprocessing_num_workers: Optional[int] = field(
172
+ default=None,
173
+ metadata={"help": "The number of processes to use for the preprocessing."},
174
+ )
175
+ max_train_samples: Optional[int] = field(
176
+ default=None,
177
+ metadata={
178
+ "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
179
+ "value if set."
180
+ },
181
+ )
182
+ max_eval_samples: Optional[int] = field(
183
+ default=None,
184
+ metadata={
185
+ "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
186
+ "value if set."
187
+ },
188
+ )
189
+ chars_to_ignore: Optional[List[str]] = list_field(
190
+ default=None,
191
+ metadata={"help": "A list of characters to remove from the transcripts."},
192
+ )
193
+ eval_metrics: List[str] = list_field(
194
+ default=["wer"],
195
+ metadata={"help": "A list of metrics the model should be evaluated on. E.g. `'wer cer'`"},
196
+ )
197
+ max_duration_in_seconds: float = field(
198
+ default=20.0,
199
+ metadata={
200
+ "help": "Filter audio files that are longer than `max_duration_in_seconds` seconds to 'max_duration_in_seconds`"
201
+ },
202
+ )
203
+ min_duration_in_seconds: float = field(
204
+ default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}
205
+ )
206
+ preprocessing_only: bool = field(
207
+ default=False,
208
+ metadata={
209
+ "help": "Whether to only do data preprocessing and skip training. "
210
+ "This is especially useful when data preprocessing errors out in distributed training due to timeout. "
211
+ "In this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` "
212
+ "so that the cached datasets can consequently be loaded in distributed training"
213
+ },
214
+ )
215
+ use_auth_token: bool = field(
216
+ default=False,
217
+ metadata={
218
+ "help": "If :obj:`True`, will use the token generated when running"
219
+ ":obj:`transformers-cli login` as HTTP bearer authorization for remote files."
220
+ },
221
+ )
222
+ unk_token: str = field(
223
+ default="[UNK]",
224
+ metadata={"help": "The unk token for the tokenizer"},
225
+ )
226
+ pad_token: str = field(
227
+ default="[PAD]",
228
+ metadata={"help": "The padding token for the tokenizer"},
229
+ )
230
+ word_delimiter_token: str = field(
231
+ default="|",
232
+ metadata={"help": "The word delimiter token for the tokenizer"},
233
+ )
234
+ phoneme_language: Optional[str] = field(
235
+ default=None,
236
+ metadata={
237
+ "help": "The target language that should be used be"
238
+ " passed to the tokenizer for tokenization. Note that"
239
+ " this is only relevant if the model classifies the"
240
+ " input audio to a sequence of phoneme sequences."
241
+ },
242
+ )
243
+
244
+
245
+ @dataclass
246
+ class DataCollatorCTCWithPadding:
247
+ """
248
+ Data collator that will dynamically pad the inputs received.
249
+ Args:
250
+ processor (:class:`~transformers.AutoProcessor`)
251
+ The processor used for proccessing the data.
252
+ padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
253
+ Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
254
+ among:
255
+ * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
256
+ sequence if provided).
257
+ * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
258
+ maximum acceptable input length for the model if that argument is not provided.
259
+ * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
260
+ different lengths).
261
+ max_length (:obj:`int`, `optional`):
262
+ Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
263
+ max_length_labels (:obj:`int`, `optional`):
264
+ Maximum length of the ``labels`` returned list and optionally padding length (see above).
265
+ pad_to_multiple_of (:obj:`int`, `optional`):
266
+ If set will pad the sequence to a multiple of the provided value.
267
+ This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
268
+ 7.5 (Volta).
269
+ """
270
+
271
+ processor: AutoProcessor
272
+ padding: Union[bool, str] = "longest"
273
+ pad_to_multiple_of: Optional[int] = None
274
+ pad_to_multiple_of_labels: Optional[int] = None
275
+
276
+ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
277
+ # split inputs and labels since they have to be of different lenghts and need
278
+ # different padding methods
279
+ input_features = [{"input_values": feature["input_values"]} for feature in features]
280
+ label_features = [{"input_ids": feature["labels"]} for feature in features]
281
+
282
+ batch = self.processor.pad(
283
+ input_features,
284
+ padding=self.padding,
285
+ pad_to_multiple_of=self.pad_to_multiple_of,
286
+ return_tensors="pt",
287
+ )
288
+
289
+ with self.processor.as_target_processor():
290
+ labels_batch = self.processor.pad(
291
+ label_features,
292
+ padding=self.padding,
293
+ pad_to_multiple_of=self.pad_to_multiple_of_labels,
294
+ return_tensors="pt",
295
+ )
296
+
297
+ # replace padding with -100 to ignore loss correctly
298
+ labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
299
+
300
+ batch["labels"] = labels
301
+
302
+ return batch
303
+
304
+
305
+ def create_vocabulary_from_data(
306
+ datasets: DatasetDict,
307
+ word_delimiter_token: Optional[str] = None,
308
+ unk_token: Optional[str] = None,
309
+ pad_token: Optional[str] = None,
310
+ ):
311
+ # Given training and test labels create vocabulary
312
+ def extract_all_chars(batch):
313
+ all_text = " ".join(batch["target_text"])
314
+ vocab = list(set(all_text))
315
+ return {"vocab": [vocab], "all_text": [all_text]}
316
+
317
+ vocabs = datasets.map(
318
+ extract_all_chars,
319
+ batched=True,
320
+ batch_size=-1,
321
+ keep_in_memory=True,
322
+ remove_columns=datasets["train"].column_names,
323
+ )
324
+
325
+ # take union of all unique characters in each dataset
326
+ vocab_set = functools.reduce(
327
+ lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
328
+ )
329
+
330
+ vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}
331
+
332
+ # replace white space with delimiter token
333
+ if word_delimiter_token is not None:
334
+ vocab_dict[word_delimiter_token] = vocab_dict[" "]
335
+ del vocab_dict[" "]
336
+
337
+ # add unk and pad token
338
+ if unk_token is not None:
339
+ vocab_dict[unk_token] = len(vocab_dict)
340
+
341
+ if pad_token is not None:
342
+ vocab_dict[pad_token] = len(vocab_dict)
343
+
344
+ return vocab_dict
345
+
346
+
347
+ def main():
348
+ # See all possible arguments in src/transformers/training_args.py
349
+ # or by passing the --help flag to this script.
350
+ # We now keep distinct sets of args, for a cleaner separation of concerns.
351
+
352
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
353
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
354
+ # If we pass only one argument to the script and it's the path to a json file,
355
+ # let's parse it to get our arguments.
356
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
357
+ else:
358
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
359
+
360
+ # Detecting last checkpoint.
361
+ last_checkpoint = None
362
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
363
+ last_checkpoint = get_last_checkpoint(training_args.output_dir)
364
+ if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
365
+ raise ValueError(
366
+ f"Output directory ({training_args.output_dir}) already exists and is not empty. "
367
+ "Use --overwrite_output_dir to overcome."
368
+ )
369
+ elif last_checkpoint is not None:
370
+ logger.info(
371
+ f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
372
+ "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
373
+ )
374
+
375
+ # Setup logging
376
+ logging.basicConfig(
377
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
378
+ datefmt="%m/%d/%Y %H:%M:%S",
379
+ handlers=[logging.StreamHandler(sys.stdout)],
380
+ )
381
+ logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
382
+
383
+ # Log on each process the small summary:
384
+ logger.warning(
385
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
386
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
387
+ )
388
+ # Set the verbosity to info of the Transformers logger (on main process only):
389
+ if is_main_process(training_args.local_rank):
390
+ transformers.utils.logging.set_verbosity_info()
391
+ logger.info("Training/evaluation parameters %s", training_args)
392
+
393
+ # Set seed before initializing model.
394
+ set_seed(training_args.seed)
395
+
396
+ # 1. First, let's load the dataset
397
+ raw_datasets = DatasetDict()
398
+
399
+ if training_args.do_train:
400
+ raw_datasets["train"] = load_dataset(
401
+ data_args.dataset_name,
402
+ data_args.dataset_config_name,
403
+ split=data_args.train_split_name,
404
+ use_auth_token=data_args.use_auth_token,
405
+ )
406
+
407
+ if data_args.audio_column_name not in raw_datasets["train"].column_names:
408
+ raise ValueError(
409
+ f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'. "
410
+ "Make sure to set `--audio_column_name` to the correct audio column - one of "
411
+ f"{', '.join(raw_datasets['train'].column_names)}."
412
+ )
413
+
414
+ if data_args.text_column_name not in raw_datasets["train"].column_names:
415
+ raise ValueError(
416
+ f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. "
417
+ "Make sure to set `--text_column_name` to the correct text column - one of "
418
+ f"{', '.join(raw_datasets['train'].column_names)}."
419
+ )
420
+
421
+ if data_args.max_train_samples is not None:
422
+ raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples))
423
+
424
+ if training_args.do_eval:
425
+ raw_datasets["eval"] = load_dataset(
426
+ data_args.dataset_name,
427
+ data_args.dataset_config_name,
428
+ split=data_args.eval_split_name,
429
+ use_auth_token=data_args.use_auth_token,
430
+ )
431
+
432
+ if data_args.max_eval_samples is not None:
433
+ raw_datasets["eval"] = raw_datasets["eval"].select(range(data_args.max_eval_samples))
434
+
435
+ # 2. We remove some special characters from the datasets
436
+ # that make training complicated and do not help in transcribing the speech
437
+ # E.g. characters, such as `,` and `.` do not really have an acoustic characteristic
438
+ # that could be easily picked up by the model
439
+ if data_args.chars_to_ignore is None:
440
+ chars_to_ignore_regex = f'[{re.escape(string.punctuation)}]'
441
+ else:
442
+ chars_to_ignore_regex = f'[{"".join(data_args.chars_to_ignore)}]'
443
+ print("chars_to_ignore", chars_to_ignore_regex)
444
+ text_column_name = data_args.text_column_name
445
+
446
+ def remove_special_characters(batch):
447
+ if chars_to_ignore_regex is not None:
448
+ batch["target_text"] = re.sub(
449
+ chars_to_ignore_regex,
450
+ "",
451
+ re.sub("['`´]", "’", # elsewhere probably meant as glottal stop
452
+ re.sub("([og])['`´]", "\g<1>‘", # after o/g indicate modified char
453
+ unidecode.unidecode(batch[text_column_name]).lower()
454
+ )
455
+ )
456
+ ) + " "
457
+ else:
458
+ batch["target_text"] = batch[text_column_name].lower() + " "
459
+ return batch
460
+
461
+ with training_args.main_process_first(desc="dataset map special characters removal"):
462
+ raw_datasets = raw_datasets.map(
463
+ remove_special_characters,
464
+ remove_columns=[text_column_name],
465
+ desc="remove special characters from datasets",
466
+ )
467
+
468
+ num_workers = data_args.preprocessing_num_workers
469
+
470
+ def is_transcript_in_length_range(text):
471
+ return 3 < len(text) < 200
472
+
473
+ raw_datasets = raw_datasets.filter(
474
+ is_transcript_in_length_range,
475
+ num_proc=num_workers,
476
+ input_columns=["target_text"],
477
+ )
478
+
479
+ # save special tokens for tokenizer
480
+ word_delimiter_token = data_args.word_delimiter_token
481
+ unk_token = data_args.unk_token
482
+ pad_token = data_args.pad_token
483
+
484
+ # 3. Next, let's load the config as we might need it to create
485
+ # the tokenizer
486
+ # load config
487
+ config = AutoConfig.from_pretrained(
488
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
489
+ )
490
+
491
+ # 4. Next, if no tokenizer file is defined,
492
+ # we create the vocabulary of the model by extracting all unique characters from
493
+ # the training and evaluation datasets
494
+ # We need to make sure that only first rank saves vocabulary
495
+ # make sure all processes wait until vocab is created
496
+ tokenizer_name_or_path = model_args.tokenizer_name_or_path
497
+ tokenizer_kwargs = {}
498
+ if tokenizer_name_or_path is None:
499
+ # save vocab in training output dir
500
+ tokenizer_name_or_path = training_args.output_dir
501
+
502
+ vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")
503
+
504
+ with training_args.main_process_first():
505
+ if training_args.overwrite_output_dir and os.path.isfile(vocab_file):
506
+ os.remove(vocab_file)
507
+
508
+ with training_args.main_process_first(desc="dataset map vocabulary creation"):
509
+ if not os.path.isfile(vocab_file):
510
+ os.makedirs(tokenizer_name_or_path, exist_ok=True)
511
+ vocab_dict = create_vocabulary_from_data(
512
+ raw_datasets,
513
+ word_delimiter_token=word_delimiter_token,
514
+ unk_token=unk_token,
515
+ pad_token=pad_token,
516
+ )
517
+
518
+ # save vocab dict to be loaded into tokenizer
519
+ with open(vocab_file, "w") as file:
520
+ json.dump(vocab_dict, file)
521
+
522
+ # if tokenizer has just been created
523
+ # it is defined by `tokenizer_class` if present in config else by `model_type`
524
+ tokenizer_kwargs = {
525
+ "config": config if config.tokenizer_class is not None else None,
526
+ "tokenizer_type": config.model_type if config.tokenizer_class is None else None,
527
+ "unk_token": unk_token,
528
+ "pad_token": pad_token,
529
+ "word_delimiter_token": word_delimiter_token,
530
+ }
531
+
532
+ # 5. Now we can instantiate the feature extractor, tokenizer and model
533
+ # Note for distributed training, the .from_pretrained methods guarantee that only
534
+ # one local process can concurrently download model & vocab.
535
+
536
+ # load feature_extractor and tokenizer
537
+ tokenizer = AutoTokenizer.from_pretrained(
538
+ tokenizer_name_or_path,
539
+ use_auth_token=data_args.use_auth_token,
540
+ **tokenizer_kwargs,
541
+ )
542
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
543
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
544
+ )
545
+
546
+ # adapt config
547
+ config.update(
548
+ {
549
+ "feat_proj_dropout": model_args.feat_proj_dropout,
550
+ "attention_dropout": model_args.attention_dropout,
551
+ "hidden_dropout": model_args.hidden_dropout,
552
+ "final_dropout": model_args.final_dropout,
553
+ "mask_time_prob": model_args.mask_time_prob,
554
+ "mask_time_length": model_args.mask_time_length,
555
+ "mask_feature_prob": model_args.mask_feature_prob,
556
+ "mask_feature_length": model_args.mask_feature_length,
557
+ "gradient_checkpointing": training_args.gradient_checkpointing,
558
+ "layerdrop": model_args.layerdrop,
559
+ "ctc_loss_reduction": model_args.ctc_loss_reduction,
560
+ "ctc_zero_infinity": True,
561
+ "pad_token_id": tokenizer.pad_token_id,
562
+ "vocab_size": len(tokenizer),
563
+ "activation_dropout": model_args.activation_dropout,
564
+ }
565
+ )
566
+
567
+ # create model
568
+ model = AutoModelForCTC.from_pretrained(
569
+ model_args.model_name_or_path,
570
+ cache_dir=model_args.cache_dir,
571
+ config=config,
572
+ use_auth_token=data_args.use_auth_token,
573
+ )
574
+
575
+ # freeze encoder
576
+ if model_args.freeze_feature_encoder:
577
+ model.freeze_feature_encoder()
578
+
579
+ # 6. Now we preprocess the datasets including loading the audio, resampling and normalization
580
+ # Thankfully, `datasets` takes care of automatically loading and resampling the audio,
581
+ # so that we just need to set the correct target sampling rate and normalize the input
582
+ # via the `feature_extractor`
583
+
584
+ # make sure that dataset decodes audio with correct sampling rate
585
+ dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate
586
+ if dataset_sampling_rate != feature_extractor.sampling_rate:
587
+ raw_datasets = raw_datasets.cast_column(
588
+ data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
589
+ )
590
+
591
+ # derive max & min input length for sample rate & max duration
592
+ max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate
593
+ min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate
594
+ audio_column_name = data_args.audio_column_name
595
+
596
+ # `phoneme_language` is only relevant if the model is fine-tuned on phoneme classification
597
+ phoneme_language = data_args.phoneme_language
598
+
599
+ # Preprocessing the datasets.
600
+ # We need to read the audio files as arrays and tokenize the targets.
601
+ def prepare_dataset(batch):
602
+ # load audio
603
+ sample = batch[audio_column_name]
604
+
605
+ inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
606
+ batch["input_values"] = inputs.input_values[0]
607
+ batch["input_length"] = len(batch["input_values"])
608
+
609
+ # encode targets
610
+ additional_kwargs = {}
611
+ if phoneme_language is not None:
612
+ additional_kwargs["phonemizer_lang"] = phoneme_language
613
+
614
+ batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids
615
+ return batch
616
+
617
+ with training_args.main_process_first(desc="dataset map preprocessing"):
618
+ vectorized_datasets = raw_datasets.map(
619
+ prepare_dataset,
620
+ remove_columns=next(iter(raw_datasets.values())).column_names,
621
+ num_proc=num_workers,
622
+ desc="preprocess datasets",
623
+ )
624
+
625
+ def is_audio_in_length_range(length):
626
+ return length > min_input_length and length < max_input_length
627
+
628
+ # filter data that is shorter than min_input_length
629
+ vectorized_datasets = vectorized_datasets.filter(
630
+ is_audio_in_length_range,
631
+ num_proc=num_workers,
632
+ input_columns=["input_length"],
633
+ )
634
+
635
+ # 7. Next, we can prepare the training.
636
+ # Let's use word error rate (WER) as our evaluation metric,
637
+ # instantiate a data collator and the trainer
638
+
639
+ # Define evaluation metrics during training, *i.e.* word error rate, character error rate
640
+ eval_metrics = {metric: load_metric(metric) for metric in data_args.eval_metrics}
641
+
642
+ # for large datasets it is advised to run the preprocessing on a
643
+ # single machine first with ``args.preprocessing_only`` since there will mostly likely
644
+ # be a timeout when running the script in distributed mode.
645
+ # In a second step ``args.preprocessing_only`` can then be set to `False` to load the
646
+ # cached dataset
647
+ if data_args.preprocessing_only:
648
+ logger.info(f"Data preprocessing finished. Files cached at {vectorized_datasets.cache_files}")
649
+ return
650
+
651
+ def compute_metrics(pred):
652
+ pred_logits = pred.predictions
653
+ pred_ids = np.argmax(pred_logits, axis=-1)
654
+
655
+ pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
656
+
657
+ pred_str = tokenizer.batch_decode(pred_ids)
658
+ # we do not want to group tokens when computing the metrics
659
+ label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)
660
+
661
+ metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}
662
+
663
+ return metrics
664
+
665
+ # Now save everything to be able to create a single processor later
666
+ if is_main_process(training_args.local_rank):
667
+ # save feature extractor, tokenizer and config
668
+ feature_extractor.save_pretrained(training_args.output_dir)
669
+ tokenizer.save_pretrained(training_args.output_dir)
670
+ config.save_pretrained(training_args.output_dir)
671
+
672
+ try:
673
+ processor = AutoProcessor.from_pretrained(training_args.output_dir)
674
+ except (OSError, KeyError):
675
+ warnings.warn(
676
+ "Loading a processor from a feature extractor config that does not"
677
+ " include a `processor_class` attribute is deprecated and will be removed in v5. Please add the following "
678
+ " attribute to your `preprocessor_config.json` file to suppress this warning: "
679
+ " `'processor_class': 'Wav2Vec2Processor'`",
680
+ FutureWarning,
681
+ )
682
+ processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir)
683
+
684
+ # Instantiate custom data collator
685
+ data_collator = DataCollatorCTCWithPadding(processor=processor)
686
+
687
+ # Initialize Trainer
688
+ trainer = Trainer(
689
+ model=model,
690
+ data_collator=data_collator,
691
+ args=training_args,
692
+ compute_metrics=compute_metrics,
693
+ train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
694
+ eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
695
+ tokenizer=feature_extractor,
696
+ )
697
+
698
+ # 8. Finally, we can start training
699
+
700
+ # Training
701
+ if training_args.do_train:
702
+
703
+ # use last checkpoint if exist
704
+ if last_checkpoint is not None:
705
+ checkpoint = last_checkpoint
706
+ elif os.path.isdir(model_args.model_name_or_path):
707
+ checkpoint = model_args.model_name_or_path
708
+ else:
709
+ checkpoint = None
710
+
711
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
712
+ trainer.save_model()
713
+
714
+ metrics = train_result.metrics
715
+ max_train_samples = (
716
+ data_args.max_train_samples
717
+ if data_args.max_train_samples is not None
718
+ else len(vectorized_datasets["train"])
719
+ )
720
+ metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"]))
721
+
722
+ trainer.log_metrics("train", metrics)
723
+ trainer.save_metrics("train", metrics)
724
+ trainer.save_state()
725
+
726
+ # Evaluation
727
+ results = {}
728
+ if training_args.do_eval:
729
+ logger.info("*** Evaluate ***")
730
+ metrics = trainer.evaluate()
731
+ max_eval_samples = (
732
+ data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"])
733
+ )
734
+ metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"]))
735
+
736
+ trainer.log_metrics("eval", metrics)
737
+ trainer.save_metrics("eval", metrics)
738
+
739
+ # Write model card and (optionally) push to hub
740
+ config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na"
741
+ kwargs = {
742
+ "finetuned_from": model_args.model_name_or_path,
743
+ "tasks": "speech-recognition",
744
+ "tags": ["automatic-speech-recognition", data_args.dataset_name],
745
+ "dataset_args": f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split: {data_args.eval_split_name}",
746
+ "dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}",
747
+ }
748
+ if "common_voice" in data_args.dataset_name:
749
+ kwargs["language"] = config_name
750
+
751
+ if training_args.push_to_hub:
752
+ trainer.push_to_hub(**kwargs)
753
+ else:
754
+ trainer.create_model_card(**kwargs)
755
+
756
+ return results
757
+
758
+
759
+ if __name__ == "__main__":
760
+ main()
special_tokens_map.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "</s>",
12
+ "lstrip": false,
13
+ "normalized": true,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
+ "bos_token": "<s>",
19
+ "eos_token": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "pad_token": "[PAD]",
27
+ "unk_token": "[UNK]"
28
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "29": {
4
+ "content": "[UNK]",
5
+ "lstrip": true,
6
+ "normalized": false,
7
+ "rstrip": true,
8
+ "single_word": false,
9
+ "special": false
10
+ },
11
+ "30": {
12
+ "content": "[PAD]",
13
+ "lstrip": true,
14
+ "normalized": false,
15
+ "rstrip": true,
16
+ "single_word": false,
17
+ "special": false
18
+ },
19
+ "31": {
20
+ "content": "<s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "32": {
28
+ "content": "</s>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "additional_special_tokens": [
37
+ "<s>",
38
+ "</s>"
39
+ ],
40
+ "bos_token": "<s>",
41
+ "clean_up_tokenization_spaces": true,
42
+ "do_lower_case": false,
43
+ "eos_token": "</s>",
44
+ "model_max_length": 1000000000000000019884624838656,
45
+ "pad_token": "[PAD]",
46
+ "processor_class": "Wav2Vec2ProcessorWithLM",
47
+ "replace_word_delimiter_char": " ",
48
+ "target_lang": null,
49
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
50
+ "unk_token": "[UNK]",
51
+ "word_delimiter_token": "|"
52
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 100.0,
3
+ "train_loss": 0.9527478711016767,
4
+ "train_runtime": 101833.8069,
5
+ "train_samples": 19726,
6
+ "train_samples_per_second": 19.371,
7
+ "train_steps_per_second": 0.151
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 99.99837925445705,
5
+ "global_step": 15400,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.65,
12
+ "learning_rate": 5.940000000000001e-06,
13
+ "loss": 11.9235,
14
+ "step": 100
15
+ },
16
+ {
17
+ "epoch": 1.3,
18
+ "learning_rate": 1.1940000000000001e-05,
19
+ "loss": 5.1849,
20
+ "step": 200
21
+ },
22
+ {
23
+ "epoch": 1.95,
24
+ "learning_rate": 1.794e-05,
25
+ "loss": 3.7405,
26
+ "step": 300
27
+ },
28
+ {
29
+ "epoch": 2.6,
30
+ "learning_rate": 2.394e-05,
31
+ "loss": 3.3052,
32
+ "step": 400
33
+ },
34
+ {
35
+ "epoch": 3.25,
36
+ "learning_rate": 2.994e-05,
37
+ "loss": 3.1401,
38
+ "step": 500
39
+ },
40
+ {
41
+ "epoch": 3.25,
42
+ "eval_cer": 1.0,
43
+ "eval_loss": 3.114637851715088,
44
+ "eval_runtime": 219.3333,
45
+ "eval_samples_per_second": 24.734,
46
+ "eval_steps_per_second": 3.096,
47
+ "eval_wer": 1.0,
48
+ "step": 500
49
+ },
50
+ {
51
+ "epoch": 3.89,
52
+ "learning_rate": 2.98006711409396e-05,
53
+ "loss": 3.0799,
54
+ "step": 600
55
+ },
56
+ {
57
+ "epoch": 4.54,
58
+ "learning_rate": 2.9599328859060405e-05,
59
+ "loss": 3.0534,
60
+ "step": 700
61
+ },
62
+ {
63
+ "epoch": 5.19,
64
+ "learning_rate": 2.9397986577181207e-05,
65
+ "loss": 2.994,
66
+ "step": 800
67
+ },
68
+ {
69
+ "epoch": 5.84,
70
+ "learning_rate": 2.9196644295302013e-05,
71
+ "loss": 2.9428,
72
+ "step": 900
73
+ },
74
+ {
75
+ "epoch": 6.49,
76
+ "learning_rate": 2.8995302013422818e-05,
77
+ "loss": 2.7484,
78
+ "step": 1000
79
+ },
80
+ {
81
+ "epoch": 6.49,
82
+ "eval_cer": 0.706927100586246,
83
+ "eval_loss": 2.284235954284668,
84
+ "eval_runtime": 222.6226,
85
+ "eval_samples_per_second": 24.369,
86
+ "eval_steps_per_second": 3.05,
87
+ "eval_wer": 1.0065119583234667,
88
+ "step": 1000
89
+ },
90
+ {
91
+ "epoch": 7.14,
92
+ "learning_rate": 2.8793959731543624e-05,
93
+ "loss": 2.0772,
94
+ "step": 1100
95
+ },
96
+ {
97
+ "epoch": 7.79,
98
+ "learning_rate": 2.859261744966443e-05,
99
+ "loss": 1.4967,
100
+ "step": 1200
101
+ },
102
+ {
103
+ "epoch": 8.44,
104
+ "learning_rate": 2.8391275167785235e-05,
105
+ "loss": 1.2731,
106
+ "step": 1300
107
+ },
108
+ {
109
+ "epoch": 9.09,
110
+ "learning_rate": 2.818993288590604e-05,
111
+ "loss": 1.1742,
112
+ "step": 1400
113
+ },
114
+ {
115
+ "epoch": 9.74,
116
+ "learning_rate": 2.7988590604026846e-05,
117
+ "loss": 1.0899,
118
+ "step": 1500
119
+ },
120
+ {
121
+ "epoch": 9.74,
122
+ "eval_cer": 0.13506382454221202,
123
+ "eval_loss": 0.5414122343063354,
124
+ "eval_runtime": 217.7702,
125
+ "eval_samples_per_second": 24.912,
126
+ "eval_steps_per_second": 3.118,
127
+ "eval_wer": 0.6124792801326071,
128
+ "step": 1500
129
+ },
130
+ {
131
+ "epoch": 10.39,
132
+ "learning_rate": 2.778724832214765e-05,
133
+ "loss": 1.0544,
134
+ "step": 1600
135
+ },
136
+ {
137
+ "epoch": 11.04,
138
+ "learning_rate": 2.7585906040268457e-05,
139
+ "loss": 1.0284,
140
+ "step": 1700
141
+ },
142
+ {
143
+ "epoch": 11.69,
144
+ "learning_rate": 2.7384563758389263e-05,
145
+ "loss": 0.9865,
146
+ "step": 1800
147
+ },
148
+ {
149
+ "epoch": 12.34,
150
+ "learning_rate": 2.7183221476510065e-05,
151
+ "loss": 0.9705,
152
+ "step": 1900
153
+ },
154
+ {
155
+ "epoch": 12.99,
156
+ "learning_rate": 2.698187919463087e-05,
157
+ "loss": 0.9465,
158
+ "step": 2000
159
+ },
160
+ {
161
+ "epoch": 12.99,
162
+ "eval_cer": 0.12229891609980818,
163
+ "eval_loss": 0.4565887749195099,
164
+ "eval_runtime": 219.4838,
165
+ "eval_samples_per_second": 24.717,
166
+ "eval_steps_per_second": 3.094,
167
+ "eval_wer": 0.5634619938432394,
168
+ "step": 2000
169
+ },
170
+ {
171
+ "epoch": 13.64,
172
+ "learning_rate": 2.6780536912751676e-05,
173
+ "loss": 0.9395,
174
+ "step": 2100
175
+ },
176
+ {
177
+ "epoch": 14.29,
178
+ "learning_rate": 2.6579194630872482e-05,
179
+ "loss": 0.9216,
180
+ "step": 2200
181
+ },
182
+ {
183
+ "epoch": 14.93,
184
+ "learning_rate": 2.6377852348993287e-05,
185
+ "loss": 0.9011,
186
+ "step": 2300
187
+ },
188
+ {
189
+ "epoch": 15.58,
190
+ "learning_rate": 2.6176510067114093e-05,
191
+ "loss": 0.9011,
192
+ "step": 2400
193
+ },
194
+ {
195
+ "epoch": 16.23,
196
+ "learning_rate": 2.59751677852349e-05,
197
+ "loss": 0.8771,
198
+ "step": 2500
199
+ },
200
+ {
201
+ "epoch": 16.23,
202
+ "eval_cer": 0.11608208983968754,
203
+ "eval_loss": 0.42124298214912415,
204
+ "eval_runtime": 218.6532,
205
+ "eval_samples_per_second": 24.811,
206
+ "eval_steps_per_second": 3.105,
207
+ "eval_wer": 0.5365557660430973,
208
+ "step": 2500
209
+ },
210
+ {
211
+ "epoch": 16.88,
212
+ "learning_rate": 2.5773825503355704e-05,
213
+ "loss": 0.8719,
214
+ "step": 2600
215
+ },
216
+ {
217
+ "epoch": 17.53,
218
+ "learning_rate": 2.557248322147651e-05,
219
+ "loss": 0.8616,
220
+ "step": 2700
221
+ },
222
+ {
223
+ "epoch": 18.18,
224
+ "learning_rate": 2.5371140939597315e-05,
225
+ "loss": 0.8573,
226
+ "step": 2800
227
+ },
228
+ {
229
+ "epoch": 18.83,
230
+ "learning_rate": 2.5169798657718124e-05,
231
+ "loss": 0.8467,
232
+ "step": 2900
233
+ },
234
+ {
235
+ "epoch": 19.48,
236
+ "learning_rate": 2.496845637583893e-05,
237
+ "loss": 0.8346,
238
+ "step": 3000
239
+ },
240
+ {
241
+ "epoch": 19.48,
242
+ "eval_cer": 0.11022733400611667,
243
+ "eval_loss": 0.3994467258453369,
244
+ "eval_runtime": 213.3023,
245
+ "eval_samples_per_second": 25.433,
246
+ "eval_steps_per_second": 3.183,
247
+ "eval_wer": 0.5143855079327492,
248
+ "step": 3000
249
+ },
250
+ {
251
+ "epoch": 20.13,
252
+ "learning_rate": 2.4767114093959732e-05,
253
+ "loss": 0.8386,
254
+ "step": 3100
255
+ },
256
+ {
257
+ "epoch": 20.78,
258
+ "learning_rate": 2.4565771812080538e-05,
259
+ "loss": 0.8283,
260
+ "step": 3200
261
+ },
262
+ {
263
+ "epoch": 21.43,
264
+ "learning_rate": 2.4364429530201343e-05,
265
+ "loss": 0.8198,
266
+ "step": 3300
267
+ },
268
+ {
269
+ "epoch": 22.08,
270
+ "learning_rate": 2.416308724832215e-05,
271
+ "loss": 0.8164,
272
+ "step": 3400
273
+ },
274
+ {
275
+ "epoch": 22.73,
276
+ "learning_rate": 2.3961744966442955e-05,
277
+ "loss": 0.8127,
278
+ "step": 3500
279
+ },
280
+ {
281
+ "epoch": 22.73,
282
+ "eval_cer": 0.10510827446479058,
283
+ "eval_loss": 0.3818908929824829,
284
+ "eval_runtime": 218.4137,
285
+ "eval_samples_per_second": 24.838,
286
+ "eval_steps_per_second": 3.109,
287
+ "eval_wer": 0.49437603599336966,
288
+ "step": 3500
289
+ },
290
+ {
291
+ "epoch": 23.38,
292
+ "learning_rate": 2.376040268456376e-05,
293
+ "loss": 0.8081,
294
+ "step": 3600
295
+ },
296
+ {
297
+ "epoch": 24.03,
298
+ "learning_rate": 2.3559060402684566e-05,
299
+ "loss": 0.7988,
300
+ "step": 3700
301
+ },
302
+ {
303
+ "epoch": 24.67,
304
+ "learning_rate": 2.335771812080537e-05,
305
+ "loss": 0.7903,
306
+ "step": 3800
307
+ },
308
+ {
309
+ "epoch": 25.32,
310
+ "learning_rate": 2.3156375838926177e-05,
311
+ "loss": 0.788,
312
+ "step": 3900
313
+ },
314
+ {
315
+ "epoch": 25.97,
316
+ "learning_rate": 2.2955033557046982e-05,
317
+ "loss": 0.7833,
318
+ "step": 4000
319
+ },
320
+ {
321
+ "epoch": 25.97,
322
+ "eval_cer": 0.10109853708140422,
323
+ "eval_loss": 0.3704679310321808,
324
+ "eval_runtime": 216.2288,
325
+ "eval_samples_per_second": 25.089,
326
+ "eval_steps_per_second": 3.14,
327
+ "eval_wer": 0.4797537295761307,
328
+ "step": 4000
329
+ },
330
+ {
331
+ "epoch": 26.62,
332
+ "learning_rate": 2.2753691275167788e-05,
333
+ "loss": 0.7724,
334
+ "step": 4100
335
+ },
336
+ {
337
+ "epoch": 27.27,
338
+ "learning_rate": 2.255234899328859e-05,
339
+ "loss": 0.7695,
340
+ "step": 4200
341
+ },
342
+ {
343
+ "epoch": 27.92,
344
+ "learning_rate": 2.2351006711409396e-05,
345
+ "loss": 0.7653,
346
+ "step": 4300
347
+ },
348
+ {
349
+ "epoch": 28.57,
350
+ "learning_rate": 2.2151677852348994e-05,
351
+ "loss": 0.7728,
352
+ "step": 4400
353
+ },
354
+ {
355
+ "epoch": 29.22,
356
+ "learning_rate": 2.19503355704698e-05,
357
+ "loss": 0.7603,
358
+ "step": 4500
359
+ },
360
+ {
361
+ "epoch": 29.22,
362
+ "eval_cer": 0.0992381113790261,
363
+ "eval_loss": 0.36611661314964294,
364
+ "eval_runtime": 217.7866,
365
+ "eval_samples_per_second": 24.91,
366
+ "eval_steps_per_second": 3.118,
367
+ "eval_wer": 0.4704001894387876,
368
+ "step": 4500
369
+ },
370
+ {
371
+ "epoch": 29.87,
372
+ "learning_rate": 2.1748993288590605e-05,
373
+ "loss": 0.7563,
374
+ "step": 4600
375
+ },
376
+ {
377
+ "epoch": 30.52,
378
+ "learning_rate": 2.154765100671141e-05,
379
+ "loss": 0.7618,
380
+ "step": 4700
381
+ },
382
+ {
383
+ "epoch": 31.17,
384
+ "learning_rate": 2.1346308724832217e-05,
385
+ "loss": 0.7467,
386
+ "step": 4800
387
+ },
388
+ {
389
+ "epoch": 31.82,
390
+ "learning_rate": 2.114697986577181e-05,
391
+ "loss": 0.7514,
392
+ "step": 4900
393
+ },
394
+ {
395
+ "epoch": 32.47,
396
+ "learning_rate": 2.0945637583892617e-05,
397
+ "loss": 0.7424,
398
+ "step": 5000
399
+ },
400
+ {
401
+ "epoch": 32.47,
402
+ "eval_cer": 0.09569444337449638,
403
+ "eval_loss": 0.3528956174850464,
404
+ "eval_runtime": 218.9072,
405
+ "eval_samples_per_second": 24.782,
406
+ "eval_steps_per_second": 3.102,
407
+ "eval_wer": 0.4577314705185887,
408
+ "step": 5000
409
+ },
410
+ {
411
+ "epoch": 33.12,
412
+ "learning_rate": 2.0744295302013423e-05,
413
+ "loss": 0.748,
414
+ "step": 5100
415
+ },
416
+ {
417
+ "epoch": 33.76,
418
+ "learning_rate": 2.054295302013423e-05,
419
+ "loss": 0.7357,
420
+ "step": 5200
421
+ },
422
+ {
423
+ "epoch": 34.41,
424
+ "learning_rate": 2.0341610738255034e-05,
425
+ "loss": 0.7357,
426
+ "step": 5300
427
+ },
428
+ {
429
+ "epoch": 35.06,
430
+ "learning_rate": 2.014026845637584e-05,
431
+ "loss": 0.735,
432
+ "step": 5400
433
+ },
434
+ {
435
+ "epoch": 35.71,
436
+ "learning_rate": 1.9938926174496645e-05,
437
+ "loss": 0.7251,
438
+ "step": 5500
439
+ },
440
+ {
441
+ "epoch": 35.71,
442
+ "eval_cer": 0.09283254627953377,
443
+ "eval_loss": 0.34103500843048096,
444
+ "eval_runtime": 218.8353,
445
+ "eval_samples_per_second": 24.79,
446
+ "eval_steps_per_second": 3.103,
447
+ "eval_wer": 0.4472827373904807,
448
+ "step": 5500
449
+ },
450
+ {
451
+ "epoch": 36.36,
452
+ "learning_rate": 1.973758389261745e-05,
453
+ "loss": 0.7306,
454
+ "step": 5600
455
+ },
456
+ {
457
+ "epoch": 37.01,
458
+ "learning_rate": 1.9536241610738256e-05,
459
+ "loss": 0.7185,
460
+ "step": 5700
461
+ },
462
+ {
463
+ "epoch": 37.66,
464
+ "learning_rate": 1.9334899328859062e-05,
465
+ "loss": 0.7135,
466
+ "step": 5800
467
+ },
468
+ {
469
+ "epoch": 38.31,
470
+ "learning_rate": 1.9133557046979864e-05,
471
+ "loss": 0.726,
472
+ "step": 5900
473
+ },
474
+ {
475
+ "epoch": 38.96,
476
+ "learning_rate": 1.893221476510067e-05,
477
+ "loss": 0.7106,
478
+ "step": 6000
479
+ },
480
+ {
481
+ "epoch": 38.96,
482
+ "eval_cer": 0.09190425933486893,
483
+ "eval_loss": 0.3401394486427307,
484
+ "eval_runtime": 218.6772,
485
+ "eval_samples_per_second": 24.808,
486
+ "eval_steps_per_second": 3.105,
487
+ "eval_wer": 0.4427539663746152,
488
+ "step": 6000
489
+ },
490
+ {
491
+ "epoch": 39.61,
492
+ "learning_rate": 1.8730872483221475e-05,
493
+ "loss": 0.7157,
494
+ "step": 6100
495
+ },
496
+ {
497
+ "epoch": 40.26,
498
+ "learning_rate": 1.852953020134228e-05,
499
+ "loss": 0.7055,
500
+ "step": 6200
501
+ },
502
+ {
503
+ "epoch": 40.91,
504
+ "learning_rate": 1.8328187919463086e-05,
505
+ "loss": 0.7108,
506
+ "step": 6300
507
+ },
508
+ {
509
+ "epoch": 41.56,
510
+ "learning_rate": 1.8126845637583892e-05,
511
+ "loss": 0.705,
512
+ "step": 6400
513
+ },
514
+ {
515
+ "epoch": 42.21,
516
+ "learning_rate": 1.7925503355704698e-05,
517
+ "loss": 0.7027,
518
+ "step": 6500
519
+ },
520
+ {
521
+ "epoch": 42.21,
522
+ "eval_cer": 0.09045597762866982,
523
+ "eval_loss": 0.3354834318161011,
524
+ "eval_runtime": 219.2927,
525
+ "eval_samples_per_second": 24.739,
526
+ "eval_steps_per_second": 3.096,
527
+ "eval_wer": 0.4352652143026285,
528
+ "step": 6500
529
+ },
530
+ {
531
+ "epoch": 42.86,
532
+ "learning_rate": 1.7724161073825503e-05,
533
+ "loss": 0.7103,
534
+ "step": 6600
535
+ },
536
+ {
537
+ "epoch": 43.51,
538
+ "learning_rate": 1.752281879194631e-05,
539
+ "loss": 0.7011,
540
+ "step": 6700
541
+ },
542
+ {
543
+ "epoch": 44.16,
544
+ "learning_rate": 1.7321476510067114e-05,
545
+ "loss": 0.7004,
546
+ "step": 6800
547
+ },
548
+ {
549
+ "epoch": 44.8,
550
+ "learning_rate": 1.712013422818792e-05,
551
+ "loss": 0.702,
552
+ "step": 6900
553
+ },
554
+ {
555
+ "epoch": 45.45,
556
+ "learning_rate": 1.6918791946308722e-05,
557
+ "loss": 0.6927,
558
+ "step": 7000
559
+ },
560
+ {
561
+ "epoch": 45.45,
562
+ "eval_cer": 0.08853392291751727,
563
+ "eval_loss": 0.33077025413513184,
564
+ "eval_runtime": 215.9851,
565
+ "eval_samples_per_second": 25.117,
566
+ "eval_steps_per_second": 3.144,
567
+ "eval_wer": 0.4296412502959981,
568
+ "step": 7000
569
+ },
570
+ {
571
+ "epoch": 46.1,
572
+ "learning_rate": 1.6717449664429528e-05,
573
+ "loss": 0.691,
574
+ "step": 7100
575
+ },
576
+ {
577
+ "epoch": 46.75,
578
+ "learning_rate": 1.6516107382550333e-05,
579
+ "loss": 0.6833,
580
+ "step": 7200
581
+ },
582
+ {
583
+ "epoch": 47.4,
584
+ "learning_rate": 1.631476510067114e-05,
585
+ "loss": 0.692,
586
+ "step": 7300
587
+ },
588
+ {
589
+ "epoch": 48.05,
590
+ "learning_rate": 1.6113422818791948e-05,
591
+ "loss": 0.6871,
592
+ "step": 7400
593
+ },
594
+ {
595
+ "epoch": 48.7,
596
+ "learning_rate": 1.5914093959731546e-05,
597
+ "loss": 0.6828,
598
+ "step": 7500
599
+ },
600
+ {
601
+ "epoch": 48.7,
602
+ "eval_cer": 0.08625364959286336,
603
+ "eval_loss": 0.324627548456192,
604
+ "eval_runtime": 221.033,
605
+ "eval_samples_per_second": 24.544,
606
+ "eval_steps_per_second": 3.072,
607
+ "eval_wer": 0.42043570921146106,
608
+ "step": 7500
609
+ },
610
+ {
611
+ "epoch": 49.35,
612
+ "learning_rate": 1.5712751677852352e-05,
613
+ "loss": 0.6789,
614
+ "step": 7600
615
+ },
616
+ {
617
+ "epoch": 50.0,
618
+ "learning_rate": 1.5511409395973158e-05,
619
+ "loss": 0.6811,
620
+ "step": 7700
621
+ },
622
+ {
623
+ "epoch": 50.65,
624
+ "learning_rate": 1.531006711409396e-05,
625
+ "loss": 0.683,
626
+ "step": 7800
627
+ },
628
+ {
629
+ "epoch": 51.3,
630
+ "learning_rate": 1.5108724832214764e-05,
631
+ "loss": 0.6765,
632
+ "step": 7900
633
+ },
634
+ {
635
+ "epoch": 51.95,
636
+ "learning_rate": 1.4907382550335571e-05,
637
+ "loss": 0.6706,
638
+ "step": 8000
639
+ },
640
+ {
641
+ "epoch": 51.95,
642
+ "eval_cer": 0.08681216248488163,
643
+ "eval_loss": 0.32503771781921387,
644
+ "eval_runtime": 215.1259,
645
+ "eval_samples_per_second": 25.218,
646
+ "eval_steps_per_second": 3.156,
647
+ "eval_wer": 0.42327729102533745,
648
+ "step": 8000
649
+ },
650
+ {
651
+ "epoch": 52.6,
652
+ "learning_rate": 1.4706040268456375e-05,
653
+ "loss": 0.6777,
654
+ "step": 8100
655
+ },
656
+ {
657
+ "epoch": 53.25,
658
+ "learning_rate": 1.450469798657718e-05,
659
+ "loss": 0.6754,
660
+ "step": 8200
661
+ },
662
+ {
663
+ "epoch": 53.89,
664
+ "learning_rate": 1.4303355704697986e-05,
665
+ "loss": 0.6675,
666
+ "step": 8300
667
+ },
668
+ {
669
+ "epoch": 54.54,
670
+ "learning_rate": 1.4102013422818792e-05,
671
+ "loss": 0.6627,
672
+ "step": 8400
673
+ },
674
+ {
675
+ "epoch": 55.19,
676
+ "learning_rate": 1.3900671140939599e-05,
677
+ "loss": 0.6629,
678
+ "step": 8500
679
+ },
680
+ {
681
+ "epoch": 55.19,
682
+ "eval_cer": 0.0849055150259227,
683
+ "eval_loss": 0.3263927102088928,
684
+ "eval_runtime": 213.0987,
685
+ "eval_samples_per_second": 25.458,
686
+ "eval_steps_per_second": 3.186,
687
+ "eval_wer": 0.4158773383850343,
688
+ "step": 8500
689
+ },
690
+ {
691
+ "epoch": 55.84,
692
+ "learning_rate": 1.3699328859060405e-05,
693
+ "loss": 0.6632,
694
+ "step": 8600
695
+ },
696
+ {
697
+ "epoch": 56.49,
698
+ "learning_rate": 1.3497986577181208e-05,
699
+ "loss": 0.6558,
700
+ "step": 8700
701
+ },
702
+ {
703
+ "epoch": 57.14,
704
+ "learning_rate": 1.3296644295302014e-05,
705
+ "loss": 0.6691,
706
+ "step": 8800
707
+ },
708
+ {
709
+ "epoch": 57.79,
710
+ "learning_rate": 1.309530201342282e-05,
711
+ "loss": 0.6633,
712
+ "step": 8900
713
+ },
714
+ {
715
+ "epoch": 58.44,
716
+ "learning_rate": 1.2893959731543625e-05,
717
+ "loss": 0.6556,
718
+ "step": 9000
719
+ },
720
+ {
721
+ "epoch": 58.44,
722
+ "eval_cer": 0.08353812139374003,
723
+ "eval_loss": 0.3212815821170807,
724
+ "eval_runtime": 220.228,
725
+ "eval_samples_per_second": 24.634,
726
+ "eval_steps_per_second": 3.083,
727
+ "eval_wer": 0.40995737627279183,
728
+ "step": 9000
729
+ },
730
+ {
731
+ "epoch": 59.09,
732
+ "learning_rate": 1.269261744966443e-05,
733
+ "loss": 0.6584,
734
+ "step": 9100
735
+ },
736
+ {
737
+ "epoch": 59.74,
738
+ "learning_rate": 1.2491275167785236e-05,
739
+ "loss": 0.6537,
740
+ "step": 9200
741
+ },
742
+ {
743
+ "epoch": 60.39,
744
+ "learning_rate": 1.228993288590604e-05,
745
+ "loss": 0.6633,
746
+ "step": 9300
747
+ },
748
+ {
749
+ "epoch": 61.04,
750
+ "learning_rate": 1.2088590604026846e-05,
751
+ "loss": 0.6474,
752
+ "step": 9400
753
+ },
754
+ {
755
+ "epoch": 61.69,
756
+ "learning_rate": 1.1887248322147651e-05,
757
+ "loss": 0.6484,
758
+ "step": 9500
759
+ },
760
+ {
761
+ "epoch": 61.69,
762
+ "eval_cer": 0.0836613794112889,
763
+ "eval_loss": 0.31816166639328003,
764
+ "eval_runtime": 216.5133,
765
+ "eval_samples_per_second": 25.056,
766
+ "eval_steps_per_second": 3.136,
767
+ "eval_wer": 0.41241416054937247,
768
+ "step": 9500
769
+ },
770
+ {
771
+ "epoch": 62.34,
772
+ "learning_rate": 1.1687919463087248e-05,
773
+ "loss": 0.6547,
774
+ "step": 9600
775
+ },
776
+ {
777
+ "epoch": 62.99,
778
+ "learning_rate": 1.1486577181208054e-05,
779
+ "loss": 0.6481,
780
+ "step": 9700
781
+ },
782
+ {
783
+ "epoch": 63.64,
784
+ "learning_rate": 1.128523489932886e-05,
785
+ "loss": 0.648,
786
+ "step": 9800
787
+ },
788
+ {
789
+ "epoch": 64.29,
790
+ "learning_rate": 1.1083892617449665e-05,
791
+ "loss": 0.6471,
792
+ "step": 9900
793
+ },
794
+ {
795
+ "epoch": 64.93,
796
+ "learning_rate": 1.088255033557047e-05,
797
+ "loss": 0.6407,
798
+ "step": 10000
799
+ },
800
+ {
801
+ "epoch": 64.93,
802
+ "eval_cer": 0.08248272461847792,
803
+ "eval_loss": 0.3171332776546478,
804
+ "eval_runtime": 217.1179,
805
+ "eval_samples_per_second": 24.986,
806
+ "eval_steps_per_second": 3.127,
807
+ "eval_wer": 0.4050142079090694,
808
+ "step": 10000
809
+ },
810
+ {
811
+ "epoch": 65.58,
812
+ "learning_rate": 1.0681208053691274e-05,
813
+ "loss": 0.6446,
814
+ "step": 10100
815
+ },
816
+ {
817
+ "epoch": 66.23,
818
+ "learning_rate": 1.047986577181208e-05,
819
+ "loss": 0.6383,
820
+ "step": 10200
821
+ },
822
+ {
823
+ "epoch": 66.88,
824
+ "learning_rate": 1.0278523489932886e-05,
825
+ "loss": 0.6413,
826
+ "step": 10300
827
+ },
828
+ {
829
+ "epoch": 67.53,
830
+ "learning_rate": 1.0077181208053691e-05,
831
+ "loss": 0.6494,
832
+ "step": 10400
833
+ },
834
+ {
835
+ "epoch": 68.18,
836
+ "learning_rate": 9.875838926174497e-06,
837
+ "loss": 0.6375,
838
+ "step": 10500
839
+ },
840
+ {
841
+ "epoch": 68.18,
842
+ "eval_cer": 0.0822362085833802,
843
+ "eval_loss": 0.31498104333877563,
844
+ "eval_runtime": 218.6829,
845
+ "eval_samples_per_second": 24.808,
846
+ "eval_steps_per_second": 3.105,
847
+ "eval_wer": 0.4038598152971821,
848
+ "step": 10500
849
+ },
850
+ {
851
+ "epoch": 68.83,
852
+ "learning_rate": 9.6744966442953e-06,
853
+ "loss": 0.6359,
854
+ "step": 10600
855
+ },
856
+ {
857
+ "epoch": 69.48,
858
+ "learning_rate": 9.473154362416108e-06,
859
+ "loss": 0.638,
860
+ "step": 10700
861
+ },
862
+ {
863
+ "epoch": 70.13,
864
+ "learning_rate": 9.271812080536914e-06,
865
+ "loss": 0.6405,
866
+ "step": 10800
867
+ },
868
+ {
869
+ "epoch": 70.78,
870
+ "learning_rate": 9.070469798657719e-06,
871
+ "loss": 0.6388,
872
+ "step": 10900
873
+ },
874
+ {
875
+ "epoch": 71.43,
876
+ "learning_rate": 8.869127516778525e-06,
877
+ "loss": 0.6363,
878
+ "step": 11000
879
+ },
880
+ {
881
+ "epoch": 71.43,
882
+ "eval_cer": 0.08095355483826237,
883
+ "eval_loss": 0.3129253089427948,
884
+ "eval_runtime": 218.9143,
885
+ "eval_samples_per_second": 24.781,
886
+ "eval_steps_per_second": 3.102,
887
+ "eval_wer": 0.3991238456073881,
888
+ "step": 11000
889
+ },
890
+ {
891
+ "epoch": 72.08,
892
+ "learning_rate": 8.66778523489933e-06,
893
+ "loss": 0.6369,
894
+ "step": 11100
895
+ },
896
+ {
897
+ "epoch": 72.73,
898
+ "learning_rate": 8.466442953020134e-06,
899
+ "loss": 0.635,
900
+ "step": 11200
901
+ },
902
+ {
903
+ "epoch": 73.38,
904
+ "learning_rate": 8.26510067114094e-06,
905
+ "loss": 0.6337,
906
+ "step": 11300
907
+ },
908
+ {
909
+ "epoch": 74.03,
910
+ "learning_rate": 8.063758389261745e-06,
911
+ "loss": 0.6308,
912
+ "step": 11400
913
+ },
914
+ {
915
+ "epoch": 74.67,
916
+ "learning_rate": 7.862416107382551e-06,
917
+ "loss": 0.6307,
918
+ "step": 11500
919
+ },
920
+ {
921
+ "epoch": 74.67,
922
+ "eval_cer": 0.08074170512060026,
923
+ "eval_loss": 0.3114279508590698,
924
+ "eval_runtime": 219.6542,
925
+ "eval_samples_per_second": 24.698,
926
+ "eval_steps_per_second": 3.091,
927
+ "eval_wer": 0.3986206488278475,
928
+ "step": 11500
929
+ },
930
+ {
931
+ "epoch": 75.32,
932
+ "learning_rate": 7.661073825503357e-06,
933
+ "loss": 0.6335,
934
+ "step": 11600
935
+ },
936
+ {
937
+ "epoch": 75.97,
938
+ "learning_rate": 7.459731543624161e-06,
939
+ "loss": 0.628,
940
+ "step": 11700
941
+ },
942
+ {
943
+ "epoch": 76.62,
944
+ "learning_rate": 7.260402684563759e-06,
945
+ "loss": 0.6324,
946
+ "step": 11800
947
+ },
948
+ {
949
+ "epoch": 77.27,
950
+ "learning_rate": 7.059060402684564e-06,
951
+ "loss": 0.6317,
952
+ "step": 11900
953
+ },
954
+ {
955
+ "epoch": 77.92,
956
+ "learning_rate": 6.857718120805369e-06,
957
+ "loss": 0.6232,
958
+ "step": 12000
959
+ },
960
+ {
961
+ "epoch": 77.92,
962
+ "eval_cer": 0.07899683380967422,
963
+ "eval_loss": 0.31030353903770447,
964
+ "eval_runtime": 220.4971,
965
+ "eval_samples_per_second": 24.603,
966
+ "eval_steps_per_second": 3.079,
967
+ "eval_wer": 0.3895335069855553,
968
+ "step": 12000
969
+ },
970
+ {
971
+ "epoch": 78.57,
972
+ "learning_rate": 6.656375838926175e-06,
973
+ "loss": 0.6295,
974
+ "step": 12100
975
+ },
976
+ {
977
+ "epoch": 79.22,
978
+ "learning_rate": 6.4550335570469795e-06,
979
+ "loss": 0.6234,
980
+ "step": 12200
981
+ },
982
+ {
983
+ "epoch": 79.87,
984
+ "learning_rate": 6.255704697986578e-06,
985
+ "loss": 0.6172,
986
+ "step": 12300
987
+ },
988
+ {
989
+ "epoch": 80.52,
990
+ "learning_rate": 6.054362416107383e-06,
991
+ "loss": 0.6203,
992
+ "step": 12400
993
+ },
994
+ {
995
+ "epoch": 81.17,
996
+ "learning_rate": 5.853020134228188e-06,
997
+ "loss": 0.6216,
998
+ "step": 12500
999
+ },
1000
+ {
1001
+ "epoch": 81.17,
1002
+ "eval_cer": 0.0789506120530934,
1003
+ "eval_loss": 0.30863967537879944,
1004
+ "eval_runtime": 218.4142,
1005
+ "eval_samples_per_second": 24.838,
1006
+ "eval_steps_per_second": 3.109,
1007
+ "eval_wer": 0.3891191096376983,
1008
+ "step": 12500
1009
+ },
1010
+ {
1011
+ "epoch": 81.82,
1012
+ "learning_rate": 5.651677852348994e-06,
1013
+ "loss": 0.6203,
1014
+ "step": 12600
1015
+ },
1016
+ {
1017
+ "epoch": 82.47,
1018
+ "learning_rate": 5.4503355704697986e-06,
1019
+ "loss": 0.6209,
1020
+ "step": 12700
1021
+ },
1022
+ {
1023
+ "epoch": 83.12,
1024
+ "learning_rate": 5.248993288590604e-06,
1025
+ "loss": 0.6257,
1026
+ "step": 12800
1027
+ },
1028
+ {
1029
+ "epoch": 83.76,
1030
+ "learning_rate": 5.04765100671141e-06,
1031
+ "loss": 0.6245,
1032
+ "step": 12900
1033
+ },
1034
+ {
1035
+ "epoch": 84.41,
1036
+ "learning_rate": 4.8463087248322145e-06,
1037
+ "loss": 0.6174,
1038
+ "step": 13000
1039
+ },
1040
+ {
1041
+ "epoch": 84.41,
1042
+ "eval_cer": 0.07851150536557558,
1043
+ "eval_loss": 0.3082079291343689,
1044
+ "eval_runtime": 215.0269,
1045
+ "eval_samples_per_second": 25.229,
1046
+ "eval_steps_per_second": 3.158,
1047
+ "eval_wer": 0.3880535164574947,
1048
+ "step": 13000
1049
+ },
1050
+ {
1051
+ "epoch": 85.06,
1052
+ "learning_rate": 4.64496644295302e-06,
1053
+ "loss": 0.6222,
1054
+ "step": 13100
1055
+ },
1056
+ {
1057
+ "epoch": 85.71,
1058
+ "learning_rate": 4.443624161073826e-06,
1059
+ "loss": 0.6113,
1060
+ "step": 13200
1061
+ },
1062
+ {
1063
+ "epoch": 86.36,
1064
+ "learning_rate": 4.2422818791946304e-06,
1065
+ "loss": 0.6238,
1066
+ "step": 13300
1067
+ },
1068
+ {
1069
+ "epoch": 87.01,
1070
+ "learning_rate": 4.040939597315437e-06,
1071
+ "loss": 0.618,
1072
+ "step": 13400
1073
+ },
1074
+ {
1075
+ "epoch": 87.66,
1076
+ "learning_rate": 3.8395973154362425e-06,
1077
+ "loss": 0.6196,
1078
+ "step": 13500
1079
+ },
1080
+ {
1081
+ "epoch": 87.66,
1082
+ "eval_cer": 0.07821876757389704,
1083
+ "eval_loss": 0.30590009689331055,
1084
+ "eval_runtime": 213.2382,
1085
+ "eval_samples_per_second": 25.441,
1086
+ "eval_steps_per_second": 3.184,
1087
+ "eval_wer": 0.3874911200568316,
1088
+ "step": 13500
1089
+ },
1090
+ {
1091
+ "epoch": 88.31,
1092
+ "learning_rate": 3.6382550335570468e-06,
1093
+ "loss": 0.6174,
1094
+ "step": 13600
1095
+ },
1096
+ {
1097
+ "epoch": 88.96,
1098
+ "learning_rate": 3.4369127516778524e-06,
1099
+ "loss": 0.6128,
1100
+ "step": 13700
1101
+ },
1102
+ {
1103
+ "epoch": 89.61,
1104
+ "learning_rate": 3.235570469798658e-06,
1105
+ "loss": 0.6246,
1106
+ "step": 13800
1107
+ },
1108
+ {
1109
+ "epoch": 90.26,
1110
+ "learning_rate": 3.034228187919463e-06,
1111
+ "loss": 0.6097,
1112
+ "step": 13900
1113
+ },
1114
+ {
1115
+ "epoch": 90.91,
1116
+ "learning_rate": 2.8328859060402687e-06,
1117
+ "loss": 0.6174,
1118
+ "step": 14000
1119
+ },
1120
+ {
1121
+ "epoch": 90.91,
1122
+ "eval_cer": 0.07799151060404132,
1123
+ "eval_loss": 0.30842480063438416,
1124
+ "eval_runtime": 212.6251,
1125
+ "eval_samples_per_second": 25.514,
1126
+ "eval_steps_per_second": 3.193,
1127
+ "eval_wer": 0.3862479280132607,
1128
+ "step": 14000
1129
+ },
1130
+ {
1131
+ "epoch": 91.56,
1132
+ "learning_rate": 2.631543624161074e-06,
1133
+ "loss": 0.6194,
1134
+ "step": 14100
1135
+ },
1136
+ {
1137
+ "epoch": 92.21,
1138
+ "learning_rate": 2.430201342281879e-06,
1139
+ "loss": 0.6167,
1140
+ "step": 14200
1141
+ },
1142
+ {
1143
+ "epoch": 92.86,
1144
+ "learning_rate": 2.2288590604026842e-06,
1145
+ "loss": 0.614,
1146
+ "step": 14300
1147
+ },
1148
+ {
1149
+ "epoch": 93.51,
1150
+ "learning_rate": 2.0275167785234902e-06,
1151
+ "loss": 0.615,
1152
+ "step": 14400
1153
+ },
1154
+ {
1155
+ "epoch": 94.16,
1156
+ "learning_rate": 1.8261744966442954e-06,
1157
+ "loss": 0.6169,
1158
+ "step": 14500
1159
+ },
1160
+ {
1161
+ "epoch": 94.16,
1162
+ "eval_cer": 0.07787595621258926,
1163
+ "eval_loss": 0.30701127648353577,
1164
+ "eval_runtime": 215.1709,
1165
+ "eval_samples_per_second": 25.213,
1166
+ "eval_steps_per_second": 3.156,
1167
+ "eval_wer": 0.3859519299076486,
1168
+ "step": 14500
1169
+ },
1170
+ {
1171
+ "epoch": 94.8,
1172
+ "learning_rate": 1.6248322147651008e-06,
1173
+ "loss": 0.6123,
1174
+ "step": 14600
1175
+ },
1176
+ {
1177
+ "epoch": 95.45,
1178
+ "learning_rate": 1.423489932885906e-06,
1179
+ "loss": 0.6133,
1180
+ "step": 14700
1181
+ },
1182
+ {
1183
+ "epoch": 96.1,
1184
+ "learning_rate": 1.2221476510067115e-06,
1185
+ "loss": 0.6068,
1186
+ "step": 14800
1187
+ },
1188
+ {
1189
+ "epoch": 96.75,
1190
+ "learning_rate": 1.020805369127517e-06,
1191
+ "loss": 0.6135,
1192
+ "step": 14900
1193
+ },
1194
+ {
1195
+ "epoch": 97.4,
1196
+ "learning_rate": 8.194630872483221e-07,
1197
+ "loss": 0.6166,
1198
+ "step": 15000
1199
+ },
1200
+ {
1201
+ "epoch": 97.4,
1202
+ "eval_cer": 0.07776810544723402,
1203
+ "eval_loss": 0.30662447214126587,
1204
+ "eval_runtime": 214.9469,
1205
+ "eval_samples_per_second": 25.239,
1206
+ "eval_steps_per_second": 3.159,
1207
+ "eval_wer": 0.3855079327492304,
1208
+ "step": 15000
1209
+ },
1210
+ {
1211
+ "epoch": 98.05,
1212
+ "learning_rate": 6.181208053691276e-07,
1213
+ "loss": 0.6189,
1214
+ "step": 15100
1215
+ },
1216
+ {
1217
+ "epoch": 98.7,
1218
+ "learning_rate": 4.167785234899329e-07,
1219
+ "loss": 0.6085,
1220
+ "step": 15200
1221
+ },
1222
+ {
1223
+ "epoch": 99.35,
1224
+ "learning_rate": 2.1543624161073826e-07,
1225
+ "loss": 0.6135,
1226
+ "step": 15300
1227
+ },
1228
+ {
1229
+ "epoch": 100.0,
1230
+ "learning_rate": 1.4093959731543625e-08,
1231
+ "loss": 0.6047,
1232
+ "step": 15400
1233
+ },
1234
+ {
1235
+ "epoch": 100.0,
1236
+ "step": 15400,
1237
+ "total_flos": 2.652008062738907e+20,
1238
+ "train_loss": 0.9527478711016767,
1239
+ "train_runtime": 101833.8069,
1240
+ "train_samples_per_second": 19.371,
1241
+ "train_steps_per_second": 0.151
1242
+ }
1243
+ ],
1244
+ "max_steps": 15400,
1245
+ "num_train_epochs": 100,
1246
+ "total_flos": 2.652008062738907e+20,
1247
+ "trial_name": null,
1248
+ "trial_params": null
1249
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e07ae4abc8c48013db20dc56c875f2a1b7115ee1ed5e58dc64886f0b18aec42a
3
+ size 3055
uz_cv8_text.txt ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "[PAD]": 30,
3
+ "[UNK]": 29,
4
+ "a": 1,
5
+ "b": 2,
6
+ "c": 3,
7
+ "d": 4,
8
+ "e": 5,
9
+ "f": 6,
10
+ "g": 7,
11
+ "h": 8,
12
+ "i": 9,
13
+ "j": 10,
14
+ "k": 11,
15
+ "l": 12,
16
+ "m": 13,
17
+ "n": 14,
18
+ "o": 15,
19
+ "p": 16,
20
+ "q": 17,
21
+ "r": 18,
22
+ "s": 19,
23
+ "t": 20,
24
+ "u": 21,
25
+ "v": 22,
26
+ "w": 23,
27
+ "x": 24,
28
+ "y": 25,
29
+ "z": 26,
30
+ "|": 0,
31
+ "‘": 27,
32
+ "’": 28
33
+ }
with_ngram_LM.ipynb ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 8,
6
+ "id": "072d16f1",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "/workspace/xls-r-uzbek-cv8\n"
14
+ ]
15
+ }
16
+ ],
17
+ "source": [
18
+ "%cd ~/xls-r-uzbek-cv8"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": 10,
24
+ "id": "12382315",
25
+ "metadata": {},
26
+ "outputs": [
27
+ {
28
+ "name": "stdout",
29
+ "output_type": "stream",
30
+ "text": [
31
+ "\u001b[33mWARNING: Ignoring invalid distribution -ransformers (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
32
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ip (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
33
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution - (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
34
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ransformers (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
35
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ip (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
36
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution - (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
37
+ "\u001b[0mCollecting https://github.com/kpu/kenlm/archive/master.zip (from -r requirements.txt (line 10))\n",
38
+ " Using cached https://github.com/kpu/kenlm/archive/master.zip (541 kB)\n",
39
+ " Preparing metadata (setup.py) ... \u001b[?25ldone\n",
40
+ "\u001b[?25hRequirement already satisfied: unidecode in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.3.2)\n",
41
+ "Collecting tensorboard\n",
42
+ " Using cached tensorboard-2.8.0-py3-none-any.whl (5.8 MB)\n",
43
+ "Requirement already satisfied: torch in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 3)) (1.10.2)\n",
44
+ "Requirement already satisfied: torchaudio in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 4)) (0.10.2)\n",
45
+ "Requirement already satisfied: jiwer~=2.3.0 in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 5)) (2.3.0)\n",
46
+ "Requirement already satisfied: soundfile~=0.10.3 in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 6)) (0.10.3.post1)\n",
47
+ "Collecting transformers~=4.16.2\n",
48
+ " Using cached transformers-4.16.2-py3-none-any.whl (3.5 MB)\n",
49
+ "Collecting datasets~=1.18.3\n",
50
+ " Using cached datasets-1.18.3-py3-none-any.whl (311 kB)\n",
51
+ "Requirement already satisfied: pyctcdecode in /opt/conda/lib/python3.8/site-packages (from -r requirements.txt (line 9)) (0.3.0)\n",
52
+ "Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (3.19.4)\n",
53
+ "Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (1.8.1)\n",
54
+ "Collecting google-auth-oauthlib<0.5,>=0.4.1\n",
55
+ " Using cached google_auth_oauthlib-0.4.6-py2.py3-none-any.whl (18 kB)\n",
56
+ "Requirement already satisfied: google-auth<3,>=1.6.3 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (2.6.0)\n",
57
+ "Requirement already satisfied: numpy>=1.12.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (1.19.2)\n",
58
+ "Requirement already satisfied: setuptools>=41.0.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (50.3.1.post20201107)\n",
59
+ "Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (2.24.0)\n",
60
+ "Requirement already satisfied: markdown>=2.6.8 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (3.3.6)\n",
61
+ "Requirement already satisfied: grpcio>=1.24.3 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (1.43.0)\n",
62
+ "Requirement already satisfied: wheel>=0.26 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (0.35.1)\n",
63
+ "Requirement already satisfied: absl-py>=0.4 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (1.0.0)\n",
64
+ "Requirement already satisfied: werkzeug>=0.11.15 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (2.0.2)\n",
65
+ "Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard->-r requirements.txt (line 2)) (0.6.1)\n",
66
+ "Requirement already satisfied: typing-extensions in /opt/conda/lib/python3.8/site-packages (from torch->-r requirements.txt (line 3)) (4.0.1)\n",
67
+ "Requirement already satisfied: python-Levenshtein==0.12.2 in /opt/conda/lib/python3.8/site-packages (from jiwer~=2.3.0->-r requirements.txt (line 5)) (0.12.2)\n",
68
+ "Requirement already satisfied: cffi>=1.0 in /opt/conda/lib/python3.8/site-packages (from soundfile~=0.10.3->-r requirements.txt (line 6)) (1.14.3)\n",
69
+ "Requirement already satisfied: filelock in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (3.0.12)\n",
70
+ "Requirement already satisfied: regex!=2019.12.17 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (2022.1.18)\n",
71
+ "Requirement already satisfied: sacremoses in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (0.0.47)\n",
72
+ "Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (21.3)\n",
73
+ "Requirement already satisfied: tokenizers!=0.11.3,>=0.10.1 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (0.11.4)\n",
74
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.1.0 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (0.4.0)\n",
75
+ "Requirement already satisfied: pyyaml>=5.1 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (5.4.1)\n",
76
+ "Requirement already satisfied: tqdm>=4.27 in /opt/conda/lib/python3.8/site-packages (from transformers~=4.16.2->-r requirements.txt (line 7)) (4.62.3)\n",
77
+ "Requirement already satisfied: multiprocess in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (0.70.12.2)\n",
78
+ "Requirement already satisfied: dill in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (0.3.4)\n",
79
+ "Requirement already satisfied: pandas in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (1.4.0)\n",
80
+ "Requirement already satisfied: aiohttp in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (3.8.1)\n",
81
+ "Requirement already satisfied: xxhash in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (2.0.2)\n",
82
+ "Requirement already satisfied: pyarrow!=4.0.0,>=3.0.0 in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (6.0.1)\n",
83
+ "Requirement already satisfied: fsspec[http]>=2021.05.0 in /opt/conda/lib/python3.8/site-packages (from datasets~=1.18.3->-r requirements.txt (line 8)) (2022.1.0)\n",
84
+ "Requirement already satisfied: pygtrie<3.0,>=2.1 in /opt/conda/lib/python3.8/site-packages (from pyctcdecode->-r requirements.txt (line 9)) (2.4.2)\n",
85
+ "Requirement already satisfied: hypothesis<7,>=6.14 in /opt/conda/lib/python3.8/site-packages (from pyctcdecode->-r requirements.txt (line 9)) (6.36.1)\n",
86
+ "Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from absl-py>=0.4->tensorboard->-r requirements.txt (line 2)) (1.15.0)\n",
87
+ "Requirement already satisfied: pycparser in /opt/conda/lib/python3.8/site-packages (from cffi>=1.0->soundfile~=0.10.3->-r requirements.txt (line 6)) (2.20)\n",
88
+ "Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard->-r requirements.txt (line 2)) (4.8)\n",
89
+ "Requirement already satisfied: cachetools<6.0,>=2.0.0 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard->-r requirements.txt (line 2)) (5.0.0)\n",
90
+ "Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.8/site-packages (from google-auth<3,>=1.6.3->tensorboard->-r requirements.txt (line 2)) (0.2.8)\n",
91
+ "Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/conda/lib/python3.8/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard->-r requirements.txt (line 2)) (1.3.1)\n",
92
+ "Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/conda/lib/python3.8/site-packages (from hypothesis<7,>=6.14->pyctcdecode->-r requirements.txt (line 9)) (2.4.0)\n",
93
+ "Requirement already satisfied: attrs>=19.2.0 in /opt/conda/lib/python3.8/site-packages (from hypothesis<7,>=6.14->pyctcdecode->-r requirements.txt (line 9)) (21.4.0)\n",
94
+ "Requirement already satisfied: importlib-metadata>=4.4 in /opt/conda/lib/python3.8/site-packages (from markdown>=2.6.8->tensorboard->-r requirements.txt (line 2)) (4.10.1)\n",
95
+ "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.8/site-packages (from packaging>=20.0->transformers~=4.16.2->-r requirements.txt (line 7)) (3.0.7)\n",
96
+ "Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard->-r requirements.txt (line 2)) (3.0.4)\n",
97
+ "Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard->-r requirements.txt (line 2)) (2.10)\n",
98
+ "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard->-r requirements.txt (line 2)) (1.25.11)\n",
99
+ "Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard->-r requirements.txt (line 2)) (2020.12.5)\n",
100
+ "Requirement already satisfied: yarl<2.0,>=1.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (1.7.2)\n",
101
+ "Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (4.0.2)\n",
102
+ "Requirement already satisfied: aiosignal>=1.1.2 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (1.2.0)\n",
103
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (6.0.2)\n",
104
+ "Requirement already satisfied: frozenlist>=1.1.1 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (1.3.0)\n",
105
+ "Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /opt/conda/lib/python3.8/site-packages (from aiohttp->datasets~=1.18.3->-r requirements.txt (line 8)) (2.0.10)\n",
106
+ "Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.8/site-packages (from pandas->datasets~=1.18.3->-r requirements.txt (line 8)) (2.8.2)\n",
107
+ "Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.8/site-packages (from pandas->datasets~=1.18.3->-r requirements.txt (line 8)) (2021.1)\n",
108
+ "Requirement already satisfied: joblib in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers~=4.16.2->-r requirements.txt (line 7)) (1.1.0)\n",
109
+ "Requirement already satisfied: click in /opt/conda/lib/python3.8/site-packages (from sacremoses->transformers~=4.16.2->-r requirements.txt (line 7)) (8.0.3)\n",
110
+ "Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.8/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard->-r requirements.txt (line 2)) (3.7.0)\n",
111
+ "Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /opt/conda/lib/python3.8/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard->-r requirements.txt (line 2)) (0.4.8)\n",
112
+ "Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard->-r requirements.txt (line 2)) (3.2.0)\n",
113
+ "Building wheels for collected packages: kenlm\n",
114
+ " Building wheel for kenlm (setup.py) ... \u001b[?25ldone\n",
115
+ "\u001b[?25h Created wheel for kenlm: filename=kenlm-0.0.0-cp38-cp38-linux_x86_64.whl size=2348591 sha256=d5c8e5430d89f59ddde39bc78aec471c1e66ef43b6cde792711b2e97d7b8b9dc\n",
116
+ " Stored in directory: /tmp/pip-ephem-wheel-cache-hhcfnszu/wheels/ff/08/4e/a3ddc0e786e0f3c1fcd2e7a82c4324c02fc3ae2638471406d2\n",
117
+ "Successfully built kenlm\n",
118
+ "\u001b[33mWARNING: Ignoring invalid distribution -ransformers (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
119
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ip (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
120
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution - (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
121
+ "\u001b[0mInstalling collected packages: kenlm, transformers, google-auth-oauthlib, tensorboard, datasets\n",
122
+ "\u001b[33m WARNING: The script transformers-cli is installed in '/workspace/.local/bin' which is not on PATH.\n",
123
+ " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
124
+ "\u001b[0m\u001b[33m WARNING: The script google-oauthlib-tool is installed in '/workspace/.local/bin' which is not on PATH.\n",
125
+ " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
126
+ "\u001b[0m\u001b[33m WARNING: The script tensorboard is installed in '/workspace/.local/bin' which is not on PATH.\n",
127
+ " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
128
+ "\u001b[0m\u001b[33m WARNING: The script datasets-cli is installed in '/workspace/.local/bin' which is not on PATH.\n",
129
+ " Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.\u001b[0m\u001b[33m\n",
130
+ "\u001b[0mSuccessfully installed datasets-1.18.3 google-auth-oauthlib-0.4.6 kenlm-0.0.0 tensorboard-2.8.0 transformers-4.16.2\n",
131
+ "\u001b[33mWARNING: Ignoring invalid distribution -ransformers (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
132
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution -ip (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
133
+ "\u001b[0m\u001b[33mWARNING: Ignoring invalid distribution - (/opt/conda/lib/python3.8/site-packages)\u001b[0m\u001b[33m\n",
134
+ "\u001b[0m"
135
+ ]
136
+ }
137
+ ],
138
+ "source": [
139
+ "!python -m pip install -r requirements.txt --user"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": 14,
145
+ "id": "3969d63a",
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "from transformers import AutoFeatureExtractor, AutoTokenizer, pipeline\n",
150
+ "from datasets import Audio, Dataset, DatasetDict, load_dataset, load_metric\n",
151
+ "\n",
152
+ "import re\n",
153
+ "import string\n",
154
+ "import unidecode"
155
+ ]
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 12,
160
+ "id": "daff17fd",
161
+ "metadata": {},
162
+ "outputs": [
163
+ {
164
+ "name": "stderr",
165
+ "output_type": "stream",
166
+ "text": [
167
+ "Reusing dataset common_voice (/workspace/.cache/huggingface/datasets/mozilla-foundation___common_voice/uz/8.0.0/b8bc4d453193c06a43269b46cd87f075c70f152ac963b7f28f7a2760c45ec3e8)\n"
168
+ ]
169
+ },
170
+ {
171
+ "data": {
172
+ "application/vnd.jupyter.widget-view+json": {
173
+ "model_id": "a8aad37a859241ff81ac932edc204bf8",
174
+ "version_major": 2,
175
+ "version_minor": 0
176
+ },
177
+ "text/plain": [
178
+ " 0%| | 0/5 [00:00<?, ?it/s]"
179
+ ]
180
+ },
181
+ "metadata": {},
182
+ "output_type": "display_data"
183
+ },
184
+ {
185
+ "data": {
186
+ "text/plain": [
187
+ "DatasetDict({\n",
188
+ " train: Dataset({\n",
189
+ " features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],\n",
190
+ " num_rows: 39456\n",
191
+ " })\n",
192
+ " test: Dataset({\n",
193
+ " features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],\n",
194
+ " num_rows: 11598\n",
195
+ " })\n",
196
+ " validation: Dataset({\n",
197
+ " features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],\n",
198
+ " num_rows: 10849\n",
199
+ " })\n",
200
+ " other: Dataset({\n",
201
+ " features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],\n",
202
+ " num_rows: 119461\n",
203
+ " })\n",
204
+ " invalidated: Dataset({\n",
205
+ " features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],\n",
206
+ " num_rows: 11276\n",
207
+ " })\n",
208
+ "})"
209
+ ]
210
+ },
211
+ "execution_count": 12,
212
+ "metadata": {},
213
+ "output_type": "execute_result"
214
+ }
215
+ ],
216
+ "source": [
217
+ "dataset_dict = load_dataset(\"mozilla-foundation/common_voice_8_0\", \"uz\", use_auth_token=True)\n",
218
+ "dataset_dict"
219
+ ]
220
+ },
221
+ {
222
+ "cell_type": "code",
223
+ "execution_count": 17,
224
+ "id": "c22aee32",
225
+ "metadata": {},
226
+ "outputs": [],
227
+ "source": [
228
+ "chars_to_ignore_regex=f\"[{re.escape(string.punctuation)}]\" \n",
229
+ "\n",
230
+ "def remove_special_characters(batch):\n",
231
+ " batch[\"text\"] = re.sub(\n",
232
+ " chars_to_ignore_regex, \n",
233
+ " \"\", \n",
234
+ " re.sub(\"['`´]\", \"’\", # elsewhere probably meant as glottal stop\n",
235
+ " re.sub(\"([og])['`´]\", \"\\g<1>‘\", # after o/g indicate modified char\n",
236
+ " unidecode.unidecode(batch[\"sentence\"]).lower()\n",
237
+ " )\n",
238
+ " )\n",
239
+ " ) + \" \"\n",
240
+ " return batch"
241
+ ]
242
+ },
243
+ {
244
+ "cell_type": "code",
245
+ "execution_count": 18,
246
+ "id": "f28dc522",
247
+ "metadata": {},
248
+ "outputs": [
249
+ {
250
+ "data": {
251
+ "application/vnd.jupyter.widget-view+json": {
252
+ "model_id": "4b8d2f0df8ea46bdaee2c94996583c5e",
253
+ "version_major": 2,
254
+ "version_minor": 0
255
+ },
256
+ "text/plain": [
257
+ "0ex [00:00, ?ex/s]"
258
+ ]
259
+ },
260
+ "metadata": {},
261
+ "output_type": "display_data"
262
+ }
263
+ ],
264
+ "source": [
265
+ "dataset = dataset_dict[\"train\"].map(remove_special_characters, remove_columns=dataset_dict[\"train\"].column_names)"
266
+ ]
267
+ },
268
+ {
269
+ "cell_type": "code",
270
+ "execution_count": 23,
271
+ "id": "38e02d29",
272
+ "metadata": {},
273
+ "outputs": [
274
+ {
275
+ "name": "stdout",
276
+ "output_type": "stream",
277
+ "text": [
278
+ " 0 244494 2030240 uz_cv8_train.txt\n"
279
+ ]
280
+ }
281
+ ],
282
+ "source": [
283
+ "text_data = \"uz_cv8_train.txt\"\n",
284
+ "with open(text_data, \"w\") as fs:\n",
285
+ " fs.write(\" \".join(dataset[\"text\"]))\n",
286
+ "\n",
287
+ "!wc $text_data"
288
+ ]
289
+ },
290
+ {
291
+ "cell_type": "code",
292
+ "execution_count": 26,
293
+ "id": "7b3d70f0",
294
+ "metadata": {},
295
+ "outputs": [
296
+ {
297
+ "name": "stdout",
298
+ "output_type": "stream",
299
+ "text": [
300
+ "--2022-02-07 03:18:36-- https://kheafield.com/code/kenlm.tar.gz\n",
301
+ "Resolving kheafield.com (kheafield.com)... 35.196.63.85\n",
302
+ "Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.\n",
303
+ "HTTP request sent, awaiting response... 200 OK\n",
304
+ "Length: 491090 (480K) [application/x-gzip]\n",
305
+ "Saving to: ‘STDOUT’\n",
306
+ "\n",
307
+ "- 100%[===================>] 479.58K 2.31MB/s in 0.2s \n",
308
+ "\n",
309
+ "2022-02-07 03:18:37 (2.31 MB/s) - written to stdout [491090/491090]\n",
310
+ "\n",
311
+ "/bin/bash: line 1: cmake: command not found\n"
312
+ ]
313
+ }
314
+ ],
315
+ "source": [
316
+ "!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz\n",
317
+ "!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2\n"
318
+ ]
319
+ },
320
+ {
321
+ "cell_type": "code",
322
+ "execution_count": null,
323
+ "id": "65118a69",
324
+ "metadata": {},
325
+ "outputs": [],
326
+ "source": []
327
+ }
328
+ ],
329
+ "metadata": {
330
+ "kernelspec": {
331
+ "display_name": "Python 3 (ipykernel)",
332
+ "language": "python",
333
+ "name": "python3"
334
+ },
335
+ "language_info": {
336
+ "codemirror_mode": {
337
+ "name": "ipython",
338
+ "version": 3
339
+ },
340
+ "file_extension": ".py",
341
+ "mimetype": "text/x-python",
342
+ "name": "python",
343
+ "nbconvert_exporter": "python",
344
+ "pygments_lexer": "ipython3",
345
+ "version": "3.8.8"
346
+ }
347
+ },
348
+ "nbformat": 4,
349
+ "nbformat_minor": 5
350
+ }