marid bakrianoo commited on
Commit
9391f5a
·
verified ·
0 Parent(s):

Duplicate from bakrianoo/sinai-voice-ar-stt

Browse files

Co-authored-by: Abu Bakr Soliman <bakrianoo@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.h5 filter=lfs diff=lfs merge=lfs -text
5
+ *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.arrow filter=lfs diff=lfs merge=lfs -text
10
+ *.ftz filter=lfs diff=lfs merge=lfs -text
11
+ *.joblib filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.pb filter=lfs diff=lfs merge=lfs -text
15
+ *.pt filter=lfs diff=lfs merge=lfs -text
16
+ *.pth filter=lfs diff=lfs merge=lfs -text
17
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - hf-asr-leaderboard
8
+ - robust-speech-event
9
+ datasets:
10
+ - mozilla-foundation/common_voice_8_0
11
+ metrics:
12
+ - wer
13
+ - cer
14
+ model-index:
15
+ - name: Sinai Voice Arabic Speech Recognition Model
16
+ results:
17
+ - task:
18
+ type: automatic-speech-recognition
19
+ name: Speech Recognition
20
+ dataset:
21
+ type: mozilla-foundation/common_voice_8_0
22
+ name: Common Voice ar
23
+ args: ar
24
+ metrics:
25
+ - type: wer
26
+ value: 0.181
27
+ name: Test WER
28
+ - type: cer
29
+ value: 0.049
30
+ name: Test CER
31
+ - task:
32
+ name: Automatic Speech Recognition
33
+ type: automatic-speech-recognition
34
+ dataset:
35
+ name: Robust Speech Event - Dev Data
36
+ type: speech-recognition-community-v2/dev_data
37
+ args: ar
38
+ metrics:
39
+ - name: Test WER
40
+ type: wer
41
+ value: 93.03
42
+ - task:
43
+ name: Automatic Speech Recognition
44
+ type: automatic-speech-recognition
45
+ dataset:
46
+ name: Robust Speech Event - Test Data
47
+ type: speech-recognition-community-v2/eval_data
48
+ args: ar
49
+ metrics:
50
+ - name: Test WER
51
+ type: wer
52
+ value: 90.79
53
+ widget:
54
+ - example_title: Example 1
55
+ src: https://huggingface.co/bakrianoo/sinai-voice-ar-stt/raw/main/examples/common_voice_ar_19077324.mp3
56
+ - example_title: Example 2
57
+ src: https://huggingface.co/bakrianoo/sinai-voice-ar-stt/raw/main/examples/common_voice_ar_19205138.mp3
58
+ - example_title: Example 3
59
+ src: https://huggingface.co/bakrianoo/sinai-voice-ar-stt/raw/main/examples/common_voice_ar_19331711.mp3
60
+ ---
61
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
62
+ should probably proofread and complete it, then remove this comment. -->
63
+
64
+ # Sinai Voice Arabic Speech Recognition Model
65
+
66
+ # نموذج **صوت سيناء** للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص
67
+
68
+ This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the MOZILLA-FOUNDATION/COMMON_VOICE_8_0 - AR dataset.
69
+ It achieves the following results on the evaluation set:
70
+ - Loss: 0.2141
71
+ - Wer: 0.1808
72
+
73
+ It achieves the following results on the evaluation set:
74
+ - eval_loss = 0.2141
75
+ - eval_samples = 10388
76
+ - eval_wer = 0.181
77
+ - eval_cer = 0.049
78
+
79
+ #### Evaluation Commands
80
+ 1. To evaluate on `mozilla-foundation/common_voice_8_0` with split `test`
81
+
82
+ ```bash
83
+ python eval.py --model_id bakrianoo/sinai-voice-ar-stt --dataset mozilla-foundation/common_voice_8_0 --config ar --split test
84
+ ```
85
+
86
+
87
+ ### Inference Without LM
88
+
89
+ ```python
90
+ from transformers import (Wav2Vec2Processor, Wav2Vec2ForCTC)
91
+ import torchaudio
92
+ import torch
93
+
94
+ def speech_file_to_array_fn(voice_path, resampling_to=16000):
95
+ speech_array, sampling_rate = torchaudio.load(voice_path)
96
+ resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
97
+
98
+ return resampler(speech_array)[0].numpy(), sampling_rate
99
+
100
+ # load the model
101
+ cp = "bakrianoo/sinai-voice-ar-stt"
102
+ processor = Wav2Vec2Processor.from_pretrained(cp)
103
+ model = Wav2Vec2ForCTC.from_pretrained(cp)
104
+
105
+ # recognize the text in a sample sound file
106
+ sound_path = './my_voice.mp3'
107
+
108
+ sample, sr = speech_file_to_array_fn(sound_path)
109
+ inputs = processor([sample], sampling_rate=16_000, return_tensors="pt", padding=True)
110
+
111
+ with torch.no_grad():
112
+ logits = model(inputs.input_values,).logits
113
+
114
+ predicted_ids = torch.argmax(logits, dim=-1)
115
+
116
+ print("Prediction:", processor.batch_decode(predicted_ids))
117
+ ```
118
+
119
+ ### Training hyperparameters
120
+
121
+ The following hyperparameters were used during training:
122
+ - learning_rate: 0.0002
123
+ - train_batch_size: 32
124
+ - eval_batch_size: 10
125
+ - seed: 42
126
+ - distributed_type: multi-GPU
127
+ - num_devices: 8
128
+ - total_train_batch_size: 256
129
+ - total_eval_batch_size: 80
130
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
131
+ - lr_scheduler_type: linear
132
+ - lr_scheduler_warmup_steps: 1000
133
+ - num_epochs: 10
134
+ - mixed_precision_training: Native AMP
135
+
136
+
137
+ ### Training results
138
+
139
+ | Training Loss | Epoch | Step | Validation Loss | Wer |
140
+ |:-------------:|:-----:|:-----:|:---------------:|:------:|
141
+ | 1.354 | 0.64 | 1000 | 0.4109 | 0.4493 |
142
+ | 0.5886 | 1.28 | 2000 | 0.2798 | 0.3099 |
143
+ | 0.4977 | 1.92 | 3000 | 0.2387 | 0.2673 |
144
+ | 0.4253 | 2.56 | 4000 | 0.2266 | 0.2523 |
145
+ | 0.3942 | 3.2 | 5000 | 0.2171 | 0.2437 |
146
+ | 0.3619 | 3.84 | 6000 | 0.2076 | 0.2253 |
147
+ | 0.3245 | 4.48 | 7000 | 0.2088 | 0.2186 |
148
+ | 0.308 | 5.12 | 8000 | 0.2086 | 0.2206 |
149
+ | 0.2881 | 5.76 | 9000 | 0.2089 | 0.2105 |
150
+ | 0.2557 | 6.4 | 10000 | 0.2015 | 0.2004 |
151
+ | 0.248 | 7.04 | 11000 | 0.2044 | 0.1953 |
152
+ | 0.2251 | 7.68 | 12000 | 0.2058 | 0.1932 |
153
+ | 0.2052 | 8.32 | 13000 | 0.2117 | 0.1878 |
154
+ | 0.1976 | 8.96 | 14000 | 0.2104 | 0.1825 |
155
+ | 0.1845 | 9.6 | 15000 | 0.2156 | 0.1821 |
156
+
157
+
158
+ ### Framework versions
159
+
160
+ - Transformers 4.16.2
161
+ - Pytorch 1.10.2+cu113
162
+ - Datasets 1.18.3
163
+ - Tokenizers 0.11.0
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"<s>": 44, "</s>": 45}
all_results.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "eval_loss": 0.21412786841392517,
4
+ "eval_runtime": 70.9089,
5
+ "eval_samples": 10388,
6
+ "eval_samples_per_second": 146.498,
7
+ "eval_steps_per_second": 1.833,
8
+ "eval_wer": 0.18078979457836977,
9
+ "train_loss": 0.1316310991176183,
10
+ "train_runtime": 23113.6031,
11
+ "train_samples": 399991,
12
+ "train_samples_per_second": 173.054,
13
+ "train_steps_per_second": 0.676
14
+ }
config.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "facebook/wav2vec2-xls-r-300m",
3
+ "activation_dropout": 0.0,
4
+ "adapter_kernel_size": 3,
5
+ "adapter_stride": 2,
6
+ "add_adapter": false,
7
+ "apply_spec_augment": true,
8
+ "architectures": [
9
+ "Wav2Vec2ForCTC"
10
+ ],
11
+ "attention_dropout": 0.0,
12
+ "bos_token_id": 1,
13
+ "classifier_proj_size": 256,
14
+ "codevector_dim": 768,
15
+ "contrastive_logits_temperature": 0.1,
16
+ "conv_bias": true,
17
+ "conv_dim": [
18
+ 512,
19
+ 512,
20
+ 512,
21
+ 512,
22
+ 512,
23
+ 512,
24
+ 512
25
+ ],
26
+ "conv_kernel": [
27
+ 10,
28
+ 3,
29
+ 3,
30
+ 3,
31
+ 3,
32
+ 2,
33
+ 2
34
+ ],
35
+ "conv_stride": [
36
+ 5,
37
+ 2,
38
+ 2,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2
43
+ ],
44
+ "ctc_loss_reduction": "mean",
45
+ "ctc_zero_infinity": false,
46
+ "diversity_loss_weight": 0.1,
47
+ "do_stable_layer_norm": true,
48
+ "eos_token_id": 2,
49
+ "feat_extract_activation": "gelu",
50
+ "feat_extract_dropout": 0.0,
51
+ "feat_extract_norm": "layer",
52
+ "feat_proj_dropout": 0.0,
53
+ "feat_quantizer_dropout": 0.0,
54
+ "final_dropout": 0.0,
55
+ "hidden_act": "gelu",
56
+ "hidden_dropout": 0.0,
57
+ "hidden_size": 1024,
58
+ "initializer_range": 0.02,
59
+ "intermediate_size": 4096,
60
+ "layer_norm_eps": 1e-05,
61
+ "layerdrop": 0.0,
62
+ "mask_feature_length": 10,
63
+ "mask_feature_min_masks": 0,
64
+ "mask_feature_prob": 0.0,
65
+ "mask_time_length": 10,
66
+ "mask_time_min_masks": 2,
67
+ "mask_time_prob": 0.05,
68
+ "model_type": "wav2vec2",
69
+ "num_adapter_layers": 3,
70
+ "num_attention_heads": 16,
71
+ "num_codevector_groups": 2,
72
+ "num_codevectors_per_group": 320,
73
+ "num_conv_pos_embedding_groups": 16,
74
+ "num_conv_pos_embeddings": 128,
75
+ "num_feat_extract_layers": 7,
76
+ "num_hidden_layers": 24,
77
+ "num_negatives": 100,
78
+ "output_hidden_size": 1024,
79
+ "pad_token_id": 43,
80
+ "proj_codevector_dim": 768,
81
+ "tdnn_dilation": [
82
+ 1,
83
+ 2,
84
+ 3,
85
+ 1,
86
+ 1
87
+ ],
88
+ "tdnn_dim": [
89
+ 512,
90
+ 512,
91
+ 512,
92
+ 512,
93
+ 1500
94
+ ],
95
+ "tdnn_kernel": [
96
+ 5,
97
+ 3,
98
+ 3,
99
+ 1,
100
+ 1
101
+ ],
102
+ "torch_dtype": "float32",
103
+ "transformers_version": "4.16.2",
104
+ "use_weighted_layer_sum": false,
105
+ "vocab_size": 46,
106
+ "xvector_output_dim": 512
107
+ }
eval.py ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import argparse
3
+ import re
4
+ from typing import Dict
5
+
6
+ import torch
7
+ from datasets import Audio, Dataset, load_dataset, load_metric
8
+
9
+ from transformers import AutoFeatureExtractor, pipeline
10
+
11
+
12
+ def log_results(result: Dataset, args: Dict[str, str]):
13
+ """DO NOT CHANGE. This function computes and logs the result metrics."""
14
+
15
+ log_outputs = args.log_outputs
16
+ dataset_id = "_".join(args.dataset.split("/") + [args.config, args.split])
17
+
18
+ # load metric
19
+ wer = load_metric("wer")
20
+ cer = load_metric("cer")
21
+
22
+ # compute metrics
23
+ wer_result = wer.compute(references=result["target"], predictions=result["prediction"])
24
+ cer_result = cer.compute(references=result["target"], predictions=result["prediction"])
25
+
26
+ # print & log results
27
+ result_str = f"WER: {wer_result}\n" f"CER: {cer_result}"
28
+ print(result_str)
29
+
30
+ with open(f"{dataset_id}_eval_results.txt", "w") as f:
31
+ f.write(result_str)
32
+
33
+ # log all results in text file. Possibly interesting for analysis
34
+ if log_outputs is not None:
35
+ pred_file = f"log_{dataset_id}_predictions.txt"
36
+ target_file = f"log_{dataset_id}_targets.txt"
37
+
38
+ with open(pred_file, "w") as p, open(target_file, "w") as t:
39
+
40
+ # mapping function to write output
41
+ def write_to_file(batch, i):
42
+ p.write(f"{i}" + "\n")
43
+ p.write(batch["prediction"] + "\n")
44
+ t.write(f"{i}" + "\n")
45
+ t.write(batch["target"] + "\n")
46
+
47
+ result.map(write_to_file, with_indices=True)
48
+
49
+
50
+ def normalize_text(text: str) -> str:
51
+ """DO ADAPT FOR YOUR USE CASE. this function normalizes the target text."""
52
+
53
+ chars_to_ignore_regex = '[zx.rﺃ“—`»NٍqAُ«☭ﻻْۛjQ,R?IDdٌOwemھa\'cۙMJ:”ًکWXZ؛;(ۘ…P)YCFٰۗsiۖklSng–fh\-Ep!ٓLVِۚBtyUTKHڨvbuGچَ؟]'
54
+
55
+ text = re.sub(chars_to_ignore_regex, "", text.lower())
56
+
57
+ # In addition, we can normalize the target text, e.g. removing new lines characters etc...
58
+ # note that order is important here!
59
+ token_sequences_to_ignore = ["\n\n", "\n", " ", " "]
60
+
61
+ for t in token_sequences_to_ignore:
62
+ text = " ".join(text.split(t))
63
+
64
+ return text
65
+
66
+
67
+ def main(args):
68
+ # load dataset
69
+ dataset = load_dataset(args.dataset, args.config, split=args.split, use_auth_token=True)
70
+
71
+ # for testing: only process the first two examples as a test
72
+ # dataset = dataset.select(range(10))
73
+
74
+ # load processor
75
+ feature_extractor = AutoFeatureExtractor.from_pretrained(args.model_id)
76
+ sampling_rate = feature_extractor.sampling_rate
77
+
78
+ # resample audio
79
+ dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
80
+
81
+ # load eval pipeline
82
+ if args.device is None:
83
+ args.device = 0 if torch.cuda.is_available() else -1
84
+ asr = pipeline("automatic-speech-recognition", model=args.model_id, device=args.device)
85
+
86
+ # map function to decode audio
87
+ def map_to_pred(batch):
88
+ prediction = asr(
89
+ batch["audio"]["array"], chunk_length_s=args.chunk_length_s, stride_length_s=args.stride_length_s
90
+ )
91
+
92
+ batch["prediction"] = prediction["text"]
93
+ batch["target"] = normalize_text(batch["sentence"])
94
+ return batch
95
+
96
+ # run inference on all examples
97
+ result = dataset.map(map_to_pred, remove_columns=dataset.column_names, batch_size=5)
98
+
99
+ # compute and log_results
100
+ # do not change function below
101
+ log_results(result, args)
102
+
103
+
104
+ if __name__ == "__main__":
105
+ parser = argparse.ArgumentParser()
106
+
107
+ parser.add_argument(
108
+ "--model_id", type=str, required=True, help="Model identifier. Should be loadable with 🤗 Transformers"
109
+ )
110
+ parser.add_argument(
111
+ "--dataset",
112
+ type=str,
113
+ required=True,
114
+ help="Dataset name to evaluate the `model_id`. Should be loadable with 🤗 Datasets",
115
+ )
116
+ parser.add_argument(
117
+ "--config", type=str, required=True, help="Config of the dataset. *E.g.* `'en'` for Common Voice"
118
+ )
119
+ parser.add_argument("--split", type=str, required=True, help="Split of the dataset. *E.g.* `'test'`")
120
+ parser.add_argument(
121
+ "--chunk_length_s", type=float, default=None, help="Chunk length in seconds. Defaults to 5 seconds."
122
+ )
123
+ parser.add_argument(
124
+ "--stride_length_s", type=float, default=None, help="Stride of the audio chunks. Defaults to 1 second."
125
+ )
126
+ parser.add_argument(
127
+ "--log_outputs", action="store_true", help="If defined, write outputs to log file for analysis."
128
+ )
129
+ parser.add_argument(
130
+ "--device",
131
+ type=int,
132
+ default=None,
133
+ help="The device to run the pipeline on. -1 for CPU (default), 0 for the first GPU and so on.",
134
+ )
135
+ args = parser.parse_args()
136
+
137
+ main(args)
eval_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "eval_loss": 0.21412786841392517,
4
+ "eval_runtime": 70.9089,
5
+ "eval_samples": 10388,
6
+ "eval_samples_per_second": 146.498,
7
+ "eval_steps_per_second": 1.833,
8
+ "eval_wer": 0.18078979457836977
9
+ }
examples/common_voice_ar_19077324.mp3 ADDED
Binary file (34.1 kB). View file
 
examples/common_voice_ar_19205138.mp3 ADDED
Binary file (29 kB). View file
 
examples/common_voice_ar_19331711.mp3 ADDED
Binary file (21.4 kB). View file
 
log_mozilla-foundation_common_voice_8_0_ar_test_predictions.txt ADDED
The diff for this file is too large to render. See raw diff
 
log_mozilla-foundation_common_voice_8_0_ar_test_targets.txt ADDED
The diff for this file is too large to render. See raw diff
 
mozilla-foundation_common_voice_8_0_ar_test_eval_results.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ WER: 0.18172268907563024
2
+ CER: 0.04875182561226061
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "return_attention_mask": true,
8
+ "sampling_rate": 16000
9
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:588e6341d51008b353be1115b1e1e34d86bad4f676b32277cba57e5f7cff526a
3
+ size 1262112241
run_speech_recognition_ctc_bnb.py ADDED
@@ -0,0 +1,754 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ # coding=utf-8
3
+ # Copyright 2021 The HuggingFace Inc. team. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+
16
+ """ Fine-tuning a 🤗 Transformers CTC model for automatic speech recognition"""
17
+
18
+ import functools
19
+ import json
20
+ import logging
21
+ import os
22
+ import re
23
+ import sys
24
+ import warnings
25
+ from dataclasses import dataclass, field
26
+ from typing import Dict, List, Optional, Union
27
+
28
+ import datasets
29
+ import numpy as np
30
+ import torch
31
+ from datasets import DatasetDict, load_dataset, load_metric
32
+
33
+ import bitsandbytes as bnb
34
+ import transformers
35
+ from transformers import (
36
+ AutoConfig,
37
+ AutoFeatureExtractor,
38
+ AutoModelForCTC,
39
+ AutoProcessor,
40
+ AutoTokenizer,
41
+ HfArgumentParser,
42
+ Trainer,
43
+ TrainingArguments,
44
+ Wav2Vec2Processor,
45
+ set_seed,
46
+ )
47
+ from transformers.trainer_pt_utils import get_parameter_names
48
+ from transformers.trainer_utils import get_last_checkpoint, is_main_process
49
+ from transformers.utils import check_min_version
50
+ from transformers.utils.versions import require_version
51
+
52
+ logger = logging.getLogger(__name__)
53
+
54
+
55
+ def list_field(default=None, metadata=None):
56
+ return field(default_factory=lambda: default, metadata=metadata)
57
+
58
+
59
+ @dataclass
60
+ class ModelArguments:
61
+ """
62
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
63
+ """
64
+
65
+ model_name_or_path: str = field(
66
+ metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
67
+ )
68
+ tokenizer_name_or_path: Optional[str] = field(
69
+ default=None,
70
+ metadata={"help": "Path to pretrained tokenizer or tokenizer identifier from huggingface.co/models"},
71
+ )
72
+ cache_dir: Optional[str] = field(
73
+ default=None,
74
+ metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
75
+ )
76
+ freeze_feature_encoder: bool = field(
77
+ default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
78
+ )
79
+ attention_dropout: float = field(
80
+ default=0.0, metadata={"help": "The dropout ratio for the attention probabilities."}
81
+ )
82
+ activation_dropout: float = field(
83
+ default=0.0, metadata={"help": "The dropout ratio for activations inside the fully connected layer."}
84
+ )
85
+ feat_proj_dropout: float = field(default=0.0, metadata={"help": "The dropout ratio for the projected features."})
86
+ hidden_dropout: float = field(
87
+ default=0.0,
88
+ metadata={
89
+ "help": "The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
90
+ },
91
+ )
92
+ final_dropout: float = field(
93
+ default=0.0,
94
+ metadata={"help": "The dropout probability for the final projection layer."},
95
+ )
96
+ mask_time_prob: float = field(
97
+ default=0.05,
98
+ metadata={
99
+ "help": "Probability of each feature vector along the time axis to be chosen as the start of the vector"
100
+ "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature"
101
+ "vectors will be masked along the time axis."
102
+ },
103
+ )
104
+ mask_time_length: int = field(
105
+ default=10,
106
+ metadata={"help": "Length of vector span to mask along the time axis."},
107
+ )
108
+ mask_feature_prob: float = field(
109
+ default=0.0,
110
+ metadata={
111
+ "help": "Probability of each feature vector along the feature axis to be chosen as the start of the vector"
112
+ "span to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature bins will be masked along the time axis."
113
+ },
114
+ )
115
+ mask_feature_length: int = field(
116
+ default=10,
117
+ metadata={"help": "Length of vector span to mask along the feature axis."},
118
+ )
119
+ layerdrop: float = field(default=0.0, metadata={"help": "The LayerDrop probability."})
120
+ ctc_loss_reduction: Optional[str] = field(
121
+ default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."}
122
+ )
123
+
124
+
125
+ @dataclass
126
+ class DataTrainingArguments:
127
+ """
128
+ Arguments pertaining to what data we are going to input our model for training and eval.
129
+
130
+ Using `HfArgumentParser` we can turn this class
131
+ into argparse arguments to be able to specify them on
132
+ the command line.
133
+ """
134
+
135
+ dataset_name: str = field(
136
+ metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
137
+ )
138
+ dataset_config_name: str = field(
139
+ default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
140
+ )
141
+ train_split_name: str = field(
142
+ default="train+validation",
143
+ metadata={
144
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
145
+ },
146
+ )
147
+ eval_split_name: str = field(
148
+ default="test",
149
+ metadata={
150
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
151
+ },
152
+ )
153
+ audio_column_name: str = field(
154
+ default="audio",
155
+ metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
156
+ )
157
+ text_column_name: str = field(
158
+ default="text",
159
+ metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"},
160
+ )
161
+ overwrite_cache: bool = field(
162
+ default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
163
+ )
164
+ preprocessing_num_workers: Optional[int] = field(
165
+ default=None,
166
+ metadata={"help": "The number of processes to use for the preprocessing."},
167
+ )
168
+ max_train_samples: Optional[int] = field(
169
+ default=None,
170
+ metadata={
171
+ "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
172
+ "value if set."
173
+ },
174
+ )
175
+ max_eval_samples: Optional[int] = field(
176
+ default=None,
177
+ metadata={
178
+ "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
179
+ "value if set."
180
+ },
181
+ )
182
+ chars_to_ignore: Optional[List[str]] = list_field(
183
+ default=['چ', 'y', 'ۗ', 'n', 'J', 'C', 'K', 'V', 'g', ';', 'M', '?', 'u', 'S', 'ٌ', 'h', 'ً', '“', 'ۛ', 'r', 'P', '–', 'ﻻ', 'W', 'p', "'", 'o', 'Z', 'ۘ', 'ٰ', 'D', 'B', 'U', 'ﺃ', 'E', 'a', '»', '(', 'X', 'f', 'َ', '\\', 'l', 'x', 'v', 'ۖ', 'w', '”', 'ٍ', 'F', 'j', 'H', '…', '`', 'ڨ', 'O', ',', 'q', 'A', 'ِ', 'ٓ', '!', '؛', 'I', 't', 'ک', 'z', 'k', 's', '؟', 'd', 'G', 'ۚ', 'T', '—', 'R', ')', '«', 'Q', '☭', 'L', 'N', '-', 'Y', 'e', '.', 'c', ':', 'i', 'm', 'ُ', 'ۙ', 'ْ', 'b', 'ھ'],
184
+ metadata={"help": "A list of characters to remove from the transcripts."},
185
+ )
186
+ eval_metrics: List[str] = list_field(
187
+ default=["wer"],
188
+ metadata={"help": "A list of metrics the model should be evaluated on. E.g. `'wer cer'`"},
189
+ )
190
+ max_duration_in_seconds: float = field(
191
+ default=20.0,
192
+ metadata={
193
+ "help": "Filter audio files that are longer than `max_duration_in_seconds` seconds to 'max_duration_in_seconds`"
194
+ },
195
+ )
196
+ min_duration_in_seconds: float = field(
197
+ default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}
198
+ )
199
+ preprocessing_only: bool = field(
200
+ default=False,
201
+ metadata={
202
+ "help": "Whether to only do data preprocessing and skip training. "
203
+ "This is especially useful when data preprocessing errors out in distributed training due to timeout. "
204
+ "In this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` "
205
+ "so that the cached datasets can consequently be loaded in distributed training"
206
+ },
207
+ )
208
+ use_auth_token: bool = field(
209
+ default=False,
210
+ metadata={
211
+ "help": "If :obj:`True`, will use the token generated when running"
212
+ ":obj:`transformers-cli login` as HTTP bearer authorization for remote files."
213
+ },
214
+ )
215
+ unk_token: str = field(
216
+ default="[UNK]",
217
+ metadata={"help": "The unk token for the tokenizer"},
218
+ )
219
+ pad_token: str = field(
220
+ default="[PAD]",
221
+ metadata={"help": "The padding token for the tokenizer"},
222
+ )
223
+ word_delimiter_token: str = field(
224
+ default="|",
225
+ metadata={"help": "The word delimiter token for the tokenizer"},
226
+ )
227
+ phoneme_language: Optional[str] = field(
228
+ default=None,
229
+ metadata={
230
+ "help": "The target language that should be used be"
231
+ " passed to the tokenizer for tokenization. Note that"
232
+ " this is only relevant if the model classifies the"
233
+ " input audio to a sequence of phoneme sequences."
234
+ },
235
+ )
236
+
237
+
238
+ @dataclass
239
+ class DataCollatorCTCWithPadding:
240
+ """
241
+ Data collator that will dynamically pad the inputs received.
242
+ Args:
243
+ processor (:class:`~transformers.AutoProcessor`)
244
+ The processor used for proccessing the data.
245
+ padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
246
+ Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
247
+ among:
248
+ * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
249
+ sequence if provided).
250
+ * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
251
+ maximum acceptable input length for the model if that argument is not provided.
252
+ * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
253
+ different lengths).
254
+ max_length (:obj:`int`, `optional`):
255
+ Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
256
+ max_length_labels (:obj:`int`, `optional`):
257
+ Maximum length of the ``labels`` returned list and optionally padding length (see above).
258
+ pad_to_multiple_of (:obj:`int`, `optional`):
259
+ If set will pad the sequence to a multiple of the provided value.
260
+ This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
261
+ 7.5 (Volta).
262
+ """
263
+
264
+ processor: AutoProcessor
265
+ padding: Union[bool, str] = "longest"
266
+ pad_to_multiple_of: Optional[int] = None
267
+ pad_to_multiple_of_labels: Optional[int] = None
268
+
269
+ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
270
+ # split inputs and labels since they have to be of different lenghts and need
271
+ # different padding methods
272
+ input_features = [{"input_values": feature["input_values"]} for feature in features]
273
+ label_features = [{"input_ids": feature["labels"]} for feature in features]
274
+
275
+ batch = self.processor.pad(
276
+ input_features,
277
+ padding=self.padding,
278
+ pad_to_multiple_of=self.pad_to_multiple_of,
279
+ return_tensors="pt",
280
+ )
281
+
282
+ with self.processor.as_target_processor():
283
+ labels_batch = self.processor.pad(
284
+ label_features,
285
+ padding=self.padding,
286
+ pad_to_multiple_of=self.pad_to_multiple_of_labels,
287
+ return_tensors="pt",
288
+ )
289
+
290
+ # replace padding with -100 to ignore loss correctly
291
+ labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
292
+
293
+ batch["labels"] = labels
294
+
295
+ return batch
296
+
297
+
298
+ def create_vocabulary_from_data(
299
+ datasets: DatasetDict,
300
+ word_delimiter_token: Optional[str] = None,
301
+ unk_token: Optional[str] = None,
302
+ pad_token: Optional[str] = None,
303
+ ):
304
+ # Given training and test labels create vocabulary
305
+ def extract_all_chars(batch):
306
+ all_text = " ".join(batch["target_text"])
307
+ vocab = list(set(all_text))
308
+ return {"vocab": [vocab], "all_text": [all_text]}
309
+
310
+ vocabs = datasets.map(
311
+ extract_all_chars,
312
+ batched=True,
313
+ batch_size=-1,
314
+ keep_in_memory=True,
315
+ remove_columns=datasets["train"].column_names,
316
+ )
317
+
318
+ # take union of all unique characters in each dataset
319
+ vocab_set = functools.reduce(
320
+ lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
321
+ )
322
+
323
+ vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}
324
+
325
+ # replace white space with delimiter token
326
+ if word_delimiter_token is not None:
327
+ vocab_dict[word_delimiter_token] = vocab_dict[" "]
328
+ del vocab_dict[" "]
329
+
330
+ # add unk and pad token
331
+ if unk_token is not None:
332
+ vocab_dict[unk_token] = len(vocab_dict)
333
+
334
+ if pad_token is not None:
335
+ vocab_dict[pad_token] = len(vocab_dict)
336
+
337
+ return vocab_dict
338
+
339
+
340
+ def main():
341
+ # See all possible arguments in src/transformers/training_args.py
342
+ # or by passing the --help flag to this script.
343
+ # We now keep distinct sets of args, for a cleaner separation of concerns.
344
+
345
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
346
+ if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
347
+ # If we pass only one argument to the script and it's the path to a json file,
348
+ # let's parse it to get our arguments.
349
+ model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
350
+ else:
351
+ model_args, data_args, training_args = parser.parse_args_into_dataclasses()
352
+
353
+ # Detecting last checkpoint.
354
+ last_checkpoint = None
355
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
356
+ last_checkpoint = get_last_checkpoint(training_args.output_dir)
357
+ if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
358
+ raise ValueError(
359
+ f"Output directory ({training_args.output_dir}) already exists and is not empty. "
360
+ "Use --overwrite_output_dir to overcome."
361
+ )
362
+ elif last_checkpoint is not None:
363
+ logger.info(
364
+ f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
365
+ "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
366
+ )
367
+
368
+ # Setup logging
369
+ logging.basicConfig(
370
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
371
+ datefmt="%m/%d/%Y %H:%M:%S",
372
+ handlers=[logging.StreamHandler(sys.stdout)],
373
+ )
374
+ logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
375
+
376
+ # Log on each process the small summary:
377
+ logger.warning(
378
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
379
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
380
+ )
381
+ # Set the verbosity to info of the Transformers logger (on main process only):
382
+ if is_main_process(training_args.local_rank):
383
+ transformers.utils.logging.set_verbosity_info()
384
+ logger.info("Training/evaluation parameters %s", training_args)
385
+
386
+ # Set seed before initializing model.
387
+ set_seed(training_args.seed)
388
+
389
+ # 1. First, let's load the dataset
390
+ raw_datasets = DatasetDict()
391
+
392
+ if training_args.do_train:
393
+ raw_datasets["train"] = load_dataset(
394
+ data_args.dataset_name,
395
+ data_args.dataset_config_name,
396
+ split=data_args.train_split_name,
397
+ use_auth_token=data_args.use_auth_token,
398
+ )
399
+
400
+ if data_args.audio_column_name not in raw_datasets["train"].column_names:
401
+ raise ValueError(
402
+ f"--audio_column_name '{data_args.audio_column_name}' not found in dataset '{data_args.dataset_name}'. "
403
+ "Make sure to set `--audio_column_name` to the correct audio column - one of "
404
+ f"{', '.join(raw_datasets['train'].column_names)}."
405
+ )
406
+
407
+ if data_args.text_column_name not in raw_datasets["train"].column_names:
408
+ raise ValueError(
409
+ f"--text_column_name {data_args.text_column_name} not found in dataset '{data_args.dataset_name}'. "
410
+ "Make sure to set `--text_column_name` to the correct text column - one of "
411
+ f"{', '.join(raw_datasets['train'].column_names)}."
412
+ )
413
+
414
+ if data_args.max_train_samples is not None:
415
+ raw_datasets["train"] = raw_datasets["train"].select(range(data_args.max_train_samples))
416
+
417
+ if training_args.do_eval:
418
+ raw_datasets["eval"] = load_dataset(
419
+ data_args.dataset_name,
420
+ data_args.dataset_config_name,
421
+ split=data_args.eval_split_name,
422
+ use_auth_token=data_args.use_auth_token,
423
+ )
424
+
425
+ if data_args.max_eval_samples is not None:
426
+ raw_datasets["eval"] = raw_datasets["eval"].select(range(data_args.max_eval_samples))
427
+
428
+ # 2. We remove some special characters from the datasets
429
+ # that make training complicated and do not help in transcribing the speech
430
+ # E.g. characters, such as `,` and `.` do not really have an acoustic characteristic
431
+ # that could be easily picked up by the model
432
+ chars_to_ignore_regex = (
433
+ f'[{"".join(data_args.chars_to_ignore)}]' if data_args.chars_to_ignore is not None else None
434
+ )
435
+ text_column_name = data_args.text_column_name
436
+
437
+ def remove_special_characters(batch):
438
+ if chars_to_ignore_regex is not None:
439
+ batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " "
440
+ else:
441
+ batch["target_text"] = batch[text_column_name].lower() + " "
442
+ return batch
443
+
444
+ with training_args.main_process_first(desc="dataset map special characters removal"):
445
+ raw_datasets = raw_datasets.map(
446
+ remove_special_characters,
447
+ remove_columns=[text_column_name],
448
+ desc="remove special characters from datasets",
449
+ )
450
+
451
+ # save special tokens for tokenizer
452
+ word_delimiter_token = data_args.word_delimiter_token
453
+ unk_token = data_args.unk_token
454
+ pad_token = data_args.pad_token
455
+
456
+ # 3. Next, let's load the config as we might need it to create
457
+ # the tokenizer
458
+ # load config
459
+ config = AutoConfig.from_pretrained(
460
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
461
+ )
462
+
463
+ # 4. Next, if no tokenizer file is defined,
464
+ # we create the vocabulary of the model by extracting all unique characters from
465
+ # the training and evaluation datasets
466
+ # We need to make sure that only first rank saves vocabulary
467
+ # make sure all processes wait until vocab is created
468
+ tokenizer_name_or_path = model_args.tokenizer_name_or_path
469
+ tokenizer_kwargs = {}
470
+ if tokenizer_name_or_path is None:
471
+ # save vocab in training output dir
472
+ tokenizer_name_or_path = training_args.output_dir
473
+
474
+ vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")
475
+
476
+ with training_args.main_process_first():
477
+ if training_args.overwrite_output_dir and os.path.isfile(vocab_file):
478
+ os.remove(vocab_file)
479
+
480
+ with training_args.main_process_first(desc="dataset map vocabulary creation"):
481
+ if not os.path.isfile(vocab_file):
482
+ os.makedirs(tokenizer_name_or_path, exist_ok=True)
483
+ vocab_dict = create_vocabulary_from_data(
484
+ raw_datasets,
485
+ word_delimiter_token=word_delimiter_token,
486
+ unk_token=unk_token,
487
+ pad_token=pad_token,
488
+ )
489
+
490
+ # save vocab dict to be loaded into tokenizer
491
+ with open(vocab_file, "w") as file:
492
+ json.dump(vocab_dict, file)
493
+
494
+ # if tokenizer has just been created
495
+ # it is defined by `tokenizer_class` if present in config else by `model_type`
496
+ tokenizer_kwargs = {
497
+ "config": config if config.tokenizer_class is not None else None,
498
+ "tokenizer_type": config.model_type if config.tokenizer_class is None else None,
499
+ "unk_token": unk_token,
500
+ "pad_token": pad_token,
501
+ "word_delimiter_token": word_delimiter_token,
502
+ }
503
+
504
+ # 5. Now we can instantiate the feature extractor, tokenizer and model
505
+ # Note for distributed training, the .from_pretrained methods guarantee that only
506
+ # one local process can concurrently download model & vocab.
507
+
508
+ # load feature_extractor and tokenizer
509
+ tokenizer = AutoTokenizer.from_pretrained(
510
+ tokenizer_name_or_path,
511
+ use_auth_token=data_args.use_auth_token,
512
+ **tokenizer_kwargs,
513
+ )
514
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
515
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
516
+ )
517
+
518
+ # adapt config
519
+ config.update(
520
+ {
521
+ "feat_proj_dropout": model_args.feat_proj_dropout,
522
+ "attention_dropout": model_args.attention_dropout,
523
+ "hidden_dropout": model_args.hidden_dropout,
524
+ "final_dropout": model_args.final_dropout,
525
+ "mask_time_prob": model_args.mask_time_prob,
526
+ "mask_time_length": model_args.mask_time_length,
527
+ "mask_feature_prob": model_args.mask_feature_prob,
528
+ "mask_feature_length": model_args.mask_feature_length,
529
+ "gradient_checkpointing": training_args.gradient_checkpointing,
530
+ "layerdrop": model_args.layerdrop,
531
+ "ctc_loss_reduction": model_args.ctc_loss_reduction,
532
+ "pad_token_id": tokenizer.pad_token_id,
533
+ "vocab_size": len(tokenizer),
534
+ "activation_dropout": model_args.activation_dropout,
535
+ }
536
+ )
537
+
538
+ # create model
539
+ model = AutoModelForCTC.from_pretrained(
540
+ model_args.model_name_or_path,
541
+ cache_dir=model_args.cache_dir,
542
+ config=config,
543
+ use_auth_token=data_args.use_auth_token,
544
+ )
545
+
546
+ # freeze encoder
547
+ if model_args.freeze_feature_encoder:
548
+ model.freeze_feature_encoder()
549
+
550
+ # 6. Now we preprocess the datasets including loading the audio, resampling and normalization
551
+ # Thankfully, `datasets` takes care of automatically loading and resampling the audio,
552
+ # so that we just need to set the correct target sampling rate and normalize the input
553
+ # via the `feature_extractor`
554
+
555
+ # make sure that dataset decodes audio with correct sampling rate
556
+ dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate
557
+ if dataset_sampling_rate != feature_extractor.sampling_rate:
558
+ raw_datasets = raw_datasets.cast_column(
559
+ data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
560
+ )
561
+
562
+ # derive max & min input length for sample rate & max duration
563
+ max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate
564
+ min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate
565
+ audio_column_name = data_args.audio_column_name
566
+ num_workers = data_args.preprocessing_num_workers
567
+
568
+ # `phoneme_language` is only relevant if the model is fine-tuned on phoneme classification
569
+ phoneme_language = data_args.phoneme_language
570
+
571
+ # Preprocessing the datasets.
572
+ # We need to read the audio files as arrays and tokenize the targets.
573
+ def prepare_dataset(batch):
574
+ # load audio
575
+ sample = batch[audio_column_name]
576
+
577
+ inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
578
+ batch["input_values"] = inputs.input_values[0]
579
+ batch["input_length"] = len(batch["input_values"])
580
+
581
+ # encode targets
582
+ additional_kwargs = {}
583
+ if phoneme_language is not None:
584
+ additional_kwargs["phonemizer_lang"] = phoneme_language
585
+
586
+ batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids
587
+ return batch
588
+
589
+ with training_args.main_process_first(desc="dataset map preprocessing"):
590
+ vectorized_datasets = raw_datasets.map(
591
+ prepare_dataset,
592
+ remove_columns=next(iter(raw_datasets.values())).column_names,
593
+ num_proc=num_workers,
594
+ desc="preprocess datasets",
595
+ )
596
+
597
+ def is_audio_in_length_range(length):
598
+ return length > min_input_length and length < max_input_length
599
+
600
+ # filter data that is shorter than min_input_length
601
+ vectorized_datasets = vectorized_datasets.filter(
602
+ is_audio_in_length_range,
603
+ num_proc=num_workers,
604
+ input_columns=["input_length"],
605
+ )
606
+
607
+ # 7. Next, we can prepare the training.
608
+ # Let's use word error rate (WER) as our evaluation metric,
609
+ # instantiate a data collator and the trainer
610
+
611
+ # Define evaluation metrics during training, *i.e.* word error rate, character error rate
612
+ eval_metrics = {metric: load_metric(metric) for metric in data_args.eval_metrics}
613
+
614
+ # for large datasets it is advised to run the preprocessing on a
615
+ # single machine first with ``args.preprocessing_only`` since there will mostly likely
616
+ # be a timeout when running the script in distributed mode.
617
+ # In a second step ``args.preprocessing_only`` can then be set to `False` to load the
618
+ # cached dataset
619
+ if data_args.preprocessing_only:
620
+ logger.info(f"Data preprocessing finished. Files cached at {vectorized_datasets.cache_files}")
621
+ return
622
+
623
+ def compute_metrics(pred):
624
+ pred_logits = pred.predictions
625
+ pred_ids = np.argmax(pred_logits, axis=-1)
626
+
627
+ pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
628
+
629
+ pred_str = tokenizer.batch_decode(pred_ids)
630
+ # we do not want to group tokens when computing the metrics
631
+ label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)
632
+
633
+ metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}
634
+
635
+ return metrics
636
+
637
+ # Now save everything to be able to create a single processor later
638
+ if is_main_process(training_args.local_rank):
639
+ # save feature extractor, tokenizer and config
640
+ feature_extractor.save_pretrained(training_args.output_dir)
641
+ tokenizer.save_pretrained(training_args.output_dir)
642
+ config.save_pretrained(training_args.output_dir)
643
+
644
+ try:
645
+ processor = AutoProcessor.from_pretrained(training_args.output_dir)
646
+ except (OSError, KeyError):
647
+ warnings.warn(
648
+ "Loading a processor from a feature extractor config that does not"
649
+ " include a `processor_class` attribute is deprecated and will be removed in v5. Please add the following "
650
+ " attribute to your `preprocessor_config.json` file to suppress this warning: "
651
+ " `'processor_class': 'Wav2Vec2Processor'`",
652
+ FutureWarning,
653
+ )
654
+ processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir)
655
+
656
+ # Instantiate custom data collator
657
+ data_collator = DataCollatorCTCWithPadding(processor=processor)
658
+
659
+ decay_parameters = get_parameter_names(model, [torch.nn.LayerNorm])
660
+ decay_parameters = [name for name in decay_parameters if "bias" not in name]
661
+ optimizer_grouped_parameters = [
662
+ {
663
+ "params": [p for n, p in model.named_parameters() if n in decay_parameters],
664
+ "weight_decay": training_args.weight_decay,
665
+ },
666
+ {
667
+ "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
668
+ "weight_decay": 0.0,
669
+ },
670
+ ]
671
+ optimizer = bnb.optim.Adam8bit(
672
+ params=optimizer_grouped_parameters,
673
+ lr=training_args.learning_rate,
674
+ betas=(training_args.adam_beta1, training_args.adam_beta2),
675
+ eps=training_args.adam_epsilon,
676
+ )
677
+
678
+ optimizers = (optimizer, None)
679
+
680
+ # Initialize Trainer
681
+ trainer = Trainer(
682
+ model=model,
683
+ data_collator=data_collator,
684
+ args=training_args,
685
+ compute_metrics=compute_metrics,
686
+ train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
687
+ eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
688
+ tokenizer=feature_extractor,
689
+ optimizers=optimizers,
690
+ )
691
+
692
+ # 8. Finally, we can start training
693
+
694
+ # Training
695
+ if training_args.do_train:
696
+
697
+ # use last checkpoint if exist
698
+ if last_checkpoint is not None:
699
+ checkpoint = last_checkpoint
700
+ elif os.path.isdir(model_args.model_name_or_path):
701
+ checkpoint = model_args.model_name_or_path
702
+ else:
703
+ checkpoint = None
704
+
705
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
706
+ trainer.save_model()
707
+
708
+ metrics = train_result.metrics
709
+ max_train_samples = (
710
+ data_args.max_train_samples
711
+ if data_args.max_train_samples is not None
712
+ else len(vectorized_datasets["train"])
713
+ )
714
+ metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"]))
715
+
716
+ trainer.log_metrics("train", metrics)
717
+ trainer.save_metrics("train", metrics)
718
+ trainer.save_state()
719
+
720
+ # Evaluation
721
+ results = {}
722
+ if training_args.do_eval:
723
+ logger.info("*** Evaluate ***")
724
+ metrics = trainer.evaluate()
725
+ max_eval_samples = (
726
+ data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"])
727
+ )
728
+ metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"]))
729
+
730
+ trainer.log_metrics("eval", metrics)
731
+ trainer.save_metrics("eval", metrics)
732
+
733
+ # Write model card and (optionally) push to hub
734
+ config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na"
735
+ kwargs = {
736
+ "finetuned_from": model_args.model_name_or_path,
737
+ "tasks": "speech-recognition",
738
+ "tags": ["automatic-speech-recognition", data_args.dataset_name],
739
+ "dataset_args": f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split: {data_args.eval_split_name}",
740
+ "dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}",
741
+ }
742
+ if "common_voice" in data_args.dataset_name:
743
+ kwargs["language"] = config_name
744
+
745
+ if training_args.push_to_hub:
746
+ trainer.push_to_hub(**kwargs)
747
+ else:
748
+ trainer.create_model_card(**kwargs)
749
+
750
+ return results
751
+
752
+
753
+ if __name__ == "__main__":
754
+ main()
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "additional_special_tokens": [{"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}]}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "bos_token": "<s>", "eos_token": "</s>", "pad_token": "[PAD]", "do_lower_case": false, "word_delimiter_token": "|", "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "/workspace/cv-corpus-8.0-2022-01-19/output", "tokenizer_class": "Wav2Vec2CTCTokenizer"}
train-experiments.py ADDED
@@ -0,0 +1,835 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from tqdm.auto import tqdm
3
+ import random
4
+ from p_tqdm import p_map
5
+ from datasets import load_dataset, load_metric, Audio
6
+ from datasets import load_from_disk, concatenate_datasets
7
+ import torchaudio
8
+
9
+ import functools
10
+ import json
11
+ import logging
12
+ import os
13
+ import re
14
+ import sys
15
+ import warnings
16
+ from dataclasses import dataclass, field
17
+ from typing import Dict, List, Optional, Union
18
+ from datasets import concatenate_datasets, load_dataset
19
+
20
+ import datasets
21
+ import numpy as np
22
+ import torch
23
+ from datasets import DatasetDict, load_dataset, load_metric, Dataset
24
+
25
+ import bitsandbytes as bnb
26
+ import transformers
27
+ from transformers import (
28
+ AutoConfig,
29
+ AutoFeatureExtractor,
30
+ AutoModelForCTC,
31
+ AutoProcessor,
32
+ AutoTokenizer,
33
+ HfArgumentParser,
34
+ Trainer,
35
+ TrainingArguments,
36
+ Wav2Vec2Processor,
37
+ set_seed,
38
+ )
39
+ from transformers.trainer_pt_utils import get_parameter_names
40
+ from transformers.trainer_utils import get_last_checkpoint, is_main_process
41
+ from transformers.utils import check_min_version
42
+ from transformers.utils.versions import require_version
43
+
44
+ logger = logging.getLogger(__name__)
45
+
46
+ def list_field(default=None, metadata=None):
47
+ return field(default_factory=lambda: default, metadata=metadata)
48
+
49
+ @dataclass
50
+ class ModelArguments:
51
+ """
52
+ Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
53
+ """
54
+
55
+ model_name_or_path: str = field(
56
+ metadata={"help": ""}, default="hf-test/xls-r-dummy"
57
+ )
58
+ tokenizer_name_or_path: Optional[str] = field(
59
+ default=None,
60
+ metadata={"help": "hf-test/xls-r-dummy"},
61
+ )
62
+ cache_dir: Optional[str] = field(
63
+ default=None,
64
+ metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
65
+ )
66
+ freeze_feature_encoder: bool = field(
67
+ default=True, metadata={"help": "Whether to freeze the feature encoder layers of the model."}
68
+ )
69
+ attention_dropout: float = field(
70
+ default=0.0, metadata={"help": "The dropout ratio for the attention probabilities."}
71
+ )
72
+ activation_dropout: float = field(
73
+ default=0.0, metadata={"help": "The dropout ratio for activations inside the fully connected layer."}
74
+ )
75
+ feat_proj_dropout: float = field(default=0.0, metadata={"help": "The dropout ratio for the projected features."})
76
+ hidden_dropout: float = field(
77
+ default=0.0,
78
+ metadata={
79
+ "help": "The dropout probability for all fully connected layers in the embeddings, encoder, and pooler."
80
+ },
81
+ )
82
+ final_dropout: float = field(
83
+ default=0.0,
84
+ metadata={"help": "The dropout probability for the final projection layer."},
85
+ )
86
+ mask_time_prob: float = field(
87
+ default=0.05,
88
+ metadata={
89
+ "help": "Probability of each feature vector along the time axis to be chosen as the start of the vector"
90
+ "span to be masked. Approximately ``mask_time_prob * sequence_length // mask_time_length`` feature"
91
+ "vectors will be masked along the time axis."
92
+ },
93
+ )
94
+ mask_time_length: int = field(
95
+ default=10,
96
+ metadata={"help": "Length of vector span to mask along the time axis."},
97
+ )
98
+ mask_feature_prob: float = field(
99
+ default=0.0,
100
+ metadata={
101
+ "help": "Probability of each feature vector along the feature axis to be chosen as the start of the vector"
102
+ "span to be masked. Approximately ``mask_feature_prob * sequence_length // mask_feature_length`` feature bins will be masked along the time axis."
103
+ },
104
+ )
105
+ mask_feature_length: int = field(
106
+ default=10,
107
+ metadata={"help": "Length of vector span to mask along the feature axis."},
108
+ )
109
+ layerdrop: float = field(default=0.0, metadata={"help": "The LayerDrop probability."})
110
+ ctc_loss_reduction: Optional[str] = field(
111
+ default="mean", metadata={"help": "The way the ctc loss should be reduced. Should be one of 'mean' or 'sum'."}
112
+ )
113
+
114
+
115
+ # In[4]:
116
+
117
+
118
+ @dataclass
119
+ class DataTrainingArguments:
120
+ """
121
+ Arguments pertaining to what data we are going to input our model for training and eval.
122
+
123
+ Using `HfArgumentParser` we can turn this class
124
+ into argparse arguments to be able to specify them on
125
+ the command line.
126
+ """
127
+
128
+ dataset_name: str = field(
129
+ metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
130
+ )
131
+ dataset_config_name: str = field(
132
+ default="ab", metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
133
+ )
134
+ train_split_name: str = field(
135
+ default="train+validation",
136
+ metadata={
137
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
138
+ },
139
+ )
140
+ eval_split_name: str = field(
141
+ default="test",
142
+ metadata={
143
+ "help": "The name of the training data set split to use (via the datasets library). Defaults to 'train'"
144
+ },
145
+ )
146
+ audio_column_name: str = field(
147
+ default="audio",
148
+ metadata={"help": "The name of the dataset column containing the audio data. Defaults to 'audio'"},
149
+ )
150
+ text_column_name: str = field(
151
+ default="text",
152
+ metadata={"help": "The name of the dataset column containing the text data. Defaults to 'text'"},
153
+ )
154
+ overwrite_cache: bool = field(
155
+ default=False, metadata={"help": "Overwrite the cached preprocessed datasets or not."}
156
+ )
157
+ preprocessing_num_workers: Optional[int] = field(
158
+ default=None,
159
+ metadata={"help": "The number of processes to use for the preprocessing."},
160
+ )
161
+ max_train_samples: Optional[int] = field(
162
+ default=None,
163
+ metadata={
164
+ "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
165
+ "value if set."
166
+ },
167
+ )
168
+ max_eval_samples: Optional[int] = field(
169
+ default=None,
170
+ metadata={
171
+ "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
172
+ "value if set."
173
+ },
174
+ )
175
+ chars_to_ignore: Optional[List[str]] = list_field(
176
+ default=None,
177
+ metadata={"help": "A list of characters to remove from the transcripts."},
178
+ )
179
+ eval_metrics: List[str] = list_field(
180
+ default=["wer"],
181
+ metadata={"help": "A list of metrics the model should be evaluated on. E.g. `'wer cer'`"},
182
+ )
183
+ max_duration_in_seconds: float = field(
184
+ default=20.0,
185
+ metadata={
186
+ "help": "Filter audio files that are longer than `max_duration_in_seconds` seconds to 'max_duration_in_seconds`"
187
+ },
188
+ )
189
+ min_duration_in_seconds: float = field(
190
+ default=0.0, metadata={"help": "Filter audio files that are shorter than `min_duration_in_seconds` seconds"}
191
+ )
192
+ preprocessing_only: bool = field(
193
+ default=False,
194
+ metadata={
195
+ "help": "Whether to only do data preprocessing and skip training. "
196
+ "This is especially useful when data preprocessing errors out in distributed training due to timeout. "
197
+ "In this case, one should run the preprocessing in a non-distributed setup with `preprocessing_only=True` "
198
+ "so that the cached datasets can consequently be loaded in distributed training"
199
+ },
200
+ )
201
+ use_auth_token: bool = field(
202
+ default=False,
203
+ metadata={
204
+ "help": "If :obj:`True`, will use the token generated when running"
205
+ ":obj:`transformers-cli login` as HTTP bearer authorization for remote files."
206
+ },
207
+ )
208
+ unk_token: str = field(
209
+ default="[UNK]",
210
+ metadata={"help": "The unk token for the tokenizer"},
211
+ )
212
+ pad_token: str = field(
213
+ default="[PAD]",
214
+ metadata={"help": "The padding token for the tokenizer"},
215
+ )
216
+ word_delimiter_token: str = field(
217
+ default="|",
218
+ metadata={"help": "The word delimiter token for the tokenizer"},
219
+ )
220
+ phoneme_language: Optional[str] = field(
221
+ default=None,
222
+ metadata={
223
+ "help": "The target language that should be used be"
224
+ " passed to the tokenizer for tokenization. Note that"
225
+ " this is only relevant if the model classifies the"
226
+ " input audio to a sequence of phoneme sequences."
227
+ },
228
+ )
229
+
230
+
231
+ # In[5]:
232
+
233
+
234
+ @dataclass
235
+ class DataCollatorCTCWithPadding:
236
+ """
237
+ Data collator that will dynamically pad the inputs received.
238
+ Args:
239
+ processor (:class:`~transformers.AutoProcessor`)
240
+ The processor used for proccessing the data.
241
+ padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
242
+ Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
243
+ among:
244
+ * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
245
+ sequence if provided).
246
+ * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
247
+ maximum acceptable input length for the model if that argument is not provided.
248
+ * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
249
+ different lengths).
250
+ max_length (:obj:`int`, `optional`):
251
+ Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
252
+ max_length_labels (:obj:`int`, `optional`):
253
+ Maximum length of the ``labels`` returned list and optionally padding length (see above).
254
+ pad_to_multiple_of (:obj:`int`, `optional`):
255
+ If set will pad the sequence to a multiple of the provided value.
256
+ This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
257
+ 7.5 (Volta).
258
+ """
259
+
260
+ processor: AutoProcessor
261
+ padding: Union[bool, str] = "longest"
262
+ pad_to_multiple_of: Optional[int] = None
263
+ pad_to_multiple_of_labels: Optional[int] = None
264
+
265
+ def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
266
+ # split inputs and labels since they have to be of different lenghts and need
267
+ # different padding methods
268
+ input_features = [{"input_values": feature["input_values"]} for feature in features]
269
+ label_features = [{"input_ids": feature["labels"]} for feature in features]
270
+
271
+ batch = self.processor.pad(
272
+ input_features,
273
+ padding=self.padding,
274
+ pad_to_multiple_of=self.pad_to_multiple_of,
275
+ return_tensors="pt",
276
+ )
277
+
278
+ with self.processor.as_target_processor():
279
+ labels_batch = self.processor.pad(
280
+ label_features,
281
+ padding=self.padding,
282
+ pad_to_multiple_of=self.pad_to_multiple_of_labels,
283
+ return_tensors="pt",
284
+ )
285
+
286
+ # replace padding with -100 to ignore loss correctly
287
+ labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
288
+
289
+ batch["labels"] = labels
290
+
291
+ return batch
292
+
293
+ # download the augmented Dataset from
294
+ # https://huggingface.co/datasets/bakrianoo/arabic-cv8-augmented
295
+
296
+ base_path = "/workspace/cv-corpus-8.0-2022-01-19"
297
+
298
+ # load augmented datasets
299
+ train_ar_df = pd.read_csv(f"{base_path}/train.tsv", sep="\t")
300
+ train_ar_df["audio"] = train_ar_df["path"]
301
+
302
+ test_ar_df = pd.read_csv(f"{base_path}/test.tsv", sep="\t")
303
+ test_ar_df["audio"] = test_ar_df["path"]
304
+
305
+ train_ar_df = train_ar_df.sample(frac=1, random_state=101, ignore_index=True)
306
+
307
+ raw_datasets = DatasetDict()
308
+
309
+ # select Dataset range
310
+ from_rows = 0
311
+ to_rows = 500_000
312
+
313
+ saved_vecs_path = f"{base_path}/saved_vec_dataset-{from_rows}-{to_rows}.ds"
314
+
315
+ raw_datasets["train"] = Dataset.from_pandas(train_ar_df.iloc[from_rows:to_rows])
316
+ raw_datasets["eval"] = Dataset.from_pandas(test_ar_df)
317
+
318
+ # Audio casting
319
+ raw_datasets["train"] = raw_datasets["train"].cast_column("audio", datasets.features.Audio(sampling_rate=16000))
320
+ raw_datasets["eval"] = raw_datasets["eval"].cast_column("audio", datasets.features.Audio(sampling_rate=16000))
321
+
322
+
323
+ parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
324
+
325
+ model_args, data_args, training_args = parser.parse_dict({
326
+ "dataset_name": "mozilla-foundation/common_voice_8_0",
327
+ "model_name_or_path": "facebook/wav2vec2-xls-r-300m",
328
+ "dataset_config_name": "ar",
329
+ "overwrite_output_dir": False,
330
+
331
+ # "preprocessing_only": True,
332
+
333
+ "output_dir": f"{base_path}/output",
334
+ "text_column_name": "sentence",
335
+
336
+ "freeze_feature_encoder": True,
337
+ "gradient_checkpointing": True,
338
+ "group_by_length": False,
339
+ "push_to_hub": False,
340
+ "use_auth_token": True,
341
+ "do_train": True,
342
+ "do_eval": True,
343
+
344
+ "per_device_train_batch_size":32,
345
+ "gradient_accumulation_steps":1,
346
+ "per_device_eval_batch_size":10,
347
+
348
+ "metric_for_best_model":'wer',
349
+ "evaluation_strategy":"steps",
350
+ "eval_steps":1000,
351
+ "logging_strategy":"steps",
352
+ "logging_steps":500,
353
+ "save_strategy":"steps",
354
+ "save_steps":1000,
355
+ "num_train_epochs":10,
356
+ "fp16":True,
357
+ "learning_rate":2e-4,
358
+ "warmup_steps":1000,
359
+ "save_total_limit":8,
360
+ "chars_to_ignore": [':', 'T', '؟', 'ۖ', '…', 'x', 'چ', '?', '.', 'ْ', 'g', '☭', 'w', ';', ',', 'a', 'ۙ', 'e', '`', '“', '!', 'n', 's', '؛', 'ﺃ', 'r', 'ٓ', 'c', '-', 't', 'u', 'l', 'o', '»', 'ٰ', 'ۗ', 'h', 'ڨ', 'ۚ', 'S', '—', 'ٌ', 'm', '”', 'd', 'ۛ', 'H', 'ُ', 'ﻻ', 'y', 'M', 'ھ', 'ک', 'ٍ', 'A', 'ۘ', 'ِ', '–', 'i', 'f', "'", 'ً', '«', 'َ'] + ['\\', '(',')','-','b','c','d','e','g','i','k','p','q','r','u','v','x'],
361
+
362
+ })
363
+
364
+
365
+ # See all possible arguments in src/transformers/training_args.py
366
+ # or by passing the --help flag to this script.
367
+ # We now keep distinct sets of args, for a cleaner separation of concerns.
368
+
369
+ # Detecting last checkpoint.
370
+ last_checkpoint = None
371
+ if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
372
+ last_checkpoint = get_last_checkpoint(training_args.output_dir)
373
+ if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
374
+ raise ValueError(
375
+ f"Output directory ({training_args.output_dir}) already exists and is not empty. "
376
+ "Use --overwrite_output_dir to overcome."
377
+ )
378
+ elif last_checkpoint is not None:
379
+ logger.info(
380
+ f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
381
+ "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
382
+ )
383
+
384
+
385
+ # Setup logging
386
+ logging.basicConfig(
387
+ format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
388
+ datefmt="%m/%d/%Y %H:%M:%S",
389
+ handlers=[logging.StreamHandler(sys.stdout)],
390
+ )
391
+ logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
392
+
393
+ # Log on each process the small summary:
394
+ logger.warning(
395
+ f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
396
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
397
+ )
398
+
399
+ # Set the verbosity to info of the Transformers logger (on main process only):
400
+ if is_main_process(training_args.local_rank):
401
+ transformers.utils.logging.set_verbosity_info()
402
+ logger.info("Training/evaluation parameters %s", training_args)
403
+
404
+
405
+ # Set seed before initializing model.
406
+ set_seed(training_args.seed)
407
+
408
+
409
+ ### Load Dataset
410
+
411
+
412
+ chars_to_ignore_regex = (
413
+ f'[{"".join(data_args.chars_to_ignore)}]' if data_args.chars_to_ignore is not None else None
414
+ )
415
+ text_column_name = data_args.text_column_name
416
+
417
+
418
+ def remove_special_characters(batch):
419
+ if chars_to_ignore_regex is not None:
420
+ batch["target_text"] = re.sub(chars_to_ignore_regex, "", batch[text_column_name]).lower() + " "
421
+ else:
422
+ batch["target_text"] = batch[text_column_name].lower() + " "
423
+ return batch
424
+
425
+ with training_args.main_process_first(desc="dataset map special characters removal"):
426
+
427
+ raw_datasets = raw_datasets.map(
428
+ remove_special_characters,
429
+ remove_columns=[text_column_name],
430
+ desc="remove special characters from datasets",
431
+ )
432
+
433
+
434
+ data_args.word_delimiter_token
435
+
436
+
437
+ # save special tokens for tokenizer
438
+ word_delimiter_token = data_args.word_delimiter_token
439
+ unk_token = data_args.unk_token
440
+ pad_token = data_args.pad_token
441
+
442
+ # 3. Next, let's load the config as we might need it to create
443
+ # the tokenizer
444
+ # load config
445
+ config = AutoConfig.from_pretrained(
446
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
447
+ )
448
+
449
+ def create_vocabulary_from_data(
450
+ datasets: DatasetDict,
451
+ word_delimiter_token: Optional[str] = None,
452
+ unk_token: Optional[str] = None,
453
+ pad_token: Optional[str] = None,
454
+ ):
455
+ # Given training and test labels create vocabulary
456
+ def extract_all_chars(batch):
457
+ all_text = " ".join(batch["target_text"])
458
+ vocab = list(set(all_text))
459
+ return {"vocab": [vocab], "all_text": [all_text]}
460
+
461
+ vocabs = datasets.map(
462
+ extract_all_chars,
463
+ batched=True,
464
+ batch_size=-1,
465
+ keep_in_memory=True,
466
+ remove_columns=datasets["train"].column_names,
467
+ )
468
+
469
+ # take union of all unique characters in each dataset
470
+ vocab_set = functools.reduce(
471
+ lambda vocab_1, vocab_2: set(vocab_1["vocab"][0]) | set(vocab_2["vocab"][0]), vocabs.values()
472
+ )
473
+
474
+
475
+ vocab_dict = {v: k for k, v in enumerate(sorted(list(vocab_set)))}
476
+
477
+ # replace white space with delimiter token
478
+ if word_delimiter_token is not None:
479
+ vocab_dict[word_delimiter_token] = vocab_dict[" "]
480
+ del vocab_dict[" "]
481
+
482
+ # add unk and pad token
483
+ if unk_token is not None:
484
+ vocab_dict[unk_token] = len(vocab_dict)
485
+
486
+ if pad_token is not None:
487
+ vocab_dict[pad_token] = len(vocab_dict)
488
+
489
+ return vocab_dict
490
+
491
+
492
+ raw_datasets["train"] = raw_datasets["train"].remove_columns("file_id")
493
+
494
+
495
+ # 4. Next, if no tokenizer file is defined,
496
+ # we create the vocabulary of the model by extracting all unique characters from
497
+ # the training and evaluation datasets
498
+ # We need to make sure that only first rank saves vocabulary
499
+ # make sure all processes wait until vocab is created
500
+ tokenizer_name_or_path = model_args.tokenizer_name_or_path
501
+ tokenizer_kwargs = {}
502
+ if tokenizer_name_or_path is None:
503
+ # save vocab in training output dir
504
+ tokenizer_name_or_path = training_args.output_dir
505
+
506
+ vocab_file = os.path.join(tokenizer_name_or_path, "vocab.json")
507
+
508
+ with training_args.main_process_first():
509
+ if training_args.overwrite_output_dir and os.path.isfile(vocab_file):
510
+ os.remove(vocab_file)
511
+
512
+ with training_args.main_process_first(desc="dataset map vocabulary creation"):
513
+ if not os.path.isfile(vocab_file):
514
+ os.makedirs(tokenizer_name_or_path, exist_ok=True)
515
+ vocab_dict = create_vocabulary_from_data(
516
+ raw_datasets,
517
+ word_delimiter_token=word_delimiter_token,
518
+ unk_token=unk_token,
519
+ pad_token=pad_token,
520
+ )
521
+
522
+ # save vocab dict to be loaded into tokenizer
523
+ with open(vocab_file, "w") as file:
524
+ json.dump(vocab_dict, file)
525
+
526
+ # if tokenizer has just been created
527
+ # it is defined by `tokenizer_class` if present in config else by `model_type`
528
+ tokenizer_kwargs = {
529
+ "config": config if config.tokenizer_class is not None else None,
530
+ "tokenizer_type": config.model_type if config.tokenizer_class is None else None,
531
+ "unk_token": unk_token,
532
+ "pad_token": pad_token,
533
+ "word_delimiter_token": word_delimiter_token,
534
+ }
535
+
536
+
537
+ # 5. Now we can instantiate the feature extractor, tokenizer and model
538
+ # Note for distributed training, the .from_pretrained methods guarantee that only
539
+ # one local process can concurrently download model & vocab.
540
+
541
+ # load feature_extractor and tokenizer
542
+ tokenizer = AutoTokenizer.from_pretrained(
543
+ tokenizer_name_or_path,
544
+ use_auth_token=data_args.use_auth_token,
545
+ **tokenizer_kwargs,
546
+ )
547
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
548
+ model_args.model_name_or_path, cache_dir=model_args.cache_dir, use_auth_token=data_args.use_auth_token
549
+ )
550
+
551
+
552
+ # adapt config
553
+ config.update(
554
+ {
555
+ "feat_proj_dropout": model_args.feat_proj_dropout,
556
+ "attention_dropout": model_args.attention_dropout,
557
+ "hidden_dropout": model_args.hidden_dropout,
558
+ "final_dropout": model_args.final_dropout,
559
+ "mask_time_prob": model_args.mask_time_prob,
560
+ "mask_time_length": model_args.mask_time_length,
561
+ "mask_feature_prob": model_args.mask_feature_prob,
562
+ "mask_feature_length": model_args.mask_feature_length,
563
+ "gradient_checkpointing": training_args.gradient_checkpointing,
564
+ "layerdrop": model_args.layerdrop,
565
+ "ctc_loss_reduction": model_args.ctc_loss_reduction,
566
+ "pad_token_id": tokenizer.pad_token_id,
567
+ "vocab_size": len(tokenizer),
568
+ "activation_dropout": model_args.activation_dropout,
569
+ }
570
+ )
571
+
572
+
573
+ # create model
574
+ model = AutoModelForCTC.from_pretrained(
575
+ model_args.model_name_or_path,
576
+ cache_dir=model_args.cache_dir,
577
+ config=config,
578
+ use_auth_token=data_args.use_auth_token,
579
+ )
580
+
581
+ # freeze encoder
582
+ if model_args.freeze_feature_encoder:
583
+ model.freeze_feature_encoder()
584
+
585
+
586
+ # 6. Now we preprocess the datasets including loading the audio, resampling and normalization
587
+ # Thankfully, `datasets` takes care of automatically loading and resampling the audio,
588
+ # so that we just need to set the correct target sampling rate and normalize the input
589
+ # via the `feature_extractor`
590
+
591
+ # make sure that dataset decodes audio with correct sampling rate
592
+ dataset_sampling_rate = next(iter(raw_datasets.values())).features[data_args.audio_column_name].sampling_rate
593
+ if dataset_sampling_rate != feature_extractor.sampling_rate:
594
+ raw_datasets = raw_datasets.cast_column(
595
+ data_args.audio_column_name, datasets.features.Audio(sampling_rate=feature_extractor.sampling_rate)
596
+ )
597
+
598
+ # derive max & min input length for sample rate & max duration
599
+ max_input_length = data_args.max_duration_in_seconds * feature_extractor.sampling_rate
600
+ min_input_length = data_args.min_duration_in_seconds * feature_extractor.sampling_rate
601
+
602
+ audio_column_name = data_args.audio_column_name
603
+ num_workers = data_args.preprocessing_num_workers
604
+
605
+ # `phoneme_language` is only relevant if the model is fine-tuned on phoneme classification
606
+ phoneme_language = data_args.phoneme_language
607
+
608
+
609
+ # Preprocessing the datasets.
610
+ # We need to read the audio files as arrays and tokenize the targets.
611
+ def prepare_dataset(batch):
612
+ # load audio
613
+ sample = batch[audio_column_name]
614
+
615
+ inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
616
+ batch["input_values"] = inputs.input_values[0]
617
+ batch["input_length"] = len(batch["input_values"])
618
+
619
+ # encode targets
620
+ additional_kwargs = {}
621
+ if phoneme_language is not None:
622
+ additional_kwargs["phonemizer_lang"] = phoneme_language
623
+
624
+ batch["labels"] = tokenizer(batch["target_text"], **additional_kwargs).input_ids
625
+ return batch
626
+
627
+ def vectorizing_record(audio_path, target_text):
628
+ batch = {}
629
+
630
+ array, sampling_rate = torchaudio.load(audio_path, format="mp3")
631
+
632
+ batch["input_values"] = array.mean(axis=0)
633
+ batch["input_length"] = len(array)
634
+
635
+ # encode targets
636
+ additional_kwargs = {}
637
+ if phoneme_language is not None:
638
+ additional_kwargs["phonemizer_lang"] = phoneme_language
639
+
640
+ batch["labels"] = tokenizer(target_text, **additional_kwargs).input_ids
641
+ return batch
642
+
643
+
644
+ # In[ ]:
645
+
646
+ print(f"========\n\n{num_workers}\n\n========")
647
+ with training_args.main_process_first(desc="dataset map preprocessing"):
648
+ saved_vecs_path = f"{base_path}/saved_vec_dataset-{from_rows}-{to_rows}.ds"
649
+ if not os.path.exists(saved_vecs_path):
650
+
651
+ vectorized_datasets = raw_datasets.map(
652
+ prepare_dataset,
653
+ remove_columns=next(iter(raw_datasets.values())).column_names,
654
+ num_proc=num_workers,
655
+ desc="preprocess datasets",
656
+ )
657
+
658
+
659
+ def is_audio_in_length_range(length):
660
+ return length > min_input_length and length < max_input_length
661
+
662
+ # filter data that is shorter than min_input_length
663
+ vectorized_datasets = vectorized_datasets.filter(
664
+ is_audio_in_length_range,
665
+ num_proc=num_workers,
666
+ input_columns=["input_length"],
667
+ )
668
+
669
+ # save to local disk
670
+ vectorized_datasets.save_to_disk(saved_vecs_path)
671
+ else:
672
+ # read from disk
673
+ vectorized_datasets = load_from_disk(saved_vecs_path)
674
+
675
+ print(vectorized_datasets)
676
+
677
+ # 7. Next, we can prepare the training.
678
+ # Let's use word error rate (WER) as our evaluation metric,
679
+ # instantiate a data collator and the trainer
680
+
681
+ # Define evaluation metrics during training, *i.e.* word error rate, character error rate
682
+ eval_metrics = {metric: load_metric(metric) for metric in data_args.eval_metrics}
683
+
684
+ vectorized_datasets["train"] = vectorized_datasets["train"].remove_columns("input_length")
685
+ vectorized_datasets["eval"] = vectorized_datasets["eval"].remove_columns("input_length")
686
+
687
+ # for large datasets it is advised to run the preprocessing on a
688
+ # single machine first with ``args.preprocessing_only`` since there will mostly likely
689
+ # be a timeout when running the script in distributed mode.
690
+ # In a second step ``args.preprocessing_only`` can then be set to `False` to load the
691
+ # cached dataset
692
+ if data_args.preprocessing_only:
693
+ logger.info(f"Data preprocessing finished. Files cached at {vectorized_datasets.cache_files}")
694
+
695
+
696
+
697
+ def compute_metrics(pred):
698
+ pred_logits = pred.predictions
699
+ pred_ids = np.argmax(pred_logits, axis=-1)
700
+
701
+ pred.label_ids[pred.label_ids == -100] = tokenizer.pad_token_id
702
+
703
+ pred_str = tokenizer.batch_decode(pred_ids)
704
+
705
+ # we do not want to group tokens when computing the metrics
706
+ label_str = tokenizer.batch_decode(pred.label_ids, group_tokens=False)
707
+
708
+ metrics = {k: v.compute(predictions=pred_str, references=label_str) for k, v in eval_metrics.items()}
709
+ return metrics
710
+
711
+ # Now save everything to be able to create a single processor later
712
+ if is_main_process(training_args.local_rank):
713
+ # save feature extractor, tokenizer and config
714
+ feature_extractor.save_pretrained(training_args.output_dir)
715
+ tokenizer.save_pretrained(training_args.output_dir)
716
+ config.save_pretrained(training_args.output_dir)
717
+
718
+ try:
719
+ processor = AutoProcessor.from_pretrained(training_args.output_dir)
720
+ except (OSError, KeyError):
721
+ warnings.warn(
722
+ "Loading a processor from a feature extractor config that does not"
723
+ " include a `processor_class` attribute is deprecated and will be removed in v5. Please add the following "
724
+ " attribute to your `preprocessor_config.json` file to suppress this warning: "
725
+ " `'processor_class': 'Wav2Vec2Processor'`",
726
+ FutureWarning,
727
+ )
728
+ processor = Wav2Vec2Processor.from_pretrained(training_args.output_dir)
729
+
730
+ # Instantiate custom data collator
731
+ data_collator = DataCollatorCTCWithPadding(processor=processor)
732
+
733
+
734
+ decay_parameters = get_parameter_names(model, [torch.nn.LayerNorm])
735
+ decay_parameters = [name for name in decay_parameters if "bias" not in name]
736
+
737
+ optimizer_grouped_parameters = [
738
+ {
739
+ "params": [p for n, p in model.named_parameters() if n in decay_parameters],
740
+ "weight_decay": training_args.weight_decay,
741
+ },
742
+ {
743
+ "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
744
+ "weight_decay": 0.0,
745
+ },
746
+ ]
747
+
748
+ optimizer = bnb.optim.Adam8bit(
749
+ params=optimizer_grouped_parameters,
750
+ lr=training_args.learning_rate,
751
+ betas=(training_args.adam_beta1, training_args.adam_beta2),
752
+ eps=training_args.adam_epsilon,
753
+ )
754
+
755
+ optimizers = (optimizer, None)
756
+
757
+
758
+ # Initialize Trainer
759
+ trainer = Trainer(
760
+ model=model,
761
+ data_collator=data_collator,
762
+ args=training_args,
763
+ compute_metrics=compute_metrics,
764
+ train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
765
+ eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
766
+ tokenizer=feature_extractor,
767
+ optimizers=optimizers,
768
+ )
769
+
770
+
771
+
772
+ # 8. Finally, we can start training
773
+
774
+ # Training
775
+ if training_args.do_train and not data_args.preprocessing_only:
776
+
777
+ # use last checkpoint if exist
778
+ if last_checkpoint is not None:
779
+ checkpoint = last_checkpoint
780
+ elif os.path.isdir(model_args.model_name_or_path):
781
+ checkpoint = model_args.model_name_or_path
782
+ else:
783
+ checkpoint = None
784
+
785
+ train_result = trainer.train(resume_from_checkpoint=checkpoint)
786
+ trainer.save_model()
787
+
788
+ metrics = train_result.metrics
789
+ max_train_samples = (
790
+ data_args.max_train_samples
791
+ if data_args.max_train_samples is not None
792
+ else len(vectorized_datasets["train"])
793
+ )
794
+ metrics["train_samples"] = min(max_train_samples, len(vectorized_datasets["train"]))
795
+
796
+ trainer.log_metrics("train", metrics)
797
+ trainer.save_metrics("train", metrics)
798
+ trainer.save_state()
799
+
800
+
801
+ # Evaluation
802
+ results = {}
803
+ if training_args.do_eval and not data_args.preprocessing_only:
804
+ logger.info("*** Evaluate ***")
805
+ metrics = trainer.evaluate()
806
+ max_eval_samples = (
807
+ data_args.max_eval_samples if data_args.max_eval_samples is not None else len(vectorized_datasets["eval"])
808
+ )
809
+ metrics["eval_samples"] = min(max_eval_samples, len(vectorized_datasets["eval"]))
810
+
811
+ trainer.log_metrics("eval", metrics)
812
+ trainer.save_metrics("eval", metrics)
813
+
814
+ # Write model card and (optionally) push to hub
815
+ config_name = data_args.dataset_config_name if data_args.dataset_config_name is not None else "na"
816
+ kwargs = {
817
+ "finetuned_from": model_args.model_name_or_path,
818
+ "tasks": "speech-recognition",
819
+ "tags": ["automatic-speech-recognition", data_args.dataset_name],
820
+ "dataset_args": f"Config: {config_name}, Training split: {data_args.train_split_name}, Eval split: {data_args.eval_split_name}",
821
+ "dataset": f"{data_args.dataset_name.upper()} - {config_name.upper()}",
822
+ }
823
+
824
+ if not data_args.preprocessing_only:
825
+ if "common_voice" in data_args.dataset_name:
826
+ kwargs["language"] = config_name
827
+
828
+
829
+ if training_args.push_to_hub:
830
+ trainer.push_to_hub(**kwargs)
831
+ else:
832
+ trainer.create_model_card(**kwargs)
833
+
834
+ print(results)
835
+
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 10.0,
3
+ "train_loss": 0.1316310991176183,
4
+ "train_runtime": 23113.6031,
5
+ "train_samples": 399991,
6
+ "train_samples_per_second": 173.054,
7
+ "train_steps_per_second": 0.676
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.4493387111903199,
3
+ "best_model_checkpoint": "/workspace/cv-corpus-8.0-2022-01-19/output/checkpoint-1000",
4
+ "epoch": 10.0,
5
+ "global_step": 15630,
6
+ "is_hyper_param_search": false,
7
+ "is_local_process_zero": true,
8
+ "is_world_process_zero": true,
9
+ "log_history": [
10
+ {
11
+ "epoch": 0.32,
12
+ "learning_rate": 9.92e-05,
13
+ "loss": 6.7916,
14
+ "step": 500
15
+ },
16
+ {
17
+ "epoch": 0.64,
18
+ "learning_rate": 0.00019920000000000002,
19
+ "loss": 1.354,
20
+ "step": 1000
21
+ },
22
+ {
23
+ "epoch": 0.64,
24
+ "eval_loss": 0.4108898937702179,
25
+ "eval_runtime": 71.0896,
26
+ "eval_samples_per_second": 146.125,
27
+ "eval_steps_per_second": 1.829,
28
+ "eval_wer": 0.4493387111903199,
29
+ "step": 1000
30
+ },
31
+ {
32
+ "epoch": 0.96,
33
+ "learning_rate": 0.00019321941216678059,
34
+ "loss": 0.7084,
35
+ "step": 1500
36
+ },
37
+ {
38
+ "epoch": 1.28,
39
+ "learning_rate": 0.00018638414217361587,
40
+ "loss": 0.5886,
41
+ "step": 2000
42
+ },
43
+ {
44
+ "epoch": 1.28,
45
+ "eval_loss": 0.2797781527042389,
46
+ "eval_runtime": 71.0704,
47
+ "eval_samples_per_second": 146.165,
48
+ "eval_steps_per_second": 1.829,
49
+ "eval_wer": 0.3099334021198762,
50
+ "step": 2000
51
+ },
52
+ {
53
+ "epoch": 1.6,
54
+ "learning_rate": 0.00017954887218045112,
55
+ "loss": 0.5386,
56
+ "step": 2500
57
+ },
58
+ {
59
+ "epoch": 1.92,
60
+ "learning_rate": 0.0001727136021872864,
61
+ "loss": 0.4977,
62
+ "step": 3000
63
+ },
64
+ {
65
+ "epoch": 1.92,
66
+ "eval_loss": 0.23867885768413544,
67
+ "eval_runtime": 71.0833,
68
+ "eval_samples_per_second": 146.138,
69
+ "eval_steps_per_second": 1.829,
70
+ "eval_wer": 0.2673295188068662,
71
+ "step": 3000
72
+ },
73
+ {
74
+ "epoch": 2.24,
75
+ "learning_rate": 0.00016587833219412168,
76
+ "loss": 0.4531,
77
+ "step": 3500
78
+ },
79
+ {
80
+ "epoch": 2.56,
81
+ "learning_rate": 0.00015905673274094328,
82
+ "loss": 0.4253,
83
+ "step": 4000
84
+ },
85
+ {
86
+ "epoch": 2.56,
87
+ "eval_loss": 0.22657370567321777,
88
+ "eval_runtime": 71.5875,
89
+ "eval_samples_per_second": 145.109,
90
+ "eval_steps_per_second": 1.816,
91
+ "eval_wer": 0.2523215458212175,
92
+ "step": 4000
93
+ },
94
+ {
95
+ "epoch": 2.88,
96
+ "learning_rate": 0.00015222146274777856,
97
+ "loss": 0.413,
98
+ "step": 4500
99
+ },
100
+ {
101
+ "epoch": 3.2,
102
+ "learning_rate": 0.0001453861927546138,
103
+ "loss": 0.3942,
104
+ "step": 5000
105
+ },
106
+ {
107
+ "epoch": 3.2,
108
+ "eval_loss": 0.21706202626228333,
109
+ "eval_runtime": 72.0993,
110
+ "eval_samples_per_second": 144.079,
111
+ "eval_steps_per_second": 1.803,
112
+ "eval_wer": 0.2437294812869337,
113
+ "step": 5000
114
+ },
115
+ {
116
+ "epoch": 3.52,
117
+ "learning_rate": 0.00013855092276144907,
118
+ "loss": 0.3741,
119
+ "step": 5500
120
+ },
121
+ {
122
+ "epoch": 3.84,
123
+ "learning_rate": 0.00013171565276828435,
124
+ "loss": 0.3619,
125
+ "step": 6000
126
+ },
127
+ {
128
+ "epoch": 3.84,
129
+ "eval_loss": 0.20762862265110016,
130
+ "eval_runtime": 72.0546,
131
+ "eval_samples_per_second": 144.168,
132
+ "eval_steps_per_second": 1.804,
133
+ "eval_wer": 0.22530719444705,
134
+ "step": 6000
135
+ },
136
+ {
137
+ "epoch": 4.16,
138
+ "learning_rate": 0.00012489405331510595,
139
+ "loss": 0.3435,
140
+ "step": 6500
141
+ },
142
+ {
143
+ "epoch": 4.48,
144
+ "learning_rate": 0.00011805878332194122,
145
+ "loss": 0.3245,
146
+ "step": 7000
147
+ },
148
+ {
149
+ "epoch": 4.48,
150
+ "eval_loss": 0.2087564468383789,
151
+ "eval_runtime": 71.9965,
152
+ "eval_samples_per_second": 144.285,
153
+ "eval_steps_per_second": 1.806,
154
+ "eval_wer": 0.21862864646843636,
155
+ "step": 7000
156
+ },
157
+ {
158
+ "epoch": 4.8,
159
+ "learning_rate": 0.0001112235133287765,
160
+ "loss": 0.3135,
161
+ "step": 7500
162
+ },
163
+ {
164
+ "epoch": 5.12,
165
+ "learning_rate": 0.0001044019138755981,
166
+ "loss": 0.308,
167
+ "step": 8000
168
+ },
169
+ {
170
+ "epoch": 5.12,
171
+ "eval_loss": 0.2086208015680313,
172
+ "eval_runtime": 68.9232,
173
+ "eval_samples_per_second": 150.718,
174
+ "eval_steps_per_second": 1.886,
175
+ "eval_wer": 0.22063596285526685,
176
+ "step": 8000
177
+ },
178
+ {
179
+ "epoch": 5.44,
180
+ "learning_rate": 9.756664388243337e-05,
181
+ "loss": 0.292,
182
+ "step": 8500
183
+ },
184
+ {
185
+ "epoch": 5.76,
186
+ "learning_rate": 9.073137388926864e-05,
187
+ "loss": 0.2881,
188
+ "step": 9000
189
+ },
190
+ {
191
+ "epoch": 5.76,
192
+ "eval_loss": 0.20888157188892365,
193
+ "eval_runtime": 70.2391,
194
+ "eval_samples_per_second": 147.895,
195
+ "eval_steps_per_second": 1.851,
196
+ "eval_wer": 0.21048682112372197,
197
+ "step": 9000
198
+ },
199
+ {
200
+ "epoch": 6.08,
201
+ "learning_rate": 8.389610389610389e-05,
202
+ "loss": 0.2717,
203
+ "step": 9500
204
+ },
205
+ {
206
+ "epoch": 6.4,
207
+ "learning_rate": 7.706083390293917e-05,
208
+ "loss": 0.2557,
209
+ "step": 10000
210
+ },
211
+ {
212
+ "epoch": 6.4,
213
+ "eval_loss": 0.20148096978664398,
214
+ "eval_runtime": 70.1726,
215
+ "eval_samples_per_second": 148.035,
216
+ "eval_steps_per_second": 1.853,
217
+ "eval_wer": 0.20035643935840916,
218
+ "step": 10000
219
+ },
220
+ {
221
+ "epoch": 6.72,
222
+ "learning_rate": 7.022556390977444e-05,
223
+ "loss": 0.2536,
224
+ "step": 10500
225
+ },
226
+ {
227
+ "epoch": 7.04,
228
+ "learning_rate": 6.33902939166097e-05,
229
+ "loss": 0.248,
230
+ "step": 11000
231
+ },
232
+ {
233
+ "epoch": 7.04,
234
+ "eval_loss": 0.2043762356042862,
235
+ "eval_runtime": 70.2184,
236
+ "eval_samples_per_second": 147.938,
237
+ "eval_steps_per_second": 1.851,
238
+ "eval_wer": 0.19529124847575274,
239
+ "step": 11000
240
+ },
241
+ {
242
+ "epoch": 7.36,
243
+ "learning_rate": 5.655502392344498e-05,
244
+ "loss": 0.2308,
245
+ "step": 11500
246
+ },
247
+ {
248
+ "epoch": 7.68,
249
+ "learning_rate": 4.971975393028025e-05,
250
+ "loss": 0.2251,
251
+ "step": 12000
252
+ },
253
+ {
254
+ "epoch": 7.68,
255
+ "eval_loss": 0.20575200021266937,
256
+ "eval_runtime": 70.8946,
257
+ "eval_samples_per_second": 146.527,
258
+ "eval_steps_per_second": 1.834,
259
+ "eval_wer": 0.19315261232529782,
260
+ "step": 12000
261
+ },
262
+ {
263
+ "epoch": 8.0,
264
+ "learning_rate": 4.291182501708818e-05,
265
+ "loss": 0.2207,
266
+ "step": 12500
267
+ },
268
+ {
269
+ "epoch": 8.32,
270
+ "learning_rate": 3.6076555023923446e-05,
271
+ "loss": 0.2052,
272
+ "step": 13000
273
+ },
274
+ {
275
+ "epoch": 8.32,
276
+ "eval_loss": 0.21170856058597565,
277
+ "eval_runtime": 70.7957,
278
+ "eval_samples_per_second": 146.732,
279
+ "eval_steps_per_second": 1.836,
280
+ "eval_wer": 0.18776850201669637,
281
+ "step": 13000
282
+ },
283
+ {
284
+ "epoch": 8.64,
285
+ "learning_rate": 2.9241285030758714e-05,
286
+ "loss": 0.2026,
287
+ "step": 13500
288
+ },
289
+ {
290
+ "epoch": 8.96,
291
+ "learning_rate": 2.2406015037593985e-05,
292
+ "loss": 0.1976,
293
+ "step": 14000
294
+ },
295
+ {
296
+ "epoch": 8.96,
297
+ "eval_loss": 0.21043309569358826,
298
+ "eval_runtime": 71.1895,
299
+ "eval_samples_per_second": 145.92,
300
+ "eval_steps_per_second": 1.826,
301
+ "eval_wer": 0.18249695150548728,
302
+ "step": 14000
303
+ },
304
+ {
305
+ "epoch": 9.28,
306
+ "learning_rate": 1.5570745044429256e-05,
307
+ "loss": 0.1875,
308
+ "step": 14500
309
+ },
310
+ {
311
+ "epoch": 9.6,
312
+ "learning_rate": 8.735475051264526e-06,
313
+ "loss": 0.1845,
314
+ "step": 15000
315
+ },
316
+ {
317
+ "epoch": 9.6,
318
+ "eval_loss": 0.21563756465911865,
319
+ "eval_runtime": 71.0722,
320
+ "eval_samples_per_second": 146.161,
321
+ "eval_steps_per_second": 1.829,
322
+ "eval_wer": 0.18212175218084609,
323
+ "step": 15000
324
+ },
325
+ {
326
+ "epoch": 9.92,
327
+ "learning_rate": 1.9138755980861244e-06,
328
+ "loss": 0.1837,
329
+ "step": 15500
330
+ },
331
+ {
332
+ "epoch": 10.0,
333
+ "step": 15630,
334
+ "total_flos": 9.942412569719006e+20,
335
+ "train_loss": 0.1316310991176183,
336
+ "train_runtime": 23113.6031,
337
+ "train_samples_per_second": 173.054,
338
+ "train_steps_per_second": 0.676
339
+ }
340
+ ],
341
+ "max_steps": 15630,
342
+ "num_train_epochs": 10,
343
+ "total_flos": 9.942412569719006e+20,
344
+ "trial_name": null,
345
+ "trial_params": null
346
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a1b80bcfc69cf9fa9cf637cef6f631649190d0c441246f20b284d86514490bf
3
+ size 3055
vocab.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"\"": 1, "،": 2, "ء": 3, "آ": 4, "أ": 5, "ؤ": 6, "إ": 7, "ئ": 8, "ا": 9, "ب": 10, "ة": 11, "ت": 12, "ث": 13, "ج": 14, "ح": 15, "خ": 16, "د": 17, "ذ": 18, "ر": 19, "ز": 20, "س": 21, "ش": 22, "ص": 23, "ض": 24, "ط": 25, "ظ": 26, "ع": 27, "غ": 28, "ـ": 29, "ف": 30, "ق": 31, "ك": 32, "ل": 33, "م": 34, "ن": 35, "ه": 36, "و": 37, "ى": 38, "ي": 39, "ّ": 40, "ی": 41, "|": 0, "[UNK]": 42, "[PAD]": 43}