wissamantoun commited on
Commit
d75604d
1 Parent(s): 30a4418

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fr
3
+ license: mit
4
+ tags:
5
+ - deberta-v2
6
+ - text-classification
7
+ base_model: almanach/camembertav2-base
8
+ datasets:
9
+ - FLUE-PAWS-X
10
+ metrics:
11
+ - accuracy
12
+ pipeline_tag: text-classification
13
+ library_name: transformers
14
+ model-index:
15
+ - name: almanach/camembertav2-base-pawsx
16
+ results:
17
+ - task:
18
+ type: text-classification
19
+ name: Paraphrase Identification
20
+ dataset:
21
+ type: flue-paws-x
22
+ name: FLUE-PAWS-X
23
+ metrics:
24
+ - name: accuracy
25
+ type: accuracy
26
+ value: 0.93511
27
+ verified: false
28
+ ---
29
+
30
+ # Model Card for almanach/camembertav2-base-pawsx
31
+
32
+ almanach/camembertav2-base-pawsx is a deberta-v2 model for text classification. It is trained on the FLUE-PAWS-X dataset for the task of Paraphrase Identification. The model achieves an accuracy of 0.93511 on the FLUE-PAWS-X dataset.
33
+
34
+ The model is part of the almanach/camembertav2-base family of model finetunes.
35
+
36
+ ## Model Details
37
+
38
+ ### Model Description
39
+
40
+ - **Developed by:** Wissam Antoun (Phd Student at Almanach, Inria-Paris)
41
+ - **Model type:** deberta-v2
42
+ - **Language(s) (NLP):** French
43
+ - **License:** MIT
44
+ - **Finetuned from model [optional]:** almanach/camembertav2-base
45
+
46
+ ### Model Sources [optional]
47
+
48
+ <!-- Provide the basic links for the model. -->
49
+
50
+ - **Repository:** https://github.com/WissamAntoun/camemberta
51
+ - **Paper:** https://arxiv.org/abs/2411.08868
52
+
53
+ ## Uses
54
+
55
+ The model can be used for text classification tasks in French for Paraphrase Identification.
56
+
57
+ ## Bias, Risks, and Limitations
58
+
59
+ The model may exhibit biases based on the training data. The model may not generalize well to other datasets or tasks. The model may also have limitations in terms of the data it was trained on.
60
+
61
+
62
+ ## How to Get Started with the Model
63
+
64
+ Use the code below to get started with the model.
65
+
66
+ ```python
67
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
68
+
69
+ model = AutoModelForSequenceClassification.from_pretrained("almanach/camembertav2-base-pawsx")
70
+ tokenizer = AutoTokenizer.from_pretrained("almanach/camembertav2-base-pawsx")
71
+
72
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
73
+
74
+ classifier({
75
+ "text": "Le livre est très intéressant et j'ai appris beaucoup de choses.",
76
+ "text_pair": "Le livre est très ennuyeux et je n'ai rien appris.",
77
+ })
78
+ ```
79
+
80
+
81
+ ## Training Details
82
+
83
+ ### Training Data
84
+
85
+ The model is trained on the FLUE-PAWS-X dataset.
86
+
87
+ - Dataset Name: FLUE-PAWS-X
88
+ - Dataset Size:
89
+ - Train: 49399
90
+ - Dev: 1988
91
+ - Test: 2000
92
+
93
+
94
+ ### Training Procedure
95
+
96
+ Model trained with the run_classification.py script from the huggingface repository.
97
+
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ ```yml
103
+ accelerator_config: '{''split_batches'': False, ''dispatch_batches'': None, ''even_batches'':
104
+ True, ''use_seedable_sampler'': True, ''non_blocking'': False, ''gradient_accumulation_kwargs'':
105
+ None}'
106
+ adafactor: false
107
+ adam_beta1: 0.9
108
+ adam_beta2: 0.999
109
+ adam_epsilon: 1.0e-08
110
+ auto_find_batch_size: false
111
+ base_model: camembertv2
112
+ base_model_name: camembertav2-base-bf16-p2-17000
113
+ batch_eval_metrics: false
114
+ bf16: false
115
+ bf16_full_eval: false
116
+ data_seed: 666.0
117
+ dataloader_drop_last: false
118
+ dataloader_num_workers: 0
119
+ dataloader_persistent_workers: false
120
+ dataloader_pin_memory: true
121
+ dataloader_prefetch_factor: .nan
122
+ ddp_backend: .nan
123
+ ddp_broadcast_buffers: .nan
124
+ ddp_bucket_cap_mb: .nan
125
+ ddp_find_unused_parameters: .nan
126
+ ddp_timeout: 1800
127
+ debug: '[]'
128
+ deepspeed: .nan
129
+ disable_tqdm: false
130
+ dispatch_batches: .nan
131
+ do_eval: true
132
+ do_predict: false
133
+ do_train: true
134
+ epoch: 5.999028340080971
135
+ eval_accumulation_steps: 4
136
+ eval_accuracy: 0.9351106639839034
137
+ eval_delay: 0
138
+ eval_do_concat_batches: true
139
+ eval_loss: 0.4311606884002685
140
+ eval_on_start: false
141
+ eval_runtime: 5.8632
142
+ eval_samples: 1988
143
+ eval_samples_per_second: 339.064
144
+ eval_steps: .nan
145
+ eval_steps_per_second: 42.468
146
+ eval_strategy: epoch
147
+ eval_use_gather_object: false
148
+ evaluation_strategy: epoch
149
+ fp16: false
150
+ fp16_backend: auto
151
+ fp16_full_eval: false
152
+ fp16_opt_level: O1
153
+ fsdp: '[]'
154
+ fsdp_config: '{''min_num_params'': 0, ''xla'': False, ''xla_fsdp_v2'': False, ''xla_fsdp_grad_ckpt'':
155
+ False}'
156
+ fsdp_min_num_params: 0
157
+ fsdp_transformer_layer_cls_to_wrap: .nan
158
+ full_determinism: false
159
+ gradient_accumulation_steps: 2
160
+ gradient_checkpointing: false
161
+ gradient_checkpointing_kwargs: .nan
162
+ greater_is_better: true
163
+ group_by_length: false
164
+ half_precision_backend: auto
165
+ hub_always_push: false
166
+ hub_model_id: .nan
167
+ hub_private_repo: false
168
+ hub_strategy: every_save
169
+ hub_token: <HUB_TOKEN>
170
+ ignore_data_skip: false
171
+ include_inputs_for_metrics: false
172
+ include_num_input_tokens_seen: false
173
+ include_tokens_per_second: false
174
+ jit_mode_eval: false
175
+ label_names: .nan
176
+ label_smoothing_factor: 0.0
177
+ learning_rate: 5.0e-05
178
+ length_column_name: length
179
+ load_best_model_at_end: true
180
+ local_rank: 0
181
+ log_level: debug
182
+ log_level_replica: warning
183
+ log_on_each_node: true
184
+ logging_dir: /scratch/camembertv2/runs/results/flue-PAWS-X/camembertav2-base-bf16-p2-17000/max_seq_length-148-gradient_accumulation_steps-2-precision-fp32-learning_rate-5e-05-epochs-6-lr_scheduler-linear-warmup_steps-0/SEED-666/logs
185
+ logging_first_step: false
186
+ logging_nan_inf_filter: true
187
+ logging_steps: 100
188
+ logging_strategy: steps
189
+ lr_scheduler_kwargs: '{}'
190
+ lr_scheduler_type: linear
191
+ max_grad_norm: 1.0
192
+ max_steps: -1
193
+ metric_for_best_model: accuracy
194
+ mp_parameters: .nan
195
+ name: camembertv2/runs/results/flue-PAWS-X/camembertav2-base-bf16-p2-17000/max_seq_length-148-gradient_accumulation_steps-2-precision-fp32-learning_rate-5e-05-epochs-6-lr_scheduler-linear-warmup_steps-0
196
+ neftune_noise_alpha: .nan
197
+ no_cuda: false
198
+ num_train_epochs: 6.0
199
+ optim: adamw_torch
200
+ optim_args: .nan
201
+ optim_target_modules: .nan
202
+ output_dir: /scratch/camembertv2/runs/results/flue-PAWS-X/camembertav2-base-bf16-p2-17000/max_seq_length-148-gradient_accumulation_steps-2-precision-fp32-learning_rate-5e-05-epochs-6-lr_scheduler-linear-warmup_steps-0/SEED-666
203
+ overwrite_output_dir: false
204
+ past_index: -1
205
+ per_device_eval_batch_size: 8
206
+ per_device_train_batch_size: 8
207
+ per_gpu_eval_batch_size: .nan
208
+ per_gpu_train_batch_size: .nan
209
+ prediction_loss_only: false
210
+ push_to_hub: false
211
+ push_to_hub_model_id: .nan
212
+ push_to_hub_organization: .nan
213
+ push_to_hub_token: <PUSH_TO_HUB_TOKEN>
214
+ ray_scope: last
215
+ remove_unused_columns: true
216
+ report_to: '[''tensorboard'']'
217
+ restore_callback_states_from_checkpoint: false
218
+ resume_from_checkpoint: .nan
219
+ run_name: /scratch/camembertv2/runs/results/flue-PAWS-X/camembertav2-base-bf16-p2-17000/max_seq_length-148-gradient_accumulation_steps-2-precision-fp32-learning_rate-5e-05-epochs-6-lr_scheduler-linear-warmup_steps-0/SEED-666
220
+ save_on_each_node: false
221
+ save_only_model: false
222
+ save_safetensors: true
223
+ save_steps: 500
224
+ save_strategy: epoch
225
+ save_total_limit: .nan
226
+ seed: 666
227
+ skip_memory_metrics: true
228
+ split_batches: .nan
229
+ tf32: .nan
230
+ torch_compile: true
231
+ torch_compile_backend: inductor
232
+ torch_compile_mode: .nan
233
+ torch_empty_cache_steps: .nan
234
+ torchdynamo: .nan
235
+ total_flos: 1.3373133118742268e+16
236
+ tpu_metrics_debug: false
237
+ tpu_num_cores: .nan
238
+ train_loss: 0.1195580537627343
239
+ train_runtime: 3073.2453
240
+ train_samples: 49399
241
+ train_samples_per_second: 96.443
242
+ train_steps_per_second: 6.027
243
+ use_cpu: false
244
+ use_ipex: false
245
+ use_legacy_prediction_loop: false
246
+ use_mps_device: false
247
+ warmup_ratio: 0.0
248
+ warmup_steps: 0
249
+ weight_decay: 0.0
250
+
251
+ ```
252
+
253
+ #### Results
254
+
255
+ **Accuracy:** 0.93511
256
+
257
+ ## Technical Specifications
258
+
259
+ ### Model Architecture and Objective
260
+
261
+ deberta-v2 for sequence classification.
262
+
263
+ ## Citation
264
+
265
+ **BibTeX:**
266
+
267
+ ```bibtex
268
+ @misc{antoun2024camembert20smarterfrench,
269
+ title={CamemBERT 2.0: A Smarter French Language Model Aged to Perfection},
270
+ author={Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
271
+ year={2024},
272
+ eprint={2411.08868},
273
+ archivePrefix={arXiv},
274
+ primaryClass={cs.CL},
275
+ url={https://arxiv.org/abs/2411.08868},
276
+ }
277
+ ```
all_results.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 5.999028340080971,
3
+ "eval_accuracy": 0.9351106639839034,
4
+ "eval_loss": 0.43116068840026855,
5
+ "eval_runtime": 5.8632,
6
+ "eval_samples": 1988,
7
+ "eval_samples_per_second": 339.064,
8
+ "eval_steps_per_second": 42.468,
9
+ "total_flos": 1.3373133118742268e+16,
10
+ "train_loss": 0.11955805376273435,
11
+ "train_runtime": 3073.2453,
12
+ "train_samples": 49399,
13
+ "train_samples_per_second": 96.443,
14
+ "train_steps_per_second": 6.027
15
+ }
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/scratch/camembertv2/runs/models/camembertav2-base-bf16/post/ckpt-p2-17000/pt/discriminator/",
3
+ "architectures": [
4
+ "DebertaV2ForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 1,
8
+ "conv_act": "gelu",
9
+ "conv_kernel_size": 0,
10
+ "embedding_size": 768,
11
+ "eos_token_id": 2,
12
+ "finetuning_task": "paws-x",
13
+ "hidden_act": "gelu",
14
+ "hidden_dropout_prob": 0.1,
15
+ "hidden_size": 768,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 3072,
18
+ "label2id": {
19
+ "0": 0,
20
+ "1": 1
21
+ },
22
+ "layer_norm_eps": 1e-07,
23
+ "max_position_embeddings": 1024,
24
+ "max_relative_positions": -1,
25
+ "model_name": "camembertav2-base-bf16",
26
+ "model_type": "deberta-v2",
27
+ "norm_rel_ebd": "layer_norm",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 12,
30
+ "pad_token_id": 0,
31
+ "pooler_dropout": 0,
32
+ "pooler_hidden_act": "gelu",
33
+ "pooler_hidden_size": 768,
34
+ "pos_att_type": [
35
+ "p2c",
36
+ "c2p"
37
+ ],
38
+ "position_biased_input": false,
39
+ "position_buckets": 256,
40
+ "relative_attention": true,
41
+ "share_att_key": true,
42
+ "torch_dtype": "float32",
43
+ "transformers_version": "4.44.2",
44
+ "type_vocab_size": 0,
45
+ "vocab_size": 32768
46
+ }
eval_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 5.999028340080971,
3
+ "eval_accuracy": 0.9351106639839034,
4
+ "eval_loss": 0.43116068840026855,
5
+ "eval_runtime": 5.8632,
6
+ "eval_samples": 1988,
7
+ "eval_samples_per_second": 339.064,
8
+ "eval_steps_per_second": 42.468
9
+ }
logs/events.out.tfevents.1724538746.nefgpu39.342676.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c3c8434c50b65d8f695e9ac1147368e0795b222d0921156b0bc0167807242de
3
+ size 47364
logs/events.out.tfevents.1724541826.nefgpu39.342676.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:406ee4c0e83f33b703fd7682ffcec39d3ef80afe8a928a6d561c635f76bb86d1
3
+ size 369
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9371cfd1659c178ad4a6505e86138bb9e9736d795c8a43822534a277616801cf
3
+ size 444859368
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "[CLS]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "[SEP]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "[MASK]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "[PAD]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "[SEP]",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "[UNK]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "[PAD]",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "[CLS]",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "[SEP]",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "[UNK]",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "[MASK]",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "[CLS]",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "[CLS]",
48
+ "eos_token": "[SEP]",
49
+ "errors": "replace",
50
+ "mask_token": "[MASK]",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "[UNK]"
57
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 5.999028340080971,
3
+ "total_flos": 1.3373133118742268e+16,
4
+ "train_loss": 0.11955805376273435,
5
+ "train_runtime": 3073.2453,
6
+ "train_samples": 49399,
7
+ "train_samples_per_second": 96.443,
8
+ "train_steps_per_second": 6.027
9
+ }
trainer_state.json ADDED
@@ -0,0 +1,1391 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 0.9351106639839034,
3
+ "best_model_checkpoint": "/scratch/camembertv2/runs/results/flue-PAWS-X/camembertav2-base-bf16-p2-17000/max_seq_length-148-gradient_accumulation_steps-2-precision-fp32-learning_rate-5e-05-epochs-6-lr_scheduler-linear-warmup_steps-0/SEED-666/checkpoint-18522",
4
+ "epoch": 5.999028340080971,
5
+ "eval_steps": 500,
6
+ "global_step": 18522,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.032388663967611336,
13
+ "grad_norm": 5.7791056632995605,
14
+ "learning_rate": 4.9730050750458916e-05,
15
+ "loss": 0.5409,
16
+ "step": 100
17
+ },
18
+ {
19
+ "epoch": 0.06477732793522267,
20
+ "grad_norm": 3.9132766723632812,
21
+ "learning_rate": 4.946010150091783e-05,
22
+ "loss": 0.3386,
23
+ "step": 200
24
+ },
25
+ {
26
+ "epoch": 0.09716599190283401,
27
+ "grad_norm": 19.378026962280273,
28
+ "learning_rate": 4.9190152251376743e-05,
29
+ "loss": 0.3131,
30
+ "step": 300
31
+ },
32
+ {
33
+ "epoch": 0.12955465587044535,
34
+ "grad_norm": 15.864294052124023,
35
+ "learning_rate": 4.892020300183566e-05,
36
+ "loss": 0.2785,
37
+ "step": 400
38
+ },
39
+ {
40
+ "epoch": 0.16194331983805668,
41
+ "grad_norm": 12.896875381469727,
42
+ "learning_rate": 4.865025375229457e-05,
43
+ "loss": 0.2929,
44
+ "step": 500
45
+ },
46
+ {
47
+ "epoch": 0.19433198380566802,
48
+ "grad_norm": 21.26336669921875,
49
+ "learning_rate": 4.8380304502753484e-05,
50
+ "loss": 0.2847,
51
+ "step": 600
52
+ },
53
+ {
54
+ "epoch": 0.22672064777327935,
55
+ "grad_norm": 5.422796249389648,
56
+ "learning_rate": 4.81103552532124e-05,
57
+ "loss": 0.2673,
58
+ "step": 700
59
+ },
60
+ {
61
+ "epoch": 0.2591093117408907,
62
+ "grad_norm": 3.9689624309539795,
63
+ "learning_rate": 4.784040600367131e-05,
64
+ "loss": 0.2561,
65
+ "step": 800
66
+ },
67
+ {
68
+ "epoch": 0.291497975708502,
69
+ "grad_norm": 8.32761001586914,
70
+ "learning_rate": 4.7570456754130226e-05,
71
+ "loss": 0.2618,
72
+ "step": 900
73
+ },
74
+ {
75
+ "epoch": 0.32388663967611336,
76
+ "grad_norm": 3.8402862548828125,
77
+ "learning_rate": 4.730050750458914e-05,
78
+ "loss": 0.2535,
79
+ "step": 1000
80
+ },
81
+ {
82
+ "epoch": 0.3562753036437247,
83
+ "grad_norm": 38.90117645263672,
84
+ "learning_rate": 4.703055825504805e-05,
85
+ "loss": 0.2476,
86
+ "step": 1100
87
+ },
88
+ {
89
+ "epoch": 0.38866396761133604,
90
+ "grad_norm": 8.632582664489746,
91
+ "learning_rate": 4.6760609005506967e-05,
92
+ "loss": 0.2504,
93
+ "step": 1200
94
+ },
95
+ {
96
+ "epoch": 0.42105263157894735,
97
+ "grad_norm": 2.50603985786438,
98
+ "learning_rate": 4.649065975596588e-05,
99
+ "loss": 0.2375,
100
+ "step": 1300
101
+ },
102
+ {
103
+ "epoch": 0.4534412955465587,
104
+ "grad_norm": 22.682083129882812,
105
+ "learning_rate": 4.6220710506424794e-05,
106
+ "loss": 0.2304,
107
+ "step": 1400
108
+ },
109
+ {
110
+ "epoch": 0.48582995951417,
111
+ "grad_norm": 3.2134041786193848,
112
+ "learning_rate": 4.595076125688371e-05,
113
+ "loss": 0.2342,
114
+ "step": 1500
115
+ },
116
+ {
117
+ "epoch": 0.5182186234817814,
118
+ "grad_norm": 9.210716247558594,
119
+ "learning_rate": 4.568081200734262e-05,
120
+ "loss": 0.2447,
121
+ "step": 1600
122
+ },
123
+ {
124
+ "epoch": 0.5506072874493927,
125
+ "grad_norm": 7.8480143547058105,
126
+ "learning_rate": 4.5410862757801535e-05,
127
+ "loss": 0.2161,
128
+ "step": 1700
129
+ },
130
+ {
131
+ "epoch": 0.582995951417004,
132
+ "grad_norm": 1.0597597360610962,
133
+ "learning_rate": 4.514091350826045e-05,
134
+ "loss": 0.2234,
135
+ "step": 1800
136
+ },
137
+ {
138
+ "epoch": 0.6153846153846154,
139
+ "grad_norm": 1.914153814315796,
140
+ "learning_rate": 4.487096425871936e-05,
141
+ "loss": 0.2001,
142
+ "step": 1900
143
+ },
144
+ {
145
+ "epoch": 0.6477732793522267,
146
+ "grad_norm": 7.569839000701904,
147
+ "learning_rate": 4.4601015009178276e-05,
148
+ "loss": 0.2354,
149
+ "step": 2000
150
+ },
151
+ {
152
+ "epoch": 0.680161943319838,
153
+ "grad_norm": 9.363743782043457,
154
+ "learning_rate": 4.433106575963719e-05,
155
+ "loss": 0.1989,
156
+ "step": 2100
157
+ },
158
+ {
159
+ "epoch": 0.7125506072874493,
160
+ "grad_norm": 0.2488391101360321,
161
+ "learning_rate": 4.40611165100961e-05,
162
+ "loss": 0.2227,
163
+ "step": 2200
164
+ },
165
+ {
166
+ "epoch": 0.7449392712550608,
167
+ "grad_norm": 6.931863307952881,
168
+ "learning_rate": 4.379116726055502e-05,
169
+ "loss": 0.2296,
170
+ "step": 2300
171
+ },
172
+ {
173
+ "epoch": 0.7773279352226721,
174
+ "grad_norm": 2.7890470027923584,
175
+ "learning_rate": 4.352121801101393e-05,
176
+ "loss": 0.2161,
177
+ "step": 2400
178
+ },
179
+ {
180
+ "epoch": 0.8097165991902834,
181
+ "grad_norm": 0.6333386301994324,
182
+ "learning_rate": 4.3251268761472844e-05,
183
+ "loss": 0.21,
184
+ "step": 2500
185
+ },
186
+ {
187
+ "epoch": 0.8421052631578947,
188
+ "grad_norm": 0.40448668599128723,
189
+ "learning_rate": 4.298131951193176e-05,
190
+ "loss": 0.2115,
191
+ "step": 2600
192
+ },
193
+ {
194
+ "epoch": 0.8744939271255061,
195
+ "grad_norm": 10.913817405700684,
196
+ "learning_rate": 4.271137026239067e-05,
197
+ "loss": 0.1923,
198
+ "step": 2700
199
+ },
200
+ {
201
+ "epoch": 0.9068825910931174,
202
+ "grad_norm": 0.3693946301937103,
203
+ "learning_rate": 4.2441421012849585e-05,
204
+ "loss": 0.2019,
205
+ "step": 2800
206
+ },
207
+ {
208
+ "epoch": 0.9392712550607287,
209
+ "grad_norm": 2.760298252105713,
210
+ "learning_rate": 4.21714717633085e-05,
211
+ "loss": 0.2113,
212
+ "step": 2900
213
+ },
214
+ {
215
+ "epoch": 0.97165991902834,
216
+ "grad_norm": 11.557462692260742,
217
+ "learning_rate": 4.190152251376741e-05,
218
+ "loss": 0.2073,
219
+ "step": 3000
220
+ },
221
+ {
222
+ "epoch": 0.9998380566801619,
223
+ "eval_accuracy": 0.9144869215291751,
224
+ "eval_loss": 0.28149500489234924,
225
+ "eval_runtime": 6.5505,
226
+ "eval_samples_per_second": 303.488,
227
+ "eval_steps_per_second": 38.012,
228
+ "step": 3087
229
+ },
230
+ {
231
+ "epoch": 1.0040485829959513,
232
+ "grad_norm": 11.277145385742188,
233
+ "learning_rate": 4.1631573264226326e-05,
234
+ "loss": 0.2007,
235
+ "step": 3100
236
+ },
237
+ {
238
+ "epoch": 1.0364372469635628,
239
+ "grad_norm": 7.342209815979004,
240
+ "learning_rate": 4.136162401468524e-05,
241
+ "loss": 0.1369,
242
+ "step": 3200
243
+ },
244
+ {
245
+ "epoch": 1.0688259109311742,
246
+ "grad_norm": 4.37772798538208,
247
+ "learning_rate": 4.1091674765144154e-05,
248
+ "loss": 0.1951,
249
+ "step": 3300
250
+ },
251
+ {
252
+ "epoch": 1.1012145748987854,
253
+ "grad_norm": 10.176764488220215,
254
+ "learning_rate": 4.0821725515603074e-05,
255
+ "loss": 0.1621,
256
+ "step": 3400
257
+ },
258
+ {
259
+ "epoch": 1.1336032388663968,
260
+ "grad_norm": 0.42846062779426575,
261
+ "learning_rate": 4.055177626606198e-05,
262
+ "loss": 0.1593,
263
+ "step": 3500
264
+ },
265
+ {
266
+ "epoch": 1.165991902834008,
267
+ "grad_norm": 0.37047576904296875,
268
+ "learning_rate": 4.0281827016520895e-05,
269
+ "loss": 0.1654,
270
+ "step": 3600
271
+ },
272
+ {
273
+ "epoch": 1.1983805668016194,
274
+ "grad_norm": 0.9531276822090149,
275
+ "learning_rate": 4.001187776697981e-05,
276
+ "loss": 0.1686,
277
+ "step": 3700
278
+ },
279
+ {
280
+ "epoch": 1.2307692307692308,
281
+ "grad_norm": 6.070562362670898,
282
+ "learning_rate": 3.974192851743873e-05,
283
+ "loss": 0.1741,
284
+ "step": 3800
285
+ },
286
+ {
287
+ "epoch": 1.263157894736842,
288
+ "grad_norm": 13.438559532165527,
289
+ "learning_rate": 3.9471979267897636e-05,
290
+ "loss": 0.1887,
291
+ "step": 3900
292
+ },
293
+ {
294
+ "epoch": 1.2955465587044535,
295
+ "grad_norm": 1.3613296747207642,
296
+ "learning_rate": 3.920203001835655e-05,
297
+ "loss": 0.1598,
298
+ "step": 4000
299
+ },
300
+ {
301
+ "epoch": 1.3279352226720649,
302
+ "grad_norm": 12.832125663757324,
303
+ "learning_rate": 3.893208076881546e-05,
304
+ "loss": 0.1735,
305
+ "step": 4100
306
+ },
307
+ {
308
+ "epoch": 1.360323886639676,
309
+ "grad_norm": 11.818835258483887,
310
+ "learning_rate": 3.8662131519274384e-05,
311
+ "loss": 0.1838,
312
+ "step": 4200
313
+ },
314
+ {
315
+ "epoch": 1.3927125506072875,
316
+ "grad_norm": 0.46300917863845825,
317
+ "learning_rate": 3.839218226973329e-05,
318
+ "loss": 0.168,
319
+ "step": 4300
320
+ },
321
+ {
322
+ "epoch": 1.425101214574899,
323
+ "grad_norm": 5.4205403327941895,
324
+ "learning_rate": 3.8122233020192204e-05,
325
+ "loss": 0.1675,
326
+ "step": 4400
327
+ },
328
+ {
329
+ "epoch": 1.45748987854251,
330
+ "grad_norm": 1.2334966659545898,
331
+ "learning_rate": 3.785228377065112e-05,
332
+ "loss": 0.1651,
333
+ "step": 4500
334
+ },
335
+ {
336
+ "epoch": 1.4898785425101215,
337
+ "grad_norm": 1.531936526298523,
338
+ "learning_rate": 3.758233452111004e-05,
339
+ "loss": 0.1577,
340
+ "step": 4600
341
+ },
342
+ {
343
+ "epoch": 1.522267206477733,
344
+ "grad_norm": 14.39586353302002,
345
+ "learning_rate": 3.7312385271568945e-05,
346
+ "loss": 0.1925,
347
+ "step": 4700
348
+ },
349
+ {
350
+ "epoch": 1.5546558704453441,
351
+ "grad_norm": 17.419506072998047,
352
+ "learning_rate": 3.704243602202786e-05,
353
+ "loss": 0.1405,
354
+ "step": 4800
355
+ },
356
+ {
357
+ "epoch": 1.5870445344129553,
358
+ "grad_norm": 16.734468460083008,
359
+ "learning_rate": 3.677248677248677e-05,
360
+ "loss": 0.1511,
361
+ "step": 4900
362
+ },
363
+ {
364
+ "epoch": 1.6194331983805668,
365
+ "grad_norm": 0.26591989398002625,
366
+ "learning_rate": 3.650253752294569e-05,
367
+ "loss": 0.1704,
368
+ "step": 5000
369
+ },
370
+ {
371
+ "epoch": 1.6518218623481782,
372
+ "grad_norm": 11.863933563232422,
373
+ "learning_rate": 3.62325882734046e-05,
374
+ "loss": 0.1498,
375
+ "step": 5100
376
+ },
377
+ {
378
+ "epoch": 1.6842105263157894,
379
+ "grad_norm": 0.1933823674917221,
380
+ "learning_rate": 3.5962639023863514e-05,
381
+ "loss": 0.1556,
382
+ "step": 5200
383
+ },
384
+ {
385
+ "epoch": 1.7165991902834008,
386
+ "grad_norm": 5.462769508361816,
387
+ "learning_rate": 3.569268977432243e-05,
388
+ "loss": 0.1656,
389
+ "step": 5300
390
+ },
391
+ {
392
+ "epoch": 1.7489878542510122,
393
+ "grad_norm": 0.5079367160797119,
394
+ "learning_rate": 3.542274052478135e-05,
395
+ "loss": 0.1637,
396
+ "step": 5400
397
+ },
398
+ {
399
+ "epoch": 1.7813765182186234,
400
+ "grad_norm": 0.9675686955451965,
401
+ "learning_rate": 3.5152791275240255e-05,
402
+ "loss": 0.1475,
403
+ "step": 5500
404
+ },
405
+ {
406
+ "epoch": 1.8137651821862348,
407
+ "grad_norm": 13.242751121520996,
408
+ "learning_rate": 3.488284202569917e-05,
409
+ "loss": 0.1799,
410
+ "step": 5600
411
+ },
412
+ {
413
+ "epoch": 1.8461538461538463,
414
+ "grad_norm": 0.37781623005867004,
415
+ "learning_rate": 3.461289277615808e-05,
416
+ "loss": 0.1562,
417
+ "step": 5700
418
+ },
419
+ {
420
+ "epoch": 1.8785425101214575,
421
+ "grad_norm": 11.740013122558594,
422
+ "learning_rate": 3.4342943526617e-05,
423
+ "loss": 0.1798,
424
+ "step": 5800
425
+ },
426
+ {
427
+ "epoch": 1.9109311740890689,
428
+ "grad_norm": 7.2465009689331055,
429
+ "learning_rate": 3.407299427707591e-05,
430
+ "loss": 0.163,
431
+ "step": 5900
432
+ },
433
+ {
434
+ "epoch": 1.9433198380566803,
435
+ "grad_norm": 1.1531609296798706,
436
+ "learning_rate": 3.380304502753482e-05,
437
+ "loss": 0.1347,
438
+ "step": 6000
439
+ },
440
+ {
441
+ "epoch": 1.9757085020242915,
442
+ "grad_norm": 3.8277270793914795,
443
+ "learning_rate": 3.353309577799374e-05,
444
+ "loss": 0.1719,
445
+ "step": 6100
446
+ },
447
+ {
448
+ "epoch": 2.0,
449
+ "eval_accuracy": 0.9225352112676056,
450
+ "eval_loss": 0.3341009020805359,
451
+ "eval_runtime": 5.9971,
452
+ "eval_samples_per_second": 331.492,
453
+ "eval_steps_per_second": 41.52,
454
+ "step": 6175
455
+ },
456
+ {
457
+ "epoch": 2.0080971659919027,
458
+ "grad_norm": 31.751121520996094,
459
+ "learning_rate": 3.326314652845266e-05,
460
+ "loss": 0.1353,
461
+ "step": 6200
462
+ },
463
+ {
464
+ "epoch": 2.0404858299595143,
465
+ "grad_norm": 6.975146770477295,
466
+ "learning_rate": 3.2993197278911564e-05,
467
+ "loss": 0.105,
468
+ "step": 6300
469
+ },
470
+ {
471
+ "epoch": 2.0728744939271255,
472
+ "grad_norm": 2.166383981704712,
473
+ "learning_rate": 3.272324802937048e-05,
474
+ "loss": 0.1191,
475
+ "step": 6400
476
+ },
477
+ {
478
+ "epoch": 2.1052631578947367,
479
+ "grad_norm": 14.97061824798584,
480
+ "learning_rate": 3.245329877982939e-05,
481
+ "loss": 0.1143,
482
+ "step": 6500
483
+ },
484
+ {
485
+ "epoch": 2.1376518218623484,
486
+ "grad_norm": 26.100902557373047,
487
+ "learning_rate": 3.218334953028831e-05,
488
+ "loss": 0.1116,
489
+ "step": 6600
490
+ },
491
+ {
492
+ "epoch": 2.1700404858299596,
493
+ "grad_norm": 1.7684509754180908,
494
+ "learning_rate": 3.191340028074722e-05,
495
+ "loss": 0.1035,
496
+ "step": 6700
497
+ },
498
+ {
499
+ "epoch": 2.2024291497975708,
500
+ "grad_norm": 0.07033982872962952,
501
+ "learning_rate": 3.164345103120613e-05,
502
+ "loss": 0.1092,
503
+ "step": 6800
504
+ },
505
+ {
506
+ "epoch": 2.234817813765182,
507
+ "grad_norm": 0.3883877396583557,
508
+ "learning_rate": 3.1373501781665046e-05,
509
+ "loss": 0.1079,
510
+ "step": 6900
511
+ },
512
+ {
513
+ "epoch": 2.2672064777327936,
514
+ "grad_norm": 7.267897129058838,
515
+ "learning_rate": 3.110355253212397e-05,
516
+ "loss": 0.1172,
517
+ "step": 7000
518
+ },
519
+ {
520
+ "epoch": 2.299595141700405,
521
+ "grad_norm": 5.313385963439941,
522
+ "learning_rate": 3.0833603282582874e-05,
523
+ "loss": 0.1346,
524
+ "step": 7100
525
+ },
526
+ {
527
+ "epoch": 2.331983805668016,
528
+ "grad_norm": 19.593353271484375,
529
+ "learning_rate": 3.056365403304179e-05,
530
+ "loss": 0.1318,
531
+ "step": 7200
532
+ },
533
+ {
534
+ "epoch": 2.3643724696356276,
535
+ "grad_norm": 2.72308611869812,
536
+ "learning_rate": 3.0293704783500704e-05,
537
+ "loss": 0.1138,
538
+ "step": 7300
539
+ },
540
+ {
541
+ "epoch": 2.396761133603239,
542
+ "grad_norm": 8.528360366821289,
543
+ "learning_rate": 3.0023755533959618e-05,
544
+ "loss": 0.1267,
545
+ "step": 7400
546
+ },
547
+ {
548
+ "epoch": 2.42914979757085,
549
+ "grad_norm": 3.0180954933166504,
550
+ "learning_rate": 2.9753806284418528e-05,
551
+ "loss": 0.1,
552
+ "step": 7500
553
+ },
554
+ {
555
+ "epoch": 2.4615384615384617,
556
+ "grad_norm": 11.93277645111084,
557
+ "learning_rate": 2.9483857034877442e-05,
558
+ "loss": 0.1322,
559
+ "step": 7600
560
+ },
561
+ {
562
+ "epoch": 2.493927125506073,
563
+ "grad_norm": 0.44001469016075134,
564
+ "learning_rate": 2.921390778533636e-05,
565
+ "loss": 0.1148,
566
+ "step": 7700
567
+ },
568
+ {
569
+ "epoch": 2.526315789473684,
570
+ "grad_norm": 9.29806900024414,
571
+ "learning_rate": 2.8943958535795273e-05,
572
+ "loss": 0.0979,
573
+ "step": 7800
574
+ },
575
+ {
576
+ "epoch": 2.5587044534412957,
577
+ "grad_norm": 0.31297165155410767,
578
+ "learning_rate": 2.8674009286254183e-05,
579
+ "loss": 0.131,
580
+ "step": 7900
581
+ },
582
+ {
583
+ "epoch": 2.591093117408907,
584
+ "grad_norm": 0.13621266186237335,
585
+ "learning_rate": 2.8404060036713097e-05,
586
+ "loss": 0.1314,
587
+ "step": 8000
588
+ },
589
+ {
590
+ "epoch": 2.623481781376518,
591
+ "grad_norm": 14.039895057678223,
592
+ "learning_rate": 2.8134110787172014e-05,
593
+ "loss": 0.1137,
594
+ "step": 8100
595
+ },
596
+ {
597
+ "epoch": 2.6558704453441297,
598
+ "grad_norm": 0.19245870411396027,
599
+ "learning_rate": 2.7864161537630927e-05,
600
+ "loss": 0.1169,
601
+ "step": 8200
602
+ },
603
+ {
604
+ "epoch": 2.688259109311741,
605
+ "grad_norm": 37.211849212646484,
606
+ "learning_rate": 2.7594212288089838e-05,
607
+ "loss": 0.1052,
608
+ "step": 8300
609
+ },
610
+ {
611
+ "epoch": 2.720647773279352,
612
+ "grad_norm": 4.244668483734131,
613
+ "learning_rate": 2.732426303854875e-05,
614
+ "loss": 0.1126,
615
+ "step": 8400
616
+ },
617
+ {
618
+ "epoch": 2.753036437246964,
619
+ "grad_norm": 0.1630948930978775,
620
+ "learning_rate": 2.705431378900767e-05,
621
+ "loss": 0.1371,
622
+ "step": 8500
623
+ },
624
+ {
625
+ "epoch": 2.785425101214575,
626
+ "grad_norm": 9.61884593963623,
627
+ "learning_rate": 2.6784364539466582e-05,
628
+ "loss": 0.1309,
629
+ "step": 8600
630
+ },
631
+ {
632
+ "epoch": 2.817813765182186,
633
+ "grad_norm": 12.167716026306152,
634
+ "learning_rate": 2.6514415289925492e-05,
635
+ "loss": 0.1219,
636
+ "step": 8700
637
+ },
638
+ {
639
+ "epoch": 2.850202429149798,
640
+ "grad_norm": 4.939400672912598,
641
+ "learning_rate": 2.624446604038441e-05,
642
+ "loss": 0.1097,
643
+ "step": 8800
644
+ },
645
+ {
646
+ "epoch": 2.882591093117409,
647
+ "grad_norm": 10.696518898010254,
648
+ "learning_rate": 2.5974516790843323e-05,
649
+ "loss": 0.1477,
650
+ "step": 8900
651
+ },
652
+ {
653
+ "epoch": 2.91497975708502,
654
+ "grad_norm": 16.492700576782227,
655
+ "learning_rate": 2.5704567541302237e-05,
656
+ "loss": 0.1158,
657
+ "step": 9000
658
+ },
659
+ {
660
+ "epoch": 2.9473684210526314,
661
+ "grad_norm": 1.9808906316757202,
662
+ "learning_rate": 2.5434618291761147e-05,
663
+ "loss": 0.1113,
664
+ "step": 9100
665
+ },
666
+ {
667
+ "epoch": 2.979757085020243,
668
+ "grad_norm": 0.19531062245368958,
669
+ "learning_rate": 2.5164669042220064e-05,
670
+ "loss": 0.1018,
671
+ "step": 9200
672
+ },
673
+ {
674
+ "epoch": 2.999838056680162,
675
+ "eval_accuracy": 0.9255533199195171,
676
+ "eval_loss": 0.3031398355960846,
677
+ "eval_runtime": 6.0146,
678
+ "eval_samples_per_second": 330.531,
679
+ "eval_steps_per_second": 41.399,
680
+ "step": 9262
681
+ },
682
+ {
683
+ "epoch": 3.0121457489878543,
684
+ "grad_norm": 0.03815652057528496,
685
+ "learning_rate": 2.4894719792678978e-05,
686
+ "loss": 0.1136,
687
+ "step": 9300
688
+ },
689
+ {
690
+ "epoch": 3.0445344129554655,
691
+ "grad_norm": 7.544350624084473,
692
+ "learning_rate": 2.462477054313789e-05,
693
+ "loss": 0.0829,
694
+ "step": 9400
695
+ },
696
+ {
697
+ "epoch": 3.076923076923077,
698
+ "grad_norm": 29.120431900024414,
699
+ "learning_rate": 2.4354821293596805e-05,
700
+ "loss": 0.0879,
701
+ "step": 9500
702
+ },
703
+ {
704
+ "epoch": 3.1093117408906883,
705
+ "grad_norm": 5.044871807098389,
706
+ "learning_rate": 2.408487204405572e-05,
707
+ "loss": 0.0895,
708
+ "step": 9600
709
+ },
710
+ {
711
+ "epoch": 3.1417004048582995,
712
+ "grad_norm": 0.42247331142425537,
713
+ "learning_rate": 2.3814922794514633e-05,
714
+ "loss": 0.1025,
715
+ "step": 9700
716
+ },
717
+ {
718
+ "epoch": 3.174089068825911,
719
+ "grad_norm": 19.73353385925293,
720
+ "learning_rate": 2.3544973544973546e-05,
721
+ "loss": 0.0755,
722
+ "step": 9800
723
+ },
724
+ {
725
+ "epoch": 3.2064777327935223,
726
+ "grad_norm": 8.545650482177734,
727
+ "learning_rate": 2.327502429543246e-05,
728
+ "loss": 0.0968,
729
+ "step": 9900
730
+ },
731
+ {
732
+ "epoch": 3.2388663967611335,
733
+ "grad_norm": 0.08128660172224045,
734
+ "learning_rate": 2.3005075045891374e-05,
735
+ "loss": 0.0906,
736
+ "step": 10000
737
+ },
738
+ {
739
+ "epoch": 3.2712550607287447,
740
+ "grad_norm": 0.023657312616705894,
741
+ "learning_rate": 2.2735125796350287e-05,
742
+ "loss": 0.0759,
743
+ "step": 10100
744
+ },
745
+ {
746
+ "epoch": 3.3036437246963564,
747
+ "grad_norm": 0.2730010151863098,
748
+ "learning_rate": 2.24651765468092e-05,
749
+ "loss": 0.0846,
750
+ "step": 10200
751
+ },
752
+ {
753
+ "epoch": 3.3360323886639676,
754
+ "grad_norm": 0.05432628467679024,
755
+ "learning_rate": 2.2195227297268115e-05,
756
+ "loss": 0.1003,
757
+ "step": 10300
758
+ },
759
+ {
760
+ "epoch": 3.3684210526315788,
761
+ "grad_norm": 0.24408192932605743,
762
+ "learning_rate": 2.192527804772703e-05,
763
+ "loss": 0.0864,
764
+ "step": 10400
765
+ },
766
+ {
767
+ "epoch": 3.4008097165991904,
768
+ "grad_norm": 0.038083989173173904,
769
+ "learning_rate": 2.1655328798185942e-05,
770
+ "loss": 0.0567,
771
+ "step": 10500
772
+ },
773
+ {
774
+ "epoch": 3.4331983805668016,
775
+ "grad_norm": 0.1453145295381546,
776
+ "learning_rate": 2.1385379548644856e-05,
777
+ "loss": 0.0673,
778
+ "step": 10600
779
+ },
780
+ {
781
+ "epoch": 3.465587044534413,
782
+ "grad_norm": 8.611372947692871,
783
+ "learning_rate": 2.111543029910377e-05,
784
+ "loss": 0.0892,
785
+ "step": 10700
786
+ },
787
+ {
788
+ "epoch": 3.4979757085020244,
789
+ "grad_norm": 0.1930588334798813,
790
+ "learning_rate": 2.0845481049562683e-05,
791
+ "loss": 0.0914,
792
+ "step": 10800
793
+ },
794
+ {
795
+ "epoch": 3.5303643724696356,
796
+ "grad_norm": 0.07848736643791199,
797
+ "learning_rate": 2.0575531800021597e-05,
798
+ "loss": 0.0705,
799
+ "step": 10900
800
+ },
801
+ {
802
+ "epoch": 3.562753036437247,
803
+ "grad_norm": 1.9848402738571167,
804
+ "learning_rate": 2.030558255048051e-05,
805
+ "loss": 0.0991,
806
+ "step": 11000
807
+ },
808
+ {
809
+ "epoch": 3.5951417004048585,
810
+ "grad_norm": 7.717613220214844,
811
+ "learning_rate": 2.0035633300939424e-05,
812
+ "loss": 0.091,
813
+ "step": 11100
814
+ },
815
+ {
816
+ "epoch": 3.6275303643724697,
817
+ "grad_norm": 0.13098488748073578,
818
+ "learning_rate": 1.9765684051398338e-05,
819
+ "loss": 0.0819,
820
+ "step": 11200
821
+ },
822
+ {
823
+ "epoch": 3.659919028340081,
824
+ "grad_norm": 18.124767303466797,
825
+ "learning_rate": 1.949573480185725e-05,
826
+ "loss": 0.0778,
827
+ "step": 11300
828
+ },
829
+ {
830
+ "epoch": 3.6923076923076925,
831
+ "grad_norm": 31.650789260864258,
832
+ "learning_rate": 1.9225785552316165e-05,
833
+ "loss": 0.099,
834
+ "step": 11400
835
+ },
836
+ {
837
+ "epoch": 3.7246963562753037,
838
+ "grad_norm": 2.110788106918335,
839
+ "learning_rate": 1.895583630277508e-05,
840
+ "loss": 0.0847,
841
+ "step": 11500
842
+ },
843
+ {
844
+ "epoch": 3.757085020242915,
845
+ "grad_norm": 6.184129238128662,
846
+ "learning_rate": 1.8685887053233992e-05,
847
+ "loss": 0.1054,
848
+ "step": 11600
849
+ },
850
+ {
851
+ "epoch": 3.7894736842105265,
852
+ "grad_norm": 0.05220253765583038,
853
+ "learning_rate": 1.8415937803692906e-05,
854
+ "loss": 0.0987,
855
+ "step": 11700
856
+ },
857
+ {
858
+ "epoch": 3.8218623481781377,
859
+ "grad_norm": 4.092259883880615,
860
+ "learning_rate": 1.8145988554151823e-05,
861
+ "loss": 0.0791,
862
+ "step": 11800
863
+ },
864
+ {
865
+ "epoch": 3.854251012145749,
866
+ "grad_norm": 5.052979946136475,
867
+ "learning_rate": 1.7876039304610734e-05,
868
+ "loss": 0.0998,
869
+ "step": 11900
870
+ },
871
+ {
872
+ "epoch": 3.8866396761133606,
873
+ "grad_norm": 0.6650911569595337,
874
+ "learning_rate": 1.760609005506965e-05,
875
+ "loss": 0.1091,
876
+ "step": 12000
877
+ },
878
+ {
879
+ "epoch": 3.919028340080972,
880
+ "grad_norm": 0.04196294769644737,
881
+ "learning_rate": 1.733614080552856e-05,
882
+ "loss": 0.0733,
883
+ "step": 12100
884
+ },
885
+ {
886
+ "epoch": 3.951417004048583,
887
+ "grad_norm": 8.808391571044922,
888
+ "learning_rate": 1.7066191555987478e-05,
889
+ "loss": 0.1036,
890
+ "step": 12200
891
+ },
892
+ {
893
+ "epoch": 3.983805668016194,
894
+ "grad_norm": 0.19184300303459167,
895
+ "learning_rate": 1.6796242306446388e-05,
896
+ "loss": 0.1012,
897
+ "step": 12300
898
+ },
899
+ {
900
+ "epoch": 4.0,
901
+ "eval_accuracy": 0.9265593561368209,
902
+ "eval_loss": 0.3330951929092407,
903
+ "eval_runtime": 6.0161,
904
+ "eval_samples_per_second": 330.449,
905
+ "eval_steps_per_second": 41.389,
906
+ "step": 12350
907
+ },
908
+ {
909
+ "epoch": 4.016194331983805,
910
+ "grad_norm": 0.1378416270017624,
911
+ "learning_rate": 1.6526293056905305e-05,
912
+ "loss": 0.0653,
913
+ "step": 12400
914
+ },
915
+ {
916
+ "epoch": 4.048582995951417,
917
+ "grad_norm": 45.89274978637695,
918
+ "learning_rate": 1.6256343807364216e-05,
919
+ "loss": 0.0768,
920
+ "step": 12500
921
+ },
922
+ {
923
+ "epoch": 4.080971659919029,
924
+ "grad_norm": 5.21702241897583,
925
+ "learning_rate": 1.5986394557823133e-05,
926
+ "loss": 0.0446,
927
+ "step": 12600
928
+ },
929
+ {
930
+ "epoch": 4.113360323886639,
931
+ "grad_norm": 5.125068664550781,
932
+ "learning_rate": 1.5716445308282043e-05,
933
+ "loss": 0.0407,
934
+ "step": 12700
935
+ },
936
+ {
937
+ "epoch": 4.145748987854251,
938
+ "grad_norm": 0.0502559095621109,
939
+ "learning_rate": 1.544649605874096e-05,
940
+ "loss": 0.0613,
941
+ "step": 12800
942
+ },
943
+ {
944
+ "epoch": 4.178137651821863,
945
+ "grad_norm": 15.276534080505371,
946
+ "learning_rate": 1.517654680919987e-05,
947
+ "loss": 0.0852,
948
+ "step": 12900
949
+ },
950
+ {
951
+ "epoch": 4.2105263157894735,
952
+ "grad_norm": 10.263306617736816,
953
+ "learning_rate": 1.4906597559658786e-05,
954
+ "loss": 0.0664,
955
+ "step": 13000
956
+ },
957
+ {
958
+ "epoch": 4.242914979757085,
959
+ "grad_norm": 0.17894484102725983,
960
+ "learning_rate": 1.4636648310117698e-05,
961
+ "loss": 0.0402,
962
+ "step": 13100
963
+ },
964
+ {
965
+ "epoch": 4.275303643724697,
966
+ "grad_norm": 0.08714679628610611,
967
+ "learning_rate": 1.4366699060576613e-05,
968
+ "loss": 0.0647,
969
+ "step": 13200
970
+ },
971
+ {
972
+ "epoch": 4.3076923076923075,
973
+ "grad_norm": 0.9923548102378845,
974
+ "learning_rate": 1.4096749811035525e-05,
975
+ "loss": 0.0529,
976
+ "step": 13300
977
+ },
978
+ {
979
+ "epoch": 4.340080971659919,
980
+ "grad_norm": 0.03006882779300213,
981
+ "learning_rate": 1.382680056149444e-05,
982
+ "loss": 0.0776,
983
+ "step": 13400
984
+ },
985
+ {
986
+ "epoch": 4.372469635627531,
987
+ "grad_norm": 3.831218719482422,
988
+ "learning_rate": 1.3556851311953352e-05,
989
+ "loss": 0.0555,
990
+ "step": 13500
991
+ },
992
+ {
993
+ "epoch": 4.4048582995951415,
994
+ "grad_norm": 0.18301153182983398,
995
+ "learning_rate": 1.3286902062412268e-05,
996
+ "loss": 0.076,
997
+ "step": 13600
998
+ },
999
+ {
1000
+ "epoch": 4.437246963562753,
1001
+ "grad_norm": 27.794418334960938,
1002
+ "learning_rate": 1.301695281287118e-05,
1003
+ "loss": 0.0585,
1004
+ "step": 13700
1005
+ },
1006
+ {
1007
+ "epoch": 4.469635627530364,
1008
+ "grad_norm": 0.12287744879722595,
1009
+ "learning_rate": 1.2747003563330095e-05,
1010
+ "loss": 0.0469,
1011
+ "step": 13800
1012
+ },
1013
+ {
1014
+ "epoch": 4.502024291497976,
1015
+ "grad_norm": 0.07235557585954666,
1016
+ "learning_rate": 1.2477054313789007e-05,
1017
+ "loss": 0.051,
1018
+ "step": 13900
1019
+ },
1020
+ {
1021
+ "epoch": 4.534412955465587,
1022
+ "grad_norm": 0.013693173415958881,
1023
+ "learning_rate": 1.220710506424792e-05,
1024
+ "loss": 0.051,
1025
+ "step": 14000
1026
+ },
1027
+ {
1028
+ "epoch": 4.566801619433198,
1029
+ "grad_norm": 3.417243242263794,
1030
+ "learning_rate": 1.1937155814706834e-05,
1031
+ "loss": 0.0501,
1032
+ "step": 14100
1033
+ },
1034
+ {
1035
+ "epoch": 4.59919028340081,
1036
+ "grad_norm": 0.20653334259986877,
1037
+ "learning_rate": 1.1667206565165748e-05,
1038
+ "loss": 0.0655,
1039
+ "step": 14200
1040
+ },
1041
+ {
1042
+ "epoch": 4.631578947368421,
1043
+ "grad_norm": 0.392309308052063,
1044
+ "learning_rate": 1.1397257315624662e-05,
1045
+ "loss": 0.042,
1046
+ "step": 14300
1047
+ },
1048
+ {
1049
+ "epoch": 4.663967611336032,
1050
+ "grad_norm": 0.3196917474269867,
1051
+ "learning_rate": 1.1127308066083575e-05,
1052
+ "loss": 0.0778,
1053
+ "step": 14400
1054
+ },
1055
+ {
1056
+ "epoch": 4.696356275303644,
1057
+ "grad_norm": 0.024839555844664574,
1058
+ "learning_rate": 1.0857358816542489e-05,
1059
+ "loss": 0.0504,
1060
+ "step": 14500
1061
+ },
1062
+ {
1063
+ "epoch": 4.728744939271255,
1064
+ "grad_norm": 0.03586389869451523,
1065
+ "learning_rate": 1.0587409567001405e-05,
1066
+ "loss": 0.0595,
1067
+ "step": 14600
1068
+ },
1069
+ {
1070
+ "epoch": 4.761133603238866,
1071
+ "grad_norm": 28.398067474365234,
1072
+ "learning_rate": 1.0317460317460318e-05,
1073
+ "loss": 0.0611,
1074
+ "step": 14700
1075
+ },
1076
+ {
1077
+ "epoch": 4.793522267206478,
1078
+ "grad_norm": 0.17493298649787903,
1079
+ "learning_rate": 1.0047511067919232e-05,
1080
+ "loss": 0.0633,
1081
+ "step": 14800
1082
+ },
1083
+ {
1084
+ "epoch": 4.825910931174089,
1085
+ "grad_norm": 11.968632698059082,
1086
+ "learning_rate": 9.777561818378146e-06,
1087
+ "loss": 0.0733,
1088
+ "step": 14900
1089
+ },
1090
+ {
1091
+ "epoch": 4.8582995951417,
1092
+ "grad_norm": 0.1989884227514267,
1093
+ "learning_rate": 9.50761256883706e-06,
1094
+ "loss": 0.0579,
1095
+ "step": 15000
1096
+ },
1097
+ {
1098
+ "epoch": 4.890688259109312,
1099
+ "grad_norm": 0.01483625266700983,
1100
+ "learning_rate": 9.237663319295973e-06,
1101
+ "loss": 0.0526,
1102
+ "step": 15100
1103
+ },
1104
+ {
1105
+ "epoch": 4.923076923076923,
1106
+ "grad_norm": 59.61151885986328,
1107
+ "learning_rate": 8.967714069754887e-06,
1108
+ "loss": 0.0514,
1109
+ "step": 15200
1110
+ },
1111
+ {
1112
+ "epoch": 4.955465587044534,
1113
+ "grad_norm": 0.20129810273647308,
1114
+ "learning_rate": 8.6977648202138e-06,
1115
+ "loss": 0.057,
1116
+ "step": 15300
1117
+ },
1118
+ {
1119
+ "epoch": 4.987854251012146,
1120
+ "grad_norm": 0.1253397911787033,
1121
+ "learning_rate": 8.427815570672714e-06,
1122
+ "loss": 0.0667,
1123
+ "step": 15400
1124
+ },
1125
+ {
1126
+ "epoch": 4.9998380566801615,
1127
+ "eval_accuracy": 0.9280684104627767,
1128
+ "eval_loss": 0.3633970022201538,
1129
+ "eval_runtime": 5.8931,
1130
+ "eval_samples_per_second": 337.342,
1131
+ "eval_steps_per_second": 42.253,
1132
+ "step": 15437
1133
+ },
1134
+ {
1135
+ "epoch": 5.020242914979757,
1136
+ "grad_norm": 0.3967650830745697,
1137
+ "learning_rate": 8.157866321131628e-06,
1138
+ "loss": 0.044,
1139
+ "step": 15500
1140
+ },
1141
+ {
1142
+ "epoch": 5.052631578947368,
1143
+ "grad_norm": 0.5851262807846069,
1144
+ "learning_rate": 7.887917071590541e-06,
1145
+ "loss": 0.0332,
1146
+ "step": 15600
1147
+ },
1148
+ {
1149
+ "epoch": 5.08502024291498,
1150
+ "grad_norm": 8.97497272491455,
1151
+ "learning_rate": 7.617967822049456e-06,
1152
+ "loss": 0.0329,
1153
+ "step": 15700
1154
+ },
1155
+ {
1156
+ "epoch": 5.117408906882591,
1157
+ "grad_norm": 3.0476999282836914,
1158
+ "learning_rate": 7.3480185725083695e-06,
1159
+ "loss": 0.0539,
1160
+ "step": 15800
1161
+ },
1162
+ {
1163
+ "epoch": 5.149797570850202,
1164
+ "grad_norm": 0.06740197539329529,
1165
+ "learning_rate": 7.078069322967283e-06,
1166
+ "loss": 0.0391,
1167
+ "step": 15900
1168
+ },
1169
+ {
1170
+ "epoch": 5.182186234817814,
1171
+ "grad_norm": 0.9729518890380859,
1172
+ "learning_rate": 6.808120073426197e-06,
1173
+ "loss": 0.04,
1174
+ "step": 16000
1175
+ },
1176
+ {
1177
+ "epoch": 5.2145748987854255,
1178
+ "grad_norm": 0.013517899438738823,
1179
+ "learning_rate": 6.5381708238851105e-06,
1180
+ "loss": 0.0549,
1181
+ "step": 16100
1182
+ },
1183
+ {
1184
+ "epoch": 5.246963562753036,
1185
+ "grad_norm": 0.11938950419425964,
1186
+ "learning_rate": 6.268221574344024e-06,
1187
+ "loss": 0.0303,
1188
+ "step": 16200
1189
+ },
1190
+ {
1191
+ "epoch": 5.279352226720648,
1192
+ "grad_norm": 0.1333564966917038,
1193
+ "learning_rate": 5.998272324802937e-06,
1194
+ "loss": 0.0516,
1195
+ "step": 16300
1196
+ },
1197
+ {
1198
+ "epoch": 5.3117408906882595,
1199
+ "grad_norm": 0.005177986808121204,
1200
+ "learning_rate": 5.728323075261851e-06,
1201
+ "loss": 0.0389,
1202
+ "step": 16400
1203
+ },
1204
+ {
1205
+ "epoch": 5.34412955465587,
1206
+ "grad_norm": 0.1303425431251526,
1207
+ "learning_rate": 5.458373825720764e-06,
1208
+ "loss": 0.0305,
1209
+ "step": 16500
1210
+ },
1211
+ {
1212
+ "epoch": 5.376518218623482,
1213
+ "grad_norm": 0.05090256780385971,
1214
+ "learning_rate": 5.188424576179678e-06,
1215
+ "loss": 0.0393,
1216
+ "step": 16600
1217
+ },
1218
+ {
1219
+ "epoch": 5.4089068825910935,
1220
+ "grad_norm": 0.02094370871782303,
1221
+ "learning_rate": 4.918475326638592e-06,
1222
+ "loss": 0.0469,
1223
+ "step": 16700
1224
+ },
1225
+ {
1226
+ "epoch": 5.441295546558704,
1227
+ "grad_norm": 0.15372493863105774,
1228
+ "learning_rate": 4.6485260770975054e-06,
1229
+ "loss": 0.0443,
1230
+ "step": 16800
1231
+ },
1232
+ {
1233
+ "epoch": 5.473684210526316,
1234
+ "grad_norm": 0.010713031515479088,
1235
+ "learning_rate": 4.378576827556419e-06,
1236
+ "loss": 0.0486,
1237
+ "step": 16900
1238
+ },
1239
+ {
1240
+ "epoch": 5.506072874493928,
1241
+ "grad_norm": 0.3018239438533783,
1242
+ "learning_rate": 4.108627578015333e-06,
1243
+ "loss": 0.032,
1244
+ "step": 17000
1245
+ },
1246
+ {
1247
+ "epoch": 5.538461538461538,
1248
+ "grad_norm": 7.911437034606934,
1249
+ "learning_rate": 3.8386783284742465e-06,
1250
+ "loss": 0.026,
1251
+ "step": 17100
1252
+ },
1253
+ {
1254
+ "epoch": 5.57085020242915,
1255
+ "grad_norm": 0.027054868638515472,
1256
+ "learning_rate": 3.5687290789331606e-06,
1257
+ "loss": 0.0319,
1258
+ "step": 17200
1259
+ },
1260
+ {
1261
+ "epoch": 5.603238866396762,
1262
+ "grad_norm": 11.172321319580078,
1263
+ "learning_rate": 3.2987798293920743e-06,
1264
+ "loss": 0.0382,
1265
+ "step": 17300
1266
+ },
1267
+ {
1268
+ "epoch": 5.635627530364372,
1269
+ "grad_norm": 0.013983375392854214,
1270
+ "learning_rate": 3.0288305798509884e-06,
1271
+ "loss": 0.0326,
1272
+ "step": 17400
1273
+ },
1274
+ {
1275
+ "epoch": 5.668016194331984,
1276
+ "grad_norm": 0.19562393426895142,
1277
+ "learning_rate": 2.758881330309902e-06,
1278
+ "loss": 0.0309,
1279
+ "step": 17500
1280
+ },
1281
+ {
1282
+ "epoch": 5.700404858299595,
1283
+ "grad_norm": 0.030886543914675713,
1284
+ "learning_rate": 2.4889320807688157e-06,
1285
+ "loss": 0.0292,
1286
+ "step": 17600
1287
+ },
1288
+ {
1289
+ "epoch": 5.732793522267206,
1290
+ "grad_norm": 0.24884428083896637,
1291
+ "learning_rate": 2.2189828312277294e-06,
1292
+ "loss": 0.0508,
1293
+ "step": 17700
1294
+ },
1295
+ {
1296
+ "epoch": 5.765182186234818,
1297
+ "grad_norm": 0.27042293548583984,
1298
+ "learning_rate": 1.949033581686643e-06,
1299
+ "loss": 0.0355,
1300
+ "step": 17800
1301
+ },
1302
+ {
1303
+ "epoch": 5.797570850202429,
1304
+ "grad_norm": 0.02953782118856907,
1305
+ "learning_rate": 1.6790843321455568e-06,
1306
+ "loss": 0.0238,
1307
+ "step": 17900
1308
+ },
1309
+ {
1310
+ "epoch": 5.82995951417004,
1311
+ "grad_norm": 0.1051710918545723,
1312
+ "learning_rate": 1.4091350826044704e-06,
1313
+ "loss": 0.0495,
1314
+ "step": 18000
1315
+ },
1316
+ {
1317
+ "epoch": 5.862348178137652,
1318
+ "grad_norm": 0.07639595121145248,
1319
+ "learning_rate": 1.1391858330633841e-06,
1320
+ "loss": 0.0364,
1321
+ "step": 18100
1322
+ },
1323
+ {
1324
+ "epoch": 5.894736842105263,
1325
+ "grad_norm": 0.010212107561528683,
1326
+ "learning_rate": 8.692365835222979e-07,
1327
+ "loss": 0.0515,
1328
+ "step": 18200
1329
+ },
1330
+ {
1331
+ "epoch": 5.9271255060728745,
1332
+ "grad_norm": 3.152235984802246,
1333
+ "learning_rate": 5.992873339812116e-07,
1334
+ "loss": 0.0294,
1335
+ "step": 18300
1336
+ },
1337
+ {
1338
+ "epoch": 5.959514170040486,
1339
+ "grad_norm": 8.367449760437012,
1340
+ "learning_rate": 3.293380844401253e-07,
1341
+ "loss": 0.0503,
1342
+ "step": 18400
1343
+ },
1344
+ {
1345
+ "epoch": 5.991902834008097,
1346
+ "grad_norm": 0.03374595567584038,
1347
+ "learning_rate": 5.938883489903899e-08,
1348
+ "loss": 0.0289,
1349
+ "step": 18500
1350
+ },
1351
+ {
1352
+ "epoch": 5.999028340080971,
1353
+ "eval_accuracy": 0.9351106639839034,
1354
+ "eval_loss": 0.43116068840026855,
1355
+ "eval_runtime": 5.8835,
1356
+ "eval_samples_per_second": 337.896,
1357
+ "eval_steps_per_second": 42.322,
1358
+ "step": 18522
1359
+ },
1360
+ {
1361
+ "epoch": 5.999028340080971,
1362
+ "step": 18522,
1363
+ "total_flos": 1.3373133118742268e+16,
1364
+ "train_loss": 0.11955805376273435,
1365
+ "train_runtime": 3073.2453,
1366
+ "train_samples_per_second": 96.443,
1367
+ "train_steps_per_second": 6.027
1368
+ }
1369
+ ],
1370
+ "logging_steps": 100,
1371
+ "max_steps": 18522,
1372
+ "num_input_tokens_seen": 0,
1373
+ "num_train_epochs": 6,
1374
+ "save_steps": 500,
1375
+ "stateful_callbacks": {
1376
+ "TrainerControl": {
1377
+ "args": {
1378
+ "should_epoch_stop": false,
1379
+ "should_evaluate": false,
1380
+ "should_log": false,
1381
+ "should_save": true,
1382
+ "should_training_stop": true
1383
+ },
1384
+ "attributes": {}
1385
+ }
1386
+ },
1387
+ "total_flos": 1.3373133118742268e+16,
1388
+ "train_batch_size": 8,
1389
+ "trial_name": null,
1390
+ "trial_params": null
1391
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc310a7102e14d6f438fd6965da5fa79eaf78ebd7f5d4491aa5a5821a6385c1a
3
+ size 5560
vocab.txt ADDED
The diff for this file is too large to render. See raw diff