ce-lery commited on
Commit
025d0db
1 Parent(s): cb4b638

feat: fine-tuning japanese-mistral-300M-base with databricks-dolly-15k-ja

Browse files
README.md ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: ce-lery/japanese-mistral-300m-base
3
+ tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: checkpoints-finetuning
7
+ results: []
8
+ ---
9
+
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
+ # japanese-mistral-300m-instruction
14
+
15
+ ## Overview
16
+
17
+ Welcome to my model card!
18
+
19
+ This Model feature is ...
20
+
21
+ - Suppression of unknown word generation by using byte fallback in SentencePiece tokenizer and conversion to huggingface Tokenizers format
22
+ - Pretrained by wikipedia dataset and cc100 dataset
23
+ - Use of [Mistral 300M](https://huggingface.co/ce-lery/japanese-mistral-300m-base/blob/main/config.json)
24
+ - Fine-tuning [ce-lery/japanese-mistral-300m-base](https://huggingface.co/ce-lery/japanese-mistral-300m-base) with [kunishou/databricks-dolly-15k-ja](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja)
25
+
26
+ Yukkuri shite ittene!
27
+
28
+ ## How to use the model
29
+
30
+ ```python
31
+ import torch
32
+ from transformers import AutoTokenizer, AutoModelForCausalLM
33
+ import os
34
+
35
+ MODEL_NAME = "ce-lery/japanese-mistral-300m-instruction"
36
+ torch.set_float32_matmul_precision('high')
37
+
38
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
39
+ print(device)
40
+
41
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False,trust_remote_code=True)
42
+ model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,trust_remote_code=True).to(device)
43
+
44
+ MAX_ASSISTANT_LENGTH = 100
45
+ MAX_INPUT_LENGTH = 128
46
+ INPUT_PROMPT = r'<s>\n以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n入力:\n{input}\n[SEP]\n応答:\n'
47
+ NO_INPUT_PROMPT = r'<s>\n以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n[SEP]\n指示:\n{instruction}\n[SEP]\n応答:\n'
48
+
49
+ def prepare_input(instruction, input_text):
50
+ if input_text != "":
51
+ prompt = INPUT_PROMPT.format(instruction=instruction, input=input_text)
52
+ else:
53
+ prompt = NO_INPUT_PROMPT.format(instruction=instruction)
54
+ return prompt
55
+
56
+ def format_output(output):
57
+ output = output.lstrip("<s>").rstrip("</s>").replace("[SEP]", "").replace("\\n", "\n")
58
+ return output
59
+
60
+ def generate_response(instruction, input_text):
61
+ prompt = prepare_input(instruction, input_text)
62
+ token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
63
+ n = len(token_ids[0])
64
+ # print(n)
65
+
66
+ with torch.no_grad():
67
+ output_ids = model.generate(
68
+ token_ids.to(model.device),
69
+ min_length=n,
70
+ max_length=min(MAX_INPUT_LENGTH, n + MAX_ASSISTANT_LENGTH),
71
+ top_p=0.95,
72
+ top_k=50,
73
+ temperature=0.4,
74
+ do_sample=True,
75
+ no_repeat_ngram_size=2,
76
+ num_beams=3,
77
+ pad_token_id=tokenizer.pad_token_id,
78
+ bos_token_id=tokenizer.bos_token_id,
79
+ eos_token_id=tokenizer.eos_token_id,
80
+ bad_words_ids=[[tokenizer.unk_token_id]]
81
+ )
82
+
83
+ output = tokenizer.decode(output_ids.tolist()[0])
84
+ formatted_output_all = format_output(output)
85
+ response = f"Assistant:{formatted_output_all.split('応答:')[-1].strip()}"
86
+
87
+ return formatted_output_all, response
88
+
89
+ instruction = "あなたは何でも正確に答えられるAIです。"
90
+ questions = [
91
+ "日本で一番高い山は?",
92
+ "日本で一番広い湖は?",
93
+ "世界で一番高い山は?",
94
+ "世界で一番広い湖は?",
95
+ "冗談を言ってください。",
96
+ ]
97
+
98
+ # 各質問に対して応答を生成して表示
99
+ for question in questions:
100
+ formatted_output_all, response = generate_response(instruction, question)
101
+ print(response)
102
+
103
+ ```
104
+
105
+ ## Receipe
106
+
107
+ If you want to restruct this model, you can refer [this Github repository](https://github.com/ce-lery/japanese-mistral-300m-recipe).
108
+
109
+ I wrote the receipe for struction this model. For example,
110
+
111
+ - Preprocess with sentencepiece
112
+ - Pretraining with flash attention2 and torch.compile and DeepSpeed
113
+ - Fine-tuning with databricks-dolly-15k-ja
114
+
115
+ If you find my mistake,error,...etc, please create issue.
116
+ If you create pulreqest, I'm very happy!
117
+ ## Training procedure
118
+
119
+ ### Training hyperparameters
120
+
121
+ The following hyperparameters were used during training:
122
+ - learning_rate: 5e-06
123
+ - train_batch_size: 4
124
+ - eval_batch_size: 4
125
+ - seed: 42
126
+ - distributed_type: multi-GPU
127
+ - gradient_accumulation_steps: 64
128
+ - total_train_batch_size: 256
129
+ - optimizer: Adam with betas=(0.9,0.95) and epsilon=0.0001
130
+ - lr_scheduler_type: cosine
131
+ - lr_scheduler_warmup_steps: 1000
132
+ - num_epochs: 200
133
+ - mixed_precision_training: Native AMP
134
+
135
+ ### Training results
136
+
137
+ | Training Loss | Epoch | Step | Validation Loss |
138
+ |:-------------:|:------:|:----:|:---------------:|
139
+ | 3.595 | 3.51 | 40 | 3.5299 |
140
+ | 3.4769 | 7.02 | 80 | 3.3722 |
141
+ | 3.3037 | 10.53 | 120 | 3.1871 |
142
+ | 3.1255 | 14.05 | 160 | 3.0088 |
143
+ | 2.9615 | 17.56 | 200 | 2.8684 |
144
+ | 2.8468 | 21.07 | 240 | 2.7808 |
145
+ | 2.7699 | 24.58 | 280 | 2.7205 |
146
+ | 2.7139 | 28.09 | 320 | 2.6793 |
147
+ | 2.6712 | 31.6 | 360 | 2.6509 |
148
+ | 2.6356 | 35.12 | 400 | 2.6294 |
149
+ | 2.6048 | 38.63 | 440 | 2.6120 |
150
+ | 2.5823 | 42.14 | 480 | 2.5974 |
151
+ | 2.5536 | 45.65 | 520 | 2.5849 |
152
+ | 2.5293 | 49.16 | 560 | 2.5740 |
153
+ | 2.5058 | 52.67 | 600 | 2.5644 |
154
+ | 2.482 | 56.19 | 640 | 2.5556 |
155
+ | 2.4575 | 59.7 | 680 | 2.5477 |
156
+ | 2.4339 | 63.21 | 720 | 2.5405 |
157
+ | 2.4073 | 66.72 | 760 | 2.5350 |
158
+ | 2.3845 | 70.23 | 800 | 2.5303 |
159
+ | 2.3606 | 73.74 | 840 | 2.5253 |
160
+ | 2.329 | 77.26 | 880 | 2.5215 |
161
+ | 2.3071 | 80.77 | 920 | 2.5185 |
162
+ | 2.2768 | 84.28 | 960 | 2.5155 |
163
+ | 2.2479 | 87.79 | 1000 | 2.5144 |
164
+ | 2.2181 | 91.3 | 1040 | 2.5151 |
165
+ | 2.1901 | 94.81 | 1080 | 2.5139 |
166
+ | 2.1571 | 98.33 | 1120 | 2.5148 |
167
+ | 2.1308 | 101.84 | 1160 | 2.5166 |
168
+ | 2.1032 | 105.35 | 1200 | 2.5193 |
169
+ | 2.0761 | 108.86 | 1240 | 2.5204 |
170
+ | 2.0495 | 112.37 | 1280 | 2.5269 |
171
+ | 2.0231 | 115.88 | 1320 | 2.5285 |
172
+ | 2.0021 | 119.4 | 1360 | 2.5328 |
173
+ | 1.9793 | 122.91 | 1400 | 2.5383 |
174
+ | 1.9575 | 126.42 | 1440 | 2.5442 |
175
+ | 1.9368 | 129.93 | 1480 | 2.5488 |
176
+ | 1.9216 | 133.44 | 1520 | 2.5534 |
177
+ | 1.902 | 136.95 | 1560 | 2.5584 |
178
+ | 1.8885 | 140.47 | 1600 | 2.5609 |
179
+ | 1.8728 | 143.98 | 1640 | 2.5657 |
180
+ | 1.8605 | 147.49 | 1680 | 2.5697 |
181
+ | 1.8476 | 151.0 | 1720 | 2.5741 |
182
+ | 1.8402 | 154.51 | 1760 | 2.5770 |
183
+ | 1.8274 | 158.02 | 1800 | 2.5803 |
184
+ | 1.8218 | 161.54 | 1840 | 2.5829 |
185
+ | 1.8144 | 165.05 | 1880 | 2.5847 |
186
+ | 1.8097 | 168.56 | 1920 | 2.5867 |
187
+ | 1.8076 | 172.07 | 1960 | 2.5883 |
188
+ | 1.8014 | 175.58 | 2000 | 2.5892 |
189
+ | 1.8001 | 179.09 | 2040 | 2.5899 |
190
+ | 1.7987 | 182.61 | 2080 | 2.5903 |
191
+ | 1.7971 | 186.12 | 2120 | 2.5906 |
192
+ | 1.7979 | 189.63 | 2160 | 2.5907 |
193
+ | 1.7975 | 193.14 | 2200 | 2.5907 |
194
+
195
+
196
+ ### Framework versions
197
+
198
+ - Transformers 4.35.2
199
+ - Pytorch 2.1.1+cu121
200
+ - Datasets 2.14.5
201
+ - Tokenizers 0.14.1
all_results.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 193.14,
3
+ "eval_loss": 2.513946056365967,
4
+ "eval_runtime": 4.6377,
5
+ "eval_samples": 313,
6
+ "eval_samples_per_second": 67.49,
7
+ "eval_steps_per_second": 17.034,
8
+ "perplexity": 12.353581936499637,
9
+ "train_loss": 2.2616969472711737,
10
+ "train_runtime": 20093.6832,
11
+ "train_samples": 2916,
12
+ "train_samples_per_second": 29.024,
13
+ "train_steps_per_second": 0.109
14
+ }
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../../pretrain/train/checkpoints-mistral-300M-FA2",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 0,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 1024,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 2400,
12
+ "max_position_embeddings": 4096,
13
+ "model_type": "mistral",
14
+ "num_attention_heads": 16,
15
+ "num_hidden_layers": 24,
16
+ "num_key_value_heads": 8,
17
+ "rms_norm_eps": 1e-05,
18
+ "rope_theta": 10000.0,
19
+ "sliding_window": 1024,
20
+ "tie_word_embeddings": false,
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.35.2",
23
+ "use_cache": true,
24
+ "vocab_size": 50257
25
+ }
eval_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 193.14,
3
+ "eval_loss": 2.513946056365967,
4
+ "eval_runtime": 4.6377,
5
+ "eval_samples": 313,
6
+ "eval_samples_per_second": 67.49,
7
+ "eval_steps_per_second": 17.034,
8
+ "perplexity": 12.353581936499637
9
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 0,
5
+ "transformers_version": "4.35.2"
6
+ }
logs/events.out.tfevents.1702213518.90c313ded1af.749823.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ab5e18d681e45befea0f89ce8bf0b3c905f0204bdceb0a194a8d5b35e106d37
3
+ size 28067
logs/events.out.tfevents.1702233619.90c313ded1af.749823.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7c98962865b1824fe059a075d55138896ddd9ae996d36db1edbab0308142f09
3
+ size 359
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e3413550b9d3e9e0b5ea6a87537fe728760798a2ee96e83f75ec1d60b3aaa256
3
+ size 1421709600
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "[UNK]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:604cb0c2f073ba13f04739ced6f8310f4f00ab344feea6cb5c4012af3876c684
3
+ size 1249735
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "</s>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "additional_special_tokens": [],
37
+ "bos_token": "<s>",
38
+ "clean_up_tokenization_spaces": true,
39
+ "eos_token": "</s>",
40
+ "extra_ids": 0,
41
+ "legacy": true,
42
+ "model_max_length": 50000,
43
+ "pad_token": "[PAD]",
44
+ "sp_model_kwargs": {},
45
+ "tokenizer_class": "T5Tokenizer",
46
+ "unk_token": "[UNK]"
47
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 193.14,
3
+ "train_loss": 2.2616969472711737,
4
+ "train_runtime": 20093.6832,
5
+ "train_samples": 2916,
6
+ "train_samples_per_second": 29.024,
7
+ "train_steps_per_second": 0.109
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,798 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.513946056365967,
3
+ "best_model_checkpoint": "checkpoints-finetuning/checkpoint-1080",
4
+ "epoch": 193.14128943758573,
5
+ "eval_steps": 40,
6
+ "global_step": 2200,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 3.51,
13
+ "learning_rate": 2.0000000000000002e-07,
14
+ "loss": 3.595,
15
+ "step": 40
16
+ },
17
+ {
18
+ "epoch": 3.51,
19
+ "eval_loss": 3.5299072265625,
20
+ "eval_runtime": 5.0148,
21
+ "eval_samples_per_second": 62.416,
22
+ "eval_steps_per_second": 15.753,
23
+ "step": 40
24
+ },
25
+ {
26
+ "epoch": 7.02,
27
+ "learning_rate": 4.0000000000000003e-07,
28
+ "loss": 3.4769,
29
+ "step": 80
30
+ },
31
+ {
32
+ "epoch": 7.02,
33
+ "eval_loss": 3.3721721172332764,
34
+ "eval_runtime": 4.4435,
35
+ "eval_samples_per_second": 70.441,
36
+ "eval_steps_per_second": 17.779,
37
+ "step": 80
38
+ },
39
+ {
40
+ "epoch": 10.53,
41
+ "learning_rate": 6.000000000000001e-07,
42
+ "loss": 3.3037,
43
+ "step": 120
44
+ },
45
+ {
46
+ "epoch": 10.53,
47
+ "eval_loss": 3.1870808601379395,
48
+ "eval_runtime": 4.6407,
49
+ "eval_samples_per_second": 67.446,
50
+ "eval_steps_per_second": 17.023,
51
+ "step": 120
52
+ },
53
+ {
54
+ "epoch": 14.05,
55
+ "learning_rate": 8.000000000000001e-07,
56
+ "loss": 3.1255,
57
+ "step": 160
58
+ },
59
+ {
60
+ "epoch": 14.05,
61
+ "eval_loss": 3.0087945461273193,
62
+ "eval_runtime": 4.7026,
63
+ "eval_samples_per_second": 66.559,
64
+ "eval_steps_per_second": 16.799,
65
+ "step": 160
66
+ },
67
+ {
68
+ "epoch": 17.56,
69
+ "learning_rate": 1.0000000000000002e-06,
70
+ "loss": 2.9615,
71
+ "step": 200
72
+ },
73
+ {
74
+ "epoch": 17.56,
75
+ "eval_loss": 2.8684051036834717,
76
+ "eval_runtime": 4.6401,
77
+ "eval_samples_per_second": 67.455,
78
+ "eval_steps_per_second": 17.026,
79
+ "step": 200
80
+ },
81
+ {
82
+ "epoch": 21.07,
83
+ "learning_rate": 1.2000000000000002e-06,
84
+ "loss": 2.8468,
85
+ "step": 240
86
+ },
87
+ {
88
+ "epoch": 21.07,
89
+ "eval_loss": 2.780834436416626,
90
+ "eval_runtime": 4.4221,
91
+ "eval_samples_per_second": 70.78,
92
+ "eval_steps_per_second": 17.865,
93
+ "step": 240
94
+ },
95
+ {
96
+ "epoch": 24.58,
97
+ "learning_rate": 1.4000000000000001e-06,
98
+ "loss": 2.7699,
99
+ "step": 280
100
+ },
101
+ {
102
+ "epoch": 24.58,
103
+ "eval_loss": 2.720453977584839,
104
+ "eval_runtime": 4.5663,
105
+ "eval_samples_per_second": 68.546,
106
+ "eval_steps_per_second": 17.301,
107
+ "step": 280
108
+ },
109
+ {
110
+ "epoch": 28.09,
111
+ "learning_rate": 1.6000000000000001e-06,
112
+ "loss": 2.7139,
113
+ "step": 320
114
+ },
115
+ {
116
+ "epoch": 28.09,
117
+ "eval_loss": 2.679349422454834,
118
+ "eval_runtime": 4.7784,
119
+ "eval_samples_per_second": 65.504,
120
+ "eval_steps_per_second": 16.533,
121
+ "step": 320
122
+ },
123
+ {
124
+ "epoch": 31.6,
125
+ "learning_rate": 1.8000000000000001e-06,
126
+ "loss": 2.6712,
127
+ "step": 360
128
+ },
129
+ {
130
+ "epoch": 31.6,
131
+ "eval_loss": 2.650853395462036,
132
+ "eval_runtime": 4.6713,
133
+ "eval_samples_per_second": 67.005,
134
+ "eval_steps_per_second": 16.912,
135
+ "step": 360
136
+ },
137
+ {
138
+ "epoch": 35.12,
139
+ "learning_rate": 2.0000000000000003e-06,
140
+ "loss": 2.6356,
141
+ "step": 400
142
+ },
143
+ {
144
+ "epoch": 35.12,
145
+ "eval_loss": 2.6293749809265137,
146
+ "eval_runtime": 4.6364,
147
+ "eval_samples_per_second": 67.51,
148
+ "eval_steps_per_second": 17.039,
149
+ "step": 400
150
+ },
151
+ {
152
+ "epoch": 38.63,
153
+ "learning_rate": 2.2e-06,
154
+ "loss": 2.6048,
155
+ "step": 440
156
+ },
157
+ {
158
+ "epoch": 38.63,
159
+ "eval_loss": 2.611950635910034,
160
+ "eval_runtime": 4.5116,
161
+ "eval_samples_per_second": 69.377,
162
+ "eval_steps_per_second": 17.511,
163
+ "step": 440
164
+ },
165
+ {
166
+ "epoch": 42.14,
167
+ "learning_rate": 2.4000000000000003e-06,
168
+ "loss": 2.5823,
169
+ "step": 480
170
+ },
171
+ {
172
+ "epoch": 42.14,
173
+ "eval_loss": 2.597449541091919,
174
+ "eval_runtime": 4.5213,
175
+ "eval_samples_per_second": 69.228,
176
+ "eval_steps_per_second": 17.473,
177
+ "step": 480
178
+ },
179
+ {
180
+ "epoch": 45.65,
181
+ "learning_rate": 2.6e-06,
182
+ "loss": 2.5536,
183
+ "step": 520
184
+ },
185
+ {
186
+ "epoch": 45.65,
187
+ "eval_loss": 2.5848779678344727,
188
+ "eval_runtime": 4.4398,
189
+ "eval_samples_per_second": 70.498,
190
+ "eval_steps_per_second": 17.793,
191
+ "step": 520
192
+ },
193
+ {
194
+ "epoch": 49.16,
195
+ "learning_rate": 2.8000000000000003e-06,
196
+ "loss": 2.5293,
197
+ "step": 560
198
+ },
199
+ {
200
+ "epoch": 49.16,
201
+ "eval_loss": 2.574049472808838,
202
+ "eval_runtime": 4.6572,
203
+ "eval_samples_per_second": 67.208,
204
+ "eval_steps_per_second": 16.963,
205
+ "step": 560
206
+ },
207
+ {
208
+ "epoch": 52.67,
209
+ "learning_rate": 3e-06,
210
+ "loss": 2.5058,
211
+ "step": 600
212
+ },
213
+ {
214
+ "epoch": 52.67,
215
+ "eval_loss": 2.5643808841705322,
216
+ "eval_runtime": 4.6849,
217
+ "eval_samples_per_second": 66.81,
218
+ "eval_steps_per_second": 16.863,
219
+ "step": 600
220
+ },
221
+ {
222
+ "epoch": 56.19,
223
+ "learning_rate": 3.2000000000000003e-06,
224
+ "loss": 2.482,
225
+ "step": 640
226
+ },
227
+ {
228
+ "epoch": 56.19,
229
+ "eval_loss": 2.555607557296753,
230
+ "eval_runtime": 4.491,
231
+ "eval_samples_per_second": 69.695,
232
+ "eval_steps_per_second": 17.591,
233
+ "step": 640
234
+ },
235
+ {
236
+ "epoch": 59.7,
237
+ "learning_rate": 3.4000000000000005e-06,
238
+ "loss": 2.4575,
239
+ "step": 680
240
+ },
241
+ {
242
+ "epoch": 59.7,
243
+ "eval_loss": 2.547734260559082,
244
+ "eval_runtime": 4.6182,
245
+ "eval_samples_per_second": 67.776,
246
+ "eval_steps_per_second": 17.106,
247
+ "step": 680
248
+ },
249
+ {
250
+ "epoch": 63.21,
251
+ "learning_rate": 3.6000000000000003e-06,
252
+ "loss": 2.4339,
253
+ "step": 720
254
+ },
255
+ {
256
+ "epoch": 63.21,
257
+ "eval_loss": 2.5405359268188477,
258
+ "eval_runtime": 4.5137,
259
+ "eval_samples_per_second": 69.345,
260
+ "eval_steps_per_second": 17.502,
261
+ "step": 720
262
+ },
263
+ {
264
+ "epoch": 66.72,
265
+ "learning_rate": 3.8000000000000005e-06,
266
+ "loss": 2.4073,
267
+ "step": 760
268
+ },
269
+ {
270
+ "epoch": 66.72,
271
+ "eval_loss": 2.5350451469421387,
272
+ "eval_runtime": 4.6034,
273
+ "eval_samples_per_second": 67.993,
274
+ "eval_steps_per_second": 17.161,
275
+ "step": 760
276
+ },
277
+ {
278
+ "epoch": 70.23,
279
+ "learning_rate": 4.000000000000001e-06,
280
+ "loss": 2.3845,
281
+ "step": 800
282
+ },
283
+ {
284
+ "epoch": 70.23,
285
+ "eval_loss": 2.530299186706543,
286
+ "eval_runtime": 4.6325,
287
+ "eval_samples_per_second": 67.566,
288
+ "eval_steps_per_second": 17.053,
289
+ "step": 800
290
+ },
291
+ {
292
+ "epoch": 73.74,
293
+ "learning_rate": 4.2000000000000004e-06,
294
+ "loss": 2.3606,
295
+ "step": 840
296
+ },
297
+ {
298
+ "epoch": 73.74,
299
+ "eval_loss": 2.525312662124634,
300
+ "eval_runtime": 4.4668,
301
+ "eval_samples_per_second": 70.072,
302
+ "eval_steps_per_second": 17.686,
303
+ "step": 840
304
+ },
305
+ {
306
+ "epoch": 77.26,
307
+ "learning_rate": 4.4e-06,
308
+ "loss": 2.329,
309
+ "step": 880
310
+ },
311
+ {
312
+ "epoch": 77.26,
313
+ "eval_loss": 2.5215225219726562,
314
+ "eval_runtime": 4.4699,
315
+ "eval_samples_per_second": 70.023,
316
+ "eval_steps_per_second": 17.674,
317
+ "step": 880
318
+ },
319
+ {
320
+ "epoch": 80.77,
321
+ "learning_rate": 4.600000000000001e-06,
322
+ "loss": 2.3071,
323
+ "step": 920
324
+ },
325
+ {
326
+ "epoch": 80.77,
327
+ "eval_loss": 2.5184576511383057,
328
+ "eval_runtime": 4.3807,
329
+ "eval_samples_per_second": 71.45,
330
+ "eval_steps_per_second": 18.034,
331
+ "step": 920
332
+ },
333
+ {
334
+ "epoch": 84.28,
335
+ "learning_rate": 4.800000000000001e-06,
336
+ "loss": 2.2768,
337
+ "step": 960
338
+ },
339
+ {
340
+ "epoch": 84.28,
341
+ "eval_loss": 2.515460729598999,
342
+ "eval_runtime": 4.6634,
343
+ "eval_samples_per_second": 67.119,
344
+ "eval_steps_per_second": 16.941,
345
+ "step": 960
346
+ },
347
+ {
348
+ "epoch": 87.79,
349
+ "learning_rate": 5e-06,
350
+ "loss": 2.2479,
351
+ "step": 1000
352
+ },
353
+ {
354
+ "epoch": 87.79,
355
+ "eval_loss": 2.514392852783203,
356
+ "eval_runtime": 4.5583,
357
+ "eval_samples_per_second": 68.665,
358
+ "eval_steps_per_second": 17.331,
359
+ "step": 1000
360
+ },
361
+ {
362
+ "epoch": 91.3,
363
+ "learning_rate": 4.986304738420684e-06,
364
+ "loss": 2.2181,
365
+ "step": 1040
366
+ },
367
+ {
368
+ "epoch": 91.3,
369
+ "eval_loss": 2.515076160430908,
370
+ "eval_runtime": 4.6324,
371
+ "eval_samples_per_second": 67.568,
372
+ "eval_steps_per_second": 17.054,
373
+ "step": 1040
374
+ },
375
+ {
376
+ "epoch": 94.81,
377
+ "learning_rate": 4.9453690018345144e-06,
378
+ "loss": 2.1901,
379
+ "step": 1080
380
+ },
381
+ {
382
+ "epoch": 94.81,
383
+ "eval_loss": 2.513946056365967,
384
+ "eval_runtime": 4.635,
385
+ "eval_samples_per_second": 67.53,
386
+ "eval_steps_per_second": 17.044,
387
+ "step": 1080
388
+ },
389
+ {
390
+ "epoch": 98.33,
391
+ "learning_rate": 4.8776412907378845e-06,
392
+ "loss": 2.1571,
393
+ "step": 1120
394
+ },
395
+ {
396
+ "epoch": 98.33,
397
+ "eval_loss": 2.514775037765503,
398
+ "eval_runtime": 4.7132,
399
+ "eval_samples_per_second": 66.41,
400
+ "eval_steps_per_second": 16.762,
401
+ "step": 1120
402
+ },
403
+ {
404
+ "epoch": 101.84,
405
+ "learning_rate": 4.783863644106502e-06,
406
+ "loss": 2.1308,
407
+ "step": 1160
408
+ },
409
+ {
410
+ "epoch": 101.84,
411
+ "eval_loss": 2.5165762901306152,
412
+ "eval_runtime": 4.6347,
413
+ "eval_samples_per_second": 67.535,
414
+ "eval_steps_per_second": 17.046,
415
+ "step": 1160
416
+ },
417
+ {
418
+ "epoch": 105.35,
419
+ "learning_rate": 4.665063509461098e-06,
420
+ "loss": 2.1032,
421
+ "step": 1200
422
+ },
423
+ {
424
+ "epoch": 105.35,
425
+ "eval_loss": 2.5192971229553223,
426
+ "eval_runtime": 4.6292,
427
+ "eval_samples_per_second": 67.614,
428
+ "eval_steps_per_second": 17.066,
429
+ "step": 1200
430
+ },
431
+ {
432
+ "epoch": 108.86,
433
+ "learning_rate": 4.522542485937369e-06,
434
+ "loss": 2.0761,
435
+ "step": 1240
436
+ },
437
+ {
438
+ "epoch": 108.86,
439
+ "eval_loss": 2.5203866958618164,
440
+ "eval_runtime": 4.6638,
441
+ "eval_samples_per_second": 67.113,
442
+ "eval_steps_per_second": 16.939,
443
+ "step": 1240
444
+ },
445
+ {
446
+ "epoch": 112.37,
447
+ "learning_rate": 4.357862063693486e-06,
448
+ "loss": 2.0495,
449
+ "step": 1280
450
+ },
451
+ {
452
+ "epoch": 112.37,
453
+ "eval_loss": 2.5268709659576416,
454
+ "eval_runtime": 4.6504,
455
+ "eval_samples_per_second": 67.306,
456
+ "eval_steps_per_second": 16.988,
457
+ "step": 1280
458
+ },
459
+ {
460
+ "epoch": 115.88,
461
+ "learning_rate": 4.172826515897146e-06,
462
+ "loss": 2.0231,
463
+ "step": 1320
464
+ },
465
+ {
466
+ "epoch": 115.88,
467
+ "eval_loss": 2.5284526348114014,
468
+ "eval_runtime": 4.6029,
469
+ "eval_samples_per_second": 68.0,
470
+ "eval_steps_per_second": 17.163,
471
+ "step": 1320
472
+ },
473
+ {
474
+ "epoch": 119.4,
475
+ "learning_rate": 3.969463130731183e-06,
476
+ "loss": 2.0021,
477
+ "step": 1360
478
+ },
479
+ {
480
+ "epoch": 119.4,
481
+ "eval_loss": 2.5327632427215576,
482
+ "eval_runtime": 4.7118,
483
+ "eval_samples_per_second": 66.429,
484
+ "eval_steps_per_second": 16.767,
485
+ "step": 1360
486
+ },
487
+ {
488
+ "epoch": 122.91,
489
+ "learning_rate": 3.7500000000000005e-06,
490
+ "loss": 1.9793,
491
+ "step": 1400
492
+ },
493
+ {
494
+ "epoch": 122.91,
495
+ "eval_loss": 2.5382816791534424,
496
+ "eval_runtime": 4.6299,
497
+ "eval_samples_per_second": 67.603,
498
+ "eval_steps_per_second": 17.063,
499
+ "step": 1400
500
+ },
501
+ {
502
+ "epoch": 126.42,
503
+ "learning_rate": 3.516841607689501e-06,
504
+ "loss": 1.9575,
505
+ "step": 1440
506
+ },
507
+ {
508
+ "epoch": 126.42,
509
+ "eval_loss": 2.5441536903381348,
510
+ "eval_runtime": 4.6442,
511
+ "eval_samples_per_second": 67.396,
512
+ "eval_steps_per_second": 17.01,
513
+ "step": 1440
514
+ },
515
+ {
516
+ "epoch": 129.93,
517
+ "learning_rate": 3.272542485937369e-06,
518
+ "loss": 1.9368,
519
+ "step": 1480
520
+ },
521
+ {
522
+ "epoch": 129.93,
523
+ "eval_loss": 2.5487852096557617,
524
+ "eval_runtime": 4.6396,
525
+ "eval_samples_per_second": 67.462,
526
+ "eval_steps_per_second": 17.027,
527
+ "step": 1480
528
+ },
529
+ {
530
+ "epoch": 133.44,
531
+ "learning_rate": 3.019779227044398e-06,
532
+ "loss": 1.9216,
533
+ "step": 1520
534
+ },
535
+ {
536
+ "epoch": 133.44,
537
+ "eval_loss": 2.5533745288848877,
538
+ "eval_runtime": 4.6038,
539
+ "eval_samples_per_second": 67.987,
540
+ "eval_steps_per_second": 17.16,
541
+ "step": 1520
542
+ },
543
+ {
544
+ "epoch": 136.95,
545
+ "learning_rate": 2.761321158169134e-06,
546
+ "loss": 1.902,
547
+ "step": 1560
548
+ },
549
+ {
550
+ "epoch": 136.95,
551
+ "eval_loss": 2.558429479598999,
552
+ "eval_runtime": 4.605,
553
+ "eval_samples_per_second": 67.969,
554
+ "eval_steps_per_second": 17.155,
555
+ "step": 1560
556
+ },
557
+ {
558
+ "epoch": 140.47,
559
+ "learning_rate": 2.5e-06,
560
+ "loss": 1.8885,
561
+ "step": 1600
562
+ },
563
+ {
564
+ "epoch": 140.47,
565
+ "eval_loss": 2.560931444168091,
566
+ "eval_runtime": 4.6137,
567
+ "eval_samples_per_second": 67.842,
568
+ "eval_steps_per_second": 17.123,
569
+ "step": 1600
570
+ },
571
+ {
572
+ "epoch": 143.98,
573
+ "learning_rate": 2.238678841830867e-06,
574
+ "loss": 1.8728,
575
+ "step": 1640
576
+ },
577
+ {
578
+ "epoch": 143.98,
579
+ "eval_loss": 2.565746307373047,
580
+ "eval_runtime": 4.6085,
581
+ "eval_samples_per_second": 67.918,
582
+ "eval_steps_per_second": 17.142,
583
+ "step": 1640
584
+ },
585
+ {
586
+ "epoch": 147.49,
587
+ "learning_rate": 1.9802207729556023e-06,
588
+ "loss": 1.8605,
589
+ "step": 1680
590
+ },
591
+ {
592
+ "epoch": 147.49,
593
+ "eval_loss": 2.569748640060425,
594
+ "eval_runtime": 4.6652,
595
+ "eval_samples_per_second": 67.092,
596
+ "eval_steps_per_second": 16.934,
597
+ "step": 1680
598
+ },
599
+ {
600
+ "epoch": 151.0,
601
+ "learning_rate": 1.7274575140626318e-06,
602
+ "loss": 1.8476,
603
+ "step": 1720
604
+ },
605
+ {
606
+ "epoch": 151.0,
607
+ "eval_loss": 2.5741446018218994,
608
+ "eval_runtime": 4.7429,
609
+ "eval_samples_per_second": 65.994,
610
+ "eval_steps_per_second": 16.657,
611
+ "step": 1720
612
+ },
613
+ {
614
+ "epoch": 154.51,
615
+ "learning_rate": 1.4831583923105e-06,
616
+ "loss": 1.8402,
617
+ "step": 1760
618
+ },
619
+ {
620
+ "epoch": 154.51,
621
+ "eval_loss": 2.5770394802093506,
622
+ "eval_runtime": 4.6184,
623
+ "eval_samples_per_second": 67.772,
624
+ "eval_steps_per_second": 17.105,
625
+ "step": 1760
626
+ },
627
+ {
628
+ "epoch": 158.02,
629
+ "learning_rate": 1.2500000000000007e-06,
630
+ "loss": 1.8274,
631
+ "step": 1800
632
+ },
633
+ {
634
+ "epoch": 158.02,
635
+ "eval_loss": 2.580260992050171,
636
+ "eval_runtime": 4.5687,
637
+ "eval_samples_per_second": 68.509,
638
+ "eval_steps_per_second": 17.291,
639
+ "step": 1800
640
+ },
641
+ {
642
+ "epoch": 161.54,
643
+ "learning_rate": 1.0305368692688175e-06,
644
+ "loss": 1.8218,
645
+ "step": 1840
646
+ },
647
+ {
648
+ "epoch": 161.54,
649
+ "eval_loss": 2.582859992980957,
650
+ "eval_runtime": 4.6266,
651
+ "eval_samples_per_second": 67.653,
652
+ "eval_steps_per_second": 17.075,
653
+ "step": 1840
654
+ },
655
+ {
656
+ "epoch": 165.05,
657
+ "learning_rate": 8.271734841028553e-07,
658
+ "loss": 1.8144,
659
+ "step": 1880
660
+ },
661
+ {
662
+ "epoch": 165.05,
663
+ "eval_loss": 2.5846669673919678,
664
+ "eval_runtime": 4.601,
665
+ "eval_samples_per_second": 68.029,
666
+ "eval_steps_per_second": 17.17,
667
+ "step": 1880
668
+ },
669
+ {
670
+ "epoch": 168.56,
671
+ "learning_rate": 6.421379363065142e-07,
672
+ "loss": 1.8097,
673
+ "step": 1920
674
+ },
675
+ {
676
+ "epoch": 168.56,
677
+ "eval_loss": 2.5867464542388916,
678
+ "eval_runtime": 4.593,
679
+ "eval_samples_per_second": 68.148,
680
+ "eval_steps_per_second": 17.2,
681
+ "step": 1920
682
+ },
683
+ {
684
+ "epoch": 172.07,
685
+ "learning_rate": 4.774575140626317e-07,
686
+ "loss": 1.8076,
687
+ "step": 1960
688
+ },
689
+ {
690
+ "epoch": 172.07,
691
+ "eval_loss": 2.5882575511932373,
692
+ "eval_runtime": 4.601,
693
+ "eval_samples_per_second": 68.028,
694
+ "eval_steps_per_second": 17.17,
695
+ "step": 1960
696
+ },
697
+ {
698
+ "epoch": 175.58,
699
+ "learning_rate": 3.3493649053890325e-07,
700
+ "loss": 1.8014,
701
+ "step": 2000
702
+ },
703
+ {
704
+ "epoch": 175.58,
705
+ "eval_loss": 2.589245080947876,
706
+ "eval_runtime": 4.5976,
707
+ "eval_samples_per_second": 68.079,
708
+ "eval_steps_per_second": 17.183,
709
+ "step": 2000
710
+ },
711
+ {
712
+ "epoch": 179.09,
713
+ "learning_rate": 2.1613635589349756e-07,
714
+ "loss": 1.8001,
715
+ "step": 2040
716
+ },
717
+ {
718
+ "epoch": 179.09,
719
+ "eval_loss": 2.589866876602173,
720
+ "eval_runtime": 4.5824,
721
+ "eval_samples_per_second": 68.305,
722
+ "eval_steps_per_second": 17.24,
723
+ "step": 2040
724
+ },
725
+ {
726
+ "epoch": 182.61,
727
+ "learning_rate": 1.223587092621162e-07,
728
+ "loss": 1.7987,
729
+ "step": 2080
730
+ },
731
+ {
732
+ "epoch": 182.61,
733
+ "eval_loss": 2.5903093814849854,
734
+ "eval_runtime": 4.6146,
735
+ "eval_samples_per_second": 67.829,
736
+ "eval_steps_per_second": 17.12,
737
+ "step": 2080
738
+ },
739
+ {
740
+ "epoch": 186.12,
741
+ "learning_rate": 5.463099816548578e-08,
742
+ "loss": 1.7971,
743
+ "step": 2120
744
+ },
745
+ {
746
+ "epoch": 186.12,
747
+ "eval_loss": 2.590583562850952,
748
+ "eval_runtime": 4.609,
749
+ "eval_samples_per_second": 67.911,
750
+ "eval_steps_per_second": 17.141,
751
+ "step": 2120
752
+ },
753
+ {
754
+ "epoch": 189.63,
755
+ "learning_rate": 1.3695261579316776e-08,
756
+ "loss": 1.7979,
757
+ "step": 2160
758
+ },
759
+ {
760
+ "epoch": 189.63,
761
+ "eval_loss": 2.5907208919525146,
762
+ "eval_runtime": 4.6125,
763
+ "eval_samples_per_second": 67.859,
764
+ "eval_steps_per_second": 17.127,
765
+ "step": 2160
766
+ },
767
+ {
768
+ "epoch": 193.14,
769
+ "learning_rate": 0.0,
770
+ "loss": 1.7975,
771
+ "step": 2200
772
+ },
773
+ {
774
+ "epoch": 193.14,
775
+ "eval_loss": 2.590698719024658,
776
+ "eval_runtime": 4.6213,
777
+ "eval_samples_per_second": 67.729,
778
+ "eval_steps_per_second": 17.095,
779
+ "step": 2200
780
+ },
781
+ {
782
+ "epoch": 193.14,
783
+ "step": 2200,
784
+ "total_flos": 1.0517861659312128e+18,
785
+ "train_loss": 2.2616969472711737,
786
+ "train_runtime": 20093.6832,
787
+ "train_samples_per_second": 29.024,
788
+ "train_steps_per_second": 0.109
789
+ }
790
+ ],
791
+ "logging_steps": 40,
792
+ "max_steps": 2200,
793
+ "num_train_epochs": 200,
794
+ "save_steps": 40,
795
+ "total_flos": 1.0517861659312128e+18,
796
+ "trial_name": null,
797
+ "trial_params": null
798
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f815d8b0257d2db0dc8b65f8b60c00f75c67865ca45e0bab814f141b01b092e
3
+ size 4536