colbyford commited on
Commit
93ac42b
·
1 Parent(s): af223f7

Add H100-trained Phi-4 model assets

Browse files
README.md CHANGED
@@ -1,202 +1,38 @@
1
  ---
2
  base_model: microsoft/phi-4
3
  library_name: peft
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
 
9
 
 
10
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
-
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
200
  ### Framework versions
201
 
202
- - PEFT 0.15.2
 
 
 
 
 
 
1
  ---
2
  base_model: microsoft/phi-4
3
  library_name: peft
4
+ model_name: peleke-phi-4
5
+ tags:
6
+ - base_model:adapter:microsoft/phi-4
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ licence: gpl-3
12
+ pipeline_tag: text-generation
13
  ---
14
 
15
+ # Model Card for peleke-phi-4
16
 
17
+ This model is a fine-tuned version of [microsoft/phi-4](https://huggingface.co/microsoft/phi-4).
18
+ It has been trained using [TRL](https://github.com/huggingface/trl).
19
 
20
+ ## Quick start
21
 
22
+ ```python
23
+ # Coming Soon...
24
+ ```
25
 
26
+ ## Training procedure
27
 
 
28
 
29
+ This model was trained with SFT.
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ### Framework versions
32
 
33
+ - PEFT 0.17.0
34
+ - TRL: 0.19.1
35
+ - Transformers: 4.54.0
36
+ - Pytorch: 2.7.1
37
+ - Datasets: 4.0.0
38
+ - Tokenizers: 0.21.2
adapter_config.json CHANGED
@@ -20,15 +20,18 @@
20
  "megatron_core": "megatron.core",
21
  "modules_to_save": null,
22
  "peft_type": "LORA",
 
23
  "r": 8,
24
  "rank_pattern": {},
25
  "revision": null,
26
  "target_modules": [
27
- "o_proj",
28
- "qkv_proj"
29
  ],
 
30
  "task_type": "CAUSAL_LM",
31
  "trainable_token_indices": null,
32
  "use_dora": false,
 
33
  "use_rslora": false
34
  }
 
20
  "megatron_core": "megatron.core",
21
  "modules_to_save": null,
22
  "peft_type": "LORA",
23
+ "qalora_group_size": 16,
24
  "r": 8,
25
  "rank_pattern": {},
26
  "revision": null,
27
  "target_modules": [
28
+ "qkv_proj",
29
+ "o_proj"
30
  ],
31
+ "target_parameters": null,
32
  "task_type": "CAUSAL_LM",
33
  "trainable_token_indices": null,
34
  "use_dora": false,
35
+ "use_qalora": false,
36
  "use_rslora": false
37
  }
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1d6851a5df88091ca237b847c4659b23f98d16951db397593e87ced71c766a38
3
- size 2084824272
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20c0b64281bc70f776d92e118ff8db05bb2541587ffd1cb94d239632980b20f9
3
+ size 2084803792
optimizer.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d488fe514b22a65589b90ffc1639497cc18bee6eac9bfb258ab143195cde3c2d
3
- size 59117259
 
 
 
 
rng_state.pth DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:5671e5814bc2b50e8ca672217a479b935e2e09221f419b16f33a4d73cf1ea4f2
3
- size 14645
 
 
 
 
scaler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:cc302c1ea6f8760b538d34aa63eb10e75b8051a0525238af5b340d810a2f312c
3
- size 1383
 
 
 
 
scheduler.pt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:76d499ff7f9713e6ce83515f64a51745b01dce3e0c41d165acffd8052e740f14
3
- size 1465
 
 
 
 
special_tokens_map.json CHANGED
@@ -1,4 +1,20 @@
1
  {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  "bos_token": {
3
  "content": "<|endoftext|>",
4
  "lstrip": true,
 
1
  {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<epi>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "</epi>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
  "bos_token": {
19
  "content": "<|endoftext|>",
20
  "lstrip": true,
tokenizer.json CHANGED
@@ -1,6 +1,11 @@
1
  {
2
  "version": "1.0",
3
- "truncation": null,
 
 
 
 
 
4
  "padding": null,
5
  "added_tokens": [
6
  {
@@ -866,6 +871,42 @@
866
  "rstrip": true,
867
  "normalized": false,
868
  "special": true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
869
  }
870
  ],
871
  "normalizer": null,
 
1
  {
2
  "version": "1.0",
3
+ "truncation": {
4
+ "direction": "Right",
5
+ "max_length": 800,
6
+ "strategy": "LongestFirst",
7
+ "stride": 0
8
+ },
9
  "padding": null,
10
  "added_tokens": [
11
  {
 
871
  "rstrip": true,
872
  "normalized": false,
873
  "special": true
874
+ },
875
+ {
876
+ "id": 100352,
877
+ "content": "<epi>",
878
+ "single_word": false,
879
+ "lstrip": false,
880
+ "rstrip": false,
881
+ "normalized": false,
882
+ "special": true
883
+ },
884
+ {
885
+ "id": 100353,
886
+ "content": "</epi>",
887
+ "single_word": false,
888
+ "lstrip": false,
889
+ "rstrip": false,
890
+ "normalized": false,
891
+ "special": true
892
+ },
893
+ {
894
+ "id": 100354,
895
+ "content": "Antigen",
896
+ "single_word": false,
897
+ "lstrip": false,
898
+ "rstrip": false,
899
+ "normalized": true,
900
+ "special": false
901
+ },
902
+ {
903
+ "id": 100355,
904
+ "content": "Antibody",
905
+ "single_word": false,
906
+ "lstrip": false,
907
+ "rstrip": false,
908
+ "normalized": true,
909
+ "special": false
910
  }
911
  ],
912
  "normalizer": null,
tokenizer_config.json CHANGED
@@ -768,8 +768,44 @@
768
  "rstrip": true,
769
  "single_word": false,
770
  "special": true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
771
  }
772
  },
 
 
 
 
773
  "bos_token": "<|endoftext|>",
774
  "clean_up_tokenization_spaces": false,
775
  "eos_token": "<|im_end|>",
 
768
  "rstrip": true,
769
  "single_word": false,
770
  "special": true
771
+ },
772
+ "100352": {
773
+ "content": "<epi>",
774
+ "lstrip": false,
775
+ "normalized": false,
776
+ "rstrip": false,
777
+ "single_word": false,
778
+ "special": true
779
+ },
780
+ "100353": {
781
+ "content": "</epi>",
782
+ "lstrip": false,
783
+ "normalized": false,
784
+ "rstrip": false,
785
+ "single_word": false,
786
+ "special": true
787
+ },
788
+ "100354": {
789
+ "content": "Antigen",
790
+ "lstrip": false,
791
+ "normalized": true,
792
+ "rstrip": false,
793
+ "single_word": false,
794
+ "special": false
795
+ },
796
+ "100355": {
797
+ "content": "Antibody",
798
+ "lstrip": false,
799
+ "normalized": true,
800
+ "rstrip": false,
801
+ "single_word": false,
802
+ "special": false
803
  }
804
  },
805
+ "additional_special_tokens": [
806
+ "<epi>",
807
+ "</epi>"
808
+ ],
809
  "bos_token": "<|endoftext|>",
810
  "clean_up_tokenization_spaces": false,
811
  "eos_token": "<|im_end|>",
trainer_state.json DELETED
@@ -1,1177 +0,0 @@
1
- {
2
- "best_global_step": null,
3
- "best_metric": null,
4
- "best_model_checkpoint": null,
5
- "epoch": 3.0,
6
- "eval_steps": 500,
7
- "global_step": 3177,
8
- "is_hyper_param_search": false,
9
- "is_local_process_zero": true,
10
- "is_world_process_zero": true,
11
- "log_history": [
12
- {
13
- "epoch": 0.023607176581680833,
14
- "grad_norm": 0.8597469329833984,
15
- "learning_rate": 0.00017600000000000002,
16
- "loss": 4.9024,
17
- "mean_token_accuracy": 0.2909739096462727,
18
- "num_tokens": 86323.0,
19
- "step": 25
20
- },
21
- {
22
- "epoch": 0.047214353163361665,
23
- "grad_norm": 0.7611781358718872,
24
- "learning_rate": 0.0001986040609137056,
25
- "loss": 4.3579,
26
- "mean_token_accuracy": 0.3365874075889587,
27
- "num_tokens": 170088.0,
28
- "step": 50
29
- },
30
- {
31
- "epoch": 0.0708215297450425,
32
- "grad_norm": 0.8384687900543213,
33
- "learning_rate": 0.00019701776649746192,
34
- "loss": 4.2579,
35
- "mean_token_accuracy": 0.33202287018299104,
36
- "num_tokens": 248841.0,
37
- "step": 75
38
- },
39
- {
40
- "epoch": 0.09442870632672333,
41
- "grad_norm": 0.8371444940567017,
42
- "learning_rate": 0.00019543147208121828,
43
- "loss": 3.9475,
44
- "mean_token_accuracy": 0.3752408367395401,
45
- "num_tokens": 331838.0,
46
- "step": 100
47
- },
48
- {
49
- "epoch": 0.11803588290840415,
50
- "grad_norm": 1.3010390996932983,
51
- "learning_rate": 0.00019384517766497464,
52
- "loss": 3.6464,
53
- "mean_token_accuracy": 0.4282234400510788,
54
- "num_tokens": 416982.0,
55
- "step": 125
56
- },
57
- {
58
- "epoch": 0.141643059490085,
59
- "grad_norm": 1.102060079574585,
60
- "learning_rate": 0.000192258883248731,
61
- "loss": 3.6204,
62
- "mean_token_accuracy": 0.42927508473396303,
63
- "num_tokens": 501322.0,
64
- "step": 150
65
- },
66
- {
67
- "epoch": 0.1652502360717658,
68
- "grad_norm": 1.1159696578979492,
69
- "learning_rate": 0.00019067258883248732,
70
- "loss": 3.5222,
71
- "mean_token_accuracy": 0.4415357458591461,
72
- "num_tokens": 591622.0,
73
- "step": 175
74
- },
75
- {
76
- "epoch": 0.18885741265344666,
77
- "grad_norm": 3.7764205932617188,
78
- "learning_rate": 0.00018908629441624365,
79
- "loss": 3.6167,
80
- "mean_token_accuracy": 0.42590244352817536,
81
- "num_tokens": 681171.0,
82
- "step": 200
83
- },
84
- {
85
- "epoch": 0.21246458923512748,
86
- "grad_norm": 1.3314100503921509,
87
- "learning_rate": 0.0001875,
88
- "loss": 3.7327,
89
- "mean_token_accuracy": 0.4113136053085327,
90
- "num_tokens": 762751.0,
91
- "step": 225
92
- },
93
- {
94
- "epoch": 0.2360717658168083,
95
- "grad_norm": 1.455350637435913,
96
- "learning_rate": 0.00018591370558375636,
97
- "loss": 3.3504,
98
- "mean_token_accuracy": 0.47106861472129824,
99
- "num_tokens": 851051.0,
100
- "step": 250
101
- },
102
- {
103
- "epoch": 0.25967894239848915,
104
- "grad_norm": 1.2478972673416138,
105
- "learning_rate": 0.00018432741116751272,
106
- "loss": 3.4177,
107
- "mean_token_accuracy": 0.46144390881061553,
108
- "num_tokens": 937834.0,
109
- "step": 275
110
- },
111
- {
112
- "epoch": 0.28328611898017,
113
- "grad_norm": 1.6548963785171509,
114
- "learning_rate": 0.00018274111675126904,
115
- "loss": 3.4914,
116
- "mean_token_accuracy": 0.450729478597641,
117
- "num_tokens": 1017911.0,
118
- "step": 300
119
- },
120
- {
121
- "epoch": 0.3068932955618508,
122
- "grad_norm": 1.7462213039398193,
123
- "learning_rate": 0.0001811548223350254,
124
- "loss": 3.2948,
125
- "mean_token_accuracy": 0.4845814883708954,
126
- "num_tokens": 1103573.0,
127
- "step": 325
128
- },
129
- {
130
- "epoch": 0.3305004721435316,
131
- "grad_norm": 1.2442463636398315,
132
- "learning_rate": 0.00017956852791878173,
133
- "loss": 3.351,
134
- "mean_token_accuracy": 0.473716744184494,
135
- "num_tokens": 1190283.0,
136
- "step": 350
137
- },
138
- {
139
- "epoch": 0.35410764872521244,
140
- "grad_norm": 0.9460684657096863,
141
- "learning_rate": 0.00017798223350253808,
142
- "loss": 3.3257,
143
- "mean_token_accuracy": 0.48253663778305056,
144
- "num_tokens": 1275756.0,
145
- "step": 375
146
- },
147
- {
148
- "epoch": 0.3777148253068933,
149
- "grad_norm": 1.3959977626800537,
150
- "learning_rate": 0.0001763959390862944,
151
- "loss": 3.3016,
152
- "mean_token_accuracy": 0.48651802659034726,
153
- "num_tokens": 1359058.0,
154
- "step": 400
155
- },
156
- {
157
- "epoch": 0.40132200188857414,
158
- "grad_norm": 1.5676658153533936,
159
- "learning_rate": 0.00017480964467005077,
160
- "loss": 3.0587,
161
- "mean_token_accuracy": 0.5262654614448548,
162
- "num_tokens": 1447793.0,
163
- "step": 425
164
- },
165
- {
166
- "epoch": 0.42492917847025496,
167
- "grad_norm": 1.6668205261230469,
168
- "learning_rate": 0.00017322335025380712,
169
- "loss": 2.9721,
170
- "mean_token_accuracy": 0.542415668964386,
171
- "num_tokens": 1533910.0,
172
- "step": 450
173
- },
174
- {
175
- "epoch": 0.4485363550519358,
176
- "grad_norm": 1.1585041284561157,
177
- "learning_rate": 0.00017163705583756348,
178
- "loss": 2.9116,
179
- "mean_token_accuracy": 0.5552564060688019,
180
- "num_tokens": 1620704.0,
181
- "step": 475
182
- },
183
- {
184
- "epoch": 0.4721435316336166,
185
- "grad_norm": 1.7781224250793457,
186
- "learning_rate": 0.0001700507614213198,
187
- "loss": 2.9046,
188
- "mean_token_accuracy": 0.5538024806976318,
189
- "num_tokens": 1706296.0,
190
- "step": 500
191
- },
192
- {
193
- "epoch": 0.49575070821529743,
194
- "grad_norm": 2.213106632232666,
195
- "learning_rate": 0.00016846446700507614,
196
- "loss": 3.0646,
197
- "mean_token_accuracy": 0.5227190399169922,
198
- "num_tokens": 1791100.0,
199
- "step": 525
200
- },
201
- {
202
- "epoch": 0.5193578847969783,
203
- "grad_norm": 4.082686901092529,
204
- "learning_rate": 0.0001668781725888325,
205
- "loss": 3.081,
206
- "mean_token_accuracy": 0.523969988822937,
207
- "num_tokens": 1869553.0,
208
- "step": 550
209
- },
210
- {
211
- "epoch": 0.5429650613786591,
212
- "grad_norm": 1.7952462434768677,
213
- "learning_rate": 0.00016529187817258885,
214
- "loss": 3.2597,
215
- "mean_token_accuracy": 0.496154887676239,
216
- "num_tokens": 1949366.0,
217
- "step": 575
218
- },
219
- {
220
- "epoch": 0.56657223796034,
221
- "grad_norm": 1.9742423295974731,
222
- "learning_rate": 0.0001637055837563452,
223
- "loss": 3.0122,
224
- "mean_token_accuracy": 0.5359121215343475,
225
- "num_tokens": 2034686.0,
226
- "step": 600
227
- },
228
- {
229
- "epoch": 0.5901794145420207,
230
- "grad_norm": 1.3130273818969727,
231
- "learning_rate": 0.00016211928934010153,
232
- "loss": 3.0088,
233
- "mean_token_accuracy": 0.5349772965908051,
234
- "num_tokens": 2121997.0,
235
- "step": 625
236
- },
237
- {
238
- "epoch": 0.6137865911237016,
239
- "grad_norm": 1.7737120389938354,
240
- "learning_rate": 0.00016053299492385786,
241
- "loss": 2.8518,
242
- "mean_token_accuracy": 0.5664570724964142,
243
- "num_tokens": 2209709.0,
244
- "step": 650
245
- },
246
- {
247
- "epoch": 0.6373937677053825,
248
- "grad_norm": 1.397939682006836,
249
- "learning_rate": 0.00015894670050761421,
250
- "loss": 2.7886,
251
- "mean_token_accuracy": 0.5750785338878631,
252
- "num_tokens": 2295052.0,
253
- "step": 675
254
- },
255
- {
256
- "epoch": 0.6610009442870632,
257
- "grad_norm": 2.435749053955078,
258
- "learning_rate": 0.00015736040609137057,
259
- "loss": 3.0218,
260
- "mean_token_accuracy": 0.5341324102878571,
261
- "num_tokens": 2378262.0,
262
- "step": 700
263
- },
264
- {
265
- "epoch": 0.6846081208687441,
266
- "grad_norm": 1.3271034955978394,
267
- "learning_rate": 0.00015577411167512693,
268
- "loss": 2.9578,
269
- "mean_token_accuracy": 0.5410768449306488,
270
- "num_tokens": 2463204.0,
271
- "step": 725
272
- },
273
- {
274
- "epoch": 0.7082152974504249,
275
- "grad_norm": 1.8051323890686035,
276
- "learning_rate": 0.00015418781725888325,
277
- "loss": 2.9002,
278
- "mean_token_accuracy": 0.5550252544879913,
279
- "num_tokens": 2547027.0,
280
- "step": 750
281
- },
282
- {
283
- "epoch": 0.7318224740321058,
284
- "grad_norm": 0.9294273257255554,
285
- "learning_rate": 0.0001526015228426396,
286
- "loss": 2.7967,
287
- "mean_token_accuracy": 0.572112466096878,
288
- "num_tokens": 2635107.0,
289
- "step": 775
290
- },
291
- {
292
- "epoch": 0.7554296506137866,
293
- "grad_norm": 1.3318583965301514,
294
- "learning_rate": 0.00015101522842639594,
295
- "loss": 2.908,
296
- "mean_token_accuracy": 0.5564561116695405,
297
- "num_tokens": 2719167.0,
298
- "step": 800
299
- },
300
- {
301
- "epoch": 0.7790368271954674,
302
- "grad_norm": 1.1354563236236572,
303
- "learning_rate": 0.0001494289340101523,
304
- "loss": 2.8887,
305
- "mean_token_accuracy": 0.5594349646568298,
306
- "num_tokens": 2805060.0,
307
- "step": 825
308
- },
309
- {
310
- "epoch": 0.8026440037771483,
311
- "grad_norm": 3.2933058738708496,
312
- "learning_rate": 0.00014784263959390862,
313
- "loss": 2.8668,
314
- "mean_token_accuracy": 0.5591881954669953,
315
- "num_tokens": 2887082.0,
316
- "step": 850
317
- },
318
- {
319
- "epoch": 0.826251180358829,
320
- "grad_norm": 1.8824466466903687,
321
- "learning_rate": 0.00014625634517766498,
322
- "loss": 2.834,
323
- "mean_token_accuracy": 0.5663117682933807,
324
- "num_tokens": 2973661.0,
325
- "step": 875
326
- },
327
- {
328
- "epoch": 0.8498583569405099,
329
- "grad_norm": 1.2233247756958008,
330
- "learning_rate": 0.00014467005076142133,
331
- "loss": 2.7768,
332
- "mean_token_accuracy": 0.5772299838066101,
333
- "num_tokens": 3060750.0,
334
- "step": 900
335
- },
336
- {
337
- "epoch": 0.8734655335221907,
338
- "grad_norm": 1.7638064622879028,
339
- "learning_rate": 0.0001430837563451777,
340
- "loss": 2.678,
341
- "mean_token_accuracy": 0.5990130162239075,
342
- "num_tokens": 3143868.0,
343
- "step": 925
344
- },
345
- {
346
- "epoch": 0.8970727101038716,
347
- "grad_norm": 1.5392266511917114,
348
- "learning_rate": 0.00014149746192893402,
349
- "loss": 2.6573,
350
- "mean_token_accuracy": 0.5954342544078827,
351
- "num_tokens": 3231027.0,
352
- "step": 950
353
- },
354
- {
355
- "epoch": 0.9206798866855525,
356
- "grad_norm": 1.8433501720428467,
357
- "learning_rate": 0.00013991116751269035,
358
- "loss": 2.9661,
359
- "mean_token_accuracy": 0.5433392310142517,
360
- "num_tokens": 3314925.0,
361
- "step": 975
362
- },
363
- {
364
- "epoch": 0.9442870632672332,
365
- "grad_norm": 1.3953369855880737,
366
- "learning_rate": 0.0001383248730964467,
367
- "loss": 2.7778,
368
- "mean_token_accuracy": 0.5806617629528046,
369
- "num_tokens": 3398916.0,
370
- "step": 1000
371
- },
372
- {
373
- "epoch": 0.9678942398489141,
374
- "grad_norm": 2.4013938903808594,
375
- "learning_rate": 0.00013673857868020306,
376
- "loss": 2.6282,
377
- "mean_token_accuracy": 0.6033717966079712,
378
- "num_tokens": 3486241.0,
379
- "step": 1025
380
- },
381
- {
382
- "epoch": 0.9915014164305949,
383
- "grad_norm": 1.3695052862167358,
384
- "learning_rate": 0.0001351522842639594,
385
- "loss": 2.8629,
386
- "mean_token_accuracy": 0.5644518744945526,
387
- "num_tokens": 3569625.0,
388
- "step": 1050
389
- },
390
- {
391
- "epoch": 1.0151085930122756,
392
- "grad_norm": 1.5885721445083618,
393
- "learning_rate": 0.00013356598984771574,
394
- "loss": 2.7245,
395
- "mean_token_accuracy": 0.583740828037262,
396
- "num_tokens": 3655264.0,
397
- "step": 1075
398
- },
399
- {
400
- "epoch": 1.0387157695939566,
401
- "grad_norm": 1.957524299621582,
402
- "learning_rate": 0.00013197969543147207,
403
- "loss": 2.5976,
404
- "mean_token_accuracy": 0.6069298982620239,
405
- "num_tokens": 3740644.0,
406
- "step": 1100
407
- },
408
- {
409
- "epoch": 1.0623229461756374,
410
- "grad_norm": 2.9976248741149902,
411
- "learning_rate": 0.00013039340101522843,
412
- "loss": 2.7362,
413
- "mean_token_accuracy": 0.5855110836029053,
414
- "num_tokens": 3825368.0,
415
- "step": 1125
416
- },
417
- {
418
- "epoch": 1.0859301227573182,
419
- "grad_norm": 3.186262845993042,
420
- "learning_rate": 0.00012880710659898478,
421
- "loss": 2.8484,
422
- "mean_token_accuracy": 0.5662666463851929,
423
- "num_tokens": 3908678.0,
424
- "step": 1150
425
- },
426
- {
427
- "epoch": 1.1095372993389991,
428
- "grad_norm": 1.8036489486694336,
429
- "learning_rate": 0.00012722081218274114,
430
- "loss": 2.67,
431
- "mean_token_accuracy": 0.5938792788982391,
432
- "num_tokens": 3996393.0,
433
- "step": 1175
434
- },
435
- {
436
- "epoch": 1.13314447592068,
437
- "grad_norm": 2.489654779434204,
438
- "learning_rate": 0.00012563451776649747,
439
- "loss": 2.8123,
440
- "mean_token_accuracy": 0.5695419406890869,
441
- "num_tokens": 4080596.0,
442
- "step": 1200
443
- },
444
- {
445
- "epoch": 1.1567516525023607,
446
- "grad_norm": 3.2093520164489746,
447
- "learning_rate": 0.00012404822335025382,
448
- "loss": 2.517,
449
- "mean_token_accuracy": 0.6239076638221741,
450
- "num_tokens": 4164481.0,
451
- "step": 1225
452
- },
453
- {
454
- "epoch": 1.1803588290840414,
455
- "grad_norm": 3.1341941356658936,
456
- "learning_rate": 0.00012246192893401015,
457
- "loss": 2.5919,
458
- "mean_token_accuracy": 0.6061871576309205,
459
- "num_tokens": 4253278.0,
460
- "step": 1250
461
- },
462
- {
463
- "epoch": 1.2039660056657224,
464
- "grad_norm": 1.433424711227417,
465
- "learning_rate": 0.0001208756345177665,
466
- "loss": 2.4236,
467
- "mean_token_accuracy": 0.6387936723232269,
468
- "num_tokens": 4338131.0,
469
- "step": 1275
470
- },
471
- {
472
- "epoch": 1.2275731822474032,
473
- "grad_norm": 1.5914385318756104,
474
- "learning_rate": 0.00011928934010152283,
475
- "loss": 2.8934,
476
- "mean_token_accuracy": 0.5650608110427856,
477
- "num_tokens": 4418463.0,
478
- "step": 1300
479
- },
480
- {
481
- "epoch": 1.251180358829084,
482
- "grad_norm": 1.8357518911361694,
483
- "learning_rate": 0.00011770304568527919,
484
- "loss": 2.5408,
485
- "mean_token_accuracy": 0.6213865375518799,
486
- "num_tokens": 4504100.0,
487
- "step": 1325
488
- },
489
- {
490
- "epoch": 1.274787535410765,
491
- "grad_norm": 2.181213855743408,
492
- "learning_rate": 0.00011611675126903555,
493
- "loss": 2.4895,
494
- "mean_token_accuracy": 0.6242103600502014,
495
- "num_tokens": 4590236.0,
496
- "step": 1350
497
- },
498
- {
499
- "epoch": 1.2983947119924457,
500
- "grad_norm": 2.054617404937744,
501
- "learning_rate": 0.00011453045685279189,
502
- "loss": 2.5319,
503
- "mean_token_accuracy": 0.6264574217796326,
504
- "num_tokens": 4673342.0,
505
- "step": 1375
506
- },
507
- {
508
- "epoch": 1.3220018885741265,
509
- "grad_norm": 2.1715738773345947,
510
- "learning_rate": 0.00011294416243654824,
511
- "loss": 2.5042,
512
- "mean_token_accuracy": 0.6200724172592164,
513
- "num_tokens": 4760597.0,
514
- "step": 1400
515
- },
516
- {
517
- "epoch": 1.3456090651558075,
518
- "grad_norm": 3.023545265197754,
519
- "learning_rate": 0.00011135786802030457,
520
- "loss": 2.763,
521
- "mean_token_accuracy": 0.5873644530773163,
522
- "num_tokens": 4842611.0,
523
- "step": 1425
524
- },
525
- {
526
- "epoch": 1.3692162417374882,
527
- "grad_norm": 1.7163723707199097,
528
- "learning_rate": 0.00010977157360406091,
529
- "loss": 2.4856,
530
- "mean_token_accuracy": 0.6349835538864136,
531
- "num_tokens": 4924684.0,
532
- "step": 1450
533
- },
534
- {
535
- "epoch": 1.392823418319169,
536
- "grad_norm": 2.5177738666534424,
537
- "learning_rate": 0.00010818527918781727,
538
- "loss": 2.6193,
539
- "mean_token_accuracy": 0.6035743832588196,
540
- "num_tokens": 5011388.0,
541
- "step": 1475
542
- },
543
- {
544
- "epoch": 1.41643059490085,
545
- "grad_norm": 1.9504915475845337,
546
- "learning_rate": 0.00010659898477157362,
547
- "loss": 2.5341,
548
- "mean_token_accuracy": 0.6178851091861725,
549
- "num_tokens": 5104282.0,
550
- "step": 1500
551
- },
552
- {
553
- "epoch": 1.4400377714825308,
554
- "grad_norm": 1.2298864126205444,
555
- "learning_rate": 0.00010501269035532994,
556
- "loss": 2.4975,
557
- "mean_token_accuracy": 0.6323371386528015,
558
- "num_tokens": 5186482.0,
559
- "step": 1525
560
- },
561
- {
562
- "epoch": 1.4636449480642115,
563
- "grad_norm": 2.7627272605895996,
564
- "learning_rate": 0.0001034263959390863,
565
- "loss": 2.5773,
566
- "mean_token_accuracy": 0.6132258200645446,
567
- "num_tokens": 5267604.0,
568
- "step": 1550
569
- },
570
- {
571
- "epoch": 1.4872521246458923,
572
- "grad_norm": 2.1530709266662598,
573
- "learning_rate": 0.00010184010152284265,
574
- "loss": 2.7489,
575
- "mean_token_accuracy": 0.5849710512161255,
576
- "num_tokens": 5345250.0,
577
- "step": 1575
578
- },
579
- {
580
- "epoch": 1.510859301227573,
581
- "grad_norm": 3.3525829315185547,
582
- "learning_rate": 0.00010025380710659899,
583
- "loss": 2.3328,
584
- "mean_token_accuracy": 0.6535279083251954,
585
- "num_tokens": 5432188.0,
586
- "step": 1600
587
- },
588
- {
589
- "epoch": 1.534466477809254,
590
- "grad_norm": 3.1491880416870117,
591
- "learning_rate": 9.866751269035533e-05,
592
- "loss": 2.4587,
593
- "mean_token_accuracy": 0.6317604184150696,
594
- "num_tokens": 5518949.0,
595
- "step": 1625
596
- },
597
- {
598
- "epoch": 1.5580736543909348,
599
- "grad_norm": 3.790788412094116,
600
- "learning_rate": 9.708121827411169e-05,
601
- "loss": 2.5356,
602
- "mean_token_accuracy": 0.6312965428829194,
603
- "num_tokens": 5598790.0,
604
- "step": 1650
605
- },
606
- {
607
- "epoch": 1.5816808309726156,
608
- "grad_norm": 2.399170160293579,
609
- "learning_rate": 9.549492385786802e-05,
610
- "loss": 2.4952,
611
- "mean_token_accuracy": 0.6315456974506378,
612
- "num_tokens": 5685520.0,
613
- "step": 1675
614
- },
615
- {
616
- "epoch": 1.6052880075542966,
617
- "grad_norm": 1.783835530281067,
618
- "learning_rate": 9.390862944162437e-05,
619
- "loss": 2.5257,
620
- "mean_token_accuracy": 0.629066880941391,
621
- "num_tokens": 5764729.0,
622
- "step": 1700
623
- },
624
- {
625
- "epoch": 1.6288951841359773,
626
- "grad_norm": 2.1746981143951416,
627
- "learning_rate": 9.232233502538072e-05,
628
- "loss": 2.3915,
629
- "mean_token_accuracy": 0.6475781321525573,
630
- "num_tokens": 5847096.0,
631
- "step": 1725
632
- },
633
- {
634
- "epoch": 1.652502360717658,
635
- "grad_norm": 2.853606700897217,
636
- "learning_rate": 9.073604060913706e-05,
637
- "loss": 2.4254,
638
- "mean_token_accuracy": 0.6364664602279663,
639
- "num_tokens": 5929730.0,
640
- "step": 1750
641
- },
642
- {
643
- "epoch": 1.676109537299339,
644
- "grad_norm": 2.6709866523742676,
645
- "learning_rate": 8.91497461928934e-05,
646
- "loss": 2.2418,
647
- "mean_token_accuracy": 0.6732782328128815,
648
- "num_tokens": 6021265.0,
649
- "step": 1775
650
- },
651
- {
652
- "epoch": 1.6997167138810199,
653
- "grad_norm": 3.0063297748565674,
654
- "learning_rate": 8.756345177664976e-05,
655
- "loss": 2.441,
656
- "mean_token_accuracy": 0.6447321319580078,
657
- "num_tokens": 6102483.0,
658
- "step": 1800
659
- },
660
- {
661
- "epoch": 1.7233238904627006,
662
- "grad_norm": 2.609477996826172,
663
- "learning_rate": 8.597715736040608e-05,
664
- "loss": 2.3247,
665
- "mean_token_accuracy": 0.667446813583374,
666
- "num_tokens": 6186450.0,
667
- "step": 1825
668
- },
669
- {
670
- "epoch": 1.7469310670443816,
671
- "grad_norm": 1.4909838438034058,
672
- "learning_rate": 8.439086294416244e-05,
673
- "loss": 2.1653,
674
- "mean_token_accuracy": 0.6888451766967774,
675
- "num_tokens": 6272397.0,
676
- "step": 1850
677
- },
678
- {
679
- "epoch": 1.7705382436260622,
680
- "grad_norm": 2.26397442817688,
681
- "learning_rate": 8.28045685279188e-05,
682
- "loss": 2.4152,
683
- "mean_token_accuracy": 0.6488711929321289,
684
- "num_tokens": 6353819.0,
685
- "step": 1875
686
- },
687
- {
688
- "epoch": 1.7941454202077431,
689
- "grad_norm": 2.281494379043579,
690
- "learning_rate": 8.121827411167512e-05,
691
- "loss": 2.2089,
692
- "mean_token_accuracy": 0.6806192409992218,
693
- "num_tokens": 6439448.0,
694
- "step": 1900
695
- },
696
- {
697
- "epoch": 1.8177525967894241,
698
- "grad_norm": 2.4258992671966553,
699
- "learning_rate": 7.963197969543148e-05,
700
- "loss": 2.5648,
701
- "mean_token_accuracy": 0.6230374383926391,
702
- "num_tokens": 6523821.0,
703
- "step": 1925
704
- },
705
- {
706
- "epoch": 1.8413597733711047,
707
- "grad_norm": 2.492616653442383,
708
- "learning_rate": 7.804568527918782e-05,
709
- "loss": 2.4212,
710
- "mean_token_accuracy": 0.643797037601471,
711
- "num_tokens": 6609653.0,
712
- "step": 1950
713
- },
714
- {
715
- "epoch": 1.8649669499527857,
716
- "grad_norm": 2.589484930038452,
717
- "learning_rate": 7.645939086294416e-05,
718
- "loss": 2.1783,
719
- "mean_token_accuracy": 0.6852576458454132,
720
- "num_tokens": 6697310.0,
721
- "step": 1975
722
- },
723
- {
724
- "epoch": 1.8885741265344664,
725
- "grad_norm": 1.6886556148529053,
726
- "learning_rate": 7.48730964467005e-05,
727
- "loss": 2.2581,
728
- "mean_token_accuracy": 0.6677005457878112,
729
- "num_tokens": 6789851.0,
730
- "step": 2000
731
- },
732
- {
733
- "epoch": 1.9121813031161472,
734
- "grad_norm": 3.7452311515808105,
735
- "learning_rate": 7.328680203045686e-05,
736
- "loss": 2.3258,
737
- "mean_token_accuracy": 0.6635724091529847,
738
- "num_tokens": 6877195.0,
739
- "step": 2025
740
- },
741
- {
742
- "epoch": 1.9357884796978282,
743
- "grad_norm": 2.047663688659668,
744
- "learning_rate": 7.170050761421319e-05,
745
- "loss": 2.3441,
746
- "mean_token_accuracy": 0.6537975025177002,
747
- "num_tokens": 6963541.0,
748
- "step": 2050
749
- },
750
- {
751
- "epoch": 1.959395656279509,
752
- "grad_norm": 3.105921506881714,
753
- "learning_rate": 7.011421319796955e-05,
754
- "loss": 2.2945,
755
- "mean_token_accuracy": 0.6703400027751922,
756
- "num_tokens": 7049829.0,
757
- "step": 2075
758
- },
759
- {
760
- "epoch": 1.9830028328611897,
761
- "grad_norm": 2.07450532913208,
762
- "learning_rate": 6.852791878172589e-05,
763
- "loss": 2.4282,
764
- "mean_token_accuracy": 0.6384662842750549,
765
- "num_tokens": 7138596.0,
766
- "step": 2100
767
- },
768
- {
769
- "epoch": 2.0066100094428707,
770
- "grad_norm": 3.5083131790161133,
771
- "learning_rate": 6.694162436548223e-05,
772
- "loss": 2.43,
773
- "mean_token_accuracy": 0.640663161277771,
774
- "num_tokens": 7223056.0,
775
- "step": 2125
776
- },
777
- {
778
- "epoch": 2.0302171860245513,
779
- "grad_norm": 1.5466697216033936,
780
- "learning_rate": 6.541878172588833e-05,
781
- "loss": 2.4533,
782
- "mean_token_accuracy": 0.6328469634056091,
783
- "num_tokens": 7309998.0,
784
- "step": 2150
785
- },
786
- {
787
- "epoch": 2.0538243626062322,
788
- "grad_norm": 1.704559087753296,
789
- "learning_rate": 6.383248730964467e-05,
790
- "loss": 2.31,
791
- "mean_token_accuracy": 0.6707417845726014,
792
- "num_tokens": 7390437.0,
793
- "step": 2175
794
- },
795
- {
796
- "epoch": 2.0774315391879132,
797
- "grad_norm": 2.0818848609924316,
798
- "learning_rate": 6.224619289340103e-05,
799
- "loss": 2.4,
800
- "mean_token_accuracy": 0.6547241771221161,
801
- "num_tokens": 7472360.0,
802
- "step": 2200
803
- },
804
- {
805
- "epoch": 2.101038715769594,
806
- "grad_norm": 2.9945547580718994,
807
- "learning_rate": 6.065989847715736e-05,
808
- "loss": 2.2873,
809
- "mean_token_accuracy": 0.6657208549976349,
810
- "num_tokens": 7555482.0,
811
- "step": 2225
812
- },
813
- {
814
- "epoch": 2.1246458923512748,
815
- "grad_norm": 3.4089205265045166,
816
- "learning_rate": 5.907360406091371e-05,
817
- "loss": 2.2813,
818
- "mean_token_accuracy": 0.6657274806499481,
819
- "num_tokens": 7646453.0,
820
- "step": 2250
821
- },
822
- {
823
- "epoch": 2.1482530689329558,
824
- "grad_norm": 3.965263605117798,
825
- "learning_rate": 5.748730964467005e-05,
826
- "loss": 2.2457,
827
- "mean_token_accuracy": 0.6795996761322022,
828
- "num_tokens": 7729020.0,
829
- "step": 2275
830
- },
831
- {
832
- "epoch": 2.1718602455146363,
833
- "grad_norm": 3.042095899581909,
834
- "learning_rate": 5.59010152284264e-05,
835
- "loss": 2.2448,
836
- "mean_token_accuracy": 0.6785978388786316,
837
- "num_tokens": 7813751.0,
838
- "step": 2300
839
- },
840
- {
841
- "epoch": 2.1954674220963173,
842
- "grad_norm": 2.43571400642395,
843
- "learning_rate": 5.431472081218274e-05,
844
- "loss": 2.3119,
845
- "mean_token_accuracy": 0.6689464914798736,
846
- "num_tokens": 7895366.0,
847
- "step": 2325
848
- },
849
- {
850
- "epoch": 2.2190745986779983,
851
- "grad_norm": 1.749118685722351,
852
- "learning_rate": 5.272842639593909e-05,
853
- "loss": 2.4028,
854
- "mean_token_accuracy": 0.6496596956253051,
855
- "num_tokens": 7978864.0,
856
- "step": 2350
857
- },
858
- {
859
- "epoch": 2.242681775259679,
860
- "grad_norm": 4.0993547439575195,
861
- "learning_rate": 5.114213197969543e-05,
862
- "loss": 2.0971,
863
- "mean_token_accuracy": 0.702848870754242,
864
- "num_tokens": 8066059.0,
865
- "step": 2375
866
- },
867
- {
868
- "epoch": 2.26628895184136,
869
- "grad_norm": 2.478060007095337,
870
- "learning_rate": 4.955583756345178e-05,
871
- "loss": 2.3131,
872
- "mean_token_accuracy": 0.6597225499153138,
873
- "num_tokens": 8150324.0,
874
- "step": 2400
875
- },
876
- {
877
- "epoch": 2.289896128423041,
878
- "grad_norm": 3.1821959018707275,
879
- "learning_rate": 4.7969543147208126e-05,
880
- "loss": 2.2792,
881
- "mean_token_accuracy": 0.6698596668243408,
882
- "num_tokens": 8237632.0,
883
- "step": 2425
884
- },
885
- {
886
- "epoch": 2.3135033050047213,
887
- "grad_norm": 1.7894206047058105,
888
- "learning_rate": 4.638324873096447e-05,
889
- "loss": 2.1695,
890
- "mean_token_accuracy": 0.687961802482605,
891
- "num_tokens": 8324924.0,
892
- "step": 2450
893
- },
894
- {
895
- "epoch": 2.3371104815864023,
896
- "grad_norm": 2.1389191150665283,
897
- "learning_rate": 4.479695431472081e-05,
898
- "loss": 2.0925,
899
- "mean_token_accuracy": 0.7001671254634857,
900
- "num_tokens": 8411403.0,
901
- "step": 2475
902
- },
903
- {
904
- "epoch": 2.360717658168083,
905
- "grad_norm": 2.390141248703003,
906
- "learning_rate": 4.321065989847716e-05,
907
- "loss": 2.3656,
908
- "mean_token_accuracy": 0.657674194574356,
909
- "num_tokens": 8496536.0,
910
- "step": 2500
911
- },
912
- {
913
- "epoch": 2.384324834749764,
914
- "grad_norm": 2.3317668437957764,
915
- "learning_rate": 4.162436548223351e-05,
916
- "loss": 1.9864,
917
- "mean_token_accuracy": 0.7178370308876038,
918
- "num_tokens": 8588032.0,
919
- "step": 2525
920
- },
921
- {
922
- "epoch": 2.407932011331445,
923
- "grad_norm": 2.255056619644165,
924
- "learning_rate": 4.003807106598985e-05,
925
- "loss": 2.1131,
926
- "mean_token_accuracy": 0.7006520819664002,
927
- "num_tokens": 8672178.0,
928
- "step": 2550
929
- },
930
- {
931
- "epoch": 2.4315391879131254,
932
- "grad_norm": 2.2023000717163086,
933
- "learning_rate": 3.84517766497462e-05,
934
- "loss": 2.0892,
935
- "mean_token_accuracy": 0.7081982207298279,
936
- "num_tokens": 8753494.0,
937
- "step": 2575
938
- },
939
- {
940
- "epoch": 2.4551463644948064,
941
- "grad_norm": 3.6177306175231934,
942
- "learning_rate": 3.686548223350254e-05,
943
- "loss": 2.2059,
944
- "mean_token_accuracy": 0.6876888346672058,
945
- "num_tokens": 8836935.0,
946
- "step": 2600
947
- },
948
- {
949
- "epoch": 2.4787535410764874,
950
- "grad_norm": 2.292860746383667,
951
- "learning_rate": 3.527918781725888e-05,
952
- "loss": 2.3883,
953
- "mean_token_accuracy": 0.6509598207473755,
954
- "num_tokens": 8921920.0,
955
- "step": 2625
956
- },
957
- {
958
- "epoch": 2.502360717658168,
959
- "grad_norm": 2.749070167541504,
960
- "learning_rate": 3.369289340101523e-05,
961
- "loss": 2.3556,
962
- "mean_token_accuracy": 0.6616550016403199,
963
- "num_tokens": 9005311.0,
964
- "step": 2650
965
- },
966
- {
967
- "epoch": 2.525967894239849,
968
- "grad_norm": 2.500455617904663,
969
- "learning_rate": 3.210659898477157e-05,
970
- "loss": 2.1479,
971
- "mean_token_accuracy": 0.6956441640853882,
972
- "num_tokens": 9089703.0,
973
- "step": 2675
974
- },
975
- {
976
- "epoch": 2.54957507082153,
977
- "grad_norm": 1.5603734254837036,
978
- "learning_rate": 3.052030456852792e-05,
979
- "loss": 2.2351,
980
- "mean_token_accuracy": 0.6813873088359833,
981
- "num_tokens": 9177025.0,
982
- "step": 2700
983
- },
984
- {
985
- "epoch": 2.5731822474032104,
986
- "grad_norm": 3.4635727405548096,
987
- "learning_rate": 2.8934010152284264e-05,
988
- "loss": 2.1598,
989
- "mean_token_accuracy": 0.6917344307899476,
990
- "num_tokens": 9261429.0,
991
- "step": 2725
992
- },
993
- {
994
- "epoch": 2.5967894239848914,
995
- "grad_norm": 2.4915714263916016,
996
- "learning_rate": 2.7347715736040606e-05,
997
- "loss": 2.2634,
998
- "mean_token_accuracy": 0.6734067058563232,
999
- "num_tokens": 9348692.0,
1000
- "step": 2750
1001
- },
1002
- {
1003
- "epoch": 2.620396600566572,
1004
- "grad_norm": 2.7267868518829346,
1005
- "learning_rate": 2.576142131979696e-05,
1006
- "loss": 2.2268,
1007
- "mean_token_accuracy": 0.6815747046470642,
1008
- "num_tokens": 9432715.0,
1009
- "step": 2775
1010
- },
1011
- {
1012
- "epoch": 2.644003777148253,
1013
- "grad_norm": 3.085237503051758,
1014
- "learning_rate": 2.41751269035533e-05,
1015
- "loss": 2.1703,
1016
- "mean_token_accuracy": 0.6912525415420532,
1017
- "num_tokens": 9518928.0,
1018
- "step": 2800
1019
- },
1020
- {
1021
- "epoch": 2.667610953729934,
1022
- "grad_norm": 2.2934255599975586,
1023
- "learning_rate": 2.2588832487309646e-05,
1024
- "loss": 2.2124,
1025
- "mean_token_accuracy": 0.6754102325439453,
1026
- "num_tokens": 9606238.0,
1027
- "step": 2825
1028
- },
1029
- {
1030
- "epoch": 2.691218130311615,
1031
- "grad_norm": 3.9280519485473633,
1032
- "learning_rate": 2.100253807106599e-05,
1033
- "loss": 2.0278,
1034
- "mean_token_accuracy": 0.7123878049850464,
1035
- "num_tokens": 9693011.0,
1036
- "step": 2850
1037
- },
1038
- {
1039
- "epoch": 2.7148253068932955,
1040
- "grad_norm": 2.0387489795684814,
1041
- "learning_rate": 1.9416243654822337e-05,
1042
- "loss": 2.177,
1043
- "mean_token_accuracy": 0.6935857963562012,
1044
- "num_tokens": 9776929.0,
1045
- "step": 2875
1046
- },
1047
- {
1048
- "epoch": 2.7384324834749765,
1049
- "grad_norm": 2.9819560050964355,
1050
- "learning_rate": 1.782994923857868e-05,
1051
- "loss": 2.0651,
1052
- "mean_token_accuracy": 0.7067358446121216,
1053
- "num_tokens": 9862456.0,
1054
- "step": 2900
1055
- },
1056
- {
1057
- "epoch": 2.762039660056657,
1058
- "grad_norm": 2.204148054122925,
1059
- "learning_rate": 1.6243654822335024e-05,
1060
- "loss": 2.046,
1061
- "mean_token_accuracy": 0.707190637588501,
1062
- "num_tokens": 9952073.0,
1063
- "step": 2925
1064
- },
1065
- {
1066
- "epoch": 2.785646836638338,
1067
- "grad_norm": 3.0924429893493652,
1068
- "learning_rate": 1.4657360406091371e-05,
1069
- "loss": 2.209,
1070
- "mean_token_accuracy": 0.6862516689300537,
1071
- "num_tokens": 10037799.0,
1072
- "step": 2950
1073
- },
1074
- {
1075
- "epoch": 2.809254013220019,
1076
- "grad_norm": 2.6368651390075684,
1077
- "learning_rate": 1.3071065989847717e-05,
1078
- "loss": 2.2244,
1079
- "mean_token_accuracy": 0.6805661177635193,
1080
- "num_tokens": 10124423.0,
1081
- "step": 2975
1082
- },
1083
- {
1084
- "epoch": 2.8328611898017,
1085
- "grad_norm": 1.7311336994171143,
1086
- "learning_rate": 1.148477157360406e-05,
1087
- "loss": 2.1135,
1088
- "mean_token_accuracy": 0.7051725935935974,
1089
- "num_tokens": 10202586.0,
1090
- "step": 3000
1091
- },
1092
- {
1093
- "epoch": 2.8564683663833805,
1094
- "grad_norm": 2.4750070571899414,
1095
- "learning_rate": 9.898477157360408e-06,
1096
- "loss": 2.0525,
1097
- "mean_token_accuracy": 0.7098425531387329,
1098
- "num_tokens": 10288808.0,
1099
- "step": 3025
1100
- },
1101
- {
1102
- "epoch": 2.8800755429650615,
1103
- "grad_norm": 2.8913192749023438,
1104
- "learning_rate": 8.312182741116751e-06,
1105
- "loss": 1.7762,
1106
- "mean_token_accuracy": 0.7606491017341613,
1107
- "num_tokens": 10375567.0,
1108
- "step": 3050
1109
- },
1110
- {
1111
- "epoch": 2.903682719546742,
1112
- "grad_norm": 2.6616008281707764,
1113
- "learning_rate": 6.725888324873096e-06,
1114
- "loss": 2.064,
1115
- "mean_token_accuracy": 0.7075335788726806,
1116
- "num_tokens": 10462771.0,
1117
- "step": 3075
1118
- },
1119
- {
1120
- "epoch": 2.927289896128423,
1121
- "grad_norm": 2.2261228561401367,
1122
- "learning_rate": 5.139593908629442e-06,
1123
- "loss": 2.2238,
1124
- "mean_token_accuracy": 0.6874805259704589,
1125
- "num_tokens": 10542846.0,
1126
- "step": 3100
1127
- },
1128
- {
1129
- "epoch": 2.950897072710104,
1130
- "grad_norm": 2.295609712600708,
1131
- "learning_rate": 3.5532994923857873e-06,
1132
- "loss": 2.1835,
1133
- "mean_token_accuracy": 0.6867920517921448,
1134
- "num_tokens": 10625357.0,
1135
- "step": 3125
1136
- },
1137
- {
1138
- "epoch": 2.9745042492917846,
1139
- "grad_norm": 1.7816609144210815,
1140
- "learning_rate": 1.967005076142132e-06,
1141
- "loss": 2.3207,
1142
- "mean_token_accuracy": 0.6663406753540039,
1143
- "num_tokens": 10706184.0,
1144
- "step": 3150
1145
- },
1146
- {
1147
- "epoch": 2.9981114258734656,
1148
- "grad_norm": 2.8482465744018555,
1149
- "learning_rate": 3.807106598984772e-07,
1150
- "loss": 2.1249,
1151
- "mean_token_accuracy": 0.6941254138946533,
1152
- "num_tokens": 10793477.0,
1153
- "step": 3175
1154
- }
1155
- ],
1156
- "logging_steps": 25,
1157
- "max_steps": 3177,
1158
- "num_input_tokens_seen": 0,
1159
- "num_train_epochs": 3,
1160
- "save_steps": 500,
1161
- "stateful_callbacks": {
1162
- "TrainerControl": {
1163
- "args": {
1164
- "should_epoch_stop": false,
1165
- "should_evaluate": false,
1166
- "should_log": false,
1167
- "should_save": true,
1168
- "should_training_stop": true
1169
- },
1170
- "attributes": {}
1171
- }
1172
- },
1173
- "total_flos": 1.6851959205842534e+18,
1174
- "train_batch_size": 9,
1175
- "trial_name": null,
1176
- "trial_params": null
1177
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b27be49042add3dc0d0be39c1a676fe5900e3b7cd0b1ac58cd0800c00a999270
3
  size 6161
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cf3a3cf49d10705b84367db680227f45497b70bf105d033348f9c874d1a31e3
3
  size 6161