learn3r commited on
Commit
1f1cb2f
1 Parent(s): 4444f08

End of training

Browse files
Files changed (5) hide show
  1. README.md +2 -2
  2. all_results.json +13 -0
  3. eval_results.json +8 -0
  4. train_results.json +8 -0
  5. trainer_state.json +1170 -0
README.md CHANGED
@@ -4,7 +4,7 @@ base_model: google/long-t5-tglobal-xl
4
  tags:
5
  - generated_from_trainer
6
  datasets:
7
- - scrolls
8
  model-index:
9
  - name: longt5_xl_sfd_20
10
  results: []
@@ -15,7 +15,7 @@ should probably proofread and complete it, then remove this comment. -->
15
 
16
  # longt5_xl_sfd_20
17
 
18
- This model is a fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the scrolls dataset.
19
  It achieves the following results on the evaluation set:
20
  - Loss: 4.8167
21
 
 
4
  tags:
5
  - generated_from_trainer
6
  datasets:
7
+ - tau/scrolls
8
  model-index:
9
  - name: longt5_xl_sfd_20
10
  results: []
 
15
 
16
  # longt5_xl_sfd_20
17
 
18
+ This model is a fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the tau/scrolls summ_screen_fd dataset.
19
  It achieves the following results on the evaluation set:
20
  - Loss: 4.8167
21
 
all_results.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 19.48,
3
+ "eval_loss": 4.81671667098999,
4
+ "eval_runtime": 80.254,
5
+ "eval_samples": 338,
6
+ "eval_samples_per_second": 4.212,
7
+ "eval_steps_per_second": 0.536,
8
+ "train_loss": 0.8494854368801628,
9
+ "train_runtime": 68771.7044,
10
+ "train_samples": 3673,
11
+ "train_samples_per_second": 1.068,
12
+ "train_steps_per_second": 0.004
13
+ }
eval_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 19.48,
3
+ "eval_loss": 4.81671667098999,
4
+ "eval_runtime": 80.254,
5
+ "eval_samples": 338,
6
+ "eval_samples_per_second": 4.212,
7
+ "eval_steps_per_second": 0.536
8
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 19.48,
3
+ "train_loss": 0.8494854368801628,
4
+ "train_runtime": 68771.7044,
5
+ "train_samples": 3673,
6
+ "train_samples_per_second": 1.068,
7
+ "train_steps_per_second": 0.004
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": 2.2994935512542725,
3
+ "best_model_checkpoint": "/exports/eddie/scratch/s1970716/models/longt5_xl_sfd_20/checkpoint-28",
4
+ "epoch": 19.47826086956522,
5
+ "eval_steps": 500,
6
+ "global_step": 280,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.14,
13
+ "grad_norm": 8.068708419799805,
14
+ "learning_rate": 0.001,
15
+ "loss": 3.274,
16
+ "step": 2
17
+ },
18
+ {
19
+ "epoch": 0.28,
20
+ "grad_norm": 1.4994572401046753,
21
+ "learning_rate": 0.001,
22
+ "loss": 3.2963,
23
+ "step": 4
24
+ },
25
+ {
26
+ "epoch": 0.42,
27
+ "grad_norm": 1.0570803880691528,
28
+ "learning_rate": 0.001,
29
+ "loss": 3.3164,
30
+ "step": 6
31
+ },
32
+ {
33
+ "epoch": 0.56,
34
+ "grad_norm": 1.2446849346160889,
35
+ "learning_rate": 0.001,
36
+ "loss": 3.0866,
37
+ "step": 8
38
+ },
39
+ {
40
+ "epoch": 0.7,
41
+ "grad_norm": 0.721084713935852,
42
+ "learning_rate": 0.001,
43
+ "loss": 2.8976,
44
+ "step": 10
45
+ },
46
+ {
47
+ "epoch": 0.83,
48
+ "grad_norm": 1.2132383584976196,
49
+ "learning_rate": 0.001,
50
+ "loss": 2.8298,
51
+ "step": 12
52
+ },
53
+ {
54
+ "epoch": 0.97,
55
+ "grad_norm": 0.4689762592315674,
56
+ "learning_rate": 0.001,
57
+ "loss": 2.9377,
58
+ "step": 14
59
+ },
60
+ {
61
+ "epoch": 0.97,
62
+ "eval_loss": 2.7965147495269775,
63
+ "eval_runtime": 81.4763,
64
+ "eval_samples_per_second": 4.148,
65
+ "eval_steps_per_second": 0.528,
66
+ "step": 14
67
+ },
68
+ {
69
+ "epoch": 1.11,
70
+ "grad_norm": 0.42892181873321533,
71
+ "learning_rate": 0.001,
72
+ "loss": 2.741,
73
+ "step": 16
74
+ },
75
+ {
76
+ "epoch": 1.25,
77
+ "grad_norm": 0.4487678110599518,
78
+ "learning_rate": 0.001,
79
+ "loss": 2.4441,
80
+ "step": 18
81
+ },
82
+ {
83
+ "epoch": 1.39,
84
+ "grad_norm": 0.4653552770614624,
85
+ "learning_rate": 0.001,
86
+ "loss": 2.432,
87
+ "step": 20
88
+ },
89
+ {
90
+ "epoch": 1.53,
91
+ "grad_norm": 0.35275548696517944,
92
+ "learning_rate": 0.001,
93
+ "loss": 2.4016,
94
+ "step": 22
95
+ },
96
+ {
97
+ "epoch": 1.67,
98
+ "grad_norm": 0.43277695775032043,
99
+ "learning_rate": 0.001,
100
+ "loss": 2.391,
101
+ "step": 24
102
+ },
103
+ {
104
+ "epoch": 1.81,
105
+ "grad_norm": 0.3408297300338745,
106
+ "learning_rate": 0.001,
107
+ "loss": 2.3911,
108
+ "step": 26
109
+ },
110
+ {
111
+ "epoch": 1.95,
112
+ "grad_norm": 0.3205319344997406,
113
+ "learning_rate": 0.001,
114
+ "loss": 2.3247,
115
+ "step": 28
116
+ },
117
+ {
118
+ "epoch": 1.95,
119
+ "eval_loss": 2.2994935512542725,
120
+ "eval_runtime": 81.4693,
121
+ "eval_samples_per_second": 4.149,
122
+ "eval_steps_per_second": 0.528,
123
+ "step": 28
124
+ },
125
+ {
126
+ "epoch": 2.09,
127
+ "grad_norm": 0.4033512771129608,
128
+ "learning_rate": 0.001,
129
+ "loss": 2.0701,
130
+ "step": 30
131
+ },
132
+ {
133
+ "epoch": 2.23,
134
+ "grad_norm": 0.36825311183929443,
135
+ "learning_rate": 0.001,
136
+ "loss": 2.0968,
137
+ "step": 32
138
+ },
139
+ {
140
+ "epoch": 2.37,
141
+ "grad_norm": 0.5080482363700867,
142
+ "learning_rate": 0.001,
143
+ "loss": 2.0681,
144
+ "step": 34
145
+ },
146
+ {
147
+ "epoch": 2.5,
148
+ "grad_norm": 0.4196927845478058,
149
+ "learning_rate": 0.001,
150
+ "loss": 2.0914,
151
+ "step": 36
152
+ },
153
+ {
154
+ "epoch": 2.64,
155
+ "grad_norm": 0.3230506479740143,
156
+ "learning_rate": 0.001,
157
+ "loss": 2.0317,
158
+ "step": 38
159
+ },
160
+ {
161
+ "epoch": 2.78,
162
+ "grad_norm": 0.2733004689216614,
163
+ "learning_rate": 0.001,
164
+ "loss": 1.9723,
165
+ "step": 40
166
+ },
167
+ {
168
+ "epoch": 2.92,
169
+ "grad_norm": 0.2709517776966095,
170
+ "learning_rate": 0.001,
171
+ "loss": 1.9943,
172
+ "step": 42
173
+ },
174
+ {
175
+ "epoch": 2.99,
176
+ "eval_loss": 2.3308048248291016,
177
+ "eval_runtime": 81.5083,
178
+ "eval_samples_per_second": 4.147,
179
+ "eval_steps_per_second": 0.528,
180
+ "step": 43
181
+ },
182
+ {
183
+ "epoch": 3.06,
184
+ "grad_norm": 0.3230663537979126,
185
+ "learning_rate": 0.001,
186
+ "loss": 1.9093,
187
+ "step": 44
188
+ },
189
+ {
190
+ "epoch": 3.2,
191
+ "grad_norm": 0.3976946175098419,
192
+ "learning_rate": 0.001,
193
+ "loss": 1.7682,
194
+ "step": 46
195
+ },
196
+ {
197
+ "epoch": 3.34,
198
+ "grad_norm": 0.42008209228515625,
199
+ "learning_rate": 0.001,
200
+ "loss": 1.7119,
201
+ "step": 48
202
+ },
203
+ {
204
+ "epoch": 3.48,
205
+ "grad_norm": 0.31828513741493225,
206
+ "learning_rate": 0.001,
207
+ "loss": 1.7283,
208
+ "step": 50
209
+ },
210
+ {
211
+ "epoch": 3.62,
212
+ "grad_norm": 0.2448839396238327,
213
+ "learning_rate": 0.001,
214
+ "loss": 1.6905,
215
+ "step": 52
216
+ },
217
+ {
218
+ "epoch": 3.76,
219
+ "grad_norm": 0.25552132725715637,
220
+ "learning_rate": 0.001,
221
+ "loss": 1.6645,
222
+ "step": 54
223
+ },
224
+ {
225
+ "epoch": 3.9,
226
+ "grad_norm": 15.679224014282227,
227
+ "learning_rate": 0.001,
228
+ "loss": 1.7056,
229
+ "step": 56
230
+ },
231
+ {
232
+ "epoch": 3.97,
233
+ "eval_loss": 2.3368992805480957,
234
+ "eval_runtime": 81.4742,
235
+ "eval_samples_per_second": 4.149,
236
+ "eval_steps_per_second": 0.528,
237
+ "step": 57
238
+ },
239
+ {
240
+ "epoch": 4.03,
241
+ "grad_norm": 0.29547178745269775,
242
+ "learning_rate": 0.001,
243
+ "loss": 1.564,
244
+ "step": 58
245
+ },
246
+ {
247
+ "epoch": 4.17,
248
+ "grad_norm": 0.31610924005508423,
249
+ "learning_rate": 0.001,
250
+ "loss": 1.3607,
251
+ "step": 60
252
+ },
253
+ {
254
+ "epoch": 4.31,
255
+ "grad_norm": 0.32351407408714294,
256
+ "learning_rate": 0.001,
257
+ "loss": 1.4158,
258
+ "step": 62
259
+ },
260
+ {
261
+ "epoch": 4.45,
262
+ "grad_norm": 0.5101042985916138,
263
+ "learning_rate": 0.001,
264
+ "loss": 1.4694,
265
+ "step": 64
266
+ },
267
+ {
268
+ "epoch": 4.59,
269
+ "grad_norm": 0.41575145721435547,
270
+ "learning_rate": 0.001,
271
+ "loss": 1.4755,
272
+ "step": 66
273
+ },
274
+ {
275
+ "epoch": 4.73,
276
+ "grad_norm": 0.3269899785518646,
277
+ "learning_rate": 0.001,
278
+ "loss": 1.4268,
279
+ "step": 68
280
+ },
281
+ {
282
+ "epoch": 4.87,
283
+ "grad_norm": 0.4077276587486267,
284
+ "learning_rate": 0.001,
285
+ "loss": 1.4471,
286
+ "step": 70
287
+ },
288
+ {
289
+ "epoch": 4.94,
290
+ "eval_loss": 2.553175926208496,
291
+ "eval_runtime": 81.5149,
292
+ "eval_samples_per_second": 4.146,
293
+ "eval_steps_per_second": 0.528,
294
+ "step": 71
295
+ },
296
+ {
297
+ "epoch": 5.01,
298
+ "grad_norm": 0.37493908405303955,
299
+ "learning_rate": 0.001,
300
+ "loss": 1.4436,
301
+ "step": 72
302
+ },
303
+ {
304
+ "epoch": 5.15,
305
+ "grad_norm": 0.8398223519325256,
306
+ "learning_rate": 0.001,
307
+ "loss": 1.1776,
308
+ "step": 74
309
+ },
310
+ {
311
+ "epoch": 5.29,
312
+ "grad_norm": 0.621316134929657,
313
+ "learning_rate": 0.001,
314
+ "loss": 1.192,
315
+ "step": 76
316
+ },
317
+ {
318
+ "epoch": 5.43,
319
+ "grad_norm": 0.5988876819610596,
320
+ "learning_rate": 0.001,
321
+ "loss": 1.1561,
322
+ "step": 78
323
+ },
324
+ {
325
+ "epoch": 5.57,
326
+ "grad_norm": 0.561390221118927,
327
+ "learning_rate": 0.001,
328
+ "loss": 1.2129,
329
+ "step": 80
330
+ },
331
+ {
332
+ "epoch": 5.7,
333
+ "grad_norm": 0.32573097944259644,
334
+ "learning_rate": 0.001,
335
+ "loss": 1.19,
336
+ "step": 82
337
+ },
338
+ {
339
+ "epoch": 5.84,
340
+ "grad_norm": 0.3272527754306793,
341
+ "learning_rate": 0.001,
342
+ "loss": 1.1933,
343
+ "step": 84
344
+ },
345
+ {
346
+ "epoch": 5.98,
347
+ "grad_norm": 0.36107558012008667,
348
+ "learning_rate": 0.001,
349
+ "loss": 1.1932,
350
+ "step": 86
351
+ },
352
+ {
353
+ "epoch": 5.98,
354
+ "eval_loss": 2.696089744567871,
355
+ "eval_runtime": 81.5294,
356
+ "eval_samples_per_second": 4.146,
357
+ "eval_steps_per_second": 0.527,
358
+ "step": 86
359
+ },
360
+ {
361
+ "epoch": 6.12,
362
+ "grad_norm": 0.4167131781578064,
363
+ "learning_rate": 0.001,
364
+ "loss": 0.9285,
365
+ "step": 88
366
+ },
367
+ {
368
+ "epoch": 6.26,
369
+ "grad_norm": 0.38736867904663086,
370
+ "learning_rate": 0.001,
371
+ "loss": 0.9568,
372
+ "step": 90
373
+ },
374
+ {
375
+ "epoch": 6.4,
376
+ "grad_norm": 0.3212537169456482,
377
+ "learning_rate": 0.001,
378
+ "loss": 0.9538,
379
+ "step": 92
380
+ },
381
+ {
382
+ "epoch": 6.54,
383
+ "grad_norm": 0.2966512143611908,
384
+ "learning_rate": 0.001,
385
+ "loss": 0.9133,
386
+ "step": 94
387
+ },
388
+ {
389
+ "epoch": 6.68,
390
+ "grad_norm": 0.3149372935295105,
391
+ "learning_rate": 0.001,
392
+ "loss": 0.9374,
393
+ "step": 96
394
+ },
395
+ {
396
+ "epoch": 6.82,
397
+ "grad_norm": 0.3140605092048645,
398
+ "learning_rate": 0.001,
399
+ "loss": 0.9585,
400
+ "step": 98
401
+ },
402
+ {
403
+ "epoch": 6.96,
404
+ "grad_norm": 0.33559679985046387,
405
+ "learning_rate": 0.001,
406
+ "loss": 0.9199,
407
+ "step": 100
408
+ },
409
+ {
410
+ "epoch": 6.96,
411
+ "eval_loss": 2.645321846008301,
412
+ "eval_runtime": 81.5044,
413
+ "eval_samples_per_second": 4.147,
414
+ "eval_steps_per_second": 0.528,
415
+ "step": 100
416
+ },
417
+ {
418
+ "epoch": 7.1,
419
+ "grad_norm": 0.3616858720779419,
420
+ "learning_rate": 0.001,
421
+ "loss": 0.7517,
422
+ "step": 102
423
+ },
424
+ {
425
+ "epoch": 7.23,
426
+ "grad_norm": 0.4970415234565735,
427
+ "learning_rate": 0.001,
428
+ "loss": 0.7378,
429
+ "step": 104
430
+ },
431
+ {
432
+ "epoch": 7.37,
433
+ "grad_norm": 0.6654688119888306,
434
+ "learning_rate": 0.001,
435
+ "loss": 0.7864,
436
+ "step": 106
437
+ },
438
+ {
439
+ "epoch": 7.51,
440
+ "grad_norm": 0.51229327917099,
441
+ "learning_rate": 0.001,
442
+ "loss": 0.762,
443
+ "step": 108
444
+ },
445
+ {
446
+ "epoch": 7.65,
447
+ "grad_norm": 0.4524416923522949,
448
+ "learning_rate": 0.001,
449
+ "loss": 0.7342,
450
+ "step": 110
451
+ },
452
+ {
453
+ "epoch": 7.79,
454
+ "grad_norm": 0.48206427693367004,
455
+ "learning_rate": 0.001,
456
+ "loss": 0.7706,
457
+ "step": 112
458
+ },
459
+ {
460
+ "epoch": 7.93,
461
+ "grad_norm": 0.4534417688846588,
462
+ "learning_rate": 0.001,
463
+ "loss": 0.7571,
464
+ "step": 114
465
+ },
466
+ {
467
+ "epoch": 8.0,
468
+ "eval_loss": 3.0977730751037598,
469
+ "eval_runtime": 81.5778,
470
+ "eval_samples_per_second": 4.143,
471
+ "eval_steps_per_second": 0.527,
472
+ "step": 115
473
+ },
474
+ {
475
+ "epoch": 8.07,
476
+ "grad_norm": 0.306815505027771,
477
+ "learning_rate": 0.001,
478
+ "loss": 0.6809,
479
+ "step": 116
480
+ },
481
+ {
482
+ "epoch": 8.21,
483
+ "grad_norm": 0.34183812141418457,
484
+ "learning_rate": 0.001,
485
+ "loss": 0.5853,
486
+ "step": 118
487
+ },
488
+ {
489
+ "epoch": 8.35,
490
+ "grad_norm": 0.3781261444091797,
491
+ "learning_rate": 0.001,
492
+ "loss": 0.5819,
493
+ "step": 120
494
+ },
495
+ {
496
+ "epoch": 8.49,
497
+ "grad_norm": 0.36344149708747864,
498
+ "learning_rate": 0.001,
499
+ "loss": 0.6059,
500
+ "step": 122
501
+ },
502
+ {
503
+ "epoch": 8.63,
504
+ "grad_norm": 0.38990476727485657,
505
+ "learning_rate": 0.001,
506
+ "loss": 0.5929,
507
+ "step": 124
508
+ },
509
+ {
510
+ "epoch": 8.77,
511
+ "grad_norm": 0.34000781178474426,
512
+ "learning_rate": 0.001,
513
+ "loss": 0.5887,
514
+ "step": 126
515
+ },
516
+ {
517
+ "epoch": 8.9,
518
+ "grad_norm": 0.32895970344543457,
519
+ "learning_rate": 0.001,
520
+ "loss": 0.6287,
521
+ "step": 128
522
+ },
523
+ {
524
+ "epoch": 8.97,
525
+ "eval_loss": 3.145782709121704,
526
+ "eval_runtime": 81.5735,
527
+ "eval_samples_per_second": 4.144,
528
+ "eval_steps_per_second": 0.527,
529
+ "step": 129
530
+ },
531
+ {
532
+ "epoch": 9.04,
533
+ "grad_norm": 0.36275872588157654,
534
+ "learning_rate": 0.001,
535
+ "loss": 0.5983,
536
+ "step": 130
537
+ },
538
+ {
539
+ "epoch": 9.18,
540
+ "grad_norm": 0.3596336245536804,
541
+ "learning_rate": 0.001,
542
+ "loss": 0.4615,
543
+ "step": 132
544
+ },
545
+ {
546
+ "epoch": 9.32,
547
+ "grad_norm": 0.37557095289230347,
548
+ "learning_rate": 0.001,
549
+ "loss": 0.4756,
550
+ "step": 134
551
+ },
552
+ {
553
+ "epoch": 9.46,
554
+ "grad_norm": 0.39249515533447266,
555
+ "learning_rate": 0.001,
556
+ "loss": 0.4546,
557
+ "step": 136
558
+ },
559
+ {
560
+ "epoch": 9.6,
561
+ "grad_norm": 0.3760348856449127,
562
+ "learning_rate": 0.001,
563
+ "loss": 0.4792,
564
+ "step": 138
565
+ },
566
+ {
567
+ "epoch": 9.74,
568
+ "grad_norm": 0.3137217164039612,
569
+ "learning_rate": 0.001,
570
+ "loss": 0.4674,
571
+ "step": 140
572
+ },
573
+ {
574
+ "epoch": 9.88,
575
+ "grad_norm": 0.40549594163894653,
576
+ "learning_rate": 0.001,
577
+ "loss": 0.4939,
578
+ "step": 142
579
+ },
580
+ {
581
+ "epoch": 9.95,
582
+ "eval_loss": 3.5685999393463135,
583
+ "eval_runtime": 81.5958,
584
+ "eval_samples_per_second": 4.142,
585
+ "eval_steps_per_second": 0.527,
586
+ "step": 143
587
+ },
588
+ {
589
+ "epoch": 10.02,
590
+ "grad_norm": 0.4173819422721863,
591
+ "learning_rate": 0.001,
592
+ "loss": 0.5055,
593
+ "step": 144
594
+ },
595
+ {
596
+ "epoch": 10.16,
597
+ "grad_norm": 0.280066579580307,
598
+ "learning_rate": 0.001,
599
+ "loss": 0.3353,
600
+ "step": 146
601
+ },
602
+ {
603
+ "epoch": 10.3,
604
+ "grad_norm": 0.30166783928871155,
605
+ "learning_rate": 0.001,
606
+ "loss": 0.351,
607
+ "step": 148
608
+ },
609
+ {
610
+ "epoch": 10.43,
611
+ "grad_norm": 0.28606531023979187,
612
+ "learning_rate": 0.001,
613
+ "loss": 0.3834,
614
+ "step": 150
615
+ },
616
+ {
617
+ "epoch": 10.57,
618
+ "grad_norm": 0.2835221588611603,
619
+ "learning_rate": 0.001,
620
+ "loss": 0.3718,
621
+ "step": 152
622
+ },
623
+ {
624
+ "epoch": 10.71,
625
+ "grad_norm": 0.3148328959941864,
626
+ "learning_rate": 0.001,
627
+ "loss": 0.3692,
628
+ "step": 154
629
+ },
630
+ {
631
+ "epoch": 10.85,
632
+ "grad_norm": 0.3502219021320343,
633
+ "learning_rate": 0.001,
634
+ "loss": 0.38,
635
+ "step": 156
636
+ },
637
+ {
638
+ "epoch": 10.99,
639
+ "grad_norm": 0.3344653844833374,
640
+ "learning_rate": 0.001,
641
+ "loss": 0.376,
642
+ "step": 158
643
+ },
644
+ {
645
+ "epoch": 10.99,
646
+ "eval_loss": 3.425977945327759,
647
+ "eval_runtime": 81.532,
648
+ "eval_samples_per_second": 4.146,
649
+ "eval_steps_per_second": 0.527,
650
+ "step": 158
651
+ },
652
+ {
653
+ "epoch": 11.13,
654
+ "grad_norm": 0.32332998514175415,
655
+ "learning_rate": 0.001,
656
+ "loss": 0.2827,
657
+ "step": 160
658
+ },
659
+ {
660
+ "epoch": 11.27,
661
+ "grad_norm": 0.35432103276252747,
662
+ "learning_rate": 0.001,
663
+ "loss": 0.2966,
664
+ "step": 162
665
+ },
666
+ {
667
+ "epoch": 11.41,
668
+ "grad_norm": 0.29032111167907715,
669
+ "learning_rate": 0.001,
670
+ "loss": 0.2954,
671
+ "step": 164
672
+ },
673
+ {
674
+ "epoch": 11.55,
675
+ "grad_norm": 0.3170696198940277,
676
+ "learning_rate": 0.001,
677
+ "loss": 0.2738,
678
+ "step": 166
679
+ },
680
+ {
681
+ "epoch": 11.69,
682
+ "grad_norm": 0.3339516520500183,
683
+ "learning_rate": 0.001,
684
+ "loss": 0.2786,
685
+ "step": 168
686
+ },
687
+ {
688
+ "epoch": 11.83,
689
+ "grad_norm": 0.3187398910522461,
690
+ "learning_rate": 0.001,
691
+ "loss": 0.315,
692
+ "step": 170
693
+ },
694
+ {
695
+ "epoch": 11.97,
696
+ "grad_norm": 0.2842791974544525,
697
+ "learning_rate": 0.001,
698
+ "loss": 0.313,
699
+ "step": 172
700
+ },
701
+ {
702
+ "epoch": 11.97,
703
+ "eval_loss": 3.9301607608795166,
704
+ "eval_runtime": 81.5908,
705
+ "eval_samples_per_second": 4.143,
706
+ "eval_steps_per_second": 0.527,
707
+ "step": 172
708
+ },
709
+ {
710
+ "epoch": 12.1,
711
+ "grad_norm": 0.2522130012512207,
712
+ "learning_rate": 0.001,
713
+ "loss": 0.2504,
714
+ "step": 174
715
+ },
716
+ {
717
+ "epoch": 12.24,
718
+ "grad_norm": 0.23560765385627747,
719
+ "learning_rate": 0.001,
720
+ "loss": 0.212,
721
+ "step": 176
722
+ },
723
+ {
724
+ "epoch": 12.38,
725
+ "grad_norm": 0.24140460789203644,
726
+ "learning_rate": 0.001,
727
+ "loss": 0.2156,
728
+ "step": 178
729
+ },
730
+ {
731
+ "epoch": 12.52,
732
+ "grad_norm": 0.2790488302707672,
733
+ "learning_rate": 0.001,
734
+ "loss": 0.2474,
735
+ "step": 180
736
+ },
737
+ {
738
+ "epoch": 12.66,
739
+ "grad_norm": 0.2879179120063782,
740
+ "learning_rate": 0.001,
741
+ "loss": 0.2486,
742
+ "step": 182
743
+ },
744
+ {
745
+ "epoch": 12.8,
746
+ "grad_norm": 0.3126004934310913,
747
+ "learning_rate": 0.001,
748
+ "loss": 0.2499,
749
+ "step": 184
750
+ },
751
+ {
752
+ "epoch": 12.94,
753
+ "grad_norm": 0.3011338412761688,
754
+ "learning_rate": 0.001,
755
+ "loss": 0.2562,
756
+ "step": 186
757
+ },
758
+ {
759
+ "epoch": 12.94,
760
+ "eval_loss": 3.743312120437622,
761
+ "eval_runtime": 81.5885,
762
+ "eval_samples_per_second": 4.143,
763
+ "eval_steps_per_second": 0.527,
764
+ "step": 186
765
+ },
766
+ {
767
+ "epoch": 13.08,
768
+ "grad_norm": 0.24417123198509216,
769
+ "learning_rate": 0.001,
770
+ "loss": 0.2166,
771
+ "step": 188
772
+ },
773
+ {
774
+ "epoch": 13.22,
775
+ "grad_norm": 0.21955759823322296,
776
+ "learning_rate": 0.001,
777
+ "loss": 0.1767,
778
+ "step": 190
779
+ },
780
+ {
781
+ "epoch": 13.36,
782
+ "grad_norm": 0.20537225902080536,
783
+ "learning_rate": 0.001,
784
+ "loss": 0.1715,
785
+ "step": 192
786
+ },
787
+ {
788
+ "epoch": 13.5,
789
+ "grad_norm": 0.21406413614749908,
790
+ "learning_rate": 0.001,
791
+ "loss": 0.1857,
792
+ "step": 194
793
+ },
794
+ {
795
+ "epoch": 13.63,
796
+ "grad_norm": 0.21677067875862122,
797
+ "learning_rate": 0.001,
798
+ "loss": 0.1881,
799
+ "step": 196
800
+ },
801
+ {
802
+ "epoch": 13.77,
803
+ "grad_norm": 0.2592070996761322,
804
+ "learning_rate": 0.001,
805
+ "loss": 0.2022,
806
+ "step": 198
807
+ },
808
+ {
809
+ "epoch": 13.91,
810
+ "grad_norm": 0.23913638293743134,
811
+ "learning_rate": 0.001,
812
+ "loss": 0.2051,
813
+ "step": 200
814
+ },
815
+ {
816
+ "epoch": 13.98,
817
+ "eval_loss": 3.911346197128296,
818
+ "eval_runtime": 81.5425,
819
+ "eval_samples_per_second": 4.145,
820
+ "eval_steps_per_second": 0.527,
821
+ "step": 201
822
+ },
823
+ {
824
+ "epoch": 14.05,
825
+ "grad_norm": 0.19888806343078613,
826
+ "learning_rate": 0.001,
827
+ "loss": 0.1774,
828
+ "step": 202
829
+ },
830
+ {
831
+ "epoch": 14.19,
832
+ "grad_norm": 0.17841410636901855,
833
+ "learning_rate": 0.001,
834
+ "loss": 0.1409,
835
+ "step": 204
836
+ },
837
+ {
838
+ "epoch": 14.33,
839
+ "grad_norm": 0.22502601146697998,
840
+ "learning_rate": 0.001,
841
+ "loss": 0.1432,
842
+ "step": 206
843
+ },
844
+ {
845
+ "epoch": 14.47,
846
+ "grad_norm": 0.21947847306728363,
847
+ "learning_rate": 0.001,
848
+ "loss": 0.1487,
849
+ "step": 208
850
+ },
851
+ {
852
+ "epoch": 14.61,
853
+ "grad_norm": 0.20319664478302002,
854
+ "learning_rate": 0.001,
855
+ "loss": 0.1753,
856
+ "step": 210
857
+ },
858
+ {
859
+ "epoch": 14.75,
860
+ "grad_norm": 0.20484566688537598,
861
+ "learning_rate": 0.001,
862
+ "loss": 0.1627,
863
+ "step": 212
864
+ },
865
+ {
866
+ "epoch": 14.89,
867
+ "grad_norm": 0.24411869049072266,
868
+ "learning_rate": 0.001,
869
+ "loss": 0.1802,
870
+ "step": 214
871
+ },
872
+ {
873
+ "epoch": 14.96,
874
+ "eval_loss": 4.0449538230896,
875
+ "eval_runtime": 81.5583,
876
+ "eval_samples_per_second": 4.144,
877
+ "eval_steps_per_second": 0.527,
878
+ "step": 215
879
+ },
880
+ {
881
+ "epoch": 15.03,
882
+ "grad_norm": 0.23610645532608032,
883
+ "learning_rate": 0.001,
884
+ "loss": 0.1881,
885
+ "step": 216
886
+ },
887
+ {
888
+ "epoch": 15.17,
889
+ "grad_norm": 0.17829175293445587,
890
+ "learning_rate": 0.001,
891
+ "loss": 0.123,
892
+ "step": 218
893
+ },
894
+ {
895
+ "epoch": 15.3,
896
+ "grad_norm": 0.178519606590271,
897
+ "learning_rate": 0.001,
898
+ "loss": 0.1166,
899
+ "step": 220
900
+ },
901
+ {
902
+ "epoch": 15.44,
903
+ "grad_norm": 0.19595706462860107,
904
+ "learning_rate": 0.001,
905
+ "loss": 0.135,
906
+ "step": 222
907
+ },
908
+ {
909
+ "epoch": 15.58,
910
+ "grad_norm": 0.20790521800518036,
911
+ "learning_rate": 0.001,
912
+ "loss": 0.1494,
913
+ "step": 224
914
+ },
915
+ {
916
+ "epoch": 15.72,
917
+ "grad_norm": 0.1832074671983719,
918
+ "learning_rate": 0.001,
919
+ "loss": 0.1488,
920
+ "step": 226
921
+ },
922
+ {
923
+ "epoch": 15.86,
924
+ "grad_norm": 0.17795896530151367,
925
+ "learning_rate": 0.001,
926
+ "loss": 0.1448,
927
+ "step": 228
928
+ },
929
+ {
930
+ "epoch": 16.0,
931
+ "grad_norm": 0.20039702951908112,
932
+ "learning_rate": 0.001,
933
+ "loss": 0.1378,
934
+ "step": 230
935
+ },
936
+ {
937
+ "epoch": 16.0,
938
+ "eval_loss": 3.939739227294922,
939
+ "eval_runtime": 81.6032,
940
+ "eval_samples_per_second": 4.142,
941
+ "eval_steps_per_second": 0.527,
942
+ "step": 230
943
+ },
944
+ {
945
+ "epoch": 16.14,
946
+ "grad_norm": 0.19622142612934113,
947
+ "learning_rate": 0.001,
948
+ "loss": 0.3001,
949
+ "step": 232
950
+ },
951
+ {
952
+ "epoch": 16.28,
953
+ "grad_norm": 19.05455207824707,
954
+ "learning_rate": 0.001,
955
+ "loss": 0.2708,
956
+ "step": 234
957
+ },
958
+ {
959
+ "epoch": 16.42,
960
+ "grad_norm": 29.798582077026367,
961
+ "learning_rate": 0.001,
962
+ "loss": 0.2154,
963
+ "step": 236
964
+ },
965
+ {
966
+ "epoch": 16.56,
967
+ "grad_norm": 8.835821151733398,
968
+ "learning_rate": 0.001,
969
+ "loss": 0.1348,
970
+ "step": 238
971
+ },
972
+ {
973
+ "epoch": 16.7,
974
+ "grad_norm": 0.3760863244533539,
975
+ "learning_rate": 0.001,
976
+ "loss": 0.6235,
977
+ "step": 240
978
+ },
979
+ {
980
+ "epoch": 16.83,
981
+ "grad_norm": 0.3473583459854126,
982
+ "learning_rate": 0.001,
983
+ "loss": 0.1445,
984
+ "step": 242
985
+ },
986
+ {
987
+ "epoch": 16.97,
988
+ "grad_norm": 0.4041793942451477,
989
+ "learning_rate": 0.001,
990
+ "loss": 0.1546,
991
+ "step": 244
992
+ },
993
+ {
994
+ "epoch": 16.97,
995
+ "eval_loss": 4.307888984680176,
996
+ "eval_runtime": 81.6566,
997
+ "eval_samples_per_second": 4.139,
998
+ "eval_steps_per_second": 0.527,
999
+ "step": 244
1000
+ },
1001
+ {
1002
+ "epoch": 17.11,
1003
+ "grad_norm": 0.2586219906806946,
1004
+ "learning_rate": 0.001,
1005
+ "loss": 0.1188,
1006
+ "step": 246
1007
+ },
1008
+ {
1009
+ "epoch": 17.25,
1010
+ "grad_norm": 0.4334220886230469,
1011
+ "learning_rate": 0.001,
1012
+ "loss": 0.1041,
1013
+ "step": 248
1014
+ },
1015
+ {
1016
+ "epoch": 17.39,
1017
+ "grad_norm": 17.520734786987305,
1018
+ "learning_rate": 0.001,
1019
+ "loss": 0.1108,
1020
+ "step": 250
1021
+ },
1022
+ {
1023
+ "epoch": 17.53,
1024
+ "grad_norm": 0.5943770408630371,
1025
+ "learning_rate": 0.001,
1026
+ "loss": 0.1146,
1027
+ "step": 252
1028
+ },
1029
+ {
1030
+ "epoch": 17.67,
1031
+ "grad_norm": 0.4325353503227234,
1032
+ "learning_rate": 0.001,
1033
+ "loss": 0.1325,
1034
+ "step": 254
1035
+ },
1036
+ {
1037
+ "epoch": 17.81,
1038
+ "grad_norm": 0.41412413120269775,
1039
+ "learning_rate": 0.001,
1040
+ "loss": 0.1491,
1041
+ "step": 256
1042
+ },
1043
+ {
1044
+ "epoch": 17.95,
1045
+ "grad_norm": 0.19986829161643982,
1046
+ "learning_rate": 0.001,
1047
+ "loss": 0.1375,
1048
+ "step": 258
1049
+ },
1050
+ {
1051
+ "epoch": 17.95,
1052
+ "eval_loss": 4.552526950836182,
1053
+ "eval_runtime": 81.6054,
1054
+ "eval_samples_per_second": 4.142,
1055
+ "eval_steps_per_second": 0.527,
1056
+ "step": 258
1057
+ },
1058
+ {
1059
+ "epoch": 18.09,
1060
+ "grad_norm": 0.7999384999275208,
1061
+ "learning_rate": 0.001,
1062
+ "loss": 0.1155,
1063
+ "step": 260
1064
+ },
1065
+ {
1066
+ "epoch": 18.23,
1067
+ "grad_norm": 0.17563021183013916,
1068
+ "learning_rate": 0.001,
1069
+ "loss": 0.1006,
1070
+ "step": 262
1071
+ },
1072
+ {
1073
+ "epoch": 18.37,
1074
+ "grad_norm": 0.17661228775978088,
1075
+ "learning_rate": 0.001,
1076
+ "loss": 0.1062,
1077
+ "step": 264
1078
+ },
1079
+ {
1080
+ "epoch": 18.5,
1081
+ "grad_norm": 0.17768113315105438,
1082
+ "learning_rate": 0.001,
1083
+ "loss": 0.1059,
1084
+ "step": 266
1085
+ },
1086
+ {
1087
+ "epoch": 18.64,
1088
+ "grad_norm": 0.15412819385528564,
1089
+ "learning_rate": 0.001,
1090
+ "loss": 0.0981,
1091
+ "step": 268
1092
+ },
1093
+ {
1094
+ "epoch": 18.78,
1095
+ "grad_norm": 0.1754271388053894,
1096
+ "learning_rate": 0.001,
1097
+ "loss": 0.0988,
1098
+ "step": 270
1099
+ },
1100
+ {
1101
+ "epoch": 18.92,
1102
+ "grad_norm": 0.15736614167690277,
1103
+ "learning_rate": 0.001,
1104
+ "loss": 0.1005,
1105
+ "step": 272
1106
+ },
1107
+ {
1108
+ "epoch": 18.99,
1109
+ "eval_loss": 4.900540828704834,
1110
+ "eval_runtime": 81.5789,
1111
+ "eval_samples_per_second": 4.143,
1112
+ "eval_steps_per_second": 0.527,
1113
+ "step": 273
1114
+ },
1115
+ {
1116
+ "epoch": 19.06,
1117
+ "grad_norm": 0.1531495302915573,
1118
+ "learning_rate": 0.001,
1119
+ "loss": 0.0844,
1120
+ "step": 274
1121
+ },
1122
+ {
1123
+ "epoch": 19.2,
1124
+ "grad_norm": 0.15237411856651306,
1125
+ "learning_rate": 0.001,
1126
+ "loss": 0.0752,
1127
+ "step": 276
1128
+ },
1129
+ {
1130
+ "epoch": 19.34,
1131
+ "grad_norm": 0.1433786153793335,
1132
+ "learning_rate": 0.001,
1133
+ "loss": 0.0782,
1134
+ "step": 278
1135
+ },
1136
+ {
1137
+ "epoch": 19.48,
1138
+ "grad_norm": 0.1296713650226593,
1139
+ "learning_rate": 0.001,
1140
+ "loss": 0.0808,
1141
+ "step": 280
1142
+ },
1143
+ {
1144
+ "epoch": 19.48,
1145
+ "eval_loss": 4.81671667098999,
1146
+ "eval_runtime": 81.4692,
1147
+ "eval_samples_per_second": 4.149,
1148
+ "eval_steps_per_second": 0.528,
1149
+ "step": 280
1150
+ },
1151
+ {
1152
+ "epoch": 19.48,
1153
+ "step": 280,
1154
+ "total_flos": 4.895208054457934e+18,
1155
+ "train_loss": 0.8494854368801628,
1156
+ "train_runtime": 68771.7044,
1157
+ "train_samples_per_second": 1.068,
1158
+ "train_steps_per_second": 0.004
1159
+ }
1160
+ ],
1161
+ "logging_steps": 2,
1162
+ "max_steps": 280,
1163
+ "num_input_tokens_seen": 0,
1164
+ "num_train_epochs": 20,
1165
+ "save_steps": 500,
1166
+ "total_flos": 4.895208054457934e+18,
1167
+ "train_batch_size": 8,
1168
+ "trial_name": null,
1169
+ "trial_params": null
1170
+ }