gitGut01 commited on
Commit
19d1dda
β€’
1 Parent(s): 61b62f9

add datasets

Browse files
dataset/dataset_ciaworld.txt ADDED
The diff for this file is too large to render. See raw diff
 
dataset/dataset_edsheeran.txt ADDED
The diff for this file is too large to render. See raw diff
 
dataset/dataset_haiku.txt ADDED
The diff for this file is too large to render. See raw diff
 
dataset/dataset_shakespeare.txt ADDED
The diff for this file is too large to render. See raw diff
 
train_info/train_info_edsheeran.txt ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Overriding config with config/finetune_shakespeare.py:
2
+ import time
3
+
4
+ out_dir = 'out-shakespeare'
5
+ eval_interval = 5
6
+ eval_iters = 40
7
+ wandb_log = False # feel free to turn on
8
+ wandb_project = 'shakespeare'
9
+ wandb_run_name = 'ft-' + str(time.time())
10
+
11
+ dataset = 'shakespeare'
12
+ init_from = 'gpt2' # this is the largest GPT-2 model
13
+
14
+ # only save checkpoints if the validation loss improves
15
+ always_save_checkpoint = False
16
+
17
+ # the number of examples per iter:
18
+ # 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
19
+ # shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
20
+ batch_size = 1
21
+ gradient_accumulation_steps = 32
22
+ max_iters = 120
23
+
24
+ # finetune at constant LR
25
+ learning_rate = 3e-5
26
+ decay_lr = False
27
+
28
+ Initializing from OpenAI GPT-2 weights: gpt2
29
+ loading weights from pretrained gpt: gpt2
30
+ forcing vocab_size=50257, block_size=1024, bias=True
31
+ overriding dropout rate to 0.0
32
+ number of parameters: 123.65M
33
+ Downloading (…)lve/main/config.json: 100% 665/665 [00:00<00:00, 98.3kB/s]
34
+ Downloading pytorch_model.bin: 100% 548M/548M [00:05<00:00, 92.8MB/s]
35
+ Downloading (…)neration_config.json: 100% 124/124 [00:00<00:00, 19.2kB/s]
36
+ using fused AdamW: True
37
+ compiling the model... (takes a ~minute)
38
+ [2023-03-21 14:22:50,795] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
39
+ step 0: train loss 3.4423, val loss 3.0369
40
+ iter 0: loss 3.2863, time 77202.23ms, mfu -100.00%
41
+ iter 1: loss 2.7469, time 22529.17ms, mfu -100.00%
42
+ iter 2: loss 3.7087, time 23101.21ms, mfu -100.00%
43
+ iter 3: loss 3.6040, time 23363.38ms, mfu -100.00%
44
+ iter 4: loss 2.6769, time 23118.49ms, mfu -100.00%
45
+ step 5: train loss 3.4339, val loss 2.9363
46
+ saving checkpoint to out-shakespeare
47
+ iter 5: loss 3.1141, time 30621.41ms, mfu 2.35%
48
+ iter 6: loss 3.3365, time 23426.49ms, mfu 2.42%
49
+ iter 7: loss 3.8965, time 23144.13ms, mfu 2.49%
50
+ iter 8: loss 3.4058, time 23061.69ms, mfu 2.55%
51
+ iter 9: loss 3.2569, time 23230.68ms, mfu 2.60%
52
+ step 10: train loss 3.2385, val loss 2.9982
53
+ iter 10: loss 3.1935, time 25160.57ms, mfu 2.63%
54
+ iter 11: loss 3.9526, time 23125.77ms, mfu 2.68%
55
+ iter 12: loss 2.4570, time 23136.22ms, mfu 2.72%
56
+ iter 13: loss 3.5092, time 23120.81ms, mfu 2.76%
57
+ iter 14: loss 3.4771, time 23226.29ms, mfu 2.79%
58
+ step 15: train loss 2.9026, val loss 2.8705
59
+ saving checkpoint to out-shakespeare
60
+ iter 15: loss 3.4825, time 30931.56ms, mfu 2.75%
61
+ iter 16: loss 3.3583, time 23307.64ms, mfu 2.78%
62
+ iter 17: loss 2.2991, time 23143.53ms, mfu 2.81%
63
+ iter 18: loss 3.2513, time 23131.39ms, mfu 2.84%
64
+ iter 19: loss 2.9859, time 23160.12ms, mfu 2.87%
65
+ step 20: train loss 2.9491, val loss 2.7808
66
+ saving checkpoint to out-shakespeare
67
+ iter 20: loss 3.0525, time 30909.27ms, mfu 2.81%
68
+ iter 21: loss 2.9295, time 23294.73ms, mfu 2.84%
69
+ iter 22: loss 2.2879, time 23094.34ms, mfu 2.87%
70
+ iter 23: loss 1.8019, time 23103.56ms, mfu 2.89%
71
+ iter 24: loss 3.4942, time 23172.01ms, mfu 2.91%
72
+ step 25: train loss 2.8004, val loss 2.8107
73
+ iter 25: loss 2.2264, time 25127.64ms, mfu 2.91%
74
+ iter 26: loss 3.4194, time 23174.40ms, mfu 2.93%
75
+ iter 27: loss 2.8144, time 23152.02ms, mfu 2.94%
76
+ iter 28: loss 3.0488, time 23133.18ms, mfu 2.96%
77
+ iter 29: loss 3.1027, time 23085.89ms, mfu 2.98%
78
+ step 30: train loss 2.6644, val loss 2.6210
79
+ saving checkpoint to out-shakespeare
80
+ iter 30: loss 2.4424, time 31309.61ms, mfu 2.91%
81
+ iter 31: loss 3.0193, time 23415.64ms, mfu 2.92%
82
+ iter 32: loss 2.8735, time 23054.64ms, mfu 2.94%
83
+ iter 33: loss 2.9842, time 23053.71ms, mfu 2.96%
84
+ iter 34: loss 2.8148, time 23136.92ms, mfu 2.97%
85
+ step 35: train loss 2.8676, val loss 2.5965
86
+ saving checkpoint to out-shakespeare
87
+ iter 35: loss 2.8556, time 31228.61ms, mfu 2.91%
88
+ iter 36: loss 2.1186, time 23332.51ms, mfu 2.92%
89
+ iter 37: loss 2.4768, time 23039.16ms, mfu 2.94%
90
+ iter 38: loss 2.7992, time 23035.59ms, mfu 2.96%
91
+ iter 39: loss 2.7109, time 23218.08ms, mfu 2.97%
92
+ step 40: train loss 2.5840, val loss 2.6467
93
+ iter 40: loss 3.0349, time 25092.98ms, mfu 2.96%
94
+ iter 41: loss 2.8766, time 23084.39ms, mfu 2.98%
95
+ iter 42: loss 2.5366, time 23099.15ms, mfu 2.99%
96
+ iter 43: loss 2.7461, time 23183.70ms, mfu 3.00%
97
+ iter 44: loss 1.4962, time 23190.74ms, mfu 3.01%
98
+ step 45: train loss 2.6357, val loss 2.6529
99
+ iter 45: loss 2.1228, time 25011.92ms, mfu 3.00%
100
+ iter 46: loss 1.9382, time 23127.95ms, mfu 3.01%
101
+ iter 47: loss 1.7129, time 23168.21ms, mfu 3.02%
102
+ iter 48: loss 2.4555, time 23162.14ms, mfu 3.03%
103
+ iter 49: loss 1.3368, time 23152.22ms, mfu 3.03%
104
+ step 50: train loss 2.3167, val loss 2.6496
105
+ iter 50: loss 2.3815, time 24969.84ms, mfu 3.02%
106
+ iter 51: loss 1.5433, time 23013.56ms, mfu 3.03%
107
+ iter 52: loss 2.5276, time 22951.87ms, mfu 3.04%
108
+ iter 53: loss 2.0912, time 22989.47ms, mfu 3.05%
109
+ iter 54: loss 1.6236, time 23016.77ms, mfu 3.06%
110
+ step 55: train loss 2.2718, val loss 2.6701
111
+ iter 55: loss 0.9116, time 24910.16ms, mfu 3.04%
train_info/train_info_haiku.txt ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # finetune at constant LR
2
+ learning_rate = 3e-5
3
+ decay_lr = False
4
+
5
+ Initializing from OpenAI GPT-2 weights: gpt2
6
+ loading weights from pretrained gpt: gpt2
7
+ forcing vocab_size=50257, block_size=1024, bias=True
8
+ overriding dropout rate to 0.0
9
+ number of parameters: 123.65M
10
+ using fused AdamW: True
11
+ compiling the model... (takes a ~minute)
12
+ [2023-03-21 15:03:01,696] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
13
+ step 0: train loss 7.3575, val loss 7.4530
14
+ iter 0: loss 7.3959, time 55528.06ms, mfu -100.00%
15
+ iter 1: loss 7.4243, time 22248.52ms, mfu -100.00%
16
+ iter 2: loss 7.3179, time 22821.48ms, mfu -100.00%
17
+ iter 3: loss 7.5001, time 23404.71ms, mfu -100.00%
18
+ iter 4: loss 7.4802, time 23247.54ms, mfu -100.00%
19
+ step 5: train loss 7.2418, val loss 7.4663
20
+ iter 5: loss 7.3052, time 24918.41ms, mfu 2.88%
21
+ iter 6: loss 6.9456, time 23189.74ms, mfu 2.90%
22
+ iter 7: loss 6.6510, time 23306.99ms, mfu 2.92%
23
+ iter 8: loss 6.3013, time 23235.93ms, mfu 2.94%
24
+ iter 9: loss 6.0171, time 23170.33ms, mfu 2.96%
25
+ step 10: train loss 5.9558, val loss 5.9625
26
+ saving checkpoint to out-shakespeare
27
+ iter 10: loss 5.9322, time 31040.11ms, mfu 2.89%
28
+ iter 11: loss 5.8374, time 23361.17ms, mfu 2.91%
29
+ iter 12: loss 5.6069, time 23241.27ms, mfu 2.93%
30
+ iter 13: loss 5.6613, time 23180.06ms, mfu 2.95%
31
+ iter 14: loss 5.2928, time 23169.15ms, mfu 2.96%
32
+ step 15: train loss 5.4229, val loss 5.4202
33
+ saving checkpoint to out-shakespeare
34
+ iter 15: loss 5.3205, time 31057.72ms, mfu 2.90%
35
+ iter 16: loss 5.4608, time 23320.27ms, mfu 2.91%
36
+ iter 17: loss 5.2379, time 23176.04ms, mfu 2.93%
37
+ iter 18: loss 5.1430, time 23211.53ms, mfu 2.95%
38
+ iter 19: loss 5.5525, time 23232.59ms, mfu 2.96%
39
+ step 20: train loss 5.1232, val loss 5.0514
40
+ saving checkpoint to out-shakespeare
41
+ iter 20: loss 5.1371, time 31097.85ms, mfu 2.90%
42
+ iter 21: loss 4.9530, time 23374.38ms, mfu 2.92%
train_info/train_info_shakespeare.txt ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Overriding config with config/finetune_shakespeare.py:
2
+ import time
3
+
4
+ out_dir = 'out-shakespeare'
5
+ eval_interval = 5
6
+ eval_iters = 40
7
+ wandb_log = False # feel free to turn on
8
+ wandb_project = 'shakespeare'
9
+ wandb_run_name = 'ft-' + str(time.time())
10
+
11
+ dataset = 'shakespeare'
12
+ init_from = 'gpt2' # this is the largest GPT-2 model
13
+
14
+ # only save checkpoints if the validation loss improves
15
+ always_save_checkpoint = False
16
+
17
+ # the number of examples per iter:
18
+ # 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
19
+ # shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
20
+ batch_size = 1
21
+ gradient_accumulation_steps = 32
22
+ max_iters = 1000
23
+
24
+ # finetune at constant LR
25
+ learning_rate = 3e-5
26
+ decay_lr = False
27
+
28
+ Initializing from OpenAI GPT-2 weights: gpt2
29
+ loading weights from pretrained gpt: gpt2
30
+ forcing vocab_size=50257, block_size=1024, bias=True
31
+ overriding dropout rate to 0.0
32
+ number of parameters: 123.65M
33
+ using fused AdamW: True
34
+ compiling the model... (takes a ~minute)
35
+ [2023-03-20 21:31:13,957] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
36
+ step 0: train loss 4.1871, val loss 4.0326
37
+ iter 0: loss 4.8126, time 53610.16ms, mfu -100.00%
38
+ iter 1: loss 3.8469, time 22853.81ms, mfu -100.00%
39
+ iter 2: loss 4.1342, time 23058.41ms, mfu -100.00%
40
+ iter 3: loss 4.2060, time 23164.17ms, mfu -100.00%
41
+ iter 4: loss 4.6711, time 23070.16ms, mfu -100.00%
42
+ step 5: train loss 4.3096, val loss 3.9636
43
+ saving checkpoint to out-shakespeare
44
+ iter 5: loss 3.4577, time 30970.06ms, mfu 2.32%
45
+ iter 6: loss 2.9587, time 23298.83ms, mfu 2.40%
46
+ iter 7: loss 3.2116, time 23132.08ms, mfu 2.47%
47
+ iter 8: loss 3.4900, time 23106.50ms, mfu 2.53%
48
+ iter 9: loss 3.8003, time 23125.60ms, mfu 2.59%
49
+ step 10: train loss 3.6215, val loss 3.4816
50
+ saving checkpoint to out-shakespeare
51
+ iter 10: loss 3.6364, time 30978.89ms, mfu 2.56%
52
+ iter 11: loss 3.4725, time 23263.91ms, mfu 2.61%
53
+ iter 12: loss 3.4080, time 23053.16ms, mfu 2.67%
54
+ iter 13: loss 3.9510, time 23091.76ms, mfu 2.71%
55
+ iter 14: loss 3.6421, time 23142.46ms, mfu 2.75%
56
+ step 15: train loss 3.5292, val loss 3.2960
57
+ saving checkpoint to out-shakespeare
58
+ iter 15: loss 3.2916, time 31036.47ms, mfu 2.71%
59
+ iter 16: loss 3.8844, time 23232.40ms, mfu 2.74%
60
+ iter 17: loss 3.2954, time 23076.36ms, mfu 2.78%
61
+ iter 18: loss 2.9807, time 23073.19ms, mfu 2.81%
62
+ iter 19: loss 3.4524, time 23090.94ms, mfu 2.84%
63
+ step 20: train loss 3.4621, val loss 3.3625
64
+ iter 20: loss 3.3737, time 25115.53ms, mfu 2.85%
65
+ iter 21: loss 3.6565, time 23165.72ms, mfu 2.87%
66
+ iter 22: loss 3.3047, time 23174.77ms, mfu 2.89%
67
+ iter 23: loss 3.8091, time 23135.82ms, mfu 2.92%
68
+ iter 24: loss 3.1955, time 23097.90ms, mfu 2.94%
69
+ step 25: train loss 3.5139, val loss 3.2854
70
+ saving checkpoint to out-shakespeare
71
+ iter 25: loss 3.8481, time 30838.74ms, mfu 2.87%
72
+ iter 26: loss 3.2716, time 23304.59ms, mfu 2.90%
73
+ iter 27: loss 3.3729, time 23056.31ms, mfu 2.92%
74
+ iter 28: loss 3.3545, time 23107.46ms, mfu 2.94%
75
+ iter 29: loss 2.7101, time 23209.45ms, mfu 2.95%
76
+ step 30: train loss 3.3706, val loss 3.2958
77
+ iter 30: loss 3.0968, time 25123.31ms, mfu 2.94%
78
+ iter 31: loss 2.9495, time 23116.72ms, mfu 2.96%
79
+ iter 32: loss 3.0179, time 23101.19ms, mfu 2.97%
80
+ iter 33: loss 2.9648, time 23117.17ms, mfu 2.99%
81
+ iter 34: loss 3.6522, time 23132.76ms, mfu 3.00%
82
+ step 35: train loss 3.3923, val loss 3.2125
83
+ saving checkpoint to out-shakespeare
84
+ iter 35: loss 3.2469, time 31079.08ms, mfu 2.93%
85
+ iter 36: loss 3.1450, time 23273.02ms, mfu 2.95%
86
+ iter 37: loss 3.4624, time 23046.04ms, mfu 2.96%
87
+ iter 38: loss 3.4371, time 23102.73ms, mfu 2.98%
88
+ iter 39: loss 3.3130, time 23178.65ms, mfu 2.99%
89
+ step 40: train loss 3.3233, val loss 3.2543
90
+ iter 40: loss 3.0743, time 25069.68ms, mfu 2.98%
91
+ iter 41: loss 3.1269, time 23084.39ms, mfu 2.99%
92
+ iter 42: loss 3.6785, time 23076.30ms, mfu 3.00%
93
+ iter 43: loss 3.3787, time 23075.87ms, mfu 3.01%
94
+ iter 44: loss 3.2637, time 23098.68ms, mfu 3.02%
95
+ step 45: train loss 3.1971, val loss 3.2642
96
+ iter 45: loss 3.1861, time 25003.67ms, mfu 3.01%
97
+ iter 46: loss 3.4037, time 23106.62ms, mfu 3.02%
98
+ iter 47: loss 3.4947, time 23109.37ms, mfu 3.03%
99
+ iter 48: loss 3.3276, time 23098.50ms, mfu 3.04%
100
+ iter 49: loss 2.9062, time 23171.38ms, mfu 3.04%
101
+ step 50: train loss 3.2188, val loss 3.2460
102
+ iter 50: loss 3.5280, time 25111.46ms, mfu 3.02%
103
+ iter 51: loss 3.5470, time 23143.40ms, mfu 3.03%
104
+ iter 52: loss 3.1881, time 23109.22ms, mfu 3.04%
105
+ iter 53: loss 3.4332, time 23083.68ms, mfu 3.05%
106
+ iter 54: loss 3.1956, time 23117.10ms, mfu 3.05%
107
+ step 55: train loss 3.2902, val loss 3.1846
108
+ saving checkpoint to out-shakespeare
109
+ iter 55: loss 3.4816, time 31132.51ms, mfu 2.98%
110
+ iter 56: loss 3.2971, time 23207.94ms, mfu 2.99%
111
+ iter 57: loss 2.9543, time 23064.74ms, mfu 3.00%
112
+ iter 58: loss 2.8729, time 23093.16ms, mfu 3.01%
113
+ iter 59: loss 3.0883, time 23129.34ms, mfu 3.02%
114
+ step 60: train loss 3.1288, val loss 3.1545
115
+ saving checkpoint to out-shakespeare
116
+ iter 60: loss 3.7098, time 31022.27ms, mfu 2.95%
117
+ iter 61: loss 3.4157, time 23229.02ms, mfu 2.97%
118
+ iter 62: loss 3.0020, time 23059.02ms, mfu 2.98%
119
+ iter 63: loss 3.0751, time 23063.51ms, mfu 2.99%
120
+ iter 64: loss 2.9081, time 23134.60ms, mfu 3.01%
121
+ step 65: train loss 3.2254, val loss 3.1772
122
+ iter 65: loss 3.3802, time 25114.58ms, mfu 2.99%
123
+ iter 66: loss 3.1073, time 23118.96ms, mfu 3.00%
124
+ iter 67: loss 3.1010, time 23081.32ms, mfu 3.01%
125
+ iter 68: loss 3.2594, time 23058.54ms, mfu 3.02%
126
+ iter 69: loss 3.4402, time 23062.45ms, mfu 3.03%
127
+ step 70: train loss 3.1511, val loss 3.2315
128
+ iter 70: loss 3.4094, time 24967.39ms, mfu 3.02%
129
+ iter 71: loss 3.0997, time 23070.28ms, mfu 3.03%
130
+ iter 72: loss 2.1573, time 23072.48ms, mfu 3.04%
131
+ iter 73: loss 3.3926, time 23060.80ms, mfu 3.04%
132
+ iter 74: loss 3.2284, time 23080.48ms, mfu 3.05%
133
+ step 75: train loss 3.1102, val loss 3.1017
134
+ saving checkpoint to out-shakespeare
135
+ iter 75: loss 3.3760, time 31003.52ms, mfu 2.98%
136
+ iter 76: loss 3.3387, time 23207.33ms, mfu 2.99%
137
+ iter 77: loss 2.9299, time 23040.87ms, mfu 3.00%
138
+ iter 78: loss 2.9623, time 23069.43ms, mfu 3.01%
139
+ iter 79: loss 3.0674, time 23111.04ms, mfu 3.02%
140
+ step 80: train loss 3.0574, val loss 3.2178
141
+ iter 80: loss 2.6808, time 25072.69ms, mfu 3.01%
142
+ iter 81: loss 2.7986, time 23144.88ms, mfu 3.02%
143
+ iter 82: loss 2.9121, time 23094.25ms, mfu 3.03%
144
+ iter 83: loss 2.7153, time 23114.27ms, mfu 3.03%
145
+ iter 84: loss 2.8444, time 23089.41ms, mfu 3.04%
146
+ step 85: train loss 2.9855, val loss 3.2298
147
+ iter 85: loss 3.0517, time 25033.77ms, mfu 3.03%
148
+ iter 86: loss 2.5920, time 23088.89ms, mfu 3.03%
149
+ iter 87: loss 3.1241, time 23084.88ms, mfu 3.04%
150
+ iter 88: loss 2.5355, time 23070.40ms, mfu 3.05%
151
+ iter 89: loss 3.4543, time 23060.05ms, mfu 3.06%
152
+ step 90: train loss 3.0426, val loss 3.2664
153
+ iter 90: loss 3.3099, time 24997.54ms, mfu 3.04%
154
+ iter 91: loss 2.8099, time 23108.94ms, mfu 3.04%
155
+ iter 92: loss 3.2419, time 23103.54ms, mfu 3.05%
156
+ iter 93: loss 3.4718, time 23089.71ms, mfu 3.06%
157
+ iter 94: loss 3.0708, time 23137.11ms, mfu 3.06%
158
+ step 95: train loss 3.0225, val loss 3.2529
159
+ iter 95: loss 2.8545, time 25072.26ms, mfu 3.04%
160
+ iter 96: loss 3.3059, time 23120.57ms, mfu 3.05%
161
+ iter 97: loss 2.7528, time 23111.60ms, mfu 3.06%
162
+ iter 98: loss 3.1788, time 23106.26ms, mfu 3.06%
163
+ iter 99: loss 2.9023, time 23103.06ms, mfu 3.07%
164
+ step 100: train loss 2.9153, val loss 3.2140
165
+ iter 100: loss 3.0090, time 24968.37ms, mfu 3.05%
166
+ iter 101: loss 3.0753, time 23093.87ms, mfu 3.05%
167
+ iter 102: loss 3.1295, time 23108.81ms, mfu 3.06%
168
+ iter 103: loss 2.9033, time 23136.51ms, mfu 3.06%
169
+ iter 104: loss 3.1117, time 23127.17ms, mfu 3.07%
170
+ step 105: train loss 2.9402, val loss 3.2071
171
+ iter 105: loss 2.8862, time 25050.88ms, mfu 3.05%
172
+ iter 106: loss 2.6040, time 23141.23ms, mfu 3.05%
173
+ iter 107: loss 3.1831, time 23146.47ms, mfu 3.06%
174
+ iter 108: loss 3.1619, time 23078.47ms, mfu 3.06%
175
+ iter 109: loss 3.0995, time 23098.26ms, mfu 3.07%
176
+ step 110: train loss 2.7568, val loss 3.2857
177
+ iter 110: loss 3.0392, time 24959.72ms, mfu 3.05%
178
+ iter 111: loss 3.1982, time 23121.36ms, mfu 3.06%
179
+ iter 112: loss 3.1794, time 23124.92ms, mfu 3.06%
180
+ iter 113: loss 2.8230, time 23138.96ms, mfu 3.07%
181
+ iter 114: loss 2.2634, time 23121.12ms, mfu 3.07%
182
+ step 115: train loss 2.8576, val loss 3.2603
183
+ iter 115: loss 3.0414, time 24960.16ms, mfu 3.05%
184
+ iter 116: loss 2.2827, time 23077.89ms, mfu 3.06%
185
+ iter 117: loss 2.5435, time 23054.11ms, mfu 3.06%
train_info/train_info_trump.txt ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Overriding config with config/finetune_shakespeare.py:
2
+ import time
3
+
4
+ out_dir = 'out-shakespeare'
5
+ eval_interval = 5
6
+ eval_iters = 40
7
+ wandb_log = False # feel free to turn on
8
+ wandb_project = 'shakespeare'
9
+ wandb_run_name = 'ft-' + str(time.time())
10
+
11
+ dataset = 'shakespeare'
12
+ init_from = 'gpt2' # this is the largest GPT-2 model
13
+
14
+ # only save checkpoints if the validation loss improves
15
+ always_save_checkpoint = False
16
+
17
+ # the number of examples per iter:
18
+ # 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
19
+ # shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
20
+ batch_size = 1
21
+ gradient_accumulation_steps = 32
22
+ max_iters = 300
23
+
24
+ # finetune at constant LR
25
+ learning_rate = 3e-5
26
+ decay_lr = False
27
+
28
+ Initializing from OpenAI GPT-2 weights: gpt2
29
+ loading weights from pretrained gpt: gpt2
30
+ forcing vocab_size=50257, block_size=1024, bias=True
31
+ overriding dropout rate to 0.0
32
+ number of parameters: 123.65M
33
+ Downloading (…)lve/main/config.json: 100% 665/665 [00:00<00:00, 88.4kB/s]
34
+ Downloading pytorch_model.bin: 100% 548M/548M [00:01<00:00, 289MB/s]
35
+ Downloading (…)neration_config.json: 100% 124/124 [00:00<00:00, 22.5kB/s]
36
+ using fused AdamW: True
37
+ compiling the model... (takes a ~minute)
38
+ [2023-03-21 06:17:18,366] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
39
+ step 0: train loss 3.3086, val loss 3.2349
40
+ iter 0: loss 3.4443, time 75907.68ms, mfu -100.00%
41
+ iter 1: loss 3.6624, time 23156.16ms, mfu -100.00%
42
+ iter 2: loss 4.4039, time 23248.46ms, mfu -100.00%
43
+ iter 3: loss 3.2693, time 22877.27ms, mfu -100.00%
44
+ iter 4: loss 3.4597, time 22906.52ms, mfu -100.00%
45
+ step 5: train loss 3.2166, val loss 3.2212
46
+ saving checkpoint to out-shakespeare
47
+ iter 5: loss 3.2885, time 30843.38ms, mfu 2.33%
48
+ iter 6: loss 3.2423, time 23117.67ms, mfu 2.41%
49
+ iter 7: loss 3.2239, time 23014.83ms, mfu 2.48%
50
+ iter 8: loss 3.3878, time 23083.71ms, mfu 2.54%
51
+ iter 9: loss 3.0245, time 23127.68ms, mfu 2.60%
52
+ step 10: train loss 3.1367, val loss 3.0886
53
+ saving checkpoint to out-shakespeare
54
+ iter 10: loss 3.2588, time 31026.66ms, mfu 2.57%
55
+ iter 11: loss 2.7963, time 23215.41ms, mfu 2.62%
56
+ iter 12: loss 3.0799, time 23045.69ms, mfu 2.67%
57
+ iter 13: loss 3.0391, time 23081.70ms, mfu 2.72%
58
+ iter 14: loss 2.9285, time 23144.99ms, mfu 2.76%
59
+ step 15: train loss 3.0614, val loss 3.0357
60
+ saving checkpoint to out-shakespeare
61
+ iter 15: loss 2.9088, time 31131.17ms, mfu 2.71%
62
+ iter 16: loss 2.8854, time 23203.33ms, mfu 2.75%
63
+ iter 17: loss 2.8941, time 23045.51ms, mfu 2.79%
64
+ iter 18: loss 3.1116, time 23058.43ms, mfu 2.82%
65
+ iter 19: loss 3.1542, time 23076.86ms, mfu 2.85%
66
+ step 20: train loss 2.9382, val loss 2.9662
67
+ saving checkpoint to out-shakespeare
68
+ iter 20: loss 2.8674, time 30800.95ms, mfu 2.80%
69
+ iter 21: loss 3.0158, time 23210.44ms, mfu 2.83%
70
+ iter 22: loss 3.0376, time 23028.93ms, mfu 2.86%
71
+ iter 23: loss 2.5614, time 23053.57ms, mfu 2.88%
72
+ iter 24: loss 3.0086, time 23135.53ms, mfu 2.90%
73
+ step 25: train loss 2.9386, val loss 2.9689
74
+ iter 25: loss 2.8633, time 25037.75ms, mfu 2.90%
75
+ iter 26: loss 3.2887, time 23087.04ms, mfu 2.92%
76
+ iter 27: loss 2.7507, time 23061.28ms, mfu 2.94%
77
+ iter 28: loss 3.0676, time 23047.93ms, mfu 2.96%
78
+ iter 29: loss 2.7316, time 23042.36ms, mfu 2.98%
79
+ step 30: train loss 2.9721, val loss 2.9042
80
+ saving checkpoint to out-shakespeare
81
+ iter 30: loss 2.7163, time 30867.03ms, mfu 2.91%
82
+ iter 31: loss 2.9423, time 23225.75ms, mfu 2.93%
83
+ iter 32: loss 2.9405, time 23012.47ms, mfu 2.95%
84
+ iter 33: loss 2.9208, time 23059.76ms, mfu 2.96%
85
+ iter 34: loss 2.9996, time 23121.13ms, mfu 2.98%
86
+ step 35: train loss 2.9496, val loss 2.8374
87
+ saving checkpoint to out-shakespeare
88
+ iter 35: loss 2.8072, time 31122.96ms, mfu 2.91%
89
+ iter 36: loss 2.9798, time 23209.16ms, mfu 2.93%
90
+ iter 37: loss 2.8476, time 23019.32ms, mfu 2.95%
91
+ iter 38: loss 2.7276, time 23056.09ms, mfu 2.97%
92
+ iter 39: loss 2.8636, time 23101.19ms, mfu 2.98%
93
+ step 40: train loss 2.8282, val loss 2.9073
94
+ iter 40: loss 2.7667, time 25022.64ms, mfu 2.97%
95
+ iter 41: loss 2.6111, time 23100.99ms, mfu 2.98%
96
+ iter 42: loss 3.1776, time 23107.88ms, mfu 3.00%
97
+ iter 43: loss 2.7963, time 23090.82ms, mfu 3.01%
98
+ iter 44: loss 3.2658, time 23084.78ms, mfu 3.02%
99
+ step 45: train loss 2.8171, val loss 2.8487
100
+ iter 45: loss 3.0523, time 24981.39ms, mfu 3.00%
101
+ iter 46: loss 2.6204, time 23087.28ms, mfu 3.01%
102
+ iter 47: loss 2.8938, time 23081.95ms, mfu 3.02%
103
+ iter 48: loss 3.1726, time 23092.57ms, mfu 3.03%
104
+ iter 49: loss 3.7836, time 23077.55ms, mfu 3.04%
105
+ step 50: train loss 2.8675, val loss 2.7787
106
+ saving checkpoint to out-shakespeare
107
+ iter 50: loss 3.0882, time 30881.37ms, mfu 2.97%
108
+ iter 51: loss 2.8358, time 23200.14ms, mfu 2.98%
109
+ iter 52: loss 2.9847, time 23008.69ms, mfu 3.00%
110
+ iter 53: loss 3.1992, time 23066.07ms, mfu 3.01%
111
+ iter 54: loss 2.4085, time 23118.93ms, mfu 3.02%
112
+ step 55: train loss 2.8049, val loss 2.7507
113
+ saving checkpoint to out-shakespeare
114
+ iter 55: loss 2.9964, time 31115.78ms, mfu 2.95%
115
+ iter 56: loss 2.9647, time 23212.73ms, mfu 2.96%
116
+ iter 57: loss 2.8880, time 23003.95ms, mfu 2.98%
117
+ iter 58: loss 2.8726, time 23053.90ms, mfu 2.99%
118
+ iter 59: loss 2.6470, time 23124.33ms, mfu 3.00%
119
+ step 60: train loss 2.8041, val loss 2.8827
120
+ iter 60: loss 2.8115, time 24978.80ms, mfu 2.99%
121
+ iter 61: loss 2.6765, time 23058.07ms, mfu 3.00%
122
+ iter 62: loss 2.6801, time 23052.27ms, mfu 3.01%
123
+ iter 63: loss 3.4295, time 23048.58ms, mfu 3.03%
124
+ iter 64: loss 2.5933, time 23062.70ms, mfu 3.03%
125
+ step 65: train loss 2.7894, val loss 2.7606
126
+ iter 65: loss 2.5231, time 24991.85ms, mfu 3.02%
127
+ iter 66: loss 2.8913, time 23099.31ms, mfu 3.03%
128
+ iter 67: loss 2.9515, time 23106.81ms, mfu 3.04%
129
+ iter 68: loss 2.8017, time 23098.12ms, mfu 3.04%
130
+ iter 69: loss 2.7759, time 23110.16ms, mfu 3.05%
131
+ step 70: train loss 2.8044, val loss 2.8498
132
+ iter 70: loss 2.9694, time 25009.31ms, mfu 3.03%
133
+ iter 71: loss 3.3238, time 23090.32ms, mfu 3.04%
134
+ iter 72: loss 2.6931, time 23086.35ms, mfu 3.05%
135
+ iter 73: loss 2.6097, time 23085.74ms, mfu 3.05%
136
+ iter 74: loss 2.1781, time 23096.25ms, mfu 3.06%
137
+ step 75: train loss 2.7755, val loss 2.6869
138
+ saving checkpoint to out-shakespeare
139
+ iter 75: loss 2.9208, time 30879.90ms, mfu 2.99%
140
+ iter 76: loss 2.7619, time 23186.69ms, mfu 3.00%
141
+ iter 77: loss 2.8394, time 23017.46ms, mfu 3.01%
142
+ iter 78: loss 2.5907, time 23049.26ms, mfu 3.02%
143
+ iter 79: loss 2.5660, time 23102.38ms, mfu 3.03%
144
+ step 80: train loss 2.7759, val loss 2.7603
145
+ iter 80: loss 2.6889, time 25011.13ms, mfu 3.01%
146
+ iter 81: loss 2.6940, time 23088.64ms, mfu 3.02%
147
+ iter 82: loss 2.6596, time 23050.35ms, mfu 3.03%
148
+ iter 83: loss 2.7638, time 23066.22ms, mfu 3.04%
149
+ iter 84: loss 2.6515, time 23059.01ms, mfu 3.05%
150
+ step 85: train loss 2.7404, val loss 2.7290
151
+ iter 85: loss 3.1829, time 24970.26ms, mfu 3.03%
152
+ iter 86: loss 2.5451, time 23052.03ms, mfu 3.04%
153
+ iter 87: loss 2.4363, time 23051.53ms, mfu 3.05%
154
+ iter 88: loss 2.8023, time 23039.12ms, mfu 3.05%
155
+ iter 89: loss 2.4755, time 23044.45ms, mfu 3.06%
156
+ step 90: train loss 2.7140, val loss 2.7692
157
+ iter 90: loss 2.7225, time 24960.52ms, mfu 3.04%
158
+ iter 91: loss 2.4655, time 23037.54ms, mfu 3.05%
159
+ iter 92: loss 2.5291, time 23029.37ms, mfu 3.06%
160
+ iter 93: loss 2.7720, time 23032.99ms, mfu 3.06%
161
+ iter 94: loss 2.7614, time 23039.50ms, mfu 3.07%
162
+ step 95: train loss 2.7932, val loss 2.7953
163
+ iter 95: loss 2.6881, time 24974.66ms, mfu 3.05%
164
+ iter 96: loss 2.9315, time 23044.89ms, mfu 3.06%
165
+ iter 97: loss 2.7099, time 23035.52ms, mfu 3.06%
166
+ iter 98: loss 2.6858, time 23036.10ms, mfu 3.07%
167
+ iter 99: loss 2.5341, time 23048.24ms, mfu 3.07%
168
+ step 100: train loss 2.6788, val loss 2.8138
169
+ iter 100: loss 2.7993, time 25008.37ms, mfu 3.05%
170
+ iter 101: loss 2.5996, time 23052.62ms, mfu 3.06%
171
+ iter 102: loss 2.7768, time 23059.09ms, mfu 3.07%
172
+ iter 103: loss 2.6378, time 23046.82ms, mfu 3.07%
173
+ iter 104: loss 2.7511, time 23043.40ms, mfu 3.08%
174
+ step 105: train loss 2.7542, val loss 2.6568
175
+ saving checkpoint to out-shakespeare
176
+ iter 105: loss 2.6596, time 31000.96ms, mfu 3.00%
177
+ iter 106: loss 2.8566, time 23195.71ms, mfu 3.01%
178
+ iter 107: loss 2.6284, time 22995.46ms, mfu 3.02%
179
+ iter 108: loss 2.6670, time 23031.45ms, mfu 3.03%
180
+ iter 109: loss 2.4732, time 23093.11ms, mfu 3.04%
181
+ step 110: train loss 2.7094, val loss 2.6684
182
+ iter 110: loss 2.5577, time 25028.10ms, mfu 3.02%
183
+ iter 111: loss 2.9250, time 23089.98ms, mfu 3.03%
184
+ iter 112: loss 2.6274, time 23072.14ms, mfu 3.04%
185
+ iter 113: loss 2.5337, time 23078.52ms, mfu 3.05%
186
+ iter 114: loss 2.7248, time 23061.41ms, mfu 3.05%
187
+ step 115: train loss 2.7062, val loss 2.7398
188
+ iter 115: loss 2.7654, time 24968.79ms, mfu 3.04%
189
+ iter 116: loss 2.6394, time 23049.91ms, mfu 3.04%
190
+ iter 117: loss 2.5259, time 23068.72ms, mfu 3.05%
191
+ iter 118: loss 2.8312, time 23061.73ms, mfu 3.06%
192
+ iter 119: loss 2.6137, time 23049.41ms, mfu 3.06%
193
+ step 120: train loss 2.6704, val loss 2.7120
194
+ iter 120: loss 2.6794, time 24958.89ms, mfu 3.05%
195
+ iter 121: loss 2.7400, time 23040.45ms, mfu 3.05%
196
+ iter 122: loss 2.6322, time 23047.61ms, mfu 3.06%
197
+ iter 123: loss 2.4416, time 23062.33ms, mfu 3.06%
198
+ iter 124: loss 2.6756, time 23048.99ms, mfu 3.07%
199
+ step 125: train loss 2.5866, val loss 2.6882
200
+ iter 125: loss 2.6490, time 24950.30ms, mfu 3.05%
201
+ iter 126: loss 2.5888, time 23027.86ms, mfu 3.06%
202
+ iter 127: loss 2.3960, time 23012.31ms, mfu 3.06%
203
+ iter 128: loss 2.6581, time 23025.51ms, mfu 3.07%
204
+ iter 129: loss 2.6202, time 23042.65ms, mfu 3.07%
205
+ step 130: train loss 2.6151, val loss 2.6532
206
+ saving checkpoint to out-shakespeare
207
+ iter 130: loss 2.8148, time 31009.76ms, mfu 3.00%
ckpt_edsheeran.pt β†’ weights/ckpt_edsheeran.pt RENAMED
File without changes
ckpt_haiku.pt β†’ weights/ckpt_haiku.pt RENAMED
File without changes
ckpt_math.pt β†’ weights/ckpt_math.pt RENAMED
File without changes
ckpt_shakespear.pt β†’ weights/ckpt_shakespear.pt RENAMED
File without changes
ckpt_trump.pt β†’ weights/ckpt_trump.pt RENAMED
File without changes
ckpt_world_facts_cia.pt β†’ weights/ckpt_world_facts_cia.pt RENAMED
File without changes