gitGut01
commited on
Commit
β’
19d1dda
1
Parent(s):
61b62f9
add datasets
Browse files- dataset/dataset_ciaworld.txt +0 -0
- dataset/dataset_edsheeran.txt +0 -0
- dataset/dataset_haiku.txt +0 -0
- dataset/dataset_shakespeare.txt +0 -0
- train_info/train_info_edsheeran.txt +111 -0
- train_info/train_info_haiku.txt +42 -0
- train_info/train_info_shakespeare.txt +185 -0
- train_info/train_info_trump.txt +207 -0
- ckpt_edsheeran.pt β weights/ckpt_edsheeran.pt +0 -0
- ckpt_haiku.pt β weights/ckpt_haiku.pt +0 -0
- ckpt_math.pt β weights/ckpt_math.pt +0 -0
- ckpt_shakespear.pt β weights/ckpt_shakespear.pt +0 -0
- ckpt_trump.pt β weights/ckpt_trump.pt +0 -0
- ckpt_world_facts_cia.pt β weights/ckpt_world_facts_cia.pt +0 -0
dataset/dataset_ciaworld.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dataset/dataset_edsheeran.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dataset/dataset_haiku.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
dataset/dataset_shakespeare.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
train_info/train_info_edsheeran.txt
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Overriding config with config/finetune_shakespeare.py:
|
2 |
+
import time
|
3 |
+
|
4 |
+
out_dir = 'out-shakespeare'
|
5 |
+
eval_interval = 5
|
6 |
+
eval_iters = 40
|
7 |
+
wandb_log = False # feel free to turn on
|
8 |
+
wandb_project = 'shakespeare'
|
9 |
+
wandb_run_name = 'ft-' + str(time.time())
|
10 |
+
|
11 |
+
dataset = 'shakespeare'
|
12 |
+
init_from = 'gpt2' # this is the largest GPT-2 model
|
13 |
+
|
14 |
+
# only save checkpoints if the validation loss improves
|
15 |
+
always_save_checkpoint = False
|
16 |
+
|
17 |
+
# the number of examples per iter:
|
18 |
+
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
|
19 |
+
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
|
20 |
+
batch_size = 1
|
21 |
+
gradient_accumulation_steps = 32
|
22 |
+
max_iters = 120
|
23 |
+
|
24 |
+
# finetune at constant LR
|
25 |
+
learning_rate = 3e-5
|
26 |
+
decay_lr = False
|
27 |
+
|
28 |
+
Initializing from OpenAI GPT-2 weights: gpt2
|
29 |
+
loading weights from pretrained gpt: gpt2
|
30 |
+
forcing vocab_size=50257, block_size=1024, bias=True
|
31 |
+
overriding dropout rate to 0.0
|
32 |
+
number of parameters: 123.65M
|
33 |
+
Downloading (β¦)lve/main/config.json: 100% 665/665 [00:00<00:00, 98.3kB/s]
|
34 |
+
Downloading pytorch_model.bin: 100% 548M/548M [00:05<00:00, 92.8MB/s]
|
35 |
+
Downloading (β¦)neration_config.json: 100% 124/124 [00:00<00:00, 19.2kB/s]
|
36 |
+
using fused AdamW: True
|
37 |
+
compiling the model... (takes a ~minute)
|
38 |
+
[2023-03-21 14:22:50,795] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
|
39 |
+
step 0: train loss 3.4423, val loss 3.0369
|
40 |
+
iter 0: loss 3.2863, time 77202.23ms, mfu -100.00%
|
41 |
+
iter 1: loss 2.7469, time 22529.17ms, mfu -100.00%
|
42 |
+
iter 2: loss 3.7087, time 23101.21ms, mfu -100.00%
|
43 |
+
iter 3: loss 3.6040, time 23363.38ms, mfu -100.00%
|
44 |
+
iter 4: loss 2.6769, time 23118.49ms, mfu -100.00%
|
45 |
+
step 5: train loss 3.4339, val loss 2.9363
|
46 |
+
saving checkpoint to out-shakespeare
|
47 |
+
iter 5: loss 3.1141, time 30621.41ms, mfu 2.35%
|
48 |
+
iter 6: loss 3.3365, time 23426.49ms, mfu 2.42%
|
49 |
+
iter 7: loss 3.8965, time 23144.13ms, mfu 2.49%
|
50 |
+
iter 8: loss 3.4058, time 23061.69ms, mfu 2.55%
|
51 |
+
iter 9: loss 3.2569, time 23230.68ms, mfu 2.60%
|
52 |
+
step 10: train loss 3.2385, val loss 2.9982
|
53 |
+
iter 10: loss 3.1935, time 25160.57ms, mfu 2.63%
|
54 |
+
iter 11: loss 3.9526, time 23125.77ms, mfu 2.68%
|
55 |
+
iter 12: loss 2.4570, time 23136.22ms, mfu 2.72%
|
56 |
+
iter 13: loss 3.5092, time 23120.81ms, mfu 2.76%
|
57 |
+
iter 14: loss 3.4771, time 23226.29ms, mfu 2.79%
|
58 |
+
step 15: train loss 2.9026, val loss 2.8705
|
59 |
+
saving checkpoint to out-shakespeare
|
60 |
+
iter 15: loss 3.4825, time 30931.56ms, mfu 2.75%
|
61 |
+
iter 16: loss 3.3583, time 23307.64ms, mfu 2.78%
|
62 |
+
iter 17: loss 2.2991, time 23143.53ms, mfu 2.81%
|
63 |
+
iter 18: loss 3.2513, time 23131.39ms, mfu 2.84%
|
64 |
+
iter 19: loss 2.9859, time 23160.12ms, mfu 2.87%
|
65 |
+
step 20: train loss 2.9491, val loss 2.7808
|
66 |
+
saving checkpoint to out-shakespeare
|
67 |
+
iter 20: loss 3.0525, time 30909.27ms, mfu 2.81%
|
68 |
+
iter 21: loss 2.9295, time 23294.73ms, mfu 2.84%
|
69 |
+
iter 22: loss 2.2879, time 23094.34ms, mfu 2.87%
|
70 |
+
iter 23: loss 1.8019, time 23103.56ms, mfu 2.89%
|
71 |
+
iter 24: loss 3.4942, time 23172.01ms, mfu 2.91%
|
72 |
+
step 25: train loss 2.8004, val loss 2.8107
|
73 |
+
iter 25: loss 2.2264, time 25127.64ms, mfu 2.91%
|
74 |
+
iter 26: loss 3.4194, time 23174.40ms, mfu 2.93%
|
75 |
+
iter 27: loss 2.8144, time 23152.02ms, mfu 2.94%
|
76 |
+
iter 28: loss 3.0488, time 23133.18ms, mfu 2.96%
|
77 |
+
iter 29: loss 3.1027, time 23085.89ms, mfu 2.98%
|
78 |
+
step 30: train loss 2.6644, val loss 2.6210
|
79 |
+
saving checkpoint to out-shakespeare
|
80 |
+
iter 30: loss 2.4424, time 31309.61ms, mfu 2.91%
|
81 |
+
iter 31: loss 3.0193, time 23415.64ms, mfu 2.92%
|
82 |
+
iter 32: loss 2.8735, time 23054.64ms, mfu 2.94%
|
83 |
+
iter 33: loss 2.9842, time 23053.71ms, mfu 2.96%
|
84 |
+
iter 34: loss 2.8148, time 23136.92ms, mfu 2.97%
|
85 |
+
step 35: train loss 2.8676, val loss 2.5965
|
86 |
+
saving checkpoint to out-shakespeare
|
87 |
+
iter 35: loss 2.8556, time 31228.61ms, mfu 2.91%
|
88 |
+
iter 36: loss 2.1186, time 23332.51ms, mfu 2.92%
|
89 |
+
iter 37: loss 2.4768, time 23039.16ms, mfu 2.94%
|
90 |
+
iter 38: loss 2.7992, time 23035.59ms, mfu 2.96%
|
91 |
+
iter 39: loss 2.7109, time 23218.08ms, mfu 2.97%
|
92 |
+
step 40: train loss 2.5840, val loss 2.6467
|
93 |
+
iter 40: loss 3.0349, time 25092.98ms, mfu 2.96%
|
94 |
+
iter 41: loss 2.8766, time 23084.39ms, mfu 2.98%
|
95 |
+
iter 42: loss 2.5366, time 23099.15ms, mfu 2.99%
|
96 |
+
iter 43: loss 2.7461, time 23183.70ms, mfu 3.00%
|
97 |
+
iter 44: loss 1.4962, time 23190.74ms, mfu 3.01%
|
98 |
+
step 45: train loss 2.6357, val loss 2.6529
|
99 |
+
iter 45: loss 2.1228, time 25011.92ms, mfu 3.00%
|
100 |
+
iter 46: loss 1.9382, time 23127.95ms, mfu 3.01%
|
101 |
+
iter 47: loss 1.7129, time 23168.21ms, mfu 3.02%
|
102 |
+
iter 48: loss 2.4555, time 23162.14ms, mfu 3.03%
|
103 |
+
iter 49: loss 1.3368, time 23152.22ms, mfu 3.03%
|
104 |
+
step 50: train loss 2.3167, val loss 2.6496
|
105 |
+
iter 50: loss 2.3815, time 24969.84ms, mfu 3.02%
|
106 |
+
iter 51: loss 1.5433, time 23013.56ms, mfu 3.03%
|
107 |
+
iter 52: loss 2.5276, time 22951.87ms, mfu 3.04%
|
108 |
+
iter 53: loss 2.0912, time 22989.47ms, mfu 3.05%
|
109 |
+
iter 54: loss 1.6236, time 23016.77ms, mfu 3.06%
|
110 |
+
step 55: train loss 2.2718, val loss 2.6701
|
111 |
+
iter 55: loss 0.9116, time 24910.16ms, mfu 3.04%
|
train_info/train_info_haiku.txt
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# finetune at constant LR
|
2 |
+
learning_rate = 3e-5
|
3 |
+
decay_lr = False
|
4 |
+
|
5 |
+
Initializing from OpenAI GPT-2 weights: gpt2
|
6 |
+
loading weights from pretrained gpt: gpt2
|
7 |
+
forcing vocab_size=50257, block_size=1024, bias=True
|
8 |
+
overriding dropout rate to 0.0
|
9 |
+
number of parameters: 123.65M
|
10 |
+
using fused AdamW: True
|
11 |
+
compiling the model... (takes a ~minute)
|
12 |
+
[2023-03-21 15:03:01,696] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
|
13 |
+
step 0: train loss 7.3575, val loss 7.4530
|
14 |
+
iter 0: loss 7.3959, time 55528.06ms, mfu -100.00%
|
15 |
+
iter 1: loss 7.4243, time 22248.52ms, mfu -100.00%
|
16 |
+
iter 2: loss 7.3179, time 22821.48ms, mfu -100.00%
|
17 |
+
iter 3: loss 7.5001, time 23404.71ms, mfu -100.00%
|
18 |
+
iter 4: loss 7.4802, time 23247.54ms, mfu -100.00%
|
19 |
+
step 5: train loss 7.2418, val loss 7.4663
|
20 |
+
iter 5: loss 7.3052, time 24918.41ms, mfu 2.88%
|
21 |
+
iter 6: loss 6.9456, time 23189.74ms, mfu 2.90%
|
22 |
+
iter 7: loss 6.6510, time 23306.99ms, mfu 2.92%
|
23 |
+
iter 8: loss 6.3013, time 23235.93ms, mfu 2.94%
|
24 |
+
iter 9: loss 6.0171, time 23170.33ms, mfu 2.96%
|
25 |
+
step 10: train loss 5.9558, val loss 5.9625
|
26 |
+
saving checkpoint to out-shakespeare
|
27 |
+
iter 10: loss 5.9322, time 31040.11ms, mfu 2.89%
|
28 |
+
iter 11: loss 5.8374, time 23361.17ms, mfu 2.91%
|
29 |
+
iter 12: loss 5.6069, time 23241.27ms, mfu 2.93%
|
30 |
+
iter 13: loss 5.6613, time 23180.06ms, mfu 2.95%
|
31 |
+
iter 14: loss 5.2928, time 23169.15ms, mfu 2.96%
|
32 |
+
step 15: train loss 5.4229, val loss 5.4202
|
33 |
+
saving checkpoint to out-shakespeare
|
34 |
+
iter 15: loss 5.3205, time 31057.72ms, mfu 2.90%
|
35 |
+
iter 16: loss 5.4608, time 23320.27ms, mfu 2.91%
|
36 |
+
iter 17: loss 5.2379, time 23176.04ms, mfu 2.93%
|
37 |
+
iter 18: loss 5.1430, time 23211.53ms, mfu 2.95%
|
38 |
+
iter 19: loss 5.5525, time 23232.59ms, mfu 2.96%
|
39 |
+
step 20: train loss 5.1232, val loss 5.0514
|
40 |
+
saving checkpoint to out-shakespeare
|
41 |
+
iter 20: loss 5.1371, time 31097.85ms, mfu 2.90%
|
42 |
+
iter 21: loss 4.9530, time 23374.38ms, mfu 2.92%
|
train_info/train_info_shakespeare.txt
ADDED
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Overriding config with config/finetune_shakespeare.py:
|
2 |
+
import time
|
3 |
+
|
4 |
+
out_dir = 'out-shakespeare'
|
5 |
+
eval_interval = 5
|
6 |
+
eval_iters = 40
|
7 |
+
wandb_log = False # feel free to turn on
|
8 |
+
wandb_project = 'shakespeare'
|
9 |
+
wandb_run_name = 'ft-' + str(time.time())
|
10 |
+
|
11 |
+
dataset = 'shakespeare'
|
12 |
+
init_from = 'gpt2' # this is the largest GPT-2 model
|
13 |
+
|
14 |
+
# only save checkpoints if the validation loss improves
|
15 |
+
always_save_checkpoint = False
|
16 |
+
|
17 |
+
# the number of examples per iter:
|
18 |
+
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
|
19 |
+
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
|
20 |
+
batch_size = 1
|
21 |
+
gradient_accumulation_steps = 32
|
22 |
+
max_iters = 1000
|
23 |
+
|
24 |
+
# finetune at constant LR
|
25 |
+
learning_rate = 3e-5
|
26 |
+
decay_lr = False
|
27 |
+
|
28 |
+
Initializing from OpenAI GPT-2 weights: gpt2
|
29 |
+
loading weights from pretrained gpt: gpt2
|
30 |
+
forcing vocab_size=50257, block_size=1024, bias=True
|
31 |
+
overriding dropout rate to 0.0
|
32 |
+
number of parameters: 123.65M
|
33 |
+
using fused AdamW: True
|
34 |
+
compiling the model... (takes a ~minute)
|
35 |
+
[2023-03-20 21:31:13,957] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
|
36 |
+
step 0: train loss 4.1871, val loss 4.0326
|
37 |
+
iter 0: loss 4.8126, time 53610.16ms, mfu -100.00%
|
38 |
+
iter 1: loss 3.8469, time 22853.81ms, mfu -100.00%
|
39 |
+
iter 2: loss 4.1342, time 23058.41ms, mfu -100.00%
|
40 |
+
iter 3: loss 4.2060, time 23164.17ms, mfu -100.00%
|
41 |
+
iter 4: loss 4.6711, time 23070.16ms, mfu -100.00%
|
42 |
+
step 5: train loss 4.3096, val loss 3.9636
|
43 |
+
saving checkpoint to out-shakespeare
|
44 |
+
iter 5: loss 3.4577, time 30970.06ms, mfu 2.32%
|
45 |
+
iter 6: loss 2.9587, time 23298.83ms, mfu 2.40%
|
46 |
+
iter 7: loss 3.2116, time 23132.08ms, mfu 2.47%
|
47 |
+
iter 8: loss 3.4900, time 23106.50ms, mfu 2.53%
|
48 |
+
iter 9: loss 3.8003, time 23125.60ms, mfu 2.59%
|
49 |
+
step 10: train loss 3.6215, val loss 3.4816
|
50 |
+
saving checkpoint to out-shakespeare
|
51 |
+
iter 10: loss 3.6364, time 30978.89ms, mfu 2.56%
|
52 |
+
iter 11: loss 3.4725, time 23263.91ms, mfu 2.61%
|
53 |
+
iter 12: loss 3.4080, time 23053.16ms, mfu 2.67%
|
54 |
+
iter 13: loss 3.9510, time 23091.76ms, mfu 2.71%
|
55 |
+
iter 14: loss 3.6421, time 23142.46ms, mfu 2.75%
|
56 |
+
step 15: train loss 3.5292, val loss 3.2960
|
57 |
+
saving checkpoint to out-shakespeare
|
58 |
+
iter 15: loss 3.2916, time 31036.47ms, mfu 2.71%
|
59 |
+
iter 16: loss 3.8844, time 23232.40ms, mfu 2.74%
|
60 |
+
iter 17: loss 3.2954, time 23076.36ms, mfu 2.78%
|
61 |
+
iter 18: loss 2.9807, time 23073.19ms, mfu 2.81%
|
62 |
+
iter 19: loss 3.4524, time 23090.94ms, mfu 2.84%
|
63 |
+
step 20: train loss 3.4621, val loss 3.3625
|
64 |
+
iter 20: loss 3.3737, time 25115.53ms, mfu 2.85%
|
65 |
+
iter 21: loss 3.6565, time 23165.72ms, mfu 2.87%
|
66 |
+
iter 22: loss 3.3047, time 23174.77ms, mfu 2.89%
|
67 |
+
iter 23: loss 3.8091, time 23135.82ms, mfu 2.92%
|
68 |
+
iter 24: loss 3.1955, time 23097.90ms, mfu 2.94%
|
69 |
+
step 25: train loss 3.5139, val loss 3.2854
|
70 |
+
saving checkpoint to out-shakespeare
|
71 |
+
iter 25: loss 3.8481, time 30838.74ms, mfu 2.87%
|
72 |
+
iter 26: loss 3.2716, time 23304.59ms, mfu 2.90%
|
73 |
+
iter 27: loss 3.3729, time 23056.31ms, mfu 2.92%
|
74 |
+
iter 28: loss 3.3545, time 23107.46ms, mfu 2.94%
|
75 |
+
iter 29: loss 2.7101, time 23209.45ms, mfu 2.95%
|
76 |
+
step 30: train loss 3.3706, val loss 3.2958
|
77 |
+
iter 30: loss 3.0968, time 25123.31ms, mfu 2.94%
|
78 |
+
iter 31: loss 2.9495, time 23116.72ms, mfu 2.96%
|
79 |
+
iter 32: loss 3.0179, time 23101.19ms, mfu 2.97%
|
80 |
+
iter 33: loss 2.9648, time 23117.17ms, mfu 2.99%
|
81 |
+
iter 34: loss 3.6522, time 23132.76ms, mfu 3.00%
|
82 |
+
step 35: train loss 3.3923, val loss 3.2125
|
83 |
+
saving checkpoint to out-shakespeare
|
84 |
+
iter 35: loss 3.2469, time 31079.08ms, mfu 2.93%
|
85 |
+
iter 36: loss 3.1450, time 23273.02ms, mfu 2.95%
|
86 |
+
iter 37: loss 3.4624, time 23046.04ms, mfu 2.96%
|
87 |
+
iter 38: loss 3.4371, time 23102.73ms, mfu 2.98%
|
88 |
+
iter 39: loss 3.3130, time 23178.65ms, mfu 2.99%
|
89 |
+
step 40: train loss 3.3233, val loss 3.2543
|
90 |
+
iter 40: loss 3.0743, time 25069.68ms, mfu 2.98%
|
91 |
+
iter 41: loss 3.1269, time 23084.39ms, mfu 2.99%
|
92 |
+
iter 42: loss 3.6785, time 23076.30ms, mfu 3.00%
|
93 |
+
iter 43: loss 3.3787, time 23075.87ms, mfu 3.01%
|
94 |
+
iter 44: loss 3.2637, time 23098.68ms, mfu 3.02%
|
95 |
+
step 45: train loss 3.1971, val loss 3.2642
|
96 |
+
iter 45: loss 3.1861, time 25003.67ms, mfu 3.01%
|
97 |
+
iter 46: loss 3.4037, time 23106.62ms, mfu 3.02%
|
98 |
+
iter 47: loss 3.4947, time 23109.37ms, mfu 3.03%
|
99 |
+
iter 48: loss 3.3276, time 23098.50ms, mfu 3.04%
|
100 |
+
iter 49: loss 2.9062, time 23171.38ms, mfu 3.04%
|
101 |
+
step 50: train loss 3.2188, val loss 3.2460
|
102 |
+
iter 50: loss 3.5280, time 25111.46ms, mfu 3.02%
|
103 |
+
iter 51: loss 3.5470, time 23143.40ms, mfu 3.03%
|
104 |
+
iter 52: loss 3.1881, time 23109.22ms, mfu 3.04%
|
105 |
+
iter 53: loss 3.4332, time 23083.68ms, mfu 3.05%
|
106 |
+
iter 54: loss 3.1956, time 23117.10ms, mfu 3.05%
|
107 |
+
step 55: train loss 3.2902, val loss 3.1846
|
108 |
+
saving checkpoint to out-shakespeare
|
109 |
+
iter 55: loss 3.4816, time 31132.51ms, mfu 2.98%
|
110 |
+
iter 56: loss 3.2971, time 23207.94ms, mfu 2.99%
|
111 |
+
iter 57: loss 2.9543, time 23064.74ms, mfu 3.00%
|
112 |
+
iter 58: loss 2.8729, time 23093.16ms, mfu 3.01%
|
113 |
+
iter 59: loss 3.0883, time 23129.34ms, mfu 3.02%
|
114 |
+
step 60: train loss 3.1288, val loss 3.1545
|
115 |
+
saving checkpoint to out-shakespeare
|
116 |
+
iter 60: loss 3.7098, time 31022.27ms, mfu 2.95%
|
117 |
+
iter 61: loss 3.4157, time 23229.02ms, mfu 2.97%
|
118 |
+
iter 62: loss 3.0020, time 23059.02ms, mfu 2.98%
|
119 |
+
iter 63: loss 3.0751, time 23063.51ms, mfu 2.99%
|
120 |
+
iter 64: loss 2.9081, time 23134.60ms, mfu 3.01%
|
121 |
+
step 65: train loss 3.2254, val loss 3.1772
|
122 |
+
iter 65: loss 3.3802, time 25114.58ms, mfu 2.99%
|
123 |
+
iter 66: loss 3.1073, time 23118.96ms, mfu 3.00%
|
124 |
+
iter 67: loss 3.1010, time 23081.32ms, mfu 3.01%
|
125 |
+
iter 68: loss 3.2594, time 23058.54ms, mfu 3.02%
|
126 |
+
iter 69: loss 3.4402, time 23062.45ms, mfu 3.03%
|
127 |
+
step 70: train loss 3.1511, val loss 3.2315
|
128 |
+
iter 70: loss 3.4094, time 24967.39ms, mfu 3.02%
|
129 |
+
iter 71: loss 3.0997, time 23070.28ms, mfu 3.03%
|
130 |
+
iter 72: loss 2.1573, time 23072.48ms, mfu 3.04%
|
131 |
+
iter 73: loss 3.3926, time 23060.80ms, mfu 3.04%
|
132 |
+
iter 74: loss 3.2284, time 23080.48ms, mfu 3.05%
|
133 |
+
step 75: train loss 3.1102, val loss 3.1017
|
134 |
+
saving checkpoint to out-shakespeare
|
135 |
+
iter 75: loss 3.3760, time 31003.52ms, mfu 2.98%
|
136 |
+
iter 76: loss 3.3387, time 23207.33ms, mfu 2.99%
|
137 |
+
iter 77: loss 2.9299, time 23040.87ms, mfu 3.00%
|
138 |
+
iter 78: loss 2.9623, time 23069.43ms, mfu 3.01%
|
139 |
+
iter 79: loss 3.0674, time 23111.04ms, mfu 3.02%
|
140 |
+
step 80: train loss 3.0574, val loss 3.2178
|
141 |
+
iter 80: loss 2.6808, time 25072.69ms, mfu 3.01%
|
142 |
+
iter 81: loss 2.7986, time 23144.88ms, mfu 3.02%
|
143 |
+
iter 82: loss 2.9121, time 23094.25ms, mfu 3.03%
|
144 |
+
iter 83: loss 2.7153, time 23114.27ms, mfu 3.03%
|
145 |
+
iter 84: loss 2.8444, time 23089.41ms, mfu 3.04%
|
146 |
+
step 85: train loss 2.9855, val loss 3.2298
|
147 |
+
iter 85: loss 3.0517, time 25033.77ms, mfu 3.03%
|
148 |
+
iter 86: loss 2.5920, time 23088.89ms, mfu 3.03%
|
149 |
+
iter 87: loss 3.1241, time 23084.88ms, mfu 3.04%
|
150 |
+
iter 88: loss 2.5355, time 23070.40ms, mfu 3.05%
|
151 |
+
iter 89: loss 3.4543, time 23060.05ms, mfu 3.06%
|
152 |
+
step 90: train loss 3.0426, val loss 3.2664
|
153 |
+
iter 90: loss 3.3099, time 24997.54ms, mfu 3.04%
|
154 |
+
iter 91: loss 2.8099, time 23108.94ms, mfu 3.04%
|
155 |
+
iter 92: loss 3.2419, time 23103.54ms, mfu 3.05%
|
156 |
+
iter 93: loss 3.4718, time 23089.71ms, mfu 3.06%
|
157 |
+
iter 94: loss 3.0708, time 23137.11ms, mfu 3.06%
|
158 |
+
step 95: train loss 3.0225, val loss 3.2529
|
159 |
+
iter 95: loss 2.8545, time 25072.26ms, mfu 3.04%
|
160 |
+
iter 96: loss 3.3059, time 23120.57ms, mfu 3.05%
|
161 |
+
iter 97: loss 2.7528, time 23111.60ms, mfu 3.06%
|
162 |
+
iter 98: loss 3.1788, time 23106.26ms, mfu 3.06%
|
163 |
+
iter 99: loss 2.9023, time 23103.06ms, mfu 3.07%
|
164 |
+
step 100: train loss 2.9153, val loss 3.2140
|
165 |
+
iter 100: loss 3.0090, time 24968.37ms, mfu 3.05%
|
166 |
+
iter 101: loss 3.0753, time 23093.87ms, mfu 3.05%
|
167 |
+
iter 102: loss 3.1295, time 23108.81ms, mfu 3.06%
|
168 |
+
iter 103: loss 2.9033, time 23136.51ms, mfu 3.06%
|
169 |
+
iter 104: loss 3.1117, time 23127.17ms, mfu 3.07%
|
170 |
+
step 105: train loss 2.9402, val loss 3.2071
|
171 |
+
iter 105: loss 2.8862, time 25050.88ms, mfu 3.05%
|
172 |
+
iter 106: loss 2.6040, time 23141.23ms, mfu 3.05%
|
173 |
+
iter 107: loss 3.1831, time 23146.47ms, mfu 3.06%
|
174 |
+
iter 108: loss 3.1619, time 23078.47ms, mfu 3.06%
|
175 |
+
iter 109: loss 3.0995, time 23098.26ms, mfu 3.07%
|
176 |
+
step 110: train loss 2.7568, val loss 3.2857
|
177 |
+
iter 110: loss 3.0392, time 24959.72ms, mfu 3.05%
|
178 |
+
iter 111: loss 3.1982, time 23121.36ms, mfu 3.06%
|
179 |
+
iter 112: loss 3.1794, time 23124.92ms, mfu 3.06%
|
180 |
+
iter 113: loss 2.8230, time 23138.96ms, mfu 3.07%
|
181 |
+
iter 114: loss 2.2634, time 23121.12ms, mfu 3.07%
|
182 |
+
step 115: train loss 2.8576, val loss 3.2603
|
183 |
+
iter 115: loss 3.0414, time 24960.16ms, mfu 3.05%
|
184 |
+
iter 116: loss 2.2827, time 23077.89ms, mfu 3.06%
|
185 |
+
iter 117: loss 2.5435, time 23054.11ms, mfu 3.06%
|
train_info/train_info_trump.txt
ADDED
@@ -0,0 +1,207 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Overriding config with config/finetune_shakespeare.py:
|
2 |
+
import time
|
3 |
+
|
4 |
+
out_dir = 'out-shakespeare'
|
5 |
+
eval_interval = 5
|
6 |
+
eval_iters = 40
|
7 |
+
wandb_log = False # feel free to turn on
|
8 |
+
wandb_project = 'shakespeare'
|
9 |
+
wandb_run_name = 'ft-' + str(time.time())
|
10 |
+
|
11 |
+
dataset = 'shakespeare'
|
12 |
+
init_from = 'gpt2' # this is the largest GPT-2 model
|
13 |
+
|
14 |
+
# only save checkpoints if the validation loss improves
|
15 |
+
always_save_checkpoint = False
|
16 |
+
|
17 |
+
# the number of examples per iter:
|
18 |
+
# 1 batch_size * 32 grad_accum * 1024 tokens = 32,768 tokens/iter
|
19 |
+
# shakespeare has 301,966 tokens, so 1 epoch ~= 9.2 iters
|
20 |
+
batch_size = 1
|
21 |
+
gradient_accumulation_steps = 32
|
22 |
+
max_iters = 300
|
23 |
+
|
24 |
+
# finetune at constant LR
|
25 |
+
learning_rate = 3e-5
|
26 |
+
decay_lr = False
|
27 |
+
|
28 |
+
Initializing from OpenAI GPT-2 weights: gpt2
|
29 |
+
loading weights from pretrained gpt: gpt2
|
30 |
+
forcing vocab_size=50257, block_size=1024, bias=True
|
31 |
+
overriding dropout rate to 0.0
|
32 |
+
number of parameters: 123.65M
|
33 |
+
Downloading (β¦)lve/main/config.json: 100% 665/665 [00:00<00:00, 88.4kB/s]
|
34 |
+
Downloading pytorch_model.bin: 100% 548M/548M [00:01<00:00, 289MB/s]
|
35 |
+
Downloading (β¦)neration_config.json: 100% 124/124 [00:00<00:00, 22.5kB/s]
|
36 |
+
using fused AdamW: True
|
37 |
+
compiling the model... (takes a ~minute)
|
38 |
+
[2023-03-21 06:17:18,366] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
|
39 |
+
step 0: train loss 3.3086, val loss 3.2349
|
40 |
+
iter 0: loss 3.4443, time 75907.68ms, mfu -100.00%
|
41 |
+
iter 1: loss 3.6624, time 23156.16ms, mfu -100.00%
|
42 |
+
iter 2: loss 4.4039, time 23248.46ms, mfu -100.00%
|
43 |
+
iter 3: loss 3.2693, time 22877.27ms, mfu -100.00%
|
44 |
+
iter 4: loss 3.4597, time 22906.52ms, mfu -100.00%
|
45 |
+
step 5: train loss 3.2166, val loss 3.2212
|
46 |
+
saving checkpoint to out-shakespeare
|
47 |
+
iter 5: loss 3.2885, time 30843.38ms, mfu 2.33%
|
48 |
+
iter 6: loss 3.2423, time 23117.67ms, mfu 2.41%
|
49 |
+
iter 7: loss 3.2239, time 23014.83ms, mfu 2.48%
|
50 |
+
iter 8: loss 3.3878, time 23083.71ms, mfu 2.54%
|
51 |
+
iter 9: loss 3.0245, time 23127.68ms, mfu 2.60%
|
52 |
+
step 10: train loss 3.1367, val loss 3.0886
|
53 |
+
saving checkpoint to out-shakespeare
|
54 |
+
iter 10: loss 3.2588, time 31026.66ms, mfu 2.57%
|
55 |
+
iter 11: loss 2.7963, time 23215.41ms, mfu 2.62%
|
56 |
+
iter 12: loss 3.0799, time 23045.69ms, mfu 2.67%
|
57 |
+
iter 13: loss 3.0391, time 23081.70ms, mfu 2.72%
|
58 |
+
iter 14: loss 2.9285, time 23144.99ms, mfu 2.76%
|
59 |
+
step 15: train loss 3.0614, val loss 3.0357
|
60 |
+
saving checkpoint to out-shakespeare
|
61 |
+
iter 15: loss 2.9088, time 31131.17ms, mfu 2.71%
|
62 |
+
iter 16: loss 2.8854, time 23203.33ms, mfu 2.75%
|
63 |
+
iter 17: loss 2.8941, time 23045.51ms, mfu 2.79%
|
64 |
+
iter 18: loss 3.1116, time 23058.43ms, mfu 2.82%
|
65 |
+
iter 19: loss 3.1542, time 23076.86ms, mfu 2.85%
|
66 |
+
step 20: train loss 2.9382, val loss 2.9662
|
67 |
+
saving checkpoint to out-shakespeare
|
68 |
+
iter 20: loss 2.8674, time 30800.95ms, mfu 2.80%
|
69 |
+
iter 21: loss 3.0158, time 23210.44ms, mfu 2.83%
|
70 |
+
iter 22: loss 3.0376, time 23028.93ms, mfu 2.86%
|
71 |
+
iter 23: loss 2.5614, time 23053.57ms, mfu 2.88%
|
72 |
+
iter 24: loss 3.0086, time 23135.53ms, mfu 2.90%
|
73 |
+
step 25: train loss 2.9386, val loss 2.9689
|
74 |
+
iter 25: loss 2.8633, time 25037.75ms, mfu 2.90%
|
75 |
+
iter 26: loss 3.2887, time 23087.04ms, mfu 2.92%
|
76 |
+
iter 27: loss 2.7507, time 23061.28ms, mfu 2.94%
|
77 |
+
iter 28: loss 3.0676, time 23047.93ms, mfu 2.96%
|
78 |
+
iter 29: loss 2.7316, time 23042.36ms, mfu 2.98%
|
79 |
+
step 30: train loss 2.9721, val loss 2.9042
|
80 |
+
saving checkpoint to out-shakespeare
|
81 |
+
iter 30: loss 2.7163, time 30867.03ms, mfu 2.91%
|
82 |
+
iter 31: loss 2.9423, time 23225.75ms, mfu 2.93%
|
83 |
+
iter 32: loss 2.9405, time 23012.47ms, mfu 2.95%
|
84 |
+
iter 33: loss 2.9208, time 23059.76ms, mfu 2.96%
|
85 |
+
iter 34: loss 2.9996, time 23121.13ms, mfu 2.98%
|
86 |
+
step 35: train loss 2.9496, val loss 2.8374
|
87 |
+
saving checkpoint to out-shakespeare
|
88 |
+
iter 35: loss 2.8072, time 31122.96ms, mfu 2.91%
|
89 |
+
iter 36: loss 2.9798, time 23209.16ms, mfu 2.93%
|
90 |
+
iter 37: loss 2.8476, time 23019.32ms, mfu 2.95%
|
91 |
+
iter 38: loss 2.7276, time 23056.09ms, mfu 2.97%
|
92 |
+
iter 39: loss 2.8636, time 23101.19ms, mfu 2.98%
|
93 |
+
step 40: train loss 2.8282, val loss 2.9073
|
94 |
+
iter 40: loss 2.7667, time 25022.64ms, mfu 2.97%
|
95 |
+
iter 41: loss 2.6111, time 23100.99ms, mfu 2.98%
|
96 |
+
iter 42: loss 3.1776, time 23107.88ms, mfu 3.00%
|
97 |
+
iter 43: loss 2.7963, time 23090.82ms, mfu 3.01%
|
98 |
+
iter 44: loss 3.2658, time 23084.78ms, mfu 3.02%
|
99 |
+
step 45: train loss 2.8171, val loss 2.8487
|
100 |
+
iter 45: loss 3.0523, time 24981.39ms, mfu 3.00%
|
101 |
+
iter 46: loss 2.6204, time 23087.28ms, mfu 3.01%
|
102 |
+
iter 47: loss 2.8938, time 23081.95ms, mfu 3.02%
|
103 |
+
iter 48: loss 3.1726, time 23092.57ms, mfu 3.03%
|
104 |
+
iter 49: loss 3.7836, time 23077.55ms, mfu 3.04%
|
105 |
+
step 50: train loss 2.8675, val loss 2.7787
|
106 |
+
saving checkpoint to out-shakespeare
|
107 |
+
iter 50: loss 3.0882, time 30881.37ms, mfu 2.97%
|
108 |
+
iter 51: loss 2.8358, time 23200.14ms, mfu 2.98%
|
109 |
+
iter 52: loss 2.9847, time 23008.69ms, mfu 3.00%
|
110 |
+
iter 53: loss 3.1992, time 23066.07ms, mfu 3.01%
|
111 |
+
iter 54: loss 2.4085, time 23118.93ms, mfu 3.02%
|
112 |
+
step 55: train loss 2.8049, val loss 2.7507
|
113 |
+
saving checkpoint to out-shakespeare
|
114 |
+
iter 55: loss 2.9964, time 31115.78ms, mfu 2.95%
|
115 |
+
iter 56: loss 2.9647, time 23212.73ms, mfu 2.96%
|
116 |
+
iter 57: loss 2.8880, time 23003.95ms, mfu 2.98%
|
117 |
+
iter 58: loss 2.8726, time 23053.90ms, mfu 2.99%
|
118 |
+
iter 59: loss 2.6470, time 23124.33ms, mfu 3.00%
|
119 |
+
step 60: train loss 2.8041, val loss 2.8827
|
120 |
+
iter 60: loss 2.8115, time 24978.80ms, mfu 2.99%
|
121 |
+
iter 61: loss 2.6765, time 23058.07ms, mfu 3.00%
|
122 |
+
iter 62: loss 2.6801, time 23052.27ms, mfu 3.01%
|
123 |
+
iter 63: loss 3.4295, time 23048.58ms, mfu 3.03%
|
124 |
+
iter 64: loss 2.5933, time 23062.70ms, mfu 3.03%
|
125 |
+
step 65: train loss 2.7894, val loss 2.7606
|
126 |
+
iter 65: loss 2.5231, time 24991.85ms, mfu 3.02%
|
127 |
+
iter 66: loss 2.8913, time 23099.31ms, mfu 3.03%
|
128 |
+
iter 67: loss 2.9515, time 23106.81ms, mfu 3.04%
|
129 |
+
iter 68: loss 2.8017, time 23098.12ms, mfu 3.04%
|
130 |
+
iter 69: loss 2.7759, time 23110.16ms, mfu 3.05%
|
131 |
+
step 70: train loss 2.8044, val loss 2.8498
|
132 |
+
iter 70: loss 2.9694, time 25009.31ms, mfu 3.03%
|
133 |
+
iter 71: loss 3.3238, time 23090.32ms, mfu 3.04%
|
134 |
+
iter 72: loss 2.6931, time 23086.35ms, mfu 3.05%
|
135 |
+
iter 73: loss 2.6097, time 23085.74ms, mfu 3.05%
|
136 |
+
iter 74: loss 2.1781, time 23096.25ms, mfu 3.06%
|
137 |
+
step 75: train loss 2.7755, val loss 2.6869
|
138 |
+
saving checkpoint to out-shakespeare
|
139 |
+
iter 75: loss 2.9208, time 30879.90ms, mfu 2.99%
|
140 |
+
iter 76: loss 2.7619, time 23186.69ms, mfu 3.00%
|
141 |
+
iter 77: loss 2.8394, time 23017.46ms, mfu 3.01%
|
142 |
+
iter 78: loss 2.5907, time 23049.26ms, mfu 3.02%
|
143 |
+
iter 79: loss 2.5660, time 23102.38ms, mfu 3.03%
|
144 |
+
step 80: train loss 2.7759, val loss 2.7603
|
145 |
+
iter 80: loss 2.6889, time 25011.13ms, mfu 3.01%
|
146 |
+
iter 81: loss 2.6940, time 23088.64ms, mfu 3.02%
|
147 |
+
iter 82: loss 2.6596, time 23050.35ms, mfu 3.03%
|
148 |
+
iter 83: loss 2.7638, time 23066.22ms, mfu 3.04%
|
149 |
+
iter 84: loss 2.6515, time 23059.01ms, mfu 3.05%
|
150 |
+
step 85: train loss 2.7404, val loss 2.7290
|
151 |
+
iter 85: loss 3.1829, time 24970.26ms, mfu 3.03%
|
152 |
+
iter 86: loss 2.5451, time 23052.03ms, mfu 3.04%
|
153 |
+
iter 87: loss 2.4363, time 23051.53ms, mfu 3.05%
|
154 |
+
iter 88: loss 2.8023, time 23039.12ms, mfu 3.05%
|
155 |
+
iter 89: loss 2.4755, time 23044.45ms, mfu 3.06%
|
156 |
+
step 90: train loss 2.7140, val loss 2.7692
|
157 |
+
iter 90: loss 2.7225, time 24960.52ms, mfu 3.04%
|
158 |
+
iter 91: loss 2.4655, time 23037.54ms, mfu 3.05%
|
159 |
+
iter 92: loss 2.5291, time 23029.37ms, mfu 3.06%
|
160 |
+
iter 93: loss 2.7720, time 23032.99ms, mfu 3.06%
|
161 |
+
iter 94: loss 2.7614, time 23039.50ms, mfu 3.07%
|
162 |
+
step 95: train loss 2.7932, val loss 2.7953
|
163 |
+
iter 95: loss 2.6881, time 24974.66ms, mfu 3.05%
|
164 |
+
iter 96: loss 2.9315, time 23044.89ms, mfu 3.06%
|
165 |
+
iter 97: loss 2.7099, time 23035.52ms, mfu 3.06%
|
166 |
+
iter 98: loss 2.6858, time 23036.10ms, mfu 3.07%
|
167 |
+
iter 99: loss 2.5341, time 23048.24ms, mfu 3.07%
|
168 |
+
step 100: train loss 2.6788, val loss 2.8138
|
169 |
+
iter 100: loss 2.7993, time 25008.37ms, mfu 3.05%
|
170 |
+
iter 101: loss 2.5996, time 23052.62ms, mfu 3.06%
|
171 |
+
iter 102: loss 2.7768, time 23059.09ms, mfu 3.07%
|
172 |
+
iter 103: loss 2.6378, time 23046.82ms, mfu 3.07%
|
173 |
+
iter 104: loss 2.7511, time 23043.40ms, mfu 3.08%
|
174 |
+
step 105: train loss 2.7542, val loss 2.6568
|
175 |
+
saving checkpoint to out-shakespeare
|
176 |
+
iter 105: loss 2.6596, time 31000.96ms, mfu 3.00%
|
177 |
+
iter 106: loss 2.8566, time 23195.71ms, mfu 3.01%
|
178 |
+
iter 107: loss 2.6284, time 22995.46ms, mfu 3.02%
|
179 |
+
iter 108: loss 2.6670, time 23031.45ms, mfu 3.03%
|
180 |
+
iter 109: loss 2.4732, time 23093.11ms, mfu 3.04%
|
181 |
+
step 110: train loss 2.7094, val loss 2.6684
|
182 |
+
iter 110: loss 2.5577, time 25028.10ms, mfu 3.02%
|
183 |
+
iter 111: loss 2.9250, time 23089.98ms, mfu 3.03%
|
184 |
+
iter 112: loss 2.6274, time 23072.14ms, mfu 3.04%
|
185 |
+
iter 113: loss 2.5337, time 23078.52ms, mfu 3.05%
|
186 |
+
iter 114: loss 2.7248, time 23061.41ms, mfu 3.05%
|
187 |
+
step 115: train loss 2.7062, val loss 2.7398
|
188 |
+
iter 115: loss 2.7654, time 24968.79ms, mfu 3.04%
|
189 |
+
iter 116: loss 2.6394, time 23049.91ms, mfu 3.04%
|
190 |
+
iter 117: loss 2.5259, time 23068.72ms, mfu 3.05%
|
191 |
+
iter 118: loss 2.8312, time 23061.73ms, mfu 3.06%
|
192 |
+
iter 119: loss 2.6137, time 23049.41ms, mfu 3.06%
|
193 |
+
step 120: train loss 2.6704, val loss 2.7120
|
194 |
+
iter 120: loss 2.6794, time 24958.89ms, mfu 3.05%
|
195 |
+
iter 121: loss 2.7400, time 23040.45ms, mfu 3.05%
|
196 |
+
iter 122: loss 2.6322, time 23047.61ms, mfu 3.06%
|
197 |
+
iter 123: loss 2.4416, time 23062.33ms, mfu 3.06%
|
198 |
+
iter 124: loss 2.6756, time 23048.99ms, mfu 3.07%
|
199 |
+
step 125: train loss 2.5866, val loss 2.6882
|
200 |
+
iter 125: loss 2.6490, time 24950.30ms, mfu 3.05%
|
201 |
+
iter 126: loss 2.5888, time 23027.86ms, mfu 3.06%
|
202 |
+
iter 127: loss 2.3960, time 23012.31ms, mfu 3.06%
|
203 |
+
iter 128: loss 2.6581, time 23025.51ms, mfu 3.07%
|
204 |
+
iter 129: loss 2.6202, time 23042.65ms, mfu 3.07%
|
205 |
+
step 130: train loss 2.6151, val loss 2.6532
|
206 |
+
saving checkpoint to out-shakespeare
|
207 |
+
iter 130: loss 2.8148, time 31009.76ms, mfu 3.00%
|
ckpt_edsheeran.pt β weights/ckpt_edsheeran.pt
RENAMED
File without changes
|
ckpt_haiku.pt β weights/ckpt_haiku.pt
RENAMED
File without changes
|
ckpt_math.pt β weights/ckpt_math.pt
RENAMED
File without changes
|
ckpt_shakespear.pt β weights/ckpt_shakespear.pt
RENAMED
File without changes
|
ckpt_trump.pt β weights/ckpt_trump.pt
RENAMED
File without changes
|
ckpt_world_facts_cia.pt β weights/ckpt_world_facts_cia.pt
RENAMED
File without changes
|