nanogpt_finetuned_models / train_info /train_info_haiku.txt
gitGut01
add datasets
19d1dda
raw
history blame
1.88 kB
# finetune at constant LR
learning_rate = 3e-5
decay_lr = False
Initializing from OpenAI GPT-2 weights: gpt2
loading weights from pretrained gpt: gpt2
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 123.65M
using fused AdamW: True
compiling the model... (takes a ~minute)
[2023-03-21 15:03:01,696] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it
step 0: train loss 7.3575, val loss 7.4530
iter 0: loss 7.3959, time 55528.06ms, mfu -100.00%
iter 1: loss 7.4243, time 22248.52ms, mfu -100.00%
iter 2: loss 7.3179, time 22821.48ms, mfu -100.00%
iter 3: loss 7.5001, time 23404.71ms, mfu -100.00%
iter 4: loss 7.4802, time 23247.54ms, mfu -100.00%
step 5: train loss 7.2418, val loss 7.4663
iter 5: loss 7.3052, time 24918.41ms, mfu 2.88%
iter 6: loss 6.9456, time 23189.74ms, mfu 2.90%
iter 7: loss 6.6510, time 23306.99ms, mfu 2.92%
iter 8: loss 6.3013, time 23235.93ms, mfu 2.94%
iter 9: loss 6.0171, time 23170.33ms, mfu 2.96%
step 10: train loss 5.9558, val loss 5.9625
saving checkpoint to out-shakespeare
iter 10: loss 5.9322, time 31040.11ms, mfu 2.89%
iter 11: loss 5.8374, time 23361.17ms, mfu 2.91%
iter 12: loss 5.6069, time 23241.27ms, mfu 2.93%
iter 13: loss 5.6613, time 23180.06ms, mfu 2.95%
iter 14: loss 5.2928, time 23169.15ms, mfu 2.96%
step 15: train loss 5.4229, val loss 5.4202
saving checkpoint to out-shakespeare
iter 15: loss 5.3205, time 31057.72ms, mfu 2.90%
iter 16: loss 5.4608, time 23320.27ms, mfu 2.91%
iter 17: loss 5.2379, time 23176.04ms, mfu 2.93%
iter 18: loss 5.1430, time 23211.53ms, mfu 2.95%
iter 19: loss 5.5525, time 23232.59ms, mfu 2.96%
step 20: train loss 5.1232, val loss 5.0514
saving checkpoint to out-shakespeare
iter 20: loss 5.1371, time 31097.85ms, mfu 2.90%
iter 21: loss 4.9530, time 23374.38ms, mfu 2.92%