# finetune at constant LR learning_rate = 3e-5 decay_lr = False Initializing from OpenAI GPT-2 weights: gpt2 loading weights from pretrained gpt: gpt2 forcing vocab_size=50257, block_size=1024, bias=True overriding dropout rate to 0.0 number of parameters: 123.65M using fused AdamW: True compiling the model... (takes a ~minute) [2023-03-21 15:03:01,696] torch._inductor.utils: [WARNING] make_fallback(aten.addmv): a decomposition exists, we should switch to it step 0: train loss 7.3575, val loss 7.4530 iter 0: loss 7.3959, time 55528.06ms, mfu -100.00% iter 1: loss 7.4243, time 22248.52ms, mfu -100.00% iter 2: loss 7.3179, time 22821.48ms, mfu -100.00% iter 3: loss 7.5001, time 23404.71ms, mfu -100.00% iter 4: loss 7.4802, time 23247.54ms, mfu -100.00% step 5: train loss 7.2418, val loss 7.4663 iter 5: loss 7.3052, time 24918.41ms, mfu 2.88% iter 6: loss 6.9456, time 23189.74ms, mfu 2.90% iter 7: loss 6.6510, time 23306.99ms, mfu 2.92% iter 8: loss 6.3013, time 23235.93ms, mfu 2.94% iter 9: loss 6.0171, time 23170.33ms, mfu 2.96% step 10: train loss 5.9558, val loss 5.9625 saving checkpoint to out-shakespeare iter 10: loss 5.9322, time 31040.11ms, mfu 2.89% iter 11: loss 5.8374, time 23361.17ms, mfu 2.91% iter 12: loss 5.6069, time 23241.27ms, mfu 2.93% iter 13: loss 5.6613, time 23180.06ms, mfu 2.95% iter 14: loss 5.2928, time 23169.15ms, mfu 2.96% step 15: train loss 5.4229, val loss 5.4202 saving checkpoint to out-shakespeare iter 15: loss 5.3205, time 31057.72ms, mfu 2.90% iter 16: loss 5.4608, time 23320.27ms, mfu 2.91% iter 17: loss 5.2379, time 23176.04ms, mfu 2.93% iter 18: loss 5.1430, time 23211.53ms, mfu 2.95% iter 19: loss 5.5525, time 23232.59ms, mfu 2.96% step 20: train loss 5.1232, val loss 5.0514 saving checkpoint to out-shakespeare iter 20: loss 5.1371, time 31097.85ms, mfu 2.90% iter 21: loss 4.9530, time 23374.38ms, mfu 2.92%