Running pytorch 2.4.1+cu121 running with DDP: False, device: cuda, world size: 1 total desired batch size: 524288 => calculated gradient accumulation steps: 64 /scratch/user/mtseng/llm.c/myenv/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( DataLoader: total number of tokens: 10,255,324,043 across 103 files DataLoader: total number of tokens: 100,000,000 across 1 files num decayed parameter tensors: 85, with 116,454,144 parameters num non-decayed parameter tensors: 73, with 46,848 parameters W1011 12:42:56.582000 46971979007168 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] xindex is not in var_ranges, defaulting to unknown range. val loss: 10.971158 saving model checkpoint to ./results/gpt2-124M-gqa/step_0.pth W1011 12:44:01.863000 46971979007168 torch/fx/experimental/symbolic_shapes.py:4449] [0/1] xindex is not in var_ranges, defaulting to unknown range. step 1/76294 | train loss 10.973977 | norm 2.6923 | lr 1.20e-04 | (66829.84 ms | 7845 tok/s) step 2/76294 | train loss 10.572808 | norm 2.8579 | lr 1.22e-04 | (3752.54 ms | 139715 tok/s) step 3/76294 | train loss 10.050116 | norm 2.7256 | lr 1.23e-04 | (3713.48 ms | 141185 tok/s) step 4/76294 | train loss 9.703547 | norm 2.5647 | lr 1.25e-04 | (3688.06 ms | 142158 tok/s) step 5/76294 | train loss 9.492823 | norm 2.3707 | lr 1.26e-04 | (3728.77 ms | 140606 tok/s) step 6/76294 | train loss 9.328242 | norm 2.0044 | lr 1.28e-04 | (3697.46 ms | 141797 tok/s) step 7/76294 | train loss 9.218893 | norm 1.8325 | lr 1.29e-04 | (3741.91 ms | 140112 tok/s) step 8/76294 | train loss 9.117081 | norm 1.7689 | lr 1.31e-04 | (3763.30 ms | 139316 tok/s) step 9/76294 | train loss 9.013759 | norm 1.6787 | lr 1.32e-04 | (3714.42 ms | 141149 tok/s) step 10/76294 | train loss 8.913359 | norm 1.6835 | lr 1.34e-04 | (3741.83 ms | 140115 tok/s) step 11/76294 | train loss 8.765006 | norm 1.6334 | lr 1.35e-04 | (3717.19 ms | 141044 tok/s) step 12/76294 | train loss 8.712566 | norm 1.5210 | lr 1.37e-04 | (3731.07 ms | 140520 tok/s) step 13/76294 | train loss 8.571910 | norm 1.5247 | lr 1.38e-04 | (3747.14 ms | 139917 tok/s) step 14/76294 | train loss 8.473577 | norm 1.4222 | lr 1.40e-04 | (3733.60 ms | 140424 tok/s) step 15/76294 | train loss 8.397326 | norm 1.3248 | lr 1.41e-04 | (3731.99 ms | 140485 tok/s) step 16/76294 | train loss 8.234661 | norm 1.3511 | lr 1.43e-04 | (3766.43 ms | 139200 tok/s) step 17/76294 | train loss 8.189455 | norm 1.2069 | lr 1.44e-04 | (3736.40 ms | 140319 tok/s) step 18/76294 | train loss 8.015874 | norm 1.2731 | lr 1.46e-04 | (3765.16 ms | 139247 tok/s) step 19/76294 | train loss 8.027077 | norm 1.1523 | lr 1.47e-04 | (3744.51 ms | 140015 tok/s) step 20/76294 | train loss 7.872913 | norm 1.0886 | lr 1.49e-04 | (3748.95 ms | 139849 tok/s) step 21/76294 | train loss 7.860012 | norm 1.0757 | lr 1.50e-04 | (3768.99 ms | 139106 tok/s) step 22/76294 | train loss 7.736823 | norm 0.9649 | lr 1.52e-04 | (3789.26 ms | 138361 tok/s) step 23/76294 | train loss 7.669961 | norm 0.9195 | lr 1.53e-04 | (3753.02 ms | 139698 tok/s) step 24/76294 | train loss 7.623647 | norm 0.8613 | lr 1.55e-04 | (3784.22 ms | 138546 tok/s) step 25/76294 | train loss 7.535639 | norm 0.7969 | lr 1.56e-04 | (3760.10 ms | 139435 tok/s) step 26/76294 | train loss 7.476774 | norm 0.7351 | lr 1.58e-04 | (3791.51 ms | 138279 tok/s) step 27/76294 | train loss 7.481842 | norm 0.7498 | lr 1.59e-04 | (3766.29 ms | 139205 tok/s) step 28/76294 | train loss 7.382748 | norm 0.6611 | lr 1.61e-04 | (3823.68 ms | 137116 tok/s) step 29/76294 | train loss 7.260289 | norm 0.6308 | lr 1.62e-04 | (3823.08 ms | 137137 tok/s) step 30/76294 | train loss 7.294071 | norm 0.5931 | lr 1.64e-04 | (3877.23 ms | 135222 tok/s) step 31/76294 | train loss 7.266265 | norm 0.5516 | lr 1.65e-04 | (3778.19 ms | 138767 tok/s) step 32/76294 | train loss 7.211483 | norm 0.5550 | lr 1.67e-04 | (3800.29 ms | 137960 tok/s) step 33/76294 | train loss 7.145764 | norm 0.5210 | lr 1.68e-04 | (3776.38 ms | 138834 tok/s) step 34/76294 | train loss 7.187470 | norm 0.4761 | lr 1.70e-04 | (3816.93 ms | 137359 tok/s) step 35/76294 | train loss 7.174496 | norm 0.4620 | lr 1.71e-04 | (3806.71 ms | 137727 tok/s) step 36/76294 | train loss 7.048912 | norm 0.4566 | lr 1.73e-04 | (3799.11 ms | 138003 tok/s) step 37/76294 | train loss 7.001124 | norm 0.4157 | lr 1.74e-04 | (3831.62 ms | 136832 tok/s) step 38/76294 | train loss 6.979667 | norm 0.4217 | lr 1.76e-04 | (3786.70 ms | 138455 tok/s) step 39/76294 | train loss 6.993701 | norm 0.3478 | lr 1.77e-04 | (3784.63 ms | 138531 tok/s) step 40/76294 | train loss 6.909603 | norm 0.4263 | lr 1.79e-04 | (3816.01 ms | 137392 tok/s) step 41/76294 | train loss 6.947084 | norm 0.5073 | lr 1.80e-04 | (3785.78 ms | 138489 tok/s) step 42/76294 | train loss 7.061844 | norm 0.8239 | lr 1.82e-04 | (3791.42 ms | 138283 tok/s) step 43/76294 | train loss 6.839359 | norm 0.7674 | lr 1.83e-04 | (3817.52 ms | 137337 tok/s) step 44/76294 | train loss 6.865248 | norm 0.3843 | lr 1.85e-04 | (3791.65 ms | 138274 tok/s) step 45/76294 | train loss 6.842014 | norm 0.5377 | lr 1.86e-04 | (3788.17 ms | 138401 tok/s) step 46/76294 | train loss 6.819368 | norm 0.4363 | lr 1.88e-04 | (3909.54 ms | 134105 tok/s) step 47/76294 | train loss 6.779409 | norm 0.4028 | lr 1.89e-04 | (3797.60 ms | 138058 tok/s) step 48/76294 | train loss 6.805227 | norm 0.4432 | lr 1.91e-04 | (3795.94 ms | 138118 tok/s) step 49/76294 | train loss 6.770819 | norm 0.3883 | lr 1.93e-04 | (3825.31 ms | 137058 tok/s) step 50/76294 | train loss 6.700094 | norm 0.3955 | lr 1.94e-04 | (3794.57 ms | 138168 tok/s) step 51/76294 | train loss 6.710415 | norm 0.3748 | lr 1.96e-04 | (4409.47 ms | 118900 tok/s) step 52/76294 | train loss 6.735385 | norm 0.3496 | lr 1.97e-04 | (3797.43 ms | 138064 tok/s) step 53/76294 | train loss 6.756082 | norm 0.2944 | lr 1.99e-04 | (3820.69 ms | 137223 tok/s) step 54/76294 | train loss 6.748868 | norm 0.3659 | lr 2.00e-04 | (3813.81 ms | 137471 tok/s) step 55/76294 | train loss 6.646012 | norm 0.3231 | lr 2.02e-04 | (3822.17 ms | 137170 tok/s) step 56/76294 | train loss 6.631675 | norm 0.3413 | lr 2.03e-04 | (3821.86 ms | 137181 tok/s) step 57/76294 | train loss 6.652728 | norm 0.3105 | lr 2.05e-04 | (3796.74 ms | 138089 tok/s) step 58/76294 | train loss 6.640766 | norm 0.3470 | lr 2.06e-04 | (3796.74 ms | 138089 tok/s) step 59/76294 | train loss 6.533360 | norm 0.3397 | lr 2.08e-04 | (3841.67 ms | 136474 tok/s) step 60/76294 | train loss 6.664685 | norm 0.4046 | lr 2.09e-04 | (3803.79 ms | 137833 tok/s) step 61/76294 | train loss 6.530439 | norm 0.2969 | lr 2.11e-04 | (3828.09 ms | 136958 tok/s) step 62/76294 | train loss 6.525674 | norm 0.4276 | lr 2.12e-04 | (3804.78 ms | 137797 tok/s) step 63/76294 | train loss 6.488318 | norm 0.3196 | lr 2.14e-04 | (3809.34 ms | 137632 tok/s) step 64/76294 | train loss 6.573791 | norm 0.3804 | lr 2.15e-04 | (4179.35 ms | 125447 tok/s) step 65/76294 | train loss 6.489616 | norm 0.3428 | lr 2.17e-04 | (3795.41 ms | 138137 tok/s) step 66/76294 | train loss 6.541927 | norm 0.3606 | lr 2.18e-04 | (3849.20 ms | 136207 tok/s) step 67/76294 | train loss 6.383108 | norm 0.3936 | lr 2.20e-04 | (3802.16 ms | 137892 tok/s) step 68/76294 | train loss 6.428723 | norm 0.4297 | lr 2.21e-04 | (3853.42 ms | 136058 tok/s) step 69/76294 | train loss 6.424401 | norm 0.3804 | lr 2.23e-04 | (3825.81 ms | 137040 tok/s) step 70/76294 | train loss 6.498914 | norm 0.3203 | lr 2.24e-04 | (3907.09 ms | 134189 tok/s) step 71/76294 | train loss 6.448039 | norm 0.3026 | lr 2.26e-04 | (3830.78 ms | 136862 tok/s) step 72/76294 | train loss 6.453261 | norm 0.3433 | lr 2.27e-04 | (3828.19 ms | 136955 tok/s) step 73/76294 | train loss 6.443392 | norm 0.2749 | lr 2.29e-04 | (3801.68 ms | 137910 tok/s) step 74/76294 | train loss 6.452988 | norm 0.3360 | lr 2.30e-04 | (3838.80 ms | 136576 tok/s) step 75/76294 | train loss 6.430051 | norm 0.3360 | lr 2.32e-04 | (3810.58 ms | 137587 tok/s) step 76/76294 | train loss 6.443998 | norm 0.3016 | lr 2.33e-04 | (3809.82 ms | 137615 tok/s) step 77/76294 | train loss 6.391025 | norm 0.3093 | lr 2.35e-04 | (3867.50 ms | 135562 tok/s) step 78/76294 | train loss 6.450660 | norm 0.3436 | lr 2.36e-04 | (3952.89 ms | 132634 tok/s) step 79/76294 | train loss 6.367066 | norm 0.3428 | lr 2.38e-04 | (3888.84 ms | 134819 tok/s) step 80/76294 | train loss 6.350414 | norm 0.5467 | lr 2.39e-04 | (3801.56 ms | 137914 tok/s) step 81/76294 | train loss 6.490496 | norm 0.7317 | lr 2.41e-04 | (3807.78 ms | 137689 tok/s) step 82/76294 | train loss 6.373075 | norm 0.6193 | lr 2.42e-04 | (3847.61 ms | 136263 tok/s) step 83/76294 | train loss 6.421629 | norm 0.4495 | lr 2.44e-04 | (3832.60 ms | 136797 tok/s) step 84/76294 | train loss 6.344560 | norm 0.4079 | lr 2.45e-04 | (3803.22 ms | 137854 tok/s) step 85/76294 | train loss 6.224625 | norm 0.4372 | lr 2.47e-04 | (3833.95 ms | 136749 tok/s) step 86/76294 | train loss 6.338327 | norm 0.4083 | lr 2.48e-04 | (3802.96 ms | 137863 tok/s) step 87/76294 | train loss 6.315450 | norm 0.3662 | lr 2.50e-04 | (3810.89 ms | 137576 tok/s) step 88/76294 | train loss 6.294254 | norm 0.4103 | lr 2.51e-04 | (3888.68 ms | 134824 tok/s) step 89/76294 | train loss 6.309763 | norm 0.3572 | lr 2.53e-04 | (3812.67 ms | 137512 tok/s) step 90/76294 | train loss 6.291923 | norm 0.3530 | lr 2.54e-04 | (4317.35 ms | 121437 tok/s) step 91/76294 | train loss 6.335164 | norm 0.3850 | lr 2.56e-04 | (4046.90 ms | 129553 tok/s) step 92/76294 | train loss 6.371280 | norm 0.3611 | lr 2.57e-04 | (3799.25 ms | 137998 tok/s) step 93/76294 | train loss 6.316596 | norm 0.3412 | lr 2.59e-04 | (3847.14 ms | 136280 tok/s) step 94/76294 | train loss 6.270893 | norm 0.2854 | lr 2.60e-04 | (3802.63 ms | 137875 tok/s) step 95/76294 | train loss 6.160252 | norm 0.3483 | lr 2.62e-04 | (3803.53 ms | 137842 tok/s) step 96/76294 | train loss 6.280005 | norm 0.3110 | lr 2.63e-04 | (3825.88 ms | 137037 tok/s) step 97/76294 | train loss 6.169206 | norm 0.4052 | lr 2.65e-04 | (3829.76 ms | 136899 tok/s) step 98/76294 | train loss 6.206565 | norm 0.3865 | lr 2.67e-04 | (3802.37 ms | 137885 tok/s) step 99/76294 | train loss 6.274752 | norm 0.4776 | lr 2.68e-04 | (3866.60 ms | 135594 tok/s) step 100/76294 | train loss 6.318696 | norm 0.4146 | lr 2.70e-04 | (3800.97 ms | 137935 tok/s) step 101/76294 | train loss 6.265903 | norm 0.5208 | lr 2.71e-04 | (3831.04 ms | 136853 tok/s) step 102/76294 | train loss 6.202161 | norm 0.4818 | lr 2.73e-04 | (3800.05 ms | 137969 tok/s) step 103/76294 | train loss 6.256533 | norm 0.4576 | lr 2.74e-04 | (3861.81 ms | 135762 tok/s) step 104/76294 | train loss 6.161304 | norm 0.4381 | lr 2.76e-04 | (3798.49 ms | 138026 tok/s) step 105/76294 | train loss 6.215577 | norm 0.4168 | lr 2.77e-04 | (3835.76 ms | 136684 tok/s) step 106/76294 | train loss 6.281837 | norm 0.5496 | lr 2.79e-04 | (3800.39 ms | 137956 tok/s) step 107/76294 | train loss 6.215004 | norm 0.4137 | lr 2.80e-04 | (3848.86 ms | 136219 tok/s) step 108/76294 | train loss 6.238954 | norm 0.4107 | lr 2.82e-04 | (3820.46 ms | 137232 tok/s) step 109/76294 | train loss 6.156774 | norm 0.5864 | lr 2.83e-04 | (3867.58 ms | 135560 tok/s) step 110/76294 | train loss 6.186093 | norm 0.3991 | lr 2.85e-04 | (3797.33 ms | 138068 tok/s) step 111/76294 | train loss 6.206845 | norm 0.3156 | lr 2.86e-04 | (3809.61 ms | 137622 tok/s) step 112/76294 | train loss 6.187704 | norm 0.3362 | lr 2.88e-04 | (3904.76 ms | 134269 tok/s) step 113/76294 | train loss 6.208045 | norm 0.3481 | lr 2.89e-04 | (3800.24 ms | 137962 tok/s) step 114/76294 | train loss 6.497286 | norm 1.1163 | lr 2.91e-04 | (3810.22 ms | 137601 tok/s) step 115/76294 | train loss 6.150635 | norm 0.3915 | lr 2.92e-04 | (3802.46 ms | 137881 tok/s) step 116/76294 | train loss 6.122416 | norm 0.4230 | lr 2.94e-04 | (3832.64 ms | 136796 tok/s) step 117/76294 | train loss 6.091107 | norm 0.3493 | lr 2.95e-04 | (3799.96 ms | 137972 tok/s) step 118/76294 | train loss 6.157835 | norm 0.3626 | lr 2.97e-04 | (3832.44 ms | 136803 tok/s) step 119/76294 | train loss 6.235362 | norm 0.4232 | lr 2.98e-04 | (3800.10 ms | 137967 tok/s) step 120/76294 | train loss 6.187137 | norm 0.4559 | lr 3.00e-04 | (3812.37 ms | 137523 tok/s) step 121/76294 | train loss 6.060850 | norm 0.5003 | lr 3.01e-04 | (3833.67 ms | 136759 tok/s) step 122/76294 | train loss 6.159577 | norm 0.5741 | lr 3.03e-04 | (3809.19 ms | 137638 tok/s) step 123/76294 | train loss 6.160774 | norm 0.4282 | lr 3.04e-04 | (3813.20 ms | 137493 tok/s) step 124/76294 | train loss 6.102098 | norm 0.4691 | lr 3.06e-04 | (3810.35 ms | 137596 tok/s) step 125/76294 | train loss 6.226477 | norm 0.5458 | lr 3.07e-04 | (3810.74 ms | 137582 tok/s) step 126/76294 | train loss 6.137618 | norm 0.4476 | lr 3.09e-04 | (3808.00 ms | 137681 tok/s) step 127/76294 | train loss 6.090816 | norm 0.5425 | lr 3.10e-04 | (3807.96 ms | 137682 tok/s) step 128/76294 | train loss 6.136693 | norm 0.4661 | lr 3.12e-04 | (3808.15 ms | 137675 tok/s) step 129/76294 | train loss 6.193569 | norm 0.5888 | lr 3.13e-04 | (3832.97 ms | 136784 tok/s) step 130/76294 | train loss 6.058520 | norm 0.5208 | lr 3.15e-04 | (3883.43 ms | 135006 tok/s) step 131/76294 | train loss 6.107296 | norm 0.4700 | lr 3.16e-04 | (3888.85 ms | 134818 tok/s) step 132/76294 | train loss 6.086199 | norm 0.4192 | lr 3.18e-04 | (3806.92 ms | 137720 tok/s) step 133/76294 | train loss 6.088540 | norm 0.4379 | lr 3.19e-04 | (3854.91 ms | 136005 tok/s) step 134/76294 | train loss 6.093202 | norm 0.4422 | lr 3.21e-04 | (3799.75 ms | 137980 tok/s) step 135/76294 | train loss 6.079446 | norm 0.5125 | lr 3.22e-04 | (3824.59 ms | 137084 tok/s) step 136/76294 | train loss 6.056699 | norm 0.3802 | lr 3.24e-04 | (3826.97 ms | 136998 tok/s) step 137/76294 | train loss 6.058701 | norm 0.4732 | lr 3.25e-04 | (3800.35 ms | 137958 tok/s) step 138/76294 | train loss 6.070695 | norm 0.5206 | lr 3.27e-04 | (3991.42 ms | 131354 tok/s) step 139/76294 | train loss 6.059684 | norm 0.4968 | lr 3.28e-04 | (3799.85 ms | 137976 tok/s) step 140/76294 | train loss 6.068952 | norm 0.4040 | lr 3.30e-04 | (3804.83 ms | 137795 tok/s) step 141/76294 | train loss 6.026955 | norm 0.3952 | lr 3.31e-04 | (3822.76 ms | 137149 tok/s) step 142/76294 | train loss 5.999803 | norm 0.3684 | lr 3.33e-04 | (3811.57 ms | 137552 tok/s) step 143/76294 | train loss 5.993890 | norm 0.3968 | lr 3.34e-04 | (3802.86 ms | 137867 tok/s) step 144/76294 | train loss 6.075047 | norm 0.4535 | lr 3.36e-04 | (3864.49 ms | 135668 tok/s) step 145/76294 | train loss 5.983877 | norm 0.3666 | lr 3.38e-04 | (3803.75 ms | 137834 tok/s) step 146/76294 | train loss 6.028826 | norm 0.3612 | lr 3.39e-04 | (3854.51 ms | 136019 tok/s) step 147/76294 | train loss 6.052479 | norm 0.4545 | lr 3.41e-04 | (3803.61 ms | 137840 tok/s) step 148/76294 | train loss 6.052212 | norm 0.5561 | lr 3.42e-04 | (3839.22 ms | 136561 tok/s) step 149/76294 | train loss 6.059466 | norm 0.5710 | lr 3.44e-04 | (3836.11 ms | 136672 tok/s) step 150/76294 | train loss 6.099411 | norm 0.6828 | lr 3.45e-04 | (3808.37 ms | 137667 tok/s) step 151/76294 | train loss 6.031685 | norm 0.6587 | lr 3.47e-04 | (3874.69 ms | 135311 tok/s) step 152/76294 | train loss 6.017876 | norm 0.6024 | lr 3.48e-04 | (3807.81 ms | 137688 tok/s) step 153/76294 | train loss 6.043781 | norm 0.6212 | lr 3.50e-04 | (3829.91 ms | 136893 tok/s) step 154/76294 | train loss 5.922680 | norm 0.4974 | lr 3.51e-04 | (3803.44 ms | 137846 tok/s) step 155/76294 | train loss 5.910867 | norm 0.6499 | lr 3.53e-04 | (3806.16 ms | 137747 tok/s) step 156/76294 | train loss 5.976318 | norm 0.6365 | lr 3.54e-04 | (3806.71 ms | 137727 tok/s) step 157/76294 | train loss 6.069532 | norm 0.5633 | lr 3.56e-04 | (3857.87 ms | 135901 tok/s) step 158/76294 | train loss 5.994080 | norm 0.4150 | lr 3.57e-04 | (3802.36 ms | 137885 tok/s) step 159/76294 | train loss 5.978808 | norm 0.4265 | lr 3.59e-04 | (3806.09 ms | 137750 tok/s) step 160/76294 | train loss 6.048437 | norm 0.4095 | lr 3.60e-04 | (3824.26 ms | 137095 tok/s) step 161/76294 | train loss 5.887503 | norm 0.4147 | lr 3.62e-04 | (3807.54 ms | 137697 tok/s) step 162/76294 | train loss 5.954743 | norm 0.3450 | lr 3.63e-04 | (3808.56 ms | 137660 tok/s) step 163/76294 | train loss 5.949151 | norm 0.3209 | lr 3.65e-04 | (3831.48 ms | 136837 tok/s) step 164/76294 | train loss 5.942934 | norm 0.3273 | lr 3.66e-04 | (3804.55 ms | 137806 tok/s) step 165/76294 | train loss 5.967479 | norm 0.3829 | lr 3.68e-04 | (3810.68 ms | 137584 tok/s) step 166/76294 | train loss 5.964025 | norm 0.4426 | lr 3.69e-04 | (3828.64 ms | 136938 tok/s) step 167/76294 | train loss 5.973143 | norm 0.6205 | lr 3.71e-04 | (3810.50 ms | 137590 tok/s) step 168/76294 | train loss 5.967710 | norm 0.8175 | lr 3.72e-04 | (3805.38 ms | 137776 tok/s) step 169/76294 | train loss 5.990646 | norm 0.5673 | lr 3.74e-04 | (3836.88 ms | 136645 tok/s) step 170/76294 | train loss 5.966996 | norm 0.5024 | lr 3.75e-04 | (3804.84 ms | 137795 tok/s) step 171/76294 | train loss 5.925948 | norm 0.4621 | lr 3.77e-04 | (3894.24 ms | 134632 tok/s) step 172/76294 | train loss 5.951741 | norm 0.4348 | lr 3.78e-04 | (3802.50 ms | 137880 tok/s) step 173/76294 | train loss 5.982956 | norm 0.4122 | lr 3.80e-04 | (3834.96 ms | 136713 tok/s) step 174/76294 | train loss 5.862164 | norm 0.4063 | lr 3.81e-04 | (3805.44 ms | 137773 tok/s) step 175/76294 | train loss 5.948396 | norm 0.3918 | lr 3.83e-04 | (3807.37 ms | 137703 tok/s) step 176/76294 | train loss 5.968335 | norm 0.4212 | lr 3.84e-04 | (3826.67 ms | 137009 tok/s) step 177/76294 | train loss 5.896030 | norm 0.4974 | lr 3.86e-04 | (3812.74 ms | 137509 tok/s) step 178/76294 | train loss 5.968899 | norm 0.5300 | lr 3.87e-04 | (3833.29 ms | 136772 tok/s) step 179/76294 | train loss 5.885706 | norm 0.4405 | lr 3.89e-04 | (3811.62 ms | 137550 tok/s) step 180/76294 | train loss 5.902092 | norm 0.3547 | lr 3.90e-04 | (3831.62 ms | 136832 tok/s) step 181/76294 | train loss 5.911314 | norm 0.4094 | lr 3.92e-04 | (3809.16 ms | 137639 tok/s) step 182/76294 | train loss 5.900958 | norm 0.3332 | lr 3.93e-04 | (3800.93 ms | 137937 tok/s) step 183/76294 | train loss 5.802586 | norm 0.3757 | lr 3.95e-04 | (3923.90 ms | 133614 tok/s) step 184/76294 | train loss 5.852753 | norm 0.3803 | lr 3.96e-04 | (3804.56 ms | 137805 tok/s) step 185/76294 | train loss 5.826636 | norm 0.4271 | lr 3.98e-04 | (3813.64 ms | 137477 tok/s) step 186/76294 | train loss 5.872246 | norm 0.3669 | lr 3.99e-04 | (3842.73 ms | 136436 tok/s) step 187/76294 | train loss 5.853525 | norm 0.5742 | lr 4.01e-04 | (3845.54 ms | 136337 tok/s) step 188/76294 | train loss 5.933128 | norm 0.8393 | lr 4.02e-04 | (3814.13 ms | 137459 tok/s) step 189/76294 | train loss 5.932065 | norm 0.7302 | lr 4.04e-04 | (3814.15 ms | 137459 tok/s) step 190/76294 | train loss 5.947619 | norm 0.9532 | lr 4.05e-04 | (3810.01 ms | 137608 tok/s) step 191/76294 | train loss 5.915179 | norm 0.8139 | lr 4.07e-04 | (4268.12 ms | 122838 tok/s) step 192/76294 | train loss 5.946787 | norm 0.8170 | lr 4.09e-04 | (3884.45 ms | 134971 tok/s) step 193/76294 | train loss 5.940802 | norm 0.5627 | lr 4.10e-04 | (3865.88 ms | 135619 tok/s) step 194/76294 | train loss 5.888989 | norm 0.5204 | lr 4.12e-04 | (3803.59 ms | 137840 tok/s) step 195/76294 | train loss 5.888504 | norm 0.4257 | lr 4.13e-04 | (3817.08 ms | 137353 tok/s) step 196/76294 | train loss 6.173963 | norm 1.3620 | lr 4.15e-04 | (3804.04 ms | 137824 tok/s) step 197/76294 | train loss 5.856583 | norm 0.6237 | lr 4.16e-04 | (3820.54 ms | 137229 tok/s) step 198/76294 | train loss 5.883880 | norm 0.7872 | lr 4.18e-04 | (3804.88 ms | 137793 tok/s) step 199/76294 | train loss 5.876805 | norm 0.6196 | lr 4.19e-04 | (3823.92 ms | 137107 tok/s) step 200/76294 | train loss 5.841710 | norm 0.4815 | lr 4.21e-04 | (3801.43 ms | 137919 tok/s) step 201/76294 | train loss 5.876703 | norm 0.5544 | lr 4.22e-04 | (3828.24 ms | 136953 tok/s) step 202/76294 | train loss 5.885362 | norm 0.3950 | lr 4.24e-04 | (3800.46 ms | 137954 tok/s) step 203/76294 | train loss 5.843430 | norm 0.3821 | lr 4.25e-04 | (3825.40 ms | 137054 tok/s) step 204/76294 | train loss 5.828236 | norm 0.4019 | lr 4.27e-04 | (3799.47 ms | 137990 tok/s) step 205/76294 | train loss 5.903079 | norm 0.3643 | lr 4.28e-04 | (3807.74 ms | 137690 tok/s) step 206/76294 | train loss 5.802477 | norm 0.4012 | lr 4.30e-04 | (3828.01 ms | 136961 tok/s) step 207/76294 | train loss 5.869931 | norm 0.4301 | lr 4.31e-04 | (3804.25 ms | 137816 tok/s) step 208/76294 | train loss 5.886185 | norm 0.4830 | lr 4.33e-04 | (3802.76 ms | 137870 tok/s) step 209/76294 | train loss 5.948040 | norm 0.4961 | lr 4.34e-04 | (3849.18 ms | 136208 tok/s) step 210/76294 | train loss 5.815902 | norm 0.5849 | lr 4.36e-04 | (3817.51 ms | 137338 tok/s) step 211/76294 | train loss 5.846799 | norm 1.0508 | lr 4.37e-04 | (3835.92 ms | 136679 tok/s) step 212/76294 | train loss 5.864277 | norm 1.1643 | lr 4.39e-04 | (3802.21 ms | 137890 tok/s) step 213/76294 | train loss 5.815387 | norm 0.7045 | lr 4.40e-04 | (3867.53 ms | 135562 tok/s) step 214/76294 | train loss 5.864858 | norm 0.5762 | lr 4.42e-04 | (3803.50 ms | 137844 tok/s) step 215/76294 | train loss 5.850428 | norm 0.5465 | lr 4.43e-04 | (3920.06 ms | 133745 tok/s) step 216/76294 | train loss 5.822092 | norm 0.5083 | lr 4.45e-04 | (3839.76 ms | 136542 tok/s) step 217/76294 | train loss 5.936616 | norm 0.5381 | lr 4.46e-04 | (3806.18 ms | 137747 tok/s) step 218/76294 | train loss 5.881813 | norm 0.4724 | lr 4.48e-04 | (3804.50 ms | 137807 tok/s) step 219/76294 | train loss 5.860651 | norm 0.5074 | lr 4.49e-04 | (3830.12 ms | 136886 tok/s) step 220/76294 | train loss 5.857191 | norm 0.5410 | lr 4.51e-04 | (3803.42 ms | 137847 tok/s) step 221/76294 | train loss 5.843690 | norm 0.7158 | lr 4.52e-04 | (3805.68 ms | 137765 tok/s) step 222/76294 | train loss 5.870971 | norm 1.1130 | lr 4.54e-04 | (3825.59 ms | 137048 tok/s) step 223/76294 | train loss 5.754541 | norm 0.6674 | lr 4.55e-04 | (3808.12 ms | 137676 tok/s) step 224/76294 | train loss 5.847232 | norm 0.4959 | lr 4.57e-04 | (3824.11 ms | 137101 tok/s) step 225/76294 | train loss 5.853219 | norm 0.6081 | lr 4.58e-04 | (3808.18 ms | 137674 tok/s) step 226/76294 | train loss 5.809850 | norm 0.8674 | lr 4.60e-04 | (3804.35 ms | 137813 tok/s) step 227/76294 | train loss 5.843626 | norm 0.7419 | lr 4.61e-04 | (3838.60 ms | 136583 tok/s) step 228/76294 | train loss 5.786310 | norm 0.6541 | lr 4.63e-04 | (3803.48 ms | 137844 tok/s) step 229/76294 | train loss 5.791254 | norm 0.5818 | lr 4.64e-04 | (4042.37 ms | 129698 tok/s) step 230/76294 | train loss 5.754129 | norm 0.4932 | lr 4.66e-04 | (3800.73 ms | 137944 tok/s) step 231/76294 | train loss 5.773458 | norm 0.5570 | lr 4.67e-04 | (3832.23 ms | 136810 tok/s) step 232/76294 | train loss 5.762952 | norm 0.5743 | lr 4.69e-04 | (3802.45 ms | 137882 tok/s) step 233/76294 | train loss 5.781806 | norm 0.4035 | lr 4.70e-04 | (3898.83 ms | 134473 tok/s) step 234/76294 | train loss 5.806579 | norm 0.4443 | lr 4.72e-04 | (3802.68 ms | 137873 tok/s) step 235/76294 | train loss 5.710004 | norm 0.4319 | lr 4.73e-04 | (3814.67 ms | 137440 tok/s) step 236/76294 | train loss 5.748392 | norm 0.5154 | lr 4.75e-04 | (3832.38 ms | 136805 tok/s) step 237/76294 | train loss 5.751234 | norm 0.5494 | lr 4.76e-04 | (3805.45 ms | 137773 tok/s) step 238/76294 | train loss 5.710639 | norm 0.4062 | lr 4.78e-04 | (3826.63 ms | 137010 tok/s) step 239/76294 | train loss 5.716084 | norm 0.3963 | lr 4.79e-04 | (3949.25 ms | 132756 tok/s) step 240/76294 | train loss 5.699065 | norm 0.4243 | lr 4.81e-04 | (3805.91 ms | 137756 tok/s) step 241/76294 | train loss 5.729066 | norm 0.3676 | lr 4.83e-04 | (3833.66 ms | 136759 tok/s) step 242/76294 | train loss 5.685411 | norm 0.4902 | lr 4.84e-04 | (3803.19 ms | 137855 tok/s) step 243/76294 | train loss 5.724210 | norm 0.5280 | lr 4.86e-04 | (3807.96 ms | 137682 tok/s) step 244/76294 | train loss 5.725984 | norm 0.6890 | lr 4.87e-04 | (3821.28 ms | 137202 tok/s) step 245/76294 | train loss 5.734766 | norm 0.7994 | lr 4.89e-04 | (3804.84 ms | 137795 tok/s) step 246/76294 | train loss 5.821327 | norm 0.8914 | lr 4.90e-04 | (3802.76 ms | 137870 tok/s) step 247/76294 | train loss 5.826526 | norm 2.1735 | lr 4.92e-04 | (3857.25 ms | 135923 tok/s) step 248/76294 | train loss 5.799942 | norm 0.6994 | lr 4.93e-04 | (3804.23 ms | 137817 tok/s) step 249/76294 | train loss 5.732520 | norm 0.6159 | lr 4.95e-04 | (3809.40 ms | 137630 tok/s) step 250/76294 | train loss 5.755344 | norm 0.7657 | lr 4.96e-04 | (3825.49 ms | 137051 tok/s) val loss: 5.840369 saving model checkpoint to ./results/gpt2-124M-gqa/step_250.pth step 251/76294 | train loss 5.757165 | norm 0.6019 | lr 4.98e-04 | (3817.13 ms | 137351 tok/s) step 252/76294 | train loss 5.766583 | norm 0.6177 | lr 4.99e-04 | (3836.03 ms | 136675 tok/s) step 253/76294 | train loss 5.939019 | norm 0.5540 | lr 5.01e-04 | (3804.01 ms | 137825 tok/s) step 254/76294 | train loss 5.736877 | norm 0.4983 | lr 5.02e-04 | (3801.19 ms | 137927 tok/s) step 255/76294 | train loss 5.773116 | norm 0.4498 | lr 5.04e-04 | (3906.87 ms | 134196 tok/s) step 256/76294 | train loss 5.712934 | norm 0.4914 | lr 5.05e-04 | (3856.75 ms | 135940 tok/s) step 257/76294 | train loss 5.722631 | norm 0.4534 | lr 5.07e-04 | (3804.75 ms | 137798 tok/s) step 258/76294 | train loss 5.702128 | norm 0.4471 | lr 5.08e-04 | (3802.16 ms | 137892 tok/s) step 259/76294 | train loss 5.749182 | norm 0.4792 | lr 5.10e-04 | (3840.60 ms | 136512 tok/s) step 260/76294 | train loss 5.658570 | norm 0.4246 | lr 5.11e-04 | (3800.80 ms | 137941 tok/s) step 261/76294 | train loss 5.671929 | norm 0.3956 | lr 5.13e-04 | (3850.66 ms | 136155 tok/s) step 262/76294 | train loss 5.694971 | norm 0.4103 | lr 5.14e-04 | (3800.02 ms | 137970 tok/s) step 263/76294 | train loss 5.756762 | norm 0.4734 | lr 5.16e-04 | (3826.16 ms | 137027 tok/s) step 264/76294 | train loss 5.619910 | norm 0.5921 | lr 5.17e-04 | (3799.48 ms | 137989 tok/s) step 265/76294 | train loss 5.642555 | norm 0.7108 | lr 5.19e-04 | (3850.50 ms | 136161 tok/s) step 266/76294 | train loss 5.664345 | norm 0.7147 | lr 5.20e-04 | (3802.88 ms | 137866 tok/s) step 267/76294 | train loss 5.691973 | norm 0.7564 | lr 5.22e-04 | (3828.63 ms | 136939 tok/s) step 268/76294 | train loss 5.647472 | norm 0.7192 | lr 5.23e-04 | (3802.46 ms | 137881 tok/s) step 269/76294 | train loss 5.635749 | norm 0.5955 | lr 5.25e-04 | (3832.76 ms | 136791 tok/s) step 270/76294 | train loss 5.694452 | norm 0.6258 | lr 5.26e-04 | (3801.16 ms | 137928 tok/s) step 271/76294 | train loss 5.689376 | norm 0.6058 | lr 5.28e-04 | (3806.38 ms | 137739 tok/s) step 272/76294 | train loss 5.624961 | norm 0.5823 | lr 5.29e-04 | (3844.81 ms | 136363 tok/s) step 273/76294 | train loss 5.709867 | norm 0.6483 | lr 5.31e-04 | (3809.32 ms | 137633 tok/s) step 274/76294 | train loss 5.641621 | norm 0.6351 | lr 5.32e-04 | (3802.67 ms | 137874 tok/s) step 275/76294 | train loss 5.628467 | norm 0.5382 | lr 5.34e-04 | (3832.11 ms | 136815 tok/s) step 276/76294 | train loss 5.689814 | norm 0.5744 | lr 5.35e-04 | (3872.56 ms | 135385 tok/s) step 277/76294 | train loss 5.642345 | norm 0.4551 | lr 5.37e-04 | (3805.73 ms | 137763 tok/s) step 278/76294 | train loss 5.611440 | norm 0.4411 | lr 5.38e-04 | (3804.12 ms | 137821 tok/s) step 279/76294 | train loss 5.673392 | norm 0.5582 | lr 5.40e-04 | (3826.90 ms | 137001 tok/s) step 280/76294 | train loss 5.684496 | norm 0.4795 | lr 5.41e-04 | (3808.26 ms | 137671 tok/s) step 281/76294 | train loss 5.631978 | norm 0.4170 | lr 5.43e-04 | (3802.01 ms | 137898 tok/s) step 282/76294 | train loss 5.637095 | norm 0.4589 | lr 5.44e-04 | (3832.72 ms | 136793 tok/s) step 283/76294 | train loss 5.656037 | norm 0.5966 | lr 5.46e-04 | (3806.68 ms | 137728 tok/s) step 284/76294 | train loss 5.651752 | norm 0.6348 | lr 5.47e-04 | (3803.17 ms | 137855 tok/s) step 285/76294 | train loss 5.577105 | norm 0.5644 | lr 5.49e-04 | (3837.32 ms | 136629 tok/s) step 286/76294 | train loss 5.574926 | norm 0.7322 | lr 5.50e-04 | (3807.58 ms | 137696 tok/s) step 287/76294 | train loss 5.619219 | norm 0.7189 | lr 5.52e-04 | (3826.00 ms | 137033 tok/s) step 288/76294 | train loss 5.646163 | norm 0.6169 | lr 5.54e-04 | (3832.32 ms | 136807 tok/s) step 289/76294 | train loss 5.614922 | norm 0.7041 | lr 5.55e-04 | (3813.42 ms | 137485 tok/s) step 290/76294 | train loss 5.611612 | norm 0.8093 | lr 5.57e-04 | (3806.44 ms | 137737 tok/s) step 291/76294 | train loss 5.633851 | norm 0.6252 | lr 5.58e-04 | (3803.40 ms | 137847 tok/s) step 292/76294 | train loss 5.547648 | norm 0.6871 | lr 5.60e-04 | (3835.90 ms | 136679 tok/s) step 293/76294 | train loss 5.557603 | norm 0.4966 | lr 5.61e-04 | (3803.45 ms | 137845 tok/s) step 294/76294 | train loss 5.622142 | norm 0.4468 | lr 5.63e-04 | (3823.75 ms | 137114 tok/s) step 295/76294 | train loss 5.602447 | norm 0.4391 | lr 5.64e-04 | (3815.79 ms | 137400 tok/s) step 296/76294 | train loss 5.541949 | norm 0.4484 | lr 5.66e-04 | (3890.04 ms | 134777 tok/s) step 297/76294 | train loss 5.596147 | norm 0.4775 | lr 5.67e-04 | (3819.13 ms | 137279 tok/s) step 298/76294 | train loss 5.570645 | norm 0.4817 | lr 5.69e-04 | (3865.16 ms | 135644 tok/s) step 299/76294 | train loss 5.544794 | norm 0.4322 | lr 5.70e-04 | (3803.74 ms | 137835 tok/s) step 300/76294 | train loss 5.440669 | norm 0.5007 | lr 5.72e-04 | (3845.39 ms | 136342 tok/s) step 301/76294 | train loss 5.725451 | norm 0.5920 | lr 5.73e-04 | (3824.13 ms | 137100 tok/s) step 302/76294 | train loss 5.601756 | norm 0.8656 | lr 5.75e-04 | (7423.45 ms | 70626 tok/s) step 303/76294 | train loss 5.528790 | norm 0.7381 | lr 5.76e-04 | (3778.71 ms | 138748 tok/s) step 304/76294 | train loss 5.581843 | norm 1.0000 | lr 5.78e-04 | (3786.74 ms | 138454 tok/s) step 305/76294 | train loss 5.551764 | norm 0.9361 | lr 5.79e-04 | (3805.80 ms | 137760 tok/s) step 306/76294 | train loss 5.690222 | norm 1.3339 | lr 5.81e-04 | (3785.39 ms | 138503 tok/s) step 307/76294 | train loss 5.704316 | norm 1.1113 | lr 5.82e-04 | (3807.09 ms | 137713 tok/s) step 308/76294 | train loss 5.710366 | norm 1.7778 | lr 5.84e-04 | (3793.16 ms | 138219 tok/s) step 309/76294 | train loss 5.644241 | norm 0.9606 | lr 5.85e-04 | (3790.07 ms | 138332 tok/s) step 310/76294 | train loss 5.529939 | norm 0.9399 | lr 5.87e-04 | (4095.36 ms | 128020 tok/s) step 311/76294 | train loss 5.590877 | norm 0.7573 | lr 5.88e-04 | (3783.24 ms | 138582 tok/s) step 312/76294 | train loss 5.610851 | norm 0.7343 | lr 5.90e-04 | (3863.80 ms | 135692 tok/s) step 313/76294 | train loss 5.614040 | norm 0.6403 | lr 5.91e-04 | (3786.01 ms | 138480 tok/s) step 314/76294 | train loss 5.535630 | norm 0.7293 | lr 5.93e-04 | (3818.94 ms | 137286 tok/s) step 315/76294 | train loss 5.631450 | norm 0.5913 | lr 5.94e-04 | (3788.84 ms | 138377 tok/s) step 316/76294 | train loss 5.605722 | norm 0.6397 | lr 5.96e-04 | (3815.39 ms | 137414 tok/s) step 317/76294 | train loss 5.558656 | norm 0.5246 | lr 5.97e-04 | (3788.83 ms | 138377 tok/s) step 318/76294 | train loss 5.544639 | norm 0.5478 | lr 5.99e-04 | (3800.68 ms | 137946 tok/s) step 319/76294 | train loss 5.636907 | norm 0.6368 | lr 6.00e-04 | (3817.06 ms | 137354 tok/s) step 320/76294 | train loss 5.490314 | norm 0.5564 | lr 6.02e-04 | (3795.02 ms | 138151 tok/s) step 321/76294 | train loss 5.524955 | norm 0.4763 | lr 6.03e-04 | (3790.20 ms | 138327 tok/s) step 322/76294 | train loss 5.582210 | norm 0.6136 | lr 6.05e-04 | (3824.82 ms | 137075 tok/s) step 323/76294 | train loss 5.574318 | norm 0.7232 | lr 6.06e-04 | (3793.85 ms | 138194 tok/s) step 324/76294 | train loss 5.565970 | norm 0.5133 | lr 6.08e-04 | (3799.39 ms | 137993 tok/s) step 325/76294 | train loss 5.620723 | norm 0.6423 | lr 6.09e-04 | (3817.99 ms | 137320 tok/s) step 326/76294 | train loss 5.616973 | norm 0.6792 | lr 6.11e-04 | (3795.50 ms | 138134 tok/s) step 327/76294 | train loss 5.476462 | norm 0.7075 | lr 6.12e-04 | (3845.95 ms | 136322 tok/s) step 328/76294 | train loss 5.538185 | norm 0.5488 | lr 6.14e-04 | (3795.97 ms | 138117 tok/s) step 329/76294 | train loss 5.521571 | norm 0.4964 | lr 6.15e-04 | (3841.10 ms | 136494 tok/s) step 330/76294 | train loss 5.480236 | norm 0.4483 | lr 6.17e-04 | (3796.05 ms | 138114 tok/s) step 331/76294 | train loss 5.452101 | norm 0.5294 | lr 6.18e-04 | (3804.47 ms | 137808 tok/s) step 332/76294 | train loss 5.559510 | norm 0.7885 | lr 6.20e-04 | (3817.68 ms | 137332 tok/s) step 333/76294 | train loss 5.534914 | norm 0.8861 | lr 6.21e-04 | (3864.51 ms | 135668 tok/s) step 334/76294 | train loss 5.476709 | norm 0.5134 | lr 6.23e-04 | (3802.53 ms | 137879 tok/s) step 335/76294 | train loss 5.495186 | norm 0.7662 | lr 6.25e-04 | (3801.89 ms | 137902 tok/s) step 336/76294 | train loss 5.493467 | norm 0.8028 | lr 6.26e-04 | (3817.20 ms | 137349 tok/s) step 337/76294 | train loss 5.476600 | norm 0.5368 | lr 6.28e-04 | (3814.37 ms | 137451 tok/s) step 338/76294 | train loss 5.436097 | norm 0.5754 | lr 6.29e-04 | (3828.28 ms | 136951 tok/s) step 339/76294 | train loss 5.451192 | norm 0.7666 | lr 6.31e-04 | (3836.57 ms | 136655 tok/s) step 340/76294 | train loss 5.550617 | norm 2.0947 | lr 6.32e-04 | (3883.16 ms | 135016 tok/s) step 341/76294 | train loss 5.504744 | norm 1.1934 | lr 6.34e-04 | (3801.40 ms | 137920 tok/s) step 342/76294 | train loss 5.499391 | norm 1.1604 | lr 6.35e-04 | (3928.91 ms | 133444 tok/s) step 343/76294 | train loss 5.487790 | norm 0.9295 | lr 6.37e-04 | (3830.96 ms | 136856 tok/s) step 344/76294 | train loss 5.486214 | norm 1.3660 | lr 6.38e-04 | (3806.98 ms | 137718 tok/s) step 345/76294 | train loss 5.549658 | norm 1.4649 | lr 6.40e-04 | (3828.75 ms | 136934 tok/s) step 346/76294 | train loss 5.476922 | norm 1.1751 | lr 6.41e-04 | (3807.60 ms | 137695 tok/s) step 347/76294 | train loss 5.475142 | norm 0.7791 | lr 6.43e-04 | (3801.62 ms | 137912 tok/s) step 348/76294 | train loss 5.534559 | norm 1.4323 | lr 6.44e-04 | (3847.10 ms | 136281 tok/s) step 349/76294 | train loss 5.467546 | norm 0.7997 | lr 6.46e-04 | (3796.56 ms | 138095 tok/s) step 350/76294 | train loss 5.547449 | norm 0.9340 | lr 6.47e-04 | (3820.71 ms | 137223 tok/s) step 351/76294 | train loss 5.547568 | norm 0.9337 | lr 6.49e-04 | (3795.16 ms | 138147 tok/s) step 352/76294 | train loss 5.477867 | norm 0.8418 | lr 6.50e-04 | (3806.96 ms | 137718 tok/s) step 353/76294 | train loss 5.450146 | norm 0.8334 | lr 6.52e-04 | (3949.50 ms | 132748 tok/s) step 354/76294 | train loss 5.468596 | norm 0.8077 | lr 6.53e-04 | (3797.78 ms | 138051 tok/s) step 355/76294 | train loss 5.431776 | norm 0.7442 | lr 6.55e-04 | (3822.55 ms | 137156 tok/s) step 356/76294 | train loss 5.454779 | norm 0.6493 | lr 6.56e-04 | (3825.35 ms | 137056 tok/s) step 357/76294 | train loss 5.431255 | norm 0.5010 | lr 6.58e-04 | (3806.26 ms | 137744 tok/s) step 358/76294 | train loss 5.492458 | norm 0.4471 | lr 6.59e-04 | (3798.89 ms | 138011 tok/s) step 359/76294 | train loss 5.433412 | norm 0.5179 | lr 6.61e-04 | (3961.88 ms | 132333 tok/s) step 360/76294 | train loss 5.411390 | norm 0.4714 | lr 6.62e-04 | (3801.03 ms | 137933 tok/s) step 361/76294 | train loss 5.425312 | norm 0.4993 | lr 6.64e-04 | (3825.81 ms | 137040 tok/s) step 362/76294 | train loss 5.418686 | norm 0.6910 | lr 6.65e-04 | (3798.96 ms | 138008 tok/s) step 363/76294 | train loss 5.408182 | norm 0.6551 | lr 6.67e-04 | (3828.74 ms | 136935 tok/s) step 364/76294 | train loss 5.357925 | norm 0.5329 | lr 6.68e-04 | (3799.09 ms | 138004 tok/s) step 365/76294 | train loss 5.339309 | norm 0.5762 | lr 6.70e-04 | (3804.19 ms | 137819 tok/s) step 366/76294 | train loss 5.384847 | norm 0.6174 | lr 6.71e-04 | (3823.80 ms | 137112 tok/s) step 367/76294 | train loss 5.393558 | norm 0.5407 | lr 6.73e-04 | (3804.78 ms | 137797 tok/s) step 368/76294 | train loss 5.365499 | norm 0.4938 | lr 6.74e-04 | (3797.71 ms | 138054 tok/s) step 369/76294 | train loss 5.373020 | norm 0.4906 | lr 6.76e-04 | (3831.10 ms | 136850 tok/s) step 370/76294 | train loss 5.437249 | norm 0.4521 | lr 6.77e-04 | (3797.32 ms | 138068 tok/s) step 371/76294 | train loss 5.361161 | norm 0.4368 | lr 6.79e-04 | (3805.23 ms | 137781 tok/s) step 372/76294 | train loss 5.362906 | norm 0.4481 | lr 6.80e-04 | (3818.55 ms | 137300 tok/s) step 373/76294 | train loss 5.400624 | norm 0.5113 | lr 6.82e-04 | (3801.68 ms | 137910 tok/s) step 374/76294 | train loss 5.383924 | norm 0.6044 | lr 6.83e-04 | (3819.08 ms | 137281 tok/s) step 375/76294 | train loss 5.373066 | norm 0.8759 | lr 6.85e-04 | (3802.46 ms | 137881 tok/s) step 376/76294 | train loss 5.412484 | norm 1.0965 | lr 6.86e-04 | (3798.15 ms | 138038 tok/s) step 377/76294 | train loss 5.420721 | norm 0.8445 | lr 6.88e-04 | (3828.93 ms | 136928 tok/s) step 378/76294 | train loss 5.411042 | norm 1.1730 | lr 6.89e-04 | (3799.77 ms | 137979 tok/s) step 379/76294 | train loss 5.408028 | norm 0.8330 | lr 6.91e-04 | (3844.76 ms | 136364 tok/s) step 380/76294 | train loss 5.404286 | norm 0.6918 | lr 6.92e-04 | (3797.28 ms | 138069 tok/s) step 381/76294 | train loss 5.373581 | norm 0.7430 | lr 6.94e-04 | (3802.85 ms | 137867 tok/s) step 382/76294 | train loss 5.382613 | norm 0.6957 | lr 6.95e-04 | (4271.15 ms | 122751 tok/s) step 383/76294 | train loss 5.357070 | norm 0.5562 | lr 6.97e-04 | (3796.51 ms | 138097 tok/s) step 384/76294 | train loss 5.333647 | norm 0.4929 | lr 6.99e-04 | (3813.57 ms | 137479 tok/s) step 385/76294 | train loss 5.319456 | norm 0.6203 | lr 7.00e-04 | (3798.38 ms | 138029 tok/s) step 386/76294 | train loss 5.374433 | norm 0.7504 | lr 7.02e-04 | (3810.45 ms | 137592 tok/s) step 387/76294 | train loss 5.382246 | norm 0.5921 | lr 7.03e-04 | (3820.38 ms | 137235 tok/s) step 388/76294 | train loss 5.448175 | norm 0.6976 | lr 7.05e-04 | (3951.91 ms | 132667 tok/s) step 389/76294 | train loss 5.338475 | norm 0.4938 | lr 7.06e-04 | (3799.63 ms | 137984 tok/s) step 390/76294 | train loss 5.273352 | norm 0.4485 | lr 7.08e-04 | (3840.88 ms | 136502 tok/s) step 391/76294 | train loss 5.390316 | norm 0.5808 | lr 7.09e-04 | (3800.32 ms | 137959 tok/s) step 392/76294 | train loss 5.315199 | norm 1.0375 | lr 7.11e-04 | (3827.43 ms | 136982 tok/s) step 393/76294 | train loss 5.381537 | norm 1.0993 | lr 7.12e-04 | (3796.86 ms | 138085 tok/s) step 394/76294 | train loss 5.326484 | norm 0.8221 | lr 7.14e-04 | (3893.97 ms | 134641 tok/s) step 395/76294 | train loss 5.369729 | norm 0.9275 | lr 7.15e-04 | (3804.58 ms | 137805 tok/s) step 396/76294 | train loss 5.391437 | norm 0.9758 | lr 7.17e-04 | (3822.90 ms | 137144 tok/s) step 397/76294 | train loss 5.383577 | norm 1.0109 | lr 7.18e-04 | (3811.13 ms | 137568 tok/s) step 398/76294 | train loss 5.379007 | norm 0.9904 | lr 7.20e-04 | (3801.18 ms | 137928 tok/s) step 399/76294 | train loss 5.394880 | norm 0.8578 | lr 7.21e-04 | (3821.90 ms | 137180 tok/s) step 400/76294 | train loss 5.361844 | norm 0.5705 | lr 7.23e-04 | (3798.91 ms | 138010 tok/s) step 401/76294 | train loss 5.383592 | norm 0.5992 | lr 7.24e-04 | (3798.42 ms | 138028 tok/s) step 402/76294 | train loss 5.324593 | norm 0.6631 | lr 7.26e-04 | (3839.80 ms | 136540 tok/s) step 403/76294 | train loss 5.326512 | norm 0.5018 | lr 7.27e-04 | (3801.75 ms | 137907 tok/s) step 404/76294 | train loss 5.293872 | norm 0.5662 | lr 7.29e-04 | (3825.90 ms | 137037 tok/s) step 405/76294 | train loss 5.341769 | norm 0.5409 | lr 7.30e-04 | (3797.10 ms | 138076 tok/s) step 406/76294 | train loss 5.385432 | norm 0.5582 | lr 7.32e-04 | (3803.91 ms | 137829 tok/s) step 407/76294 | train loss 5.345273 | norm 0.7497 | lr 7.33e-04 | (3824.78 ms | 137077 tok/s) step 408/76294 | train loss 5.328507 | norm 0.9365 | lr 7.35e-04 | (3801.98 ms | 137899 tok/s) step 409/76294 | train loss 5.358222 | norm 0.6096 | lr 7.36e-04 | (3797.36 ms | 138067 tok/s) step 410/76294 | train loss 5.246535 | norm 0.5377 | lr 7.38e-04 | (3829.76 ms | 136898 tok/s) step 411/76294 | train loss 5.311440 | norm 0.7007 | lr 7.39e-04 | (3799.67 ms | 137983 tok/s) step 412/76294 | train loss 5.270452 | norm 0.6708 | lr 7.41e-04 | (3803.45 ms | 137846 tok/s) step 413/76294 | train loss 5.239758 | norm 0.5281 | lr 7.42e-04 | (3822.29 ms | 137166 tok/s) step 414/76294 | train loss 5.227363 | norm 0.6145 | lr 7.44e-04 | (3924.77 ms | 133584 tok/s) step 415/76294 | train loss 5.298184 | norm 0.5974 | lr 7.45e-04 | (3795.71 ms | 138127 tok/s) step 416/76294 | train loss 5.253067 | norm 0.5538 | lr 7.47e-04 | (3826.35 ms | 137020 tok/s) step 417/76294 | train loss 5.311471 | norm 0.5871 | lr 7.48e-04 | (3798.55 ms | 138023 tok/s) step 418/76294 | train loss 5.253590 | norm 0.5926 | lr 7.50e-04 | (3807.23 ms | 137709 tok/s) step 419/76294 | train loss 5.243934 | norm 0.5498 | lr 7.51e-04 | (3824.47 ms | 137088 tok/s) step 420/76294 | train loss 5.160328 | norm 0.4681 | lr 7.53e-04 | (3802.21 ms | 137890 tok/s) step 421/76294 | train loss 5.257531 | norm 0.4074 | lr 7.54e-04 | (3808.92 ms | 137647 tok/s) step 422/76294 | train loss 5.175920 | norm 0.3340 | lr 7.56e-04 | (3805.71 ms | 137764 tok/s) step 423/76294 | train loss 5.209051 | norm 0.3545 | lr 7.57e-04 | (3799.49 ms | 137989 tok/s) step 424/76294 | train loss 5.167995 | norm 0.4204 | lr 7.59e-04 | (3836.50 ms | 136658 tok/s) step 425/76294 | train loss 5.257149 | norm 0.6688 | lr 7.60e-04 | (3800.49 ms | 137953 tok/s) step 426/76294 | train loss 5.224015 | norm 0.7580 | lr 7.62e-04 | (3804.84 ms | 137795 tok/s) step 427/76294 | train loss 5.246913 | norm 0.6776 | lr 7.63e-04 | (3823.01 ms | 137140 tok/s) step 428/76294 | train loss 5.208283 | norm 0.6584 | lr 7.65e-04 | (3801.77 ms | 137906 tok/s) step 429/76294 | train loss 5.198548 | norm 0.4785 | lr 7.66e-04 | (3803.36 ms | 137849 tok/s) step 430/76294 | train loss 5.174828 | norm 0.4694 | lr 7.68e-04 | (3827.41 ms | 136982 tok/s) step 431/76294 | train loss 5.285746 | norm 0.5311 | lr 7.70e-04 | (3802.25 ms | 137889 tok/s) step 432/76294 | train loss 5.216725 | norm 0.8037 | lr 7.71e-04 | (3804.74 ms | 137799 tok/s) step 433/76294 | train loss 5.234146 | norm 0.9720 | lr 7.73e-04 | (3825.16 ms | 137063 tok/s) step 434/76294 | train loss 5.206200 | norm 0.7018 | lr 7.74e-04 | (3806.84 ms | 137723 tok/s) step 435/76294 | train loss 5.292982 | norm 0.9445 | lr 7.76e-04 | (3867.06 ms | 135578 tok/s) step 436/76294 | train loss 5.243612 | norm 0.7804 | lr 7.77e-04 | (3801.69 ms | 137909 tok/s) step 437/76294 | train loss 5.365932 | norm 0.6339 | lr 7.79e-04 | (3806.35 ms | 137740 tok/s) step 438/76294 | train loss 5.197178 | norm 0.5124 | lr 7.80e-04 | (3836.90 ms | 136644 tok/s) step 439/76294 | train loss 5.220163 | norm 0.5431 | lr 7.82e-04 | (3802.48 ms | 137880 tok/s) step 440/76294 | train loss 5.171553 | norm 0.5020 | lr 7.83e-04 | (3801.87 ms | 137903 tok/s) step 441/76294 | train loss 5.212368 | norm 0.4806 | lr 7.85e-04 | (3832.28 ms | 136808 tok/s) step 442/76294 | train loss 5.040841 | norm 0.4454 | lr 7.86e-04 | (3799.74 ms | 137980 tok/s) step 443/76294 | train loss 5.173247 | norm 0.4887 | lr 7.88e-04 | (3825.61 ms | 137047 tok/s) step 444/76294 | train loss 5.154223 | norm 0.5380 | lr 7.89e-04 | (3799.14 ms | 138002 tok/s) step 445/76294 | train loss 5.206408 | norm 0.4205 | lr 7.91e-04 | (3819.47 ms | 137267 tok/s) step 446/76294 | train loss 5.108850 | norm 0.4441 | lr 7.92e-04 | (3820.50 ms | 137230 tok/s) step 447/76294 | train loss 5.206084 | norm 0.5841 | lr 7.94e-04 | (3800.93 ms | 137937 tok/s) step 448/76294 | train loss 5.108384 | norm 0.6561 | lr 7.95e-04 | (3805.82 ms | 137760 tok/s) step 449/76294 | train loss 5.192156 | norm 0.6298 | lr 7.97e-04 | (3857.65 ms | 135909 tok/s) step 450/76294 | train loss 5.146430 | norm 0.5039 | lr 7.98e-04 | (3798.63 ms | 138020 tok/s) step 451/76294 | train loss 5.195024 | norm 0.4346 | lr 8.00e-04 | (3804.15 ms | 137820 tok/s) step 452/76294 | train loss 5.139457 | norm 0.4702 | lr 8.01e-04 | (3826.27 ms | 137023 tok/s) step 453/76294 | train loss 5.138968 | norm 0.4995 | lr 8.03e-04 | (3801.09 ms | 137931 tok/s) step 454/76294 | train loss 5.127535 | norm 0.6477 | lr 8.04e-04 | (3826.10 ms | 137029 tok/s) step 455/76294 | train loss 5.183962 | norm 0.8663 | lr 8.06e-04 | (3809.00 ms | 137645 tok/s) step 456/76294 | train loss 5.116596 | norm 0.6941 | lr 8.07e-04 | (3821.83 ms | 137182 tok/s) step 457/76294 | train loss 5.252749 | norm 0.5937 | lr 8.09e-04 | (3805.39 ms | 137775 tok/s) step 458/76294 | train loss 5.102275 | norm 0.5557 | lr 8.10e-04 | (3816.95 ms | 137358 tok/s) step 459/76294 | train loss 5.167690 | norm 0.6432 | lr 8.12e-04 | (3801.93 ms | 137901 tok/s) step 460/76294 | train loss 5.158487 | norm 1.0337 | lr 8.13e-04 | (3798.06 ms | 138041 tok/s) step 461/76294 | train loss 5.183106 | norm 0.6766 | lr 8.15e-04 | (3840.25 ms | 136525 tok/s) step 462/76294 | train loss 5.115167 | norm 0.7363 | lr 8.16e-04 | (3799.71 ms | 137981 tok/s) step 463/76294 | train loss 5.167182 | norm 0.9369 | lr 8.18e-04 | (3806.91 ms | 137720 tok/s) step 464/76294 | train loss 5.210318 | norm 0.7584 | lr 8.19e-04 | (3823.66 ms | 137117 tok/s) step 465/76294 | train loss 5.129128 | norm 0.5585 | lr 8.21e-04 | (3978.80 ms | 131770 tok/s) step 466/76294 | train loss 5.200756 | norm 0.6073 | lr 8.22e-04 | (3799.03 ms | 138006 tok/s) step 467/76294 | train loss 5.192446 | norm 0.5454 | lr 8.24e-04 | (3838.57 ms | 136584 tok/s) step 468/76294 | train loss 5.184371 | norm 0.6157 | lr 8.25e-04 | (3800.84 ms | 137940 tok/s) step 469/76294 | train loss 5.117798 | norm 0.4865 | lr 8.27e-04 | (3854.25 ms | 136028 tok/s) step 470/76294 | train loss 5.116889 | norm 0.3933 | lr 8.28e-04 | (3797.91 ms | 138046 tok/s) step 471/76294 | train loss 5.060505 | norm 0.3786 | lr 8.30e-04 | (3802.79 ms | 137869 tok/s) step 472/76294 | train loss 5.115436 | norm 0.3558 | lr 8.31e-04 | (3824.11 ms | 137101 tok/s) step 473/76294 | train loss 5.019555 | norm 0.3890 | lr 8.33e-04 | (3798.94 ms | 138009 tok/s) step 474/76294 | train loss 5.086886 | norm 0.4458 | lr 8.34e-04 | (3806.47 ms | 137736 tok/s) step 475/76294 | train loss 4.983830 | norm 0.5821 | lr 8.36e-04 | (3878.03 ms | 135194 tok/s) step 476/76294 | train loss 5.096136 | norm 0.7783 | lr 8.37e-04 | (3811.24 ms | 137564 tok/s) step 477/76294 | train loss 5.015116 | norm 0.7147 | lr 8.39e-04 | (3799.31 ms | 137996 tok/s) step 478/76294 | train loss 5.085171 | norm 0.6627 | lr 8.41e-04 | (3802.95 ms | 137863 tok/s) step 479/76294 | train loss 5.087062 | norm 0.6079 | lr 8.42e-04 | (3833.16 ms | 136777 tok/s) step 480/76294 | train loss 5.125390 | norm 0.6156 | lr 8.44e-04 | (3798.55 ms | 138023 tok/s) step 481/76294 | train loss 5.069616 | norm 0.6626 | lr 8.45e-04 | (3821.45 ms | 137196 tok/s) step 482/76294 | train loss 5.108305 | norm 0.6761 | lr 8.47e-04 | (3823.01 ms | 137140 tok/s) step 483/76294 | train loss 5.079213 | norm 0.7468 | lr 8.48e-04 | (3804.26 ms | 137816 tok/s) step 484/76294 | train loss 5.092807 | norm 0.5738 | lr 8.50e-04 | (3798.33 ms | 138031 tok/s) step 485/76294 | train loss 5.076204 | norm 0.4866 | lr 8.51e-04 | (3874.25 ms | 135326 tok/s) step 486/76294 | train loss 5.076656 | norm 0.6747 | lr 8.53e-04 | (3817.39 ms | 137342 tok/s) step 487/76294 | train loss 5.078126 | norm 0.3694 | lr 8.54e-04 | (3839.57 ms | 136549 tok/s) step 488/76294 | train loss 5.016982 | norm 0.3837 | lr 8.56e-04 | (3806.27 ms | 137743 tok/s) step 489/76294 | train loss 5.048003 | norm 0.3494 | lr 8.57e-04 | (3830.78 ms | 136862 tok/s) step 490/76294 | train loss 4.949324 | norm 0.4401 | lr 8.59e-04 | (3806.47 ms | 137736 tok/s) step 491/76294 | train loss 4.984612 | norm 0.5493 | lr 8.60e-04 | (3814.46 ms | 137447 tok/s) step 492/76294 | train loss 5.032763 | norm 0.5300 | lr 8.62e-04 | (3830.42 ms | 136875 tok/s) step 493/76294 | train loss 5.002477 | norm 0.5549 | lr 8.63e-04 | (3830.42 ms | 136875 tok/s) step 494/76294 | train loss 5.005189 | norm 0.6228 | lr 8.65e-04 | (3807.20 ms | 137710 tok/s) step 495/76294 | train loss 5.093349 | norm 0.5215 | lr 8.66e-04 | (3864.53 ms | 135667 tok/s) step 496/76294 | train loss 5.026928 | norm 0.7339 | lr 8.68e-04 | (3894.78 ms | 134613 tok/s) step 497/76294 | train loss 5.080836 | norm 0.6563 | lr 8.69e-04 | (3805.38 ms | 137775 tok/s) step 498/76294 | train loss 5.057483 | norm 0.7787 | lr 8.71e-04 | (3837.14 ms | 136635 tok/s) step 499/76294 | train loss 5.068163 | norm 0.7323 | lr 8.72e-04 | (3802.73 ms | 137872 tok/s) step 500/76294 | train loss 5.043596 | norm 1.3223 | lr 8.74e-04 | (3805.11 ms | 137785 tok/s) val loss: 5.140162 saving model checkpoint to ./results/gpt2-124M-gqa/step_500.pth step 501/76294 | train loss 5.181555 | norm 1.7007 | lr 8.75e-04 | (3807.04 ms | 137715 tok/s) step 502/76294 | train loss 5.050561 | norm 1.3643 | lr 8.77e-04 | (3806.86 ms | 137722 tok/s) step 503/76294 | train loss 5.075872 | norm 0.7145 | lr 8.78e-04 | (3800.52 ms | 137952 tok/s) step 504/76294 | train loss 5.026802 | norm 0.7922 | lr 8.80e-04 | (3823.36 ms | 137128 tok/s) step 505/76294 | train loss 5.068057 | norm 0.6354 | lr 8.81e-04 | (3797.42 ms | 138064 tok/s) step 506/76294 | train loss 5.038336 | norm 0.7470 | lr 8.83e-04 | (3802.92 ms | 137865 tok/s) step 507/76294 | train loss 5.023240 | norm 0.6098 | lr 8.84e-04 | (3824.49 ms | 137087 tok/s) step 508/76294 | train loss 5.080378 | norm 0.6087 | lr 8.86e-04 | (3799.99 ms | 137971 tok/s) step 509/76294 | train loss 5.120679 | norm 0.8008 | lr 8.87e-04 | (3821.13 ms | 137208 tok/s) step 510/76294 | train loss 5.035172 | norm 0.9382 | lr 8.89e-04 | (3806.43 ms | 137737 tok/s) step 511/76294 | train loss 5.083925 | norm 0.5719 | lr 8.90e-04 | (3801.65 ms | 137911 tok/s) step 512/76294 | train loss 5.024813 | norm 0.8109 | lr 8.92e-04 | (3850.50 ms | 136161 tok/s) step 513/76294 | train loss 5.056838 | norm 0.8281 | lr 8.93e-04 | (3800.72 ms | 137945 tok/s) step 514/76294 | train loss 5.118059 | norm 0.8725 | lr 8.95e-04 | (3805.31 ms | 137778 tok/s) step 515/76294 | train loss 5.058600 | norm 0.9191 | lr 8.96e-04 | (3825.78 ms | 137041 tok/s) step 516/76294 | train loss 5.011086 | norm 0.9177 | lr 8.98e-04 | (3799.37 ms | 137994 tok/s) step 517/76294 | train loss 5.121015 | norm 1.7233 | lr 8.99e-04 | (3796.62 ms | 138094 tok/s) step 518/76294 | train loss 5.143663 | norm 0.6014 | lr 9.01e-04 | (7092.40 ms | 73923 tok/s) step 519/76294 | train loss 5.105508 | norm 0.6164 | lr 9.02e-04 | (3979.16 ms | 131759 tok/s) step 520/76294 | train loss 5.040832 | norm 1.2343 | lr 9.04e-04 | (3787.62 ms | 138421 tok/s) step 521/76294 | train loss 5.056862 | norm 0.4575 | lr 9.05e-04 | (3791.31 ms | 138287 tok/s) step 522/76294 | train loss 5.023995 | norm 0.6432 | lr 9.07e-04 | (3811.05 ms | 137571 tok/s) step 523/76294 | train loss 5.043204 | norm 0.6313 | lr 9.08e-04 | (3790.26 ms | 138325 tok/s) step 524/76294 | train loss 5.026201 | norm 0.5664 | lr 9.10e-04 | (3788.09 ms | 138404 tok/s) step 525/76294 | train loss 5.027553 | norm 0.5262 | lr 9.11e-04 | (3842.27 ms | 136453 tok/s) step 526/76294 | train loss 4.959237 | norm 0.4075 | lr 9.13e-04 | (3792.95 ms | 138227 tok/s) step 527/76294 | train loss 4.947224 | norm 0.5837 | lr 9.15e-04 | (3794.81 ms | 138159 tok/s) step 528/76294 | train loss 4.978811 | norm 0.5905 | lr 9.16e-04 | (3813.60 ms | 137479 tok/s) step 529/76294 | train loss 5.020999 | norm 0.5411 | lr 9.18e-04 | (3799.89 ms | 137975 tok/s) step 530/76294 | train loss 4.939855 | norm 0.4345 | lr 9.19e-04 | (3819.81 ms | 137255 tok/s) step 531/76294 | train loss 5.038038 | norm 0.5053 | lr 9.21e-04 | (3793.16 ms | 138219 tok/s) step 532/76294 | train loss 5.052317 | norm 0.4468 | lr 9.22e-04 | (3794.71 ms | 138163 tok/s) step 533/76294 | train loss 4.886298 | norm 0.4065 | lr 9.24e-04 | (3828.34 ms | 136949 tok/s) step 534/76294 | train loss 5.025489 | norm 0.4844 | lr 9.25e-04 | (3871.60 ms | 135419 tok/s) step 535/76294 | train loss 4.884147 | norm 0.4008 | lr 9.27e-04 | (3793.93 ms | 138191 tok/s) step 536/76294 | train loss 5.017484 | norm 0.3926 | lr 9.28e-04 | (3801.21 ms | 137927 tok/s) step 537/76294 | train loss 4.927680 | norm 0.4241 | lr 9.30e-04 | (3815.41 ms | 137413 tok/s) step 538/76294 | train loss 4.935799 | norm 0.5655 | lr 9.31e-04 | (3862.12 ms | 135751 tok/s) step 539/76294 | train loss 4.944407 | norm 0.6192 | lr 9.33e-04 | (3814.91 ms | 137431 tok/s) step 540/76294 | train loss 4.921499 | norm 0.6208 | lr 9.34e-04 | (3798.68 ms | 138019 tok/s) step 541/76294 | train loss 4.908991 | norm 0.4827 | lr 9.36e-04 | (3803.90 ms | 137829 tok/s) step 542/76294 | train loss 4.933350 | norm 0.4998 | lr 9.37e-04 | (3823.25 ms | 137131 tok/s) step 543/76294 | train loss 4.983672 | norm 0.5543 | lr 9.39e-04 | (3821.86 ms | 137181 tok/s) step 544/76294 | train loss 4.948181 | norm 0.5234 | lr 9.40e-04 | (3799.41 ms | 137992 tok/s) step 545/76294 | train loss 4.868810 | norm 0.4012 | lr 9.42e-04 | (3794.04 ms | 138187 tok/s) step 546/76294 | train loss 4.949229 | norm 0.3816 | lr 9.43e-04 | (3830.71 ms | 136865 tok/s) step 547/76294 | train loss 4.957602 | norm 0.3801 | lr 9.45e-04 | (3794.32 ms | 138177 tok/s) step 548/76294 | train loss 4.910153 | norm 0.3901 | lr 9.46e-04 | (3803.85 ms | 137831 tok/s) step 549/76294 | train loss 4.906832 | norm 0.5652 | lr 9.48e-04 | (3818.84 ms | 137290 tok/s) step 550/76294 | train loss 4.870793 | norm 0.4097 | lr 9.49e-04 | (3799.63 ms | 137984 tok/s) step 551/76294 | train loss 4.961486 | norm 0.3910 | lr 9.51e-04 | (3796.17 ms | 138110 tok/s) step 552/76294 | train loss 4.999930 | norm 0.6827 | lr 9.52e-04 | (3835.08 ms | 136708 tok/s) step 553/76294 | train loss 4.929609 | norm 0.5143 | lr 9.54e-04 | (3795.27 ms | 138142 tok/s) step 554/76294 | train loss 4.981391 | norm 0.4058 | lr 9.55e-04 | (3802.49 ms | 137880 tok/s) step 555/76294 | train loss 4.984918 | norm 0.4534 | lr 9.57e-04 | (3815.64 ms | 137405 tok/s) step 556/76294 | train loss 4.852058 | norm 0.3307 | lr 9.58e-04 | (3797.34 ms | 138067 tok/s) step 557/76294 | train loss 4.915627 | norm 0.4339 | lr 9.60e-04 | (3794.02 ms | 138188 tok/s) step 558/76294 | train loss 4.965073 | norm 0.6434 | lr 9.61e-04 | (3834.82 ms | 136718 tok/s) step 559/76294 | train loss 4.915994 | norm 0.6765 | lr 9.63e-04 | (3797.24 ms | 138071 tok/s) step 560/76294 | train loss 4.873122 | norm 0.6632 | lr 9.64e-04 | (3799.03 ms | 138006 tok/s) step 561/76294 | train loss 4.870944 | norm 0.5905 | lr 9.66e-04 | (3816.37 ms | 137379 tok/s) step 562/76294 | train loss 4.853683 | norm 0.5548 | lr 9.67e-04 | (3798.73 ms | 138016 tok/s) step 563/76294 | train loss 4.926503 | norm 0.7558 | lr 9.69e-04 | (3797.62 ms | 138057 tok/s) step 564/76294 | train loss 4.949180 | norm 0.8031 | lr 9.70e-04 | (3840.16 ms | 136528 tok/s) step 565/76294 | train loss 4.855750 | norm 0.7516 | lr 9.72e-04 | (3795.11 ms | 138148 tok/s) step 566/76294 | train loss 5.064032 | norm 0.6524 | lr 9.73e-04 | (3808.20 ms | 137673 tok/s) step 567/76294 | train loss 4.964111 | norm 0.6448 | lr 9.75e-04 | (3822.93 ms | 137143 tok/s) step 568/76294 | train loss 4.904237 | norm 0.6511 | lr 9.76e-04 | (3798.54 ms | 138024 tok/s) step 569/76294 | train loss 4.914377 | norm 0.5749 | lr 9.78e-04 | (3804.30 ms | 137815 tok/s) step 570/76294 | train loss 4.978334 | norm 0.5203 | lr 9.79e-04 | (3798.36 ms | 138030 tok/s) step 571/76294 | train loss 4.910886 | norm 0.6009 | lr 9.81e-04 | (3824.29 ms | 137094 tok/s) step 572/76294 | train loss 4.926702 | norm 0.7058 | lr 9.82e-04 | (3797.47 ms | 138063 tok/s) step 573/76294 | train loss 4.896276 | norm 1.0416 | lr 9.84e-04 | (4149.64 ms | 126345 tok/s) step 574/76294 | train loss 4.883007 | norm 0.7144 | lr 9.86e-04 | (3792.54 ms | 138242 tok/s) step 575/76294 | train loss 4.962392 | norm 0.5964 | lr 9.87e-04 | (3893.90 ms | 134644 tok/s) step 576/76294 | train loss 4.916070 | norm 0.6311 | lr 9.89e-04 | (3800.33 ms | 137959 tok/s) step 577/76294 | train loss 4.940452 | norm 0.6282 | lr 9.90e-04 | (3801.42 ms | 137919 tok/s) step 578/76294 | train loss 4.876210 | norm 0.6112 | lr 9.92e-04 | (3822.18 ms | 137170 tok/s) step 579/76294 | train loss 4.861098 | norm 0.6603 | lr 9.93e-04 | (3802.29 ms | 137887 tok/s) step 580/76294 | train loss 4.956231 | norm 0.5953 | lr 9.95e-04 | (3794.56 ms | 138168 tok/s) step 581/76294 | train loss 4.902445 | norm 0.5167 | lr 9.96e-04 | (3829.11 ms | 136922 tok/s) step 582/76294 | train loss 4.882167 | norm 0.5314 | lr 9.98e-04 | (3796.43 ms | 138100 tok/s) step 583/76294 | train loss 4.879946 | norm 0.4249 | lr 9.99e-04 | (3801.88 ms | 137902 tok/s) step 584/76294 | train loss 4.854416 | norm 0.3668 | lr 1.00e-03 | (3817.17 ms | 137350 tok/s) step 585/76294 | train loss 4.879431 | norm 0.3530 | lr 1.00e-03 | (3803.46 ms | 137845 tok/s) step 586/76294 | train loss 4.793570 | norm 0.3518 | lr 1.00e-03 | (3795.33 ms | 138140 tok/s) step 587/76294 | train loss 4.893016 | norm 0.3528 | lr 1.01e-03 | (3833.37 ms | 136769 tok/s) step 588/76294 | train loss 4.834765 | norm 0.3726 | lr 1.01e-03 | (3798.70 ms | 138018 tok/s) step 589/76294 | train loss 4.819828 | norm 0.4552 | lr 1.01e-03 | (3800.39 ms | 137957 tok/s) step 590/76294 | train loss 4.795924 | norm 0.4851 | lr 1.01e-03 | (3821.84 ms | 137182 tok/s) step 591/76294 | train loss 4.856579 | norm 0.4975 | lr 1.01e-03 | (3802.55 ms | 137878 tok/s) step 592/76294 | train loss 4.882172 | norm 0.3791 | lr 1.01e-03 | (3795.88 ms | 138120 tok/s) step 593/76294 | train loss 4.786117 | norm 0.4098 | lr 1.01e-03 | (3839.12 ms | 136565 tok/s) step 594/76294 | train loss 4.824802 | norm 0.3945 | lr 1.02e-03 | (3799.22 ms | 137999 tok/s) step 595/76294 | train loss 4.808616 | norm 0.3437 | lr 1.02e-03 | (3883.12 ms | 135017 tok/s) step 596/76294 | train loss 4.862611 | norm 0.4004 | lr 1.02e-03 | (3799.42 ms | 137992 tok/s) step 597/76294 | train loss 4.844928 | norm 0.3791 | lr 1.02e-03 | (3852.74 ms | 136082 tok/s) step 598/76294 | train loss 4.864400 | norm 0.3895 | lr 1.02e-03 | (3795.79 ms | 138123 tok/s) step 599/76294 | train loss 4.873534 | norm 0.4014 | lr 1.02e-03 | (3821.76 ms | 137185 tok/s) step 600/76294 | train loss 4.789302 | norm 0.4469 | lr 1.02e-03 | (3818.00 ms | 137320 tok/s) step 601/76294 | train loss 4.783144 | norm 0.4600 | lr 1.03e-03 | (3801.45 ms | 137918 tok/s) step 602/76294 | train loss 4.786394 | norm 0.4812 | lr 1.03e-03 | (3802.60 ms | 137876 tok/s) step 603/76294 | train loss 4.768991 | norm 0.3717 | lr 1.03e-03 | (3805.57 ms | 137768 tok/s) step 604/76294 | train loss 4.815871 | norm 0.4126 | lr 1.03e-03 | (3800.33 ms | 137959 tok/s) step 605/76294 | train loss 4.723863 | norm 0.4086 | lr 1.03e-03 | (3850.53 ms | 136160 tok/s) step 606/76294 | train loss 4.759244 | norm 0.4290 | lr 1.03e-03 | (3805.31 ms | 137778 tok/s) step 607/76294 | train loss 4.765461 | norm 0.4654 | lr 1.04e-03 | (3805.20 ms | 137782 tok/s) step 608/76294 | train loss 4.812407 | norm 0.4869 | lr 1.04e-03 | (3824.30 ms | 137094 tok/s) step 609/76294 | train loss 4.739980 | norm 0.5550 | lr 1.04e-03 | (3796.33 ms | 138104 tok/s) step 610/76294 | train loss 4.781221 | norm 0.4829 | lr 1.04e-03 | (3797.92 ms | 138046 tok/s) step 611/76294 | train loss 4.797896 | norm 0.5330 | lr 1.04e-03 | (3828.96 ms | 136927 tok/s) step 612/76294 | train loss 4.804084 | norm 0.5220 | lr 1.04e-03 | (3797.42 ms | 138064 tok/s) step 613/76294 | train loss 4.802406 | norm 0.4027 | lr 1.04e-03 | (3807.34 ms | 137704 tok/s) step 614/76294 | train loss 4.835801 | norm 0.4655 | lr 1.05e-03 | (3797.99 ms | 138044 tok/s) step 615/76294 | train loss 4.758856 | norm 0.5673 | lr 1.05e-03 | (3806.75 ms | 137726 tok/s) step 616/76294 | train loss 4.813650 | norm 0.6268 | lr 1.05e-03 | (3878.21 ms | 135188 tok/s) step 617/76294 | train loss 4.779370 | norm 0.3796 | lr 1.05e-03 | (3800.93 ms | 137937 tok/s) step 618/76294 | train loss 4.714851 | norm 0.3623 | lr 1.05e-03 | (3835.50 ms | 136694 tok/s) step 619/76294 | train loss 4.738998 | norm 0.5303 | lr 1.05e-03 | (3800.28 ms | 137961 tok/s) step 620/76294 | train loss 4.767353 | norm 0.7414 | lr 1.05e-03 | (3802.49 ms | 137880 tok/s) step 621/76294 | train loss 5.026783 | norm 0.7858 | lr 1.06e-03 | (3820.69 ms | 137223 tok/s) step 622/76294 | train loss 4.799723 | norm 0.6615 | lr 1.06e-03 | (4232.37 ms | 123876 tok/s) step 623/76294 | train loss 4.873271 | norm 0.7709 | lr 1.06e-03 | (3798.25 ms | 138034 tok/s) step 624/76294 | train loss 4.793836 | norm 0.5939 | lr 1.06e-03 | (3837.35 ms | 136628 tok/s) step 625/76294 | train loss 4.760114 | norm 0.6738 | lr 1.06e-03 | (3812.90 ms | 137504 tok/s) step 626/76294 | train loss 4.803283 | norm 4.1969 | lr 1.06e-03 | (3800.29 ms | 137960 tok/s) step 627/76294 | train loss 4.917343 | norm 0.9367 | lr 1.07e-03 | (3821.81 ms | 137183 tok/s) step 628/76294 | train loss 4.894681 | norm 1.1239 | lr 1.07e-03 | (3803.44 ms | 137846 tok/s) step 629/76294 | train loss 4.953161 | norm 0.8838 | lr 1.07e-03 | (3800.72 ms | 137944 tok/s) step 630/76294 | train loss 4.873314 | norm 0.8306 | lr 1.07e-03 | (3827.25 ms | 136988 tok/s) step 631/76294 | train loss 4.897504 | norm 1.7124 | lr 1.07e-03 | (3798.64 ms | 138020 tok/s) step 632/76294 | train loss 5.083768 | norm 0.9559 | lr 1.07e-03 | (3876.05 ms | 135263 tok/s) step 633/76294 | train loss 4.922335 | norm 1.2570 | lr 1.07e-03 | (3819.34 ms | 137272 tok/s) step 634/76294 | train loss 4.863170 | norm 0.8390 | lr 1.08e-03 | (3801.70 ms | 137909 tok/s) step 635/76294 | train loss 4.946705 | norm 0.9664 | lr 1.08e-03 | (3799.74 ms | 137980 tok/s) step 636/76294 | train loss 4.811970 | norm 0.7712 | lr 1.08e-03 | (3974.84 ms | 131902 tok/s) step 637/76294 | train loss 4.882799 | norm 0.9865 | lr 1.08e-03 | (3795.26 ms | 138143 tok/s) step 638/76294 | train loss 4.890009 | norm 0.7998 | lr 1.08e-03 | (3811.87 ms | 137541 tok/s) step 639/76294 | train loss 4.898188 | norm 0.6945 | lr 1.08e-03 | (3794.29 ms | 138178 tok/s) step 640/76294 | train loss 4.807732 | norm 0.6429 | lr 1.09e-03 | (3810.23 ms | 137600 tok/s) step 641/76294 | train loss 4.896730 | norm 0.4569 | lr 1.09e-03 | (3827.99 ms | 136962 tok/s) step 642/76294 | train loss 4.820106 | norm 0.4405 | lr 1.09e-03 | (3809.97 ms | 137610 tok/s) step 643/76294 | train loss 4.786839 | norm 0.4010 | lr 1.09e-03 | (3796.55 ms | 138096 tok/s) step 644/76294 | train loss 4.754480 | norm 0.4821 | lr 1.09e-03 | (3827.85 ms | 136967 tok/s) step 645/76294 | train loss 4.799548 | norm 0.5725 | lr 1.09e-03 | (3796.90 ms | 138083 tok/s) step 646/76294 | train loss 4.738804 | norm 0.5268 | lr 1.09e-03 | (3810.87 ms | 137577 tok/s) step 647/76294 | train loss 4.720801 | norm 0.4240 | lr 1.10e-03 | (3829.18 ms | 136919 tok/s) step 648/76294 | train loss 4.811533 | norm 0.5203 | lr 1.10e-03 | (3803.35 ms | 137849 tok/s) step 649/76294 | train loss 4.821206 | norm 0.5070 | lr 1.10e-03 | (3817.44 ms | 137340 tok/s) step 650/76294 | train loss 4.701691 | norm 0.4091 | lr 1.10e-03 | (3804.83 ms | 137795 tok/s) step 651/76294 | train loss 4.757176 | norm 0.4109 | lr 1.10e-03 | (3798.92 ms | 138010 tok/s) step 652/76294 | train loss 4.695519 | norm 0.3291 | lr 1.10e-03 | (3835.39 ms | 136698 tok/s) step 653/76294 | train loss 4.666175 | norm 0.3206 | lr 1.10e-03 | (3803.31 ms | 137850 tok/s) step 654/76294 | train loss 4.723676 | norm 0.3039 | lr 1.11e-03 | (3805.81 ms | 137760 tok/s) step 655/76294 | train loss 4.648787 | norm 0.3304 | lr 1.11e-03 | (3797.34 ms | 138067 tok/s) step 656/76294 | train loss 4.721763 | norm 0.3756 | lr 1.11e-03 | (3806.05 ms | 137751 tok/s) step 657/76294 | train loss 4.724658 | norm 0.6255 | lr 1.11e-03 | (3835.08 ms | 136708 tok/s) step 658/76294 | train loss 4.732018 | norm 0.4714 | lr 1.11e-03 | (3795.19 ms | 138145 tok/s) step 659/76294 | train loss 4.728409 | norm 0.4582 | lr 1.11e-03 | (3831.60 ms | 136832 tok/s) step 660/76294 | train loss 4.737513 | norm 0.4622 | lr 1.12e-03 | (3824.08 ms | 137102 tok/s) step 661/76294 | train loss 4.709082 | norm 0.4105 | lr 1.12e-03 | (3801.49 ms | 137917 tok/s) step 662/76294 | train loss 4.713185 | norm 0.3611 | lr 1.12e-03 | (3816.08 ms | 137389 tok/s) step 663/76294 | train loss 4.715084 | norm 0.3470 | lr 1.12e-03 | (3809.99 ms | 137609 tok/s) step 664/76294 | train loss 4.664967 | norm 0.4176 | lr 1.12e-03 | (4026.99 ms | 130194 tok/s) step 665/76294 | train loss 4.751350 | norm 0.4399 | lr 1.12e-03 | (3806.28 ms | 137743 tok/s) step 666/76294 | train loss 4.686249 | norm 0.4955 | lr 1.12e-03 | (3805.93 ms | 137756 tok/s) step 667/76294 | train loss 4.703388 | norm 0.5257 | lr 1.13e-03 | (3851.01 ms | 136143 tok/s) step 668/76294 | train loss 4.681674 | norm 0.5169 | lr 1.13e-03 | (3807.07 ms | 137714 tok/s) step 669/76294 | train loss 4.683882 | norm 0.5092 | lr 1.13e-03 | (3809.04 ms | 137643 tok/s) step 670/76294 | train loss 4.697188 | norm 2.0863 | lr 1.13e-03 | (3834.37 ms | 136734 tok/s) step 671/76294 | train loss 4.737782 | norm 0.4213 | lr 1.13e-03 | (3804.67 ms | 137801 tok/s) step 672/76294 | train loss 4.678672 | norm 0.4840 | lr 1.13e-03 | (3805.10 ms | 137786 tok/s) step 673/76294 | train loss 4.679389 | norm 0.5934 | lr 1.14e-03 | (3846.81 ms | 136292 tok/s) step 674/76294 | train loss 4.646740 | norm 0.6303 | lr 1.14e-03 | (3804.55 ms | 137805 tok/s) step 675/76294 | train loss 4.707082 | norm 0.5791 | lr 1.14e-03 | (3861.91 ms | 135759 tok/s) step 676/76294 | train loss 4.656905 | norm 0.6158 | lr 1.14e-03 | (3804.25 ms | 137816 tok/s) step 677/76294 | train loss 4.721159 | norm 0.7025 | lr 1.14e-03 | (3897.08 ms | 134533 tok/s) step 678/76294 | train loss 4.654892 | norm 0.5178 | lr 1.14e-03 | (3808.41 ms | 137666 tok/s) step 679/76294 | train loss 4.651889 | norm 0.5564 | lr 1.14e-03 | (3807.11 ms | 137713 tok/s) step 680/76294 | train loss 4.634154 | norm 0.5268 | lr 1.15e-03 | (3830.19 ms | 136883 tok/s) step 681/76294 | train loss 4.713994 | norm 0.6243 | lr 1.15e-03 | (3813.06 ms | 137498 tok/s) step 682/76294 | train loss 4.645876 | norm 1.2354 | lr 1.15e-03 | (3811.81 ms | 137543 tok/s) step 683/76294 | train loss 4.715025 | norm 0.7421 | lr 1.15e-03 | (3805.02 ms | 137789 tok/s) step 684/76294 | train loss 4.679682 | norm 0.9446 | lr 1.15e-03 | (3801.71 ms | 137908 tok/s) step 685/76294 | train loss 4.696798 | norm 0.5391 | lr 1.15e-03 | (3845.10 ms | 136352 tok/s) step 686/76294 | train loss 4.662431 | norm 1.1221 | lr 1.15e-03 | (3795.43 ms | 138137 tok/s) step 687/76294 | train loss 4.847211 | norm 0.5199 | lr 1.16e-03 | (3805.60 ms | 137768 tok/s) step 688/76294 | train loss 4.709170 | norm 2.4401 | lr 1.16e-03 | (3822.33 ms | 137165 tok/s) step 689/76294 | train loss 4.684415 | norm 1.8874 | lr 1.16e-03 | (3800.82 ms | 137941 tok/s) step 690/76294 | train loss 4.786690 | norm 1.1438 | lr 1.16e-03 | (3800.89 ms | 137938 tok/s) step 691/76294 | train loss 4.686177 | norm 0.9623 | lr 1.16e-03 | (3835.23 ms | 136703 tok/s) step 692/76294 | train loss 4.712460 | norm 0.9761 | lr 1.16e-03 | (3799.10 ms | 138003 tok/s) step 693/76294 | train loss 4.736419 | norm 0.8147 | lr 1.17e-03 | (3812.25 ms | 137527 tok/s) step 694/76294 | train loss 4.721713 | norm 1.6051 | lr 1.17e-03 | (3817.96 ms | 137322 tok/s) step 695/76294 | train loss 4.763046 | norm 0.6044 | lr 1.17e-03 | (3802.65 ms | 137874 tok/s) step 696/76294 | train loss 4.725577 | norm 0.8458 | lr 1.17e-03 | (3818.38 ms | 137306 tok/s) step 697/76294 | train loss 4.738531 | norm 0.9484 | lr 1.17e-03 | (3906.17 ms | 134221 tok/s) step 698/76294 | train loss 4.747079 | norm 0.9604 | lr 1.17e-03 | (3866.15 ms | 135610 tok/s) step 699/76294 | train loss 4.705425 | norm 0.9103 | lr 1.17e-03 | (3797.13 ms | 138075 tok/s) step 700/76294 | train loss 4.675272 | norm 1.0383 | lr 1.18e-03 | (3804.08 ms | 137823 tok/s) step 701/76294 | train loss 4.736641 | norm 0.6882 | lr 1.18e-03 | (3818.40 ms | 137306 tok/s) step 702/76294 | train loss 4.685051 | norm 0.6018 | lr 1.18e-03 | (3806.16 ms | 137747 tok/s) step 703/76294 | train loss 4.693089 | norm 0.4976 | lr 1.18e-03 | (3803.90 ms | 137829 tok/s) step 704/76294 | train loss 4.761346 | norm 0.5408 | lr 1.18e-03 | (3799.67 ms | 137982 tok/s) step 705/76294 | train loss 4.535616 | norm 0.4644 | lr 1.18e-03 | (3803.43 ms | 137846 tok/s) step 706/76294 | train loss 4.643229 | norm 0.4319 | lr 1.18e-03 | (3972.72 ms | 131972 tok/s) step 707/76294 | train loss 4.636609 | norm 0.4474 | lr 1.19e-03 | (3798.90 ms | 138010 tok/s) step 708/76294 | train loss 4.645967 | norm 0.3605 | lr 1.19e-03 | (3840.33 ms | 136522 tok/s) step 709/76294 | train loss 4.650891 | norm 0.4265 | lr 1.19e-03 | (3798.26 ms | 138034 tok/s) step 710/76294 | train loss 4.660038 | norm 0.4289 | lr 1.19e-03 | (3802.55 ms | 137878 tok/s) step 711/76294 | train loss 4.627226 | norm 0.3838 | lr 1.19e-03 | (3815.94 ms | 137394 tok/s) step 712/76294 | train loss 4.575212 | norm 0.3224 | lr 1.19e-03 | (3796.79 ms | 138087 tok/s) step 713/76294 | train loss 4.606234 | norm 0.3995 | lr 1.20e-03 | (3820.67 ms | 137224 tok/s) step 714/76294 | train loss 4.593599 | norm 0.3526 | lr 1.20e-03 | (3795.56 ms | 138132 tok/s) step 715/76294 | train loss 4.611602 | norm 0.3183 | lr 1.20e-03 | (3812.70 ms | 137511 tok/s) step 716/76294 | train loss 4.619802 | norm 0.3956 | lr 1.20e-03 | (4019.65 ms | 130431 tok/s) step 717/76294 | train loss 4.535412 | norm 0.3530 | lr 1.20e-03 | (3803.41 ms | 137847 tok/s) step 718/76294 | train loss 4.550380 | norm 0.3083 | lr 1.20e-03 | (3924.72 ms | 133586 tok/s) step 719/76294 | train loss 4.589515 | norm 0.2743 | lr 1.20e-03 | (3797.54 ms | 138060 tok/s) step 720/76294 | train loss 4.626167 | norm 0.3151 | lr 1.20e-03 | (3798.93 ms | 138009 tok/s) step 721/76294 | train loss 4.512656 | norm 0.3081 | lr 1.20e-03 | (3818.32 ms | 137309 tok/s) step 722/76294 | train loss 4.544876 | norm 0.3381 | lr 1.20e-03 | (3802.66 ms | 137874 tok/s) step 723/76294 | train loss 4.563670 | norm 0.3824 | lr 1.20e-03 | (3794.73 ms | 138162 tok/s) step 724/76294 | train loss 4.563369 | norm 0.4099 | lr 1.20e-03 | (3829.55 ms | 136906 tok/s) step 725/76294 | train loss 4.614543 | norm 0.5371 | lr 1.20e-03 | (3796.90 ms | 138083 tok/s) step 726/76294 | train loss 4.554152 | norm 0.6251 | lr 1.20e-03 | (3802.60 ms | 137876 tok/s) step 727/76294 | train loss 4.586936 | norm 0.6277 | lr 1.20e-03 | (3817.40 ms | 137342 tok/s) step 728/76294 | train loss 4.526227 | norm 0.7053 | lr 1.20e-03 | (3794.63 ms | 138166 tok/s) step 729/76294 | train loss 4.586399 | norm 0.5833 | lr 1.20e-03 | (3794.05 ms | 138187 tok/s) step 730/76294 | train loss 4.643964 | norm 0.4459 | lr 1.20e-03 | (3854.13 ms | 136033 tok/s) step 731/76294 | train loss 4.494591 | norm 0.4294 | lr 1.20e-03 | (3801.45 ms | 137918 tok/s) step 732/76294 | train loss 4.579269 | norm 0.3625 | lr 1.20e-03 | (3799.33 ms | 137995 tok/s) step 733/76294 | train loss 4.532958 | norm 0.3107 | lr 1.20e-03 | (3819.40 ms | 137270 tok/s) step 734/76294 | train loss 4.574043 | norm 0.3312 | lr 1.20e-03 | (3884.96 ms | 134953 tok/s) step 735/76294 | train loss 4.502861 | norm 0.3276 | lr 1.20e-03 | (3797.69 ms | 138054 tok/s) step 736/76294 | train loss 4.548028 | norm 0.3565 | lr 1.20e-03 | (4687.57 ms | 111846 tok/s) step 737/76294 | train loss 4.621146 | norm 0.3745 | lr 1.20e-03 | (3801.19 ms | 137927 tok/s) step 738/76294 | train loss 4.543201 | norm 0.3067 | lr 1.20e-03 | (3807.78 ms | 137688 tok/s) step 739/76294 | train loss 4.540053 | norm 0.2778 | lr 1.20e-03 | (3796.00 ms | 138116 tok/s) step 740/76294 | train loss 4.559952 | norm 0.2845 | lr 1.20e-03 | (3802.83 ms | 137868 tok/s) step 741/76294 | train loss 4.579475 | norm 0.4917 | lr 1.20e-03 | (3819.92 ms | 137251 tok/s) step 742/76294 | train loss 4.549293 | norm 0.3640 | lr 1.20e-03 | (3802.97 ms | 137863 tok/s) step 743/76294 | train loss 4.480515 | norm 0.4784 | lr 1.20e-03 | (4218.36 ms | 124287 tok/s) step 744/76294 | train loss 4.518069 | norm 0.6133 | lr 1.20e-03 | (3872.03 ms | 135404 tok/s) step 745/76294 | train loss 4.496179 | norm 0.4361 | lr 1.20e-03 | (3806.54 ms | 137734 tok/s) step 746/76294 | train loss 4.520130 | norm 0.4454 | lr 1.20e-03 | (3797.53 ms | 138060 tok/s) step 747/76294 | train loss 4.511398 | norm 0.4465 | lr 1.20e-03 | (3842.18 ms | 136456 tok/s) step 748/76294 | train loss 4.447997 | norm 0.4523 | lr 1.20e-03 | (3796.99 ms | 138080 tok/s) step 749/76294 | train loss 4.552755 | norm 0.5451 | lr 1.20e-03 | (3800.59 ms | 137949 tok/s) step 750/76294 | train loss 4.466392 | norm 0.4665 | lr 1.20e-03 | (3817.83 ms | 137326 tok/s) val loss: 4.532447 saving model checkpoint to ./results/gpt2-124M-gqa/step_750.pth step 751/76294 | train loss 4.470891 | norm 0.3692 | lr 1.20e-03 | (3793.71 ms | 138199 tok/s) step 752/76294 | train loss 4.522638 | norm 0.4342 | lr 1.20e-03 | (3814.82 ms | 137435 tok/s) step 753/76294 | train loss 4.571917 | norm 0.5069 | lr 1.20e-03 | (3804.46 ms | 137809 tok/s) step 754/76294 | train loss 4.482885 | norm 0.5489 | lr 1.20e-03 | (3831.22 ms | 136846 tok/s) step 755/76294 | train loss 4.521100 | norm 0.5060 | lr 1.20e-03 | (3822.76 ms | 137149 tok/s) step 756/76294 | train loss 4.523591 | norm 0.4100 | lr 1.20e-03 | (3793.64 ms | 138202 tok/s) step 757/76294 | train loss 4.462473 | norm 0.3450 | lr 1.20e-03 | (3800.41 ms | 137956 tok/s) step 758/76294 | train loss 4.508999 | norm 0.3724 | lr 1.20e-03 | (3799.90 ms | 137974 tok/s) step 759/76294 | train loss 4.526753 | norm 0.3825 | lr 1.20e-03 | (3825.15 ms | 137063 tok/s) step 760/76294 | train loss 4.467501 | norm 0.3573 | lr 1.20e-03 | (3797.58 ms | 138058 tok/s) step 761/76294 | train loss 4.467376 | norm 0.3877 | lr 1.20e-03 | (3806.95 ms | 137719 tok/s) step 762/76294 | train loss 4.442953 | norm 0.3107 | lr 1.20e-03 | (3818.29 ms | 137310 tok/s) step 763/76294 | train loss 4.470520 | norm 0.3002 | lr 1.20e-03 | (4117.63 ms | 127328 tok/s) step 764/76294 | train loss 4.437695 | norm 0.3390 | lr 1.20e-03 | (3824.13 ms | 137100 tok/s) step 765/76294 | train loss 4.440620 | norm 0.4369 | lr 1.20e-03 | (3798.42 ms | 138028 tok/s) step 766/76294 | train loss 4.496768 | norm 0.4863 | lr 1.20e-03 | (3797.24 ms | 138071 tok/s) step 767/76294 | train loss 4.569575 | norm 0.4913 | lr 1.20e-03 | (3852.15 ms | 136103 tok/s) step 768/76294 | train loss 4.389648 | norm 0.5009 | lr 1.20e-03 | (3795.62 ms | 138130 tok/s) step 769/76294 | train loss 4.467929 | norm 0.5533 | lr 1.20e-03 | (3803.36 ms | 137849 tok/s) step 770/76294 | train loss 4.476799 | norm 0.4752 | lr 1.20e-03 | (3853.30 ms | 136062 tok/s) step 771/76294 | train loss 4.425078 | norm 0.4862 | lr 1.20e-03 | (3814.88 ms | 137432 tok/s) step 772/76294 | train loss 4.472411 | norm 0.4857 | lr 1.20e-03 | (3797.15 ms | 138074 tok/s) step 773/76294 | train loss 4.456757 | norm 0.4984 | lr 1.20e-03 | (4006.71 ms | 130853 tok/s) step 774/76294 | train loss 4.453773 | norm 0.6042 | lr 1.20e-03 | (3797.44 ms | 138064 tok/s) step 775/76294 | train loss 4.532097 | norm 0.6328 | lr 1.20e-03 | (3847.94 ms | 136252 tok/s) step 776/76294 | train loss 4.489301 | norm 0.5464 | lr 1.20e-03 | (3798.29 ms | 138033 tok/s) step 777/76294 | train loss 4.475792 | norm 0.5141 | lr 1.20e-03 | (3802.70 ms | 137872 tok/s) step 778/76294 | train loss 4.514676 | norm 0.4802 | lr 1.20e-03 | (3819.26 ms | 137275 tok/s) step 779/76294 | train loss 4.446939 | norm 0.3945 | lr 1.20e-03 | (3803.81 ms | 137832 tok/s) step 780/76294 | train loss 4.487207 | norm 0.4342 | lr 1.20e-03 | (3803.45 ms | 137845 tok/s) step 781/76294 | train loss 4.494960 | norm 0.4031 | lr 1.20e-03 | (3803.91 ms | 137829 tok/s) step 782/76294 | train loss 4.538570 | norm 0.4547 | lr 1.20e-03 | (3796.29 ms | 138105 tok/s) step 783/76294 | train loss 4.483261 | norm 0.4762 | lr 1.20e-03 | (3882.40 ms | 135042 tok/s) step 784/76294 | train loss 4.457002 | norm 0.5110 | lr 1.20e-03 | (3797.77 ms | 138051 tok/s) step 785/76294 | train loss 4.482006 | norm 0.4369 | lr 1.20e-03 | (3801.17 ms | 137928 tok/s) step 786/76294 | train loss 4.498071 | norm 0.4304 | lr 1.20e-03 | (3821.46 ms | 137196 tok/s) step 787/76294 | train loss 4.446136 | norm 0.3719 | lr 1.20e-03 | (3828.70 ms | 136936 tok/s) step 788/76294 | train loss 4.449642 | norm 0.3466 | lr 1.20e-03 | (3820.06 ms | 137246 tok/s) step 789/76294 | train loss 4.404840 | norm 0.2695 | lr 1.20e-03 | (3801.12 ms | 137930 tok/s) step 790/76294 | train loss 4.388740 | norm 0.2629 | lr 1.20e-03 | (3799.66 ms | 137983 tok/s) step 791/76294 | train loss 4.400074 | norm 0.2641 | lr 1.20e-03 | (3865.02 ms | 135649 tok/s) step 792/76294 | train loss 4.450519 | norm 0.2819 | lr 1.20e-03 | (3796.60 ms | 138094 tok/s) step 793/76294 | train loss 4.441277 | norm 0.3444 | lr 1.20e-03 | (3823.87 ms | 137109 tok/s) step 794/76294 | train loss 4.511074 | norm 0.4655 | lr 1.20e-03 | (3797.92 ms | 138046 tok/s) step 795/76294 | train loss 4.473820 | norm 0.6313 | lr 1.20e-03 | (3809.63 ms | 137622 tok/s) step 796/76294 | train loss 4.421328 | norm 0.6170 | lr 1.20e-03 | (3798.57 ms | 138022 tok/s) step 797/76294 | train loss 4.497255 | norm 0.6328 | lr 1.20e-03 | (3802.49 ms | 137880 tok/s) step 798/76294 | train loss 4.457414 | norm 0.5456 | lr 1.20e-03 | (3932.65 ms | 133317 tok/s) step 799/76294 | train loss 4.461659 | norm 0.4498 | lr 1.20e-03 | (3803.00 ms | 137862 tok/s) step 800/76294 | train loss 4.511483 | norm 0.4083 | lr 1.20e-03 | (3804.42 ms | 137810 tok/s) step 801/76294 | train loss 4.453733 | norm 0.3581 | lr 1.20e-03 | (3824.78 ms | 137077 tok/s) step 802/76294 | train loss 4.401231 | norm 0.3533 | lr 1.20e-03 | (3821.38 ms | 137199 tok/s) step 803/76294 | train loss 4.463452 | norm 0.3558 | lr 1.20e-03 | (3806.99 ms | 137717 tok/s) step 804/76294 | train loss 4.341486 | norm 0.3354 | lr 1.20e-03 | (3798.52 ms | 138024 tok/s) step 805/76294 | train loss 4.401066 | norm 0.3094 | lr 1.20e-03 | (3806.04 ms | 137752 tok/s) step 806/76294 | train loss 4.406378 | norm 0.3476 | lr 1.20e-03 | (3856.22 ms | 135959 tok/s) step 807/76294 | train loss 4.406153 | norm 0.3959 | lr 1.20e-03 | (3810.05 ms | 137607 tok/s) step 808/76294 | train loss 4.462713 | norm 0.3356 | lr 1.20e-03 | (3829.29 ms | 136915 tok/s) step 809/76294 | train loss 4.399265 | norm 0.3724 | lr 1.20e-03 | (3899.69 ms | 134443 tok/s) step 810/76294 | train loss 4.439600 | norm 0.3988 | lr 1.20e-03 | (3823.30 ms | 137130 tok/s) step 811/76294 | train loss 4.373953 | norm 0.3429 | lr 1.20e-03 | (3806.63 ms | 137730 tok/s) step 812/76294 | train loss 4.401480 | norm 0.3379 | lr 1.20e-03 | (3819.01 ms | 137284 tok/s) step 813/76294 | train loss 4.381671 | norm 0.3432 | lr 1.20e-03 | (4065.67 ms | 128955 tok/s) step 814/76294 | train loss 4.499572 | norm 0.3926 | lr 1.20e-03 | (3901.60 ms | 134378 tok/s) step 815/76294 | train loss 4.416818 | norm 0.3421 | lr 1.20e-03 | (3892.93 ms | 134677 tok/s) step 816/76294 | train loss 4.358050 | norm 0.3443 | lr 1.20e-03 | (3797.14 ms | 138075 tok/s) step 817/76294 | train loss 4.421663 | norm 0.3335 | lr 1.20e-03 | (3860.88 ms | 135795 tok/s) step 818/76294 | train loss 4.430489 | norm 0.3377 | lr 1.20e-03 | (3801.47 ms | 137917 tok/s) step 819/76294 | train loss 4.354394 | norm 0.3320 | lr 1.20e-03 | (3802.69 ms | 137873 tok/s) step 820/76294 | train loss 4.382137 | norm 0.3276 | lr 1.20e-03 | (3823.33 ms | 137129 tok/s) step 821/76294 | train loss 4.367378 | norm 0.4627 | lr 1.20e-03 | (3837.59 ms | 136619 tok/s) step 822/76294 | train loss 4.425168 | norm 0.5433 | lr 1.20e-03 | (3824.91 ms | 137072 tok/s) step 823/76294 | train loss 4.400776 | norm 0.6781 | lr 1.20e-03 | (3967.05 ms | 132161 tok/s) step 824/76294 | train loss 4.387663 | norm 0.4626 | lr 1.20e-03 | (3797.93 ms | 138046 tok/s) step 825/76294 | train loss 4.398641 | norm 0.5274 | lr 1.20e-03 | (3839.62 ms | 136547 tok/s) step 826/76294 | train loss 4.427072 | norm 0.4675 | lr 1.20e-03 | (3801.08 ms | 137931 tok/s) step 827/76294 | train loss 4.344629 | norm 0.4426 | lr 1.20e-03 | (3813.51 ms | 137482 tok/s) step 828/76294 | train loss 4.434073 | norm 0.4175 | lr 1.20e-03 | (3835.30 ms | 136701 tok/s) step 829/76294 | train loss 4.265455 | norm 0.3847 | lr 1.20e-03 | (3804.04 ms | 137824 tok/s) step 830/76294 | train loss 4.344467 | norm 0.5266 | lr 1.20e-03 | (3832.29 ms | 136808 tok/s) step 831/76294 | train loss 4.400757 | norm 0.4226 | lr 1.20e-03 | (3802.11 ms | 137894 tok/s) step 832/76294 | train loss 4.451308 | norm 0.4156 | lr 1.20e-03 | (3799.60 ms | 137985 tok/s) step 833/76294 | train loss 4.397631 | norm 0.3633 | lr 1.20e-03 | (3867.61 ms | 135559 tok/s) step 834/76294 | train loss 4.377021 | norm 0.3860 | lr 1.20e-03 | (3802.71 ms | 137872 tok/s) step 835/76294 | train loss 4.429913 | norm 0.3595 | lr 1.20e-03 | (3865.68 ms | 135626 tok/s) step 836/76294 | train loss 4.349862 | norm 0.3712 | lr 1.20e-03 | (3795.64 ms | 138129 tok/s) step 837/76294 | train loss 4.380701 | norm 0.3888 | lr 1.20e-03 | (3840.65 ms | 136510 tok/s) step 838/76294 | train loss 4.428060 | norm 0.4105 | lr 1.20e-03 | (3795.16 ms | 138146 tok/s) step 839/76294 | train loss 4.510098 | norm 0.4322 | lr 1.20e-03 | (3809.18 ms | 137638 tok/s) step 840/76294 | train loss 4.378586 | norm 0.4174 | lr 1.20e-03 | (3799.88 ms | 137975 tok/s) step 841/76294 | train loss 4.350452 | norm 0.3874 | lr 1.20e-03 | (3804.53 ms | 137806 tok/s) step 842/76294 | train loss 4.345348 | norm 0.3085 | lr 1.20e-03 | (3821.43 ms | 137197 tok/s) step 843/76294 | train loss 4.350604 | norm 0.3008 | lr 1.20e-03 | (3800.55 ms | 137950 tok/s) step 844/76294 | train loss 4.394919 | norm 0.3109 | lr 1.20e-03 | (3894.81 ms | 134612 tok/s) step 845/76294 | train loss 4.401634 | norm 0.3903 | lr 1.20e-03 | (3800.59 ms | 137949 tok/s) step 846/76294 | train loss 4.364585 | norm 0.5181 | lr 1.20e-03 | (3913.88 ms | 133956 tok/s) step 847/76294 | train loss 4.300776 | norm 0.5367 | lr 1.20e-03 | (3798.22 ms | 138035 tok/s) step 848/76294 | train loss 4.352232 | norm 0.4250 | lr 1.20e-03 | (3801.13 ms | 137930 tok/s) step 849/76294 | train loss 4.350521 | norm 0.3927 | lr 1.20e-03 | (3824.45 ms | 137089 tok/s) step 850/76294 | train loss 4.429572 | norm 0.3104 | lr 1.20e-03 | (3802.92 ms | 137865 tok/s) step 851/76294 | train loss 4.339036 | norm 0.2784 | lr 1.20e-03 | (3799.14 ms | 138002 tok/s) step 852/76294 | train loss 4.336338 | norm 0.2985 | lr 1.20e-03 | (3836.36 ms | 136663 tok/s) step 853/76294 | train loss 4.355555 | norm 0.2695 | lr 1.20e-03 | (3796.34 ms | 138104 tok/s) step 854/76294 | train loss 4.341925 | norm 0.2867 | lr 1.20e-03 | (3888.72 ms | 134823 tok/s) step 855/76294 | train loss 4.288145 | norm 0.2953 | lr 1.20e-03 | (3815.45 ms | 137412 tok/s) step 856/76294 | train loss 4.382767 | norm 0.3093 | lr 1.20e-03 | (3801.08 ms | 137931 tok/s) step 857/76294 | train loss 4.300941 | norm 0.3996 | lr 1.20e-03 | (3845.11 ms | 136352 tok/s) step 858/76294 | train loss 4.343958 | norm 0.4833 | lr 1.20e-03 | (3800.40 ms | 137956 tok/s) step 859/76294 | train loss 4.336622 | norm 0.4698 | lr 1.20e-03 | (3797.69 ms | 138054 tok/s) step 860/76294 | train loss 4.410268 | norm 0.4064 | lr 1.20e-03 | (3827.81 ms | 136968 tok/s) step 861/76294 | train loss 4.322435 | norm 0.4112 | lr 1.20e-03 | (3795.74 ms | 138125 tok/s) step 862/76294 | train loss 4.383589 | norm 0.3356 | lr 1.20e-03 | (3818.52 ms | 137301 tok/s) step 863/76294 | train loss 4.350991 | norm 0.2843 | lr 1.20e-03 | (3797.59 ms | 138058 tok/s) step 864/76294 | train loss 4.349699 | norm 0.2983 | lr 1.20e-03 | (3918.91 ms | 133784 tok/s) step 865/76294 | train loss 4.307905 | norm 0.2800 | lr 1.20e-03 | (3799.15 ms | 138001 tok/s) step 866/76294 | train loss 4.306241 | norm 0.3082 | lr 1.20e-03 | (3854.43 ms | 136022 tok/s) step 867/76294 | train loss 4.353792 | norm 0.3251 | lr 1.20e-03 | (3798.18 ms | 138037 tok/s) step 868/76294 | train loss 4.315105 | norm 0.3680 | lr 1.20e-03 | (3806.31 ms | 137742 tok/s) step 869/76294 | train loss 4.361730 | norm 0.3556 | lr 1.20e-03 | (3820.63 ms | 137225 tok/s) step 870/76294 | train loss 4.388369 | norm 0.8728 | lr 1.20e-03 | (3802.95 ms | 137863 tok/s) step 871/76294 | train loss 4.282565 | norm 0.2885 | lr 1.20e-03 | (4098.78 ms | 127913 tok/s) step 872/76294 | train loss 4.271465 | norm 0.2824 | lr 1.20e-03 | (3796.42 ms | 138101 tok/s) step 873/76294 | train loss 4.310425 | norm 0.3095 | lr 1.20e-03 | (3803.10 ms | 137858 tok/s) step 874/76294 | train loss 4.265750 | norm 0.3229 | lr 1.20e-03 | (3818.59 ms | 137299 tok/s) step 875/76294 | train loss 4.349170 | norm 0.3160 | lr 1.20e-03 | (3806.32 ms | 137741 tok/s) step 876/76294 | train loss 4.298642 | norm 0.3085 | lr 1.20e-03 | (3822.50 ms | 137158 tok/s) step 877/76294 | train loss 4.345303 | norm 0.6215 | lr 1.20e-03 | (3801.76 ms | 137907 tok/s) step 878/76294 | train loss 4.382883 | norm 0.8029 | lr 1.20e-03 | (3798.24 ms | 138034 tok/s) step 879/76294 | train loss 4.428228 | norm 1.2662 | lr 1.20e-03 | (3838.09 ms | 136601 tok/s) step 880/76294 | train loss 4.391432 | norm 0.6691 | lr 1.20e-03 | (3800.61 ms | 137948 tok/s) step 881/76294 | train loss 4.488982 | norm 1.7853 | lr 1.20e-03 | (3805.43 ms | 137774 tok/s) step 882/76294 | train loss 4.430027 | norm 1.9784 | lr 1.20e-03 | (3826.74 ms | 137006 tok/s) step 883/76294 | train loss 4.401073 | norm 1.2093 | lr 1.20e-03 | (3827.78 ms | 136969 tok/s) step 884/76294 | train loss 4.364349 | norm 1.3386 | lr 1.20e-03 | (3803.15 ms | 137856 tok/s) step 885/76294 | train loss 4.366924 | norm 0.6863 | lr 1.20e-03 | (4107.30 ms | 127648 tok/s) step 886/76294 | train loss 4.290114 | norm 0.8634 | lr 1.20e-03 | (3804.37 ms | 137812 tok/s) step 887/76294 | train loss 4.429829 | norm 0.8198 | lr 1.20e-03 | (3924.09 ms | 133607 tok/s) step 888/76294 | train loss 4.375212 | norm 0.7736 | lr 1.20e-03 | (3810.88 ms | 137577 tok/s) step 889/76294 | train loss 4.391481 | norm 0.5311 | lr 1.20e-03 | (3846.25 ms | 136311 tok/s) step 890/76294 | train loss 4.318640 | norm 0.4408 | lr 1.20e-03 | (3836.21 ms | 136668 tok/s) step 891/76294 | train loss 4.345456 | norm 0.5073 | lr 1.20e-03 | (3813.21 ms | 137493 tok/s) step 892/76294 | train loss 4.374644 | norm 1.0559 | lr 1.20e-03 | (3807.30 ms | 137706 tok/s) step 893/76294 | train loss 4.334461 | norm 0.4010 | lr 1.20e-03 | (3873.78 ms | 135343 tok/s) step 894/76294 | train loss 4.378806 | norm 5.7122 | lr 1.20e-03 | (3805.87 ms | 137758 tok/s) step 895/76294 | train loss 4.415453 | norm 0.4612 | lr 1.20e-03 | (3817.45 ms | 137340 tok/s) step 896/76294 | train loss 4.335478 | norm 0.4503 | lr 1.20e-03 | (3830.74 ms | 136863 tok/s) step 897/76294 | train loss 4.350575 | norm 0.4479 | lr 1.20e-03 | (3828.07 ms | 136959 tok/s) step 898/76294 | train loss 4.371380 | norm 0.6079 | lr 1.20e-03 | (3810.20 ms | 137601 tok/s) step 899/76294 | train loss 4.361083 | norm 0.5229 | lr 1.20e-03 | (3809.76 ms | 137617 tok/s) step 900/76294 | train loss 4.371122 | norm 0.5440 | lr 1.20e-03 | (3805.34 ms | 137777 tok/s) step 901/76294 | train loss 4.375383 | norm 0.3761 | lr 1.20e-03 | (3840.80 ms | 136505 tok/s) step 902/76294 | train loss 4.345207 | norm 0.3635 | lr 1.20e-03 | (3807.75 ms | 137690 tok/s) step 903/76294 | train loss 4.291062 | norm 1.8099 | lr 1.20e-03 | (3825.04 ms | 137067 tok/s) step 904/76294 | train loss 4.325805 | norm 0.3405 | lr 1.20e-03 | (3837.62 ms | 136618 tok/s) step 905/76294 | train loss 4.252574 | norm 0.3856 | lr 1.20e-03 | (3810.55 ms | 137589 tok/s) step 906/76294 | train loss 4.331892 | norm 0.3551 | lr 1.20e-03 | (3926.01 ms | 133542 tok/s) step 907/76294 | train loss 4.306589 | norm 0.3539 | lr 1.20e-03 | (3850.29 ms | 136168 tok/s) step 908/76294 | train loss 4.274110 | norm 0.3265 | lr 1.20e-03 | (3832.50 ms | 136801 tok/s) step 909/76294 | train loss 4.296875 | norm 0.3405 | lr 1.20e-03 | (3830.55 ms | 136870 tok/s) step 910/76294 | train loss 4.303616 | norm 0.3823 | lr 1.20e-03 | (3796.96 ms | 138081 tok/s) step 911/76294 | train loss 4.280109 | norm 0.3313 | lr 1.20e-03 | (3833.96 ms | 136748 tok/s) step 912/76294 | train loss 4.283741 | norm 0.3242 | lr 1.20e-03 | (3798.51 ms | 138025 tok/s) step 913/76294 | train loss 4.286853 | norm 0.3334 | lr 1.20e-03 | (3855.81 ms | 135974 tok/s) step 914/76294 | train loss 4.267423 | norm 0.2924 | lr 1.20e-03 | (3797.95 ms | 138045 tok/s) step 915/76294 | train loss 4.351677 | norm 0.3895 | lr 1.20e-03 | (3806.68 ms | 137728 tok/s) step 916/76294 | train loss 4.341975 | norm 0.3824 | lr 1.20e-03 | (3830.96 ms | 136855 tok/s) step 917/76294 | train loss 4.318256 | norm 0.4560 | lr 1.20e-03 | (3818.60 ms | 137299 tok/s) step 918/76294 | train loss 4.217251 | norm 0.4685 | lr 1.20e-03 | (3796.81 ms | 138087 tok/s) step 919/76294 | train loss 4.462077 | norm 0.3700 | lr 1.20e-03 | (3858.32 ms | 135885 tok/s) step 920/76294 | train loss 4.237272 | norm 0.4557 | lr 1.20e-03 | (3800.83 ms | 137940 tok/s) step 921/76294 | train loss 4.279399 | norm 0.5049 | lr 1.20e-03 | (3831.77 ms | 136827 tok/s) step 922/76294 | train loss 4.326188 | norm 0.8931 | lr 1.20e-03 | (3823.41 ms | 137126 tok/s) step 923/76294 | train loss 4.346191 | norm 0.9772 | lr 1.20e-03 | (3801.03 ms | 137933 tok/s) step 924/76294 | train loss 4.333104 | norm 0.6573 | lr 1.20e-03 | (3801.74 ms | 137908 tok/s) step 925/76294 | train loss 4.372249 | norm 0.7364 | lr 1.20e-03 | (3844.08 ms | 136388 tok/s) step 926/76294 | train loss 4.300489 | norm 0.6050 | lr 1.20e-03 | (3809.25 ms | 137636 tok/s) step 927/76294 | train loss 4.412057 | norm 0.4842 | lr 1.20e-03 | (3968.76 ms | 132104 tok/s) step 928/76294 | train loss 4.344876 | norm 0.4254 | lr 1.20e-03 | (3850.52 ms | 136160 tok/s) step 929/76294 | train loss 4.288085 | norm 0.3123 | lr 1.20e-03 | (3801.11 ms | 137930 tok/s) step 930/76294 | train loss 4.304236 | norm 0.2874 | lr 1.20e-03 | (3823.22 ms | 137133 tok/s) step 931/76294 | train loss 4.363297 | norm 0.2788 | lr 1.20e-03 | (3822.42 ms | 137161 tok/s) step 932/76294 | train loss 4.271602 | norm 0.3202 | lr 1.20e-03 | (3801.13 ms | 137929 tok/s) step 933/76294 | train loss 4.332655 | norm 0.3427 | lr 1.20e-03 | (3800.09 ms | 137967 tok/s) step 934/76294 | train loss 4.256177 | norm 0.4108 | lr 1.20e-03 | (3833.57 ms | 136762 tok/s) step 935/76294 | train loss 4.303058 | norm 0.3818 | lr 1.20e-03 | (3796.62 ms | 138093 tok/s) step 936/76294 | train loss 4.260407 | norm 0.3352 | lr 1.20e-03 | (3811.05 ms | 137571 tok/s) step 937/76294 | train loss 4.322376 | norm 0.3839 | lr 1.20e-03 | (3796.77 ms | 138088 tok/s) step 938/76294 | train loss 4.219247 | norm 0.2592 | lr 1.20e-03 | (3842.53 ms | 136443 tok/s) step 939/76294 | train loss 4.245533 | norm 0.2482 | lr 1.20e-03 | (3835.82 ms | 136682 tok/s) step 940/76294 | train loss 4.291022 | norm 0.2599 | lr 1.20e-03 | (3917.90 ms | 133819 tok/s) step 941/76294 | train loss 4.281989 | norm 0.2761 | lr 1.20e-03 | (4482.57 ms | 116961 tok/s) step 942/76294 | train loss 4.303400 | norm 0.2472 | lr 1.20e-03 | (3838.94 ms | 136571 tok/s) step 943/76294 | train loss 4.305167 | norm 0.3059 | lr 1.20e-03 | (4372.75 ms | 119899 tok/s) step 944/76294 | train loss 4.245562 | norm 0.3704 | lr 1.20e-03 | (3820.22 ms | 137240 tok/s) step 945/76294 | train loss 4.258888 | norm 0.3808 | lr 1.20e-03 | (3951.64 ms | 132676 tok/s) step 946/76294 | train loss 4.240307 | norm 0.3523 | lr 1.20e-03 | (3814.97 ms | 137429 tok/s) step 947/76294 | train loss 4.222216 | norm 0.3146 | lr 1.20e-03 | (4608.82 ms | 113757 tok/s) step 948/76294 | train loss 4.211462 | norm 0.2819 | lr 1.20e-03 | (3795.77 ms | 138124 tok/s) step 949/76294 | train loss 4.237491 | norm 0.2801 | lr 1.20e-03 | (3822.55 ms | 137156 tok/s) step 950/76294 | train loss 4.236825 | norm 0.2594 | lr 1.20e-03 | (3794.03 ms | 138188 tok/s) step 951/76294 | train loss 4.232575 | norm 0.2914 | lr 1.20e-03 | (3803.53 ms | 137843 tok/s) step 952/76294 | train loss 4.144501 | norm 0.2723 | lr 1.20e-03 | (3824.65 ms | 137081 tok/s) step 953/76294 | train loss 4.287350 | norm 0.2489 | lr 1.20e-03 | (3797.99 ms | 138043 tok/s) step 954/76294 | train loss 4.253815 | norm 0.2665 | lr 1.20e-03 | (4185.89 ms | 125251 tok/s) step 955/76294 | train loss 4.276869 | norm 0.2927 | lr 1.20e-03 | (3849.00 ms | 136214 tok/s) step 956/76294 | train loss 4.307074 | norm 0.3368 | lr 1.20e-03 | (3794.99 ms | 138153 tok/s) step 957/76294 | train loss 4.295305 | norm 0.3493 | lr 1.20e-03 | (4014.12 ms | 130611 tok/s) step 958/76294 | train loss 4.276824 | norm 0.3726 | lr 1.20e-03 | (3809.39 ms | 137631 tok/s) step 959/76294 | train loss 4.282786 | norm 0.3596 | lr 1.20e-03 | (5492.67 ms | 95452 tok/s) step 960/76294 | train loss 4.261883 | norm 0.3056 | lr 1.20e-03 | (3790.38 ms | 138321 tok/s) step 961/76294 | train loss 4.225372 | norm 0.3238 | lr 1.20e-03 | (3859.54 ms | 135842 tok/s) step 962/76294 | train loss 4.261631 | norm 0.3080 | lr 1.20e-03 | (3790.75 ms | 138307 tok/s) step 963/76294 | train loss 4.207123 | norm 0.2840 | lr 1.20e-03 | (3857.18 ms | 135925 tok/s) step 964/76294 | train loss 4.159183 | norm 0.3050 | lr 1.20e-03 | (3797.31 ms | 138068 tok/s) step 965/76294 | train loss 4.239944 | norm 0.2867 | lr 1.20e-03 | (3845.87 ms | 136325 tok/s) step 966/76294 | train loss 4.194913 | norm 0.2493 | lr 1.20e-03 | (9347.35 ms | 56090 tok/s) step 967/76294 | train loss 4.166304 | norm 0.2467 | lr 1.20e-03 | (3971.48 ms | 132013 tok/s) step 968/76294 | train loss 4.196281 | norm 0.2711 | lr 1.20e-03 | (3816.46 ms | 137376 tok/s) step 969/76294 | train loss 4.179584 | norm 0.2412 | lr 1.20e-03 | (3915.29 ms | 133908 tok/s) step 970/76294 | train loss 4.183441 | norm 0.2638 | lr 1.20e-03 | (3866.58 ms | 135595 tok/s) step 971/76294 | train loss 4.216402 | norm 0.2972 | lr 1.20e-03 | (3787.36 ms | 138431 tok/s) step 972/76294 | train loss 4.233347 | norm 0.3499 | lr 1.20e-03 | (3856.45 ms | 135951 tok/s) step 973/76294 | train loss 4.266411 | norm 0.3372 | lr 1.20e-03 | (3797.94 ms | 138045 tok/s) step 974/76294 | train loss 4.295884 | norm 0.3013 | lr 1.20e-03 | (4558.75 ms | 115007 tok/s) step 975/76294 | train loss 4.201497 | norm 0.4561 | lr 1.20e-03 | (3793.29 ms | 138214 tok/s) step 976/76294 | train loss 4.246611 | norm 0.4219 | lr 1.20e-03 | (4125.88 ms | 127073 tok/s) step 977/76294 | train loss 4.184311 | norm 0.7511 | lr 1.20e-03 | (3806.98 ms | 137718 tok/s) step 978/76294 | train loss 4.212726 | norm 0.4055 | lr 1.20e-03 | (3805.17 ms | 137783 tok/s) step 979/76294 | train loss 4.330221 | norm 0.3651 | lr 1.20e-03 | (3817.08 ms | 137353 tok/s) step 980/76294 | train loss 4.182662 | norm 0.3409 | lr 1.20e-03 | (3799.90 ms | 137974 tok/s) step 981/76294 | train loss 4.249273 | norm 0.3263 | lr 1.20e-03 | (3794.33 ms | 138177 tok/s) step 982/76294 | train loss 4.233955 | norm 0.2994 | lr 1.20e-03 | (3849.08 ms | 136211 tok/s) step 983/76294 | train loss 4.205619 | norm 0.2987 | lr 1.20e-03 | (3799.01 ms | 138006 tok/s) step 984/76294 | train loss 4.185711 | norm 0.3306 | lr 1.20e-03 | (3801.58 ms | 137913 tok/s) step 985/76294 | train loss 4.330877 | norm 0.4000 | lr 1.20e-03 | (3827.53 ms | 136978 tok/s) step 986/76294 | train loss 4.186242 | norm 0.4495 | lr 1.20e-03 | (3798.69 ms | 138018 tok/s) step 987/76294 | train loss 4.243281 | norm 0.3845 | lr 1.20e-03 | (3809.10 ms | 137641 tok/s) step 988/76294 | train loss 4.204252 | norm 0.4106 | lr 1.20e-03 | (3799.51 ms | 137988 tok/s) step 989/76294 | train loss 4.229506 | norm 0.3424 | lr 1.20e-03 | (3819.43 ms | 137269 tok/s) step 990/76294 | train loss 4.209149 | norm 0.3443 | lr 1.20e-03 | (3798.41 ms | 138028 tok/s) step 991/76294 | train loss 4.248751 | norm 0.3231 | lr 1.20e-03 | (3818.46 ms | 137304 tok/s) step 992/76294 | train loss 4.140013 | norm 0.3269 | lr 1.20e-03 | (3803.30 ms | 137851 tok/s) step 993/76294 | train loss 4.202605 | norm 0.2982 | lr 1.20e-03 | (3800.95 ms | 137936 tok/s) step 994/76294 | train loss 4.167215 | norm 0.3145 | lr 1.20e-03 | (3837.09 ms | 136637 tok/s) step 995/76294 | train loss 4.170773 | norm 0.5885 | lr 1.20e-03 | (3798.19 ms | 138036 tok/s) step 996/76294 | train loss 4.254020 | norm 0.2937 | lr 1.20e-03 | (3942.79 ms | 132974 tok/s) step 997/76294 | train loss 4.209854 | norm 0.2534 | lr 1.20e-03 | (9392.49 ms | 55820 tok/s) step 998/76294 | train loss 4.207761 | norm 0.2652 | lr 1.20e-03 | (3852.53 ms | 136089 tok/s) step 999/76294 | train loss 4.237841 | norm 0.2786 | lr 1.20e-03 | (3794.56 ms | 138168 tok/s) step 1000/76294 | train loss 4.218008 | norm 0.3106 | lr 1.20e-03 | (3844.84 ms | 136362 tok/s) val loss: 4.225113 saving model checkpoint to ./results/gpt2-124M-gqa/step_1000.pth step 1001/76294 | train loss 4.113557 | norm 0.8084 | lr 1.20e-03 | (3836.25 ms | 136667 tok/s) step 1002/76294 | train loss 4.209691 | norm 0.5074 | lr 1.20e-03 | (3793.16 ms | 138219 tok/s) step 1003/76294 | train loss 4.255528 | norm 0.5578 | lr 1.20e-03 | (3804.52 ms | 137807 tok/s) step 1004/76294 | train loss 4.239764 | norm 0.4589 | lr 1.20e-03 | (3786.96 ms | 138445 tok/s) step 1005/76294 | train loss 4.275149 | norm 0.5365 | lr 1.20e-03 | (3819.64 ms | 137261 tok/s) step 1006/76294 | train loss 4.276976 | norm 0.3714 | lr 1.20e-03 | (3793.53 ms | 138206 tok/s) step 1007/76294 | train loss 4.179741 | norm 0.3382 | lr 1.20e-03 | (3798.00 ms | 138043 tok/s) step 1008/76294 | train loss 4.194944 | norm 0.4417 | lr 1.20e-03 | (3839.90 ms | 136537 tok/s) step 1009/76294 | train loss 4.214240 | norm 0.3083 | lr 1.20e-03 | (3800.80 ms | 137941 tok/s) step 1010/76294 | train loss 4.135323 | norm 0.3001 | lr 1.20e-03 | (3797.09 ms | 138076 tok/s) step 1011/76294 | train loss 4.153452 | norm 0.3471 | lr 1.20e-03 | (3844.68 ms | 136367 tok/s) step 1012/76294 | train loss 4.236577 | norm 0.2892 | lr 1.20e-03 | (3793.06 ms | 138223 tok/s) step 1013/76294 | train loss 4.188565 | norm 0.3411 | lr 1.20e-03 | (3799.56 ms | 137987 tok/s) step 1014/76294 | train loss 4.120244 | norm 0.3347 | lr 1.20e-03 | (3866.02 ms | 135615 tok/s) step 1015/76294 | train loss 4.283399 | norm 0.3214 | lr 1.20e-03 | (3839.80 ms | 136540 tok/s) step 1016/76294 | train loss 4.164058 | norm 0.2977 | lr 1.20e-03 | (4662.93 ms | 112437 tok/s) step 1017/76294 | train loss 4.243944 | norm 0.3858 | lr 1.20e-03 | (3787.05 ms | 138442 tok/s) step 1018/76294 | train loss 4.187786 | norm 0.3779 | lr 1.20e-03 | (4887.34 ms | 107275 tok/s) step 1019/76294 | train loss 4.193231 | norm 0.3861 | lr 1.20e-03 | (3808.43 ms | 137665 tok/s) step 1020/76294 | train loss 4.192583 | norm 0.3280 | lr 1.20e-03 | (3793.65 ms | 138201 tok/s) step 1021/76294 | train loss 4.211674 | norm 0.3656 | lr 1.20e-03 | (3848.50 ms | 136232 tok/s) step 1022/76294 | train loss 4.157859 | norm 0.3362 | lr 1.20e-03 | (3799.02 ms | 138006 tok/s) step 1023/76294 | train loss 4.166050 | norm 0.3166 | lr 1.20e-03 | (3800.78 ms | 137942 tok/s) step 1024/76294 | train loss 4.184124 | norm 0.3114 | lr 1.20e-03 | (3819.71 ms | 137259 tok/s) step 1025/76294 | train loss 4.184557 | norm 0.2589 | lr 1.20e-03 | (3809.26 ms | 137635 tok/s) step 1026/76294 | train loss 4.179842 | norm 0.2806 | lr 1.20e-03 | (3798.61 ms | 138021 tok/s) step 1027/76294 | train loss 4.228981 | norm 0.2753 | lr 1.20e-03 | (3939.53 ms | 133084 tok/s) step 1028/76294 | train loss 4.186743 | norm 0.2698 | lr 1.20e-03 | (3796.05 ms | 138114 tok/s) step 1029/76294 | train loss 4.145663 | norm 0.2423 | lr 1.20e-03 | (3798.08 ms | 138040 tok/s) step 1030/76294 | train loss 4.204756 | norm 0.2595 | lr 1.20e-03 | (3826.90 ms | 137001 tok/s) step 1031/76294 | train loss 4.256236 | norm 0.2652 | lr 1.20e-03 | (3802.54 ms | 137878 tok/s) step 1032/76294 | train loss 4.205287 | norm 0.2946 | lr 1.20e-03 | (3794.95 ms | 138154 tok/s) step 1033/76294 | train loss 4.138499 | norm 0.3783 | lr 1.20e-03 | (3846.55 ms | 136301 tok/s) step 1034/76294 | train loss 4.150883 | norm 0.3289 | lr 1.20e-03 | (3805.89 ms | 137757 tok/s) step 1035/76294 | train loss 4.172613 | norm 0.2778 | lr 1.20e-03 | (3806.19 ms | 137746 tok/s) step 1036/76294 | train loss 4.160307 | norm 0.3040 | lr 1.20e-03 | (3835.73 ms | 136685 tok/s) step 1037/76294 | train loss 4.180175 | norm 0.2427 | lr 1.20e-03 | (3809.65 ms | 137621 tok/s) step 1038/76294 | train loss 4.150042 | norm 0.2601 | lr 1.20e-03 | (3806.99 ms | 137717 tok/s) step 1039/76294 | train loss 4.149447 | norm 0.2481 | lr 1.20e-03 | (3839.81 ms | 136540 tok/s) step 1040/76294 | train loss 4.162637 | norm 0.2431 | lr 1.20e-03 | (3806.86 ms | 137722 tok/s) step 1041/76294 | train loss 4.111892 | norm 0.2966 | lr 1.20e-03 | (3833.78 ms | 136755 tok/s) step 1042/76294 | train loss 4.114770 | norm 0.3219 | lr 1.20e-03 | (3811.68 ms | 137548 tok/s) step 1043/76294 | train loss 4.153299 | norm 0.3302 | lr 1.20e-03 | (3820.16 ms | 137242 tok/s) step 1044/76294 | train loss 4.156106 | norm 0.2999 | lr 1.20e-03 | (3821.24 ms | 137204 tok/s) step 1045/76294 | train loss 4.116846 | norm 0.2856 | lr 1.20e-03 | (3813.03 ms | 137499 tok/s) step 1046/76294 | train loss 4.185602 | norm 0.3269 | lr 1.20e-03 | (3834.80 ms | 136718 tok/s) step 1047/76294 | train loss 4.138142 | norm 0.3428 | lr 1.20e-03 | (3808.53 ms | 137662 tok/s) step 1048/76294 | train loss 4.102535 | norm 0.3436 | lr 1.20e-03 | (3808.73 ms | 137654 tok/s) step 1049/76294 | train loss 4.126035 | norm 0.2968 | lr 1.20e-03 | (3836.13 ms | 136671 tok/s) step 1050/76294 | train loss 4.245786 | norm 0.3151 | lr 1.20e-03 | (3808.54 ms | 137661 tok/s) step 1051/76294 | train loss 4.104943 | norm 0.2885 | lr 1.20e-03 | (3901.63 ms | 134377 tok/s) step 1052/76294 | train loss 4.120414 | norm 0.2964 | lr 1.20e-03 | (3833.70 ms | 136758 tok/s) step 1053/76294 | train loss 4.152625 | norm 0.2547 | lr 1.20e-03 | (3909.99 ms | 134089 tok/s) step 1054/76294 | train loss 4.154342 | norm 0.2910 | lr 1.20e-03 | (3807.35 ms | 137704 tok/s) step 1055/76294 | train loss 4.146430 | norm 0.5484 | lr 1.20e-03 | (3880.59 ms | 135105 tok/s) step 1056/76294 | train loss 4.168981 | norm 0.3594 | lr 1.20e-03 | (3813.65 ms | 137477 tok/s) step 1057/76294 | train loss 4.129364 | norm 0.4314 | lr 1.20e-03 | (3817.90 ms | 137324 tok/s) step 1058/76294 | train loss 4.071616 | norm 0.5508 | lr 1.20e-03 | (3833.31 ms | 136772 tok/s) step 1059/76294 | train loss 4.216693 | norm 1.1929 | lr 1.20e-03 | (3881.66 ms | 135068 tok/s) step 1060/76294 | train loss 4.195191 | norm 0.3823 | lr 1.20e-03 | (3804.64 ms | 137802 tok/s) step 1061/76294 | train loss 4.152861 | norm 0.4973 | lr 1.20e-03 | (3897.77 ms | 134510 tok/s) step 1062/76294 | train loss 4.168340 | norm 0.4345 | lr 1.20e-03 | (3800.60 ms | 137949 tok/s) step 1063/76294 | train loss 4.171692 | norm 0.4281 | lr 1.20e-03 | (3911.98 ms | 134021 tok/s) step 1064/76294 | train loss 4.215360 | norm 0.3783 | lr 1.20e-03 | (3807.70 ms | 137692 tok/s) step 1065/76294 | train loss 4.129913 | norm 0.4672 | lr 1.20e-03 | (3807.53 ms | 137698 tok/s) step 1066/76294 | train loss 4.101521 | norm 0.3725 | lr 1.20e-03 | (6736.71 ms | 77826 tok/s) step 1067/76294 | train loss 4.203729 | norm 0.4851 | lr 1.20e-03 | (4289.18 ms | 122235 tok/s) step 1068/76294 | train loss 4.189898 | norm 0.3880 | lr 1.20e-03 | (14139.90 ms | 37079 tok/s) step 1069/76294 | train loss 4.080474 | norm 0.3288 | lr 1.20e-03 | (5034.34 ms | 104142 tok/s) step 1070/76294 | train loss 4.163663 | norm 0.4310 | lr 1.20e-03 | (3881.40 ms | 135077 tok/s) step 1071/76294 | train loss 4.191867 | norm 0.4583 | lr 1.20e-03 | (3797.93 ms | 138046 tok/s) step 1072/76294 | train loss 4.188503 | norm 0.3788 | lr 1.20e-03 | (5313.23 ms | 98676 tok/s) step 1073/76294 | train loss 4.142936 | norm 1.2462 | lr 1.20e-03 | (3787.82 ms | 138414 tok/s) step 1074/76294 | train loss 4.232235 | norm 0.3316 | lr 1.20e-03 | (15334.59 ms | 34190 tok/s) step 1075/76294 | train loss 4.225456 | norm 0.3794 | lr 1.20e-03 | (3870.79 ms | 135447 tok/s) step 1076/76294 | train loss 4.156327 | norm 0.2754 | lr 1.20e-03 | (3894.96 ms | 134607 tok/s) step 1077/76294 | train loss 4.118316 | norm 0.2968 | lr 1.20e-03 | (3775.22 ms | 138876 tok/s) step 1078/76294 | train loss 4.223460 | norm 0.4537 | lr 1.20e-03 | (3798.18 ms | 138036 tok/s) step 1079/76294 | train loss 4.151937 | norm 0.3761 | lr 1.20e-03 | (3778.69 ms | 138749 tok/s) step 1080/76294 | train loss 4.166075 | norm 0.3601 | lr 1.20e-03 | (3788.52 ms | 138389 tok/s) step 1081/76294 | train loss 4.094265 | norm 0.3566 | lr 1.20e-03 | (3803.66 ms | 137838 tok/s) step 1082/76294 | train loss 4.103951 | norm 0.4237 | lr 1.20e-03 | (3791.23 ms | 138290 tok/s) step 1083/76294 | train loss 4.196867 | norm 0.4006 | lr 1.20e-03 | (3831.07 ms | 136852 tok/s) step 1084/76294 | train loss 4.176070 | norm 0.3298 | lr 1.20e-03 | (3804.05 ms | 137824 tok/s) step 1085/76294 | train loss 4.154821 | norm 0.2428 | lr 1.20e-03 | (3802.50 ms | 137880 tok/s) step 1086/76294 | train loss 4.198344 | norm 0.3128 | lr 1.20e-03 | (3824.60 ms | 137083 tok/s) step 1087/76294 | train loss 4.151577 | norm 0.2931 | lr 1.20e-03 | (3881.56 ms | 135071 tok/s) step 1088/76294 | train loss 4.137939 | norm 0.2673 | lr 1.20e-03 | (3800.20 ms | 137963 tok/s) step 1089/76294 | train loss 4.091512 | norm 0.2898 | lr 1.20e-03 | (3818.81 ms | 137291 tok/s) step 1090/76294 | train loss 4.203011 | norm 0.3282 | lr 1.20e-03 | (3805.13 ms | 137785 tok/s) step 1091/76294 | train loss 4.145440 | norm 0.3737 | lr 1.20e-03 | (3883.11 ms | 135017 tok/s) step 1092/76294 | train loss 4.156861 | norm 0.3422 | lr 1.20e-03 | (3812.63 ms | 137514 tok/s) step 1093/76294 | train loss 4.390555 | norm 0.3008 | lr 1.20e-03 | (3936.63 ms | 133182 tok/s) step 1094/76294 | train loss 4.149233 | norm 0.2826 | lr 1.20e-03 | (3818.67 ms | 137296 tok/s) step 1095/76294 | train loss 4.244281 | norm 0.2983 | lr 1.20e-03 | (3858.57 ms | 135876 tok/s) step 1096/76294 | train loss 4.215629 | norm 0.2604 | lr 1.20e-03 | (3848.03 ms | 136248 tok/s) step 1097/76294 | train loss 4.088986 | norm 0.2756 | lr 1.20e-03 | (3815.95 ms | 137394 tok/s) step 1098/76294 | train loss 4.153977 | norm 0.2793 | lr 1.20e-03 | (3808.44 ms | 137665 tok/s) step 1099/76294 | train loss 4.221720 | norm 0.3424 | lr 1.20e-03 | (3857.99 ms | 135897 tok/s) step 1100/76294 | train loss 4.141594 | norm 0.3290 | lr 1.20e-03 | (3812.60 ms | 137514 tok/s) step 1101/76294 | train loss 4.206364 | norm 0.3217 | lr 1.20e-03 | (4002.76 ms | 130982 tok/s) step 1102/76294 | train loss 4.202968 | norm 0.2925 | lr 1.20e-03 | (3835.00 ms | 136711 tok/s) step 1103/76294 | train loss 4.074564 | norm 0.4182 | lr 1.20e-03 | (3847.01 ms | 136284 tok/s) step 1104/76294 | train loss 4.216186 | norm 0.4540 | lr 1.20e-03 | (3817.62 ms | 137334 tok/s) step 1105/76294 | train loss 4.183201 | norm 0.5910 | lr 1.20e-03 | (3849.31 ms | 136203 tok/s) step 1106/76294 | train loss 4.138267 | norm 0.8533 | lr 1.20e-03 | (3813.00 ms | 137500 tok/s) step 1107/76294 | train loss 4.208014 | norm 0.7669 | lr 1.20e-03 | (3863.81 ms | 135692 tok/s) step 1108/76294 | train loss 4.223702 | norm 0.6316 | lr 1.20e-03 | (3813.35 ms | 137488 tok/s) step 1109/76294 | train loss 4.212905 | norm 0.9582 | lr 1.20e-03 | (3855.29 ms | 135992 tok/s) step 1110/76294 | train loss 4.213256 | norm 0.5381 | lr 1.20e-03 | (3817.21 ms | 137348 tok/s) step 1111/76294 | train loss 4.226676 | norm 0.4906 | lr 1.20e-03 | (3871.27 ms | 135430 tok/s) step 1112/76294 | train loss 4.162899 | norm 0.3519 | lr 1.20e-03 | (3819.56 ms | 137264 tok/s) step 1113/76294 | train loss 4.120259 | norm 0.3084 | lr 1.20e-03 | (3815.67 ms | 137404 tok/s) step 1114/76294 | train loss 4.191637 | norm 0.2654 | lr 1.20e-03 | (3834.58 ms | 136726 tok/s) step 1115/76294 | train loss 4.152769 | norm 0.2562 | lr 1.20e-03 | (3832.88 ms | 136787 tok/s) step 1116/76294 | train loss 4.153417 | norm 0.4741 | lr 1.20e-03 | (3876.37 ms | 135252 tok/s) step 1117/76294 | train loss 4.144415 | norm 0.2660 | lr 1.20e-03 | (3810.39 ms | 137594 tok/s) step 1118/76294 | train loss 4.143090 | norm 0.3010 | lr 1.20e-03 | (3811.51 ms | 137554 tok/s) step 1119/76294 | train loss 4.147620 | norm 0.3180 | lr 1.20e-03 | (3838.89 ms | 136573 tok/s) step 1120/76294 | train loss 4.185957 | norm 0.3380 | lr 1.20e-03 | (3808.70 ms | 137655 tok/s) step 1121/76294 | train loss 4.171134 | norm 0.3191 | lr 1.20e-03 | (4015.94 ms | 130552 tok/s) step 1122/76294 | train loss 4.134660 | norm 2.6799 | lr 1.20e-03 | (3878.82 ms | 135167 tok/s) step 1123/76294 | train loss 4.144782 | norm 0.4213 | lr 1.20e-03 | (4006.87 ms | 130847 tok/s) step 1124/76294 | train loss 4.132621 | norm 0.4162 | lr 1.20e-03 | (3870.71 ms | 135450 tok/s) step 1125/76294 | train loss 4.154946 | norm 0.3732 | lr 1.20e-03 | (3808.60 ms | 137659 tok/s) step 1126/76294 | train loss 4.092269 | norm 0.3726 | lr 1.20e-03 | (3839.53 ms | 136550 tok/s) step 1127/76294 | train loss 4.129809 | norm 0.3260 | lr 1.20e-03 | (3812.84 ms | 137506 tok/s) step 1128/76294 | train loss 4.251111 | norm 0.3411 | lr 1.20e-03 | (3811.09 ms | 137569 tok/s) step 1129/76294 | train loss 4.181740 | norm 0.3389 | lr 1.20e-03 | (3829.30 ms | 136915 tok/s) step 1130/76294 | train loss 4.136877 | norm 0.2448 | lr 1.20e-03 | (3813.13 ms | 137495 tok/s) step 1131/76294 | train loss 4.184201 | norm 0.2843 | lr 1.20e-03 | (3831.77 ms | 136826 tok/s) step 1132/76294 | train loss 4.225546 | norm 0.2717 | lr 1.20e-03 | (3806.97 ms | 137718 tok/s) step 1133/76294 | train loss 4.037324 | norm 0.2624 | lr 1.20e-03 | (3805.08 ms | 137786 tok/s) step 1134/76294 | train loss 4.196375 | norm 0.2787 | lr 1.20e-03 | (3867.33 ms | 135568 tok/s) step 1135/76294 | train loss 4.153691 | norm 0.3700 | lr 1.20e-03 | (3805.13 ms | 137784 tok/s) step 1136/76294 | train loss 4.108810 | norm 0.4161 | lr 1.20e-03 | (3814.55 ms | 137444 tok/s) step 1137/76294 | train loss 4.081626 | norm 0.3416 | lr 1.20e-03 | (3828.45 ms | 136945 tok/s) step 1138/76294 | train loss 4.165952 | norm 0.3151 | lr 1.20e-03 | (3807.11 ms | 137713 tok/s) step 1139/76294 | train loss 4.091455 | norm 0.2991 | lr 1.20e-03 | (3836.26 ms | 136666 tok/s) step 1140/76294 | train loss 4.086214 | norm 0.4053 | lr 1.20e-03 | (3808.10 ms | 137677 tok/s) step 1141/76294 | train loss 4.140710 | norm 0.3398 | lr 1.20e-03 | (3805.08 ms | 137786 tok/s) step 1142/76294 | train loss 4.115072 | norm 0.3147 | lr 1.20e-03 | (3850.77 ms | 136152 tok/s) step 1143/76294 | train loss 4.115953 | norm 0.2572 | lr 1.20e-03 | (3804.32 ms | 137814 tok/s) step 1144/76294 | train loss 4.110931 | norm 0.2330 | lr 1.20e-03 | (3810.67 ms | 137584 tok/s) step 1145/76294 | train loss 4.039727 | norm 0.2617 | lr 1.20e-03 | (4579.52 ms | 114485 tok/s) step 1146/76294 | train loss 4.065438 | norm 0.3599 | lr 1.20e-03 | (3799.48 ms | 137989 tok/s) step 1147/76294 | train loss 4.176052 | norm 0.3357 | lr 1.20e-03 | (3830.79 ms | 136861 tok/s) step 1148/76294 | train loss 4.155504 | norm 0.3229 | lr 1.20e-03 | (3830.75 ms | 136863 tok/s) step 1149/76294 | train loss 4.117048 | norm 0.4288 | lr 1.20e-03 | (3802.42 ms | 137883 tok/s) step 1150/76294 | train loss 4.127543 | norm 0.4404 | lr 1.20e-03 | (3839.05 ms | 136567 tok/s) step 1151/76294 | train loss 4.115839 | norm 0.3352 | lr 1.20e-03 | (3800.53 ms | 137951 tok/s) step 1152/76294 | train loss 4.128324 | norm 0.3411 | lr 1.20e-03 | (3808.23 ms | 137672 tok/s) step 1153/76294 | train loss 4.204343 | norm 0.4140 | lr 1.20e-03 | (3973.13 ms | 131959 tok/s) step 1154/76294 | train loss 4.146330 | norm 0.4815 | lr 1.20e-03 | (3809.91 ms | 137612 tok/s) step 1155/76294 | train loss 4.161165 | norm 0.5218 | lr 1.20e-03 | (3800.37 ms | 137957 tok/s) step 1156/76294 | train loss 4.214021 | norm 0.3897 | lr 1.20e-03 | (3922.87 ms | 133649 tok/s) step 1157/76294 | train loss 4.073225 | norm 0.3322 | lr 1.20e-03 | (3804.44 ms | 137810 tok/s) step 1158/76294 | train loss 4.063029 | norm 0.3096 | lr 1.20e-03 | (3807.44 ms | 137701 tok/s) step 1159/76294 | train loss 4.127791 | norm 0.2750 | lr 1.20e-03 | (3823.83 ms | 137111 tok/s) step 1160/76294 | train loss 4.128951 | norm 0.2505 | lr 1.20e-03 | (3832.61 ms | 136797 tok/s) step 1161/76294 | train loss 4.121876 | norm 0.2719 | lr 1.20e-03 | (3801.10 ms | 137931 tok/s) step 1162/76294 | train loss 4.187659 | norm 0.2662 | lr 1.20e-03 | (3835.32 ms | 136700 tok/s) step 1163/76294 | train loss 4.095087 | norm 0.2129 | lr 1.20e-03 | (3804.06 ms | 137823 tok/s) step 1164/76294 | train loss 4.173164 | norm 0.2408 | lr 1.20e-03 | (3812.50 ms | 137518 tok/s) step 1165/76294 | train loss 4.092972 | norm 0.2361 | lr 1.20e-03 | (3822.87 ms | 137145 tok/s) step 1166/76294 | train loss 4.133196 | norm 0.2762 | lr 1.20e-03 | (3804.81 ms | 137796 tok/s) step 1167/76294 | train loss 4.073291 | norm 0.2790 | lr 1.20e-03 | (3801.32 ms | 137922 tok/s) step 1168/76294 | train loss 4.195372 | norm 0.3159 | lr 1.20e-03 | (3845.16 ms | 136350 tok/s) step 1169/76294 | train loss 4.073090 | norm 0.3714 | lr 1.20e-03 | (3800.41 ms | 137956 tok/s) step 1170/76294 | train loss 4.087532 | norm 0.3531 | lr 1.20e-03 | (3805.91 ms | 137756 tok/s) step 1171/76294 | train loss 4.124863 | norm 0.3684 | lr 1.20e-03 | (3822.79 ms | 137148 tok/s) step 1172/76294 | train loss 4.225910 | norm 0.3238 | lr 1.20e-03 | (3805.02 ms | 137788 tok/s) step 1173/76294 | train loss 4.125582 | norm 0.2805 | lr 1.20e-03 | (3801.54 ms | 137915 tok/s) step 1174/76294 | train loss 4.076119 | norm 0.3232 | lr 1.20e-03 | (3836.09 ms | 136673 tok/s) step 1175/76294 | train loss 4.280651 | norm 0.2769 | lr 1.20e-03 | (3833.49 ms | 136765 tok/s) step 1176/76294 | train loss 4.099788 | norm 0.2953 | lr 1.20e-03 | (3807.58 ms | 137696 tok/s) step 1177/76294 | train loss 4.101317 | norm 0.2874 | lr 1.20e-03 | (3827.50 ms | 136979 tok/s) step 1178/76294 | train loss 4.095580 | norm 0.2713 | lr 1.20e-03 | (3808.45 ms | 137665 tok/s) step 1179/76294 | train loss 4.116885 | norm 0.2977 | lr 1.20e-03 | (5638.12 ms | 92990 tok/s) step 1180/76294 | train loss 4.053898 | norm 0.2284 | lr 1.20e-03 | (3797.53 ms | 138060 tok/s) step 1181/76294 | train loss 4.112285 | norm 0.2576 | lr 1.20e-03 | (3799.22 ms | 137999 tok/s) step 1182/76294 | train loss 4.136948 | norm 0.2642 | lr 1.20e-03 | (3829.03 ms | 136925 tok/s) step 1183/76294 | train loss 4.121374 | norm 0.2791 | lr 1.20e-03 | (3801.13 ms | 137930 tok/s) step 1184/76294 | train loss 4.151022 | norm 0.2979 | lr 1.20e-03 | (3805.03 ms | 137788 tok/s) step 1185/76294 | train loss 4.105379 | norm 0.2879 | lr 1.20e-03 | (3990.90 ms | 131371 tok/s) step 1186/76294 | train loss 4.081739 | norm 0.2978 | lr 1.20e-03 | (3803.28 ms | 137852 tok/s) step 1187/76294 | train loss 4.127200 | norm 0.3698 | lr 1.20e-03 | (3801.90 ms | 137902 tok/s) step 1188/76294 | train loss 4.106554 | norm 0.4307 | lr 1.20e-03 | (3819.54 ms | 137265 tok/s) step 1189/76294 | train loss 4.121559 | norm 0.4293 | lr 1.20e-03 | (3802.95 ms | 137864 tok/s) step 1190/76294 | train loss 4.102410 | norm 0.3591 | lr 1.20e-03 | (3804.03 ms | 137824 tok/s) step 1191/76294 | train loss 4.158323 | norm 0.3247 | lr 1.20e-03 | (3825.99 ms | 137033 tok/s) step 1192/76294 | train loss 4.038476 | norm 0.2778 | lr 1.20e-03 | (3807.02 ms | 137716 tok/s) step 1193/76294 | train loss 4.138670 | norm 0.2716 | lr 1.20e-03 | (3840.39 ms | 136520 tok/s) step 1194/76294 | train loss 4.088631 | norm 0.2827 | lr 1.20e-03 | (3804.78 ms | 137797 tok/s) step 1195/76294 | train loss 4.128084 | norm 0.2557 | lr 1.20e-03 | (3808.14 ms | 137676 tok/s) step 1196/76294 | train loss 4.074112 | norm 0.2259 | lr 1.20e-03 | (3830.08 ms | 136887 tok/s) step 1197/76294 | train loss 4.091698 | norm 0.2431 | lr 1.20e-03 | (3805.05 ms | 137787 tok/s) step 1198/76294 | train loss 4.060294 | norm 0.2507 | lr 1.20e-03 | (3803.23 ms | 137853 tok/s) step 1199/76294 | train loss 4.108218 | norm 0.2577 | lr 1.20e-03 | (3887.48 ms | 134866 tok/s) step 1200/76294 | train loss 4.086551 | norm 0.3057 | lr 1.20e-03 | (3992.44 ms | 131320 tok/s) step 1201/76294 | train loss 4.135148 | norm 0.3533 | lr 1.20e-03 | (3855.03 ms | 136001 tok/s) step 1202/76294 | train loss 4.093212 | norm 0.2870 | lr 1.20e-03 | (3797.92 ms | 138046 tok/s) step 1203/76294 | train loss 4.152500 | norm 0.2891 | lr 1.20e-03 | (3843.58 ms | 136406 tok/s) step 1204/76294 | train loss 4.086674 | norm 0.3196 | lr 1.20e-03 | (3799.14 ms | 138002 tok/s) step 1205/76294 | train loss 4.100490 | norm 0.2528 | lr 1.20e-03 | (3929.90 ms | 133410 tok/s) step 1206/76294 | train loss 4.066695 | norm 0.2726 | lr 1.20e-03 | (3800.54 ms | 137951 tok/s) step 1207/76294 | train loss 4.047853 | norm 0.2387 | lr 1.20e-03 | (4697.93 ms | 111600 tok/s) step 1208/76294 | train loss 4.119299 | norm 0.2905 | lr 1.20e-03 | (3854.77 ms | 136010 tok/s) step 1209/76294 | train loss 4.044055 | norm 0.2653 | lr 1.20e-03 | (3870.20 ms | 135468 tok/s) step 1210/76294 | train loss 4.144698 | norm 0.2700 | lr 1.20e-03 | (3800.39 ms | 137956 tok/s) step 1211/76294 | train loss 4.130726 | norm 0.3068 | lr 1.20e-03 | (3872.87 ms | 135375 tok/s) step 1212/76294 | train loss 4.076177 | norm 0.2784 | lr 1.20e-03 | (4454.39 ms | 117701 tok/s) step 1213/76294 | train loss 4.050791 | norm 0.2842 | lr 1.20e-03 | (4900.95 ms | 106977 tok/s) step 1214/76294 | train loss 4.148543 | norm 0.3410 | lr 1.20e-03 | (3848.81 ms | 136221 tok/s) step 1215/76294 | train loss 4.086359 | norm 0.3222 | lr 1.20e-03 | (3993.73 ms | 131278 tok/s) step 1216/76294 | train loss 4.101053 | norm 0.2916 | lr 1.20e-03 | (3790.56 ms | 138314 tok/s) step 1217/76294 | train loss 4.065489 | norm 0.3360 | lr 1.20e-03 | (3816.83 ms | 137362 tok/s) step 1218/76294 | train loss 4.049535 | norm 0.3637 | lr 1.20e-03 | (3792.60 ms | 138240 tok/s) step 1219/76294 | train loss 4.067066 | norm 0.3262 | lr 1.20e-03 | (3841.23 ms | 136489 tok/s) step 1220/76294 | train loss 4.113521 | norm 0.2726 | lr 1.20e-03 | (3789.73 ms | 138345 tok/s) step 1221/76294 | train loss 4.091295 | norm 0.2788 | lr 1.20e-03 | (3852.29 ms | 136098 tok/s) step 1222/76294 | train loss 4.071040 | norm 0.3529 | lr 1.20e-03 | (3793.35 ms | 138212 tok/s) step 1223/76294 | train loss 4.195817 | norm 0.3221 | lr 1.20e-03 | (3826.45 ms | 137017 tok/s) step 1224/76294 | train loss 4.024841 | norm 0.2701 | lr 1.20e-03 | (3796.85 ms | 138085 tok/s) step 1225/76294 | train loss 4.111527 | norm 0.2929 | lr 1.20e-03 | (4010.55 ms | 130727 tok/s) step 1226/76294 | train loss 4.079255 | norm 0.2773 | lr 1.20e-03 | (3827.49 ms | 136979 tok/s) step 1227/76294 | train loss 4.144891 | norm 0.2550 | lr 1.20e-03 | (3804.73 ms | 137799 tok/s) step 1228/76294 | train loss 4.025485 | norm 0.2479 | lr 1.20e-03 | (3817.59 ms | 137335 tok/s) step 1229/76294 | train loss 4.056368 | norm 0.2953 | lr 1.20e-03 | (3801.94 ms | 137900 tok/s) step 1230/76294 | train loss 4.083600 | norm 0.3875 | lr 1.20e-03 | (3955.13 ms | 132559 tok/s) step 1231/76294 | train loss 4.118594 | norm 0.4065 | lr 1.20e-03 | (3798.33 ms | 138031 tok/s) step 1232/76294 | train loss 4.083808 | norm 0.3118 | lr 1.20e-03 | (3851.78 ms | 136116 tok/s) step 1233/76294 | train loss 4.066496 | norm 0.3309 | lr 1.20e-03 | (3847.16 ms | 136279 tok/s) step 1234/76294 | train loss 4.123811 | norm 0.4084 | lr 1.20e-03 | (3821.83 ms | 137183 tok/s) step 1235/76294 | train loss 4.067742 | norm 0.3570 | lr 1.20e-03 | (3803.80 ms | 137833 tok/s) step 1236/76294 | train loss 4.092199 | norm 0.3725 | lr 1.20e-03 | (3802.98 ms | 137862 tok/s) step 1237/76294 | train loss 4.117425 | norm 0.2725 | lr 1.20e-03 | (3853.01 ms | 136072 tok/s) step 1238/76294 | train loss 4.097805 | norm 0.2530 | lr 1.20e-03 | (3797.91 ms | 138046 tok/s) step 1239/76294 | train loss 4.011930 | norm 0.2625 | lr 1.20e-03 | (3851.89 ms | 136112 tok/s) step 1240/76294 | train loss 4.134293 | norm 0.2788 | lr 1.20e-03 | (3812.70 ms | 137511 tok/s) step 1241/76294 | train loss 4.022815 | norm 0.2779 | lr 1.20e-03 | (3889.22 ms | 134805 tok/s) step 1242/76294 | train loss 4.034825 | norm 0.2336 | lr 1.20e-03 | (3803.63 ms | 137839 tok/s) step 1243/76294 | train loss 4.116505 | norm 0.2693 | lr 1.20e-03 | (3900.15 ms | 134428 tok/s) step 1244/76294 | train loss 4.083922 | norm 0.2383 | lr 1.20e-03 | (6016.89 ms | 87136 tok/s) step 1245/76294 | train loss 4.043285 | norm 0.2559 | lr 1.20e-03 | (3808.54 ms | 137661 tok/s) step 1246/76294 | train loss 4.023325 | norm 0.2705 | lr 1.20e-03 | (3834.41 ms | 136732 tok/s) step 1247/76294 | train loss 4.078954 | norm 0.2738 | lr 1.20e-03 | (3829.27 ms | 136916 tok/s) step 1248/76294 | train loss 4.079802 | norm 0.2328 | lr 1.20e-03 | (3802.86 ms | 137867 tok/s) step 1249/76294 | train loss 4.033801 | norm 0.2637 | lr 1.20e-03 | (3831.39 ms | 136840 tok/s) step 1250/76294 | train loss 4.087594 | norm 0.2926 | lr 1.20e-03 | (3802.03 ms | 137897 tok/s) val loss: 4.063883 saving model checkpoint to ./results/gpt2-124M-gqa/step_1250.pth step 1251/76294 | train loss 4.043890 | norm 0.3303 | lr 1.20e-03 | (4187.64 ms | 125199 tok/s) step 1252/76294 | train loss 4.000176 | norm 0.3202 | lr 1.20e-03 | (3795.38 ms | 138139 tok/s) step 1253/76294 | train loss 4.157294 | norm 0.3303 | lr 1.20e-03 | (3876.22 ms | 135258 tok/s) step 1254/76294 | train loss 4.010808 | norm 0.3049 | lr 1.20e-03 | (3779.60 ms | 138715 tok/s) step 1255/76294 | train loss 4.061401 | norm 0.2845 | lr 1.20e-03 | (3813.73 ms | 137474 tok/s) step 1256/76294 | train loss 4.051802 | norm 0.2752 | lr 1.20e-03 | (3866.16 ms | 135610 tok/s) step 1257/76294 | train loss 4.008605 | norm 0.2385 | lr 1.20e-03 | (3787.77 ms | 138416 tok/s) step 1258/76294 | train loss 4.102630 | norm 0.2646 | lr 1.20e-03 | (3813.42 ms | 137485 tok/s) step 1259/76294 | train loss 4.072704 | norm 0.2920 | lr 1.20e-03 | (3813.10 ms | 137497 tok/s) step 1260/76294 | train loss 4.019634 | norm 0.2969 | lr 1.20e-03 | (3797.90 ms | 138047 tok/s) step 1261/76294 | train loss 4.032969 | norm 0.3262 | lr 1.20e-03 | (3801.13 ms | 137929 tok/s) step 1262/76294 | train loss 3.958702 | norm 0.3586 | lr 1.20e-03 | (3821.89 ms | 137180 tok/s) step 1263/76294 | train loss 4.014338 | norm 0.3016 | lr 1.20e-03 | (3805.68 ms | 137765 tok/s) step 1264/76294 | train loss 3.997779 | norm 0.3239 | lr 1.20e-03 | (3794.33 ms | 138177 tok/s) step 1265/76294 | train loss 4.031574 | norm 1.0749 | lr 1.20e-03 | (3828.80 ms | 136933 tok/s) step 1266/76294 | train loss 4.098074 | norm 0.6345 | lr 1.20e-03 | (3802.60 ms | 137876 tok/s) step 1267/76294 | train loss 4.074055 | norm 0.3672 | lr 1.20e-03 | (4001.53 ms | 131022 tok/s) step 1268/76294 | train loss 4.446981 | norm 18.1075 | lr 1.20e-03 | (3805.33 ms | 137777 tok/s) step 1269/76294 | train loss 4.328196 | norm 5.5405 | lr 1.20e-03 | (3841.03 ms | 136497 tok/s) step 1270/76294 | train loss 4.300000 | norm 1.1545 | lr 1.20e-03 | (3825.83 ms | 137039 tok/s) step 1271/76294 | train loss 4.182594 | norm 1.0204 | lr 1.20e-03 | (3811.10 ms | 137569 tok/s) step 1272/76294 | train loss 4.224440 | norm 0.8805 | lr 1.20e-03 | (3811.74 ms | 137546 tok/s) step 1273/76294 | train loss 4.193529 | norm 1.3237 | lr 1.20e-03 | (3838.60 ms | 136583 tok/s) step 1274/76294 | train loss 4.224033 | norm 0.4554 | lr 1.20e-03 | (3812.28 ms | 137526 tok/s) step 1275/76294 | train loss 4.150534 | norm 0.4570 | lr 1.20e-03 | (3816.47 ms | 137375 tok/s) step 1276/76294 | train loss 4.100317 | norm 0.4393 | lr 1.20e-03 | (3832.59 ms | 136797 tok/s) step 1277/76294 | train loss 4.158400 | norm 0.3722 | lr 1.20e-03 | (3815.85 ms | 137398 tok/s) step 1278/76294 | train loss 4.075294 | norm 0.4241 | lr 1.20e-03 | (3813.19 ms | 137493 tok/s) step 1279/76294 | train loss 4.216254 | norm 0.3242 | lr 1.20e-03 | (3914.68 ms | 133929 tok/s) step 1280/76294 | train loss 4.069150 | norm 0.2834 | lr 1.20e-03 | (3810.21 ms | 137601 tok/s) step 1281/76294 | train loss 4.092966 | norm 0.3047 | lr 1.20e-03 | (3818.63 ms | 137297 tok/s) step 1282/76294 | train loss 4.069697 | norm 0.4609 | lr 1.20e-03 | (3870.61 ms | 135453 tok/s) step 1283/76294 | train loss 4.133125 | norm 0.4456 | lr 1.20e-03 | (3964.74 ms | 132238 tok/s) step 1284/76294 | train loss 4.044640 | norm 0.3206 | lr 1.20e-03 | (3811.05 ms | 137570 tok/s) step 1285/76294 | train loss 4.115199 | norm 0.2579 | lr 1.20e-03 | (3832.13 ms | 136814 tok/s) step 1286/76294 | train loss 4.050292 | norm 0.2682 | lr 1.20e-03 | (3811.19 ms | 137566 tok/s) step 1287/76294 | train loss 4.031883 | norm 0.2964 | lr 1.20e-03 | (3928.39 ms | 133461 tok/s) step 1288/76294 | train loss 4.047246 | norm 0.2892 | lr 1.20e-03 | (3836.30 ms | 136665 tok/s) step 1289/76294 | train loss 4.266251 | norm 0.3510 | lr 1.20e-03 | (3814.17 ms | 137458 tok/s) step 1290/76294 | train loss 4.078880 | norm 0.2805 | lr 1.20e-03 | (3834.62 ms | 136725 tok/s) step 1291/76294 | train loss 4.069827 | norm 0.2828 | lr 1.20e-03 | (3845.13 ms | 136351 tok/s) step 1292/76294 | train loss 4.136434 | norm 0.3042 | lr 1.20e-03 | (3904.86 ms | 134266 tok/s) step 1293/76294 | train loss 4.069510 | norm 0.3397 | lr 1.20e-03 | (3835.41 ms | 136697 tok/s) step 1294/76294 | train loss 4.044916 | norm 0.2818 | lr 1.20e-03 | (3808.52 ms | 137662 tok/s) step 1295/76294 | train loss 4.011393 | norm 0.2640 | lr 1.20e-03 | (3844.37 ms | 136378 tok/s) step 1296/76294 | train loss 4.177850 | norm 0.2538 | lr 1.20e-03 | (3808.28 ms | 137670 tok/s) step 1297/76294 | train loss 4.016519 | norm 0.2291 | lr 1.20e-03 | (3832.57 ms | 136798 tok/s) step 1298/76294 | train loss 4.062892 | norm 0.2209 | lr 1.20e-03 | (3809.40 ms | 137630 tok/s) step 1299/76294 | train loss 4.101274 | norm 0.2565 | lr 1.20e-03 | (3813.19 ms | 137493 tok/s) step 1300/76294 | train loss 3.998966 | norm 0.2485 | lr 1.20e-03 | (3828.15 ms | 136956 tok/s) step 1301/76294 | train loss 4.002561 | norm 0.2262 | lr 1.20e-03 | (3817.32 ms | 137344 tok/s) step 1302/76294 | train loss 4.043101 | norm 0.2289 | lr 1.20e-03 | (3810.38 ms | 137595 tok/s) step 1303/76294 | train loss 4.025701 | norm 0.2566 | lr 1.20e-03 | (3857.33 ms | 135920 tok/s) step 1304/76294 | train loss 4.037360 | norm 0.2646 | lr 1.20e-03 | (3808.58 ms | 137660 tok/s) step 1305/76294 | train loss 4.017794 | norm 0.2356 | lr 1.20e-03 | (3810.93 ms | 137575 tok/s) step 1306/76294 | train loss 4.030561 | norm 0.2803 | lr 1.20e-03 | (3829.34 ms | 136914 tok/s) step 1307/76294 | train loss 4.053242 | norm 0.2610 | lr 1.20e-03 | (3810.75 ms | 137581 tok/s) step 1308/76294 | train loss 4.000504 | norm 0.2277 | lr 1.20e-03 | (3813.59 ms | 137479 tok/s) step 1309/76294 | train loss 4.022268 | norm 0.2157 | lr 1.20e-03 | (3898.71 ms | 134477 tok/s) step 1310/76294 | train loss 4.005424 | norm 0.2551 | lr 1.20e-03 | (3807.17 ms | 137711 tok/s) step 1311/76294 | train loss 4.020539 | norm 0.6464 | lr 1.20e-03 | (3812.40 ms | 137522 tok/s) step 1312/76294 | train loss 4.012374 | norm 0.2190 | lr 1.20e-03 | (3829.94 ms | 136892 tok/s) step 1313/76294 | train loss 4.086694 | norm 0.3085 | lr 1.20e-03 | (3812.22 ms | 137528 tok/s) step 1314/76294 | train loss 4.080784 | norm 0.2767 | lr 1.20e-03 | (3803.05 ms | 137860 tok/s) step 1315/76294 | train loss 4.064542 | norm 0.3038 | lr 1.20e-03 | (3833.51 ms | 136764 tok/s) step 1316/76294 | train loss 4.040591 | norm 0.3255 | lr 1.20e-03 | (3803.69 ms | 137837 tok/s) step 1317/76294 | train loss 3.994589 | norm 0.3009 | lr 1.20e-03 | (3863.14 ms | 135715 tok/s) step 1318/76294 | train loss 4.015312 | norm 0.2708 | lr 1.20e-03 | (3806.75 ms | 137726 tok/s) step 1319/76294 | train loss 3.976575 | norm 0.2785 | lr 1.20e-03 | (3808.80 ms | 137652 tok/s) step 1320/76294 | train loss 4.066978 | norm 0.2720 | lr 1.20e-03 | (3828.56 ms | 136941 tok/s) step 1321/76294 | train loss 4.045263 | norm 0.2626 | lr 1.20e-03 | (3803.90 ms | 137829 tok/s) step 1322/76294 | train loss 3.999840 | norm 0.2523 | lr 1.20e-03 | (3837.09 ms | 136637 tok/s) step 1323/76294 | train loss 4.021214 | norm 0.2697 | lr 1.20e-03 | (3805.90 ms | 137756 tok/s) step 1324/76294 | train loss 4.048979 | norm 0.3055 | lr 1.20e-03 | (3854.46 ms | 136021 tok/s) step 1325/76294 | train loss 4.053737 | norm 0.2629 | lr 1.20e-03 | (3806.57 ms | 137732 tok/s) step 1326/76294 | train loss 4.026376 | norm 0.2413 | lr 1.20e-03 | (3830.12 ms | 136885 tok/s) step 1327/76294 | train loss 4.057744 | norm 0.2860 | lr 1.20e-03 | (3857.84 ms | 135902 tok/s) step 1328/76294 | train loss 4.051757 | norm 0.2534 | lr 1.20e-03 | (3809.02 ms | 137644 tok/s) step 1329/76294 | train loss 4.048117 | norm 0.2564 | lr 1.20e-03 | (3902.19 ms | 134357 tok/s) step 1330/76294 | train loss 4.009339 | norm 0.2682 | lr 1.20e-03 | (3851.37 ms | 136130 tok/s) step 1331/76294 | train loss 4.104054 | norm 0.3432 | lr 1.20e-03 | (3827.34 ms | 136985 tok/s) step 1332/76294 | train loss 3.973596 | norm 0.2823 | lr 1.20e-03 | (3802.17 ms | 137892 tok/s) step 1333/76294 | train loss 3.987442 | norm 0.2627 | lr 1.20e-03 | (3800.24 ms | 137962 tok/s) step 1334/76294 | train loss 4.006278 | norm 0.2537 | lr 1.20e-03 | (3876.02 ms | 135265 tok/s) step 1335/76294 | train loss 4.049210 | norm 0.3362 | lr 1.20e-03 | (3873.65 ms | 135347 tok/s) step 1336/76294 | train loss 4.036821 | norm 0.3207 | lr 1.20e-03 | (4100.00 ms | 127875 tok/s) step 1337/76294 | train loss 4.041003 | norm 0.3521 | lr 1.20e-03 | (3859.37 ms | 135848 tok/s) step 1338/76294 | train loss 4.007470 | norm 0.3552 | lr 1.20e-03 | (3802.46 ms | 137881 tok/s) step 1339/76294 | train loss 4.077505 | norm 0.3622 | lr 1.20e-03 | (3905.85 ms | 134232 tok/s) step 1340/76294 | train loss 4.045042 | norm 0.3118 | lr 1.20e-03 | (3805.32 ms | 137778 tok/s) step 1341/76294 | train loss 3.985272 | norm 0.3306 | lr 1.20e-03 | (3812.29 ms | 137526 tok/s) step 1342/76294 | train loss 4.011927 | norm 0.2773 | lr 1.20e-03 | (3831.98 ms | 136819 tok/s) step 1343/76294 | train loss 3.975012 | norm 0.2824 | lr 1.20e-03 | (3827.10 ms | 136993 tok/s) step 1344/76294 | train loss 4.009303 | norm 0.2562 | lr 1.20e-03 | (3805.05 ms | 137787 tok/s) step 1345/76294 | train loss 4.022622 | norm 0.2469 | lr 1.20e-03 | (3841.72 ms | 136472 tok/s) step 1346/76294 | train loss 3.999641 | norm 0.2691 | lr 1.20e-03 | (4128.65 ms | 126988 tok/s) step 1347/76294 | train loss 4.032977 | norm 0.2631 | lr 1.20e-03 | (3807.57 ms | 137696 tok/s) step 1348/76294 | train loss 4.017999 | norm 0.3310 | lr 1.20e-03 | (3829.14 ms | 136921 tok/s) step 1349/76294 | train loss 4.063111 | norm 0.2303 | lr 1.20e-03 | (3807.14 ms | 137712 tok/s) step 1350/76294 | train loss 4.051726 | norm 0.2536 | lr 1.20e-03 | (3822.73 ms | 137150 tok/s) step 1351/76294 | train loss 4.038021 | norm 0.2254 | lr 1.20e-03 | (3805.61 ms | 137767 tok/s) step 1352/76294 | train loss 4.045160 | norm 0.2371 | lr 1.20e-03 | (3812.83 ms | 137506 tok/s) step 1353/76294 | train loss 3.990730 | norm 0.2321 | lr 1.20e-03 | (3840.94 ms | 136500 tok/s) step 1354/76294 | train loss 3.986646 | norm 0.2476 | lr 1.20e-03 | (3816.26 ms | 137383 tok/s) step 1355/76294 | train loss 3.963356 | norm 0.3190 | lr 1.20e-03 | (3812.70 ms | 137511 tok/s) step 1356/76294 | train loss 3.943236 | norm 0.3283 | lr 1.20e-03 | (4089.80 ms | 128194 tok/s) step 1357/76294 | train loss 4.057174 | norm 0.2944 | lr 1.20e-03 | (3810.91 ms | 137575 tok/s) step 1358/76294 | train loss 3.945028 | norm 0.2946 | lr 1.20e-03 | (3862.45 ms | 135740 tok/s) step 1359/76294 | train loss 4.037523 | norm 0.2616 | lr 1.20e-03 | (3815.59 ms | 137407 tok/s) step 1360/76294 | train loss 3.996661 | norm 0.2472 | lr 1.20e-03 | (3934.60 ms | 133251 tok/s) step 1361/76294 | train loss 3.983171 | norm 0.2742 | lr 1.20e-03 | (3811.08 ms | 137569 tok/s) step 1362/76294 | train loss 4.156028 | norm 0.2454 | lr 1.20e-03 | (3847.20 ms | 136278 tok/s) step 1363/76294 | train loss 3.970282 | norm 0.2605 | lr 1.20e-03 | (3816.98 ms | 137357 tok/s) step 1364/76294 | train loss 4.013409 | norm 0.2721 | lr 1.20e-03 | (3814.92 ms | 137431 tok/s) step 1365/76294 | train loss 4.043626 | norm 0.2761 | lr 1.20e-03 | (3932.84 ms | 133310 tok/s) step 1366/76294 | train loss 3.938683 | norm 0.2328 | lr 1.20e-03 | (3909.94 ms | 134091 tok/s) step 1367/76294 | train loss 3.992804 | norm 0.2514 | lr 1.20e-03 | (3811.48 ms | 137555 tok/s) step 1368/76294 | train loss 4.011627 | norm 0.2354 | lr 1.20e-03 | (3819.09 ms | 137281 tok/s) step 1369/76294 | train loss 3.915960 | norm 0.2157 | lr 1.20e-03 | (3883.34 ms | 135010 tok/s) step 1370/76294 | train loss 4.122808 | norm 0.2873 | lr 1.20e-03 | (3808.08 ms | 137678 tok/s) step 1371/76294 | train loss 3.994288 | norm 0.3252 | lr 1.20e-03 | (3809.63 ms | 137622 tok/s) step 1372/76294 | train loss 4.055542 | norm 0.3143 | lr 1.20e-03 | (3839.85 ms | 136539 tok/s) step 1373/76294 | train loss 3.982726 | norm 0.3251 | lr 1.20e-03 | (3844.69 ms | 136367 tok/s) step 1374/76294 | train loss 3.973426 | norm 0.3157 | lr 1.20e-03 | (3805.52 ms | 137770 tok/s) step 1375/76294 | train loss 3.992515 | norm 0.2646 | lr 1.20e-03 | (3809.89 ms | 137612 tok/s) step 1376/76294 | train loss 4.009295 | norm 0.2506 | lr 1.20e-03 | (3846.91 ms | 136288 tok/s) step 1377/76294 | train loss 4.001326 | norm 0.2340 | lr 1.20e-03 | (3804.30 ms | 137815 tok/s) step 1378/76294 | train loss 4.015655 | norm 0.2816 | lr 1.20e-03 | (3836.43 ms | 136661 tok/s) step 1379/76294 | train loss 3.974761 | norm 0.2367 | lr 1.20e-03 | (3807.87 ms | 137685 tok/s) step 1380/76294 | train loss 4.058529 | norm 0.2573 | lr 1.20e-03 | (3856.81 ms | 135938 tok/s) step 1381/76294 | train loss 3.954159 | norm 0.2888 | lr 1.20e-03 | (3807.28 ms | 137707 tok/s) step 1382/76294 | train loss 3.974750 | norm 0.2669 | lr 1.20e-03 | (3810.98 ms | 137573 tok/s) step 1383/76294 | train loss 3.991869 | norm 0.2748 | lr 1.20e-03 | (3832.29 ms | 136808 tok/s) step 1384/76294 | train loss 3.973259 | norm 0.2896 | lr 1.20e-03 | (3806.63 ms | 137730 tok/s) step 1385/76294 | train loss 4.162143 | norm 0.3116 | lr 1.20e-03 | (3810.47 ms | 137592 tok/s) step 1386/76294 | train loss 3.937236 | norm 0.4111 | lr 1.20e-03 | (3841.10 ms | 136494 tok/s) step 1387/76294 | train loss 4.088784 | norm 0.3462 | lr 1.20e-03 | (3797.69 ms | 138054 tok/s) step 1388/76294 | train loss 3.996688 | norm 0.4332 | lr 1.20e-03 | (3812.58 ms | 137515 tok/s) step 1389/76294 | train loss 4.015551 | norm 0.3762 | lr 1.20e-03 | (3806.87 ms | 137722 tok/s) step 1390/76294 | train loss 4.022593 | norm 0.3165 | lr 1.20e-03 | (3809.64 ms | 137622 tok/s) step 1391/76294 | train loss 3.960428 | norm 0.3352 | lr 1.20e-03 | (3831.02 ms | 136853 tok/s) step 1392/76294 | train loss 4.040752 | norm 0.2547 | lr 1.20e-03 | (3810.16 ms | 137602 tok/s) step 1393/76294 | train loss 3.968271 | norm 0.2573 | lr 1.20e-03 | (3933.37 ms | 133292 tok/s) step 1394/76294 | train loss 3.906520 | norm 0.2183 | lr 1.20e-03 | (3811.66 ms | 137548 tok/s) step 1395/76294 | train loss 3.972641 | norm 0.2309 | lr 1.20e-03 | (3811.28 ms | 137562 tok/s) step 1396/76294 | train loss 3.912897 | norm 0.2161 | lr 1.20e-03 | (3828.33 ms | 136949 tok/s) step 1397/76294 | train loss 4.005535 | norm 0.2656 | lr 1.20e-03 | (3852.73 ms | 136082 tok/s) step 1398/76294 | train loss 3.991634 | norm 0.2303 | lr 1.20e-03 | (3831.69 ms | 136830 tok/s) step 1399/76294 | train loss 3.906500 | norm 0.3252 | lr 1.20e-03 | (3810.99 ms | 137573 tok/s) step 1400/76294 | train loss 3.976430 | norm 0.2531 | lr 1.20e-03 | (3825.45 ms | 137053 tok/s) step 1401/76294 | train loss 4.008296 | norm 0.2251 | lr 1.20e-03 | (3813.15 ms | 137495 tok/s) step 1402/76294 | train loss 4.012649 | norm 0.2549 | lr 1.20e-03 | (3836.23 ms | 136667 tok/s) step 1403/76294 | train loss 3.978527 | norm 0.2756 | lr 1.20e-03 | (3841.47 ms | 136481 tok/s) step 1404/76294 | train loss 3.974427 | norm 0.2338 | lr 1.20e-03 | (3804.49 ms | 137808 tok/s) step 1405/76294 | train loss 3.983783 | norm 0.2057 | lr 1.20e-03 | (3840.23 ms | 136525 tok/s) step 1406/76294 | train loss 4.059590 | norm 0.2202 | lr 1.20e-03 | (3806.08 ms | 137750 tok/s) step 1407/76294 | train loss 3.918048 | norm 0.2113 | lr 1.20e-03 | (3841.98 ms | 136463 tok/s) step 1408/76294 | train loss 4.062019 | norm 0.2212 | lr 1.20e-03 | (3805.78 ms | 137761 tok/s) step 1409/76294 | train loss 3.982137 | norm 0.2757 | lr 1.20e-03 | (3810.89 ms | 137576 tok/s) step 1410/76294 | train loss 4.003899 | norm 0.2221 | lr 1.20e-03 | (3829.36 ms | 136913 tok/s) step 1411/76294 | train loss 4.015596 | norm 0.2845 | lr 1.20e-03 | (3808.65 ms | 137657 tok/s) step 1412/76294 | train loss 3.982402 | norm 0.3099 | lr 1.20e-03 | (3808.82 ms | 137651 tok/s) step 1413/76294 | train loss 3.968749 | norm 0.3418 | lr 1.20e-03 | (3838.48 ms | 136587 tok/s) step 1414/76294 | train loss 3.983217 | norm 0.4189 | lr 1.20e-03 | (3805.59 ms | 137768 tok/s) step 1415/76294 | train loss 3.996715 | norm 0.3180 | lr 1.20e-03 | (3815.93 ms | 137394 tok/s) step 1416/76294 | train loss 3.977916 | norm 0.2576 | lr 1.20e-03 | (3827.43 ms | 136982 tok/s) step 1417/76294 | train loss 3.961386 | norm 0.2571 | lr 1.20e-03 | (3810.33 ms | 137596 tok/s) step 1418/76294 | train loss 3.960364 | norm 0.2348 | lr 1.20e-03 | (3806.64 ms | 137730 tok/s) step 1419/76294 | train loss 4.003110 | norm 0.2338 | lr 1.20e-03 | (3844.25 ms | 136382 tok/s) step 1420/76294 | train loss 3.942572 | norm 0.5892 | lr 1.20e-03 | (3805.11 ms | 137785 tok/s) step 1421/76294 | train loss 4.007181 | norm 0.2402 | lr 1.20e-03 | (3814.12 ms | 137460 tok/s) step 1422/76294 | train loss 4.000376 | norm 0.1945 | lr 1.20e-03 | (3836.47 ms | 136659 tok/s) step 1423/76294 | train loss 3.968993 | norm 0.2106 | lr 1.20e-03 | (3833.22 ms | 136775 tok/s) step 1424/76294 | train loss 3.994246 | norm 0.2294 | lr 1.20e-03 | (3809.93 ms | 137611 tok/s) step 1425/76294 | train loss 3.937989 | norm 0.2658 | lr 1.20e-03 | (3853.57 ms | 136053 tok/s) step 1426/76294 | train loss 4.013507 | norm 0.3217 | lr 1.20e-03 | (3804.34 ms | 137813 tok/s) step 1427/76294 | train loss 3.964373 | norm 1.8892 | lr 1.20e-03 | (3853.26 ms | 136064 tok/s) step 1428/76294 | train loss 4.022993 | norm 0.2897 | lr 1.20e-03 | (3804.67 ms | 137801 tok/s) step 1429/76294 | train loss 3.997186 | norm 0.3408 | lr 1.20e-03 | (3813.46 ms | 137484 tok/s) step 1430/76294 | train loss 4.004281 | norm 0.2881 | lr 1.20e-03 | (3826.33 ms | 137021 tok/s) step 1431/76294 | train loss 3.957139 | norm 0.2692 | lr 1.20e-03 | (3806.48 ms | 137735 tok/s) step 1432/76294 | train loss 4.022152 | norm 0.2500 | lr 1.20e-03 | (3804.47 ms | 137809 tok/s) step 1433/76294 | train loss 4.081089 | norm 0.2913 | lr 1.20e-03 | (3836.96 ms | 136642 tok/s) step 1434/76294 | train loss 3.987507 | norm 0.4225 | lr 1.20e-03 | (3807.33 ms | 137705 tok/s) step 1435/76294 | train loss 4.001281 | norm 0.3415 | lr 1.20e-03 | (3982.35 ms | 131653 tok/s) step 1436/76294 | train loss 3.995997 | norm 0.2948 | lr 1.20e-03 | (3807.84 ms | 137686 tok/s) step 1437/76294 | train loss 3.944530 | norm 0.2870 | lr 1.20e-03 | (3924.96 ms | 133578 tok/s) step 1438/76294 | train loss 4.002085 | norm 0.3668 | lr 1.20e-03 | (3799.00 ms | 138007 tok/s) step 1439/76294 | train loss 3.916732 | norm 0.2825 | lr 1.20e-03 | (3822.28 ms | 137166 tok/s) step 1440/76294 | train loss 3.977933 | norm 0.3226 | lr 1.20e-03 | (3826.65 ms | 137010 tok/s) step 1441/76294 | train loss 3.986763 | norm 0.2713 | lr 1.20e-03 | (6866.88 ms | 76350 tok/s) step 1442/76294 | train loss 3.970701 | norm 0.3499 | lr 1.20e-03 | (3848.55 ms | 136230 tok/s) step 1443/76294 | train loss 4.084884 | norm 0.3373 | lr 1.20e-03 | (3806.52 ms | 137734 tok/s) step 1444/76294 | train loss 3.964323 | norm 0.6761 | lr 1.20e-03 | (3817.99 ms | 137320 tok/s) step 1445/76294 | train loss 3.973657 | norm 0.3205 | lr 1.20e-03 | (3805.65 ms | 137766 tok/s) step 1446/76294 | train loss 4.000296 | norm 0.2928 | lr 1.20e-03 | (3796.57 ms | 138095 tok/s) step 1447/76294 | train loss 3.997440 | norm 0.3113 | lr 1.20e-03 | (3978.59 ms | 131777 tok/s) step 1448/76294 | train loss 3.956027 | norm 0.3063 | lr 1.20e-03 | (3922.99 ms | 133645 tok/s) step 1449/76294 | train loss 3.964408 | norm 0.2293 | lr 1.20e-03 | (3797.17 ms | 138073 tok/s) step 1450/76294 | train loss 3.946758 | norm 0.2637 | lr 1.20e-03 | (3821.88 ms | 137181 tok/s) step 1451/76294 | train loss 3.991263 | norm 0.2587 | lr 1.20e-03 | (3824.09 ms | 137101 tok/s) step 1452/76294 | train loss 3.954562 | norm 0.2183 | lr 1.20e-03 | (3803.01 ms | 137861 tok/s) step 1453/76294 | train loss 3.960188 | norm 0.2532 | lr 1.20e-03 | (3826.09 ms | 137030 tok/s) step 1454/76294 | train loss 3.970378 | norm 0.2626 | lr 1.20e-03 | (3805.00 ms | 137789 tok/s) step 1455/76294 | train loss 3.927813 | norm 0.3457 | lr 1.20e-03 | (3832.70 ms | 136793 tok/s) step 1456/76294 | train loss 3.976879 | norm 0.3572 | lr 1.20e-03 | (3799.31 ms | 137996 tok/s) step 1457/76294 | train loss 3.991875 | norm 0.3413 | lr 1.20e-03 | (3826.40 ms | 137019 tok/s) step 1458/76294 | train loss 3.962215 | norm 0.2889 | lr 1.20e-03 | (3807.35 ms | 137704 tok/s) step 1459/76294 | train loss 3.952512 | norm 0.2617 | lr 1.20e-03 | (3833.02 ms | 136782 tok/s) step 1460/76294 | train loss 3.953378 | norm 0.2363 | lr 1.20e-03 | (3806.30 ms | 137742 tok/s) step 1461/76294 | train loss 3.990661 | norm 0.2488 | lr 1.20e-03 | (3842.90 ms | 136430 tok/s) step 1462/76294 | train loss 3.945627 | norm 0.2585 | lr 1.20e-03 | (3807.16 ms | 137711 tok/s) step 1463/76294 | train loss 4.012582 | norm 0.2820 | lr 1.20e-03 | (3821.63 ms | 137190 tok/s) step 1464/76294 | train loss 3.946739 | norm 0.3160 | lr 1.20e-03 | (3832.63 ms | 136796 tok/s) step 1465/76294 | train loss 3.944539 | norm 0.2528 | lr 1.20e-03 | (3825.54 ms | 137049 tok/s) step 1466/76294 | train loss 3.919201 | norm 0.2327 | lr 1.20e-03 | (3813.44 ms | 137484 tok/s) step 1467/76294 | train loss 3.952076 | norm 0.2605 | lr 1.20e-03 | (3888.88 ms | 134817 tok/s) step 1468/76294 | train loss 3.935993 | norm 0.2686 | lr 1.20e-03 | (3802.38 ms | 137884 tok/s) step 1469/76294 | train loss 3.941176 | norm 0.2930 | lr 1.20e-03 | (3808.86 ms | 137650 tok/s) step 1470/76294 | train loss 3.979657 | norm 0.2706 | lr 1.20e-03 | (3820.79 ms | 137220 tok/s) step 1471/76294 | train loss 3.945979 | norm 0.2256 | lr 1.20e-03 | (3859.04 ms | 135860 tok/s) step 1472/76294 | train loss 3.896916 | norm 0.2005 | lr 1.20e-03 | (3966.28 ms | 132186 tok/s) step 1473/76294 | train loss 3.994264 | norm 0.2735 | lr 1.20e-03 | (3812.93 ms | 137502 tok/s) step 1474/76294 | train loss 3.906049 | norm 0.2650 | lr 1.20e-03 | (3803.35 ms | 137849 tok/s) step 1475/76294 | train loss 3.961805 | norm 0.2886 | lr 1.20e-03 | (3861.52 ms | 135772 tok/s) step 1476/76294 | train loss 4.047066 | norm 0.2674 | lr 1.20e-03 | (3798.54 ms | 138024 tok/s) step 1477/76294 | train loss 3.922350 | norm 0.2310 | lr 1.20e-03 | (3957.14 ms | 132492 tok/s) step 1478/76294 | train loss 3.940261 | norm 0.2690 | lr 1.20e-03 | (3798.22 ms | 138035 tok/s) step 1479/76294 | train loss 3.959671 | norm 0.2713 | lr 1.20e-03 | (3843.79 ms | 136399 tok/s) step 1480/76294 | train loss 3.910774 | norm 0.3306 | lr 1.20e-03 | (3814.53 ms | 137445 tok/s) step 1481/76294 | train loss 3.972435 | norm 0.3044 | lr 1.20e-03 | (3804.74 ms | 137798 tok/s) step 1482/76294 | train loss 3.978393 | norm 0.2541 | lr 1.20e-03 | (3825.09 ms | 137065 tok/s) step 1483/76294 | train loss 3.951480 | norm 0.2474 | lr 1.20e-03 | (3805.03 ms | 137788 tok/s) step 1484/76294 | train loss 4.011662 | norm 0.2498 | lr 1.20e-03 | (3802.82 ms | 137868 tok/s) step 1485/76294 | train loss 3.944464 | norm 0.2463 | lr 1.20e-03 | (3849.93 ms | 136181 tok/s) step 1486/76294 | train loss 3.898654 | norm 0.2426 | lr 1.20e-03 | (3806.32 ms | 137741 tok/s) step 1487/76294 | train loss 3.943816 | norm 0.2377 | lr 1.20e-03 | (3915.17 ms | 133912 tok/s) step 1488/76294 | train loss 3.917642 | norm 0.2548 | lr 1.20e-03 | (3800.71 ms | 137945 tok/s) step 1489/76294 | train loss 3.978812 | norm 0.1973 | lr 1.20e-03 | (3810.85 ms | 137578 tok/s) step 1490/76294 | train loss 3.933191 | norm 0.2195 | lr 1.20e-03 | (3804.86 ms | 137794 tok/s) step 1491/76294 | train loss 3.896975 | norm 0.2150 | lr 1.20e-03 | (3853.28 ms | 136063 tok/s) step 1492/76294 | train loss 3.980740 | norm 0.2708 | lr 1.20e-03 | (3806.45 ms | 137737 tok/s) step 1493/76294 | train loss 3.886084 | norm 0.4641 | lr 1.20e-03 | (3807.38 ms | 137703 tok/s) step 1494/76294 | train loss 3.937436 | norm 0.4658 | lr 1.20e-03 | (3825.03 ms | 137068 tok/s) step 1495/76294 | train loss 3.995251 | norm 0.3349 | lr 1.20e-03 | (3807.50 ms | 137699 tok/s) step 1496/76294 | train loss 3.997295 | norm 0.3696 | lr 1.20e-03 | (3827.71 ms | 136972 tok/s) step 1497/76294 | train loss 4.000952 | norm 0.2894 | lr 1.20e-03 | (3808.27 ms | 137671 tok/s) step 1498/76294 | train loss 4.051672 | norm 0.2751 | lr 1.20e-03 | (3805.28 ms | 137779 tok/s) step 1499/76294 | train loss 3.946579 | norm 0.2717 | lr 1.20e-03 | (3863.61 ms | 135699 tok/s) step 1500/76294 | train loss 3.970594 | norm 0.2543 | lr 1.20e-03 | (3802.14 ms | 137893 tok/s) val loss: 3.974434 saving model checkpoint to ./results/gpt2-124M-gqa/step_1500.pth step 1501/76294 | train loss 3.957699 | norm 0.2146 | lr 1.20e-03 | (3889.27 ms | 134804 tok/s) step 1502/76294 | train loss 3.952188 | norm 0.2760 | lr 1.20e-03 | (3820.64 ms | 137225 tok/s) step 1503/76294 | train loss 3.963577 | norm 0.2745 | lr 1.20e-03 | (3827.34 ms | 136985 tok/s) step 1504/76294 | train loss 3.973452 | norm 0.2580 | lr 1.20e-03 | (3799.87 ms | 137975 tok/s) step 1505/76294 | train loss 3.941449 | norm 0.2113 | lr 1.20e-03 | (3826.86 ms | 137002 tok/s) step 1506/76294 | train loss 4.006050 | norm 0.2421 | lr 1.20e-03 | (3850.35 ms | 136166 tok/s) step 1507/76294 | train loss 3.918592 | norm 0.2091 | lr 1.20e-03 | (3917.89 ms | 133819 tok/s) step 1508/76294 | train loss 3.942857 | norm 0.2246 | lr 1.20e-03 | (3807.11 ms | 137713 tok/s) step 1509/76294 | train loss 3.959090 | norm 0.2343 | lr 1.20e-03 | (3840.76 ms | 136506 tok/s) step 1510/76294 | train loss 3.982196 | norm 0.2046 | lr 1.20e-03 | (3814.95 ms | 137430 tok/s) step 1511/76294 | train loss 3.962194 | norm 0.2622 | lr 1.20e-03 | (3794.30 ms | 138178 tok/s) step 1512/76294 | train loss 3.935723 | norm 0.2085 | lr 1.20e-03 | (3829.08 ms | 136923 tok/s) step 1513/76294 | train loss 3.922122 | norm 0.1994 | lr 1.20e-03 | (3798.97 ms | 138008 tok/s) step 1514/76294 | train loss 3.900694 | norm 0.2043 | lr 1.20e-03 | (3811.29 ms | 137562 tok/s) step 1515/76294 | train loss 3.962951 | norm 0.2471 | lr 1.20e-03 | (3839.81 ms | 136540 tok/s) step 1516/76294 | train loss 3.961153 | norm 0.2388 | lr 1.20e-03 | (3898.19 ms | 134495 tok/s) step 1517/76294 | train loss 3.923845 | norm 0.2167 | lr 1.20e-03 | (3806.47 ms | 137736 tok/s) step 1518/76294 | train loss 3.929433 | norm 0.2380 | lr 1.20e-03 | (3985.54 ms | 131547 tok/s) step 1519/76294 | train loss 3.936420 | norm 0.2330 | lr 1.20e-03 | (3799.19 ms | 138000 tok/s) step 1520/76294 | train loss 3.932677 | norm 0.2438 | lr 1.20e-03 | (3801.44 ms | 137918 tok/s) step 1521/76294 | train loss 3.990559 | norm 0.2362 | lr 1.20e-03 | (3821.62 ms | 137190 tok/s) step 1522/76294 | train loss 3.927655 | norm 0.3095 | lr 1.20e-03 | (3821.72 ms | 137186 tok/s) step 1523/76294 | train loss 3.927423 | norm 0.3805 | lr 1.20e-03 | (3806.52 ms | 137734 tok/s) step 1524/76294 | train loss 3.990452 | norm 0.4041 | lr 1.20e-03 | (3802.40 ms | 137883 tok/s) step 1525/76294 | train loss 3.961691 | norm 0.3641 | lr 1.20e-03 | (3797.71 ms | 138054 tok/s) step 1526/76294 | train loss 3.930133 | norm 0.3548 | lr 1.20e-03 | (5005.61 ms | 104740 tok/s) step 1527/76294 | train loss 3.954200 | norm 0.2841 | lr 1.20e-03 | (3794.73 ms | 138162 tok/s) step 1528/76294 | train loss 3.929319 | norm 0.2499 | lr 1.20e-03 | (3825.53 ms | 137050 tok/s) step 1529/76294 | train loss 3.994899 | norm 0.2587 | lr 1.20e-03 | (3797.34 ms | 138067 tok/s) step 1530/76294 | train loss 3.941400 | norm 0.2554 | lr 1.20e-03 | (3827.89 ms | 136965 tok/s) step 1531/76294 | train loss 4.023817 | norm 0.2232 | lr 1.20e-03 | (3795.85 ms | 138121 tok/s) step 1532/76294 | train loss 4.031172 | norm 0.2265 | lr 1.20e-03 | (3851.31 ms | 136132 tok/s) step 1533/76294 | train loss 3.902419 | norm 0.2400 | lr 1.19e-03 | (3978.53 ms | 131779 tok/s) step 1534/76294 | train loss 3.994167 | norm 0.2854 | lr 1.19e-03 | (3820.51 ms | 137230 tok/s) step 1535/76294 | train loss 4.002052 | norm 0.2949 | lr 1.19e-03 | (3828.38 ms | 136948 tok/s) step 1536/76294 | train loss 3.998241 | norm 0.2796 | lr 1.19e-03 | (3825.23 ms | 137061 tok/s) step 1537/76294 | train loss 3.953608 | norm 0.3453 | lr 1.19e-03 | (3836.62 ms | 136654 tok/s) step 1538/76294 | train loss 3.990091 | norm 0.2799 | lr 1.19e-03 | (3802.11 ms | 137894 tok/s) step 1539/76294 | train loss 3.943515 | norm 0.2859 | lr 1.19e-03 | (3799.38 ms | 137993 tok/s) step 1540/76294 | train loss 3.944602 | norm 0.2266 | lr 1.19e-03 | (3841.27 ms | 136488 tok/s) step 1541/76294 | train loss 3.966152 | norm 0.2282 | lr 1.19e-03 | (3799.01 ms | 138007 tok/s) step 1542/76294 | train loss 3.940730 | norm 0.2565 | lr 1.19e-03 | (3808.65 ms | 137657 tok/s) step 1543/76294 | train loss 3.945801 | norm 0.2538 | lr 1.19e-03 | (3833.89 ms | 136751 tok/s) step 1544/76294 | train loss 3.982646 | norm 0.2462 | lr 1.19e-03 | (3802.64 ms | 137875 tok/s) step 1545/76294 | train loss 3.922831 | norm 0.2328 | lr 1.19e-03 | (3813.13 ms | 137496 tok/s) step 1546/76294 | train loss 3.970153 | norm 0.2615 | lr 1.19e-03 | (3807.41 ms | 137702 tok/s) step 1547/76294 | train loss 3.976472 | norm 0.2534 | lr 1.19e-03 | (3892.19 ms | 134702 tok/s) step 1548/76294 | train loss 3.915743 | norm 0.2750 | lr 1.19e-03 | (3800.18 ms | 137964 tok/s) step 1549/76294 | train loss 3.993851 | norm 0.2963 | lr 1.19e-03 | (3803.83 ms | 137832 tok/s) step 1550/76294 | train loss 3.935050 | norm 0.2872 | lr 1.19e-03 | (3818.86 ms | 137289 tok/s) step 1551/76294 | train loss 3.919178 | norm 0.2941 | lr 1.19e-03 | (3802.64 ms | 137875 tok/s) step 1552/76294 | train loss 4.008629 | norm 0.2371 | lr 1.19e-03 | (3799.45 ms | 137991 tok/s) step 1553/76294 | train loss 3.944412 | norm 0.2757 | lr 1.19e-03 | (3840.58 ms | 136513 tok/s) step 1554/76294 | train loss 3.957492 | norm 0.2794 | lr 1.19e-03 | (3997.76 ms | 131145 tok/s) step 1555/76294 | train loss 3.986176 | norm 0.3162 | lr 1.19e-03 | (3863.49 ms | 135703 tok/s) step 1556/76294 | train loss 3.942036 | norm 0.2705 | lr 1.19e-03 | (4006.06 ms | 130874 tok/s) step 1557/76294 | train loss 3.894155 | norm 0.2717 | lr 1.19e-03 | (3805.72 ms | 137763 tok/s) step 1558/76294 | train loss 3.884803 | norm 0.2730 | lr 1.19e-03 | (3821.74 ms | 137186 tok/s) step 1559/76294 | train loss 3.960163 | norm 0.2463 | lr 1.19e-03 | (3830.73 ms | 136864 tok/s) step 1560/76294 | train loss 3.988522 | norm 0.2213 | lr 1.19e-03 | (4162.26 ms | 125962 tok/s) step 1561/76294 | train loss 3.904301 | norm 0.2048 | lr 1.19e-03 | (3801.25 ms | 137925 tok/s) step 1562/76294 | train loss 4.023426 | norm 0.2365 | lr 1.19e-03 | (3841.71 ms | 136473 tok/s) step 1563/76294 | train loss 3.886589 | norm 0.2685 | lr 1.19e-03 | (3804.44 ms | 137810 tok/s) step 1564/76294 | train loss 3.918655 | norm 0.2589 | lr 1.19e-03 | (3891.53 ms | 134725 tok/s) step 1565/76294 | train loss 3.916720 | norm 0.2879 | lr 1.19e-03 | (3838.83 ms | 136575 tok/s) step 1566/76294 | train loss 4.074875 | norm 0.3162 | lr 1.19e-03 | (3974.10 ms | 131926 tok/s) step 1567/76294 | train loss 3.952560 | norm 0.3078 | lr 1.19e-03 | (3799.50 ms | 137989 tok/s) step 1568/76294 | train loss 3.939098 | norm 0.2958 | lr 1.19e-03 | (3810.15 ms | 137603 tok/s) step 1569/76294 | train loss 3.883531 | norm 0.2569 | lr 1.19e-03 | (3818.80 ms | 137291 tok/s) step 1570/76294 | train loss 3.913866 | norm 0.2541 | lr 1.19e-03 | (3823.41 ms | 137126 tok/s) step 1571/76294 | train loss 3.939625 | norm 0.2038 | lr 1.19e-03 | (3798.22 ms | 138035 tok/s) step 1572/76294 | train loss 3.932806 | norm 0.2558 | lr 1.19e-03 | (3797.01 ms | 138079 tok/s) step 1573/76294 | train loss 3.936224 | norm 0.2407 | lr 1.19e-03 | (3903.57 ms | 134310 tok/s) step 1574/76294 | train loss 3.938566 | norm 0.2874 | lr 1.19e-03 | (3795.19 ms | 138145 tok/s) step 1575/76294 | train loss 3.948067 | norm 0.2425 | lr 1.19e-03 | (3824.79 ms | 137076 tok/s) step 1576/76294 | train loss 3.919589 | norm 0.2199 | lr 1.19e-03 | (3796.69 ms | 138091 tok/s) step 1577/76294 | train loss 3.930106 | norm 0.2573 | lr 1.19e-03 | (3799.50 ms | 137989 tok/s) step 1578/76294 | train loss 3.981354 | norm 0.2748 | lr 1.19e-03 | (3966.33 ms | 132185 tok/s) step 1579/76294 | train loss 3.891061 | norm 0.2803 | lr 1.19e-03 | (3801.68 ms | 137909 tok/s) step 1580/76294 | train loss 3.992369 | norm 0.2795 | lr 1.19e-03 | (3842.92 ms | 136430 tok/s) step 1581/76294 | train loss 3.867823 | norm 0.2832 | lr 1.19e-03 | (3797.34 ms | 138067 tok/s) step 1582/76294 | train loss 3.943718 | norm 0.3127 | lr 1.19e-03 | (3823.46 ms | 137124 tok/s) step 1583/76294 | train loss 3.942707 | norm 0.4902 | lr 1.19e-03 | (3797.47 ms | 138062 tok/s) step 1584/76294 | train loss 3.920207 | norm 0.5853 | lr 1.19e-03 | (3820.54 ms | 137229 tok/s) step 1585/76294 | train loss 3.939390 | norm 0.4276 | lr 1.19e-03 | (3799.54 ms | 137987 tok/s) step 1586/76294 | train loss 3.917135 | norm 0.4343 | lr 1.19e-03 | (3807.49 ms | 137699 tok/s) step 1587/76294 | train loss 3.935476 | norm 0.3386 | lr 1.19e-03 | (3820.85 ms | 137218 tok/s) step 1588/76294 | train loss 3.960660 | norm 0.3629 | lr 1.19e-03 | (3809.24 ms | 137636 tok/s) step 1589/76294 | train loss 3.957811 | norm 0.2754 | lr 1.19e-03 | (3798.43 ms | 138027 tok/s) step 1590/76294 | train loss 3.980404 | norm 0.4621 | lr 1.19e-03 | (3829.59 ms | 136904 tok/s) step 1591/76294 | train loss 3.866357 | norm 0.2395 | lr 1.19e-03 | (3800.64 ms | 137947 tok/s) step 1592/76294 | train loss 3.888922 | norm 0.2519 | lr 1.19e-03 | (3831.62 ms | 136832 tok/s) step 1593/76294 | train loss 3.976729 | norm 0.2401 | lr 1.19e-03 | (3835.44 ms | 136696 tok/s) step 1594/76294 | train loss 3.890961 | norm 0.2452 | lr 1.19e-03 | (4358.25 ms | 120298 tok/s) step 1595/76294 | train loss 3.888855 | norm 0.2440 | lr 1.19e-03 | (3798.05 ms | 138041 tok/s) step 1596/76294 | train loss 3.934419 | norm 0.2751 | lr 1.19e-03 | (3827.33 ms | 136985 tok/s) step 1597/76294 | train loss 3.914926 | norm 0.2682 | lr 1.19e-03 | (3820.71 ms | 137223 tok/s) step 1598/76294 | train loss 3.891623 | norm 0.2718 | lr 1.19e-03 | (3804.27 ms | 137816 tok/s) step 1599/76294 | train loss 3.951720 | norm 0.2592 | lr 1.19e-03 | (3805.66 ms | 137766 tok/s) step 1600/76294 | train loss 3.979849 | norm 0.2295 | lr 1.19e-03 | (3830.35 ms | 136877 tok/s) step 1601/76294 | train loss 3.879244 | norm 0.2605 | lr 1.19e-03 | (3800.36 ms | 137957 tok/s) step 1602/76294 | train loss 3.938118 | norm 0.2340 | lr 1.19e-03 | (3802.54 ms | 137878 tok/s) step 1603/76294 | train loss 3.921107 | norm 0.2079 | lr 1.19e-03 | (3921.39 ms | 133700 tok/s) step 1604/76294 | train loss 3.912747 | norm 0.2290 | lr 1.19e-03 | (3865.39 ms | 135637 tok/s) step 1605/76294 | train loss 3.917293 | norm 0.2311 | lr 1.19e-03 | (3818.98 ms | 137285 tok/s) step 1606/76294 | train loss 3.872808 | norm 0.2342 | lr 1.19e-03 | (3800.80 ms | 137942 tok/s) step 1607/76294 | train loss 3.950778 | norm 0.2373 | lr 1.19e-03 | (3804.03 ms | 137824 tok/s) step 1608/76294 | train loss 3.950235 | norm 0.2353 | lr 1.19e-03 | (3802.12 ms | 137894 tok/s) step 1609/76294 | train loss 3.894424 | norm 0.2121 | lr 1.19e-03 | (3836.23 ms | 136668 tok/s) step 1610/76294 | train loss 3.956726 | norm 0.2407 | lr 1.19e-03 | (3825.86 ms | 137038 tok/s) step 1611/76294 | train loss 3.929561 | norm 0.2471 | lr 1.19e-03 | (3809.41 ms | 137630 tok/s) step 1612/76294 | train loss 3.897571 | norm 0.2788 | lr 1.19e-03 | (3810.30 ms | 137598 tok/s) step 1613/76294 | train loss 3.918097 | norm 0.2497 | lr 1.19e-03 | (3809.27 ms | 137635 tok/s) step 1614/76294 | train loss 3.945740 | norm 0.2294 | lr 1.19e-03 | (3823.95 ms | 137106 tok/s) step 1615/76294 | train loss 3.922148 | norm 0.2312 | lr 1.19e-03 | (3811.93 ms | 137539 tok/s) step 1616/76294 | train loss 3.908084 | norm 0.2286 | lr 1.19e-03 | (3868.95 ms | 135512 tok/s) step 1617/76294 | train loss 3.957880 | norm 0.2426 | lr 1.19e-03 | (3805.80 ms | 137760 tok/s) step 1618/76294 | train loss 3.893174 | norm 0.2200 | lr 1.19e-03 | (3890.05 ms | 134777 tok/s) step 1619/76294 | train loss 3.893147 | norm 0.2524 | lr 1.19e-03 | (3825.25 ms | 137060 tok/s) step 1620/76294 | train loss 3.939650 | norm 0.2691 | lr 1.19e-03 | (3806.84 ms | 137723 tok/s) step 1621/76294 | train loss 3.958316 | norm 0.2733 | lr 1.19e-03 | (3807.00 ms | 137717 tok/s) step 1622/76294 | train loss 3.951157 | norm 0.2825 | lr 1.19e-03 | (3937.97 ms | 133137 tok/s) step 1623/76294 | train loss 3.954379 | norm 0.2334 | lr 1.19e-03 | (3798.62 ms | 138021 tok/s) step 1624/76294 | train loss 3.923719 | norm 0.2253 | lr 1.19e-03 | (3829.12 ms | 136921 tok/s) step 1625/76294 | train loss 3.896884 | norm 0.2373 | lr 1.19e-03 | (3802.03 ms | 137897 tok/s) step 1626/76294 | train loss 3.874641 | norm 0.2249 | lr 1.19e-03 | (3824.48 ms | 137087 tok/s) step 1627/76294 | train loss 3.923081 | norm 0.2207 | lr 1.19e-03 | (3828.65 ms | 136938 tok/s) step 1628/76294 | train loss 3.926605 | norm 0.1959 | lr 1.19e-03 | (3808.83 ms | 137651 tok/s) step 1629/76294 | train loss 4.000711 | norm 0.3179 | lr 1.19e-03 | (3938.24 ms | 133127 tok/s) step 1630/76294 | train loss 4.038303 | norm 0.2967 | lr 1.19e-03 | (3799.81 ms | 137977 tok/s) step 1631/76294 | train loss 3.887640 | norm 0.2709 | lr 1.19e-03 | (3842.84 ms | 136432 tok/s) step 1632/76294 | train loss 3.927288 | norm 0.2422 | lr 1.19e-03 | (3824.66 ms | 137081 tok/s) step 1633/76294 | train loss 3.883174 | norm 0.2910 | lr 1.19e-03 | (3807.32 ms | 137705 tok/s) step 1634/76294 | train loss 3.967012 | norm 0.2902 | lr 1.19e-03 | (3826.03 ms | 137032 tok/s) step 1635/76294 | train loss 3.982394 | norm 0.3204 | lr 1.19e-03 | (3803.59 ms | 137840 tok/s) step 1636/76294 | train loss 3.958740 | norm 0.3790 | lr 1.19e-03 | (3801.48 ms | 137917 tok/s) step 1637/76294 | train loss 3.927820 | norm 0.2773 | lr 1.19e-03 | (3829.32 ms | 136914 tok/s) step 1638/76294 | train loss 3.895252 | norm 0.2802 | lr 1.19e-03 | (4078.67 ms | 128544 tok/s) step 1639/76294 | train loss 3.930685 | norm 0.2960 | lr 1.19e-03 | (3799.45 ms | 137990 tok/s) step 1640/76294 | train loss 3.911210 | norm 0.2510 | lr 1.19e-03 | (3862.13 ms | 135751 tok/s) step 1641/76294 | train loss 3.971370 | norm 0.3082 | lr 1.19e-03 | (3802.54 ms | 137879 tok/s) step 1642/76294 | train loss 3.923934 | norm 0.3047 | lr 1.19e-03 | (3811.43 ms | 137557 tok/s) step 1643/76294 | train loss 3.873786 | norm 0.2845 | lr 1.19e-03 | (3825.91 ms | 137036 tok/s) step 1644/76294 | train loss 3.981431 | norm 0.2608 | lr 1.19e-03 | (3828.18 ms | 136955 tok/s) step 1645/76294 | train loss 3.909370 | norm 0.2693 | lr 1.19e-03 | (3806.80 ms | 137724 tok/s) step 1646/76294 | train loss 3.948833 | norm 0.2429 | lr 1.19e-03 | (3804.39 ms | 137811 tok/s) step 1647/76294 | train loss 3.909559 | norm 0.2452 | lr 1.19e-03 | (3833.86 ms | 136752 tok/s) step 1648/76294 | train loss 3.916933 | norm 0.2119 | lr 1.19e-03 | (3809.15 ms | 137639 tok/s) step 1649/76294 | train loss 3.904113 | norm 0.2483 | lr 1.19e-03 | (3804.65 ms | 137802 tok/s) step 1650/76294 | train loss 3.909007 | norm 0.2950 | lr 1.19e-03 | (3835.30 ms | 136701 tok/s) step 1651/76294 | train loss 3.897952 | norm 0.2905 | lr 1.19e-03 | (3801.46 ms | 137918 tok/s) step 1652/76294 | train loss 3.846596 | norm 0.2252 | lr 1.19e-03 | (3811.87 ms | 137541 tok/s) step 1653/76294 | train loss 3.961024 | norm 0.2697 | lr 1.19e-03 | (3828.06 ms | 136959 tok/s) step 1654/76294 | train loss 3.895944 | norm 0.2986 | lr 1.19e-03 | (3837.96 ms | 136606 tok/s) step 1655/76294 | train loss 3.905850 | norm 0.3297 | lr 1.19e-03 | (3829.95 ms | 136892 tok/s) step 1656/76294 | train loss 3.897525 | norm 0.2788 | lr 1.19e-03 | (3819.06 ms | 137282 tok/s) step 1657/76294 | train loss 3.937736 | norm 0.2742 | lr 1.19e-03 | (3834.54 ms | 136728 tok/s) step 1658/76294 | train loss 3.886147 | norm 0.2792 | lr 1.19e-03 | (4782.22 ms | 109633 tok/s) step 1659/76294 | train loss 3.937240 | norm 0.2209 | lr 1.19e-03 | (3801.49 ms | 137916 tok/s) step 1660/76294 | train loss 3.978759 | norm 0.2520 | lr 1.19e-03 | (3829.36 ms | 136913 tok/s) step 1661/76294 | train loss 3.899492 | norm 0.2478 | lr 1.19e-03 | (3825.28 ms | 137059 tok/s) step 1662/76294 | train loss 3.968176 | norm 0.2178 | lr 1.19e-03 | (3803.14 ms | 137856 tok/s) step 1663/76294 | train loss 3.890142 | norm 0.2384 | lr 1.19e-03 | (3799.65 ms | 137983 tok/s) step 1664/76294 | train loss 3.940699 | norm 0.2288 | lr 1.19e-03 | (3826.43 ms | 137018 tok/s) step 1665/76294 | train loss 3.877048 | norm 0.2151 | lr 1.19e-03 | (3804.57 ms | 137805 tok/s) step 1666/76294 | train loss 3.891504 | norm 0.1910 | lr 1.19e-03 | (3818.53 ms | 137301 tok/s) step 1667/76294 | train loss 3.946386 | norm 0.2135 | lr 1.19e-03 | (3820.21 ms | 137241 tok/s) step 1668/76294 | train loss 3.814697 | norm 0.2725 | lr 1.19e-03 | (4005.85 ms | 130881 tok/s) step 1669/76294 | train loss 3.934451 | norm 0.2677 | lr 1.19e-03 | (3797.20 ms | 138072 tok/s) step 1670/76294 | train loss 3.961663 | norm 0.2392 | lr 1.19e-03 | (3836.25 ms | 136667 tok/s) step 1671/76294 | train loss 3.894945 | norm 0.2729 | lr 1.19e-03 | (3802.79 ms | 137869 tok/s) step 1672/76294 | train loss 3.900401 | norm 0.2767 | lr 1.19e-03 | (3805.61 ms | 137767 tok/s) step 1673/76294 | train loss 3.879949 | norm 0.2961 | lr 1.19e-03 | (3827.04 ms | 136996 tok/s) step 1674/76294 | train loss 3.899005 | norm 0.2382 | lr 1.19e-03 | (3811.44 ms | 137556 tok/s) step 1675/76294 | train loss 3.911707 | norm 0.3059 | lr 1.19e-03 | (3799.90 ms | 137974 tok/s) step 1676/76294 | train loss 3.878026 | norm 0.4757 | lr 1.19e-03 | (3834.75 ms | 136720 tok/s) step 1677/76294 | train loss 3.871407 | norm 0.3547 | lr 1.19e-03 | (3801.48 ms | 137917 tok/s) step 1678/76294 | train loss 4.050517 | norm 0.3170 | lr 1.19e-03 | (3809.83 ms | 137614 tok/s) step 1679/76294 | train loss 3.967597 | norm 0.2943 | lr 1.19e-03 | (3824.64 ms | 137082 tok/s) step 1680/76294 | train loss 3.915198 | norm 0.3185 | lr 1.19e-03 | (3805.45 ms | 137773 tok/s) step 1681/76294 | train loss 3.937572 | norm 0.2434 | lr 1.19e-03 | (3804.90 ms | 137793 tok/s) step 1682/76294 | train loss 3.869648 | norm 0.2530 | lr 1.19e-03 | (3839.67 ms | 136545 tok/s) step 1683/76294 | train loss 3.918998 | norm 0.2605 | lr 1.19e-03 | (3805.46 ms | 137772 tok/s) step 1684/76294 | train loss 3.870557 | norm 0.2521 | lr 1.19e-03 | (3809.69 ms | 137620 tok/s) step 1685/76294 | train loss 3.864915 | norm 0.2617 | lr 1.19e-03 | (3849.98 ms | 136179 tok/s) step 1686/76294 | train loss 3.899189 | norm 0.2569 | lr 1.19e-03 | (3802.32 ms | 137886 tok/s) step 1687/76294 | train loss 3.895765 | norm 0.2307 | lr 1.19e-03 | (3837.58 ms | 136620 tok/s) step 1688/76294 | train loss 3.927610 | norm 0.2240 | lr 1.19e-03 | (3806.23 ms | 137745 tok/s) step 1689/76294 | train loss 3.927506 | norm 0.2163 | lr 1.19e-03 | (3816.30 ms | 137381 tok/s) step 1690/76294 | train loss 3.929856 | norm 0.2478 | lr 1.19e-03 | (3801.94 ms | 137900 tok/s) step 1691/76294 | train loss 3.888617 | norm 0.2199 | lr 1.19e-03 | (3910.66 ms | 134066 tok/s) step 1692/76294 | train loss 3.897008 | norm 0.2311 | lr 1.19e-03 | (3804.39 ms | 137811 tok/s) step 1693/76294 | train loss 3.888374 | norm 0.2469 | lr 1.19e-03 | (3809.72 ms | 137619 tok/s) step 1694/76294 | train loss 3.878832 | norm 0.2407 | lr 1.19e-03 | (3870.00 ms | 135475 tok/s) step 1695/76294 | train loss 3.893174 | norm 0.2888 | lr 1.19e-03 | (3800.82 ms | 137941 tok/s) step 1696/76294 | train loss 3.843660 | norm 0.4778 | lr 1.19e-03 | (3807.26 ms | 137707 tok/s) step 1697/76294 | train loss 3.994011 | norm 0.4526 | lr 1.19e-03 | (3836.95 ms | 136642 tok/s) step 1698/76294 | train loss 3.889783 | norm 0.2949 | lr 1.19e-03 | (3804.98 ms | 137790 tok/s) step 1699/76294 | train loss 3.912735 | norm 0.2979 | lr 1.19e-03 | (3812.10 ms | 137533 tok/s) step 1700/76294 | train loss 3.940713 | norm 0.2664 | lr 1.19e-03 | (3830.87 ms | 136859 tok/s) step 1701/76294 | train loss 3.882600 | norm 0.2632 | lr 1.19e-03 | (3814.04 ms | 137463 tok/s) step 1702/76294 | train loss 3.908133 | norm 0.2273 | lr 1.19e-03 | (3803.32 ms | 137850 tok/s) step 1703/76294 | train loss 3.860942 | norm 0.2329 | lr 1.19e-03 | (3841.07 ms | 136495 tok/s) step 1704/76294 | train loss 3.916438 | norm 0.2401 | lr 1.19e-03 | (3804.34 ms | 137813 tok/s) step 1705/76294 | train loss 3.913138 | norm 0.2308 | lr 1.19e-03 | (3811.86 ms | 137541 tok/s) step 1706/76294 | train loss 4.018132 | norm 0.2537 | lr 1.19e-03 | (3832.18 ms | 136812 tok/s) step 1707/76294 | train loss 3.924074 | norm 0.2857 | lr 1.19e-03 | (3836.18 ms | 136669 tok/s) step 1708/76294 | train loss 3.948426 | norm 0.2420 | lr 1.19e-03 | (3828.64 ms | 136939 tok/s) step 1709/76294 | train loss 3.890546 | norm 0.2185 | lr 1.19e-03 | (3810.37 ms | 137595 tok/s) step 1710/76294 | train loss 3.941794 | norm 0.2351 | lr 1.19e-03 | (3829.48 ms | 136908 tok/s) step 1711/76294 | train loss 3.891015 | norm 0.2905 | lr 1.19e-03 | (3808.11 ms | 137677 tok/s) step 1712/76294 | train loss 3.886820 | norm 0.2605 | lr 1.19e-03 | (3830.81 ms | 136861 tok/s) step 1713/76294 | train loss 4.048428 | norm 0.2188 | lr 1.19e-03 | (3833.59 ms | 136762 tok/s) step 1714/76294 | train loss 3.873963 | norm 0.2736 | lr 1.19e-03 | (3914.89 ms | 133922 tok/s) step 1715/76294 | train loss 3.913472 | norm 0.2316 | lr 1.19e-03 | (3916.82 ms | 133856 tok/s) step 1716/76294 | train loss 3.864242 | norm 0.2015 | lr 1.19e-03 | (3806.02 ms | 137752 tok/s) step 1717/76294 | train loss 3.893646 | norm 0.2182 | lr 1.19e-03 | (4135.58 ms | 126775 tok/s) step 1718/76294 | train loss 3.871258 | norm 0.2413 | lr 1.19e-03 | (3798.62 ms | 138021 tok/s) step 1719/76294 | train loss 3.902469 | norm 0.2283 | lr 1.19e-03 | (3806.60 ms | 137731 tok/s) step 1720/76294 | train loss 3.904796 | norm 0.2273 | lr 1.19e-03 | (3798.27 ms | 138033 tok/s) step 1721/76294 | train loss 3.874725 | norm 0.2009 | lr 1.19e-03 | (3820.79 ms | 137220 tok/s) step 1722/76294 | train loss 3.996216 | norm 0.2353 | lr 1.19e-03 | (3821.89 ms | 137180 tok/s) step 1723/76294 | train loss 3.864191 | norm 0.2404 | lr 1.19e-03 | (3812.89 ms | 137504 tok/s) step 1724/76294 | train loss 3.913051 | norm 0.2060 | lr 1.19e-03 | (3898.58 ms | 134482 tok/s) step 1725/76294 | train loss 3.960999 | norm 0.2111 | lr 1.19e-03 | (3864.54 ms | 135666 tok/s) step 1726/76294 | train loss 3.939038 | norm 0.2212 | lr 1.19e-03 | (3828.49 ms | 136944 tok/s) step 1727/76294 | train loss 3.954136 | norm 0.2488 | lr 1.19e-03 | (3804.82 ms | 137796 tok/s) step 1728/76294 | train loss 3.877458 | norm 0.2635 | lr 1.19e-03 | (3827.89 ms | 136965 tok/s) step 1729/76294 | train loss 3.872163 | norm 0.2432 | lr 1.19e-03 | (3804.05 ms | 137824 tok/s) step 1730/76294 | train loss 3.831843 | norm 0.2263 | lr 1.19e-03 | (3957.82 ms | 132469 tok/s) step 1731/76294 | train loss 3.863754 | norm 0.2212 | lr 1.19e-03 | (3823.09 ms | 137137 tok/s) step 1732/76294 | train loss 3.900121 | norm 0.2534 | lr 1.19e-03 | (3807.82 ms | 137687 tok/s) step 1733/76294 | train loss 3.938813 | norm 0.2922 | lr 1.19e-03 | (3801.22 ms | 137926 tok/s) step 1734/76294 | train loss 3.849850 | norm 0.3943 | lr 1.19e-03 | (3843.51 ms | 136409 tok/s) step 1735/76294 | train loss 3.905799 | norm 0.4140 | lr 1.19e-03 | (3809.02 ms | 137644 tok/s) step 1736/76294 | train loss 3.942790 | norm 0.6514 | lr 1.19e-03 | (3833.61 ms | 136761 tok/s) step 1737/76294 | train loss 3.912157 | norm 0.3317 | lr 1.19e-03 | (3805.74 ms | 137762 tok/s) step 1738/76294 | train loss 3.936703 | norm 0.2966 | lr 1.19e-03 | (3809.25 ms | 137636 tok/s) step 1739/76294 | train loss 3.942637 | norm 0.2505 | lr 1.19e-03 | (3831.12 ms | 136850 tok/s) step 1740/76294 | train loss 3.928576 | norm 0.2636 | lr 1.19e-03 | (3817.45 ms | 137340 tok/s) step 1741/76294 | train loss 3.941968 | norm 0.3210 | lr 1.19e-03 | (3802.56 ms | 137878 tok/s) step 1742/76294 | train loss 3.993384 | norm 0.2667 | lr 1.19e-03 | (3834.25 ms | 136738 tok/s) step 1743/76294 | train loss 3.920486 | norm 0.2536 | lr 1.19e-03 | (3806.13 ms | 137748 tok/s) step 1744/76294 | train loss 3.894594 | norm 0.2702 | lr 1.19e-03 | (3882.06 ms | 135054 tok/s) step 1745/76294 | train loss 3.853109 | norm 0.3123 | lr 1.19e-03 | (3801.58 ms | 137913 tok/s) step 1746/76294 | train loss 3.947424 | norm 0.2746 | lr 1.19e-03 | (3858.43 ms | 135881 tok/s) step 1747/76294 | train loss 3.918422 | norm 0.2806 | lr 1.19e-03 | (3806.95 ms | 137719 tok/s) step 1748/76294 | train loss 3.865035 | norm 0.2652 | lr 1.19e-03 | (3810.51 ms | 137590 tok/s) step 1749/76294 | train loss 3.903327 | norm 0.2983 | lr 1.19e-03 | (3832.31 ms | 136807 tok/s) step 1750/76294 | train loss 3.912476 | norm 0.2867 | lr 1.19e-03 | (3808.17 ms | 137674 tok/s) val loss: 3.899432 saving model checkpoint to ./results/gpt2-124M-gqa/step_1750.pth step 1751/76294 | train loss 3.945502 | norm 0.2907 | lr 1.19e-03 | (3879.13 ms | 135156 tok/s) step 1752/76294 | train loss 3.874115 | norm 0.2563 | lr 1.19e-03 | (3820.16 ms | 137242 tok/s) step 1753/76294 | train loss 3.895066 | norm 0.2495 | lr 1.19e-03 | (3791.47 ms | 138281 tok/s) step 1754/76294 | train loss 3.892200 | norm 0.2695 | lr 1.19e-03 | (3811.53 ms | 137553 tok/s) step 1755/76294 | train loss 3.883728 | norm 0.2040 | lr 1.19e-03 | (3794.51 ms | 138170 tok/s) step 1756/76294 | train loss 3.879591 | norm 0.2467 | lr 1.19e-03 | (3795.57 ms | 138131 tok/s) step 1757/76294 | train loss 3.909260 | norm 0.2011 | lr 1.19e-03 | (3848.87 ms | 136219 tok/s) step 1758/76294 | train loss 3.916470 | norm 0.2191 | lr 1.19e-03 | (3796.27 ms | 138106 tok/s) step 1759/76294 | train loss 3.860245 | norm 0.2350 | lr 1.19e-03 | (3796.72 ms | 138090 tok/s) step 1760/76294 | train loss 3.888478 | norm 0.2313 | lr 1.19e-03 | (3820.11 ms | 137244 tok/s) step 1761/76294 | train loss 3.828058 | norm 0.2524 | lr 1.19e-03 | (3795.26 ms | 138143 tok/s) step 1762/76294 | train loss 3.884492 | norm 0.2508 | lr 1.19e-03 | (3802.33 ms | 137886 tok/s) step 1763/76294 | train loss 3.880297 | norm 0.2152 | lr 1.19e-03 | (3822.87 ms | 137145 tok/s) step 1764/76294 | train loss 3.911861 | norm 0.2028 | lr 1.19e-03 | (3805.87 ms | 137758 tok/s) step 1765/76294 | train loss 3.868705 | norm 0.2354 | lr 1.19e-03 | (3837.73 ms | 136614 tok/s) step 1766/76294 | train loss 3.920320 | norm 0.2660 | lr 1.19e-03 | (3854.04 ms | 136036 tok/s) step 1767/76294 | train loss 3.921915 | norm 0.2145 | lr 1.19e-03 | (3801.99 ms | 137898 tok/s) step 1768/76294 | train loss 3.882948 | norm 0.2196 | lr 1.19e-03 | (3830.89 ms | 136858 tok/s) step 1769/76294 | train loss 3.961160 | norm 0.2080 | lr 1.19e-03 | (3801.80 ms | 137905 tok/s) step 1770/76294 | train loss 3.937987 | norm 0.1914 | lr 1.19e-03 | (3878.92 ms | 135164 tok/s) step 1771/76294 | train loss 3.864526 | norm 0.1824 | lr 1.19e-03 | (3802.40 ms | 137883 tok/s) step 1772/76294 | train loss 3.940292 | norm 0.2343 | lr 1.19e-03 | (3810.43 ms | 137593 tok/s) step 1773/76294 | train loss 3.896658 | norm 0.2464 | lr 1.19e-03 | (3823.45 ms | 137124 tok/s) step 1774/76294 | train loss 3.964983 | norm 0.2584 | lr 1.19e-03 | (3803.81 ms | 137832 tok/s) step 1775/76294 | train loss 3.900402 | norm 0.3239 | lr 1.19e-03 | (3804.68 ms | 137801 tok/s) step 1776/76294 | train loss 3.886096 | norm 0.2865 | lr 1.19e-03 | (3845.13 ms | 136351 tok/s) step 1777/76294 | train loss 3.874289 | norm 0.2213 | lr 1.19e-03 | (3805.56 ms | 137769 tok/s) step 1778/76294 | train loss 3.910405 | norm 0.2489 | lr 1.19e-03 | (3857.43 ms | 135916 tok/s) step 1779/76294 | train loss 3.843925 | norm 0.2771 | lr 1.19e-03 | (3815.62 ms | 137406 tok/s) step 1780/76294 | train loss 3.975071 | norm 0.2823 | lr 1.19e-03 | (3858.50 ms | 135879 tok/s) step 1781/76294 | train loss 3.821934 | norm 0.2178 | lr 1.19e-03 | (3813.64 ms | 137477 tok/s) step 1782/76294 | train loss 3.945597 | norm 0.2341 | lr 1.19e-03 | (3812.64 ms | 137513 tok/s) step 1783/76294 | train loss 3.877394 | norm 0.2455 | lr 1.19e-03 | (3826.99 ms | 136998 tok/s) step 1784/76294 | train loss 3.863844 | norm 0.2167 | lr 1.19e-03 | (3860.21 ms | 135818 tok/s) step 1785/76294 | train loss 3.878043 | norm 0.2282 | lr 1.19e-03 | (3808.07 ms | 137678 tok/s) step 1786/76294 | train loss 3.912697 | norm 0.2509 | lr 1.19e-03 | (3839.82 ms | 136540 tok/s) step 1787/76294 | train loss 3.990311 | norm 0.2373 | lr 1.19e-03 | (3804.52 ms | 137807 tok/s) step 1788/76294 | train loss 3.886893 | norm 0.2471 | lr 1.19e-03 | (3816.55 ms | 137372 tok/s) step 1789/76294 | train loss 3.919562 | norm 0.2228 | lr 1.19e-03 | (3807.48 ms | 137699 tok/s) step 1790/76294 | train loss 3.902315 | norm 0.2014 | lr 1.19e-03 | (3816.59 ms | 137371 tok/s) step 1791/76294 | train loss 3.884756 | norm 0.2267 | lr 1.19e-03 | (3804.53 ms | 137806 tok/s) step 1792/76294 | train loss 3.837881 | norm 0.2275 | lr 1.19e-03 | (3832.15 ms | 136813 tok/s) step 1793/76294 | train loss 3.890275 | norm 0.2176 | lr 1.19e-03 | (3806.22 ms | 137745 tok/s) step 1794/76294 | train loss 3.909537 | norm 0.2474 | lr 1.19e-03 | (3813.54 ms | 137481 tok/s) step 1795/76294 | train loss 3.973357 | norm 0.3326 | lr 1.19e-03 | (3807.90 ms | 137684 tok/s) step 1796/76294 | train loss 3.859277 | norm 0.4419 | lr 1.19e-03 | (3818.07 ms | 137317 tok/s) step 1797/76294 | train loss 3.972077 | norm 0.4010 | lr 1.19e-03 | (3829.56 ms | 136906 tok/s) step 1798/76294 | train loss 3.878769 | norm 0.3674 | lr 1.19e-03 | (3807.74 ms | 137690 tok/s) step 1799/76294 | train loss 3.945901 | norm 0.2995 | lr 1.19e-03 | (3817.29 ms | 137346 tok/s) step 1800/76294 | train loss 3.883907 | norm 0.2660 | lr 1.19e-03 | (3886.00 ms | 134917 tok/s) step 1801/76294 | train loss 3.876884 | norm 0.2383 | lr 1.19e-03 | (3810.86 ms | 137577 tok/s) step 1802/76294 | train loss 3.901619 | norm 0.2419 | lr 1.19e-03 | (3815.24 ms | 137420 tok/s) step 1803/76294 | train loss 3.865563 | norm 0.2298 | lr 1.19e-03 | (3844.62 ms | 136369 tok/s) step 1804/76294 | train loss 3.863529 | norm 0.2537 | lr 1.19e-03 | (3810.80 ms | 137579 tok/s) step 1805/76294 | train loss 3.830153 | norm 0.2303 | lr 1.19e-03 | (3807.29 ms | 137706 tok/s) step 1806/76294 | train loss 3.880001 | norm 0.2182 | lr 1.19e-03 | (3836.50 ms | 136658 tok/s) step 1807/76294 | train loss 3.813772 | norm 0.2222 | lr 1.19e-03 | (3809.90 ms | 137612 tok/s) step 1808/76294 | train loss 3.841585 | norm 0.2251 | lr 1.19e-03 | (3853.94 ms | 136040 tok/s) step 1809/76294 | train loss 3.874083 | norm 0.2010 | lr 1.19e-03 | (3834.00 ms | 136747 tok/s) step 1810/76294 | train loss 3.862336 | norm 0.2239 | lr 1.19e-03 | (3810.54 ms | 137589 tok/s) step 1811/76294 | train loss 3.876795 | norm 0.2286 | lr 1.19e-03 | (3835.76 ms | 136684 tok/s) step 1812/76294 | train loss 3.866217 | norm 0.2006 | lr 1.19e-03 | (3803.33 ms | 137850 tok/s) step 1813/76294 | train loss 3.855509 | norm 0.2248 | lr 1.19e-03 | (3808.08 ms | 137678 tok/s) step 1814/76294 | train loss 3.870482 | norm 0.2652 | lr 1.19e-03 | (3843.24 ms | 136418 tok/s) step 1815/76294 | train loss 3.901432 | norm 0.2603 | lr 1.19e-03 | (3809.25 ms | 137636 tok/s) step 1816/76294 | train loss 3.901359 | norm 0.2474 | lr 1.19e-03 | (3808.16 ms | 137675 tok/s) step 1817/76294 | train loss 3.863230 | norm 0.2332 | lr 1.19e-03 | (3838.86 ms | 136574 tok/s) step 1818/76294 | train loss 3.915838 | norm 0.2408 | lr 1.19e-03 | (3803.79 ms | 137833 tok/s) step 1819/76294 | train loss 3.862199 | norm 0.2643 | lr 1.19e-03 | (3830.02 ms | 136889 tok/s) step 1820/76294 | train loss 3.862152 | norm 0.2209 | lr 1.19e-03 | (3827.76 ms | 136970 tok/s) step 1821/76294 | train loss 3.815523 | norm 0.1938 | lr 1.19e-03 | (3870.07 ms | 135473 tok/s) step 1822/76294 | train loss 3.893648 | norm 0.2126 | lr 1.19e-03 | (3808.27 ms | 137671 tok/s) step 1823/76294 | train loss 3.994969 | norm 0.2169 | lr 1.19e-03 | (3828.47 ms | 136945 tok/s) step 1824/76294 | train loss 3.870131 | norm 0.2390 | lr 1.19e-03 | (3824.92 ms | 137072 tok/s) step 1825/76294 | train loss 3.862512 | norm 0.2896 | lr 1.19e-03 | (3809.82 ms | 137615 tok/s) step 1826/76294 | train loss 3.907018 | norm 0.3172 | lr 1.19e-03 | (3823.19 ms | 137134 tok/s) step 1827/76294 | train loss 3.944045 | norm 0.3721 | lr 1.19e-03 | (3871.66 ms | 135417 tok/s) step 1828/76294 | train loss 3.876152 | norm 0.3908 | lr 1.19e-03 | (3998.18 ms | 131132 tok/s) step 1829/76294 | train loss 3.915725 | norm 0.2642 | lr 1.19e-03 | (3807.09 ms | 137714 tok/s) step 1830/76294 | train loss 3.893938 | norm 0.2847 | lr 1.19e-03 | (3834.65 ms | 136724 tok/s) step 1831/76294 | train loss 3.906979 | norm 0.2193 | lr 1.19e-03 | (3830.02 ms | 136889 tok/s) step 1832/76294 | train loss 3.882864 | norm 0.2321 | lr 1.19e-03 | (3808.85 ms | 137650 tok/s) step 1833/76294 | train loss 3.903242 | norm 0.2308 | lr 1.19e-03 | (3831.93 ms | 136821 tok/s) step 1834/76294 | train loss 3.874081 | norm 0.2462 | lr 1.19e-03 | (3814.91 ms | 137431 tok/s) step 1835/76294 | train loss 3.898084 | norm 0.2588 | lr 1.19e-03 | (3804.67 ms | 137801 tok/s) step 1836/76294 | train loss 3.939064 | norm 0.3094 | lr 1.19e-03 | (3859.47 ms | 135845 tok/s) step 1837/76294 | train loss 3.877986 | norm 0.2889 | lr 1.19e-03 | (3809.13 ms | 137640 tok/s) step 1838/76294 | train loss 3.923547 | norm 0.2486 | lr 1.19e-03 | (3833.45 ms | 136767 tok/s) step 1839/76294 | train loss 3.819582 | norm 0.2650 | lr 1.19e-03 | (3803.46 ms | 137845 tok/s) step 1840/76294 | train loss 3.892317 | norm 0.2835 | lr 1.19e-03 | (3813.09 ms | 137497 tok/s) step 1841/76294 | train loss 3.907163 | norm 0.2238 | lr 1.19e-03 | (3829.38 ms | 136912 tok/s) step 1842/76294 | train loss 3.887661 | norm 0.2341 | lr 1.19e-03 | (3808.93 ms | 137647 tok/s) step 1843/76294 | train loss 3.963612 | norm 0.2233 | lr 1.19e-03 | (3817.33 ms | 137344 tok/s) step 1844/76294 | train loss 3.836849 | norm 0.2571 | lr 1.19e-03 | (3832.52 ms | 136800 tok/s) step 1845/76294 | train loss 3.870162 | norm 0.2214 | lr 1.19e-03 | (3802.46 ms | 137881 tok/s) step 1846/76294 | train loss 3.846503 | norm 0.2672 | lr 1.19e-03 | (3826.33 ms | 137021 tok/s) step 1847/76294 | train loss 3.949588 | norm 0.3221 | lr 1.19e-03 | (3826.05 ms | 137031 tok/s) step 1848/76294 | train loss 3.801530 | norm 0.3354 | lr 1.19e-03 | (3869.41 ms | 135496 tok/s) step 1849/76294 | train loss 3.892283 | norm 0.2199 | lr 1.19e-03 | (3808.05 ms | 137679 tok/s) step 1850/76294 | train loss 3.824883 | norm 0.2542 | lr 1.19e-03 | (3932.84 ms | 133310 tok/s) step 1851/76294 | train loss 3.869567 | norm 0.2459 | lr 1.19e-03 | (3829.02 ms | 136925 tok/s) step 1852/76294 | train loss 3.891585 | norm 0.2395 | lr 1.19e-03 | (3809.09 ms | 137641 tok/s) step 1853/76294 | train loss 3.849679 | norm 0.1991 | lr 1.19e-03 | (3831.50 ms | 136836 tok/s) step 1854/76294 | train loss 3.868787 | norm 0.2198 | lr 1.19e-03 | (3808.74 ms | 137654 tok/s) step 1855/76294 | train loss 3.817088 | norm 0.2105 | lr 1.19e-03 | (3890.53 ms | 134760 tok/s) step 1856/76294 | train loss 3.870625 | norm 0.2153 | lr 1.19e-03 | (4446.88 ms | 117900 tok/s) step 1857/76294 | train loss 3.894189 | norm 0.1910 | lr 1.19e-03 | (4058.39 ms | 129186 tok/s) step 1858/76294 | train loss 3.872677 | norm 0.2038 | lr 1.19e-03 | (3827.61 ms | 136975 tok/s) step 1859/76294 | train loss 3.889856 | norm 0.2027 | lr 1.19e-03 | (3829.02 ms | 136925 tok/s) step 1860/76294 | train loss 3.914551 | norm 0.2131 | lr 1.19e-03 | (3801.56 ms | 137914 tok/s) step 1861/76294 | train loss 3.847432 | norm 0.2489 | lr 1.19e-03 | (3836.70 ms | 136651 tok/s) step 1862/76294 | train loss 3.859304 | norm 0.3253 | lr 1.19e-03 | (3802.07 ms | 137895 tok/s) step 1863/76294 | train loss 3.807865 | norm 0.3335 | lr 1.19e-03 | (3810.24 ms | 137600 tok/s) step 1864/76294 | train loss 3.823669 | norm 0.3068 | lr 1.19e-03 | (3822.87 ms | 137145 tok/s) step 1865/76294 | train loss 3.896463 | norm 0.2853 | lr 1.19e-03 | (3807.15 ms | 137711 tok/s) step 1866/76294 | train loss 3.821582 | norm 0.2224 | lr 1.19e-03 | (3802.92 ms | 137864 tok/s) step 1867/76294 | train loss 3.829507 | norm 0.2277 | lr 1.19e-03 | (3838.65 ms | 136582 tok/s) step 1868/76294 | train loss 3.836351 | norm 0.2369 | lr 1.19e-03 | (3806.72 ms | 137727 tok/s) step 1869/76294 | train loss 3.817941 | norm 0.2196 | lr 1.19e-03 | (4194.06 ms | 125007 tok/s) step 1870/76294 | train loss 3.814858 | norm 0.2099 | lr 1.19e-03 | (3800.95 ms | 137936 tok/s) step 1871/76294 | train loss 3.790480 | norm 0.2301 | lr 1.19e-03 | (3808.82 ms | 137651 tok/s) step 1872/76294 | train loss 3.941015 | norm 0.2784 | lr 1.19e-03 | (3830.51 ms | 136872 tok/s) step 1873/76294 | train loss 3.852001 | norm 0.2929 | lr 1.19e-03 | (3807.74 ms | 137690 tok/s) step 1874/76294 | train loss 3.888695 | norm 0.2853 | lr 1.19e-03 | (3805.15 ms | 137784 tok/s) step 1875/76294 | train loss 3.869340 | norm 0.3091 | lr 1.19e-03 | (3940.58 ms | 133048 tok/s) step 1876/76294 | train loss 3.874114 | norm 0.2582 | lr 1.19e-03 | (3805.37 ms | 137776 tok/s) step 1877/76294 | train loss 3.827355 | norm 0.2287 | lr 1.19e-03 | (3812.20 ms | 137529 tok/s) step 1878/76294 | train loss 3.824650 | norm 0.2445 | lr 1.19e-03 | (3826.30 ms | 137022 tok/s) step 1879/76294 | train loss 3.862799 | norm 0.2859 | lr 1.19e-03 | (3806.84 ms | 137722 tok/s) step 1880/76294 | train loss 3.891190 | norm 0.2833 | lr 1.19e-03 | (3808.06 ms | 137679 tok/s) step 1881/76294 | train loss 3.881311 | norm 0.2458 | lr 1.19e-03 | (3841.46 ms | 136481 tok/s) step 1882/76294 | train loss 3.847356 | norm 0.2920 | lr 1.19e-03 | (3797.58 ms | 138058 tok/s) step 1883/76294 | train loss 3.920144 | norm 0.2819 | lr 1.19e-03 | (3951.12 ms | 132693 tok/s) step 1884/76294 | train loss 3.858687 | norm 0.2948 | lr 1.19e-03 | (3804.41 ms | 137811 tok/s) step 1885/76294 | train loss 3.888759 | norm 0.2763 | lr 1.19e-03 | (3974.67 ms | 131907 tok/s) step 1886/76294 | train loss 3.854335 | norm 0.2627 | lr 1.19e-03 | (3800.37 ms | 137957 tok/s) step 1887/76294 | train loss 3.852944 | norm 0.2744 | lr 1.19e-03 | (3832.07 ms | 136816 tok/s) step 1888/76294 | train loss 3.801833 | norm 0.2535 | lr 1.19e-03 | (3804.06 ms | 137823 tok/s) step 1889/76294 | train loss 3.924400 | norm 0.2113 | lr 1.19e-03 | (3832.55 ms | 136799 tok/s) step 1890/76294 | train loss 3.850152 | norm 0.2240 | lr 1.19e-03 | (3818.59 ms | 137299 tok/s) step 1891/76294 | train loss 3.843031 | norm 0.2703 | lr 1.19e-03 | (3804.82 ms | 137796 tok/s) step 1892/76294 | train loss 3.808228 | norm 0.2504 | lr 1.19e-03 | (3802.78 ms | 137870 tok/s) step 1893/76294 | train loss 3.899382 | norm 0.2395 | lr 1.19e-03 | (3835.54 ms | 136692 tok/s) step 1894/76294 | train loss 3.880085 | norm 0.2315 | lr 1.19e-03 | (3800.82 ms | 137941 tok/s) step 1895/76294 | train loss 3.811596 | norm 0.2229 | lr 1.19e-03 | (3865.54 ms | 135631 tok/s) step 1896/76294 | train loss 3.854189 | norm 0.1989 | lr 1.19e-03 | (3806.28 ms | 137743 tok/s) step 1897/76294 | train loss 3.798692 | norm 0.2227 | lr 1.19e-03 | (3925.82 ms | 133549 tok/s) step 1898/76294 | train loss 3.867543 | norm 0.2147 | lr 1.19e-03 | (3801.02 ms | 137934 tok/s) step 1899/76294 | train loss 3.826869 | norm 0.2200 | lr 1.19e-03 | (3858.85 ms | 135866 tok/s) step 1900/76294 | train loss 3.859660 | norm 0.2363 | lr 1.19e-03 | (3801.43 ms | 137919 tok/s) step 1901/76294 | train loss 3.815408 | norm 0.4033 | lr 1.19e-03 | (3845.07 ms | 136353 tok/s) step 1902/76294 | train loss 3.886020 | norm 0.2275 | lr 1.19e-03 | (3830.59 ms | 136869 tok/s) step 1903/76294 | train loss 3.858219 | norm 0.2189 | lr 1.19e-03 | (3807.46 ms | 137700 tok/s) step 1904/76294 | train loss 3.890738 | norm 0.1924 | lr 1.19e-03 | (3817.71 ms | 137330 tok/s) step 1905/76294 | train loss 4.049669 | norm 0.2347 | lr 1.19e-03 | (3812.11 ms | 137532 tok/s) step 1906/76294 | train loss 3.860631 | norm 0.2931 | lr 1.19e-03 | (3803.55 ms | 137842 tok/s) step 1907/76294 | train loss 3.897769 | norm 0.2902 | lr 1.19e-03 | (3833.77 ms | 136755 tok/s) step 1908/76294 | train loss 3.832602 | norm 0.2821 | lr 1.19e-03 | (4218.39 ms | 124286 tok/s) step 1909/76294 | train loss 3.840194 | norm 0.3189 | lr 1.19e-03 | (3829.34 ms | 136913 tok/s) step 1910/76294 | train loss 3.901925 | norm 0.2734 | lr 1.19e-03 | (3809.53 ms | 137625 tok/s) step 1911/76294 | train loss 3.866059 | norm 0.2338 | lr 1.19e-03 | (3811.68 ms | 137548 tok/s) step 1912/76294 | train loss 3.791746 | norm 0.2044 | lr 1.19e-03 | (3914.81 ms | 133924 tok/s) step 1913/76294 | train loss 3.879404 | norm 0.2057 | lr 1.19e-03 | (3888.83 ms | 134819 tok/s) step 1914/76294 | train loss 3.856535 | norm 0.2338 | lr 1.19e-03 | (3806.16 ms | 137747 tok/s) step 1915/76294 | train loss 3.800000 | norm 0.2307 | lr 1.19e-03 | (3815.48 ms | 137411 tok/s) step 1916/76294 | train loss 3.873289 | norm 0.2526 | lr 1.19e-03 | (3820.44 ms | 137232 tok/s) step 1917/76294 | train loss 3.830193 | norm 0.2854 | lr 1.19e-03 | (3810.72 ms | 137582 tok/s) step 1918/76294 | train loss 3.904143 | norm 0.3149 | lr 1.19e-03 | (3827.99 ms | 136962 tok/s) step 1919/76294 | train loss 3.840612 | norm 0.2719 | lr 1.19e-03 | (3810.59 ms | 137587 tok/s) step 1920/76294 | train loss 3.811078 | norm 0.2570 | lr 1.19e-03 | (3805.60 ms | 137768 tok/s) step 1921/76294 | train loss 3.873837 | norm 0.2298 | lr 1.19e-03 | (3870.77 ms | 135448 tok/s) step 1922/76294 | train loss 3.790073 | norm 0.2350 | lr 1.19e-03 | (3809.36 ms | 137632 tok/s) step 1923/76294 | train loss 3.793350 | norm 0.2341 | lr 1.19e-03 | (3837.31 ms | 136629 tok/s) step 1924/76294 | train loss 3.785614 | norm 0.2373 | lr 1.19e-03 | (3807.57 ms | 137696 tok/s) step 1925/76294 | train loss 3.829091 | norm 0.2392 | lr 1.19e-03 | (3810.07 ms | 137606 tok/s) step 1926/76294 | train loss 3.851223 | norm 0.2374 | lr 1.19e-03 | (3831.97 ms | 136819 tok/s) step 1927/76294 | train loss 3.865168 | norm 0.3031 | lr 1.19e-03 | (3832.84 ms | 136788 tok/s) step 1928/76294 | train loss 3.902107 | norm 0.2915 | lr 1.19e-03 | (3806.20 ms | 137746 tok/s) step 1929/76294 | train loss 3.794555 | norm 0.2509 | lr 1.19e-03 | (3830.40 ms | 136875 tok/s) step 1930/76294 | train loss 3.756463 | norm 0.2348 | lr 1.19e-03 | (3807.52 ms | 137698 tok/s) step 1931/76294 | train loss 3.846802 | norm 0.2205 | lr 1.19e-03 | (4076.10 ms | 128625 tok/s) step 1932/76294 | train loss 3.839563 | norm 0.2209 | lr 1.19e-03 | (3801.36 ms | 137921 tok/s) step 1933/76294 | train loss 3.749818 | norm 0.2256 | lr 1.19e-03 | (3816.13 ms | 137387 tok/s) step 1934/76294 | train loss 3.861076 | norm 0.2115 | lr 1.19e-03 | (3808.66 ms | 137657 tok/s) step 1935/76294 | train loss 3.853564 | norm 0.2201 | lr 1.19e-03 | (3811.37 ms | 137559 tok/s) step 1936/76294 | train loss 3.854190 | norm 0.2207 | lr 1.19e-03 | (3833.04 ms | 136781 tok/s) step 1937/76294 | train loss 3.829618 | norm 0.1967 | lr 1.19e-03 | (3821.06 ms | 137210 tok/s) step 1938/76294 | train loss 3.829920 | norm 0.1981 | lr 1.19e-03 | (3831.03 ms | 136853 tok/s) step 1939/76294 | train loss 3.763400 | norm 0.2148 | lr 1.19e-03 | (3817.22 ms | 137348 tok/s) step 1940/76294 | train loss 3.796703 | norm 0.2332 | lr 1.19e-03 | (3804.45 ms | 137809 tok/s) step 1941/76294 | train loss 3.861409 | norm 0.2603 | lr 1.19e-03 | (3833.26 ms | 136773 tok/s) step 1942/76294 | train loss 3.786189 | norm 0.2773 | lr 1.19e-03 | (3809.41 ms | 137630 tok/s) step 1943/76294 | train loss 3.867368 | norm 0.2965 | lr 1.19e-03 | (3810.16 ms | 137603 tok/s) step 1944/76294 | train loss 3.815642 | norm 0.2681 | lr 1.19e-03 | (3830.32 ms | 136878 tok/s) step 1945/76294 | train loss 3.815644 | norm 0.2613 | lr 1.19e-03 | (3811.87 ms | 137541 tok/s) step 1946/76294 | train loss 3.768403 | norm 0.2788 | lr 1.19e-03 | (3807.11 ms | 137713 tok/s) step 1947/76294 | train loss 3.887763 | norm 0.2959 | lr 1.19e-03 | (3834.16 ms | 136741 tok/s) step 1948/76294 | train loss 3.817326 | norm 0.3327 | lr 1.19e-03 | (3804.02 ms | 137825 tok/s) step 1949/76294 | train loss 3.941165 | norm 0.3095 | lr 1.19e-03 | (3815.68 ms | 137404 tok/s) step 1950/76294 | train loss 3.841254 | norm 0.2910 | lr 1.19e-03 | (4091.10 ms | 128153 tok/s) step 1951/76294 | train loss 3.850758 | norm 0.2183 | lr 1.19e-03 | (3963.22 ms | 132288 tok/s) step 1952/76294 | train loss 3.844551 | norm 0.2430 | lr 1.19e-03 | (3802.42 ms | 137883 tok/s) step 1953/76294 | train loss 3.849532 | norm 0.1944 | lr 1.19e-03 | (3807.22 ms | 137709 tok/s) step 1954/76294 | train loss 3.823452 | norm 0.2149 | lr 1.19e-03 | (3827.41 ms | 136983 tok/s) step 1955/76294 | train loss 3.855597 | norm 0.2104 | lr 1.19e-03 | (3810.86 ms | 137577 tok/s) step 1956/76294 | train loss 3.855107 | norm 0.1995 | lr 1.19e-03 | (3804.04 ms | 137824 tok/s) step 1957/76294 | train loss 3.749804 | norm 0.2068 | lr 1.19e-03 | (3830.46 ms | 136873 tok/s) step 1958/76294 | train loss 3.876938 | norm 0.1953 | lr 1.19e-03 | (3813.26 ms | 137491 tok/s) step 1959/76294 | train loss 3.788355 | norm 0.2024 | lr 1.19e-03 | (3808.53 ms | 137662 tok/s) step 1960/76294 | train loss 3.797639 | norm 0.2085 | lr 1.19e-03 | (3827.45 ms | 136981 tok/s) step 1961/76294 | train loss 3.870858 | norm 0.2025 | lr 1.19e-03 | (3808.36 ms | 137668 tok/s) step 1962/76294 | train loss 3.866061 | norm 0.2329 | lr 1.19e-03 | (3813.26 ms | 137491 tok/s) step 1963/76294 | train loss 3.815792 | norm 0.2875 | lr 1.19e-03 | (3809.00 ms | 137645 tok/s) step 1964/76294 | train loss 3.820995 | norm 0.3023 | lr 1.19e-03 | (3812.46 ms | 137520 tok/s) step 1965/76294 | train loss 3.743429 | norm 0.2304 | lr 1.19e-03 | (3809.76 ms | 137617 tok/s) step 1966/76294 | train loss 3.789401 | norm 0.2372 | lr 1.19e-03 | (3803.68 ms | 137837 tok/s) step 1967/76294 | train loss 3.830890 | norm 0.2687 | lr 1.19e-03 | (3838.71 ms | 136579 tok/s) step 1968/76294 | train loss 3.821736 | norm 0.2547 | lr 1.19e-03 | (3803.22 ms | 137854 tok/s) step 1969/76294 | train loss 3.823925 | norm 0.2350 | lr 1.19e-03 | (3811.83 ms | 137542 tok/s) step 1970/76294 | train loss 3.855290 | norm 0.2142 | lr 1.19e-03 | (3831.18 ms | 136847 tok/s) step 1971/76294 | train loss 3.809190 | norm 0.2045 | lr 1.19e-03 | (3809.07 ms | 137642 tok/s) step 1972/76294 | train loss 3.876323 | norm 0.2105 | lr 1.19e-03 | (3878.95 ms | 135162 tok/s) step 1973/76294 | train loss 3.794997 | norm 0.1954 | lr 1.19e-03 | (3804.97 ms | 137790 tok/s) step 1974/76294 | train loss 3.765813 | norm 0.2104 | lr 1.19e-03 | (3810.74 ms | 137582 tok/s) step 1975/76294 | train loss 3.799056 | norm 0.2222 | lr 1.19e-03 | (3828.09 ms | 136958 tok/s) step 1976/76294 | train loss 3.893638 | norm 0.2555 | lr 1.19e-03 | (5470.29 ms | 95843 tok/s) step 1977/76294 | train loss 3.762403 | norm 0.2601 | lr 1.19e-03 | (5973.64 ms | 87767 tok/s) step 1978/76294 | train loss 3.875037 | norm 0.2757 | lr 1.19e-03 | (3814.92 ms | 137431 tok/s) step 1979/76294 | train loss 3.786041 | norm 0.2306 | lr 1.19e-03 | (3820.55 ms | 137228 tok/s) step 1980/76294 | train loss 3.824680 | norm 0.3282 | lr 1.19e-03 | (3821.60 ms | 137191 tok/s) step 1981/76294 | train loss 3.876967 | norm 0.3652 | lr 1.19e-03 | (3799.46 ms | 137990 tok/s) step 1982/76294 | train loss 3.981528 | norm 0.3627 | lr 1.19e-03 | (3819.23 ms | 137276 tok/s) step 1983/76294 | train loss 3.860155 | norm 0.3977 | lr 1.19e-03 | (3800.33 ms | 137959 tok/s) step 1984/76294 | train loss 3.846730 | norm 0.3374 | lr 1.19e-03 | (3825.70 ms | 137044 tok/s) step 1985/76294 | train loss 3.824561 | norm 0.3007 | lr 1.19e-03 | (5346.06 ms | 98070 tok/s) step 1986/76294 | train loss 3.778210 | norm 0.2467 | lr 1.19e-03 | (3798.34 ms | 138031 tok/s) step 1987/76294 | train loss 3.880780 | norm 0.2830 | lr 1.19e-03 | (3868.47 ms | 135529 tok/s) step 1988/76294 | train loss 3.881752 | norm 0.2408 | lr 1.19e-03 | (3795.05 ms | 138150 tok/s) step 1989/76294 | train loss 3.780396 | norm 0.2548 | lr 1.19e-03 | (3832.90 ms | 136786 tok/s) step 1990/76294 | train loss 3.844299 | norm 0.2234 | lr 1.19e-03 | (3797.08 ms | 138077 tok/s) step 1991/76294 | train loss 3.836112 | norm 0.2394 | lr 1.19e-03 | (3877.06 ms | 135228 tok/s) step 1992/76294 | train loss 3.818016 | norm 0.2603 | lr 1.19e-03 | (3917.53 ms | 133831 tok/s) step 1993/76294 | train loss 3.835729 | norm 0.2510 | lr 1.19e-03 | (3807.45 ms | 137701 tok/s) step 1994/76294 | train loss 3.841316 | norm 0.2235 | lr 1.19e-03 | (4266.45 ms | 122886 tok/s) step 1995/76294 | train loss 3.968071 | norm 0.2824 | lr 1.19e-03 | (3972.77 ms | 131971 tok/s) step 1996/76294 | train loss 3.885101 | norm 0.2455 | lr 1.19e-03 | (3824.09 ms | 137101 tok/s) step 1997/76294 | train loss 3.888791 | norm 0.2742 | lr 1.19e-03 | (3823.17 ms | 137134 tok/s) step 1998/76294 | train loss 3.798039 | norm 0.2737 | lr 1.19e-03 | (4052.42 ms | 129377 tok/s) step 1999/76294 | train loss 3.839304 | norm 0.3318 | lr 1.19e-03 | (3794.29 ms | 138178 tok/s) step 2000/76294 | train loss 3.904334 | norm 0.3182 | lr 1.19e-03 | (3845.71 ms | 136331 tok/s) val loss: 3.845704 saving model checkpoint to ./results/gpt2-124M-gqa/step_2000.pth step 2001/76294 | train loss 3.806220 | norm 0.3178 | lr 1.19e-03 | (3851.88 ms | 136112 tok/s) step 2002/76294 | train loss 3.926526 | norm 0.2737 | lr 1.19e-03 | (3793.75 ms | 138198 tok/s) step 2003/76294 | train loss 3.835924 | norm 0.2178 | lr 1.19e-03 | (3844.61 ms | 136370 tok/s) step 2004/76294 | train loss 3.821974 | norm 0.2506 | lr 1.19e-03 | (3795.42 ms | 138137 tok/s) step 2005/76294 | train loss 3.835040 | norm 0.2377 | lr 1.19e-03 | (4157.43 ms | 126109 tok/s) step 2006/76294 | train loss 3.835636 | norm 0.2112 | lr 1.19e-03 | (3840.42 ms | 136518 tok/s) step 2007/76294 | train loss 3.793629 | norm 0.2176 | lr 1.19e-03 | (3803.27 ms | 137852 tok/s) step 2008/76294 | train loss 3.897966 | norm 0.1928 | lr 1.19e-03 | (3821.36 ms | 137199 tok/s) step 2009/76294 | train loss 3.867364 | norm 0.2042 | lr 1.19e-03 | (3796.26 ms | 138107 tok/s) step 2010/76294 | train loss 3.773149 | norm 0.2184 | lr 1.19e-03 | (3812.86 ms | 137505 tok/s) step 2011/76294 | train loss 3.830924 | norm 0.1949 | lr 1.19e-03 | (3828.74 ms | 136935 tok/s) step 2012/76294 | train loss 3.771070 | norm 0.2779 | lr 1.19e-03 | (3947.34 ms | 132821 tok/s) step 2013/76294 | train loss 3.797003 | norm 0.3147 | lr 1.19e-03 | (3800.54 ms | 137951 tok/s) step 2014/76294 | train loss 3.787781 | norm 0.3254 | lr 1.19e-03 | (3824.75 ms | 137078 tok/s) step 2015/76294 | train loss 3.811366 | norm 0.2830 | lr 1.19e-03 | (3799.92 ms | 137973 tok/s) step 2016/76294 | train loss 3.825792 | norm 0.2679 | lr 1.19e-03 | (3814.91 ms | 137431 tok/s) step 2017/76294 | train loss 3.870764 | norm 0.2185 | lr 1.19e-03 | (3832.12 ms | 136814 tok/s) step 2018/76294 | train loss 3.834168 | norm 0.2329 | lr 1.19e-03 | (3810.03 ms | 137607 tok/s) step 2019/76294 | train loss 3.813802 | norm 0.2301 | lr 1.19e-03 | (3835.96 ms | 136677 tok/s) step 2020/76294 | train loss 3.860939 | norm 0.2162 | lr 1.19e-03 | (3806.55 ms | 137733 tok/s) step 2021/76294 | train loss 3.769001 | norm 0.2007 | lr 1.19e-03 | (3831.03 ms | 136853 tok/s) step 2022/76294 | train loss 3.831640 | norm 0.2027 | lr 1.19e-03 | (3834.97 ms | 136712 tok/s) step 2023/76294 | train loss 3.810308 | norm 0.2158 | lr 1.19e-03 | (3829.89 ms | 136894 tok/s) step 2024/76294 | train loss 3.832012 | norm 0.2220 | lr 1.19e-03 | (3810.70 ms | 137583 tok/s) step 2025/76294 | train loss 3.851738 | norm 0.2091 | lr 1.19e-03 | (3932.63 ms | 133317 tok/s) step 2026/76294 | train loss 3.826197 | norm 0.2219 | lr 1.19e-03 | (3845.87 ms | 136325 tok/s) step 2027/76294 | train loss 3.753825 | norm 0.1884 | lr 1.19e-03 | (3807.06 ms | 137715 tok/s) step 2028/76294 | train loss 3.812279 | norm 0.7876 | lr 1.19e-03 | (4166.38 ms | 125838 tok/s) step 2029/76294 | train loss 3.793926 | norm 0.2109 | lr 1.19e-03 | (3858.96 ms | 135862 tok/s) step 2030/76294 | train loss 3.817849 | norm 0.2145 | lr 1.19e-03 | (3860.99 ms | 135791 tok/s) step 2031/76294 | train loss 3.971134 | norm 0.2310 | lr 1.19e-03 | (3808.62 ms | 137658 tok/s) step 2032/76294 | train loss 3.775737 | norm 0.1888 | lr 1.19e-03 | (3883.79 ms | 134994 tok/s) step 2033/76294 | train loss 3.838807 | norm 0.2173 | lr 1.19e-03 | (3801.65 ms | 137911 tok/s) step 2034/76294 | train loss 3.799396 | norm 0.2212 | lr 1.19e-03 | (3835.01 ms | 136711 tok/s) step 2035/76294 | train loss 3.831820 | norm 0.2396 | lr 1.19e-03 | (3807.90 ms | 137684 tok/s) step 2036/76294 | train loss 3.868589 | norm 0.2372 | lr 1.19e-03 | (3813.85 ms | 137469 tok/s) step 2037/76294 | train loss 3.829705 | norm 0.2432 | lr 1.19e-03 | (3855.91 ms | 135970 tok/s) step 2038/76294 | train loss 3.800151 | norm 0.2907 | lr 1.19e-03 | (3848.12 ms | 136245 tok/s) step 2039/76294 | train loss 3.840589 | norm 0.4494 | lr 1.19e-03 | (3839.13 ms | 136564 tok/s) step 2040/76294 | train loss 3.799307 | norm 0.4404 | lr 1.19e-03 | (3819.32 ms | 137273 tok/s) step 2041/76294 | train loss 3.805670 | norm 0.2931 | lr 1.19e-03 | (3837.46 ms | 136624 tok/s) step 2042/76294 | train loss 3.825688 | norm 0.3480 | lr 1.19e-03 | (3820.23 ms | 137240 tok/s) step 2043/76294 | train loss 3.802540 | norm 0.2796 | lr 1.19e-03 | (3820.95 ms | 137214 tok/s) step 2044/76294 | train loss 3.827257 | norm 0.3046 | lr 1.19e-03 | (3849.35 ms | 136202 tok/s) step 2045/76294 | train loss 3.864895 | norm 0.2800 | lr 1.19e-03 | (3820.60 ms | 137227 tok/s) step 2046/76294 | train loss 3.819493 | norm 0.2861 | lr 1.19e-03 | (3817.76 ms | 137329 tok/s) step 2047/76294 | train loss 3.758508 | norm 0.2715 | lr 1.19e-03 | (3870.48 ms | 135458 tok/s) step 2048/76294 | train loss 3.850300 | norm 0.2834 | lr 1.19e-03 | (3902.35 ms | 134352 tok/s) step 2049/76294 | train loss 3.832757 | norm 0.2842 | lr 1.19e-03 | (3809.40 ms | 137630 tok/s) step 2050/76294 | train loss 3.762156 | norm 0.2417 | lr 1.19e-03 | (3820.72 ms | 137222 tok/s) step 2051/76294 | train loss 3.916360 | norm 0.2838 | lr 1.19e-03 | (3845.96 ms | 136322 tok/s) step 2052/76294 | train loss 3.858079 | norm 0.3292 | lr 1.19e-03 | (3888.74 ms | 134822 tok/s) step 2053/76294 | train loss 3.753655 | norm 0.2933 | lr 1.19e-03 | (3990.04 ms | 131399 tok/s) step 2054/76294 | train loss 3.842751 | norm 0.2652 | lr 1.19e-03 | (3813.55 ms | 137480 tok/s) step 2055/76294 | train loss 3.854479 | norm 0.2760 | lr 1.19e-03 | (3823.55 ms | 137121 tok/s) step 2056/76294 | train loss 3.783990 | norm 0.2414 | lr 1.19e-03 | (3818.63 ms | 137297 tok/s) step 2057/76294 | train loss 3.775344 | norm 0.2235 | lr 1.19e-03 | (3847.73 ms | 136259 tok/s) step 2058/76294 | train loss 3.795138 | norm 0.2123 | lr 1.19e-03 | (3817.95 ms | 137322 tok/s) step 2059/76294 | train loss 3.835246 | norm 0.2055 | lr 1.19e-03 | (3819.91 ms | 137251 tok/s) step 2060/76294 | train loss 3.744388 | norm 0.2224 | lr 1.19e-03 | (3843.02 ms | 136426 tok/s) step 2061/76294 | train loss 3.790483 | norm 0.2611 | lr 1.19e-03 | (3879.43 ms | 135146 tok/s) step 2062/76294 | train loss 3.796393 | norm 0.2175 | lr 1.19e-03 | (3840.05 ms | 136532 tok/s) step 2063/76294 | train loss 3.772320 | norm 0.2094 | lr 1.19e-03 | (3829.71 ms | 136900 tok/s) step 2064/76294 | train loss 3.795849 | norm 0.2020 | lr 1.19e-03 | (3823.13 ms | 137136 tok/s) step 2065/76294 | train loss 3.715068 | norm 0.2131 | lr 1.19e-03 | (3871.20 ms | 135433 tok/s) step 2066/76294 | train loss 3.872791 | norm 0.2266 | lr 1.19e-03 | (3812.58 ms | 137515 tok/s) step 2067/76294 | train loss 3.806168 | norm 0.2116 | lr 1.19e-03 | (3831.49 ms | 136837 tok/s) step 2068/76294 | train loss 3.742188 | norm 0.2178 | lr 1.19e-03 | (4054.69 ms | 129304 tok/s) step 2069/76294 | train loss 3.800721 | norm 0.2015 | lr 1.19e-03 | (3818.56 ms | 137300 tok/s) step 2070/76294 | train loss 3.793772 | norm 0.2062 | lr 1.19e-03 | (3810.72 ms | 137582 tok/s) step 2071/76294 | train loss 3.840641 | norm 0.2261 | lr 1.19e-03 | (3841.99 ms | 136463 tok/s) step 2072/76294 | train loss 3.816139 | norm 0.2676 | lr 1.19e-03 | (3984.17 ms | 131593 tok/s) step 2073/76294 | train loss 3.850919 | norm 0.2906 | lr 1.19e-03 | (3990.09 ms | 131398 tok/s) step 2074/76294 | train loss 3.776957 | norm 0.2165 | lr 1.19e-03 | (3827.06 ms | 136995 tok/s) step 2075/76294 | train loss 3.922728 | norm 0.2105 | lr 1.19e-03 | (3887.90 ms | 134851 tok/s) step 2076/76294 | train loss 3.816859 | norm 0.2269 | lr 1.19e-03 | (3813.02 ms | 137499 tok/s) step 2077/76294 | train loss 3.762137 | norm 0.2292 | lr 1.19e-03 | (3819.87 ms | 137253 tok/s) step 2078/76294 | train loss 3.791373 | norm 0.2378 | lr 1.19e-03 | (3828.50 ms | 136943 tok/s) step 2079/76294 | train loss 3.804456 | norm 0.2230 | lr 1.19e-03 | (3809.67 ms | 137620 tok/s) step 2080/76294 | train loss 3.808963 | norm 0.2404 | lr 1.19e-03 | (3808.92 ms | 137647 tok/s) step 2081/76294 | train loss 3.821315 | norm 0.2325 | lr 1.19e-03 | (3863.87 ms | 135690 tok/s) step 2082/76294 | train loss 3.794866 | norm 0.2399 | lr 1.19e-03 | (3805.85 ms | 137758 tok/s) step 2083/76294 | train loss 3.874703 | norm 0.2997 | lr 1.19e-03 | (3878.79 ms | 135168 tok/s) step 2084/76294 | train loss 3.807640 | norm 0.2839 | lr 1.19e-03 | (3805.18 ms | 137783 tok/s) step 2085/76294 | train loss 3.814058 | norm 0.2729 | lr 1.19e-03 | (3897.63 ms | 134515 tok/s) step 2086/76294 | train loss 3.825286 | norm 0.2675 | lr 1.19e-03 | (3823.72 ms | 137115 tok/s) step 2087/76294 | train loss 3.799432 | norm 0.2269 | lr 1.19e-03 | (3821.57 ms | 137192 tok/s) step 2088/76294 | train loss 3.810978 | norm 0.2188 | lr 1.19e-03 | (3831.05 ms | 136852 tok/s) step 2089/76294 | train loss 3.808019 | norm 0.2123 | lr 1.19e-03 | (3817.78 ms | 137328 tok/s) step 2090/76294 | train loss 3.754338 | norm 0.2012 | lr 1.19e-03 | (3807.50 ms | 137699 tok/s) step 2091/76294 | train loss 3.762157 | norm 0.2051 | lr 1.19e-03 | (3880.02 ms | 135125 tok/s) step 2092/76294 | train loss 3.899072 | norm 0.2501 | lr 1.19e-03 | (3806.23 ms | 137745 tok/s) step 2093/76294 | train loss 3.785408 | norm 0.2281 | lr 1.19e-03 | (3812.92 ms | 137503 tok/s) step 2094/76294 | train loss 3.742445 | norm 0.1776 | lr 1.19e-03 | (3830.63 ms | 136867 tok/s) step 2095/76294 | train loss 3.826317 | norm 0.2181 | lr 1.19e-03 | (3810.38 ms | 137595 tok/s) step 2096/76294 | train loss 3.785658 | norm 0.2157 | lr 1.19e-03 | (3899.80 ms | 134440 tok/s) step 2097/76294 | train loss 3.787501 | norm 0.2573 | lr 1.19e-03 | (3820.18 ms | 137242 tok/s) step 2098/76294 | train loss 3.731814 | norm 0.2916 | lr 1.19e-03 | (3830.59 ms | 136869 tok/s) step 2099/76294 | train loss 3.845819 | norm 0.2404 | lr 1.19e-03 | (5012.94 ms | 104587 tok/s) step 2100/76294 | train loss 3.755916 | norm 0.2381 | lr 1.19e-03 | (3824.88 ms | 137073 tok/s) step 2101/76294 | train loss 3.856810 | norm 0.2433 | lr 1.19e-03 | (3806.43 ms | 137737 tok/s) step 2102/76294 | train loss 3.775755 | norm 0.2359 | lr 1.19e-03 | (3802.46 ms | 137881 tok/s) step 2103/76294 | train loss 3.865717 | norm 0.2241 | lr 1.19e-03 | (3829.37 ms | 136912 tok/s) step 2104/76294 | train loss 3.769375 | norm 0.2232 | lr 1.19e-03 | (3806.46 ms | 137737 tok/s) step 2105/76294 | train loss 3.885407 | norm 0.2441 | lr 1.19e-03 | (3820.44 ms | 137232 tok/s) step 2106/76294 | train loss 3.732309 | norm 0.2400 | lr 1.19e-03 | (3804.34 ms | 137813 tok/s) step 2107/76294 | train loss 3.809714 | norm 0.2253 | lr 1.19e-03 | (3809.82 ms | 137615 tok/s) step 2108/76294 | train loss 3.827713 | norm 0.2208 | lr 1.19e-03 | (3834.52 ms | 136728 tok/s) step 2109/76294 | train loss 3.845460 | norm 0.2284 | lr 1.19e-03 | (3805.80 ms | 137760 tok/s) step 2110/76294 | train loss 3.880033 | norm 0.2615 | lr 1.19e-03 | (3804.20 ms | 137818 tok/s) step 2111/76294 | train loss 3.901551 | norm 0.2718 | lr 1.19e-03 | (3840.83 ms | 136504 tok/s) step 2112/76294 | train loss 3.910170 | norm 0.2615 | lr 1.19e-03 | (3805.88 ms | 137758 tok/s) step 2113/76294 | train loss 3.792255 | norm 0.2646 | lr 1.19e-03 | (3807.75 ms | 137690 tok/s) step 2114/76294 | train loss 3.870009 | norm 0.2449 | lr 1.19e-03 | (3882.75 ms | 135030 tok/s) step 2115/76294 | train loss 3.831520 | norm 0.2014 | lr 1.19e-03 | (3900.92 ms | 134401 tok/s) step 2116/76294 | train loss 3.899325 | norm 0.2653 | lr 1.19e-03 | (3807.01 ms | 137717 tok/s) step 2117/76294 | train loss 3.767173 | norm 0.2316 | lr 1.19e-03 | (3827.97 ms | 136963 tok/s) step 2118/76294 | train loss 3.937341 | norm 0.2343 | lr 1.19e-03 | (3809.45 ms | 137628 tok/s) step 2119/76294 | train loss 3.791639 | norm 0.2337 | lr 1.19e-03 | (3865.78 ms | 135623 tok/s) step 2120/76294 | train loss 3.858008 | norm 0.2270 | lr 1.19e-03 | (3813.57 ms | 137480 tok/s) step 2121/76294 | train loss 3.879084 | norm 0.2301 | lr 1.19e-03 | (3834.39 ms | 136733 tok/s) step 2122/76294 | train loss 3.814033 | norm 0.2405 | lr 1.19e-03 | (3805.27 ms | 137779 tok/s) step 2123/76294 | train loss 3.849288 | norm 0.3644 | lr 1.19e-03 | (3857.79 ms | 135904 tok/s) step 2124/76294 | train loss 3.810914 | norm 0.4327 | lr 1.19e-03 | (3805.80 ms | 137760 tok/s) step 2125/76294 | train loss 3.817340 | norm 0.3088 | lr 1.19e-03 | (3808.38 ms | 137667 tok/s) step 2126/76294 | train loss 3.812326 | norm 0.3029 | lr 1.19e-03 | (3826.07 ms | 137030 tok/s) step 2127/76294 | train loss 3.852551 | norm 0.2563 | lr 1.19e-03 | (4821.77 ms | 108734 tok/s) step 2128/76294 | train loss 3.833415 | norm 0.2468 | lr 1.19e-03 | (4602.55 ms | 113913 tok/s) step 2129/76294 | train loss 3.859377 | norm 0.2598 | lr 1.19e-03 | (3807.83 ms | 137687 tok/s) step 2130/76294 | train loss 3.794982 | norm 0.2230 | lr 1.19e-03 | (3810.07 ms | 137606 tok/s) step 2131/76294 | train loss 3.843108 | norm 0.2179 | lr 1.19e-03 | (3824.57 ms | 137084 tok/s) step 2132/76294 | train loss 3.891774 | norm 0.2174 | lr 1.19e-03 | (3804.13 ms | 137821 tok/s) step 2133/76294 | train loss 3.809224 | norm 0.2444 | lr 1.19e-03 | (3802.77 ms | 137870 tok/s) step 2134/76294 | train loss 3.805376 | norm 0.2229 | lr 1.18e-03 | (4105.70 ms | 127697 tok/s) step 2135/76294 | train loss 3.766266 | norm 0.2139 | lr 1.18e-03 | (4377.59 ms | 119766 tok/s) step 2136/76294 | train loss 3.839796 | norm 0.1983 | lr 1.18e-03 | (3795.51 ms | 138134 tok/s) step 2137/76294 | train loss 3.843279 | norm 0.2079 | lr 1.18e-03 | (3830.80 ms | 136861 tok/s) step 2138/76294 | train loss 3.811944 | norm 0.1920 | lr 1.18e-03 | (3798.88 ms | 138011 tok/s) step 2139/76294 | train loss 3.819953 | norm 0.2236 | lr 1.18e-03 | (3894.76 ms | 134614 tok/s) step 2140/76294 | train loss 3.821296 | norm 0.2285 | lr 1.18e-03 | (3798.48 ms | 138026 tok/s) step 2141/76294 | train loss 3.844826 | norm 0.1913 | lr 1.18e-03 | (3823.52 ms | 137122 tok/s) step 2142/76294 | train loss 3.746763 | norm 0.1984 | lr 1.18e-03 | (3796.00 ms | 138116 tok/s) step 2143/76294 | train loss 3.814322 | norm 0.2095 | lr 1.18e-03 | (3839.32 ms | 136557 tok/s) step 2144/76294 | train loss 3.824649 | norm 0.1808 | lr 1.18e-03 | (3800.63 ms | 137948 tok/s) step 2145/76294 | train loss 3.811392 | norm 0.2133 | lr 1.18e-03 | (3810.86 ms | 137577 tok/s) step 2146/76294 | train loss 3.808072 | norm 0.2256 | lr 1.18e-03 | (3839.31 ms | 136558 tok/s) step 2147/76294 | train loss 3.885219 | norm 0.3702 | lr 1.18e-03 | (3817.16 ms | 137350 tok/s) step 2148/76294 | train loss 3.844619 | norm 0.2971 | lr 1.18e-03 | (3824.58 ms | 137084 tok/s) step 2149/76294 | train loss 3.805037 | norm 0.3095 | lr 1.18e-03 | (3815.12 ms | 137424 tok/s) step 2150/76294 | train loss 3.858460 | norm 0.2560 | lr 1.18e-03 | (3830.81 ms | 136861 tok/s) step 2151/76294 | train loss 3.788205 | norm 0.2622 | lr 1.18e-03 | (3809.43 ms | 137629 tok/s) step 2152/76294 | train loss 3.849710 | norm 0.2531 | lr 1.18e-03 | (3801.22 ms | 137926 tok/s) step 2153/76294 | train loss 3.784130 | norm 0.2696 | lr 1.18e-03 | (3833.31 ms | 136772 tok/s) step 2154/76294 | train loss 3.813459 | norm 0.2424 | lr 1.18e-03 | (3839.91 ms | 136537 tok/s) step 2155/76294 | train loss 3.801726 | norm 0.2497 | lr 1.18e-03 | (3904.36 ms | 134283 tok/s) step 2156/76294 | train loss 3.811096 | norm 0.2179 | lr 1.18e-03 | (3860.65 ms | 135803 tok/s) step 2157/76294 | train loss 3.801416 | norm 0.2610 | lr 1.18e-03 | (3807.28 ms | 137707 tok/s) step 2158/76294 | train loss 3.736177 | norm 0.2666 | lr 1.18e-03 | (3823.87 ms | 137109 tok/s) step 2159/76294 | train loss 3.868981 | norm 0.2505 | lr 1.18e-03 | (3807.30 ms | 137706 tok/s) step 2160/76294 | train loss 3.846268 | norm 0.2863 | lr 1.18e-03 | (3859.27 ms | 135851 tok/s) step 2161/76294 | train loss 3.820389 | norm 0.4602 | lr 1.18e-03 | (3811.38 ms | 137559 tok/s) step 2162/76294 | train loss 3.785831 | norm 0.4291 | lr 1.18e-03 | (3836.40 ms | 136661 tok/s) step 2163/76294 | train loss 3.867046 | norm 0.2766 | lr 1.18e-03 | (3810.93 ms | 137575 tok/s) step 2164/76294 | train loss 3.772218 | norm 0.2706 | lr 1.18e-03 | (3811.82 ms | 137543 tok/s) step 2165/76294 | train loss 4.000285 | norm 0.2770 | lr 1.18e-03 | (3826.54 ms | 137014 tok/s) step 2166/76294 | train loss 3.814498 | norm 0.2473 | lr 1.18e-03 | (3967.15 ms | 132157 tok/s) step 2167/76294 | train loss 3.805944 | norm 0.2507 | lr 1.18e-03 | (3836.35 ms | 136663 tok/s) step 2168/76294 | train loss 3.810331 | norm 0.2173 | lr 1.18e-03 | (3818.26 ms | 137311 tok/s) step 2169/76294 | train loss 3.814877 | norm 0.2175 | lr 1.18e-03 | (3819.01 ms | 137284 tok/s) step 2170/76294 | train loss 3.840691 | norm 0.2060 | lr 1.18e-03 | (3804.70 ms | 137800 tok/s) step 2171/76294 | train loss 3.772186 | norm 0.2392 | lr 1.18e-03 | (3809.16 ms | 137639 tok/s) step 2172/76294 | train loss 3.822289 | norm 0.2665 | lr 1.18e-03 | (3818.32 ms | 137309 tok/s) step 2173/76294 | train loss 3.799171 | norm 0.2122 | lr 1.18e-03 | (3808.98 ms | 137645 tok/s) step 2174/76294 | train loss 3.814402 | norm 0.2135 | lr 1.18e-03 | (3878.60 ms | 135175 tok/s) step 2175/76294 | train loss 3.829379 | norm 0.2132 | lr 1.18e-03 | (3807.29 ms | 137706 tok/s) step 2176/76294 | train loss 3.869248 | norm 0.1836 | lr 1.18e-03 | (3868.31 ms | 135534 tok/s) step 2177/76294 | train loss 3.870663 | norm 0.2074 | lr 1.18e-03 | (3808.06 ms | 137679 tok/s) step 2178/76294 | train loss 3.804502 | norm 0.2026 | lr 1.18e-03 | (3833.80 ms | 136754 tok/s) step 2179/76294 | train loss 3.878087 | norm 0.1948 | lr 1.18e-03 | (3880.53 ms | 135107 tok/s) step 2180/76294 | train loss 3.771951 | norm 0.2120 | lr 1.18e-03 | (3944.10 ms | 132930 tok/s) step 2181/76294 | train loss 3.782619 | norm 0.2070 | lr 1.18e-03 | (3805.11 ms | 137785 tok/s) step 2182/76294 | train loss 3.773631 | norm 0.2233 | lr 1.18e-03 | (3812.29 ms | 137526 tok/s) step 2183/76294 | train loss 3.857650 | norm 0.3045 | lr 1.18e-03 | (3832.76 ms | 136791 tok/s) step 2184/76294 | train loss 3.906336 | norm 0.3414 | lr 1.18e-03 | (3808.18 ms | 137674 tok/s) step 2185/76294 | train loss 3.758492 | norm 0.3164 | lr 1.18e-03 | (3815.26 ms | 137419 tok/s) step 2186/76294 | train loss 3.870820 | norm 0.2294 | lr 1.18e-03 | (3881.68 ms | 135067 tok/s) step 2187/76294 | train loss 3.846024 | norm 0.2353 | lr 1.18e-03 | (3813.73 ms | 137474 tok/s) step 2188/76294 | train loss 3.909088 | norm 0.3102 | lr 1.18e-03 | (3819.83 ms | 137254 tok/s) step 2189/76294 | train loss 3.775804 | norm 0.2170 | lr 1.18e-03 | (3836.91 ms | 136643 tok/s) step 2190/76294 | train loss 3.842084 | norm 0.2123 | lr 1.18e-03 | (3809.85 ms | 137614 tok/s) step 2191/76294 | train loss 3.850882 | norm 0.7754 | lr 1.18e-03 | (4168.37 ms | 125778 tok/s) step 2192/76294 | train loss 6.298288 | norm 49.1055 | lr 1.18e-03 | (4788.81 ms | 109482 tok/s) step 2193/76294 | train loss 4.579095 | norm 18.4496 | lr 1.18e-03 | (3863.03 ms | 135719 tok/s) step 2194/76294 | train loss 3.873689 | norm 0.7668 | lr 1.18e-03 | (4836.93 ms | 108393 tok/s) step 2195/76294 | train loss 3.941435 | norm 2.4453 | lr 1.18e-03 | (3802.62 ms | 137875 tok/s) step 2196/76294 | train loss 3.987914 | norm 0.8506 | lr 1.18e-03 | (3823.69 ms | 137116 tok/s) step 2197/76294 | train loss 3.858023 | norm 0.4853 | lr 1.18e-03 | (3853.89 ms | 136041 tok/s) step 2198/76294 | train loss 3.905847 | norm 0.4721 | lr 1.18e-03 | (3834.75 ms | 136720 tok/s) step 2199/76294 | train loss 3.793671 | norm 0.3160 | lr 1.18e-03 | (3820.59 ms | 137227 tok/s) step 2200/76294 | train loss 3.900797 | norm 0.3204 | lr 1.18e-03 | (3805.87 ms | 137758 tok/s) step 2201/76294 | train loss 3.878596 | norm 0.2723 | lr 1.18e-03 | (3827.35 ms | 136985 tok/s) step 2202/76294 | train loss 3.853979 | norm 0.2674 | lr 1.18e-03 | (3821.06 ms | 137210 tok/s) step 2203/76294 | train loss 3.920400 | norm 0.2579 | lr 1.18e-03 | (3823.19 ms | 137134 tok/s) step 2204/76294 | train loss 3.831589 | norm 0.2661 | lr 1.18e-03 | (3810.31 ms | 137597 tok/s) step 2205/76294 | train loss 3.851412 | norm 0.2507 | lr 1.18e-03 | (3856.91 ms | 135935 tok/s) step 2206/76294 | train loss 3.850504 | norm 0.2448 | lr 1.18e-03 | (3796.29 ms | 138105 tok/s) step 2207/76294 | train loss 3.844193 | norm 0.2050 | lr 1.18e-03 | (3850.80 ms | 136150 tok/s) step 2208/76294 | train loss 3.811395 | norm 0.2120 | lr 1.18e-03 | (3836.80 ms | 136647 tok/s) step 2209/76294 | train loss 3.840090 | norm 0.2127 | lr 1.18e-03 | (3799.77 ms | 137979 tok/s) step 2210/76294 | train loss 3.850396 | norm 0.2158 | lr 1.18e-03 | (3828.26 ms | 136952 tok/s) step 2211/76294 | train loss 3.773171 | norm 0.1920 | lr 1.18e-03 | (4043.70 ms | 129655 tok/s) step 2212/76294 | train loss 3.874840 | norm 0.1936 | lr 1.18e-03 | (3802.99 ms | 137862 tok/s) step 2213/76294 | train loss 3.808382 | norm 0.1978 | lr 1.18e-03 | (3837.12 ms | 136636 tok/s) step 2214/76294 | train loss 3.820799 | norm 0.1874 | lr 1.18e-03 | (5347.23 ms | 98049 tok/s) step 2215/76294 | train loss 3.814876 | norm 0.1852 | lr 1.18e-03 | (3806.31 ms | 137742 tok/s) step 2216/76294 | train loss 3.804134 | norm 0.1721 | lr 1.18e-03 | (3832.00 ms | 136818 tok/s) step 2217/76294 | train loss 3.832844 | norm 0.1896 | lr 1.18e-03 | (3808.07 ms | 137678 tok/s) step 2218/76294 | train loss 3.746720 | norm 0.1822 | lr 1.18e-03 | (3806.38 ms | 137739 tok/s) step 2219/76294 | train loss 3.759756 | norm 0.1984 | lr 1.18e-03 | (3850.25 ms | 136170 tok/s) step 2220/76294 | train loss 3.801445 | norm 0.1862 | lr 1.18e-03 | (3808.25 ms | 137672 tok/s) step 2221/76294 | train loss 3.801623 | norm 0.1786 | lr 1.18e-03 | (3847.28 ms | 136275 tok/s) step 2222/76294 | train loss 3.783144 | norm 0.1849 | lr 1.18e-03 | (3811.82 ms | 137543 tok/s) step 2223/76294 | train loss 3.756951 | norm 0.2096 | lr 1.18e-03 | (3816.05 ms | 137390 tok/s) step 2224/76294 | train loss 3.802324 | norm 0.2144 | lr 1.18e-03 | (3843.13 ms | 136422 tok/s) step 2225/76294 | train loss 3.764587 | norm 0.2204 | lr 1.18e-03 | (3814.30 ms | 137453 tok/s) step 2226/76294 | train loss 3.869358 | norm 0.2178 | lr 1.18e-03 | (3847.74 ms | 136259 tok/s) step 2227/76294 | train loss 3.910386 | norm 0.2194 | lr 1.18e-03 | (5076.98 ms | 103268 tok/s) step 2228/76294 | train loss 3.797106 | norm 0.2357 | lr 1.18e-03 | (3864.37 ms | 135672 tok/s) step 2229/76294 | train loss 3.818386 | norm 0.2364 | lr 1.18e-03 | (3804.18 ms | 137819 tok/s) step 2230/76294 | train loss 3.799469 | norm 0.2051 | lr 1.18e-03 | (3817.36 ms | 137343 tok/s) step 2231/76294 | train loss 3.803301 | norm 0.2271 | lr 1.18e-03 | (4279.02 ms | 122525 tok/s) step 2232/76294 | train loss 3.780286 | norm 0.2530 | lr 1.18e-03 | (3807.24 ms | 137708 tok/s) step 2233/76294 | train loss 3.754332 | norm 0.2614 | lr 1.18e-03 | (4195.15 ms | 124975 tok/s) step 2234/76294 | train loss 3.809497 | norm 0.2067 | lr 1.18e-03 | (4193.07 ms | 125037 tok/s) step 2235/76294 | train loss 3.735949 | norm 0.2096 | lr 1.18e-03 | (3845.50 ms | 136338 tok/s) step 2236/76294 | train loss 3.868367 | norm 0.2272 | lr 1.18e-03 | (3811.97 ms | 137537 tok/s) step 2237/76294 | train loss 3.792179 | norm 0.2283 | lr 1.18e-03 | (3831.75 ms | 136827 tok/s) step 2238/76294 | train loss 3.839573 | norm 0.2118 | lr 1.18e-03 | (3809.58 ms | 137624 tok/s) step 2239/76294 | train loss 3.815856 | norm 0.1929 | lr 1.18e-03 | (3825.90 ms | 137036 tok/s) step 2240/76294 | train loss 3.799816 | norm 0.2176 | lr 1.18e-03 | (3806.98 ms | 137718 tok/s) step 2241/76294 | train loss 3.817174 | norm 0.2260 | lr 1.18e-03 | (3832.46 ms | 136802 tok/s) step 2242/76294 | train loss 3.798883 | norm 0.2262 | lr 1.18e-03 | (3809.79 ms | 137616 tok/s) step 2243/76294 | train loss 3.793917 | norm 0.2488 | lr 1.18e-03 | (3808.28 ms | 137671 tok/s) step 2244/76294 | train loss 3.826403 | norm 0.1907 | lr 1.18e-03 | (3848.02 ms | 136249 tok/s) step 2245/76294 | train loss 3.746790 | norm 0.2074 | lr 1.18e-03 | (3811.25 ms | 137563 tok/s) step 2246/76294 | train loss 3.770683 | norm 0.2738 | lr 1.18e-03 | (3816.93 ms | 137358 tok/s) step 2247/76294 | train loss 3.772446 | norm 0.3120 | lr 1.18e-03 | (3832.26 ms | 136809 tok/s) step 2248/76294 | train loss 3.807273 | norm 0.3442 | lr 1.18e-03 | (3808.88 ms | 137649 tok/s) step 2249/76294 | train loss 3.742647 | norm 0.2890 | lr 1.18e-03 | (3809.14 ms | 137639 tok/s) step 2250/76294 | train loss 3.824586 | norm 0.2887 | lr 1.18e-03 | (3842.28 ms | 136452 tok/s) val loss: 3.789963 saving model checkpoint to ./results/gpt2-124M-gqa/step_2250.pth step 2251/76294 | train loss 3.866443 | norm 0.2835 | lr 1.18e-03 | (4104.82 ms | 127725 tok/s) step 2252/76294 | train loss 3.856323 | norm 0.2271 | lr 1.18e-03 | (3753.02 ms | 139697 tok/s) step 2253/76294 | train loss 3.799759 | norm 0.2258 | lr 1.18e-03 | (3786.67 ms | 138456 tok/s) step 2254/76294 | train loss 3.700429 | norm 0.2496 | lr 1.18e-03 | (3784.08 ms | 138551 tok/s) step 2255/76294 | train loss 3.767056 | norm 0.2333 | lr 1.18e-03 | (3763.82 ms | 139297 tok/s) step 2256/76294 | train loss 3.731547 | norm 0.2174 | lr 1.18e-03 | (3830.08 ms | 136887 tok/s) step 2257/76294 | train loss 3.777705 | norm 0.2163 | lr 1.18e-03 | (3771.56 ms | 139011 tok/s) step 2258/76294 | train loss 3.829810 | norm 0.2115 | lr 1.18e-03 | (3776.78 ms | 138819 tok/s) step 2259/76294 | train loss 3.787204 | norm 0.1978 | lr 1.18e-03 | (3804.05 ms | 137824 tok/s) step 2260/76294 | train loss 3.823804 | norm 0.1987 | lr 1.18e-03 | (3780.43 ms | 138685 tok/s) step 2261/76294 | train loss 3.872124 | norm 0.2631 | lr 1.18e-03 | (3782.07 ms | 138625 tok/s) step 2262/76294 | train loss 3.788054 | norm 0.3395 | lr 1.18e-03 | (3816.84 ms | 137362 tok/s) step 2263/76294 | train loss 3.787549 | norm 0.2811 | lr 1.18e-03 | (3789.92 ms | 138338 tok/s) step 2264/76294 | train loss 3.843030 | norm 0.2668 | lr 1.18e-03 | (3791.59 ms | 138277 tok/s) step 2265/76294 | train loss 3.809692 | norm 0.2411 | lr 1.18e-03 | (3828.92 ms | 136928 tok/s) step 2266/76294 | train loss 3.775290 | norm 0.2382 | lr 1.18e-03 | (3792.59 ms | 138240 tok/s) step 2267/76294 | train loss 3.856947 | norm 0.2244 | lr 1.18e-03 | (3798.49 ms | 138026 tok/s) step 2268/76294 | train loss 3.818617 | norm 0.2097 | lr 1.18e-03 | (3902.00 ms | 134364 tok/s) step 2269/76294 | train loss 3.808627 | norm 0.1946 | lr 1.18e-03 | (3800.55 ms | 137950 tok/s) step 2270/76294 | train loss 3.829415 | norm 0.2090 | lr 1.18e-03 | (3916.53 ms | 133866 tok/s) step 2271/76294 | train loss 3.781771 | norm 0.2047 | lr 1.18e-03 | (3950.74 ms | 132706 tok/s) step 2272/76294 | train loss 3.837107 | norm 0.2318 | lr 1.18e-03 | (3803.49 ms | 137844 tok/s) step 2273/76294 | train loss 3.848502 | norm 0.2221 | lr 1.18e-03 | (3822.51 ms | 137158 tok/s) step 2274/76294 | train loss 3.707090 | norm 0.2280 | lr 1.18e-03 | (3803.83 ms | 137831 tok/s) step 2275/76294 | train loss 3.868584 | norm 0.2754 | lr 1.18e-03 | (3822.74 ms | 137150 tok/s) step 2276/76294 | train loss 3.818964 | norm 0.2710 | lr 1.18e-03 | (3827.85 ms | 136967 tok/s) step 2277/76294 | train loss 3.749637 | norm 0.2392 | lr 1.18e-03 | (3807.37 ms | 137703 tok/s) step 2278/76294 | train loss 3.792447 | norm 0.2231 | lr 1.18e-03 | (3811.51 ms | 137554 tok/s) step 2279/76294 | train loss 3.788675 | norm 0.2305 | lr 1.18e-03 | (3802.64 ms | 137875 tok/s) step 2280/76294 | train loss 3.853548 | norm 0.2659 | lr 1.18e-03 | (3852.31 ms | 136097 tok/s) step 2281/76294 | train loss 3.828154 | norm 0.2684 | lr 1.18e-03 | (3804.67 ms | 137801 tok/s) step 2282/76294 | train loss 3.822178 | norm 0.2016 | lr 1.18e-03 | (3854.76 ms | 136011 tok/s) step 2283/76294 | train loss 3.799693 | norm 0.2378 | lr 1.18e-03 | (3828.95 ms | 136927 tok/s) step 2284/76294 | train loss 3.842247 | norm 0.2101 | lr 1.18e-03 | (3816.86 ms | 137361 tok/s) step 2285/76294 | train loss 3.808029 | norm 0.1965 | lr 1.18e-03 | (3808.24 ms | 137672 tok/s) step 2286/76294 | train loss 3.892779 | norm 0.2339 | lr 1.18e-03 | (3916.09 ms | 133881 tok/s) step 2287/76294 | train loss 3.890371 | norm 0.2214 | lr 1.18e-03 | (3810.45 ms | 137592 tok/s) step 2288/76294 | train loss 3.764940 | norm 0.2921 | lr 1.18e-03 | (3853.18 ms | 136066 tok/s) step 2289/76294 | train loss 3.735865 | norm 0.2710 | lr 1.18e-03 | (4213.19 ms | 124440 tok/s) step 2290/76294 | train loss 3.758580 | norm 0.2367 | lr 1.18e-03 | (3851.75 ms | 136117 tok/s) step 2291/76294 | train loss 3.830699 | norm 0.2402 | lr 1.18e-03 | (3803.93 ms | 137828 tok/s) step 2292/76294 | train loss 3.755657 | norm 0.2314 | lr 1.18e-03 | (3830.23 ms | 136882 tok/s) step 2293/76294 | train loss 3.755280 | norm 0.1953 | lr 1.18e-03 | (3828.17 ms | 136955 tok/s) step 2294/76294 | train loss 3.861895 | norm 0.2265 | lr 1.18e-03 | (3834.25 ms | 136738 tok/s) step 2295/76294 | train loss 3.866604 | norm 0.2045 | lr 1.18e-03 | (3966.83 ms | 132168 tok/s) step 2296/76294 | train loss 3.797435 | norm 0.1925 | lr 1.18e-03 | (3813.63 ms | 137478 tok/s) step 2297/76294 | train loss 3.870638 | norm 0.2372 | lr 1.18e-03 | (3807.16 ms | 137711 tok/s) step 2298/76294 | train loss 3.784256 | norm 0.2344 | lr 1.18e-03 | (3855.42 ms | 135987 tok/s) step 2299/76294 | train loss 3.760217 | norm 0.2385 | lr 1.18e-03 | (3807.52 ms | 137698 tok/s) step 2300/76294 | train loss 3.883880 | norm 0.2647 | lr 1.18e-03 | (3814.12 ms | 137460 tok/s) step 2301/76294 | train loss 3.864906 | norm 0.2047 | lr 1.18e-03 | (4023.72 ms | 130299 tok/s) step 2302/76294 | train loss 3.835495 | norm 0.2241 | lr 1.18e-03 | (3818.27 ms | 137310 tok/s) step 2303/76294 | train loss 3.763788 | norm 0.2803 | lr 1.18e-03 | (3830.05 ms | 136888 tok/s) step 2304/76294 | train loss 3.794814 | norm 0.2736 | lr 1.18e-03 | (3808.39 ms | 137667 tok/s) step 2305/76294 | train loss 3.800753 | norm 0.2321 | lr 1.18e-03 | (3805.55 ms | 137769 tok/s) step 2306/76294 | train loss 3.778667 | norm 0.2771 | lr 1.18e-03 | (3836.39 ms | 136662 tok/s) step 2307/76294 | train loss 3.823604 | norm 0.2522 | lr 1.18e-03 | (3814.57 ms | 137444 tok/s) step 2308/76294 | train loss 3.776585 | norm 0.2263 | lr 1.18e-03 | (3812.74 ms | 137510 tok/s) step 2309/76294 | train loss 3.788845 | norm 0.3006 | lr 1.18e-03 | (3828.40 ms | 136947 tok/s) step 2310/76294 | train loss 3.776483 | norm 0.2631 | lr 1.18e-03 | (3815.52 ms | 137409 tok/s) step 2311/76294 | train loss 3.742524 | norm 0.2908 | lr 1.18e-03 | (3832.13 ms | 136814 tok/s) step 2312/76294 | train loss 3.754471 | norm 0.2935 | lr 1.18e-03 | (3809.90 ms | 137612 tok/s) step 2313/76294 | train loss 3.792065 | norm 0.2327 | lr 1.18e-03 | (3839.06 ms | 136567 tok/s) step 2314/76294 | train loss 3.862291 | norm 0.2557 | lr 1.18e-03 | (3833.64 ms | 136760 tok/s) step 2315/76294 | train loss 3.817576 | norm 0.2106 | lr 1.18e-03 | (3808.89 ms | 137649 tok/s) step 2316/76294 | train loss 3.796595 | norm 0.2028 | lr 1.18e-03 | (3918.93 ms | 133783 tok/s) step 2317/76294 | train loss 3.818362 | norm 0.2040 | lr 1.18e-03 | (3837.22 ms | 136632 tok/s) step 2318/76294 | train loss 3.800520 | norm 0.2088 | lr 1.18e-03 | (4770.88 ms | 109893 tok/s) step 2319/76294 | train loss 3.783835 | norm 0.2101 | lr 1.18e-03 | (3822.95 ms | 137142 tok/s) step 2320/76294 | train loss 3.777975 | norm 0.1988 | lr 1.18e-03 | (3816.86 ms | 137361 tok/s) step 2321/76294 | train loss 3.810858 | norm 0.2290 | lr 1.18e-03 | (3835.28 ms | 136701 tok/s) step 2322/76294 | train loss 3.793130 | norm 0.2074 | lr 1.18e-03 | (3856.98 ms | 135932 tok/s) step 2323/76294 | train loss 3.791773 | norm 0.2127 | lr 1.18e-03 | (3811.45 ms | 137556 tok/s) step 2324/76294 | train loss 3.757770 | norm 0.2183 | lr 1.18e-03 | (3824.49 ms | 137087 tok/s) step 2325/76294 | train loss 3.823706 | norm 0.2372 | lr 1.18e-03 | (3815.56 ms | 137408 tok/s) step 2326/76294 | train loss 3.865640 | norm 0.2609 | lr 1.18e-03 | (3823.39 ms | 137126 tok/s) step 2327/76294 | train loss 3.965150 | norm 0.2644 | lr 1.18e-03 | (3825.25 ms | 137060 tok/s) step 2328/76294 | train loss 3.834403 | norm 0.1998 | lr 1.18e-03 | (3918.32 ms | 133804 tok/s) step 2329/76294 | train loss 3.740493 | norm 0.2439 | lr 1.18e-03 | (3809.99 ms | 137609 tok/s) step 2330/76294 | train loss 3.774248 | norm 0.2150 | lr 1.18e-03 | (3810.35 ms | 137596 tok/s) step 2331/76294 | train loss 3.760198 | norm 0.2130 | lr 1.18e-03 | (3830.94 ms | 136856 tok/s) step 2332/76294 | train loss 3.752291 | norm 0.1968 | lr 1.18e-03 | (3830.24 ms | 136881 tok/s) step 2333/76294 | train loss 3.775729 | norm 0.2047 | lr 1.18e-03 | (3835.10 ms | 136708 tok/s) step 2334/76294 | train loss 3.870501 | norm 0.2044 | lr 1.18e-03 | (3809.61 ms | 137623 tok/s) step 2335/76294 | train loss 3.823581 | norm 0.2034 | lr 1.18e-03 | (3836.25 ms | 136667 tok/s) step 2336/76294 | train loss 3.763898 | norm 0.1985 | lr 1.18e-03 | (3803.81 ms | 137832 tok/s) step 2337/76294 | train loss 3.981163 | norm 0.3019 | lr 1.18e-03 | (3836.44 ms | 136660 tok/s) step 2338/76294 | train loss 3.824670 | norm 0.5211 | lr 1.18e-03 | (3824.30 ms | 137094 tok/s) step 2339/76294 | train loss 3.849592 | norm 0.4085 | lr 1.18e-03 | (3887.13 ms | 134878 tok/s) step 2340/76294 | train loss 3.785081 | norm 0.8184 | lr 1.18e-03 | (3840.69 ms | 136509 tok/s) step 2341/76294 | train loss 3.886049 | norm 0.8841 | lr 1.18e-03 | (3810.77 ms | 137581 tok/s) step 2342/76294 | train loss 3.836790 | norm 0.4229 | lr 1.18e-03 | (3834.19 ms | 136740 tok/s) step 2343/76294 | train loss 3.841805 | norm 0.6448 | lr 1.18e-03 | (3847.33 ms | 136273 tok/s) step 2344/76294 | train loss 3.887899 | norm 0.5909 | lr 1.18e-03 | (3809.32 ms | 137633 tok/s) step 2345/76294 | train loss 3.825361 | norm 0.3486 | lr 1.18e-03 | (3826.48 ms | 137016 tok/s) step 2346/76294 | train loss 3.832983 | norm 0.2603 | lr 1.18e-03 | (3817.47 ms | 137339 tok/s) step 2347/76294 | train loss 3.912894 | norm 0.2548 | lr 1.18e-03 | (3860.35 ms | 135814 tok/s) step 2348/76294 | train loss 3.801165 | norm 0.2451 | lr 1.18e-03 | (3809.54 ms | 137625 tok/s) step 2349/76294 | train loss 3.806019 | norm 0.2042 | lr 1.18e-03 | (3834.38 ms | 136733 tok/s) step 2350/76294 | train loss 3.804027 | norm 0.2324 | lr 1.18e-03 | (3824.55 ms | 137085 tok/s) step 2351/76294 | train loss 3.839930 | norm 0.2010 | lr 1.18e-03 | (3836.36 ms | 136663 tok/s) step 2352/76294 | train loss 3.811371 | norm 0.2495 | lr 1.18e-03 | (3805.55 ms | 137769 tok/s) step 2353/76294 | train loss 3.806954 | norm 0.2266 | lr 1.18e-03 | (3805.80 ms | 137760 tok/s) step 2354/76294 | train loss 3.820738 | norm 0.2127 | lr 1.18e-03 | (3831.36 ms | 136841 tok/s) step 2355/76294 | train loss 3.836213 | norm 0.2249 | lr 1.18e-03 | (3803.14 ms | 137857 tok/s) step 2356/76294 | train loss 3.758573 | norm 0.2121 | lr 1.18e-03 | (3818.39 ms | 137306 tok/s) step 2357/76294 | train loss 3.735878 | norm 0.2088 | lr 1.18e-03 | (3839.68 ms | 136545 tok/s) step 2358/76294 | train loss 3.752906 | norm 0.2117 | lr 1.18e-03 | (3802.47 ms | 137881 tok/s) step 2359/76294 | train loss 3.761604 | norm 0.1965 | lr 1.18e-03 | (3814.60 ms | 137442 tok/s) step 2360/76294 | train loss 3.804860 | norm 0.2076 | lr 1.18e-03 | (3829.38 ms | 136912 tok/s) step 2361/76294 | train loss 3.938076 | norm 0.2326 | lr 1.18e-03 | (3800.89 ms | 137938 tok/s) step 2362/76294 | train loss 3.797717 | norm 0.1919 | lr 1.18e-03 | (3859.57 ms | 135841 tok/s) step 2363/76294 | train loss 3.771041 | norm 0.2124 | lr 1.18e-03 | (3861.56 ms | 135771 tok/s) step 2364/76294 | train loss 3.788822 | norm 0.2495 | lr 1.18e-03 | (3857.43 ms | 135916 tok/s) step 2365/76294 | train loss 3.801076 | norm 0.2560 | lr 1.18e-03 | (3801.50 ms | 137916 tok/s) step 2366/76294 | train loss 3.765694 | norm 0.2450 | lr 1.18e-03 | (3809.68 ms | 137620 tok/s) step 2367/76294 | train loss 3.843357 | norm 0.2311 | lr 1.18e-03 | (3857.44 ms | 135916 tok/s) step 2368/76294 | train loss 3.751296 | norm 0.2429 | lr 1.18e-03 | (3805.31 ms | 137778 tok/s) step 2369/76294 | train loss 3.855617 | norm 0.1973 | lr 1.18e-03 | (3809.06 ms | 137642 tok/s) step 2370/76294 | train loss 3.803872 | norm 0.2264 | lr 1.18e-03 | (3806.48 ms | 137735 tok/s) step 2371/76294 | train loss 3.823904 | norm 0.2023 | lr 1.18e-03 | (5039.71 ms | 104031 tok/s) step 2372/76294 | train loss 3.996103 | norm 0.2313 | lr 1.18e-03 | (4294.72 ms | 122077 tok/s) step 2373/76294 | train loss 3.786104 | norm 0.2007 | lr 1.18e-03 | (3826.28 ms | 137023 tok/s) step 2374/76294 | train loss 3.903665 | norm 0.2307 | lr 1.18e-03 | (3800.83 ms | 137941 tok/s) step 2375/76294 | train loss 3.829650 | norm 0.2259 | lr 1.18e-03 | (3799.69 ms | 137982 tok/s) step 2376/76294 | train loss 3.790070 | norm 0.2010 | lr 1.18e-03 | (3810.33 ms | 137596 tok/s) step 2377/76294 | train loss 3.808435 | norm 0.1929 | lr 1.18e-03 | (3801.42 ms | 137919 tok/s) step 2378/76294 | train loss 3.861797 | norm 0.1979 | lr 1.18e-03 | (3800.59 ms | 137949 tok/s) step 2379/76294 | train loss 3.764619 | norm 0.1828 | lr 1.18e-03 | (3890.79 ms | 134751 tok/s) step 2380/76294 | train loss 3.760593 | norm 0.1921 | lr 1.18e-03 | (3799.65 ms | 137983 tok/s) step 2381/76294 | train loss 3.691760 | norm 0.2044 | lr 1.18e-03 | (3809.47 ms | 137627 tok/s) step 2382/76294 | train loss 3.753984 | norm 0.1960 | lr 1.18e-03 | (3824.76 ms | 137077 tok/s) step 2383/76294 | train loss 3.746951 | norm 0.2171 | lr 1.18e-03 | (3858.73 ms | 135871 tok/s) step 2384/76294 | train loss 3.830670 | norm 0.2022 | lr 1.18e-03 | (3802.30 ms | 137887 tok/s) step 2385/76294 | train loss 3.798848 | norm 0.2134 | lr 1.18e-03 | (3852.66 ms | 136085 tok/s) step 2386/76294 | train loss 3.798380 | norm 0.1899 | lr 1.18e-03 | (3802.80 ms | 137869 tok/s) step 2387/76294 | train loss 3.762088 | norm 0.2431 | lr 1.18e-03 | (3809.11 ms | 137641 tok/s) step 2388/76294 | train loss 3.793015 | norm 0.2250 | lr 1.18e-03 | (3828.99 ms | 136926 tok/s) step 2389/76294 | train loss 3.818465 | norm 0.2085 | lr 1.18e-03 | (3807.47 ms | 137700 tok/s) step 2390/76294 | train loss 3.776443 | norm 0.2061 | lr 1.18e-03 | (3806.04 ms | 137752 tok/s) step 2391/76294 | train loss 3.758892 | norm 0.2011 | lr 1.18e-03 | (3832.01 ms | 136818 tok/s) step 2392/76294 | train loss 3.798699 | norm 0.2158 | lr 1.18e-03 | (3804.34 ms | 137813 tok/s) step 2393/76294 | train loss 3.802784 | norm 0.2289 | lr 1.18e-03 | (3809.15 ms | 137639 tok/s) step 2394/76294 | train loss 3.775512 | norm 0.1892 | lr 1.18e-03 | (3822.57 ms | 137156 tok/s) step 2395/76294 | train loss 3.772958 | norm 0.2094 | lr 1.18e-03 | (3836.00 ms | 136676 tok/s) step 2396/76294 | train loss 3.787867 | norm 0.2386 | lr 1.18e-03 | (3806.23 ms | 137745 tok/s) step 2397/76294 | train loss 3.830341 | norm 0.2404 | lr 1.18e-03 | (3872.69 ms | 135381 tok/s) step 2398/76294 | train loss 3.679437 | norm 0.2176 | lr 1.18e-03 | (3905.35 ms | 134249 tok/s) step 2399/76294 | train loss 3.786866 | norm 0.2373 | lr 1.18e-03 | (3829.63 ms | 136903 tok/s) step 2400/76294 | train loss 3.821519 | norm 0.2535 | lr 1.18e-03 | (3823.67 ms | 137117 tok/s) step 2401/76294 | train loss 3.760518 | norm 0.2609 | lr 1.18e-03 | (3803.01 ms | 137861 tok/s) step 2402/76294 | train loss 3.727636 | norm 0.2284 | lr 1.18e-03 | (3835.21 ms | 136704 tok/s) step 2403/76294 | train loss 3.761796 | norm 0.2090 | lr 1.18e-03 | (3876.35 ms | 135253 tok/s) step 2404/76294 | train loss 3.748950 | norm 0.2216 | lr 1.18e-03 | (3799.73 ms | 137980 tok/s) step 2405/76294 | train loss 3.781215 | norm 0.2192 | lr 1.18e-03 | (3813.56 ms | 137480 tok/s) step 2406/76294 | train loss 3.789858 | norm 0.2455 | lr 1.18e-03 | (3824.73 ms | 137078 tok/s) step 2407/76294 | train loss 3.792203 | norm 0.2383 | lr 1.18e-03 | (3835.88 ms | 136680 tok/s) step 2408/76294 | train loss 3.904265 | norm 0.2156 | lr 1.18e-03 | (3807.54 ms | 137697 tok/s) step 2409/76294 | train loss 3.785889 | norm 0.2502 | lr 1.18e-03 | (3828.30 ms | 136950 tok/s) step 2410/76294 | train loss 3.813371 | norm 0.2531 | lr 1.18e-03 | (3807.66 ms | 137693 tok/s) step 2411/76294 | train loss 3.764482 | norm 0.2468 | lr 1.18e-03 | (3927.38 ms | 133496 tok/s) step 2412/76294 | train loss 3.790665 | norm 0.2375 | lr 1.18e-03 | (3801.35 ms | 137922 tok/s) step 2413/76294 | train loss 3.737018 | norm 0.3370 | lr 1.18e-03 | (3835.68 ms | 136687 tok/s) step 2414/76294 | train loss 3.781098 | norm 0.3249 | lr 1.18e-03 | (3808.57 ms | 137660 tok/s) step 2415/76294 | train loss 3.776475 | norm 0.2589 | lr 1.18e-03 | (3833.36 ms | 136770 tok/s) step 2416/76294 | train loss 3.795134 | norm 0.2468 | lr 1.18e-03 | (3799.51 ms | 137988 tok/s) step 2417/76294 | train loss 3.814184 | norm 0.2333 | lr 1.18e-03 | (3858.09 ms | 135893 tok/s) step 2418/76294 | train loss 3.782032 | norm 0.2161 | lr 1.18e-03 | (3803.56 ms | 137842 tok/s) step 2419/76294 | train loss 3.789926 | norm 0.2670 | lr 1.18e-03 | (3803.42 ms | 137846 tok/s) step 2420/76294 | train loss 3.821965 | norm 0.2599 | lr 1.18e-03 | (3820.66 ms | 137225 tok/s) step 2421/76294 | train loss 3.795406 | norm 0.2359 | lr 1.18e-03 | (3807.96 ms | 137682 tok/s) step 2422/76294 | train loss 3.811798 | norm 0.2394 | lr 1.18e-03 | (3814.49 ms | 137446 tok/s) step 2423/76294 | train loss 3.820280 | norm 0.2451 | lr 1.18e-03 | (3803.74 ms | 137835 tok/s) step 2424/76294 | train loss 3.800359 | norm 0.1947 | lr 1.18e-03 | (3804.51 ms | 137807 tok/s) step 2425/76294 | train loss 3.834652 | norm 0.2183 | lr 1.18e-03 | (3826.98 ms | 136998 tok/s) step 2426/76294 | train loss 3.661557 | norm 0.2034 | lr 1.18e-03 | (3803.55 ms | 137842 tok/s) step 2427/76294 | train loss 3.774072 | norm 0.2524 | lr 1.18e-03 | (3824.60 ms | 137083 tok/s) step 2428/76294 | train loss 3.822106 | norm 0.1972 | lr 1.18e-03 | (3827.20 ms | 136990 tok/s) step 2429/76294 | train loss 3.729669 | norm 0.2209 | lr 1.18e-03 | (3805.05 ms | 137788 tok/s) step 2430/76294 | train loss 3.794024 | norm 0.2198 | lr 1.18e-03 | (3868.45 ms | 135529 tok/s) step 2431/76294 | train loss 3.819634 | norm 0.2169 | lr 1.18e-03 | (3836.39 ms | 136662 tok/s) step 2432/76294 | train loss 3.713382 | norm 0.2288 | lr 1.18e-03 | (3814.68 ms | 137440 tok/s) step 2433/76294 | train loss 3.760031 | norm 0.2133 | lr 1.18e-03 | (3835.49 ms | 136694 tok/s) step 2434/76294 | train loss 3.831082 | norm 0.2176 | lr 1.18e-03 | (3801.58 ms | 137913 tok/s) step 2435/76294 | train loss 3.727151 | norm 0.2182 | lr 1.18e-03 | (3805.51 ms | 137771 tok/s) step 2436/76294 | train loss 3.743421 | norm 0.2109 | lr 1.18e-03 | (3830.98 ms | 136855 tok/s) step 2437/76294 | train loss 3.779965 | norm 0.1873 | lr 1.18e-03 | (3807.19 ms | 137710 tok/s) step 2438/76294 | train loss 3.828388 | norm 0.2220 | lr 1.18e-03 | (3853.64 ms | 136050 tok/s) step 2439/76294 | train loss 3.788277 | norm 0.2020 | lr 1.18e-03 | (3801.74 ms | 137908 tok/s) step 2440/76294 | train loss 3.775033 | norm 0.2345 | lr 1.18e-03 | (3827.79 ms | 136969 tok/s) step 2441/76294 | train loss 3.795696 | norm 0.2456 | lr 1.18e-03 | (3952.21 ms | 132657 tok/s) step 2442/76294 | train loss 3.783851 | norm 0.3085 | lr 1.18e-03 | (3847.01 ms | 136285 tok/s) step 2443/76294 | train loss 3.777098 | norm 0.2903 | lr 1.18e-03 | (3838.22 ms | 136597 tok/s) step 2444/76294 | train loss 3.774969 | norm 0.3048 | lr 1.18e-03 | (3805.01 ms | 137789 tok/s) step 2445/76294 | train loss 3.745219 | norm 0.2790 | lr 1.18e-03 | (3893.88 ms | 134644 tok/s) step 2446/76294 | train loss 3.804311 | norm 0.2687 | lr 1.18e-03 | (3827.99 ms | 136962 tok/s) step 2447/76294 | train loss 3.768508 | norm 0.2549 | lr 1.18e-03 | (3827.66 ms | 136973 tok/s) step 2448/76294 | train loss 3.749460 | norm 0.2264 | lr 1.18e-03 | (3802.02 ms | 137897 tok/s) step 2449/76294 | train loss 3.797534 | norm 0.2129 | lr 1.18e-03 | (3804.87 ms | 137794 tok/s) step 2450/76294 | train loss 3.769860 | norm 0.1892 | lr 1.18e-03 | (3821.17 ms | 137206 tok/s) step 2451/76294 | train loss 3.797463 | norm 0.2179 | lr 1.18e-03 | (3803.45 ms | 137845 tok/s) step 2452/76294 | train loss 3.800702 | norm 0.2146 | lr 1.18e-03 | (3840.29 ms | 136523 tok/s) step 2453/76294 | train loss 3.804483 | norm 0.2261 | lr 1.18e-03 | (3802.53 ms | 137879 tok/s) step 2454/76294 | train loss 3.789988 | norm 0.2275 | lr 1.18e-03 | (4402.08 ms | 119100 tok/s) step 2455/76294 | train loss 3.783283 | norm 0.2404 | lr 1.18e-03 | (3819.35 ms | 137271 tok/s) step 2456/76294 | train loss 3.784209 | norm 0.2091 | lr 1.18e-03 | (3842.32 ms | 136451 tok/s) step 2457/76294 | train loss 3.676295 | norm 0.2297 | lr 1.18e-03 | (3852.64 ms | 136085 tok/s) step 2458/76294 | train loss 3.826086 | norm 0.2130 | lr 1.18e-03 | (3823.42 ms | 137125 tok/s) step 2459/76294 | train loss 3.756247 | norm 0.2189 | lr 1.18e-03 | (3807.28 ms | 137707 tok/s) step 2460/76294 | train loss 3.735018 | norm 0.2567 | lr 1.18e-03 | (3827.67 ms | 136973 tok/s) step 2461/76294 | train loss 3.753889 | norm 0.3028 | lr 1.18e-03 | (3803.43 ms | 137846 tok/s) step 2462/76294 | train loss 3.714873 | norm 0.2453 | lr 1.18e-03 | (3895.38 ms | 134592 tok/s) step 2463/76294 | train loss 3.794545 | norm 0.2338 | lr 1.18e-03 | (3802.70 ms | 137873 tok/s) step 2464/76294 | train loss 3.767546 | norm 0.2796 | lr 1.18e-03 | (3825.95 ms | 137035 tok/s) step 2465/76294 | train loss 3.717902 | norm 0.2489 | lr 1.18e-03 | (3803.53 ms | 137842 tok/s) step 2466/76294 | train loss 3.763576 | norm 0.2354 | lr 1.18e-03 | (3836.05 ms | 136674 tok/s) step 2467/76294 | train loss 3.786481 | norm 0.2535 | lr 1.18e-03 | (3843.66 ms | 136403 tok/s) step 2468/76294 | train loss 3.756901 | norm 0.2431 | lr 1.18e-03 | (3929.13 ms | 133436 tok/s) step 2469/76294 | train loss 3.656842 | norm 0.2553 | lr 1.18e-03 | (3815.34 ms | 137416 tok/s) step 2470/76294 | train loss 3.767308 | norm 0.2222 | lr 1.18e-03 | (3806.71 ms | 137727 tok/s) step 2471/76294 | train loss 3.784499 | norm 0.2353 | lr 1.18e-03 | (3803.65 ms | 137838 tok/s) step 2472/76294 | train loss 3.770154 | norm 0.2432 | lr 1.18e-03 | (3834.59 ms | 136726 tok/s) step 2473/76294 | train loss 3.723708 | norm 0.2681 | lr 1.18e-03 | (3805.81 ms | 137760 tok/s) step 2474/76294 | train loss 3.712202 | norm 0.2136 | lr 1.18e-03 | (3806.73 ms | 137727 tok/s) step 2475/76294 | train loss 3.733768 | norm 0.1957 | lr 1.18e-03 | (3826.45 ms | 137017 tok/s) step 2476/76294 | train loss 3.746364 | norm 0.2193 | lr 1.18e-03 | (3809.00 ms | 137645 tok/s) step 2477/76294 | train loss 3.757278 | norm 0.2095 | lr 1.18e-03 | (3805.38 ms | 137775 tok/s) step 2478/76294 | train loss 3.747096 | norm 0.2423 | lr 1.18e-03 | (3837.58 ms | 136619 tok/s) step 2479/76294 | train loss 3.773107 | norm 0.2330 | lr 1.18e-03 | (3807.33 ms | 137705 tok/s) step 2480/76294 | train loss 3.700474 | norm 0.2574 | lr 1.18e-03 | (4172.72 ms | 125647 tok/s) step 2481/76294 | train loss 3.795253 | norm 0.2366 | lr 1.18e-03 | (3985.86 ms | 131537 tok/s) step 2482/76294 | train loss 3.781361 | norm 0.2062 | lr 1.18e-03 | (3808.34 ms | 137668 tok/s) step 2483/76294 | train loss 3.921213 | norm 0.2002 | lr 1.18e-03 | (3861.99 ms | 135756 tok/s) step 2484/76294 | train loss 3.719634 | norm 0.2046 | lr 1.18e-03 | (3830.29 ms | 136879 tok/s) step 2485/76294 | train loss 3.806018 | norm 0.1916 | lr 1.18e-03 | (3816.57 ms | 137371 tok/s) step 2486/76294 | train loss 3.773408 | norm 0.1790 | lr 1.18e-03 | (3825.61 ms | 137047 tok/s) step 2487/76294 | train loss 3.734799 | norm 0.1869 | lr 1.18e-03 | (3946.26 ms | 132857 tok/s) step 2488/76294 | train loss 3.732522 | norm 0.2614 | lr 1.18e-03 | (3804.69 ms | 137801 tok/s) step 2489/76294 | train loss 3.759431 | norm 0.2299 | lr 1.18e-03 | (3813.25 ms | 137491 tok/s) step 2490/76294 | train loss 3.795110 | norm 0.2290 | lr 1.18e-03 | (3832.43 ms | 136803 tok/s) step 2491/76294 | train loss 3.778929 | norm 0.2347 | lr 1.18e-03 | (3809.56 ms | 137624 tok/s) step 2492/76294 | train loss 3.735335 | norm 0.2133 | lr 1.18e-03 | (3804.10 ms | 137822 tok/s) step 2493/76294 | train loss 3.718259 | norm 0.2117 | lr 1.18e-03 | (3834.13 ms | 136742 tok/s) step 2494/76294 | train loss 3.850048 | norm 0.1939 | lr 1.18e-03 | (3809.19 ms | 137638 tok/s) step 2495/76294 | train loss 3.770570 | norm 0.3183 | lr 1.18e-03 | (3928.31 ms | 133464 tok/s) step 2496/76294 | train loss 3.737877 | norm 0.2298 | lr 1.18e-03 | (3805.28 ms | 137779 tok/s) step 2497/76294 | train loss 3.755340 | norm 0.1927 | lr 1.18e-03 | (3810.49 ms | 137591 tok/s) step 2498/76294 | train loss 3.773581 | norm 0.1959 | lr 1.18e-03 | (3917.41 ms | 133835 tok/s) step 2499/76294 | train loss 3.759531 | norm 0.6944 | lr 1.18e-03 | (3827.30 ms | 136986 tok/s) step 2500/76294 | train loss 3.783510 | norm 0.1815 | lr 1.18e-03 | (3828.10 ms | 136958 tok/s) val loss: 3.750427 saving model checkpoint to ./results/gpt2-124M-gqa/step_2500.pth step 2501/76294 | train loss 3.771616 | norm 0.2000 | lr 1.18e-03 | (4216.81 ms | 124333 tok/s) step 2502/76294 | train loss 3.788606 | norm 0.2568 | lr 1.18e-03 | (3813.79 ms | 137472 tok/s) step 2503/76294 | train loss 3.759341 | norm 0.3244 | lr 1.18e-03 | (3745.68 ms | 139971 tok/s) step 2504/76294 | train loss 3.795068 | norm 0.3963 | lr 1.18e-03 | (3883.26 ms | 135012 tok/s) step 2505/76294 | train loss 3.781796 | norm 0.3433 | lr 1.18e-03 | (3782.44 ms | 138611 tok/s) step 2506/76294 | train loss 3.826833 | norm 0.2675 | lr 1.18e-03 | (4007.02 ms | 130842 tok/s) step 2507/76294 | train loss 3.821491 | norm 0.2535 | lr 1.18e-03 | (3777.32 ms | 138799 tok/s) step 2508/76294 | train loss 3.761728 | norm 0.2330 | lr 1.18e-03 | (3774.02 ms | 138920 tok/s) step 2509/76294 | train loss 3.733860 | norm 0.2742 | lr 1.18e-03 | (3774.59 ms | 138899 tok/s) step 2510/76294 | train loss 3.745993 | norm 0.2139 | lr 1.18e-03 | (3807.93 ms | 137683 tok/s) step 2511/76294 | train loss 3.758991 | norm 0.2174 | lr 1.18e-03 | (3773.44 ms | 138942 tok/s) step 2512/76294 | train loss 3.764672 | norm 0.2190 | lr 1.18e-03 | (3915.48 ms | 133901 tok/s) step 2513/76294 | train loss 3.758188 | norm 0.2550 | lr 1.18e-03 | (3948.55 ms | 132780 tok/s) step 2514/76294 | train loss 3.768778 | norm 0.3227 | lr 1.18e-03 | (3907.68 ms | 134169 tok/s) step 2515/76294 | train loss 3.762488 | norm 0.3736 | lr 1.18e-03 | (3847.26 ms | 136276 tok/s) step 2516/76294 | train loss 3.798252 | norm 0.3112 | lr 1.18e-03 | (3817.56 ms | 137336 tok/s) step 2517/76294 | train loss 3.814573 | norm 0.2302 | lr 1.18e-03 | (3787.83 ms | 138414 tok/s) step 2518/76294 | train loss 3.717348 | norm 0.2548 | lr 1.18e-03 | (3791.32 ms | 138286 tok/s) step 2519/76294 | train loss 3.748128 | norm 0.2859 | lr 1.18e-03 | (3820.14 ms | 137243 tok/s) step 2520/76294 | train loss 3.757748 | norm 0.2284 | lr 1.18e-03 | (3834.98 ms | 136712 tok/s) step 2521/76294 | train loss 3.764560 | norm 0.2233 | lr 1.18e-03 | (3822.69 ms | 137152 tok/s) step 2522/76294 | train loss 3.761795 | norm 0.2335 | lr 1.18e-03 | (4031.30 ms | 130054 tok/s) step 2523/76294 | train loss 3.770927 | norm 0.2458 | lr 1.18e-03 | (3796.47 ms | 138099 tok/s) step 2524/76294 | train loss 3.805067 | norm 0.2357 | lr 1.18e-03 | (3839.32 ms | 136557 tok/s) step 2525/76294 | train loss 3.703940 | norm 0.2289 | lr 1.18e-03 | (3796.07 ms | 138113 tok/s) step 2526/76294 | train loss 3.796694 | norm 0.2007 | lr 1.18e-03 | (3800.04 ms | 137969 tok/s) step 2527/76294 | train loss 3.834929 | norm 0.2274 | lr 1.18e-03 | (3896.40 ms | 134557 tok/s) step 2528/76294 | train loss 3.751225 | norm 0.2243 | lr 1.18e-03 | (3798.97 ms | 138008 tok/s) step 2529/76294 | train loss 3.808358 | norm 0.2277 | lr 1.18e-03 | (3856.81 ms | 135938 tok/s) step 2530/76294 | train loss 3.742116 | norm 0.1966 | lr 1.18e-03 | (3802.98 ms | 137863 tok/s) step 2531/76294 | train loss 3.764551 | norm 0.2036 | lr 1.18e-03 | (3879.98 ms | 135126 tok/s) step 2532/76294 | train loss 3.737432 | norm 0.2011 | lr 1.18e-03 | (3868.83 ms | 135516 tok/s) step 2533/76294 | train loss 3.782471 | norm 0.2039 | lr 1.18e-03 | (3799.64 ms | 137984 tok/s) step 2534/76294 | train loss 3.770843 | norm 0.2255 | lr 1.18e-03 | (3824.70 ms | 137080 tok/s) step 2535/76294 | train loss 3.741483 | norm 0.2162 | lr 1.18e-03 | (3803.71 ms | 137836 tok/s) step 2536/76294 | train loss 3.756745 | norm 0.2161 | lr 1.18e-03 | (3825.13 ms | 137064 tok/s) step 2537/76294 | train loss 3.723372 | norm 0.1827 | lr 1.18e-03 | (3800.94 ms | 137936 tok/s) step 2538/76294 | train loss 3.718822 | norm 0.1990 | lr 1.18e-03 | (3833.84 ms | 136753 tok/s) step 2539/76294 | train loss 3.828701 | norm 0.2012 | lr 1.18e-03 | (3805.71 ms | 137763 tok/s) step 2540/76294 | train loss 3.796787 | norm 0.1861 | lr 1.18e-03 | (3906.70 ms | 134202 tok/s) step 2541/76294 | train loss 3.745435 | norm 0.2343 | lr 1.18e-03 | (3804.52 ms | 137807 tok/s) step 2542/76294 | train loss 3.771292 | norm 0.2347 | lr 1.18e-03 | (3859.83 ms | 135832 tok/s) step 2543/76294 | train loss 3.779193 | norm 0.1940 | lr 1.18e-03 | (3806.27 ms | 137743 tok/s) step 2544/76294 | train loss 3.763454 | norm 0.1908 | lr 1.18e-03 | (3951.06 ms | 132695 tok/s) step 2545/76294 | train loss 3.681332 | norm 0.2105 | lr 1.18e-03 | (3834.40 ms | 136733 tok/s) step 2546/76294 | train loss 3.726139 | norm 0.2279 | lr 1.18e-03 | (3918.93 ms | 133783 tok/s) step 2547/76294 | train loss 3.766773 | norm 0.2466 | lr 1.18e-03 | (3847.69 ms | 136260 tok/s) step 2548/76294 | train loss 3.710739 | norm 0.2771 | lr 1.18e-03 | (3832.71 ms | 136793 tok/s) step 2549/76294 | train loss 3.803422 | norm 0.2713 | lr 1.17e-03 | (3827.05 ms | 136995 tok/s) step 2550/76294 | train loss 3.781037 | norm 0.2773 | lr 1.17e-03 | (3812.80 ms | 137507 tok/s) step 2551/76294 | train loss 3.779491 | norm 0.2231 | lr 1.17e-03 | (3831.43 ms | 136839 tok/s) step 2552/76294 | train loss 3.737078 | norm 0.2234 | lr 1.17e-03 | (3807.75 ms | 137690 tok/s) step 2553/76294 | train loss 3.805502 | norm 0.1966 | lr 1.17e-03 | (3855.45 ms | 135986 tok/s) step 2554/76294 | train loss 3.711640 | norm 0.2574 | lr 1.17e-03 | (3806.96 ms | 137718 tok/s) step 2555/76294 | train loss 3.751678 | norm 0.2305 | lr 1.17e-03 | (3813.62 ms | 137478 tok/s) step 2556/76294 | train loss 3.765383 | norm 0.2095 | lr 1.17e-03 | (3834.92 ms | 136714 tok/s) step 2557/76294 | train loss 3.775422 | norm 0.2257 | lr 1.17e-03 | (4083.54 ms | 128391 tok/s) step 2558/76294 | train loss 3.707182 | norm 0.2171 | lr 1.17e-03 | (3835.66 ms | 136688 tok/s) step 2559/76294 | train loss 3.783918 | norm 0.2304 | lr 1.17e-03 | (3808.18 ms | 137674 tok/s) step 2560/76294 | train loss 3.723776 | norm 0.2274 | lr 1.17e-03 | (3829.58 ms | 136905 tok/s) step 2561/76294 | train loss 3.707542 | norm 0.2544 | lr 1.17e-03 | (3806.80 ms | 137724 tok/s) step 2562/76294 | train loss 3.753221 | norm 0.3448 | lr 1.17e-03 | (3861.96 ms | 135757 tok/s) step 2563/76294 | train loss 3.767500 | norm 0.2943 | lr 1.17e-03 | (3815.04 ms | 137427 tok/s) step 2564/76294 | train loss 3.726095 | norm 0.2259 | lr 1.17e-03 | (3814.83 ms | 137434 tok/s) step 2565/76294 | train loss 3.834288 | norm 0.2801 | lr 1.17e-03 | (3834.58 ms | 136726 tok/s) step 2566/76294 | train loss 3.781874 | norm 0.2570 | lr 1.17e-03 | (3916.56 ms | 133864 tok/s) step 2567/76294 | train loss 3.724696 | norm 0.2468 | lr 1.17e-03 | (3812.68 ms | 137512 tok/s) step 2568/76294 | train loss 3.838385 | norm 0.2583 | lr 1.17e-03 | (3859.49 ms | 135844 tok/s) step 2569/76294 | train loss 3.788535 | norm 0.2454 | lr 1.17e-03 | (3918.82 ms | 133787 tok/s) step 2570/76294 | train loss 3.903458 | norm 0.2344 | lr 1.17e-03 | (3827.98 ms | 136962 tok/s) step 2571/76294 | train loss 3.828698 | norm 0.2348 | lr 1.17e-03 | (4270.77 ms | 122762 tok/s) step 2572/76294 | train loss 3.724704 | norm 0.2506 | lr 1.17e-03 | (3840.95 ms | 136499 tok/s) step 2573/76294 | train loss 3.747939 | norm 0.2510 | lr 1.17e-03 | (3823.65 ms | 137117 tok/s) step 2574/76294 | train loss 3.778720 | norm 0.2381 | lr 1.17e-03 | (3810.67 ms | 137584 tok/s) step 2575/76294 | train loss 3.778796 | norm 0.2360 | lr 1.17e-03 | (3842.88 ms | 136431 tok/s) step 2576/76294 | train loss 3.789695 | norm 0.2651 | lr 1.17e-03 | (3812.39 ms | 137522 tok/s) step 2577/76294 | train loss 3.781499 | norm 0.2515 | lr 1.17e-03 | (3816.20 ms | 137385 tok/s) step 2578/76294 | train loss 3.875936 | norm 0.2490 | lr 1.17e-03 | (3836.88 ms | 136644 tok/s) step 2579/76294 | train loss 3.735679 | norm 0.2688 | lr 1.17e-03 | (3819.33 ms | 137272 tok/s) step 2580/76294 | train loss 3.777251 | norm 0.2236 | lr 1.17e-03 | (3818.38 ms | 137306 tok/s) step 2581/76294 | train loss 3.748753 | norm 0.2282 | lr 1.17e-03 | (3820.58 ms | 137227 tok/s) step 2582/76294 | train loss 3.762034 | norm 0.2212 | lr 1.17e-03 | (3809.29 ms | 137634 tok/s) step 2583/76294 | train loss 3.777739 | norm 0.1943 | lr 1.17e-03 | (3920.17 ms | 133741 tok/s) step 2584/76294 | train loss 3.743990 | norm 0.1977 | lr 1.17e-03 | (3813.83 ms | 137470 tok/s) step 2585/76294 | train loss 3.749320 | norm 0.2119 | lr 1.17e-03 | (3865.09 ms | 135647 tok/s) step 2586/76294 | train loss 3.733414 | norm 0.2142 | lr 1.17e-03 | (3914.90 ms | 133921 tok/s) step 2587/76294 | train loss 3.849869 | norm 0.2064 | lr 1.17e-03 | (3912.14 ms | 134016 tok/s) step 2588/76294 | train loss 3.677850 | norm 0.1961 | lr 1.17e-03 | (4073.68 ms | 128701 tok/s) step 2589/76294 | train loss 3.747393 | norm 0.1872 | lr 1.17e-03 | (4285.94 ms | 122327 tok/s) step 2590/76294 | train loss 3.699230 | norm 0.1901 | lr 1.17e-03 | (3814.23 ms | 137456 tok/s) step 2591/76294 | train loss 3.788575 | norm 0.2001 | lr 1.17e-03 | (3834.58 ms | 136726 tok/s) step 2592/76294 | train loss 3.713164 | norm 0.1785 | lr 1.17e-03 | (3917.69 ms | 133826 tok/s) step 2593/76294 | train loss 3.770464 | norm 0.1795 | lr 1.17e-03 | (3952.95 ms | 132632 tok/s) step 2594/76294 | train loss 3.802529 | norm 0.1961 | lr 1.17e-03 | (3904.95 ms | 134262 tok/s) step 2595/76294 | train loss 3.769745 | norm 0.2549 | lr 1.17e-03 | (3781.75 ms | 138636 tok/s) step 2596/76294 | train loss 3.793537 | norm 0.3290 | lr 1.17e-03 | (3814.24 ms | 137455 tok/s) step 2597/76294 | train loss 3.747530 | norm 0.3263 | lr 1.17e-03 | (3786.40 ms | 138466 tok/s) step 2598/76294 | train loss 3.747903 | norm 0.2514 | lr 1.17e-03 | (3823.50 ms | 137123 tok/s) step 2599/76294 | train loss 3.760464 | norm 0.2298 | lr 1.17e-03 | (3794.75 ms | 138161 tok/s) step 2600/76294 | train loss 3.768177 | norm 0.1947 | lr 1.17e-03 | (3848.27 ms | 136240 tok/s) step 2601/76294 | train loss 3.704861 | norm 0.2822 | lr 1.17e-03 | (3795.11 ms | 138148 tok/s) step 2602/76294 | train loss 3.728470 | norm 0.2738 | lr 1.17e-03 | (3826.72 ms | 137007 tok/s) step 2603/76294 | train loss 3.756933 | norm 0.3112 | lr 1.17e-03 | (3797.40 ms | 138065 tok/s) step 2604/76294 | train loss 3.673610 | norm 0.2468 | lr 1.17e-03 | (3824.06 ms | 137103 tok/s) step 2605/76294 | train loss 3.806798 | norm 0.2182 | lr 1.17e-03 | (4309.04 ms | 121672 tok/s) step 2606/76294 | train loss 3.781680 | norm 0.2604 | lr 1.17e-03 | (4808.82 ms | 109026 tok/s) step 2607/76294 | train loss 3.750947 | norm 0.2725 | lr 1.17e-03 | (4205.57 ms | 124665 tok/s) step 2608/76294 | train loss 3.730700 | norm 0.2754 | lr 1.17e-03 | (3823.25 ms | 137131 tok/s) step 2609/76294 | train loss 3.821984 | norm 0.2504 | lr 1.17e-03 | (3998.46 ms | 131122 tok/s) step 2610/76294 | train loss 3.729636 | norm 0.3441 | lr 1.17e-03 | (3871.77 ms | 135413 tok/s) step 2611/76294 | train loss 3.736360 | norm 0.2770 | lr 1.17e-03 | (3801.15 ms | 137929 tok/s) step 2612/76294 | train loss 3.728862 | norm 0.2334 | lr 1.17e-03 | (3823.15 ms | 137135 tok/s) step 2613/76294 | train loss 3.762311 | norm 0.2552 | lr 1.17e-03 | (3828.81 ms | 136933 tok/s) step 2614/76294 | train loss 3.741220 | norm 0.2384 | lr 1.17e-03 | (3807.29 ms | 137706 tok/s) step 2615/76294 | train loss 3.771865 | norm 0.2200 | lr 1.17e-03 | (3800.53 ms | 137951 tok/s) step 2616/76294 | train loss 3.725413 | norm 0.2198 | lr 1.17e-03 | (3839.03 ms | 136568 tok/s) step 2617/76294 | train loss 3.798722 | norm 0.2211 | lr 1.17e-03 | (3847.04 ms | 136283 tok/s) step 2618/76294 | train loss 3.738043 | norm 0.2293 | lr 1.17e-03 | (3825.00 ms | 137069 tok/s) step 2619/76294 | train loss 3.763432 | norm 0.2137 | lr 1.17e-03 | (3835.94 ms | 136678 tok/s) step 2620/76294 | train loss 3.813725 | norm 0.1897 | lr 1.17e-03 | (3814.70 ms | 137439 tok/s) step 2621/76294 | train loss 3.724750 | norm 0.2566 | lr 1.17e-03 | (3812.76 ms | 137509 tok/s) step 2622/76294 | train loss 3.692322 | norm 0.2519 | lr 1.17e-03 | (3840.89 ms | 136502 tok/s) step 2623/76294 | train loss 3.718796 | norm 0.2408 | lr 1.17e-03 | (3815.68 ms | 137404 tok/s) step 2624/76294 | train loss 3.706007 | norm 0.2779 | lr 1.17e-03 | (3927.82 ms | 133481 tok/s) step 2625/76294 | train loss 3.773642 | norm 0.2283 | lr 1.17e-03 | (3810.86 ms | 137577 tok/s) step 2626/76294 | train loss 3.720982 | norm 0.2234 | lr 1.17e-03 | (3855.36 ms | 135989 tok/s) step 2627/76294 | train loss 3.927141 | norm 0.2350 | lr 1.17e-03 | (3866.41 ms | 135601 tok/s) step 2628/76294 | train loss 3.723557 | norm 0.2229 | lr 1.17e-03 | (3814.18 ms | 137457 tok/s) step 2629/76294 | train loss 3.758745 | norm 0.2182 | lr 1.17e-03 | (3841.22 ms | 136490 tok/s) step 2630/76294 | train loss 3.735137 | norm 0.2301 | lr 1.17e-03 | (3821.00 ms | 137212 tok/s) step 2631/76294 | train loss 3.707880 | norm 0.2159 | lr 1.17e-03 | (3837.72 ms | 136615 tok/s) step 2632/76294 | train loss 3.668932 | norm 0.1952 | lr 1.17e-03 | (3987.34 ms | 131488 tok/s) step 2633/76294 | train loss 3.734625 | norm 0.1935 | lr 1.17e-03 | (3841.93 ms | 136465 tok/s) step 2634/76294 | train loss 3.762706 | norm 0.2044 | lr 1.17e-03 | (3823.10 ms | 137137 tok/s) step 2635/76294 | train loss 3.764181 | norm 0.2368 | lr 1.17e-03 | (3846.01 ms | 136320 tok/s) step 2636/76294 | train loss 3.816489 | norm 0.2145 | lr 1.17e-03 | (3815.82 ms | 137399 tok/s) step 2637/76294 | train loss 3.745516 | norm 0.2583 | lr 1.17e-03 | (3896.27 ms | 134561 tok/s) step 2638/76294 | train loss 3.761522 | norm 0.1986 | lr 1.17e-03 | (3813.92 ms | 137467 tok/s) step 2639/76294 | train loss 3.729027 | norm 0.2333 | lr 1.17e-03 | (3817.12 ms | 137352 tok/s) step 2640/76294 | train loss 3.728663 | norm 0.2086 | lr 1.17e-03 | (3835.20 ms | 136704 tok/s) step 2641/76294 | train loss 3.770585 | norm 0.2184 | lr 1.17e-03 | (3821.40 ms | 137198 tok/s) step 2642/76294 | train loss 3.743721 | norm 0.2104 | lr 1.17e-03 | (3894.16 ms | 134634 tok/s) step 2643/76294 | train loss 3.725951 | norm 0.2157 | lr 1.17e-03 | (3817.28 ms | 137346 tok/s) step 2644/76294 | train loss 3.740389 | norm 0.2305 | lr 1.17e-03 | (3840.37 ms | 136520 tok/s) step 2645/76294 | train loss 3.748272 | norm 0.2534 | lr 1.17e-03 | (3814.91 ms | 137431 tok/s) step 2646/76294 | train loss 3.736817 | norm 0.3869 | lr 1.17e-03 | (3840.40 ms | 136519 tok/s) step 2647/76294 | train loss 3.802260 | norm 0.3736 | lr 1.17e-03 | (3812.07 ms | 137534 tok/s) step 2648/76294 | train loss 3.742586 | norm 0.2606 | lr 1.17e-03 | (3862.59 ms | 135735 tok/s) step 2649/76294 | train loss 3.767014 | norm 0.2278 | lr 1.17e-03 | (3811.38 ms | 137559 tok/s) step 2650/76294 | train loss 3.779515 | norm 0.2259 | lr 1.17e-03 | (3835.51 ms | 136693 tok/s) step 2651/76294 | train loss 3.725314 | norm 0.2403 | lr 1.17e-03 | (3807.97 ms | 137682 tok/s) step 2652/76294 | train loss 3.749289 | norm 0.2206 | lr 1.17e-03 | (3861.24 ms | 135782 tok/s) step 2653/76294 | train loss 3.731980 | norm 0.1979 | lr 1.17e-03 | (3813.76 ms | 137473 tok/s) step 2654/76294 | train loss 3.771991 | norm 0.2031 | lr 1.17e-03 | (3816.36 ms | 137379 tok/s) step 2655/76294 | train loss 3.736649 | norm 0.2112 | lr 1.17e-03 | (3833.17 ms | 136777 tok/s) step 2656/76294 | train loss 3.829392 | norm 0.2040 | lr 1.17e-03 | (3810.16 ms | 137603 tok/s) step 2657/76294 | train loss 3.857275 | norm 0.2523 | lr 1.17e-03 | (3806.98 ms | 137718 tok/s) step 2658/76294 | train loss 3.829485 | norm 0.2040 | lr 1.17e-03 | (3873.20 ms | 135363 tok/s) step 2659/76294 | train loss 3.760656 | norm 0.1935 | lr 1.17e-03 | (3808.85 ms | 137650 tok/s) step 2660/76294 | train loss 3.768560 | norm 0.2223 | lr 1.17e-03 | (4120.80 ms | 127230 tok/s) step 2661/76294 | train loss 3.759239 | norm 0.2312 | lr 1.17e-03 | (3803.19 ms | 137855 tok/s) step 2662/76294 | train loss 3.746945 | norm 0.2668 | lr 1.17e-03 | (3889.90 ms | 134782 tok/s) step 2663/76294 | train loss 3.750408 | norm 0.2393 | lr 1.17e-03 | (3805.98 ms | 137754 tok/s) step 2664/76294 | train loss 3.776317 | norm 0.2105 | lr 1.17e-03 | (3857.13 ms | 135927 tok/s) step 2665/76294 | train loss 3.723081 | norm 0.2570 | lr 1.17e-03 | (3800.60 ms | 137949 tok/s) step 2666/76294 | train loss 3.769130 | norm 0.2454 | lr 1.17e-03 | (3807.62 ms | 137694 tok/s) step 2667/76294 | train loss 3.804758 | norm 0.1938 | lr 1.17e-03 | (3833.90 ms | 136751 tok/s) step 2668/76294 | train loss 3.832542 | norm 0.2199 | lr 1.17e-03 | (3808.00 ms | 137681 tok/s) step 2669/76294 | train loss 3.783982 | norm 0.2596 | lr 1.17e-03 | (3836.18 ms | 136669 tok/s) step 2670/76294 | train loss 3.742905 | norm 0.2329 | lr 1.17e-03 | (3810.15 ms | 137603 tok/s) step 2671/76294 | train loss 3.744265 | norm 0.2505 | lr 1.17e-03 | (4473.91 ms | 117188 tok/s) step 2672/76294 | train loss 3.713442 | norm 0.2759 | lr 1.17e-03 | (3799.60 ms | 137985 tok/s) step 2673/76294 | train loss 3.773629 | norm 0.2583 | lr 1.17e-03 | (3810.75 ms | 137581 tok/s) step 2674/76294 | train loss 3.721325 | norm 0.2262 | lr 1.17e-03 | (4122.64 ms | 127173 tok/s) step 2675/76294 | train loss 3.885960 | norm 0.2426 | lr 1.17e-03 | (3838.41 ms | 136590 tok/s) step 2676/76294 | train loss 3.747628 | norm 0.2532 | lr 1.17e-03 | (3823.10 ms | 137137 tok/s) step 2677/76294 | train loss 3.789368 | norm 0.2423 | lr 1.17e-03 | (3940.21 ms | 133061 tok/s) step 2678/76294 | train loss 3.789744 | norm 0.2316 | lr 1.17e-03 | (3846.65 ms | 136297 tok/s) step 2679/76294 | train loss 3.780761 | norm 0.2256 | lr 1.17e-03 | (3801.16 ms | 137928 tok/s) step 2680/76294 | train loss 3.704010 | norm 0.2242 | lr 1.17e-03 | (3837.33 ms | 136628 tok/s) step 2681/76294 | train loss 3.751676 | norm 0.2000 | lr 1.17e-03 | (3801.20 ms | 137927 tok/s) step 2682/76294 | train loss 3.789359 | norm 0.2337 | lr 1.17e-03 | (3941.85 ms | 133005 tok/s) step 2683/76294 | train loss 3.778521 | norm 0.2185 | lr 1.17e-03 | (3797.07 ms | 138077 tok/s) step 2684/76294 | train loss 3.717012 | norm 0.2164 | lr 1.17e-03 | (3825.22 ms | 137061 tok/s) step 2685/76294 | train loss 3.757261 | norm 0.2456 | lr 1.17e-03 | (3800.51 ms | 137952 tok/s) step 2686/76294 | train loss 3.730971 | norm 0.2225 | lr 1.17e-03 | (3851.66 ms | 136120 tok/s) step 2687/76294 | train loss 3.828881 | norm 0.2071 | lr 1.17e-03 | (3798.22 ms | 138035 tok/s) step 2688/76294 | train loss 3.755176 | norm 0.2213 | lr 1.17e-03 | (4201.93 ms | 124773 tok/s) step 2689/76294 | train loss 3.763538 | norm 0.2290 | lr 1.17e-03 | (3795.09 ms | 138149 tok/s) step 2690/76294 | train loss 3.718034 | norm 0.2035 | lr 1.17e-03 | (3881.92 ms | 135059 tok/s) step 2691/76294 | train loss 3.714783 | norm 0.2249 | lr 1.17e-03 | (3799.46 ms | 137990 tok/s) step 2692/76294 | train loss 3.782967 | norm 0.2992 | lr 1.17e-03 | (3918.90 ms | 133784 tok/s) step 2693/76294 | train loss 3.757077 | norm 0.3889 | lr 1.17e-03 | (3797.92 ms | 138046 tok/s) step 2694/76294 | train loss 3.793057 | norm 0.3004 | lr 1.17e-03 | (3878.14 ms | 135191 tok/s) step 2695/76294 | train loss 3.760207 | norm 0.2168 | lr 1.17e-03 | (3799.11 ms | 138003 tok/s) step 2696/76294 | train loss 3.764899 | norm 0.2852 | lr 1.17e-03 | (3849.72 ms | 136189 tok/s) step 2697/76294 | train loss 3.805774 | norm 0.2281 | lr 1.17e-03 | (3815.61 ms | 137406 tok/s) step 2698/76294 | train loss 3.776649 | norm 0.2248 | lr 1.17e-03 | (3838.71 ms | 136579 tok/s) step 2699/76294 | train loss 3.750133 | norm 0.2590 | lr 1.17e-03 | (3839.99 ms | 136534 tok/s) step 2700/76294 | train loss 3.730345 | norm 0.1983 | lr 1.17e-03 | (4075.04 ms | 128658 tok/s) step 2701/76294 | train loss 3.723236 | norm 0.2360 | lr 1.17e-03 | (3827.92 ms | 136964 tok/s) step 2702/76294 | train loss 3.713687 | norm 0.2485 | lr 1.17e-03 | (3808.86 ms | 137649 tok/s) step 2703/76294 | train loss 3.760399 | norm 0.2460 | lr 1.17e-03 | (3839.47 ms | 136552 tok/s) step 2704/76294 | train loss 3.778941 | norm 0.2267 | lr 1.17e-03 | (3805.52 ms | 137771 tok/s) step 2705/76294 | train loss 3.782284 | norm 0.2124 | lr 1.17e-03 | (3825.21 ms | 137061 tok/s) step 2706/76294 | train loss 3.762281 | norm 0.2339 | lr 1.17e-03 | (3811.45 ms | 137556 tok/s) step 2707/76294 | train loss 3.797905 | norm 0.2218 | lr 1.17e-03 | (3799.30 ms | 137996 tok/s) step 2708/76294 | train loss 3.841556 | norm 0.2170 | lr 1.17e-03 | (3832.38 ms | 136805 tok/s) step 2709/76294 | train loss 3.713512 | norm 0.2221 | lr 1.17e-03 | (3910.63 ms | 134067 tok/s) step 2710/76294 | train loss 3.754480 | norm 0.2219 | lr 1.17e-03 | (3827.90 ms | 136965 tok/s) step 2711/76294 | train loss 3.778017 | norm 0.2176 | lr 1.17e-03 | (3801.52 ms | 137915 tok/s) step 2712/76294 | train loss 3.757580 | norm 0.2303 | lr 1.17e-03 | (3832.12 ms | 136814 tok/s) step 2713/76294 | train loss 3.694983 | norm 0.1918 | lr 1.17e-03 | (3807.35 ms | 137704 tok/s) step 2714/76294 | train loss 3.729616 | norm 0.2210 | lr 1.17e-03 | (3897.97 ms | 134503 tok/s) step 2715/76294 | train loss 3.803427 | norm 0.2207 | lr 1.17e-03 | (3804.75 ms | 137798 tok/s) step 2716/76294 | train loss 3.769233 | norm 0.2298 | lr 1.17e-03 | (3850.31 ms | 136168 tok/s) step 2717/76294 | train loss 3.725131 | norm 0.2400 | lr 1.17e-03 | (3805.59 ms | 137768 tok/s) step 2718/76294 | train loss 3.736478 | norm 0.2400 | lr 1.17e-03 | (3867.41 ms | 135566 tok/s) step 2719/76294 | train loss 3.702308 | norm 0.2112 | lr 1.17e-03 | (3799.93 ms | 137973 tok/s) step 2720/76294 | train loss 3.794706 | norm 0.2128 | lr 1.17e-03 | (3856.28 ms | 135957 tok/s) step 2721/76294 | train loss 3.744792 | norm 0.2220 | lr 1.17e-03 | (3800.52 ms | 137952 tok/s) step 2722/76294 | train loss 3.741278 | norm 0.2098 | lr 1.17e-03 | (3829.42 ms | 136911 tok/s) step 2723/76294 | train loss 3.759044 | norm 0.1938 | lr 1.17e-03 | (3804.00 ms | 137825 tok/s) step 2724/76294 | train loss 3.749456 | norm 0.1852 | lr 1.17e-03 | (3843.47 ms | 136410 tok/s) step 2725/76294 | train loss 3.741572 | norm 0.1923 | lr 1.17e-03 | (3801.02 ms | 137933 tok/s) step 2726/76294 | train loss 3.728006 | norm 0.1911 | lr 1.17e-03 | (3807.79 ms | 137688 tok/s) step 2727/76294 | train loss 3.788579 | norm 0.2407 | lr 1.17e-03 | (3823.92 ms | 137107 tok/s) step 2728/76294 | train loss 3.813119 | norm 0.2400 | lr 1.17e-03 | (3803.10 ms | 137858 tok/s) step 2729/76294 | train loss 3.809126 | norm 0.2009 | lr 1.17e-03 | (3803.88 ms | 137830 tok/s) step 2730/76294 | train loss 3.776402 | norm 0.2197 | lr 1.17e-03 | (4006.00 ms | 130876 tok/s) step 2731/76294 | train loss 3.764888 | norm 0.2286 | lr 1.17e-03 | (3802.80 ms | 137869 tok/s) step 2732/76294 | train loss 3.737762 | norm 0.2159 | lr 1.17e-03 | (3838.40 ms | 136590 tok/s) step 2733/76294 | train loss 3.749223 | norm 0.2511 | lr 1.17e-03 | (3801.71 ms | 137908 tok/s) step 2734/76294 | train loss 3.798848 | norm 0.2782 | lr 1.17e-03 | (3823.37 ms | 137127 tok/s) step 2735/76294 | train loss 3.720298 | norm 0.3149 | lr 1.17e-03 | (3850.31 ms | 136168 tok/s) step 2736/76294 | train loss 3.763579 | norm 0.3158 | lr 1.17e-03 | (3805.19 ms | 137783 tok/s) step 2737/76294 | train loss 3.821295 | norm 0.2602 | lr 1.17e-03 | (3877.32 ms | 135219 tok/s) step 2738/76294 | train loss 3.826642 | norm 0.3448 | lr 1.17e-03 | (3927.65 ms | 133487 tok/s) step 2739/76294 | train loss 3.774663 | norm 0.2364 | lr 1.17e-03 | (3805.03 ms | 137788 tok/s) step 2740/76294 | train loss 3.774323 | norm 0.2427 | lr 1.17e-03 | (4001.86 ms | 131011 tok/s) step 2741/76294 | train loss 3.763011 | norm 0.2474 | lr 1.17e-03 | (3857.18 ms | 135925 tok/s) step 2742/76294 | train loss 3.759482 | norm 0.2317 | lr 1.17e-03 | (3812.76 ms | 137509 tok/s) step 2743/76294 | train loss 3.752092 | norm 0.2118 | lr 1.17e-03 | (3823.58 ms | 137120 tok/s) step 2744/76294 | train loss 3.775651 | norm 0.2108 | lr 1.17e-03 | (3810.67 ms | 137584 tok/s) step 2745/76294 | train loss 3.735036 | norm 0.2089 | lr 1.17e-03 | (3806.79 ms | 137724 tok/s) step 2746/76294 | train loss 3.709968 | norm 0.2383 | lr 1.17e-03 | (3836.14 ms | 136671 tok/s) step 2747/76294 | train loss 3.703368 | norm 0.2559 | lr 1.17e-03 | (3808.64 ms | 137658 tok/s) step 2748/76294 | train loss 3.721438 | norm 0.2688 | lr 1.17e-03 | (3833.79 ms | 136755 tok/s) step 2749/76294 | train loss 3.760919 | norm 0.2234 | lr 1.17e-03 | (3885.53 ms | 134933 tok/s) step 2750/76294 | train loss 3.733169 | norm 0.2125 | lr 1.17e-03 | (3865.41 ms | 135636 tok/s) val loss: 3.724697 saving model checkpoint to ./results/gpt2-124M-gqa/step_2750.pth step 2751/76294 | train loss 3.847764 | norm 0.2252 | lr 1.17e-03 | (3803.14 ms | 137857 tok/s) step 2752/76294 | train loss 3.730897 | norm 0.2428 | lr 1.17e-03 | (3825.98 ms | 137034 tok/s) step 2753/76294 | train loss 3.759465 | norm 0.2590 | lr 1.17e-03 | (3797.13 ms | 138075 tok/s) step 2754/76294 | train loss 3.709504 | norm 0.1883 | lr 1.17e-03 | (3822.75 ms | 137149 tok/s) step 2755/76294 | train loss 3.682573 | norm 0.2308 | lr 1.17e-03 | (3802.76 ms | 137870 tok/s) step 2756/76294 | train loss 3.765089 | norm 0.2151 | lr 1.17e-03 | (3824.26 ms | 137095 tok/s) step 2757/76294 | train loss 3.672479 | norm 0.2978 | lr 1.17e-03 | (3887.11 ms | 134879 tok/s) step 2758/76294 | train loss 3.728591 | norm 0.2577 | lr 1.17e-03 | (3799.66 ms | 137983 tok/s) step 2759/76294 | train loss 3.780755 | norm 0.2525 | lr 1.17e-03 | (3832.99 ms | 136783 tok/s) step 2760/76294 | train loss 3.822397 | norm 0.2442 | lr 1.17e-03 | (3802.11 ms | 137894 tok/s) step 2761/76294 | train loss 3.679832 | norm 0.2150 | lr 1.17e-03 | (3806.94 ms | 137719 tok/s) step 2762/76294 | train loss 3.728312 | norm 0.2441 | lr 1.17e-03 | (3827.29 ms | 136987 tok/s) step 2763/76294 | train loss 3.764827 | norm 0.2290 | lr 1.17e-03 | (3808.23 ms | 137672 tok/s) step 2764/76294 | train loss 3.778102 | norm 0.2676 | lr 1.17e-03 | (3824.79 ms | 137076 tok/s) step 2765/76294 | train loss 3.799597 | norm 0.2335 | lr 1.17e-03 | (3805.19 ms | 137782 tok/s) step 2766/76294 | train loss 3.763324 | norm 0.2352 | lr 1.17e-03 | (3830.15 ms | 136884 tok/s) step 2767/76294 | train loss 3.738325 | norm 0.2664 | lr 1.17e-03 | (3809.61 ms | 137623 tok/s) step 2768/76294 | train loss 3.698577 | norm 0.2763 | lr 1.17e-03 | (3800.47 ms | 137953 tok/s) step 2769/76294 | train loss 3.804863 | norm 0.2916 | lr 1.17e-03 | (3843.34 ms | 136415 tok/s) step 2770/76294 | train loss 3.691951 | norm 0.2119 | lr 1.17e-03 | (3809.90 ms | 137612 tok/s) step 2771/76294 | train loss 3.757493 | norm 0.2688 | lr 1.17e-03 | (3837.77 ms | 136613 tok/s) step 2772/76294 | train loss 3.841005 | norm 0.2416 | lr 1.17e-03 | (3827.17 ms | 136991 tok/s) step 2773/76294 | train loss 3.766060 | norm 0.2453 | lr 1.17e-03 | (3828.24 ms | 136953 tok/s) step 2774/76294 | train loss 3.623937 | norm 0.2144 | lr 1.17e-03 | (3805.49 ms | 137771 tok/s) step 2775/76294 | train loss 3.773386 | norm 0.2279 | lr 1.17e-03 | (3808.66 ms | 137657 tok/s) step 2776/76294 | train loss 3.755677 | norm 0.2408 | lr 1.17e-03 | (3808.09 ms | 137677 tok/s) step 2777/76294 | train loss 3.682448 | norm 0.1913 | lr 1.17e-03 | (3995.00 ms | 131236 tok/s) step 2778/76294 | train loss 3.765951 | norm 0.2539 | lr 1.17e-03 | (3799.41 ms | 137992 tok/s) step 2779/76294 | train loss 3.698621 | norm 0.1860 | lr 1.17e-03 | (3840.99 ms | 136498 tok/s) step 2780/76294 | train loss 3.777379 | norm 0.2228 | lr 1.17e-03 | (3803.30 ms | 137851 tok/s) step 2781/76294 | train loss 3.771511 | norm 0.2344 | lr 1.17e-03 | (3861.28 ms | 135781 tok/s) step 2782/76294 | train loss 3.729634 | norm 0.2511 | lr 1.17e-03 | (3801.45 ms | 137918 tok/s) step 2783/76294 | train loss 3.814306 | norm 0.3108 | lr 1.17e-03 | (3808.29 ms | 137670 tok/s) step 2784/76294 | train loss 3.748241 | norm 0.2700 | lr 1.17e-03 | (3826.82 ms | 137004 tok/s) step 2785/76294 | train loss 3.679391 | norm 0.2484 | lr 1.17e-03 | (3806.97 ms | 137718 tok/s) step 2786/76294 | train loss 3.754589 | norm 0.2597 | lr 1.17e-03 | (3903.07 ms | 134327 tok/s) step 2787/76294 | train loss 3.739664 | norm 0.2366 | lr 1.17e-03 | (3818.03 ms | 137319 tok/s) step 2788/76294 | train loss 3.815749 | norm 0.2489 | lr 1.17e-03 | (3807.14 ms | 137712 tok/s) step 2789/76294 | train loss 3.789690 | norm 0.2391 | lr 1.17e-03 | (3828.26 ms | 136952 tok/s) step 2790/76294 | train loss 3.759295 | norm 0.2661 | lr 1.17e-03 | (3808.65 ms | 137657 tok/s) step 2791/76294 | train loss 3.714321 | norm 0.2030 | lr 1.17e-03 | (4186.32 ms | 125238 tok/s) step 2792/76294 | train loss 3.800789 | norm 0.2540 | lr 1.17e-03 | (3835.06 ms | 136709 tok/s) step 2793/76294 | train loss 3.759562 | norm 0.2298 | lr 1.17e-03 | (3806.79 ms | 137724 tok/s) step 2794/76294 | train loss 3.686179 | norm 0.1921 | lr 1.17e-03 | (3810.59 ms | 137587 tok/s) step 2795/76294 | train loss 3.757445 | norm 0.2005 | lr 1.17e-03 | (3838.88 ms | 136573 tok/s) step 2796/76294 | train loss 3.757592 | norm 0.2189 | lr 1.17e-03 | (3809.20 ms | 137637 tok/s) step 2797/76294 | train loss 3.758946 | norm 0.1829 | lr 1.17e-03 | (3884.10 ms | 134983 tok/s) step 2798/76294 | train loss 3.756830 | norm 0.1820 | lr 1.17e-03 | (3837.02 ms | 136639 tok/s) step 2799/76294 | train loss 3.720485 | norm 0.2074 | lr 1.17e-03 | (3826.15 ms | 137027 tok/s) step 2800/76294 | train loss 3.789263 | norm 0.1767 | lr 1.17e-03 | (3808.32 ms | 137669 tok/s) step 2801/76294 | train loss 3.770962 | norm 0.2088 | lr 1.17e-03 | (3830.26 ms | 136881 tok/s) step 2802/76294 | train loss 3.811765 | norm 0.2293 | lr 1.17e-03 | (3810.71 ms | 137583 tok/s) step 2803/76294 | train loss 3.682085 | norm 0.2338 | lr 1.17e-03 | (3903.16 ms | 134324 tok/s) step 2804/76294 | train loss 3.699657 | norm 0.2016 | lr 1.17e-03 | (3896.76 ms | 134545 tok/s) step 2805/76294 | train loss 3.707756 | norm 0.1921 | lr 1.17e-03 | (3806.85 ms | 137722 tok/s) step 2806/76294 | train loss 3.746988 | norm 0.2158 | lr 1.17e-03 | (3833.44 ms | 136767 tok/s) step 2807/76294 | train loss 3.680139 | norm 0.2086 | lr 1.17e-03 | (3907.97 ms | 134159 tok/s) step 2808/76294 | train loss 3.734230 | norm 0.2048 | lr 1.17e-03 | (3808.05 ms | 137679 tok/s) step 2809/76294 | train loss 3.768744 | norm 0.2295 | lr 1.17e-03 | (3832.94 ms | 136785 tok/s) step 2810/76294 | train loss 3.728715 | norm 0.1967 | lr 1.17e-03 | (3810.78 ms | 137580 tok/s) step 2811/76294 | train loss 3.703222 | norm 0.1930 | lr 1.17e-03 | (3803.68 ms | 137837 tok/s) step 2812/76294 | train loss 3.710998 | norm 0.2399 | lr 1.17e-03 | (3846.89 ms | 136289 tok/s) step 2813/76294 | train loss 3.703760 | norm 0.2528 | lr 1.17e-03 | (3811.43 ms | 137557 tok/s) step 2814/76294 | train loss 3.758986 | norm 0.2174 | lr 1.17e-03 | (3818.44 ms | 137304 tok/s) step 2815/76294 | train loss 3.813434 | norm 0.1952 | lr 1.17e-03 | (3808.05 ms | 137679 tok/s) step 2816/76294 | train loss 3.739382 | norm 0.2139 | lr 1.17e-03 | (3840.04 ms | 136532 tok/s) step 2817/76294 | train loss 3.772003 | norm 0.2346 | lr 1.17e-03 | (3805.63 ms | 137767 tok/s) step 2818/76294 | train loss 3.742634 | norm 0.2251 | lr 1.17e-03 | (3861.85 ms | 135761 tok/s) step 2819/76294 | train loss 3.713821 | norm 0.2006 | lr 1.17e-03 | (3813.57 ms | 137480 tok/s) step 2820/76294 | train loss 3.743258 | norm 0.2117 | lr 1.17e-03 | (3877.03 ms | 135229 tok/s) step 2821/76294 | train loss 3.732376 | norm 0.2260 | lr 1.17e-03 | (3802.06 ms | 137896 tok/s) step 2822/76294 | train loss 3.802816 | norm 0.2129 | lr 1.17e-03 | (3838.74 ms | 136578 tok/s) step 2823/76294 | train loss 3.723796 | norm 0.2303 | lr 1.17e-03 | (3805.93 ms | 137755 tok/s) step 2824/76294 | train loss 3.776576 | norm 0.2103 | lr 1.17e-03 | (3859.11 ms | 135857 tok/s) step 2825/76294 | train loss 3.768732 | norm 0.3159 | lr 1.17e-03 | (3816.14 ms | 137387 tok/s) step 2826/76294 | train loss 3.708726 | norm 0.3214 | lr 1.17e-03 | (3820.37 ms | 137235 tok/s) step 2827/76294 | train loss 3.730058 | norm 0.2667 | lr 1.17e-03 | (3824.66 ms | 137081 tok/s) step 2828/76294 | train loss 3.755123 | norm 0.2559 | lr 1.17e-03 | (3807.20 ms | 137710 tok/s) step 2829/76294 | train loss 3.878362 | norm 0.2285 | lr 1.17e-03 | (3831.75 ms | 136827 tok/s) step 2830/76294 | train loss 3.722800 | norm 0.2034 | lr 1.17e-03 | (3814.21 ms | 137457 tok/s) step 2831/76294 | train loss 3.779302 | norm 0.2716 | lr 1.17e-03 | (3838.79 ms | 136576 tok/s) step 2832/76294 | train loss 3.825939 | norm 0.2771 | lr 1.17e-03 | (3875.13 ms | 135296 tok/s) step 2833/76294 | train loss 3.753129 | norm 0.2433 | lr 1.17e-03 | (3827.30 ms | 136986 tok/s) step 2834/76294 | train loss 3.755293 | norm 0.2205 | lr 1.17e-03 | (3877.04 ms | 135229 tok/s) step 2835/76294 | train loss 3.741461 | norm 0.2070 | lr 1.17e-03 | (3799.39 ms | 137993 tok/s) step 2836/76294 | train loss 3.733347 | norm 0.2442 | lr 1.17e-03 | (3848.55 ms | 136230 tok/s) step 2837/76294 | train loss 3.699858 | norm 0.2203 | lr 1.17e-03 | (3805.05 ms | 137787 tok/s) step 2838/76294 | train loss 3.729365 | norm 0.2236 | lr 1.17e-03 | (3933.92 ms | 133274 tok/s) step 2839/76294 | train loss 3.774340 | norm 0.1904 | lr 1.17e-03 | (3805.94 ms | 137755 tok/s) step 2840/76294 | train loss 3.765912 | norm 0.2283 | lr 1.17e-03 | (3843.30 ms | 136416 tok/s) step 2841/76294 | train loss 3.767967 | norm 0.2397 | lr 1.17e-03 | (3829.12 ms | 136921 tok/s) step 2842/76294 | train loss 3.738141 | norm 0.2192 | lr 1.17e-03 | (3809.53 ms | 137625 tok/s) step 2843/76294 | train loss 3.752883 | norm 0.2297 | lr 1.17e-03 | (3831.24 ms | 136845 tok/s) step 2844/76294 | train loss 3.700182 | norm 0.2741 | lr 1.17e-03 | (3833.46 ms | 136766 tok/s) step 2845/76294 | train loss 3.748225 | norm 0.2404 | lr 1.17e-03 | (3824.41 ms | 137090 tok/s) step 2846/76294 | train loss 3.742269 | norm 0.2249 | lr 1.17e-03 | (5368.51 ms | 97660 tok/s) step 2847/76294 | train loss 3.711508 | norm 0.2094 | lr 1.17e-03 | (4074.78 ms | 128667 tok/s) step 2848/76294 | train loss 3.707606 | norm 0.2172 | lr 1.17e-03 | (3787.70 ms | 138419 tok/s) step 2849/76294 | train loss 3.747834 | norm 0.2368 | lr 1.17e-03 | (3823.04 ms | 137139 tok/s) step 2850/76294 | train loss 3.704761 | norm 0.2677 | lr 1.17e-03 | (3798.38 ms | 138029 tok/s) step 2851/76294 | train loss 3.681070 | norm 0.2487 | lr 1.17e-03 | (4043.89 ms | 129649 tok/s) step 2852/76294 | train loss 3.701367 | norm 0.2235 | lr 1.17e-03 | (3795.78 ms | 138124 tok/s) step 2853/76294 | train loss 3.727395 | norm 0.2568 | lr 1.17e-03 | (3826.12 ms | 137029 tok/s) step 2854/76294 | train loss 3.698461 | norm 0.2144 | lr 1.17e-03 | (3817.83 ms | 137326 tok/s) step 2855/76294 | train loss 3.743661 | norm 0.2060 | lr 1.17e-03 | (3809.84 ms | 137614 tok/s) step 2856/76294 | train loss 3.741905 | norm 0.1966 | lr 1.17e-03 | (3933.31 ms | 133294 tok/s) step 2857/76294 | train loss 3.732994 | norm 0.1970 | lr 1.17e-03 | (3838.30 ms | 136594 tok/s) step 2858/76294 | train loss 3.699557 | norm 0.2251 | lr 1.17e-03 | (3831.71 ms | 136829 tok/s) step 2859/76294 | train loss 3.810834 | norm 0.2461 | lr 1.17e-03 | (3810.06 ms | 137606 tok/s) step 2860/76294 | train loss 3.789943 | norm 0.2839 | lr 1.17e-03 | (3801.21 ms | 137927 tok/s) step 2861/76294 | train loss 3.722269 | norm 0.2987 | lr 1.17e-03 | (3801.33 ms | 137922 tok/s) step 2862/76294 | train loss 3.692142 | norm 0.2632 | lr 1.17e-03 | (4537.76 ms | 115539 tok/s) step 2863/76294 | train loss 3.733793 | norm 0.2317 | lr 1.17e-03 | (10698.87 ms | 49004 tok/s) step 2864/76294 | train loss 3.754943 | norm 0.2482 | lr 1.17e-03 | (3781.30 ms | 138653 tok/s) step 2865/76294 | train loss 3.696019 | norm 0.2579 | lr 1.17e-03 | (3812.74 ms | 137510 tok/s) step 2866/76294 | train loss 3.761740 | norm 0.2085 | lr 1.17e-03 | (3806.58 ms | 137732 tok/s) step 2867/76294 | train loss 3.682025 | norm 0.2026 | lr 1.17e-03 | (4098.25 ms | 127930 tok/s) step 2868/76294 | train loss 3.759723 | norm 0.2232 | lr 1.17e-03 | (3795.59 ms | 138131 tok/s) step 2869/76294 | train loss 3.691487 | norm 0.2513 | lr 1.17e-03 | (3967.85 ms | 132134 tok/s) step 2870/76294 | train loss 3.803034 | norm 0.2166 | lr 1.17e-03 | (3816.19 ms | 137385 tok/s) step 2871/76294 | train loss 3.662618 | norm 0.2846 | lr 1.17e-03 | (3932.13 ms | 133334 tok/s) step 2872/76294 | train loss 3.762230 | norm 0.3219 | lr 1.17e-03 | (9080.31 ms | 57739 tok/s) step 2873/76294 | train loss 3.735486 | norm 0.2824 | lr 1.17e-03 | (3882.37 ms | 135043 tok/s) step 2874/76294 | train loss 3.731534 | norm 0.2463 | lr 1.17e-03 | (4092.85 ms | 128098 tok/s) step 2875/76294 | train loss 3.726383 | norm 0.2615 | lr 1.17e-03 | (4400.15 ms | 119152 tok/s) step 2876/76294 | train loss 3.763728 | norm 0.2808 | lr 1.17e-03 | (3789.04 ms | 138369 tok/s) step 2877/76294 | train loss 3.697758 | norm 0.2093 | lr 1.17e-03 | (3869.14 ms | 135505 tok/s) step 2878/76294 | train loss 3.712736 | norm 0.2354 | lr 1.17e-03 | (3790.60 ms | 138313 tok/s) step 2879/76294 | train loss 3.765081 | norm 0.2278 | lr 1.17e-03 | (3924.66 ms | 133588 tok/s) step 2880/76294 | train loss 3.806475 | norm 0.2594 | lr 1.17e-03 | (3793.27 ms | 138215 tok/s) step 2881/76294 | train loss 3.752046 | norm 0.2677 | lr 1.17e-03 | (3897.81 ms | 134508 tok/s) step 2882/76294 | train loss 3.782427 | norm 0.2611 | lr 1.17e-03 | (3806.22 ms | 137745 tok/s) step 2883/76294 | train loss 3.807523 | norm 0.2254 | lr 1.17e-03 | (3831.62 ms | 136832 tok/s) step 2884/76294 | train loss 3.722134 | norm 0.2246 | lr 1.17e-03 | (3921.49 ms | 133696 tok/s) step 2885/76294 | train loss 3.736762 | norm 0.2199 | lr 1.17e-03 | (3950.83 ms | 132703 tok/s) step 2886/76294 | train loss 3.760416 | norm 0.2279 | lr 1.17e-03 | (10412.74 ms | 50351 tok/s) step 2887/76294 | train loss 3.754518 | norm 0.2714 | lr 1.17e-03 | (4666.71 ms | 112346 tok/s) step 2888/76294 | train loss 3.769965 | norm 0.2501 | lr 1.16e-03 | (4229.08 ms | 123972 tok/s) step 2889/76294 | train loss 3.651223 | norm 0.2256 | lr 1.16e-03 | (3821.39 ms | 137198 tok/s) step 2890/76294 | train loss 3.804745 | norm 0.2057 | lr 1.16e-03 | (3824.24 ms | 137096 tok/s) step 2891/76294 | train loss 3.681818 | norm 0.2340 | lr 1.16e-03 | (4367.31 ms | 120048 tok/s) step 2892/76294 | train loss 3.747341 | norm 0.2392 | lr 1.16e-03 | (3932.37 ms | 133326 tok/s) step 2893/76294 | train loss 3.691142 | norm 0.2297 | lr 1.16e-03 | (3874.87 ms | 135305 tok/s) step 2894/76294 | train loss 3.671949 | norm 0.2296 | lr 1.16e-03 | (3857.40 ms | 135918 tok/s) step 2895/76294 | train loss 3.712092 | norm 0.2169 | lr 1.16e-03 | (3784.01 ms | 138554 tok/s) step 2896/76294 | train loss 3.748335 | norm 0.2104 | lr 1.16e-03 | (4127.14 ms | 127034 tok/s) step 2897/76294 | train loss 3.709273 | norm 0.2796 | lr 1.16e-03 | (3786.65 ms | 138457 tok/s) step 2898/76294 | train loss 3.772148 | norm 0.3011 | lr 1.16e-03 | (3796.36 ms | 138103 tok/s) step 2899/76294 | train loss 3.725659 | norm 0.2132 | lr 1.16e-03 | (3823.62 ms | 137118 tok/s) step 2900/76294 | train loss 3.758870 | norm 0.2894 | lr 1.16e-03 | (3982.17 ms | 131659 tok/s) step 2901/76294 | train loss 3.671932 | norm 0.2667 | lr 1.16e-03 | (3792.82 ms | 138232 tok/s) step 2902/76294 | train loss 3.733673 | norm 0.2261 | lr 1.16e-03 | (3831.08 ms | 136851 tok/s) step 2903/76294 | train loss 3.696143 | norm 0.2426 | lr 1.16e-03 | (3894.37 ms | 134627 tok/s) step 2904/76294 | train loss 3.734044 | norm 0.2206 | lr 1.16e-03 | (3814.34 ms | 137452 tok/s) step 2905/76294 | train loss 3.681129 | norm 0.2361 | lr 1.16e-03 | (3816.93 ms | 137358 tok/s) step 2906/76294 | train loss 3.667567 | norm 0.2330 | lr 1.16e-03 | (3824.71 ms | 137079 tok/s) step 2907/76294 | train loss 3.723670 | norm 0.2176 | lr 1.16e-03 | (3798.20 ms | 138036 tok/s) step 2908/76294 | train loss 3.696505 | norm 0.2081 | lr 1.16e-03 | (3854.35 ms | 136025 tok/s) step 2909/76294 | train loss 3.711994 | norm 0.2378 | lr 1.16e-03 | (3801.66 ms | 137910 tok/s) step 2910/76294 | train loss 3.667157 | norm 0.1993 | lr 1.16e-03 | (3889.31 ms | 134802 tok/s) step 2911/76294 | train loss 3.723591 | norm 0.2609 | lr 1.16e-03 | (3819.94 ms | 137250 tok/s) step 2912/76294 | train loss 3.657584 | norm 0.2559 | lr 1.16e-03 | (3957.05 ms | 132495 tok/s) step 2913/76294 | train loss 3.751944 | norm 0.2600 | lr 1.16e-03 | (4012.58 ms | 130661 tok/s) step 2914/76294 | train loss 3.710840 | norm 0.2668 | lr 1.16e-03 | (5340.24 ms | 98177 tok/s) step 2915/76294 | train loss 3.690119 | norm 0.2277 | lr 1.16e-03 | (6788.99 ms | 77226 tok/s) step 2916/76294 | train loss 3.696773 | norm 0.2453 | lr 1.16e-03 | (3792.48 ms | 138244 tok/s) step 2917/76294 | train loss 3.711197 | norm 0.2301 | lr 1.16e-03 | (3798.24 ms | 138034 tok/s) step 2918/76294 | train loss 3.677178 | norm 0.2351 | lr 1.16e-03 | (3818.65 ms | 137297 tok/s) step 2919/76294 | train loss 3.677469 | norm 0.1940 | lr 1.16e-03 | (3796.19 ms | 138109 tok/s) step 2920/76294 | train loss 3.691878 | norm 0.2093 | lr 1.16e-03 | (3795.37 ms | 138139 tok/s) step 2921/76294 | train loss 3.700378 | norm 0.2133 | lr 1.16e-03 | (3942.56 ms | 132982 tok/s) step 2922/76294 | train loss 3.681877 | norm 0.1948 | lr 1.16e-03 | (3799.51 ms | 137988 tok/s) step 2923/76294 | train loss 3.715242 | norm 0.1984 | lr 1.16e-03 | (3829.97 ms | 136891 tok/s) step 2924/76294 | train loss 3.726985 | norm 0.2260 | lr 1.16e-03 | (3813.12 ms | 137496 tok/s) step 2925/76294 | train loss 3.699615 | norm 0.3625 | lr 1.16e-03 | (3806.45 ms | 137737 tok/s) step 2926/76294 | train loss 3.672373 | norm 0.1959 | lr 1.16e-03 | (3828.03 ms | 136960 tok/s) step 2927/76294 | train loss 3.732712 | norm 0.2282 | lr 1.16e-03 | (3825.66 ms | 137045 tok/s) step 2928/76294 | train loss 3.666538 | norm 0.2366 | lr 1.16e-03 | (3827.54 ms | 136978 tok/s) step 2929/76294 | train loss 3.823192 | norm 0.2461 | lr 1.16e-03 | (3803.57 ms | 137841 tok/s) step 2930/76294 | train loss 3.647134 | norm 0.2392 | lr 1.16e-03 | (3802.37 ms | 137885 tok/s) step 2931/76294 | train loss 3.679336 | norm 0.1968 | lr 1.16e-03 | (3848.92 ms | 136217 tok/s) step 2932/76294 | train loss 3.678935 | norm 0.2141 | lr 1.16e-03 | (3806.51 ms | 137735 tok/s) step 2933/76294 | train loss 3.749132 | norm 0.2356 | lr 1.16e-03 | (3810.13 ms | 137604 tok/s) step 2934/76294 | train loss 3.660490 | norm 0.2286 | lr 1.16e-03 | (3835.31 ms | 136700 tok/s) step 2935/76294 | train loss 3.744252 | norm 0.2406 | lr 1.16e-03 | (3808.21 ms | 137673 tok/s) step 2936/76294 | train loss 3.678762 | norm 0.2410 | lr 1.16e-03 | (3927.27 ms | 133499 tok/s) step 2937/76294 | train loss 3.667952 | norm 0.3716 | lr 1.16e-03 | (3809.19 ms | 137638 tok/s) step 2938/76294 | train loss 3.711096 | norm 0.4021 | lr 1.16e-03 | (3808.37 ms | 137667 tok/s) step 2939/76294 | train loss 3.775808 | norm 0.3166 | lr 1.16e-03 | (3852.05 ms | 136106 tok/s) step 2940/76294 | train loss 3.649257 | norm 0.2964 | lr 1.16e-03 | (3807.05 ms | 137715 tok/s) step 2941/76294 | train loss 3.690937 | norm 0.3839 | lr 1.16e-03 | (3816.53 ms | 137373 tok/s) step 2942/76294 | train loss 3.670260 | norm 0.2662 | lr 1.16e-03 | (3808.97 ms | 137646 tok/s) step 2943/76294 | train loss 3.856211 | norm 0.2747 | lr 1.16e-03 | (3814.29 ms | 137454 tok/s) step 2944/76294 | train loss 3.708060 | norm 0.2729 | lr 1.16e-03 | (3810.37 ms | 137595 tok/s) step 2945/76294 | train loss 3.670426 | norm 0.2665 | lr 1.16e-03 | (3851.35 ms | 136131 tok/s) step 2946/76294 | train loss 3.683295 | norm 0.2750 | lr 1.16e-03 | (3810.81 ms | 137579 tok/s) step 2947/76294 | train loss 3.781728 | norm 0.2201 | lr 1.16e-03 | (3837.31 ms | 136629 tok/s) step 2948/76294 | train loss 3.751804 | norm 0.2228 | lr 1.16e-03 | (3833.22 ms | 136775 tok/s) step 2949/76294 | train loss 3.718919 | norm 0.1988 | lr 1.16e-03 | (3809.75 ms | 137617 tok/s) step 2950/76294 | train loss 3.786747 | norm 0.1938 | lr 1.16e-03 | (3849.85 ms | 136184 tok/s) step 2951/76294 | train loss 3.668300 | norm 0.1981 | lr 1.16e-03 | (3838.98 ms | 136570 tok/s) step 2952/76294 | train loss 3.704221 | norm 0.1894 | lr 1.16e-03 | (3840.83 ms | 136504 tok/s) step 2953/76294 | train loss 3.698123 | norm 0.1937 | lr 1.16e-03 | (3807.03 ms | 137716 tok/s) step 2954/76294 | train loss 3.658415 | norm 0.2000 | lr 1.16e-03 | (3840.44 ms | 136518 tok/s) step 2955/76294 | train loss 3.734079 | norm 0.2144 | lr 1.16e-03 | (3802.90 ms | 137865 tok/s) step 2956/76294 | train loss 3.725632 | norm 0.2024 | lr 1.16e-03 | (3841.02 ms | 136497 tok/s) step 2957/76294 | train loss 3.669242 | norm 0.2014 | lr 1.16e-03 | (3838.77 ms | 136577 tok/s) step 2958/76294 | train loss 3.735374 | norm 0.2137 | lr 1.16e-03 | (3932.62 ms | 133318 tok/s) step 2959/76294 | train loss 3.726410 | norm 0.2342 | lr 1.16e-03 | (3804.35 ms | 137813 tok/s) step 2960/76294 | train loss 3.700052 | norm 0.1889 | lr 1.16e-03 | (3844.97 ms | 136357 tok/s) step 2961/76294 | train loss 3.683409 | norm 0.1977 | lr 1.16e-03 | (3809.13 ms | 137640 tok/s) step 2962/76294 | train loss 3.584778 | norm 0.2322 | lr 1.16e-03 | (4083.63 ms | 128388 tok/s) step 2963/76294 | train loss 3.718580 | norm 0.2423 | lr 1.16e-03 | (3811.02 ms | 137572 tok/s) step 2964/76294 | train loss 3.715415 | norm 0.2815 | lr 1.16e-03 | (3828.32 ms | 136950 tok/s) step 2965/76294 | train loss 3.842643 | norm 0.2466 | lr 1.16e-03 | (3806.31 ms | 137742 tok/s) step 2966/76294 | train loss 3.740099 | norm 0.2396 | lr 1.16e-03 | (3833.10 ms | 136779 tok/s) step 2967/76294 | train loss 3.708829 | norm 0.2515 | lr 1.16e-03 | (3803.68 ms | 137837 tok/s) step 2968/76294 | train loss 3.716561 | norm 0.2318 | lr 1.16e-03 | (3831.36 ms | 136841 tok/s) step 2969/76294 | train loss 3.653098 | norm 0.2216 | lr 1.16e-03 | (3804.02 ms | 137825 tok/s) step 2970/76294 | train loss 3.737661 | norm 0.2234 | lr 1.16e-03 | (3931.22 ms | 133365 tok/s) step 2971/76294 | train loss 3.755861 | norm 0.2193 | lr 1.16e-03 | (3956.77 ms | 132504 tok/s) step 2972/76294 | train loss 3.684454 | norm 0.2097 | lr 1.16e-03 | (3804.30 ms | 137815 tok/s) step 2973/76294 | train loss 3.735114 | norm 0.2208 | lr 1.16e-03 | (3824.54 ms | 137085 tok/s) step 2974/76294 | train loss 3.791976 | norm 0.2356 | lr 1.16e-03 | (3805.59 ms | 137768 tok/s) step 2975/76294 | train loss 3.729317 | norm 0.2757 | lr 1.16e-03 | (3842.42 ms | 136447 tok/s) step 2976/76294 | train loss 3.734612 | norm 0.2468 | lr 1.16e-03 | (3802.53 ms | 137879 tok/s) step 2977/76294 | train loss 3.777510 | norm 0.2060 | lr 1.16e-03 | (4019.74 ms | 130428 tok/s) step 2978/76294 | train loss 3.683700 | norm 0.2491 | lr 1.16e-03 | (3837.90 ms | 136608 tok/s) step 2979/76294 | train loss 3.737069 | norm 0.2168 | lr 1.16e-03 | (3826.22 ms | 137025 tok/s) step 2980/76294 | train loss 3.683157 | norm 0.1930 | lr 1.16e-03 | (3901.21 ms | 134391 tok/s) step 2981/76294 | train loss 3.689377 | norm 0.2629 | lr 1.16e-03 | (3836.26 ms | 136666 tok/s) step 2982/76294 | train loss 3.694501 | norm 0.2960 | lr 1.16e-03 | (3805.85 ms | 137758 tok/s) step 2983/76294 | train loss 3.657849 | norm 0.2263 | lr 1.16e-03 | (3862.75 ms | 135729 tok/s) step 2984/76294 | train loss 3.681620 | norm 0.2472 | lr 1.16e-03 | (3805.84 ms | 137759 tok/s) step 2985/76294 | train loss 3.709448 | norm 0.2242 | lr 1.16e-03 | (3829.09 ms | 136922 tok/s) step 2986/76294 | train loss 3.728032 | norm 0.2177 | lr 1.16e-03 | (3827.04 ms | 136996 tok/s) step 2987/76294 | train loss 3.700620 | norm 0.2234 | lr 1.16e-03 | (3812.03 ms | 137535 tok/s) step 2988/76294 | train loss 3.770191 | norm 0.2562 | lr 1.16e-03 | (3804.87 ms | 137794 tok/s) step 2989/76294 | train loss 3.714979 | norm 0.2518 | lr 1.16e-03 | (3832.78 ms | 136791 tok/s) step 2990/76294 | train loss 3.773987 | norm 0.2369 | lr 1.16e-03 | (3805.27 ms | 137779 tok/s) step 2991/76294 | train loss 3.600769 | norm 0.2046 | lr 1.16e-03 | (3883.94 ms | 134989 tok/s) step 2992/76294 | train loss 3.699572 | norm 0.2750 | lr 1.16e-03 | (3799.28 ms | 137997 tok/s) step 2993/76294 | train loss 3.692496 | norm 0.2821 | lr 1.16e-03 | (3805.69 ms | 137764 tok/s) step 2994/76294 | train loss 3.695430 | norm 0.2266 | lr 1.16e-03 | (3828.78 ms | 136933 tok/s) step 2995/76294 | train loss 3.673596 | norm 0.2743 | lr 1.16e-03 | (3808.61 ms | 137659 tok/s) step 2996/76294 | train loss 3.741932 | norm 0.2734 | lr 1.16e-03 | (3805.66 ms | 137765 tok/s) step 2997/76294 | train loss 3.685660 | norm 0.2348 | lr 1.16e-03 | (3870.09 ms | 135472 tok/s) step 2998/76294 | train loss 3.729454 | norm 0.2487 | lr 1.16e-03 | (3809.02 ms | 137644 tok/s) step 2999/76294 | train loss 3.726623 | norm 0.2164 | lr 1.16e-03 | (3868.77 ms | 135518 tok/s) step 3000/76294 | train loss 3.729120 | norm 0.2139 | lr 1.16e-03 | (3830.79 ms | 136862 tok/s) val loss: 3.693403 saving model checkpoint to ./results/gpt2-124M-gqa/step_3000.pth step 3001/76294 | train loss 3.757776 | norm 0.2001 | lr 1.16e-03 | (3836.33 ms | 136664 tok/s) step 3002/76294 | train loss 3.681816 | norm 0.2063 | lr 1.16e-03 | (3803.72 ms | 137836 tok/s) step 3003/76294 | train loss 3.787028 | norm 0.1985 | lr 1.16e-03 | (3829.45 ms | 136909 tok/s) step 3004/76294 | train loss 3.703086 | norm 0.2200 | lr 1.16e-03 | (3806.20 ms | 137746 tok/s) step 3005/76294 | train loss 3.718850 | norm 0.2211 | lr 1.16e-03 | (3806.66 ms | 137729 tok/s) step 3006/76294 | train loss 3.698476 | norm 0.2001 | lr 1.16e-03 | (3835.05 ms | 136709 tok/s) step 3007/76294 | train loss 3.653935 | norm 0.2180 | lr 1.16e-03 | (3806.72 ms | 137727 tok/s) step 3008/76294 | train loss 3.693374 | norm 0.1983 | lr 1.16e-03 | (3799.53 ms | 137988 tok/s) step 3009/76294 | train loss 3.705465 | norm 0.2109 | lr 1.16e-03 | (3908.55 ms | 134139 tok/s) step 3010/76294 | train loss 3.711279 | norm 0.2235 | lr 1.16e-03 | (3800.95 ms | 137936 tok/s) step 3011/76294 | train loss 3.739921 | norm 0.2150 | lr 1.16e-03 | (3921.22 ms | 133705 tok/s) step 3012/76294 | train loss 3.693834 | norm 0.2062 | lr 1.16e-03 | (3804.24 ms | 137817 tok/s) step 3013/76294 | train loss 3.631597 | norm 0.2003 | lr 1.16e-03 | (3873.32 ms | 135359 tok/s) step 3014/76294 | train loss 3.660321 | norm 0.2306 | lr 1.16e-03 | (3803.50 ms | 137843 tok/s) step 3015/76294 | train loss 3.702086 | norm 0.2204 | lr 1.16e-03 | (4625.81 ms | 113340 tok/s) step 3016/76294 | train loss 3.737607 | norm 0.2969 | lr 1.16e-03 | (3800.82 ms | 137941 tok/s) step 3017/76294 | train loss 3.734775 | norm 0.3919 | lr 1.16e-03 | (3827.26 ms | 136988 tok/s) step 3018/76294 | train loss 3.725575 | norm 0.3179 | lr 1.16e-03 | (3833.28 ms | 136773 tok/s) step 3019/76294 | train loss 3.667282 | norm 0.2422 | lr 1.16e-03 | (3867.53 ms | 135561 tok/s) step 3020/76294 | train loss 3.723564 | norm 0.2689 | lr 1.16e-03 | (3803.14 ms | 137856 tok/s) step 3021/76294 | train loss 3.692750 | norm 0.2202 | lr 1.16e-03 | (3860.95 ms | 135792 tok/s) step 3022/76294 | train loss 3.726625 | norm 0.2290 | lr 1.16e-03 | (3803.36 ms | 137849 tok/s) step 3023/76294 | train loss 3.721297 | norm 0.2046 | lr 1.16e-03 | (4547.12 ms | 115301 tok/s) step 3024/76294 | train loss 3.690316 | norm 0.2080 | lr 1.16e-03 | (3870.55 ms | 135456 tok/s) step 3025/76294 | train loss 3.725783 | norm 0.2020 | lr 1.16e-03 | (3845.69 ms | 136331 tok/s) step 3026/76294 | train loss 3.672294 | norm 0.1868 | lr 1.16e-03 | (3802.07 ms | 137895 tok/s) step 3027/76294 | train loss 3.724168 | norm 0.1857 | lr 1.16e-03 | (3803.61 ms | 137840 tok/s) step 3028/76294 | train loss 3.646918 | norm 0.2105 | lr 1.16e-03 | (3823.63 ms | 137118 tok/s) step 3029/76294 | train loss 3.686044 | norm 0.2112 | lr 1.16e-03 | (3895.15 ms | 134600 tok/s) step 3030/76294 | train loss 3.670805 | norm 0.2202 | lr 1.16e-03 | (3835.26 ms | 136702 tok/s) step 3031/76294 | train loss 3.664040 | norm 0.2105 | lr 1.16e-03 | (3929.59 ms | 133421 tok/s) step 3032/76294 | train loss 3.670852 | norm 0.2076 | lr 1.16e-03 | (3804.87 ms | 137794 tok/s) step 3033/76294 | train loss 3.734834 | norm 0.2280 | lr 1.16e-03 | (3842.90 ms | 136430 tok/s) step 3034/76294 | train loss 3.685249 | norm 0.2380 | lr 1.16e-03 | (3824.90 ms | 137072 tok/s) step 3035/76294 | train loss 3.725027 | norm 0.2078 | lr 1.16e-03 | (3810.39 ms | 137594 tok/s) step 3036/76294 | train loss 3.704956 | norm 0.1906 | lr 1.16e-03 | (3806.62 ms | 137731 tok/s) step 3037/76294 | train loss 3.681391 | norm 0.2014 | lr 1.16e-03 | (3807.52 ms | 137698 tok/s) step 3038/76294 | train loss 3.699171 | norm 0.2001 | lr 1.16e-03 | (3902.75 ms | 134338 tok/s) step 3039/76294 | train loss 3.707034 | norm 0.2052 | lr 1.16e-03 | (3864.34 ms | 135674 tok/s) step 3040/76294 | train loss 3.702460 | norm 0.2103 | lr 1.16e-03 | (4004.96 ms | 130910 tok/s) step 3041/76294 | train loss 3.689344 | norm 0.2375 | lr 1.16e-03 | (3812.83 ms | 137506 tok/s) step 3042/76294 | train loss 3.679467 | norm 0.2971 | lr 1.16e-03 | (3822.78 ms | 137148 tok/s) step 3043/76294 | train loss 3.702969 | norm 0.3433 | lr 1.16e-03 | (3919.10 ms | 133778 tok/s) step 3044/76294 | train loss 3.660648 | norm 0.3326 | lr 1.16e-03 | (3802.15 ms | 137893 tok/s) step 3045/76294 | train loss 3.712622 | norm 0.2583 | lr 1.16e-03 | (3861.83 ms | 135761 tok/s) step 3046/76294 | train loss 3.670474 | norm 0.2747 | lr 1.16e-03 | (3803.78 ms | 137833 tok/s) step 3047/76294 | train loss 3.687415 | norm 0.2228 | lr 1.16e-03 | (3818.23 ms | 137312 tok/s) step 3048/76294 | train loss 3.718164 | norm 0.2665 | lr 1.16e-03 | (3828.79 ms | 136933 tok/s) step 3049/76294 | train loss 3.713243 | norm 0.2542 | lr 1.16e-03 | (3809.63 ms | 137622 tok/s) step 3050/76294 | train loss 3.653310 | norm 0.2707 | lr 1.16e-03 | (3810.40 ms | 137594 tok/s) step 3051/76294 | train loss 3.678296 | norm 0.2560 | lr 1.16e-03 | (3972.32 ms | 131985 tok/s) step 3052/76294 | train loss 3.773941 | norm 0.2614 | lr 1.16e-03 | (4938.29 ms | 106168 tok/s) step 3053/76294 | train loss 3.660010 | norm 0.3301 | lr 1.16e-03 | (3822.87 ms | 137145 tok/s) step 3054/76294 | train loss 3.758100 | norm 0.3215 | lr 1.16e-03 | (3810.57 ms | 137588 tok/s) step 3055/76294 | train loss 3.922722 | norm 0.3237 | lr 1.16e-03 | (3802.44 ms | 137882 tok/s) step 3056/76294 | train loss 3.786023 | norm 0.2817 | lr 1.16e-03 | (3835.95 ms | 136677 tok/s) step 3057/76294 | train loss 3.684278 | norm 0.2456 | lr 1.16e-03 | (3805.52 ms | 137770 tok/s) step 3058/76294 | train loss 3.696211 | norm 0.2630 | lr 1.16e-03 | (3826.92 ms | 137000 tok/s) step 3059/76294 | train loss 3.740313 | norm 0.2361 | lr 1.16e-03 | (3826.61 ms | 137011 tok/s) step 3060/76294 | train loss 3.653611 | norm 0.2446 | lr 1.16e-03 | (3807.71 ms | 137691 tok/s) step 3061/76294 | train loss 3.763386 | norm 0.2178 | lr 1.16e-03 | (3812.60 ms | 137515 tok/s) step 3062/76294 | train loss 3.706365 | norm 0.2369 | lr 1.16e-03 | (3831.81 ms | 136825 tok/s) step 3063/76294 | train loss 3.704201 | norm 0.2145 | lr 1.16e-03 | (3805.73 ms | 137763 tok/s) step 3064/76294 | train loss 3.691073 | norm 0.1963 | lr 1.16e-03 | (3809.87 ms | 137613 tok/s) step 3065/76294 | train loss 3.704008 | norm 0.2142 | lr 1.16e-03 | (3825.85 ms | 137038 tok/s) step 3066/76294 | train loss 3.704724 | norm 0.2017 | lr 1.16e-03 | (3821.73 ms | 137186 tok/s) step 3067/76294 | train loss 3.672466 | norm 0.1806 | lr 1.16e-03 | (3900.39 ms | 134420 tok/s) step 3068/76294 | train loss 3.718808 | norm 0.1938 | lr 1.16e-03 | (3807.77 ms | 137689 tok/s) step 3069/76294 | train loss 3.676186 | norm 0.1885 | lr 1.16e-03 | (3807.49 ms | 137699 tok/s) step 3070/76294 | train loss 3.717577 | norm 0.2090 | lr 1.16e-03 | (3806.56 ms | 137733 tok/s) step 3071/76294 | train loss 3.732696 | norm 0.2010 | lr 1.16e-03 | (3807.97 ms | 137682 tok/s) step 3072/76294 | train loss 3.700438 | norm 0.2182 | lr 1.16e-03 | (3835.94 ms | 136678 tok/s) step 3073/76294 | train loss 3.730181 | norm 0.1990 | lr 1.16e-03 | (3825.08 ms | 137066 tok/s) step 3074/76294 | train loss 3.635452 | norm 0.2347 | lr 1.16e-03 | (3813.05 ms | 137498 tok/s) step 3075/76294 | train loss 3.731623 | norm 0.2132 | lr 1.16e-03 | (3841.94 ms | 136464 tok/s) step 3076/76294 | train loss 3.725400 | norm 0.2768 | lr 1.16e-03 | (3812.34 ms | 137524 tok/s) step 3077/76294 | train loss 3.701150 | norm 0.2193 | lr 1.16e-03 | (3815.82 ms | 137398 tok/s) step 3078/76294 | train loss 3.672061 | norm 0.2310 | lr 1.16e-03 | (3834.05 ms | 136745 tok/s) step 3079/76294 | train loss 3.697440 | norm 0.2439 | lr 1.16e-03 | (3921.28 ms | 133703 tok/s) step 3080/76294 | train loss 3.674613 | norm 0.2295 | lr 1.16e-03 | (3892.95 ms | 134676 tok/s) step 3081/76294 | train loss 3.686018 | norm 0.2600 | lr 1.16e-03 | (3807.32 ms | 137705 tok/s) step 3082/76294 | train loss 3.819988 | norm 0.2589 | lr 1.16e-03 | (3812.76 ms | 137509 tok/s) step 3083/76294 | train loss 3.734044 | norm 0.3210 | lr 1.16e-03 | (3830.82 ms | 136861 tok/s) step 3084/76294 | train loss 3.734237 | norm 0.2807 | lr 1.16e-03 | (3821.35 ms | 137200 tok/s) step 3085/76294 | train loss 3.668518 | norm 0.2425 | lr 1.16e-03 | (3805.82 ms | 137760 tok/s) step 3086/76294 | train loss 3.701435 | norm 0.2343 | lr 1.16e-03 | (3856.69 ms | 135942 tok/s) step 3087/76294 | train loss 3.694481 | norm 0.2385 | lr 1.16e-03 | (3810.57 ms | 137588 tok/s) step 3088/76294 | train loss 3.656951 | norm 0.2279 | lr 1.16e-03 | (3833.77 ms | 136755 tok/s) step 3089/76294 | train loss 3.690376 | norm 0.2086 | lr 1.16e-03 | (3805.86 ms | 137758 tok/s) step 3090/76294 | train loss 3.681304 | norm 0.2259 | lr 1.16e-03 | (3808.94 ms | 137647 tok/s) step 3091/76294 | train loss 3.703972 | norm 0.1874 | lr 1.16e-03 | (3838.31 ms | 136593 tok/s) step 3092/76294 | train loss 3.804710 | norm 0.3208 | lr 1.16e-03 | (3945.67 ms | 132877 tok/s) step 3093/76294 | train loss 3.668930 | norm 0.3171 | lr 1.16e-03 | (3809.75 ms | 137617 tok/s) step 3094/76294 | train loss 3.690762 | norm 0.2540 | lr 1.16e-03 | (3939.26 ms | 133093 tok/s) step 3095/76294 | train loss 3.675739 | norm 0.2098 | lr 1.16e-03 | (3804.54 ms | 137806 tok/s) step 3096/76294 | train loss 3.762775 | norm 0.2473 | lr 1.16e-03 | (3832.34 ms | 136806 tok/s) step 3097/76294 | train loss 3.699661 | norm 0.2418 | lr 1.16e-03 | (3831.32 ms | 136842 tok/s) step 3098/76294 | train loss 3.663922 | norm 0.2020 | lr 1.16e-03 | (3814.47 ms | 137447 tok/s) step 3099/76294 | train loss 3.711192 | norm 0.9153 | lr 1.16e-03 | (3819.41 ms | 137269 tok/s) step 3100/76294 | train loss 3.743270 | norm 0.2038 | lr 1.16e-03 | (4027.57 ms | 130175 tok/s) step 3101/76294 | train loss 3.740848 | norm 0.2069 | lr 1.16e-03 | (3796.89 ms | 138084 tok/s) step 3102/76294 | train loss 3.705792 | norm 0.2130 | lr 1.16e-03 | (3810.39 ms | 137594 tok/s) step 3103/76294 | train loss 3.694398 | norm 0.2409 | lr 1.16e-03 | (4189.01 ms | 125158 tok/s) step 3104/76294 | train loss 3.734803 | norm 0.2573 | lr 1.16e-03 | (4229.61 ms | 123957 tok/s) step 3105/76294 | train loss 3.675982 | norm 0.4010 | lr 1.16e-03 | (3792.23 ms | 138253 tok/s) step 3106/76294 | train loss 3.684797 | norm 0.3769 | lr 1.16e-03 | (3833.81 ms | 136754 tok/s) step 3107/76294 | train loss 3.747337 | norm 0.2899 | lr 1.16e-03 | (3825.87 ms | 137037 tok/s) step 3108/76294 | train loss 3.671219 | norm 0.2640 | lr 1.16e-03 | (3799.96 ms | 137972 tok/s) step 3109/76294 | train loss 3.690634 | norm 0.2313 | lr 1.16e-03 | (3801.55 ms | 137914 tok/s) step 3110/76294 | train loss 3.740391 | norm 0.2100 | lr 1.16e-03 | (3896.86 ms | 134541 tok/s) step 3111/76294 | train loss 3.685741 | norm 0.2340 | lr 1.16e-03 | (3797.87 ms | 138048 tok/s) step 3112/76294 | train loss 3.685662 | norm 0.2031 | lr 1.16e-03 | (3803.10 ms | 137858 tok/s) step 3113/76294 | train loss 3.660373 | norm 0.1992 | lr 1.16e-03 | (3822.03 ms | 137175 tok/s) step 3114/76294 | train loss 3.682544 | norm 0.2101 | lr 1.16e-03 | (3817.86 ms | 137325 tok/s) step 3115/76294 | train loss 3.619463 | norm 0.1800 | lr 1.16e-03 | (3790.91 ms | 138301 tok/s) step 3116/76294 | train loss 3.645455 | norm 0.1818 | lr 1.16e-03 | (3821.48 ms | 137195 tok/s) step 3117/76294 | train loss 3.704270 | norm 0.1950 | lr 1.16e-03 | (3798.45 ms | 138027 tok/s) step 3118/76294 | train loss 3.752925 | norm 0.3839 | lr 1.16e-03 | (3836.45 ms | 136660 tok/s) step 3119/76294 | train loss 3.737804 | norm 0.1748 | lr 1.16e-03 | (3820.65 ms | 137225 tok/s) step 3120/76294 | train loss 3.667640 | norm 0.2018 | lr 1.16e-03 | (3796.47 ms | 138099 tok/s) step 3121/76294 | train loss 3.647526 | norm 0.2232 | lr 1.16e-03 | (3804.02 ms | 137825 tok/s) step 3122/76294 | train loss 3.725654 | norm 0.2060 | lr 1.16e-03 | (3802.28 ms | 137888 tok/s) step 3123/76294 | train loss 3.729733 | norm 0.2026 | lr 1.16e-03 | (3795.04 ms | 138151 tok/s) step 3124/76294 | train loss 3.860235 | norm 0.1997 | lr 1.16e-03 | (3823.99 ms | 137105 tok/s) step 3125/76294 | train loss 3.669858 | norm 0.2275 | lr 1.16e-03 | (3821.15 ms | 137207 tok/s) step 3126/76294 | train loss 3.698833 | norm 0.2382 | lr 1.16e-03 | (3866.68 ms | 135591 tok/s) step 3127/76294 | train loss 3.691727 | norm 0.2766 | lr 1.16e-03 | (3797.82 ms | 138050 tok/s) step 3128/76294 | train loss 3.702415 | norm 0.2199 | lr 1.16e-03 | (3802.13 ms | 137893 tok/s) step 3129/76294 | train loss 3.797461 | norm 0.2408 | lr 1.16e-03 | (3800.03 ms | 137970 tok/s) step 3130/76294 | train loss 3.657011 | norm 0.2124 | lr 1.16e-03 | (3902.10 ms | 134360 tok/s) step 3131/76294 | train loss 3.634747 | norm 0.2048 | lr 1.16e-03 | (3798.96 ms | 138008 tok/s) step 3132/76294 | train loss 3.775483 | norm 0.2388 | lr 1.16e-03 | (3803.30 ms | 137851 tok/s) step 3133/76294 | train loss 3.786935 | norm 0.2422 | lr 1.16e-03 | (3841.14 ms | 136493 tok/s) step 3134/76294 | train loss 3.672543 | norm 0.2186 | lr 1.16e-03 | (3807.96 ms | 137682 tok/s) step 3135/76294 | train loss 3.717955 | norm 0.2013 | lr 1.16e-03 | (3803.25 ms | 137853 tok/s) step 3136/76294 | train loss 3.724396 | norm 0.2400 | lr 1.16e-03 | (3836.57 ms | 136655 tok/s) step 3137/76294 | train loss 3.665136 | norm 0.2220 | lr 1.16e-03 | (3805.92 ms | 137756 tok/s) step 3138/76294 | train loss 3.644751 | norm 0.2132 | lr 1.16e-03 | (3849.13 ms | 136209 tok/s) step 3139/76294 | train loss 3.727702 | norm 0.1998 | lr 1.16e-03 | (3820.03 ms | 137247 tok/s) step 3140/76294 | train loss 3.696452 | norm 0.1939 | lr 1.16e-03 | (3831.43 ms | 136839 tok/s) step 3141/76294 | train loss 3.540892 | norm 0.2006 | lr 1.16e-03 | (3812.69 ms | 137511 tok/s) step 3142/76294 | train loss 3.616975 | norm 0.2102 | lr 1.16e-03 | (3810.61 ms | 137586 tok/s) step 3143/76294 | train loss 3.733354 | norm 0.2262 | lr 1.16e-03 | (3831.20 ms | 136847 tok/s) step 3144/76294 | train loss 3.633077 | norm 0.2335 | lr 1.16e-03 | (3808.82 ms | 137651 tok/s) step 3145/76294 | train loss 3.706691 | norm 0.2596 | lr 1.16e-03 | (3830.03 ms | 136889 tok/s) step 3146/76294 | train loss 3.680486 | norm 0.2330 | lr 1.16e-03 | (3808.37 ms | 137667 tok/s) step 3147/76294 | train loss 3.662158 | norm 0.2236 | lr 1.16e-03 | (3857.99 ms | 135897 tok/s) step 3148/76294 | train loss 3.637478 | norm 0.2009 | lr 1.16e-03 | (3808.05 ms | 137679 tok/s) step 3149/76294 | train loss 3.638491 | norm 0.2220 | lr 1.16e-03 | (3812.79 ms | 137508 tok/s) step 3150/76294 | train loss 3.701830 | norm 0.2016 | lr 1.16e-03 | (3900.85 ms | 134404 tok/s) step 3151/76294 | train loss 3.692683 | norm 0.2335 | lr 1.16e-03 | (3940.94 ms | 133036 tok/s) step 3152/76294 | train loss 3.705314 | norm 0.2335 | lr 1.16e-03 | (3799.81 ms | 137978 tok/s) step 3153/76294 | train loss 3.654941 | norm 0.2342 | lr 1.16e-03 | (3808.13 ms | 137676 tok/s) step 3154/76294 | train loss 3.803844 | norm 0.2758 | lr 1.16e-03 | (3825.04 ms | 137067 tok/s) step 3155/76294 | train loss 3.804710 | norm 0.2422 | lr 1.16e-03 | (3827.66 ms | 136974 tok/s) step 3156/76294 | train loss 3.632010 | norm 0.2116 | lr 1.16e-03 | (3813.72 ms | 137474 tok/s) step 3157/76294 | train loss 3.689664 | norm 0.2378 | lr 1.16e-03 | (3870.45 ms | 135459 tok/s) step 3158/76294 | train loss 3.739634 | norm 0.2190 | lr 1.16e-03 | (3799.83 ms | 137977 tok/s) step 3159/76294 | train loss 3.742794 | norm 0.2385 | lr 1.16e-03 | (3804.64 ms | 137802 tok/s) step 3160/76294 | train loss 3.728088 | norm 0.2933 | lr 1.16e-03 | (3820.79 ms | 137220 tok/s) step 3161/76294 | train loss 3.645515 | norm 0.2311 | lr 1.16e-03 | (3798.73 ms | 138017 tok/s) step 3162/76294 | train loss 3.730590 | norm 0.2323 | lr 1.16e-03 | (3852.48 ms | 136091 tok/s) step 3163/76294 | train loss 3.724046 | norm 0.2192 | lr 1.16e-03 | (3795.64 ms | 138129 tok/s) step 3164/76294 | train loss 3.837972 | norm 0.2612 | lr 1.16e-03 | (3834.07 ms | 136744 tok/s) step 3165/76294 | train loss 3.693411 | norm 0.2466 | lr 1.16e-03 | (3802.23 ms | 137890 tok/s) step 3166/76294 | train loss 3.679184 | norm 0.2559 | lr 1.16e-03 | (3803.59 ms | 137840 tok/s) step 3167/76294 | train loss 3.667527 | norm 0.2326 | lr 1.16e-03 | (3862.70 ms | 135731 tok/s) step 3168/76294 | train loss 3.675048 | norm 0.2026 | lr 1.16e-03 | (3855.78 ms | 135975 tok/s) step 3169/76294 | train loss 3.643303 | norm 0.2432 | lr 1.16e-03 | (7381.25 ms | 71030 tok/s) step 3170/76294 | train loss 3.760910 | norm 0.2354 | lr 1.16e-03 | (5267.95 ms | 99524 tok/s) step 3171/76294 | train loss 3.686829 | norm 0.2274 | lr 1.16e-03 | (3797.56 ms | 138059 tok/s) step 3172/76294 | train loss 3.590945 | norm 0.2614 | lr 1.16e-03 | (3809.39 ms | 137630 tok/s) step 3173/76294 | train loss 3.849930 | norm 0.2738 | lr 1.16e-03 | (3816.11 ms | 137388 tok/s) step 3174/76294 | train loss 3.681389 | norm 0.2280 | lr 1.16e-03 | (3798.21 ms | 138035 tok/s) step 3175/76294 | train loss 3.687874 | norm 0.2644 | lr 1.16e-03 | (3794.16 ms | 138183 tok/s) step 3176/76294 | train loss 3.694437 | norm 0.2184 | lr 1.16e-03 | (3827.33 ms | 136985 tok/s) step 3177/76294 | train loss 3.657540 | norm 0.2454 | lr 1.16e-03 | (3796.77 ms | 138088 tok/s) step 3178/76294 | train loss 3.701183 | norm 0.2263 | lr 1.16e-03 | (3804.14 ms | 137820 tok/s) step 3179/76294 | train loss 3.730112 | norm 0.2236 | lr 1.16e-03 | (3820.31 ms | 137237 tok/s) step 3180/76294 | train loss 3.678442 | norm 0.2432 | lr 1.16e-03 | (3824.65 ms | 137081 tok/s) step 3181/76294 | train loss 3.697117 | norm 0.2697 | lr 1.16e-03 | (3802.01 ms | 137898 tok/s) step 3182/76294 | train loss 3.723710 | norm 0.2331 | lr 1.16e-03 | (3866.87 ms | 135585 tok/s) step 3183/76294 | train loss 3.716152 | norm 0.2293 | lr 1.15e-03 | (3799.27 ms | 137997 tok/s) step 3184/76294 | train loss 3.693083 | norm 0.3115 | lr 1.15e-03 | (3805.50 ms | 137771 tok/s) step 3185/76294 | train loss 3.675797 | norm 0.3052 | lr 1.15e-03 | (3821.38 ms | 137199 tok/s) step 3186/76294 | train loss 3.792388 | norm 0.2249 | lr 1.15e-03 | (3802.78 ms | 137870 tok/s) step 3187/76294 | train loss 3.682415 | norm 0.2474 | lr 1.15e-03 | (3799.59 ms | 137985 tok/s) step 3188/76294 | train loss 3.686647 | norm 0.2204 | lr 1.15e-03 | (3830.30 ms | 136879 tok/s) step 3189/76294 | train loss 3.706858 | norm 0.2465 | lr 1.15e-03 | (3800.06 ms | 137968 tok/s) step 3190/76294 | train loss 3.706079 | norm 0.2558 | lr 1.15e-03 | (3890.64 ms | 134756 tok/s) step 3191/76294 | train loss 3.646434 | norm 0.2782 | lr 1.15e-03 | (3801.60 ms | 137913 tok/s) step 3192/76294 | train loss 3.714894 | norm 0.2041 | lr 1.15e-03 | (3802.96 ms | 137863 tok/s) step 3193/76294 | train loss 3.651921 | norm 0.1997 | lr 1.15e-03 | (3838.74 ms | 136578 tok/s) step 3194/76294 | train loss 3.682106 | norm 0.2161 | lr 1.15e-03 | (3803.72 ms | 137836 tok/s) step 3195/76294 | train loss 3.676214 | norm 0.2213 | lr 1.15e-03 | (3802.27 ms | 137888 tok/s) step 3196/76294 | train loss 3.713397 | norm 0.2070 | lr 1.15e-03 | (3830.60 ms | 136868 tok/s) step 3197/76294 | train loss 3.710368 | norm 0.2195 | lr 1.15e-03 | (3800.73 ms | 137944 tok/s) step 3198/76294 | train loss 3.702713 | norm 0.2219 | lr 1.15e-03 | (3809.55 ms | 137625 tok/s) step 3199/76294 | train loss 3.685176 | norm 0.2006 | lr 1.15e-03 | (3829.35 ms | 136913 tok/s) step 3200/76294 | train loss 3.782333 | norm 0.2094 | lr 1.15e-03 | (3807.52 ms | 137698 tok/s) step 3201/76294 | train loss 3.673685 | norm 0.1981 | lr 1.15e-03 | (3822.68 ms | 137152 tok/s) step 3202/76294 | train loss 3.656413 | norm 0.1985 | lr 1.15e-03 | (3889.25 ms | 134804 tok/s) step 3203/76294 | train loss 3.678624 | norm 0.2442 | lr 1.15e-03 | (3871.39 ms | 135426 tok/s) step 3204/76294 | train loss 3.681156 | norm 0.2283 | lr 1.15e-03 | (3807.63 ms | 137694 tok/s) step 3205/76294 | train loss 3.604525 | norm 0.2196 | lr 1.15e-03 | (4227.83 ms | 124009 tok/s) step 3206/76294 | train loss 3.657448 | norm 0.2419 | lr 1.15e-03 | (3841.95 ms | 136464 tok/s) step 3207/76294 | train loss 3.688978 | norm 0.2786 | lr 1.15e-03 | (3830.86 ms | 136859 tok/s) step 3208/76294 | train loss 3.721490 | norm 0.2072 | lr 1.15e-03 | (3799.70 ms | 137982 tok/s) step 3209/76294 | train loss 3.670846 | norm 0.2237 | lr 1.15e-03 | (3810.39 ms | 137594 tok/s) step 3210/76294 | train loss 3.658123 | norm 0.2074 | lr 1.15e-03 | (3959.80 ms | 132403 tok/s) step 3211/76294 | train loss 3.678856 | norm 0.2089 | lr 1.15e-03 | (3801.31 ms | 137923 tok/s) step 3212/76294 | train loss 3.801325 | norm 0.1988 | lr 1.15e-03 | (3810.20 ms | 137601 tok/s) step 3213/76294 | train loss 3.648046 | norm 0.1821 | lr 1.15e-03 | (3824.88 ms | 137073 tok/s) step 3214/76294 | train loss 3.674970 | norm 0.1962 | lr 1.15e-03 | (3809.69 ms | 137620 tok/s) step 3215/76294 | train loss 3.756295 | norm 0.2253 | lr 1.15e-03 | (3822.16 ms | 137171 tok/s) step 3216/76294 | train loss 3.734181 | norm 0.2606 | lr 1.15e-03 | (3825.31 ms | 137058 tok/s) step 3217/76294 | train loss 3.645391 | norm 0.2888 | lr 1.15e-03 | (3922.15 ms | 133674 tok/s) step 3218/76294 | train loss 3.662471 | norm 0.2154 | lr 1.15e-03 | (3806.21 ms | 137745 tok/s) step 3219/76294 | train loss 3.698005 | norm 0.2601 | lr 1.15e-03 | (3802.16 ms | 137892 tok/s) step 3220/76294 | train loss 3.703387 | norm 0.2799 | lr 1.15e-03 | (3824.99 ms | 137069 tok/s) step 3221/76294 | train loss 3.647179 | norm 0.2158 | lr 1.15e-03 | (3812.26 ms | 137527 tok/s) step 3222/76294 | train loss 3.712939 | norm 0.2486 | lr 1.15e-03 | (3805.25 ms | 137780 tok/s) step 3223/76294 | train loss 3.641174 | norm 0.2228 | lr 1.15e-03 | (3833.45 ms | 136767 tok/s) step 3224/76294 | train loss 3.640657 | norm 0.2109 | lr 1.15e-03 | (3801.51 ms | 137916 tok/s) step 3225/76294 | train loss 3.695841 | norm 0.2118 | lr 1.15e-03 | (3810.83 ms | 137578 tok/s) step 3226/76294 | train loss 3.690971 | norm 0.2190 | lr 1.15e-03 | (3854.82 ms | 136008 tok/s) step 3227/76294 | train loss 3.714758 | norm 0.2287 | lr 1.15e-03 | (3808.72 ms | 137655 tok/s) step 3228/76294 | train loss 3.663980 | norm 0.2070 | lr 1.15e-03 | (3822.50 ms | 137158 tok/s) step 3229/76294 | train loss 3.748891 | norm 0.2371 | lr 1.15e-03 | (3810.32 ms | 137597 tok/s) step 3230/76294 | train loss 3.665936 | norm 0.2619 | lr 1.15e-03 | (3801.48 ms | 137917 tok/s) step 3231/76294 | train loss 3.723544 | norm 0.2320 | lr 1.15e-03 | (3838.97 ms | 136570 tok/s) step 3232/76294 | train loss 3.706256 | norm 0.2570 | lr 1.15e-03 | (3805.27 ms | 137780 tok/s) step 3233/76294 | train loss 3.658097 | norm 0.2534 | lr 1.15e-03 | (3810.24 ms | 137600 tok/s) step 3234/76294 | train loss 3.645110 | norm 0.2354 | lr 1.15e-03 | (3825.12 ms | 137064 tok/s) step 3235/76294 | train loss 3.671993 | norm 0.2180 | lr 1.15e-03 | (3809.03 ms | 137644 tok/s) step 3236/76294 | train loss 3.699599 | norm 0.2583 | lr 1.15e-03 | (4334.78 ms | 120949 tok/s) step 3237/76294 | train loss 3.721260 | norm 0.3232 | lr 1.15e-03 | (3876.72 ms | 135240 tok/s) step 3238/76294 | train loss 3.723486 | norm 0.3202 | lr 1.15e-03 | (3815.54 ms | 137409 tok/s) step 3239/76294 | train loss 3.640654 | norm 0.2982 | lr 1.15e-03 | (3808.76 ms | 137653 tok/s) step 3240/76294 | train loss 3.696076 | norm 0.2548 | lr 1.15e-03 | (3805.74 ms | 137763 tok/s) step 3241/76294 | train loss 3.701241 | norm 0.2355 | lr 1.15e-03 | (3830.28 ms | 136880 tok/s) step 3242/76294 | train loss 3.676243 | norm 0.2626 | lr 1.15e-03 | (3808.19 ms | 137674 tok/s) step 3243/76294 | train loss 3.781201 | norm 0.2058 | lr 1.15e-03 | (4257.52 ms | 123144 tok/s) step 3244/76294 | train loss 3.640386 | norm 0.2293 | lr 1.15e-03 | (3832.65 ms | 136795 tok/s) step 3245/76294 | train loss 3.668031 | norm 0.2350 | lr 1.15e-03 | (3806.16 ms | 137747 tok/s) step 3246/76294 | train loss 3.663974 | norm 0.2118 | lr 1.15e-03 | (3808.20 ms | 137673 tok/s) step 3247/76294 | train loss 3.648703 | norm 0.2086 | lr 1.15e-03 | (3840.65 ms | 136510 tok/s) step 3248/76294 | train loss 3.692658 | norm 0.2207 | lr 1.15e-03 | (3805.95 ms | 137755 tok/s) step 3249/76294 | train loss 3.611897 | norm 0.1927 | lr 1.15e-03 | (3811.40 ms | 137558 tok/s) step 3250/76294 | train loss 3.673814 | norm 0.2311 | lr 1.15e-03 | (4219.19 ms | 124263 tok/s) val loss: 3.672625 saving model checkpoint to ./results/gpt2-124M-gqa/step_3250.pth step 3251/76294 | train loss 3.596535 | norm 0.2077 | lr 1.15e-03 | (3909.52 ms | 134105 tok/s) step 3252/76294 | train loss 3.655994 | norm 0.1948 | lr 1.15e-03 | (3802.73 ms | 137871 tok/s) step 3253/76294 | train loss 3.661616 | norm 0.1759 | lr 1.15e-03 | (3833.34 ms | 136771 tok/s) step 3254/76294 | train loss 3.666044 | norm 0.2233 | lr 1.15e-03 | (3802.19 ms | 137891 tok/s) step 3255/76294 | train loss 3.661933 | norm 0.2717 | lr 1.15e-03 | (3807.00 ms | 137717 tok/s) step 3256/76294 | train loss 3.635191 | norm 0.2451 | lr 1.15e-03 | (3823.74 ms | 137114 tok/s) step 3257/76294 | train loss 3.726933 | norm 0.1948 | lr 1.15e-03 | (3812.39 ms | 137522 tok/s) step 3258/76294 | train loss 3.603078 | norm 0.2720 | lr 1.15e-03 | (3830.72 ms | 136864 tok/s) step 3259/76294 | train loss 3.676882 | norm 0.2802 | lr 1.15e-03 | (3798.95 ms | 138009 tok/s) step 3260/76294 | train loss 3.677357 | norm 0.3022 | lr 1.15e-03 | (3958.05 ms | 132461 tok/s) step 3261/76294 | train loss 3.702056 | norm 0.2926 | lr 1.15e-03 | (3806.17 ms | 137747 tok/s) step 3262/76294 | train loss 3.562474 | norm 0.2318 | lr 1.15e-03 | (3843.00 ms | 136427 tok/s) step 3263/76294 | train loss 3.686308 | norm 0.2783 | lr 1.15e-03 | (3802.95 ms | 137863 tok/s) step 3264/76294 | train loss 3.673132 | norm 0.2340 | lr 1.15e-03 | (3863.96 ms | 135687 tok/s) step 3265/76294 | train loss 3.703356 | norm 0.2372 | lr 1.15e-03 | (3835.58 ms | 136691 tok/s) step 3266/76294 | train loss 3.640207 | norm 0.2517 | lr 1.15e-03 | (3826.65 ms | 137010 tok/s) step 3267/76294 | train loss 3.620535 | norm 0.2184 | lr 1.15e-03 | (3809.92 ms | 137611 tok/s) step 3268/76294 | train loss 3.671219 | norm 0.2221 | lr 1.15e-03 | (3825.90 ms | 137037 tok/s) step 3269/76294 | train loss 3.635349 | norm 0.1986 | lr 1.15e-03 | (3964.33 ms | 132251 tok/s) step 3270/76294 | train loss 3.667747 | norm 0.2381 | lr 1.15e-03 | (3877.62 ms | 135209 tok/s) step 3271/76294 | train loss 3.626878 | norm 0.2083 | lr 1.15e-03 | (3804.08 ms | 137822 tok/s) step 3272/76294 | train loss 3.650591 | norm 0.2267 | lr 1.15e-03 | (3821.52 ms | 137193 tok/s) step 3273/76294 | train loss 3.625499 | norm 0.2040 | lr 1.15e-03 | (3805.34 ms | 137777 tok/s) step 3274/76294 | train loss 3.736778 | norm 0.5810 | lr 1.15e-03 | (3853.85 ms | 136043 tok/s) step 3275/76294 | train loss 3.630354 | norm 0.5286 | lr 1.15e-03 | (3804.83 ms | 137795 tok/s) step 3276/76294 | train loss 3.715781 | norm 0.5523 | lr 1.15e-03 | (3854.00 ms | 136037 tok/s) step 3277/76294 | train loss 3.565493 | norm 0.3849 | lr 1.15e-03 | (3802.01 ms | 137897 tok/s) step 3278/76294 | train loss 3.650574 | norm 0.2877 | lr 1.15e-03 | (3808.08 ms | 137678 tok/s) step 3279/76294 | train loss 3.608420 | norm 0.2681 | lr 1.15e-03 | (3830.60 ms | 136868 tok/s) step 3280/76294 | train loss 3.751833 | norm 0.2327 | lr 1.15e-03 | (3808.49 ms | 137663 tok/s) step 3281/76294 | train loss 3.683774 | norm 0.2230 | lr 1.15e-03 | (3823.59 ms | 137119 tok/s) step 3282/76294 | train loss 3.729767 | norm 0.2296 | lr 1.15e-03 | (3805.11 ms | 137785 tok/s) step 3283/76294 | train loss 3.733009 | norm 0.2232 | lr 1.15e-03 | (3803.34 ms | 137849 tok/s) step 3284/76294 | train loss 3.709608 | norm 0.2479 | lr 1.15e-03 | (3837.16 ms | 136635 tok/s) step 3285/76294 | train loss 3.685335 | norm 0.2123 | lr 1.15e-03 | (3805.73 ms | 137763 tok/s) step 3286/76294 | train loss 3.580170 | norm 0.1986 | lr 1.15e-03 | (3810.78 ms | 137580 tok/s) step 3287/76294 | train loss 3.674965 | norm 0.2381 | lr 1.15e-03 | (3831.60 ms | 136833 tok/s) step 3288/76294 | train loss 3.656117 | norm 0.2107 | lr 1.15e-03 | (3812.42 ms | 137521 tok/s) step 3289/76294 | train loss 3.728817 | norm 0.1994 | lr 1.15e-03 | (3806.26 ms | 137743 tok/s) step 3290/76294 | train loss 3.674211 | norm 0.2128 | lr 1.15e-03 | (3916.03 ms | 133882 tok/s) step 3291/76294 | train loss 3.743076 | norm 0.2128 | lr 1.15e-03 | (3808.65 ms | 137657 tok/s) step 3292/76294 | train loss 3.576879 | norm 0.2296 | lr 1.15e-03 | (3833.22 ms | 136775 tok/s) step 3293/76294 | train loss 3.685452 | norm 0.2085 | lr 1.15e-03 | (3828.21 ms | 136954 tok/s) step 3294/76294 | train loss 3.672543 | norm 0.2217 | lr 1.15e-03 | (3835.24 ms | 136703 tok/s) step 3295/76294 | train loss 3.637916 | norm 0.2037 | lr 1.15e-03 | (3813.91 ms | 137467 tok/s) step 3296/76294 | train loss 3.658109 | norm 0.2301 | lr 1.15e-03 | (3807.68 ms | 137692 tok/s) step 3297/76294 | train loss 3.619359 | norm 0.2539 | lr 1.15e-03 | (3829.33 ms | 136914 tok/s) step 3298/76294 | train loss 3.677597 | norm 0.2578 | lr 1.15e-03 | (3806.42 ms | 137738 tok/s) step 3299/76294 | train loss 3.600454 | norm 0.1983 | lr 1.15e-03 | (3863.42 ms | 135706 tok/s) step 3300/76294 | train loss 3.715368 | norm 0.2811 | lr 1.15e-03 | (3826.11 ms | 137029 tok/s) step 3301/76294 | train loss 3.627283 | norm 0.2389 | lr 1.15e-03 | (3814.42 ms | 137449 tok/s) step 3302/76294 | train loss 3.645936 | norm 0.2182 | lr 1.15e-03 | (3807.13 ms | 137712 tok/s) step 3303/76294 | train loss 3.649838 | norm 0.2379 | lr 1.15e-03 | (3909.59 ms | 134103 tok/s) step 3304/76294 | train loss 3.653098 | norm 0.1963 | lr 1.15e-03 | (3807.57 ms | 137696 tok/s) step 3305/76294 | train loss 3.740316 | norm 0.2560 | lr 1.15e-03 | (3840.10 ms | 136530 tok/s) step 3306/76294 | train loss 3.605143 | norm 0.3100 | lr 1.15e-03 | (3806.42 ms | 137738 tok/s) step 3307/76294 | train loss 3.695440 | norm 0.2634 | lr 1.15e-03 | (3829.31 ms | 136915 tok/s) step 3308/76294 | train loss 3.641081 | norm 0.2425 | lr 1.15e-03 | (3806.95 ms | 137719 tok/s) step 3309/76294 | train loss 3.703305 | norm 0.2597 | lr 1.15e-03 | (3852.96 ms | 136074 tok/s) step 3310/76294 | train loss 3.648742 | norm 0.2320 | lr 1.15e-03 | (4029.71 ms | 130106 tok/s) step 3311/76294 | train loss 3.625206 | norm 0.2042 | lr 1.15e-03 | (3804.28 ms | 137815 tok/s) step 3312/76294 | train loss 3.724596 | norm 0.2418 | lr 1.15e-03 | (6175.46 ms | 84899 tok/s) step 3313/76294 | train loss 3.654071 | norm 0.2319 | lr 1.15e-03 | (3911.84 ms | 134026 tok/s) step 3314/76294 | train loss 3.743643 | norm 0.2545 | lr 1.15e-03 | (3918.66 ms | 133793 tok/s) step 3315/76294 | train loss 3.581919 | norm 0.2354 | lr 1.15e-03 | (3802.67 ms | 137874 tok/s) step 3316/76294 | train loss 3.724852 | norm 0.2511 | lr 1.15e-03 | (4377.07 ms | 119781 tok/s) step 3317/76294 | train loss 3.743744 | norm 0.3913 | lr 1.15e-03 | (3934.76 ms | 133245 tok/s) step 3318/76294 | train loss 3.664684 | norm 0.3318 | lr 1.15e-03 | (3794.19 ms | 138182 tok/s) step 3319/76294 | train loss 3.644956 | norm 0.3083 | lr 1.15e-03 | (3859.48 ms | 135844 tok/s) step 3320/76294 | train loss 3.631026 | norm 0.2863 | lr 1.15e-03 | (3746.53 ms | 139939 tok/s) step 3321/76294 | train loss 3.621180 | norm 0.2669 | lr 1.15e-03 | (3809.12 ms | 137640 tok/s) step 3322/76294 | train loss 3.692639 | norm 0.2788 | lr 1.15e-03 | (3756.62 ms | 139564 tok/s) step 3323/76294 | train loss 3.635885 | norm 0.2640 | lr 1.15e-03 | (3786.88 ms | 138448 tok/s) step 3324/76294 | train loss 3.660850 | norm 0.2792 | lr 1.15e-03 | (3764.90 ms | 139257 tok/s) step 3325/76294 | train loss 3.617334 | norm 0.2945 | lr 1.15e-03 | (3843.67 ms | 136403 tok/s) step 3326/76294 | train loss 3.710743 | norm 0.2100 | lr 1.15e-03 | (3770.53 ms | 139049 tok/s) step 3327/76294 | train loss 3.687731 | norm 0.2405 | lr 1.15e-03 | (3781.54 ms | 138644 tok/s) step 3328/76294 | train loss 3.658661 | norm 0.2521 | lr 1.15e-03 | (3795.98 ms | 138117 tok/s) step 3329/76294 | train loss 3.680262 | norm 0.2599 | lr 1.15e-03 | (3779.68 ms | 138712 tok/s) step 3330/76294 | train loss 3.617743 | norm 0.2479 | lr 1.15e-03 | (3778.46 ms | 138757 tok/s) step 3331/76294 | train loss 3.689188 | norm 0.3012 | lr 1.15e-03 | (3813.57 ms | 137479 tok/s) step 3332/76294 | train loss 3.610305 | norm 0.3142 | lr 1.15e-03 | (3785.83 ms | 138487 tok/s) step 3333/76294 | train loss 3.716978 | norm 0.2418 | lr 1.15e-03 | (3812.25 ms | 137527 tok/s) step 3334/76294 | train loss 3.788323 | norm 0.2907 | lr 1.15e-03 | (3859.83 ms | 135832 tok/s) step 3335/76294 | train loss 3.956518 | norm 0.3845 | lr 1.15e-03 | (3790.57 ms | 138314 tok/s) step 3336/76294 | train loss 3.615527 | norm 0.2733 | lr 1.15e-03 | (3787.21 ms | 138437 tok/s) step 3337/76294 | train loss 3.704263 | norm 0.2470 | lr 1.15e-03 | (3825.82 ms | 137040 tok/s) step 3338/76294 | train loss 3.675769 | norm 0.2381 | lr 1.15e-03 | (3790.93 ms | 138301 tok/s) step 3339/76294 | train loss 3.635633 | norm 0.2203 | lr 1.15e-03 | (3798.36 ms | 138030 tok/s) step 3340/76294 | train loss 3.634115 | norm 0.2224 | lr 1.15e-03 | (3819.26 ms | 137275 tok/s) step 3341/76294 | train loss 3.677530 | norm 0.2024 | lr 1.15e-03 | (3798.92 ms | 138010 tok/s) step 3342/76294 | train loss 3.694917 | norm 0.2054 | lr 1.15e-03 | (3794.24 ms | 138180 tok/s) step 3343/76294 | train loss 3.611006 | norm 0.2097 | lr 1.15e-03 | (3849.66 ms | 136191 tok/s) step 3344/76294 | train loss 3.701152 | norm 0.2164 | lr 1.15e-03 | (3795.21 ms | 138144 tok/s) step 3345/76294 | train loss 3.605295 | norm 0.1999 | lr 1.15e-03 | (3951.11 ms | 132694 tok/s) step 3346/76294 | train loss 3.629283 | norm 0.2010 | lr 1.15e-03 | (3799.48 ms | 137989 tok/s) step 3347/76294 | train loss 3.637689 | norm 0.1902 | lr 1.15e-03 | (3832.60 ms | 136797 tok/s) step 3348/76294 | train loss 3.614155 | norm 0.2339 | lr 1.15e-03 | (3795.51 ms | 138134 tok/s) step 3349/76294 | train loss 3.621505 | norm 0.2593 | lr 1.15e-03 | (3829.95 ms | 136892 tok/s) step 3350/76294 | train loss 3.666830 | norm 0.2383 | lr 1.15e-03 | (3794.25 ms | 138180 tok/s) step 3351/76294 | train loss 3.637568 | norm 0.2039 | lr 1.15e-03 | (3799.08 ms | 138004 tok/s) step 3352/76294 | train loss 3.587102 | norm 0.2141 | lr 1.15e-03 | (3821.93 ms | 137179 tok/s) step 3353/76294 | train loss 3.943161 | norm 0.2568 | lr 1.15e-03 | (3797.92 ms | 138046 tok/s) step 3354/76294 | train loss 3.583479 | norm 0.1896 | lr 1.15e-03 | (3797.02 ms | 138079 tok/s) step 3355/76294 | train loss 3.819514 | norm 0.2977 | lr 1.15e-03 | (3828.09 ms | 136958 tok/s) step 3356/76294 | train loss 3.721448 | norm 0.3748 | lr 1.15e-03 | (3801.39 ms | 137920 tok/s) step 3357/76294 | train loss 3.664730 | norm 0.3377 | lr 1.15e-03 | (3805.85 ms | 137758 tok/s) step 3358/76294 | train loss 3.762342 | norm 0.2724 | lr 1.15e-03 | (3823.81 ms | 137112 tok/s) step 3359/76294 | train loss 3.705156 | norm 0.3295 | lr 1.15e-03 | (3804.54 ms | 137806 tok/s) step 3360/76294 | train loss 3.681217 | norm 0.2273 | lr 1.15e-03 | (3804.02 ms | 137825 tok/s) step 3361/76294 | train loss 3.704959 | norm 0.2679 | lr 1.15e-03 | (3808.54 ms | 137661 tok/s) step 3362/76294 | train loss 3.674371 | norm 0.2837 | lr 1.15e-03 | (3822.95 ms | 137142 tok/s) step 3363/76294 | train loss 3.645750 | norm 0.2259 | lr 1.15e-03 | (3844.58 ms | 136371 tok/s) step 3364/76294 | train loss 3.693563 | norm 0.2212 | lr 1.15e-03 | (3809.56 ms | 137624 tok/s) step 3365/76294 | train loss 3.652174 | norm 0.2194 | lr 1.15e-03 | (3843.46 ms | 136411 tok/s) step 3366/76294 | train loss 3.686123 | norm 0.2248 | lr 1.15e-03 | (3874.21 ms | 135328 tok/s) step 3367/76294 | train loss 3.645604 | norm 0.2065 | lr 1.15e-03 | (3803.49 ms | 137844 tok/s) step 3368/76294 | train loss 3.685698 | norm 0.1967 | lr 1.15e-03 | (3829.74 ms | 136899 tok/s) step 3369/76294 | train loss 3.635360 | norm 0.2063 | lr 1.15e-03 | (3804.40 ms | 137811 tok/s) step 3370/76294 | train loss 3.610070 | norm 0.1825 | lr 1.15e-03 | (3807.71 ms | 137691 tok/s) step 3371/76294 | train loss 3.745395 | norm 0.2040 | lr 1.15e-03 | (3829.55 ms | 136906 tok/s) step 3372/76294 | train loss 3.639902 | norm 0.1955 | lr 1.15e-03 | (3803.24 ms | 137853 tok/s) step 3373/76294 | train loss 3.674678 | norm 0.2074 | lr 1.15e-03 | (3833.35 ms | 136770 tok/s) step 3374/76294 | train loss 3.640395 | norm 0.2204 | lr 1.15e-03 | (3804.98 ms | 137790 tok/s) step 3375/76294 | train loss 3.698179 | norm 0.1875 | lr 1.15e-03 | (3805.52 ms | 137770 tok/s) step 3376/76294 | train loss 3.712526 | norm 0.1983 | lr 1.15e-03 | (3840.95 ms | 136500 tok/s) step 3377/76294 | train loss 3.581674 | norm 0.1896 | lr 1.15e-03 | (3804.75 ms | 137798 tok/s) step 3378/76294 | train loss 3.665225 | norm 0.1859 | lr 1.15e-03 | (3810.43 ms | 137593 tok/s) step 3379/76294 | train loss 3.624101 | norm 0.2140 | lr 1.15e-03 | (3829.38 ms | 136912 tok/s) step 3380/76294 | train loss 3.679231 | norm 0.2085 | lr 1.15e-03 | (3810.18 ms | 137602 tok/s) step 3381/76294 | train loss 3.611217 | norm 0.2151 | lr 1.15e-03 | (3805.19 ms | 137782 tok/s) step 3382/76294 | train loss 3.674733 | norm 0.1974 | lr 1.15e-03 | (3840.70 ms | 136509 tok/s) step 3383/76294 | train loss 3.611947 | norm 0.2237 | lr 1.15e-03 | (3806.32 ms | 137741 tok/s) step 3384/76294 | train loss 3.633105 | norm 0.2146 | lr 1.15e-03 | (3805.15 ms | 137784 tok/s) step 3385/76294 | train loss 3.711430 | norm 0.2086 | lr 1.15e-03 | (3826.29 ms | 137023 tok/s) step 3386/76294 | train loss 3.623492 | norm 0.1993 | lr 1.15e-03 | (3868.33 ms | 135533 tok/s) step 3387/76294 | train loss 3.703368 | norm 0.2335 | lr 1.15e-03 | (3802.65 ms | 137874 tok/s) step 3388/76294 | train loss 3.616721 | norm 0.2047 | lr 1.15e-03 | (3834.34 ms | 136735 tok/s) step 3389/76294 | train loss 3.680976 | norm 0.2152 | lr 1.15e-03 | (3896.23 ms | 134563 tok/s) step 3390/76294 | train loss 3.687401 | norm 0.1998 | lr 1.15e-03 | (3805.70 ms | 137764 tok/s) step 3391/76294 | train loss 3.621080 | norm 0.1768 | lr 1.15e-03 | (3808.25 ms | 137672 tok/s) step 3392/76294 | train loss 3.723626 | norm 0.2078 | lr 1.15e-03 | (3833.99 ms | 136747 tok/s) step 3393/76294 | train loss 3.541026 | norm 0.2358 | lr 1.15e-03 | (3806.83 ms | 137723 tok/s) step 3394/76294 | train loss 3.710703 | norm 0.2544 | lr 1.15e-03 | (3933.02 ms | 133304 tok/s) step 3395/76294 | train loss 3.633258 | norm 0.2425 | lr 1.15e-03 | (3826.77 ms | 137005 tok/s) step 3396/76294 | train loss 3.638286 | norm 0.2107 | lr 1.15e-03 | (3808.77 ms | 137653 tok/s) step 3397/76294 | train loss 3.695706 | norm 0.2343 | lr 1.15e-03 | (3812.30 ms | 137526 tok/s) step 3398/76294 | train loss 3.671503 | norm 0.2339 | lr 1.15e-03 | (3810.95 ms | 137574 tok/s) step 3399/76294 | train loss 3.723023 | norm 0.2178 | lr 1.15e-03 | (3816.16 ms | 137386 tok/s) step 3400/76294 | train loss 3.558642 | norm 0.2220 | lr 1.15e-03 | (3810.91 ms | 137575 tok/s) step 3401/76294 | train loss 3.692435 | norm 0.2486 | lr 1.15e-03 | (3804.88 ms | 137793 tok/s) step 3402/76294 | train loss 3.644638 | norm 0.2330 | lr 1.15e-03 | (3849.49 ms | 136197 tok/s) step 3403/76294 | train loss 3.692038 | norm 0.1908 | lr 1.15e-03 | (3806.51 ms | 137735 tok/s) step 3404/76294 | train loss 3.618841 | norm 0.2616 | lr 1.15e-03 | (3849.21 ms | 136207 tok/s) step 3405/76294 | train loss 3.680876 | norm 0.2302 | lr 1.15e-03 | (3809.16 ms | 137639 tok/s) step 3406/76294 | train loss 3.619503 | norm 0.2968 | lr 1.15e-03 | (3833.36 ms | 136770 tok/s) step 3407/76294 | train loss 3.666456 | norm 0.3020 | lr 1.15e-03 | (3891.49 ms | 134727 tok/s) step 3408/76294 | train loss 3.717199 | norm 0.3057 | lr 1.15e-03 | (3804.83 ms | 137795 tok/s) step 3409/76294 | train loss 3.635083 | norm 0.2969 | lr 1.15e-03 | (3810.20 ms | 137601 tok/s) step 3410/76294 | train loss 3.702075 | norm 0.2325 | lr 1.15e-03 | (3829.30 ms | 136915 tok/s) step 3411/76294 | train loss 3.649261 | norm 0.2614 | lr 1.15e-03 | (3840.81 ms | 136504 tok/s) step 3412/76294 | train loss 3.780157 | norm 0.2541 | lr 1.15e-03 | (3804.82 ms | 137796 tok/s) step 3413/76294 | train loss 3.630721 | norm 0.2555 | lr 1.15e-03 | (3832.03 ms | 136817 tok/s) step 3414/76294 | train loss 3.643709 | norm 0.2240 | lr 1.15e-03 | (3806.04 ms | 137751 tok/s) step 3415/76294 | train loss 3.773939 | norm 0.2417 | lr 1.15e-03 | (3835.86 ms | 136681 tok/s) step 3416/76294 | train loss 3.584590 | norm 0.2304 | lr 1.15e-03 | (3826.59 ms | 137012 tok/s) step 3417/76294 | train loss 3.659751 | norm 0.2327 | lr 1.15e-03 | (3832.25 ms | 136809 tok/s) step 3418/76294 | train loss 3.628134 | norm 0.2236 | lr 1.15e-03 | (3830.12 ms | 136885 tok/s) step 3419/76294 | train loss 3.701409 | norm 0.1928 | lr 1.15e-03 | (3811.13 ms | 137568 tok/s) step 3420/76294 | train loss 3.660590 | norm 0.2261 | lr 1.15e-03 | (3808.58 ms | 137660 tok/s) step 3421/76294 | train loss 3.620912 | norm 0.2234 | lr 1.15e-03 | (3867.52 ms | 135562 tok/s) step 3422/76294 | train loss 3.705855 | norm 0.2560 | lr 1.15e-03 | (3803.05 ms | 137860 tok/s) step 3423/76294 | train loss 3.639285 | norm 0.2482 | lr 1.15e-03 | (3811.55 ms | 137553 tok/s) step 3424/76294 | train loss 3.642890 | norm 0.2142 | lr 1.15e-03 | (3875.77 ms | 135273 tok/s) step 3425/76294 | train loss 3.602097 | norm 0.2333 | lr 1.15e-03 | (3811.13 ms | 137568 tok/s) step 3426/76294 | train loss 3.673393 | norm 0.2193 | lr 1.15e-03 | (3806.57 ms | 137732 tok/s) step 3427/76294 | train loss 3.679610 | norm 0.2368 | lr 1.15e-03 | (4054.07 ms | 129324 tok/s) step 3428/76294 | train loss 3.698499 | norm 0.2204 | lr 1.15e-03 | (3888.55 ms | 134829 tok/s) step 3429/76294 | train loss 3.656272 | norm 0.2375 | lr 1.15e-03 | (3809.36 ms | 137631 tok/s) step 3430/76294 | train loss 3.583971 | norm 0.2603 | lr 1.15e-03 | (3839.46 ms | 136553 tok/s) step 3431/76294 | train loss 3.666014 | norm 0.2223 | lr 1.15e-03 | (3808.77 ms | 137653 tok/s) step 3432/76294 | train loss 3.651738 | norm 0.2387 | lr 1.15e-03 | (3832.67 ms | 136794 tok/s) step 3433/76294 | train loss 3.662853 | norm 0.2216 | lr 1.15e-03 | (3804.85 ms | 137795 tok/s) step 3434/76294 | train loss 3.649514 | norm 0.2144 | lr 1.15e-03 | (4386.01 ms | 119536 tok/s) step 3435/76294 | train loss 3.630127 | norm 0.2217 | lr 1.15e-03 | (3802.06 ms | 137896 tok/s) step 3436/76294 | train loss 3.657117 | norm 0.2231 | lr 1.15e-03 | (3867.13 ms | 135575 tok/s) step 3437/76294 | train loss 3.642009 | norm 0.2041 | lr 1.15e-03 | (3811.31 ms | 137561 tok/s) step 3438/76294 | train loss 3.574715 | norm 0.2071 | lr 1.15e-03 | (3871.26 ms | 135431 tok/s) step 3439/76294 | train loss 3.686738 | norm 0.2445 | lr 1.15e-03 | (3820.68 ms | 137224 tok/s) step 3440/76294 | train loss 3.630665 | norm 0.2408 | lr 1.15e-03 | (3828.71 ms | 136936 tok/s) step 3441/76294 | train loss 3.715423 | norm 0.2596 | lr 1.15e-03 | (3828.92 ms | 136928 tok/s) step 3442/76294 | train loss 3.652353 | norm 0.2259 | lr 1.15e-03 | (3804.54 ms | 137806 tok/s) step 3443/76294 | train loss 3.647159 | norm 0.2311 | lr 1.15e-03 | (3833.91 ms | 136750 tok/s) step 3444/76294 | train loss 3.679809 | norm 0.2287 | lr 1.15e-03 | (3812.15 ms | 137531 tok/s) step 3445/76294 | train loss 3.707042 | norm 0.2219 | lr 1.15e-03 | (3825.94 ms | 137035 tok/s) step 3446/76294 | train loss 3.650244 | norm 0.1882 | lr 1.15e-03 | (3808.07 ms | 137678 tok/s) step 3447/76294 | train loss 3.692677 | norm 0.1997 | lr 1.14e-03 | (3853.86 ms | 136042 tok/s) step 3448/76294 | train loss 3.690752 | norm 0.2038 | lr 1.14e-03 | (3914.86 ms | 133922 tok/s) step 3449/76294 | train loss 3.568630 | norm 0.2024 | lr 1.14e-03 | (3805.91 ms | 137756 tok/s) step 3450/76294 | train loss 3.657190 | norm 0.2022 | lr 1.14e-03 | (4992.85 ms | 105008 tok/s) step 3451/76294 | train loss 3.705944 | norm 0.2144 | lr 1.14e-03 | (3817.64 ms | 137333 tok/s) step 3452/76294 | train loss 3.680994 | norm 0.2066 | lr 1.14e-03 | (3827.10 ms | 136993 tok/s) step 3453/76294 | train loss 3.668072 | norm 0.2036 | lr 1.14e-03 | (3810.46 ms | 137592 tok/s) step 3454/76294 | train loss 3.631099 | norm 0.1974 | lr 1.14e-03 | (3835.01 ms | 136711 tok/s) step 3455/76294 | train loss 3.701925 | norm 0.2182 | lr 1.14e-03 | (3800.92 ms | 137937 tok/s) step 3456/76294 | train loss 3.643999 | norm 0.2020 | lr 1.14e-03 | (3871.67 ms | 135416 tok/s) step 3457/76294 | train loss 3.627127 | norm 0.2402 | lr 1.14e-03 | (3806.34 ms | 137741 tok/s) step 3458/76294 | train loss 3.660971 | norm 0.2376 | lr 1.14e-03 | (3836.39 ms | 136662 tok/s) step 3459/76294 | train loss 3.697637 | norm 0.2504 | lr 1.14e-03 | (3831.56 ms | 136834 tok/s) step 3460/76294 | train loss 3.669412 | norm 0.2039 | lr 1.14e-03 | (3826.54 ms | 137014 tok/s) step 3461/76294 | train loss 3.654263 | norm 0.2169 | lr 1.14e-03 | (3804.31 ms | 137814 tok/s) step 3462/76294 | train loss 3.652081 | norm 0.3134 | lr 1.14e-03 | (3814.61 ms | 137442 tok/s) step 3463/76294 | train loss 3.601218 | norm 0.3315 | lr 1.14e-03 | (3809.83 ms | 137614 tok/s) step 3464/76294 | train loss 3.676701 | norm 0.2132 | lr 1.14e-03 | (3826.22 ms | 137025 tok/s) step 3465/76294 | train loss 3.629140 | norm 0.2795 | lr 1.14e-03 | (3836.85 ms | 136645 tok/s) step 3466/76294 | train loss 3.865367 | norm 0.2571 | lr 1.14e-03 | (3831.80 ms | 136826 tok/s) step 3467/76294 | train loss 3.703110 | norm 0.2925 | lr 1.14e-03 | (3828.66 ms | 136938 tok/s) step 3468/76294 | train loss 3.625084 | norm 0.2747 | lr 1.14e-03 | (4117.82 ms | 127322 tok/s) step 3469/76294 | train loss 3.675476 | norm 0.2203 | lr 1.14e-03 | (3800.59 ms | 137949 tok/s) step 3470/76294 | train loss 3.632530 | norm 0.2326 | lr 1.14e-03 | (3834.18 ms | 136740 tok/s) step 3471/76294 | train loss 3.606149 | norm 0.2125 | lr 1.14e-03 | (3846.46 ms | 136304 tok/s) step 3472/76294 | train loss 3.606694 | norm 0.2769 | lr 1.14e-03 | (3974.01 ms | 131929 tok/s) step 3473/76294 | train loss 3.605778 | norm 0.2438 | lr 1.14e-03 | (3801.54 ms | 137914 tok/s) step 3474/76294 | train loss 3.638240 | norm 0.2277 | lr 1.14e-03 | (3841.05 ms | 136496 tok/s) step 3475/76294 | train loss 3.646597 | norm 0.2835 | lr 1.14e-03 | (3800.95 ms | 137936 tok/s) step 3476/76294 | train loss 3.658037 | norm 0.2833 | lr 1.14e-03 | (3861.54 ms | 135772 tok/s) step 3477/76294 | train loss 3.671106 | norm 0.2291 | lr 1.14e-03 | (3800.97 ms | 137935 tok/s) step 3478/76294 | train loss 3.803830 | norm 0.2209 | lr 1.14e-03 | (3854.71 ms | 136012 tok/s) step 3479/76294 | train loss 3.617490 | norm 0.2040 | lr 1.14e-03 | (3799.90 ms | 137974 tok/s) step 3480/76294 | train loss 3.686838 | norm 0.2154 | lr 1.14e-03 | (3832.34 ms | 136806 tok/s) step 3481/76294 | train loss 3.701050 | norm 0.2101 | lr 1.14e-03 | (3988.33 ms | 131455 tok/s) step 3482/76294 | train loss 3.826617 | norm 0.2710 | lr 1.14e-03 | (3803.92 ms | 137828 tok/s) step 3483/76294 | train loss 3.646367 | norm 0.3002 | lr 1.14e-03 | (3829.17 ms | 136919 tok/s) step 3484/76294 | train loss 3.811558 | norm 0.3100 | lr 1.14e-03 | (3807.78 ms | 137689 tok/s) step 3485/76294 | train loss 3.633955 | norm 0.2477 | lr 1.14e-03 | (3814.94 ms | 137430 tok/s) step 3486/76294 | train loss 3.647463 | norm 0.2409 | lr 1.14e-03 | (3805.82 ms | 137760 tok/s) step 3487/76294 | train loss 3.674514 | norm 0.2263 | lr 1.14e-03 | (3808.69 ms | 137656 tok/s) step 3488/76294 | train loss 3.716057 | norm 0.2366 | lr 1.14e-03 | (3845.58 ms | 136335 tok/s) step 3489/76294 | train loss 3.682025 | norm 0.2099 | lr 1.14e-03 | (3873.96 ms | 135337 tok/s) step 3490/76294 | train loss 3.623685 | norm 0.2514 | lr 1.14e-03 | (3813.11 ms | 137496 tok/s) step 3491/76294 | train loss 3.649241 | norm 0.2160 | lr 1.14e-03 | (3808.33 ms | 137669 tok/s) step 3492/76294 | train loss 3.667916 | norm 0.1966 | lr 1.14e-03 | (3838.33 ms | 136593 tok/s) step 3493/76294 | train loss 3.664053 | norm 0.1950 | lr 1.14e-03 | (3807.92 ms | 137684 tok/s) step 3494/76294 | train loss 3.628390 | norm 0.1981 | lr 1.14e-03 | (3828.42 ms | 136946 tok/s) step 3495/76294 | train loss 3.663167 | norm 0.1927 | lr 1.14e-03 | (3813.02 ms | 137499 tok/s) step 3496/76294 | train loss 3.633553 | norm 0.1939 | lr 1.14e-03 | (4408.60 ms | 118924 tok/s) step 3497/76294 | train loss 3.702391 | norm 0.2104 | lr 1.14e-03 | (3925.45 ms | 133561 tok/s) step 3498/76294 | train loss 3.612051 | norm 0.2638 | lr 1.14e-03 | (3807.13 ms | 137712 tok/s) step 3499/76294 | train loss 3.603382 | norm 0.2325 | lr 1.14e-03 | (3825.53 ms | 137050 tok/s) step 3500/76294 | train loss 3.626711 | norm 0.1925 | lr 1.14e-03 | (3834.75 ms | 136720 tok/s) val loss: 3.656496 saving model checkpoint to ./results/gpt2-124M-gqa/step_3500.pth step 3501/76294 | train loss 3.662508 | norm 0.2211 | lr 1.14e-03 | (3915.87 ms | 133888 tok/s) step 3502/76294 | train loss 3.614228 | norm 0.2020 | lr 1.14e-03 | (3799.95 ms | 137972 tok/s) step 3503/76294 | train loss 3.629936 | norm 0.1927 | lr 1.14e-03 | (3860.97 ms | 135792 tok/s) step 3504/76294 | train loss 3.670937 | norm 0.2137 | lr 1.14e-03 | (3800.09 ms | 137967 tok/s) step 3505/76294 | train loss 3.635727 | norm 0.1936 | lr 1.14e-03 | (3837.05 ms | 136638 tok/s) step 3506/76294 | train loss 3.674618 | norm 0.2006 | lr 1.14e-03 | (3799.69 ms | 137982 tok/s) step 3507/76294 | train loss 3.671779 | norm 0.2068 | lr 1.14e-03 | (3811.93 ms | 137539 tok/s) step 3508/76294 | train loss 3.631436 | norm 0.1982 | lr 1.14e-03 | (4290.55 ms | 122196 tok/s) step 3509/76294 | train loss 3.670855 | norm 0.2026 | lr 1.14e-03 | (3796.99 ms | 138080 tok/s) step 3510/76294 | train loss 3.638253 | norm 0.2096 | lr 1.14e-03 | (3858.67 ms | 135873 tok/s) step 3511/76294 | train loss 3.682693 | norm 0.2737 | lr 1.14e-03 | (3798.51 ms | 138025 tok/s) step 3512/76294 | train loss 3.678067 | norm 0.2467 | lr 1.14e-03 | (3877.17 ms | 135224 tok/s) step 3513/76294 | train loss 3.711482 | norm 0.2590 | lr 1.14e-03 | (3813.55 ms | 137480 tok/s) step 3514/76294 | train loss 3.650863 | norm 0.3007 | lr 1.14e-03 | (3864.34 ms | 135674 tok/s) step 3515/76294 | train loss 3.651896 | norm 0.2808 | lr 1.14e-03 | (3804.36 ms | 137812 tok/s) step 3516/76294 | train loss 3.611156 | norm 0.2277 | lr 1.14e-03 | (3872.53 ms | 135386 tok/s) step 3517/76294 | train loss 3.615920 | norm 0.2285 | lr 1.14e-03 | (3804.28 ms | 137815 tok/s) step 3518/76294 | train loss 3.771612 | norm 0.2470 | lr 1.14e-03 | (3837.06 ms | 136638 tok/s) step 3519/76294 | train loss 3.621310 | norm 0.2367 | lr 1.14e-03 | (3806.34 ms | 137741 tok/s) step 3520/76294 | train loss 3.706903 | norm 0.2055 | lr 1.14e-03 | (3846.14 ms | 136315 tok/s) step 3521/76294 | train loss 3.602619 | norm 0.2446 | lr 1.14e-03 | (3801.35 ms | 137922 tok/s) step 3522/76294 | train loss 3.648293 | norm 0.1801 | lr 1.14e-03 | (3877.97 ms | 135196 tok/s) step 3523/76294 | train loss 3.701174 | norm 0.2283 | lr 1.14e-03 | (3807.34 ms | 137705 tok/s) step 3524/76294 | train loss 3.716223 | norm 0.1908 | lr 1.14e-03 | (3848.58 ms | 136229 tok/s) step 3525/76294 | train loss 3.627778 | norm 0.1989 | lr 1.14e-03 | (3801.95 ms | 137900 tok/s) step 3526/76294 | train loss 3.656795 | norm 0.1866 | lr 1.14e-03 | (3804.71 ms | 137800 tok/s) step 3527/76294 | train loss 3.704181 | norm 0.2148 | lr 1.14e-03 | (8854.45 ms | 59212 tok/s) step 3528/76294 | train loss 3.616297 | norm 0.2440 | lr 1.14e-03 | (3982.70 ms | 131641 tok/s) step 3529/76294 | train loss 3.694317 | norm 0.2586 | lr 1.14e-03 | (4257.62 ms | 123141 tok/s) step 3530/76294 | train loss 3.728075 | norm 0.2661 | lr 1.14e-03 | (3801.26 ms | 137925 tok/s) step 3531/76294 | train loss 3.575213 | norm 0.2496 | lr 1.14e-03 | (3856.34 ms | 135955 tok/s) step 3532/76294 | train loss 3.814944 | norm 0.2657 | lr 1.14e-03 | (3733.16 ms | 140441 tok/s) step 3533/76294 | train loss 3.710734 | norm 0.2487 | lr 1.14e-03 | (3915.52 ms | 133900 tok/s) step 3534/76294 | train loss 3.593172 | norm 0.2304 | lr 1.14e-03 | (3740.87 ms | 140151 tok/s) step 3535/76294 | train loss 3.665168 | norm 0.2472 | lr 1.14e-03 | (3805.51 ms | 137771 tok/s) step 3536/76294 | train loss 3.665932 | norm 0.2049 | lr 1.14e-03 | (4313.11 ms | 121557 tok/s) step 3537/76294 | train loss 3.700383 | norm 0.1955 | lr 1.14e-03 | (3822.05 ms | 137174 tok/s) step 3538/76294 | train loss 3.646818 | norm 0.2251 | lr 1.14e-03 | (3783.31 ms | 138579 tok/s) step 3539/76294 | train loss 3.686061 | norm 0.2050 | lr 1.14e-03 | (3784.86 ms | 138522 tok/s) step 3540/76294 | train loss 3.616730 | norm 0.2110 | lr 1.14e-03 | (3892.30 ms | 134699 tok/s) step 3541/76294 | train loss 3.636286 | norm 0.1794 | lr 1.14e-03 | (3787.27 ms | 138434 tok/s) step 3542/76294 | train loss 3.665119 | norm 0.2015 | lr 1.14e-03 | (3778.90 ms | 138741 tok/s) step 3543/76294 | train loss 3.616186 | norm 0.1833 | lr 1.14e-03 | (3891.19 ms | 134737 tok/s) step 3544/76294 | train loss 3.632281 | norm 0.2080 | lr 1.14e-03 | (3774.09 ms | 138918 tok/s) step 3545/76294 | train loss 3.619451 | norm 0.1852 | lr 1.14e-03 | (3785.22 ms | 138509 tok/s) step 3546/76294 | train loss 3.642061 | norm 0.1878 | lr 1.14e-03 | (3803.60 ms | 137840 tok/s) step 3547/76294 | train loss 3.669531 | norm 0.2200 | lr 1.14e-03 | (3814.57 ms | 137443 tok/s) step 3548/76294 | train loss 3.654258 | norm 0.1982 | lr 1.14e-03 | (3789.40 ms | 138356 tok/s) step 3549/76294 | train loss 3.662784 | norm 0.2082 | lr 1.14e-03 | (3830.18 ms | 136883 tok/s) step 3550/76294 | train loss 3.647466 | norm 0.2114 | lr 1.14e-03 | (3824.77 ms | 137077 tok/s) step 3551/76294 | train loss 3.676850 | norm 0.2024 | lr 1.14e-03 | (4276.25 ms | 122605 tok/s) step 3552/76294 | train loss 3.721014 | norm 0.2135 | lr 1.14e-03 | (3796.59 ms | 138094 tok/s) step 3553/76294 | train loss 3.657989 | norm 0.2162 | lr 1.14e-03 | (3805.36 ms | 137776 tok/s) step 3554/76294 | train loss 3.650970 | norm 0.2352 | lr 1.14e-03 | (3819.69 ms | 137259 tok/s) step 3555/76294 | train loss 3.620367 | norm 0.2267 | lr 1.14e-03 | (3832.45 ms | 136802 tok/s) step 3556/76294 | train loss 3.619110 | norm 0.2872 | lr 1.14e-03 | (3886.67 ms | 134894 tok/s) step 3557/76294 | train loss 3.673754 | norm 0.3003 | lr 1.14e-03 | (3833.92 ms | 136750 tok/s) step 3558/76294 | train loss 3.618393 | norm 0.3150 | lr 1.14e-03 | (3803.40 ms | 137847 tok/s) step 3559/76294 | train loss 3.645381 | norm 0.2749 | lr 1.14e-03 | (3862.79 ms | 135728 tok/s) step 3560/76294 | train loss 3.618458 | norm 0.1898 | lr 1.14e-03 | (3805.93 ms | 137756 tok/s) step 3561/76294 | train loss 3.664160 | norm 0.2454 | lr 1.14e-03 | (3811.74 ms | 137546 tok/s) step 3562/76294 | train loss 3.620362 | norm 0.1923 | lr 1.14e-03 | (3828.07 ms | 136959 tok/s) step 3563/76294 | train loss 3.648581 | norm 0.1856 | lr 1.14e-03 | (3811.94 ms | 137538 tok/s) step 3564/76294 | train loss 3.691180 | norm 0.2105 | lr 1.14e-03 | (3837.43 ms | 136625 tok/s) step 3565/76294 | train loss 3.776512 | norm 0.2037 | lr 1.14e-03 | (3811.02 ms | 137572 tok/s) step 3566/76294 | train loss 3.668900 | norm 0.1975 | lr 1.14e-03 | (3840.31 ms | 136522 tok/s) step 3567/76294 | train loss 3.684525 | norm 0.2022 | lr 1.14e-03 | (3813.75 ms | 137473 tok/s) step 3568/76294 | train loss 3.625501 | norm 0.2106 | lr 1.14e-03 | (3810.15 ms | 137603 tok/s) step 3569/76294 | train loss 3.651778 | norm 0.2059 | lr 1.14e-03 | (3884.01 ms | 134986 tok/s) step 3570/76294 | train loss 3.618299 | norm 0.1909 | lr 1.14e-03 | (3814.15 ms | 137459 tok/s) step 3571/76294 | train loss 3.645108 | norm 0.2055 | lr 1.14e-03 | (3846.35 ms | 136308 tok/s) step 3572/76294 | train loss 3.737354 | norm 0.1964 | lr 1.14e-03 | (3814.92 ms | 137431 tok/s) step 3573/76294 | train loss 3.689329 | norm 0.2316 | lr 1.14e-03 | (3868.21 ms | 135538 tok/s) step 3574/76294 | train loss 3.597885 | norm 0.2513 | lr 1.14e-03 | (4034.01 ms | 129967 tok/s) step 3575/76294 | train loss 3.631027 | norm 0.1982 | lr 1.14e-03 | (3817.33 ms | 137344 tok/s) step 3576/76294 | train loss 3.598732 | norm 0.2007 | lr 1.14e-03 | (3971.19 ms | 132023 tok/s) step 3577/76294 | train loss 3.610351 | norm 0.2331 | lr 1.14e-03 | (3813.23 ms | 137492 tok/s) step 3578/76294 | train loss 3.648253 | norm 0.2017 | lr 1.14e-03 | (4026.35 ms | 130214 tok/s) step 3579/76294 | train loss 3.594650 | norm 0.2292 | lr 1.14e-03 | (3812.85 ms | 137506 tok/s) step 3580/76294 | train loss 3.647102 | norm 0.2104 | lr 1.14e-03 | (3821.53 ms | 137193 tok/s) step 3581/76294 | train loss 3.614765 | norm 0.2681 | lr 1.14e-03 | (3812.60 ms | 137514 tok/s) step 3582/76294 | train loss 3.616909 | norm 0.2309 | lr 1.14e-03 | (3819.88 ms | 137253 tok/s) step 3583/76294 | train loss 3.615904 | norm 0.2241 | lr 1.14e-03 | (3860.44 ms | 135810 tok/s) step 3584/76294 | train loss 3.712479 | norm 0.3590 | lr 1.14e-03 | (3813.47 ms | 137483 tok/s) step 3585/76294 | train loss 3.618003 | norm 0.3857 | lr 1.14e-03 | (3844.80 ms | 136363 tok/s) step 3586/76294 | train loss 3.641573 | norm 0.2844 | lr 1.14e-03 | (3814.19 ms | 137457 tok/s) step 3587/76294 | train loss 3.670815 | norm 0.2731 | lr 1.14e-03 | (3834.07 ms | 136744 tok/s) step 3588/76294 | train loss 3.643834 | norm 0.2464 | lr 1.14e-03 | (3852.46 ms | 136092 tok/s) step 3589/76294 | train loss 3.714809 | norm 0.2760 | lr 1.14e-03 | (3839.23 ms | 136561 tok/s) step 3590/76294 | train loss 3.673838 | norm 0.1960 | lr 1.14e-03 | (3844.43 ms | 136376 tok/s) step 3591/76294 | train loss 3.620146 | norm 0.2506 | lr 1.14e-03 | (3812.17 ms | 137530 tok/s) step 3592/76294 | train loss 3.615840 | norm 0.1986 | lr 1.14e-03 | (3808.77 ms | 137653 tok/s) step 3593/76294 | train loss 3.630459 | norm 0.2284 | lr 1.14e-03 | (3850.34 ms | 136167 tok/s) step 3594/76294 | train loss 3.646116 | norm 0.1952 | lr 1.14e-03 | (3958.83 ms | 132435 tok/s) step 3595/76294 | train loss 3.597900 | norm 0.2216 | lr 1.14e-03 | (3816.20 ms | 137385 tok/s) step 3596/76294 | train loss 3.694828 | norm 0.1899 | lr 1.14e-03 | (3903.18 ms | 134323 tok/s) step 3597/76294 | train loss 3.609232 | norm 0.2264 | lr 1.14e-03 | (3854.11 ms | 136034 tok/s) step 3598/76294 | train loss 3.643642 | norm 0.2173 | lr 1.14e-03 | (3813.90 ms | 137468 tok/s) step 3599/76294 | train loss 3.640759 | norm 0.2779 | lr 1.14e-03 | (3820.27 ms | 137239 tok/s) step 3600/76294 | train loss 3.671744 | norm 0.2844 | lr 1.14e-03 | (3814.31 ms | 137453 tok/s) step 3601/76294 | train loss 3.645205 | norm 0.2730 | lr 1.14e-03 | (3806.33 ms | 137741 tok/s) step 3602/76294 | train loss 3.644284 | norm 0.2315 | lr 1.14e-03 | (3833.41 ms | 136768 tok/s) step 3603/76294 | train loss 3.698543 | norm 0.2232 | lr 1.14e-03 | (3809.76 ms | 137617 tok/s) step 3604/76294 | train loss 3.624776 | norm 0.2318 | lr 1.14e-03 | (3840.01 ms | 136533 tok/s) step 3605/76294 | train loss 3.677485 | norm 0.2381 | lr 1.14e-03 | (3806.85 ms | 137722 tok/s) step 3606/76294 | train loss 3.608957 | norm 0.2156 | lr 1.14e-03 | (3810.12 ms | 137604 tok/s) step 3607/76294 | train loss 3.638061 | norm 0.2650 | lr 1.14e-03 | (3836.43 ms | 136661 tok/s) step 3608/76294 | train loss 3.621268 | norm 0.2372 | lr 1.14e-03 | (3829.79 ms | 136897 tok/s) step 3609/76294 | train loss 3.597480 | norm 0.2109 | lr 1.14e-03 | (3830.94 ms | 136856 tok/s) step 3610/76294 | train loss 3.658731 | norm 0.2150 | lr 1.14e-03 | (3807.83 ms | 137687 tok/s) step 3611/76294 | train loss 3.581762 | norm 0.1882 | lr 1.14e-03 | (3805.28 ms | 137779 tok/s) step 3612/76294 | train loss 3.601395 | norm 0.2207 | lr 1.14e-03 | (3864.76 ms | 135658 tok/s) step 3613/76294 | train loss 3.634476 | norm 0.2757 | lr 1.14e-03 | (3809.96 ms | 137610 tok/s) step 3614/76294 | train loss 3.643441 | norm 0.2918 | lr 1.14e-03 | (3876.52 ms | 135247 tok/s) step 3615/76294 | train loss 3.655202 | norm 0.2154 | lr 1.14e-03 | (3814.98 ms | 137429 tok/s) step 3616/76294 | train loss 3.610600 | norm 0.2883 | lr 1.14e-03 | (3880.55 ms | 135107 tok/s) step 3617/76294 | train loss 3.611752 | norm 0.2241 | lr 1.14e-03 | (3817.38 ms | 137342 tok/s) step 3618/76294 | train loss 3.666214 | norm 0.2186 | lr 1.14e-03 | (3829.06 ms | 136923 tok/s) step 3619/76294 | train loss 3.678986 | norm 0.1992 | lr 1.14e-03 | (3808.12 ms | 137676 tok/s) step 3620/76294 | train loss 3.615695 | norm 0.2063 | lr 1.14e-03 | (3954.13 ms | 132592 tok/s) step 3621/76294 | train loss 3.598409 | norm 0.1933 | lr 1.14e-03 | (3842.85 ms | 136432 tok/s) step 3622/76294 | train loss 3.628182 | norm 0.2176 | lr 1.14e-03 | (3802.57 ms | 137877 tok/s) step 3623/76294 | train loss 3.653699 | norm 0.2083 | lr 1.14e-03 | (3807.54 ms | 137697 tok/s) step 3624/76294 | train loss 3.700251 | norm 0.1971 | lr 1.14e-03 | (4150.89 ms | 126307 tok/s) step 3625/76294 | train loss 3.720708 | norm 0.2421 | lr 1.14e-03 | (3838.96 ms | 136570 tok/s) step 3626/76294 | train loss 3.656234 | norm 0.2419 | lr 1.14e-03 | (3810.62 ms | 137586 tok/s) step 3627/76294 | train loss 3.723718 | norm 0.2048 | lr 1.14e-03 | (3811.67 ms | 137548 tok/s) step 3628/76294 | train loss 3.568016 | norm 0.2508 | lr 1.14e-03 | (3807.06 ms | 137715 tok/s) step 3629/76294 | train loss 3.707801 | norm 0.2500 | lr 1.14e-03 | (3848.85 ms | 136219 tok/s) step 3630/76294 | train loss 3.667297 | norm 0.2393 | lr 1.14e-03 | (3805.54 ms | 137770 tok/s) step 3631/76294 | train loss 3.674068 | norm 0.2413 | lr 1.14e-03 | (3809.22 ms | 137637 tok/s) step 3632/76294 | train loss 3.625734 | norm 0.2160 | lr 1.14e-03 | (3826.78 ms | 137005 tok/s) step 3633/76294 | train loss 3.615286 | norm 0.1833 | lr 1.14e-03 | (3850.69 ms | 136154 tok/s) step 3634/76294 | train loss 3.663137 | norm 0.2166 | lr 1.14e-03 | (3812.51 ms | 137518 tok/s) step 3635/76294 | train loss 3.738490 | norm 0.2090 | lr 1.14e-03 | (3962.46 ms | 132314 tok/s) step 3636/76294 | train loss 3.684011 | norm 0.2164 | lr 1.14e-03 | (3890.27 ms | 134769 tok/s) step 3637/76294 | train loss 3.726362 | norm 0.1995 | lr 1.14e-03 | (3804.11 ms | 137821 tok/s) step 3638/76294 | train loss 3.677715 | norm 0.2221 | lr 1.14e-03 | (3833.86 ms | 136752 tok/s) step 3639/76294 | train loss 3.682214 | norm 0.2236 | lr 1.14e-03 | (3806.13 ms | 137748 tok/s) step 3640/76294 | train loss 3.658936 | norm 0.2455 | lr 1.14e-03 | (3932.21 ms | 133332 tok/s) step 3641/76294 | train loss 3.585257 | norm 0.2431 | lr 1.14e-03 | (3805.75 ms | 137762 tok/s) step 3642/76294 | train loss 3.578795 | norm 0.4367 | lr 1.14e-03 | (3810.89 ms | 137576 tok/s) step 3643/76294 | train loss 3.688281 | norm 0.2922 | lr 1.14e-03 | (3829.42 ms | 136911 tok/s) step 3644/76294 | train loss 3.675759 | norm 0.3085 | lr 1.14e-03 | (3806.68 ms | 137728 tok/s) step 3645/76294 | train loss 3.661043 | norm 0.2738 | lr 1.14e-03 | (3807.28 ms | 137707 tok/s) step 3646/76294 | train loss 3.633308 | norm 0.2739 | lr 1.14e-03 | (3830.67 ms | 136866 tok/s) step 3647/76294 | train loss 3.715868 | norm 0.2460 | lr 1.14e-03 | (3813.33 ms | 137488 tok/s) step 3648/76294 | train loss 3.686257 | norm 0.2343 | lr 1.14e-03 | (3813.17 ms | 137494 tok/s) step 3649/76294 | train loss 3.688358 | norm 0.2278 | lr 1.14e-03 | (3826.59 ms | 137012 tok/s) step 3650/76294 | train loss 3.643174 | norm 0.2100 | lr 1.14e-03 | (3807.40 ms | 137702 tok/s) step 3651/76294 | train loss 3.676494 | norm 0.2119 | lr 1.14e-03 | (3814.06 ms | 137462 tok/s) step 3652/76294 | train loss 3.657541 | norm 0.2193 | lr 1.14e-03 | (3807.45 ms | 137700 tok/s) step 3653/76294 | train loss 3.687557 | norm 0.2616 | lr 1.14e-03 | (3829.07 ms | 136923 tok/s) step 3654/76294 | train loss 3.642296 | norm 0.1880 | lr 1.14e-03 | (3814.59 ms | 137443 tok/s) step 3655/76294 | train loss 3.644081 | norm 0.2313 | lr 1.14e-03 | (3806.32 ms | 137741 tok/s) step 3656/76294 | train loss 3.630716 | norm 0.2245 | lr 1.14e-03 | (3830.44 ms | 136874 tok/s) step 3657/76294 | train loss 3.789466 | norm 0.2245 | lr 1.14e-03 | (3847.40 ms | 136271 tok/s) step 3658/76294 | train loss 3.640704 | norm 0.1732 | lr 1.14e-03 | (3807.49 ms | 137699 tok/s) step 3659/76294 | train loss 3.698864 | norm 0.2076 | lr 1.14e-03 | (3826.21 ms | 137025 tok/s) step 3660/76294 | train loss 3.649737 | norm 0.1930 | lr 1.14e-03 | (3804.79 ms | 137797 tok/s) step 3661/76294 | train loss 3.702468 | norm 0.2131 | lr 1.14e-03 | (3806.00 ms | 137753 tok/s) step 3662/76294 | train loss 3.603699 | norm 0.2145 | lr 1.14e-03 | (3840.82 ms | 136504 tok/s) step 3663/76294 | train loss 3.708861 | norm 0.2173 | lr 1.14e-03 | (3807.14 ms | 137712 tok/s) step 3664/76294 | train loss 3.588620 | norm 0.2391 | lr 1.14e-03 | (3837.10 ms | 136637 tok/s) step 3665/76294 | train loss 3.668138 | norm 0.2152 | lr 1.14e-03 | (3808.02 ms | 137680 tok/s) step 3666/76294 | train loss 3.632278 | norm 0.2254 | lr 1.14e-03 | (3812.92 ms | 137503 tok/s) step 3667/76294 | train loss 3.648678 | norm 0.2204 | lr 1.14e-03 | (3834.41 ms | 136732 tok/s) step 3668/76294 | train loss 3.628588 | norm 0.2139 | lr 1.14e-03 | (3812.91 ms | 137503 tok/s) step 3669/76294 | train loss 3.626311 | norm 0.2147 | lr 1.14e-03 | (3804.97 ms | 137790 tok/s) step 3670/76294 | train loss 3.611755 | norm 0.2249 | lr 1.14e-03 | (3838.29 ms | 136594 tok/s) step 3671/76294 | train loss 3.690086 | norm 0.2321 | lr 1.14e-03 | (3810.10 ms | 137605 tok/s) step 3672/76294 | train loss 3.600686 | norm 0.2437 | lr 1.14e-03 | (3852.48 ms | 136091 tok/s) step 3673/76294 | train loss 3.679468 | norm 0.2702 | lr 1.14e-03 | (3802.89 ms | 137866 tok/s) step 3674/76294 | train loss 3.575841 | norm 0.3577 | lr 1.14e-03 | (3837.67 ms | 136616 tok/s) step 3675/76294 | train loss 3.683914 | norm 0.2593 | lr 1.14e-03 | (3803.80 ms | 137833 tok/s) step 3676/76294 | train loss 3.655738 | norm 0.2322 | lr 1.14e-03 | (3806.98 ms | 137718 tok/s) step 3677/76294 | train loss 3.686697 | norm 0.2240 | lr 1.14e-03 | (3906.64 ms | 134204 tok/s) step 3678/76294 | train loss 3.614996 | norm 0.2059 | lr 1.14e-03 | (3805.07 ms | 137787 tok/s) step 3679/76294 | train loss 3.680806 | norm 0.2056 | lr 1.14e-03 | (3810.78 ms | 137580 tok/s) step 3680/76294 | train loss 3.603685 | norm 0.2025 | lr 1.14e-03 | (3831.87 ms | 136823 tok/s) step 3681/76294 | train loss 3.639546 | norm 0.2159 | lr 1.14e-03 | (3825.93 ms | 137036 tok/s) step 3682/76294 | train loss 3.727323 | norm 0.2300 | lr 1.14e-03 | (3844.63 ms | 136369 tok/s) step 3683/76294 | train loss 3.673887 | norm 0.2578 | lr 1.14e-03 | (3860.82 ms | 135797 tok/s) step 3684/76294 | train loss 3.574559 | norm 0.2399 | lr 1.14e-03 | (3803.46 ms | 137845 tok/s) step 3685/76294 | train loss 3.658492 | norm 0.2618 | lr 1.14e-03 | (3806.53 ms | 137734 tok/s) step 3686/76294 | train loss 3.666068 | norm 0.3211 | lr 1.14e-03 | (3828.74 ms | 136935 tok/s) step 3687/76294 | train loss 3.634286 | norm 0.3560 | lr 1.14e-03 | (3803.50 ms | 137844 tok/s) step 3688/76294 | train loss 3.713503 | norm 0.3674 | lr 1.14e-03 | (3801.74 ms | 137907 tok/s) step 3689/76294 | train loss 3.630259 | norm 0.3487 | lr 1.14e-03 | (3829.46 ms | 136909 tok/s) step 3690/76294 | train loss 3.661278 | norm 0.2763 | lr 1.13e-03 | (3798.24 ms | 138035 tok/s) step 3691/76294 | train loss 3.660432 | norm 0.2773 | lr 1.13e-03 | (3830.04 ms | 136888 tok/s) step 3692/76294 | train loss 3.633361 | norm 0.2715 | lr 1.13e-03 | (3801.89 ms | 137902 tok/s) step 3693/76294 | train loss 3.640455 | norm 0.2627 | lr 1.13e-03 | (3834.73 ms | 136721 tok/s) step 3694/76294 | train loss 3.661242 | norm 0.2558 | lr 1.13e-03 | (3801.16 ms | 137928 tok/s) step 3695/76294 | train loss 3.609746 | norm 0.2332 | lr 1.13e-03 | (3850.28 ms | 136169 tok/s) step 3696/76294 | train loss 3.668974 | norm 0.2474 | lr 1.13e-03 | (3796.11 ms | 138112 tok/s) step 3697/76294 | train loss 3.655651 | norm 0.2113 | lr 1.13e-03 | (3899.59 ms | 134447 tok/s) step 3698/76294 | train loss 3.727441 | norm 0.2857 | lr 1.13e-03 | (3796.84 ms | 138085 tok/s) step 3699/76294 | train loss 3.622306 | norm 0.2611 | lr 1.13e-03 | (3824.84 ms | 137075 tok/s) step 3700/76294 | train loss 3.652315 | norm 0.2336 | lr 1.13e-03 | (3809.76 ms | 137617 tok/s) step 3701/76294 | train loss 3.688387 | norm 0.2572 | lr 1.13e-03 | (3802.86 ms | 137867 tok/s) step 3702/76294 | train loss 3.628389 | norm 0.2497 | lr 1.13e-03 | (3971.84 ms | 132001 tok/s) step 3703/76294 | train loss 3.766068 | norm 0.2185 | lr 1.13e-03 | (3888.68 ms | 134824 tok/s) step 3704/76294 | train loss 3.649207 | norm 0.2489 | lr 1.13e-03 | (3792.58 ms | 138241 tok/s) step 3705/76294 | train loss 3.683744 | norm 0.2460 | lr 1.13e-03 | (3857.88 ms | 135900 tok/s) step 3706/76294 | train loss 3.753597 | norm 0.2685 | lr 1.13e-03 | (3800.20 ms | 137963 tok/s) step 3707/76294 | train loss 3.693297 | norm 0.2247 | lr 1.13e-03 | (3844.76 ms | 136364 tok/s) step 3708/76294 | train loss 3.681430 | norm 0.2632 | lr 1.13e-03 | (3793.71 ms | 138199 tok/s) step 3709/76294 | train loss 3.705693 | norm 0.2393 | lr 1.13e-03 | (3824.17 ms | 137098 tok/s) step 3710/76294 | train loss 3.634683 | norm 0.2177 | lr 1.13e-03 | (3797.14 ms | 138075 tok/s) step 3711/76294 | train loss 3.664973 | norm 0.2091 | lr 1.13e-03 | (3846.68 ms | 136296 tok/s) step 3712/76294 | train loss 3.667672 | norm 0.1983 | lr 1.13e-03 | (3844.28 ms | 136381 tok/s) step 3713/76294 | train loss 3.640436 | norm 0.2105 | lr 1.13e-03 | (3802.42 ms | 137883 tok/s) step 3714/76294 | train loss 3.675196 | norm 0.1631 | lr 1.13e-03 | (3819.09 ms | 137281 tok/s) step 3715/76294 | train loss 3.708188 | norm 0.2102 | lr 1.13e-03 | (3801.84 ms | 137904 tok/s) step 3716/76294 | train loss 3.707004 | norm 0.2209 | lr 1.13e-03 | (3940.90 ms | 133038 tok/s) step 3717/76294 | train loss 3.607144 | norm 0.2249 | lr 1.13e-03 | (3800.99 ms | 137934 tok/s) step 3718/76294 | train loss 3.639145 | norm 0.1852 | lr 1.13e-03 | (3807.15 ms | 137711 tok/s) step 3719/76294 | train loss 3.625679 | norm 0.2188 | lr 1.13e-03 | (3911.15 ms | 134050 tok/s) step 3720/76294 | train loss 3.663925 | norm 0.2134 | lr 1.13e-03 | (3824.58 ms | 137084 tok/s) step 3721/76294 | train loss 3.608099 | norm 0.2199 | lr 1.13e-03 | (3799.55 ms | 137987 tok/s) step 3722/76294 | train loss 3.674362 | norm 0.1911 | lr 1.13e-03 | (3822.85 ms | 137146 tok/s) step 3723/76294 | train loss 3.660756 | norm 0.2342 | lr 1.13e-03 | (3798.61 ms | 138021 tok/s) step 3724/76294 | train loss 3.686666 | norm 0.2338 | lr 1.13e-03 | (3802.89 ms | 137866 tok/s) step 3725/76294 | train loss 3.626487 | norm 0.2320 | lr 1.13e-03 | (3818.73 ms | 137294 tok/s) step 3726/76294 | train loss 3.624383 | norm 0.2223 | lr 1.13e-03 | (3805.74 ms | 137762 tok/s) step 3727/76294 | train loss 3.672124 | norm 0.2768 | lr 1.13e-03 | (3801.29 ms | 137924 tok/s) step 3728/76294 | train loss 3.590090 | norm 0.2837 | lr 1.13e-03 | (3964.19 ms | 132256 tok/s) step 3729/76294 | train loss 3.658678 | norm 0.2707 | lr 1.13e-03 | (3809.82 ms | 137615 tok/s) step 3730/76294 | train loss 3.613859 | norm 0.2055 | lr 1.13e-03 | (4016.79 ms | 130524 tok/s) step 3731/76294 | train loss 3.700178 | norm 0.2913 | lr 1.13e-03 | (3808.61 ms | 137658 tok/s) step 3732/76294 | train loss 3.599293 | norm 0.1887 | lr 1.13e-03 | (3922.37 ms | 133666 tok/s) step 3733/76294 | train loss 3.666994 | norm 0.2859 | lr 1.13e-03 | (3825.21 ms | 137061 tok/s) step 3734/76294 | train loss 3.700608 | norm 0.2387 | lr 1.13e-03 | (3839.71 ms | 136544 tok/s) step 3735/76294 | train loss 3.681545 | norm 0.1930 | lr 1.13e-03 | (3828.43 ms | 136946 tok/s) step 3736/76294 | train loss 3.793623 | norm 0.2478 | lr 1.13e-03 | (3927.84 ms | 133480 tok/s) step 3737/76294 | train loss 3.623984 | norm 0.2013 | lr 1.13e-03 | (3797.22 ms | 138071 tok/s) step 3738/76294 | train loss 3.666738 | norm 0.2236 | lr 1.13e-03 | (3829.97 ms | 136891 tok/s) step 3739/76294 | train loss 3.670277 | norm 0.2273 | lr 1.13e-03 | (3805.65 ms | 137766 tok/s) step 3740/76294 | train loss 3.699115 | norm 0.2213 | lr 1.13e-03 | (3856.50 ms | 135949 tok/s) step 3741/76294 | train loss 3.612939 | norm 0.2115 | lr 1.13e-03 | (4500.78 ms | 116488 tok/s) step 3742/76294 | train loss 3.657882 | norm 0.2354 | lr 1.13e-03 | (3868.33 ms | 135533 tok/s) step 3743/76294 | train loss 3.625735 | norm 0.2236 | lr 1.13e-03 | (3802.45 ms | 137881 tok/s) step 3744/76294 | train loss 3.670825 | norm 0.2054 | lr 1.13e-03 | (3972.94 ms | 131965 tok/s) step 3745/76294 | train loss 3.613096 | norm 0.2224 | lr 1.13e-03 | (3802.57 ms | 137877 tok/s) step 3746/76294 | train loss 3.661395 | norm 0.2140 | lr 1.13e-03 | (3834.15 ms | 136742 tok/s) step 3747/76294 | train loss 3.596608 | norm 0.1983 | lr 1.13e-03 | (3807.73 ms | 137690 tok/s) step 3748/76294 | train loss 3.648763 | norm 0.2461 | lr 1.13e-03 | (3810.81 ms | 137579 tok/s) step 3749/76294 | train loss 3.666225 | norm 0.3252 | lr 1.13e-03 | (3833.63 ms | 136760 tok/s) step 3750/76294 | train loss 3.668120 | norm 0.3366 | lr 1.13e-03 | (3813.09 ms | 137497 tok/s) val loss: 3.637556 saving model checkpoint to ./results/gpt2-124M-gqa/step_3750.pth step 3751/76294 | train loss 3.686014 | norm 0.2120 | lr 1.13e-03 | (4054.60 ms | 129307 tok/s) step 3752/76294 | train loss 3.680604 | norm 0.3404 | lr 1.13e-03 | (3769.57 ms | 139084 tok/s) step 3753/76294 | train loss 3.648802 | norm 0.2750 | lr 1.13e-03 | (3772.86 ms | 138963 tok/s) step 3754/76294 | train loss 3.777783 | norm 0.3879 | lr 1.13e-03 | (3801.28 ms | 137924 tok/s) step 3755/76294 | train loss 3.649254 | norm 0.3024 | lr 1.13e-03 | (3820.49 ms | 137231 tok/s) step 3756/76294 | train loss 3.686218 | norm 0.2894 | lr 1.13e-03 | (3780.38 ms | 138687 tok/s) step 3757/76294 | train loss 3.716052 | norm 0.2415 | lr 1.13e-03 | (3790.69 ms | 138309 tok/s) step 3758/76294 | train loss 3.664260 | norm 0.2753 | lr 1.13e-03 | (3812.18 ms | 137530 tok/s) step 3759/76294 | train loss 3.638549 | norm 0.2495 | lr 1.13e-03 | (4979.82 ms | 105283 tok/s) step 3760/76294 | train loss 3.675406 | norm 0.2276 | lr 1.13e-03 | (3815.13 ms | 137423 tok/s) step 3761/76294 | train loss 3.644651 | norm 0.2456 | lr 1.13e-03 | (3800.94 ms | 137937 tok/s) step 3762/76294 | train loss 3.665289 | norm 0.2226 | lr 1.13e-03 | (3789.52 ms | 138352 tok/s) step 3763/76294 | train loss 3.695076 | norm 0.1999 | lr 1.13e-03 | (3843.09 ms | 136424 tok/s) step 3764/76294 | train loss 3.646946 | norm 0.2196 | lr 1.13e-03 | (3800.28 ms | 137961 tok/s) step 3765/76294 | train loss 3.685771 | norm 0.2073 | lr 1.13e-03 | (3827.97 ms | 136962 tok/s) step 3766/76294 | train loss 3.672085 | norm 0.1973 | lr 1.13e-03 | (3794.49 ms | 138171 tok/s) step 3767/76294 | train loss 3.629554 | norm 0.2190 | lr 1.13e-03 | (3832.36 ms | 136806 tok/s) step 3768/76294 | train loss 3.657064 | norm 0.2251 | lr 1.13e-03 | (3797.42 ms | 138064 tok/s) step 3769/76294 | train loss 3.661725 | norm 0.2502 | lr 1.13e-03 | (3923.64 ms | 133623 tok/s) step 3770/76294 | train loss 3.667535 | norm 0.2118 | lr 1.13e-03 | (3801.41 ms | 137919 tok/s) step 3771/76294 | train loss 3.670890 | norm 0.2380 | lr 1.13e-03 | (3946.37 ms | 132853 tok/s) step 3772/76294 | train loss 3.853395 | norm 0.3268 | lr 1.13e-03 | (3798.61 ms | 138021 tok/s) step 3773/76294 | train loss 3.585283 | norm 0.4945 | lr 1.13e-03 | (3844.12 ms | 136387 tok/s) step 3774/76294 | train loss 3.612674 | norm 0.3414 | lr 1.13e-03 | (3800.96 ms | 137936 tok/s) step 3775/76294 | train loss 3.660523 | norm 0.2854 | lr 1.13e-03 | (4076.20 ms | 128622 tok/s) step 3776/76294 | train loss 3.641699 | norm 0.3081 | lr 1.13e-03 | (3798.11 ms | 138039 tok/s) step 3777/76294 | train loss 3.637162 | norm 0.2606 | lr 1.13e-03 | (3879.43 ms | 135146 tok/s) step 3778/76294 | train loss 3.676070 | norm 0.2210 | lr 1.13e-03 | (3826.76 ms | 137006 tok/s) step 3779/76294 | train loss 3.622159 | norm 0.2563 | lr 1.13e-03 | (3978.07 ms | 131795 tok/s) step 3780/76294 | train loss 3.655157 | norm 0.2428 | lr 1.13e-03 | (3797.43 ms | 138064 tok/s) step 3781/76294 | train loss 3.645664 | norm 0.2155 | lr 1.13e-03 | (3847.38 ms | 136271 tok/s) step 3782/76294 | train loss 3.668494 | norm 0.2033 | lr 1.13e-03 | (3799.23 ms | 137999 tok/s) step 3783/76294 | train loss 3.605925 | norm 0.1997 | lr 1.13e-03 | (3880.01 ms | 135125 tok/s) step 3784/76294 | train loss 3.643965 | norm 0.1973 | lr 1.13e-03 | (3803.94 ms | 137828 tok/s) step 3785/76294 | train loss 3.785679 | norm 0.2005 | lr 1.13e-03 | (3897.04 ms | 134535 tok/s) step 3786/76294 | train loss 3.715307 | norm 0.2070 | lr 1.13e-03 | (3834.78 ms | 136719 tok/s) step 3787/76294 | train loss 3.795233 | norm 0.2319 | lr 1.13e-03 | (3802.45 ms | 137882 tok/s) step 3788/76294 | train loss 3.659158 | norm 0.2345 | lr 1.13e-03 | (3938.97 ms | 133103 tok/s) step 3789/76294 | train loss 3.658082 | norm 0.2399 | lr 1.13e-03 | (3803.50 ms | 137844 tok/s) step 3790/76294 | train loss 3.671517 | norm 0.2020 | lr 1.13e-03 | (3928.02 ms | 133474 tok/s) step 3791/76294 | train loss 3.601450 | norm 0.2323 | lr 1.13e-03 | (3806.47 ms | 137736 tok/s) step 3792/76294 | train loss 3.591098 | norm 0.2056 | lr 1.13e-03 | (3946.32 ms | 132855 tok/s) step 3793/76294 | train loss 3.629085 | norm 0.2044 | lr 1.13e-03 | (3798.81 ms | 138014 tok/s) step 3794/76294 | train loss 3.644530 | norm 0.1952 | lr 1.13e-03 | (3829.65 ms | 136902 tok/s) step 3795/76294 | train loss 3.644844 | norm 0.2691 | lr 1.13e-03 | (3801.49 ms | 137916 tok/s) step 3796/76294 | train loss 3.602581 | norm 0.2690 | lr 1.13e-03 | (3856.96 ms | 135933 tok/s) step 3797/76294 | train loss 3.639292 | norm 0.2132 | lr 1.13e-03 | (3797.23 ms | 138071 tok/s) step 3798/76294 | train loss 3.611959 | norm 0.2221 | lr 1.13e-03 | (3824.69 ms | 137080 tok/s) step 3799/76294 | train loss 3.626491 | norm 0.1947 | lr 1.13e-03 | (3804.56 ms | 137805 tok/s) step 3800/76294 | train loss 3.681865 | norm 0.2653 | lr 1.13e-03 | (3884.28 ms | 134977 tok/s) step 3801/76294 | train loss 3.708885 | norm 0.2061 | lr 1.13e-03 | (3795.97 ms | 138117 tok/s) step 3802/76294 | train loss 3.654304 | norm 0.1961 | lr 1.13e-03 | (3848.47 ms | 136233 tok/s) step 3803/76294 | train loss 3.629697 | norm 0.2152 | lr 1.13e-03 | (3793.75 ms | 138198 tok/s) step 3804/76294 | train loss 3.668132 | norm 0.2162 | lr 1.13e-03 | (3824.59 ms | 137084 tok/s) step 3805/76294 | train loss 3.695487 | norm 0.1795 | lr 1.13e-03 | (3801.25 ms | 137925 tok/s) step 3806/76294 | train loss 3.672210 | norm 0.2032 | lr 1.13e-03 | (3861.56 ms | 135771 tok/s) step 3807/76294 | train loss 3.617945 | norm 0.1974 | lr 1.13e-03 | (3842.05 ms | 136460 tok/s) step 3808/76294 | train loss 3.607155 | norm 0.2067 | lr 1.13e-03 | (3919.06 ms | 133779 tok/s) step 3809/76294 | train loss 3.588118 | norm 0.2234 | lr 1.13e-03 | (3804.50 ms | 137807 tok/s) step 3810/76294 | train loss 3.664613 | norm 0.2246 | lr 1.13e-03 | (3815.72 ms | 137402 tok/s) step 3811/76294 | train loss 3.595344 | norm 0.1837 | lr 1.13e-03 | (3825.86 ms | 137038 tok/s) step 3812/76294 | train loss 3.639060 | norm 0.2355 | lr 1.13e-03 | (3833.72 ms | 136757 tok/s) step 3813/76294 | train loss 3.576610 | norm 0.2600 | lr 1.13e-03 | (3829.56 ms | 136905 tok/s) step 3814/76294 | train loss 3.729444 | norm 0.2363 | lr 1.13e-03 | (3834.89 ms | 136715 tok/s) step 3815/76294 | train loss 3.611162 | norm 0.2783 | lr 1.13e-03 | (4449.22 ms | 117838 tok/s) step 3816/76294 | train loss 3.620929 | norm 0.2473 | lr 1.13e-03 | (3802.19 ms | 137891 tok/s) step 3817/76294 | train loss 3.664105 | norm 0.2258 | lr 1.13e-03 | (3837.08 ms | 136637 tok/s) step 3818/76294 | train loss 3.654743 | norm 0.2517 | lr 1.13e-03 | (3808.20 ms | 137673 tok/s) step 3819/76294 | train loss 3.657309 | norm 0.2033 | lr 1.13e-03 | (3861.64 ms | 135768 tok/s) step 3820/76294 | train loss 3.650251 | norm 0.2036 | lr 1.13e-03 | (3921.60 ms | 133692 tok/s) step 3821/76294 | train loss 3.620853 | norm 0.2005 | lr 1.13e-03 | (3805.88 ms | 137757 tok/s) step 3822/76294 | train loss 3.583103 | norm 0.2086 | lr 1.13e-03 | (3839.27 ms | 136559 tok/s) step 3823/76294 | train loss 3.620537 | norm 0.1930 | lr 1.13e-03 | (3897.68 ms | 134513 tok/s) step 3824/76294 | train loss 3.552500 | norm 0.2003 | lr 1.13e-03 | (3801.50 ms | 137916 tok/s) step 3825/76294 | train loss 3.650741 | norm 0.2073 | lr 1.13e-03 | (3917.36 ms | 133837 tok/s) step 3826/76294 | train loss 3.625492 | norm 0.1850 | lr 1.13e-03 | (3876.53 ms | 135247 tok/s) step 3827/76294 | train loss 3.601898 | norm 0.1824 | lr 1.13e-03 | (3798.91 ms | 138010 tok/s) step 3828/76294 | train loss 3.669121 | norm 0.1995 | lr 1.13e-03 | (3899.43 ms | 134452 tok/s) step 3829/76294 | train loss 3.628665 | norm 0.1808 | lr 1.13e-03 | (3805.60 ms | 137767 tok/s) step 3830/76294 | train loss 3.641780 | norm 0.1866 | lr 1.13e-03 | (3906.35 ms | 134214 tok/s) step 3831/76294 | train loss 3.694687 | norm 0.2153 | lr 1.13e-03 | (3873.62 ms | 135348 tok/s) step 3832/76294 | train loss 3.604124 | norm 0.1921 | lr 1.13e-03 | (3794.77 ms | 138161 tok/s) step 3833/76294 | train loss 3.608301 | norm 0.2060 | lr 1.13e-03 | (3838.81 ms | 136576 tok/s) step 3834/76294 | train loss 3.647069 | norm 0.2133 | lr 1.13e-03 | (3796.62 ms | 138093 tok/s) step 3835/76294 | train loss 3.740837 | norm 0.1882 | lr 1.13e-03 | (3830.29 ms | 136880 tok/s) step 3836/76294 | train loss 3.596533 | norm 0.2338 | lr 1.13e-03 | (3798.05 ms | 138041 tok/s) step 3837/76294 | train loss 3.645588 | norm 0.2334 | lr 1.13e-03 | (3909.80 ms | 134096 tok/s) step 3838/76294 | train loss 3.634494 | norm 0.2346 | lr 1.13e-03 | (3839.95 ms | 136535 tok/s) step 3839/76294 | train loss 3.678487 | norm 0.2548 | lr 1.13e-03 | (3792.53 ms | 138242 tok/s) step 3840/76294 | train loss 3.741729 | norm 0.2344 | lr 1.13e-03 | (3825.13 ms | 137064 tok/s) step 3841/76294 | train loss 3.694992 | norm 0.2367 | lr 1.13e-03 | (3794.69 ms | 138164 tok/s) step 3842/76294 | train loss 3.732940 | norm 0.2643 | lr 1.13e-03 | (3824.88 ms | 137073 tok/s) step 3843/76294 | train loss 3.637821 | norm 0.2516 | lr 1.13e-03 | (3847.51 ms | 136267 tok/s) step 3844/76294 | train loss 3.583664 | norm 0.2136 | lr 1.13e-03 | (3811.93 ms | 137539 tok/s) step 3845/76294 | train loss 3.779088 | norm 0.2342 | lr 1.13e-03 | (3807.10 ms | 137713 tok/s) step 3846/76294 | train loss 3.584744 | norm 0.2223 | lr 1.13e-03 | (3842.97 ms | 136428 tok/s) step 3847/76294 | train loss 3.627008 | norm 0.2211 | lr 1.13e-03 | (3874.41 ms | 135321 tok/s) step 3848/76294 | train loss 3.577479 | norm 0.2401 | lr 1.13e-03 | (3811.46 ms | 137556 tok/s) step 3849/76294 | train loss 3.670486 | norm 0.2324 | lr 1.13e-03 | (3847.49 ms | 136268 tok/s) step 3850/76294 | train loss 3.595395 | norm 0.2309 | lr 1.13e-03 | (3835.90 ms | 136679 tok/s) step 3851/76294 | train loss 3.683073 | norm 0.2250 | lr 1.13e-03 | (3809.33 ms | 137633 tok/s) step 3852/76294 | train loss 3.679128 | norm 0.1962 | lr 1.13e-03 | (4310.98 ms | 121617 tok/s) step 3853/76294 | train loss 3.692297 | norm 0.1976 | lr 1.13e-03 | (6493.23 ms | 80744 tok/s) step 3854/76294 | train loss 3.686824 | norm 0.2034 | lr 1.13e-03 | (6385.74 ms | 82103 tok/s) step 3855/76294 | train loss 3.619906 | norm 0.2318 | lr 1.13e-03 | (3823.07 ms | 137138 tok/s) step 3856/76294 | train loss 3.709173 | norm 0.3736 | lr 1.13e-03 | (4140.18 ms | 126634 tok/s) step 3857/76294 | train loss 3.576361 | norm 0.3825 | lr 1.13e-03 | (3855.00 ms | 136002 tok/s) step 3858/76294 | train loss 3.739680 | norm 0.2356 | lr 1.13e-03 | (3826.57 ms | 137013 tok/s) step 3859/76294 | train loss 3.634645 | norm 0.2723 | lr 1.13e-03 | (3856.29 ms | 135957 tok/s) step 3860/76294 | train loss 3.682252 | norm 0.2315 | lr 1.13e-03 | (3823.79 ms | 137112 tok/s) step 3861/76294 | train loss 3.739156 | norm 0.2385 | lr 1.13e-03 | (3832.63 ms | 136796 tok/s) step 3862/76294 | train loss 3.632700 | norm 0.2129 | lr 1.13e-03 | (3835.60 ms | 136690 tok/s) step 3863/76294 | train loss 3.678328 | norm 0.2193 | lr 1.13e-03 | (3844.26 ms | 136382 tok/s) step 3864/76294 | train loss 3.606959 | norm 0.1847 | lr 1.13e-03 | (3813.74 ms | 137474 tok/s) step 3865/76294 | train loss 3.667481 | norm 0.1985 | lr 1.13e-03 | (3829.27 ms | 136916 tok/s) step 3866/76294 | train loss 3.606256 | norm 0.2081 | lr 1.13e-03 | (4198.16 ms | 124885 tok/s) step 3867/76294 | train loss 3.610118 | norm 0.2218 | lr 1.13e-03 | (3824.68 ms | 137080 tok/s) step 3868/76294 | train loss 3.657901 | norm 0.2058 | lr 1.13e-03 | (3809.65 ms | 137621 tok/s) step 3869/76294 | train loss 3.641656 | norm 0.2076 | lr 1.13e-03 | (3811.91 ms | 137540 tok/s) step 3870/76294 | train loss 3.598327 | norm 0.1981 | lr 1.13e-03 | (3852.39 ms | 136094 tok/s) step 3871/76294 | train loss 3.629185 | norm 0.2057 | lr 1.13e-03 | (3819.55 ms | 137264 tok/s) step 3872/76294 | train loss 3.784414 | norm 0.2146 | lr 1.13e-03 | (3812.77 ms | 137508 tok/s) step 3873/76294 | train loss 3.697457 | norm 0.2567 | lr 1.13e-03 | (3836.39 ms | 136662 tok/s) step 3874/76294 | train loss 3.624763 | norm 0.2429 | lr 1.13e-03 | (3820.92 ms | 137215 tok/s) step 3875/76294 | train loss 3.646244 | norm 0.1994 | lr 1.13e-03 | (3820.18 ms | 137242 tok/s) step 3876/76294 | train loss 3.653333 | norm 0.2238 | lr 1.13e-03 | (3814.36 ms | 137451 tok/s) step 3877/76294 | train loss 3.765018 | norm 0.2566 | lr 1.13e-03 | (3816.95 ms | 137358 tok/s) step 3878/76294 | train loss 3.614642 | norm 0.2803 | lr 1.13e-03 | (3834.19 ms | 136740 tok/s) step 3879/76294 | train loss 3.601866 | norm 0.2315 | lr 1.13e-03 | (3839.89 ms | 136537 tok/s) step 3880/76294 | train loss 3.702651 | norm 0.2271 | lr 1.13e-03 | (3831.07 ms | 136851 tok/s) step 3881/76294 | train loss 3.585090 | norm 0.2892 | lr 1.13e-03 | (3844.46 ms | 136375 tok/s) step 3882/76294 | train loss 3.548820 | norm 0.2174 | lr 1.13e-03 | (3901.81 ms | 134371 tok/s) step 3883/76294 | train loss 3.653159 | norm 0.2490 | lr 1.13e-03 | (3824.17 ms | 137099 tok/s) step 3884/76294 | train loss 3.616390 | norm 0.2410 | lr 1.13e-03 | (3883.84 ms | 134992 tok/s) step 3885/76294 | train loss 3.621614 | norm 0.2268 | lr 1.13e-03 | (3826.18 ms | 137026 tok/s) step 3886/76294 | train loss 3.632615 | norm 0.2166 | lr 1.13e-03 | (3845.81 ms | 136327 tok/s) step 3887/76294 | train loss 3.542766 | norm 0.1850 | lr 1.13e-03 | (3810.90 ms | 137576 tok/s) step 3888/76294 | train loss 3.703790 | norm 0.1925 | lr 1.13e-03 | (3834.00 ms | 136747 tok/s) step 3889/76294 | train loss 3.667194 | norm 0.1752 | lr 1.13e-03 | (3810.61 ms | 137587 tok/s) step 3890/76294 | train loss 3.629515 | norm 0.2221 | lr 1.13e-03 | (3841.04 ms | 136496 tok/s) step 3891/76294 | train loss 3.628944 | norm 0.2710 | lr 1.13e-03 | (3832.01 ms | 136818 tok/s) step 3892/76294 | train loss 3.621429 | norm 0.2382 | lr 1.13e-03 | (3815.01 ms | 137428 tok/s) step 3893/76294 | train loss 3.649963 | norm 0.2325 | lr 1.13e-03 | (3813.71 ms | 137475 tok/s) step 3894/76294 | train loss 3.598289 | norm 0.2437 | lr 1.13e-03 | (3950.63 ms | 132710 tok/s) step 3895/76294 | train loss 3.632532 | norm 0.2543 | lr 1.13e-03 | (3810.10 ms | 137605 tok/s) step 3896/76294 | train loss 3.631245 | norm 0.2323 | lr 1.13e-03 | (3923.52 ms | 133627 tok/s) step 3897/76294 | train loss 3.633394 | norm 0.2402 | lr 1.13e-03 | (4029.74 ms | 130105 tok/s) step 3898/76294 | train loss 3.646412 | norm 0.2080 | lr 1.13e-03 | (3789.56 ms | 138351 tok/s) step 3899/76294 | train loss 3.628605 | norm 0.2204 | lr 1.13e-03 | (3818.98 ms | 137285 tok/s) step 3900/76294 | train loss 3.540670 | norm 0.2445 | lr 1.13e-03 | (3945.90 ms | 132869 tok/s) step 3901/76294 | train loss 3.663457 | norm 0.2184 | lr 1.13e-03 | (3883.69 ms | 134997 tok/s) step 3902/76294 | train loss 3.625285 | norm 0.2004 | lr 1.13e-03 | (4462.72 ms | 117482 tok/s) step 3903/76294 | train loss 3.593518 | norm 0.1987 | lr 1.13e-03 | (3793.02 ms | 138224 tok/s) step 3904/76294 | train loss 3.674783 | norm 0.2417 | lr 1.13e-03 | (3929.48 ms | 133424 tok/s) step 3905/76294 | train loss 3.641834 | norm 0.2151 | lr 1.13e-03 | (3824.33 ms | 137093 tok/s) step 3906/76294 | train loss 3.685167 | norm 0.2131 | lr 1.13e-03 | (3793.96 ms | 138190 tok/s) step 3907/76294 | train loss 3.658792 | norm 0.2162 | lr 1.13e-03 | (3910.22 ms | 134081 tok/s) step 3908/76294 | train loss 3.643137 | norm 0.2077 | lr 1.13e-03 | (3784.34 ms | 138542 tok/s) step 3909/76294 | train loss 3.680058 | norm 0.2217 | lr 1.13e-03 | (3997.54 ms | 131153 tok/s) step 3910/76294 | train loss 3.673034 | norm 0.2333 | lr 1.13e-03 | (3828.83 ms | 136932 tok/s) step 3911/76294 | train loss 3.595639 | norm 0.2304 | lr 1.13e-03 | (3791.80 ms | 138269 tok/s) step 3912/76294 | train loss 3.695847 | norm 0.1823 | lr 1.13e-03 | (3826.74 ms | 137007 tok/s) step 3913/76294 | train loss 3.646459 | norm 0.1846 | lr 1.13e-03 | (3799.03 ms | 138006 tok/s) step 3914/76294 | train loss 3.659002 | norm 0.2221 | lr 1.13e-03 | (3817.27 ms | 137346 tok/s) step 3915/76294 | train loss 3.676429 | norm 0.2065 | lr 1.13e-03 | (3801.69 ms | 137909 tok/s) step 3916/76294 | train loss 3.655592 | norm 0.1992 | lr 1.12e-03 | (3801.57 ms | 137914 tok/s) step 3917/76294 | train loss 3.586663 | norm 0.2433 | lr 1.12e-03 | (3799.71 ms | 137981 tok/s) step 3918/76294 | train loss 3.740343 | norm 0.2338 | lr 1.12e-03 | (3941.47 ms | 133019 tok/s) step 3919/76294 | train loss 3.649760 | norm 0.2434 | lr 1.12e-03 | (3806.57 ms | 137732 tok/s) step 3920/76294 | train loss 3.594881 | norm 0.2306 | lr 1.12e-03 | (3998.54 ms | 131120 tok/s) step 3921/76294 | train loss 3.637373 | norm 0.2255 | lr 1.12e-03 | (3798.69 ms | 138018 tok/s) step 3922/76294 | train loss 3.674653 | norm 0.2134 | lr 1.12e-03 | (3952.50 ms | 132647 tok/s) step 3923/76294 | train loss 3.680692 | norm 0.2039 | lr 1.12e-03 | (3881.90 ms | 135060 tok/s) step 3924/76294 | train loss 3.687510 | norm 0.2029 | lr 1.12e-03 | (4867.19 ms | 107719 tok/s) step 3925/76294 | train loss 3.577698 | norm 0.2192 | lr 1.12e-03 | (3812.28 ms | 137526 tok/s) step 3926/76294 | train loss 3.633347 | norm 0.1946 | lr 1.12e-03 | (4472.69 ms | 117220 tok/s) step 3927/76294 | train loss 3.672984 | norm 0.2016 | lr 1.12e-03 | (3803.73 ms | 137835 tok/s) step 3928/76294 | train loss 3.665713 | norm 0.2276 | lr 1.12e-03 | (3842.62 ms | 136440 tok/s) step 3929/76294 | train loss 3.791120 | norm 0.2692 | lr 1.12e-03 | (3790.38 ms | 138321 tok/s) step 3930/76294 | train loss 3.616857 | norm 0.3508 | lr 1.12e-03 | (3997.88 ms | 131141 tok/s) step 3931/76294 | train loss 3.688238 | norm 0.3235 | lr 1.12e-03 | (3792.08 ms | 138259 tok/s) step 3932/76294 | train loss 3.640318 | norm 0.2657 | lr 1.12e-03 | (3825.34 ms | 137057 tok/s) step 3933/76294 | train loss 3.615488 | norm 0.2620 | lr 1.12e-03 | (3856.13 ms | 135962 tok/s) step 3934/76294 | train loss 3.656611 | norm 0.2988 | lr 1.12e-03 | (3854.37 ms | 136024 tok/s) step 3935/76294 | train loss 3.652604 | norm 0.2383 | lr 1.12e-03 | (3860.69 ms | 135802 tok/s) step 3936/76294 | train loss 3.625893 | norm 0.2418 | lr 1.12e-03 | (3803.43 ms | 137846 tok/s) step 3937/76294 | train loss 3.642043 | norm 0.2325 | lr 1.12e-03 | (3824.01 ms | 137104 tok/s) step 3938/76294 | train loss 3.627379 | norm 0.2134 | lr 1.12e-03 | (3824.99 ms | 137069 tok/s) step 3939/76294 | train loss 3.695721 | norm 0.1974 | lr 1.12e-03 | (3873.88 ms | 135339 tok/s) step 3940/76294 | train loss 3.612350 | norm 0.1856 | lr 1.12e-03 | (3795.30 ms | 138141 tok/s) step 3941/76294 | train loss 3.642115 | norm 0.2143 | lr 1.12e-03 | (3832.25 ms | 136810 tok/s) step 3942/76294 | train loss 3.589360 | norm 0.2028 | lr 1.12e-03 | (3797.50 ms | 138061 tok/s) step 3943/76294 | train loss 3.633684 | norm 0.2541 | lr 1.12e-03 | (3827.30 ms | 136986 tok/s) step 3944/76294 | train loss 3.640220 | norm 0.2208 | lr 1.12e-03 | (3804.28 ms | 137815 tok/s) step 3945/76294 | train loss 3.646971 | norm 0.2023 | lr 1.12e-03 | (3815.91 ms | 137395 tok/s) step 3946/76294 | train loss 3.590177 | norm 0.2219 | lr 1.12e-03 | (3802.88 ms | 137866 tok/s) step 3947/76294 | train loss 3.625171 | norm 0.2218 | lr 1.12e-03 | (3835.74 ms | 136685 tok/s) step 3948/76294 | train loss 3.609678 | norm 0.2594 | lr 1.12e-03 | (3804.55 ms | 137806 tok/s) step 3949/76294 | train loss 3.635875 | norm 0.2204 | lr 1.12e-03 | (3875.14 ms | 135295 tok/s) step 3950/76294 | train loss 3.629208 | norm 0.2028 | lr 1.12e-03 | (3806.54 ms | 137734 tok/s) step 3951/76294 | train loss 3.684949 | norm 0.2725 | lr 1.12e-03 | (3833.64 ms | 136760 tok/s) step 3952/76294 | train loss 3.607384 | norm 0.2529 | lr 1.12e-03 | (3826.27 ms | 137023 tok/s) step 3953/76294 | train loss 3.619568 | norm 0.2343 | lr 1.12e-03 | (3910.06 ms | 134087 tok/s) step 3954/76294 | train loss 3.645278 | norm 0.2399 | lr 1.12e-03 | (3806.39 ms | 137739 tok/s) step 3955/76294 | train loss 3.582178 | norm 0.2169 | lr 1.12e-03 | (3991.12 ms | 131364 tok/s) step 3956/76294 | train loss 3.623636 | norm 0.2034 | lr 1.12e-03 | (3791.79 ms | 138269 tok/s) step 3957/76294 | train loss 3.670218 | norm 0.2303 | lr 1.12e-03 | (3799.75 ms | 137979 tok/s) step 3958/76294 | train loss 3.664973 | norm 0.2033 | lr 1.12e-03 | (4078.39 ms | 128553 tok/s) step 3959/76294 | train loss 3.621123 | norm 0.1946 | lr 1.12e-03 | (3797.64 ms | 138056 tok/s) step 3960/76294 | train loss 3.612842 | norm 0.1945 | lr 1.12e-03 | (4195.30 ms | 124970 tok/s) step 3961/76294 | train loss 3.652225 | norm 0.2042 | lr 1.12e-03 | (3807.99 ms | 137681 tok/s) step 3962/76294 | train loss 3.581864 | norm 0.1950 | lr 1.12e-03 | (3808.40 ms | 137666 tok/s) step 3963/76294 | train loss 3.674061 | norm 0.2099 | lr 1.12e-03 | (3828.92 ms | 136929 tok/s) step 3964/76294 | train loss 3.734263 | norm 0.2327 | lr 1.12e-03 | (3844.37 ms | 136378 tok/s) step 3965/76294 | train loss 3.618573 | norm 0.2503 | lr 1.12e-03 | (3835.09 ms | 136708 tok/s) step 3966/76294 | train loss 3.607393 | norm 0.2781 | lr 1.12e-03 | (3829.60 ms | 136904 tok/s) step 3967/76294 | train loss 3.586524 | norm 0.2507 | lr 1.12e-03 | (3825.14 ms | 137064 tok/s) step 3968/76294 | train loss 3.669666 | norm 0.2201 | lr 1.12e-03 | (3988.45 ms | 131452 tok/s) step 3969/76294 | train loss 3.638484 | norm 0.2168 | lr 1.12e-03 | (3816.31 ms | 137381 tok/s) step 3970/76294 | train loss 3.638472 | norm 0.2289 | lr 1.12e-03 | (3841.81 ms | 136469 tok/s) step 3971/76294 | train loss 3.655681 | norm 0.2151 | lr 1.12e-03 | (3809.58 ms | 137624 tok/s) step 3972/76294 | train loss 3.659477 | norm 0.2223 | lr 1.12e-03 | (4569.68 ms | 114732 tok/s) step 3973/76294 | train loss 3.643040 | norm 0.1915 | lr 1.12e-03 | (3821.86 ms | 137181 tok/s) step 3974/76294 | train loss 3.582117 | norm 0.2178 | lr 1.12e-03 | (3826.50 ms | 137015 tok/s) step 3975/76294 | train loss 3.842770 | norm 0.2358 | lr 1.12e-03 | (3838.97 ms | 136570 tok/s) step 3976/76294 | train loss 3.702034 | norm 0.2427 | lr 1.12e-03 | (3812.82 ms | 137507 tok/s) step 3977/76294 | train loss 3.611844 | norm 0.2273 | lr 1.12e-03 | (3833.54 ms | 136763 tok/s) step 3978/76294 | train loss 3.564608 | norm 0.2123 | lr 1.12e-03 | (3824.77 ms | 137077 tok/s) step 3979/76294 | train loss 3.645630 | norm 0.2149 | lr 1.12e-03 | (3940.09 ms | 133065 tok/s) step 3980/76294 | train loss 3.639844 | norm 0.2217 | lr 1.12e-03 | (3811.74 ms | 137546 tok/s) step 3981/76294 | train loss 3.612437 | norm 0.2031 | lr 1.12e-03 | (3815.67 ms | 137404 tok/s) step 3982/76294 | train loss 3.598935 | norm 0.2055 | lr 1.12e-03 | (3844.99 ms | 136356 tok/s) step 3983/76294 | train loss 3.597835 | norm 0.1832 | lr 1.12e-03 | (3837.90 ms | 136608 tok/s) step 3984/76294 | train loss 3.600606 | norm 0.1962 | lr 1.12e-03 | (3841.94 ms | 136464 tok/s) step 3985/76294 | train loss 3.617584 | norm 0.1914 | lr 1.12e-03 | (3816.65 ms | 137369 tok/s) step 3986/76294 | train loss 3.638903 | norm 0.2007 | lr 1.12e-03 | (3833.25 ms | 136774 tok/s) step 3987/76294 | train loss 3.571125 | norm 0.2015 | lr 1.12e-03 | (3838.45 ms | 136588 tok/s) step 3988/76294 | train loss 3.655138 | norm 0.1968 | lr 1.12e-03 | (3835.22 ms | 136704 tok/s) step 3989/76294 | train loss 3.710496 | norm 0.2360 | lr 1.12e-03 | (3813.75 ms | 137473 tok/s) step 3990/76294 | train loss 3.691324 | norm 0.2482 | lr 1.12e-03 | (3832.83 ms | 136789 tok/s) step 3991/76294 | train loss 3.586487 | norm 0.2282 | lr 1.12e-03 | (3816.98 ms | 137357 tok/s) step 3992/76294 | train loss 3.565469 | norm 0.2026 | lr 1.12e-03 | (3843.83 ms | 136397 tok/s) step 3993/76294 | train loss 3.541808 | norm 0.1958 | lr 1.12e-03 | (3811.99 ms | 137537 tok/s) step 3994/76294 | train loss 3.611099 | norm 0.2171 | lr 1.12e-03 | (3834.55 ms | 136727 tok/s) step 3995/76294 | train loss 3.606989 | norm 0.1993 | lr 1.12e-03 | (3822.14 ms | 137171 tok/s) step 3996/76294 | train loss 3.598656 | norm 0.1922 | lr 1.12e-03 | (3816.41 ms | 137377 tok/s) step 3997/76294 | train loss 3.641505 | norm 0.2179 | lr 1.12e-03 | (3827.44 ms | 136981 tok/s) step 3998/76294 | train loss 3.625543 | norm 0.2285 | lr 1.12e-03 | (3837.94 ms | 136607 tok/s) step 3999/76294 | train loss 3.681000 | norm 0.3228 | lr 1.12e-03 | (3900.66 ms | 134410 tok/s) step 4000/76294 | train loss 3.681431 | norm 0.2687 | lr 1.12e-03 | (3808.82 ms | 137651 tok/s) val loss: 3.624115 saving model checkpoint to ./results/gpt2-124M-gqa/step_4000.pth step 4001/76294 | train loss 3.616091 | norm 0.2019 | lr 1.12e-03 | (3854.40 ms | 136023 tok/s) step 4002/76294 | train loss 3.631910 | norm 0.2253 | lr 1.12e-03 | (3794.51 ms | 138170 tok/s) step 4003/76294 | train loss 3.607406 | norm 0.2425 | lr 1.12e-03 | (3860.41 ms | 135812 tok/s) step 4004/76294 | train loss 3.673304 | norm 0.2002 | lr 1.12e-03 | (3798.56 ms | 138023 tok/s) step 4005/76294 | train loss 3.598482 | norm 0.2284 | lr 1.12e-03 | (3807.21 ms | 137709 tok/s) step 4006/76294 | train loss 3.582120 | norm 0.2293 | lr 1.12e-03 | (4823.25 ms | 108700 tok/s) step 4007/76294 | train loss 3.656488 | norm 0.1928 | lr 1.12e-03 | (3883.58 ms | 135001 tok/s) step 4008/76294 | train loss 3.701936 | norm 0.2546 | lr 1.12e-03 | (3795.14 ms | 138147 tok/s) step 4009/76294 | train loss 3.627079 | norm 0.2088 | lr 1.12e-03 | (4096.54 ms | 127983 tok/s) step 4010/76294 | train loss 3.693626 | norm 0.2148 | lr 1.12e-03 | (3794.60 ms | 138167 tok/s) step 4011/76294 | train loss 3.642788 | norm 0.2103 | lr 1.12e-03 | (3821.11 ms | 137208 tok/s) step 4012/76294 | train loss 3.625241 | norm 0.2168 | lr 1.12e-03 | (3795.27 ms | 138143 tok/s) step 4013/76294 | train loss 3.651828 | norm 0.2056 | lr 1.12e-03 | (3797.53 ms | 138060 tok/s) step 4014/76294 | train loss 3.643559 | norm 0.2201 | lr 1.12e-03 | (3818.30 ms | 137309 tok/s) step 4015/76294 | train loss 3.673003 | norm 0.1995 | lr 1.12e-03 | (3798.61 ms | 138021 tok/s) step 4016/76294 | train loss 3.683752 | norm 0.2322 | lr 1.12e-03 | (3820.55 ms | 137228 tok/s) step 4017/76294 | train loss 3.591030 | norm 0.1996 | lr 1.12e-03 | (5173.40 ms | 101343 tok/s) step 4018/76294 | train loss 3.609664 | norm 0.2177 | lr 1.12e-03 | (3791.00 ms | 138298 tok/s) step 4019/76294 | train loss 3.702983 | norm 0.2333 | lr 1.12e-03 | (4076.76 ms | 128604 tok/s) step 4020/76294 | train loss 3.669041 | norm 0.2517 | lr 1.12e-03 | (3817.64 ms | 137333 tok/s) step 4021/76294 | train loss 3.626900 | norm 0.2114 | lr 1.12e-03 | (3798.27 ms | 138033 tok/s) step 4022/76294 | train loss 3.677659 | norm 0.2027 | lr 1.12e-03 | (3820.34 ms | 137236 tok/s) step 4023/76294 | train loss 3.622225 | norm 0.1933 | lr 1.12e-03 | (3802.49 ms | 137880 tok/s) step 4024/76294 | train loss 3.644252 | norm 0.2294 | lr 1.12e-03 | (3795.40 ms | 138138 tok/s) step 4025/76294 | train loss 3.667990 | norm 0.1930 | lr 1.12e-03 | (3829.85 ms | 136895 tok/s) step 4026/76294 | train loss 3.625181 | norm 0.2039 | lr 1.12e-03 | (3799.17 ms | 138001 tok/s) step 4027/76294 | train loss 3.714179 | norm 0.1862 | lr 1.12e-03 | (3824.45 ms | 137088 tok/s) step 4028/76294 | train loss 3.667169 | norm 0.2239 | lr 1.12e-03 | (3796.38 ms | 138102 tok/s) step 4029/76294 | train loss 3.610067 | norm 0.2138 | lr 1.12e-03 | (3829.11 ms | 136922 tok/s) step 4030/76294 | train loss 3.622817 | norm 0.2136 | lr 1.12e-03 | (3824.33 ms | 137093 tok/s) step 4031/76294 | train loss 3.630764 | norm 0.2351 | lr 1.12e-03 | (3799.01 ms | 138007 tok/s) step 4032/76294 | train loss 3.616707 | norm 0.1994 | lr 1.12e-03 | (3830.08 ms | 136887 tok/s) step 4033/76294 | train loss 3.613451 | norm 0.2195 | lr 1.12e-03 | (3836.70 ms | 136651 tok/s) step 4034/76294 | train loss 3.700845 | norm 0.2576 | lr 1.12e-03 | (3799.03 ms | 138006 tok/s) step 4035/76294 | train loss 3.646697 | norm 0.2854 | lr 1.12e-03 | (3825.66 ms | 137045 tok/s) step 4036/76294 | train loss 3.654112 | norm 0.2538 | lr 1.12e-03 | (3802.28 ms | 137888 tok/s) step 4037/76294 | train loss 3.640436 | norm 0.2143 | lr 1.12e-03 | (3876.13 ms | 135261 tok/s) step 4038/76294 | train loss 3.655010 | norm 0.2525 | lr 1.12e-03 | (3905.65 ms | 134238 tok/s) step 4039/76294 | train loss 3.676379 | norm 0.2711 | lr 1.12e-03 | (3852.83 ms | 136079 tok/s) step 4040/76294 | train loss 3.669377 | norm 0.3141 | lr 1.12e-03 | (3795.47 ms | 138135 tok/s) step 4041/76294 | train loss 3.714670 | norm 0.2634 | lr 1.12e-03 | (3842.42 ms | 136447 tok/s) step 4042/76294 | train loss 3.714353 | norm 0.3004 | lr 1.12e-03 | (3798.66 ms | 138019 tok/s) step 4043/76294 | train loss 3.673038 | norm 0.2726 | lr 1.12e-03 | (4035.93 ms | 129905 tok/s) step 4044/76294 | train loss 3.672074 | norm 0.2448 | lr 1.12e-03 | (3800.70 ms | 137945 tok/s) step 4045/76294 | train loss 3.611144 | norm 0.2543 | lr 1.12e-03 | (3826.31 ms | 137022 tok/s) step 4046/76294 | train loss 3.611842 | norm 0.2399 | lr 1.12e-03 | (3829.95 ms | 136892 tok/s) step 4047/76294 | train loss 3.631067 | norm 0.2333 | lr 1.12e-03 | (3800.36 ms | 137957 tok/s) step 4048/76294 | train loss 3.662270 | norm 0.2090 | lr 1.12e-03 | (3842.53 ms | 136444 tok/s) step 4049/76294 | train loss 3.626690 | norm 0.2113 | lr 1.12e-03 | (3803.10 ms | 137858 tok/s) step 4050/76294 | train loss 3.650202 | norm 0.2172 | lr 1.12e-03 | (3887.22 ms | 134875 tok/s) step 4051/76294 | train loss 3.625514 | norm 0.1955 | lr 1.12e-03 | (3796.78 ms | 138088 tok/s) step 4052/76294 | train loss 3.565532 | norm 0.1943 | lr 1.12e-03 | (3812.26 ms | 137527 tok/s) step 4053/76294 | train loss 3.697884 | norm 0.2216 | lr 1.12e-03 | (3796.23 ms | 138108 tok/s) step 4054/76294 | train loss 3.658347 | norm 0.2028 | lr 1.12e-03 | (3875.44 ms | 135285 tok/s) step 4055/76294 | train loss 3.614084 | norm 0.2095 | lr 1.12e-03 | (3793.53 ms | 138206 tok/s) step 4056/76294 | train loss 3.667870 | norm 0.2122 | lr 1.12e-03 | (3904.80 ms | 134268 tok/s) step 4057/76294 | train loss 3.646541 | norm 0.1872 | lr 1.12e-03 | (3813.15 ms | 137495 tok/s) step 4058/76294 | train loss 3.624834 | norm 0.2105 | lr 1.12e-03 | (3887.88 ms | 134852 tok/s) step 4059/76294 | train loss 3.665561 | norm 0.1841 | lr 1.12e-03 | (3854.28 ms | 136027 tok/s) step 4060/76294 | train loss 3.711235 | norm 0.2173 | lr 1.12e-03 | (3801.67 ms | 137910 tok/s) step 4061/76294 | train loss 3.713498 | norm 0.2033 | lr 1.12e-03 | (3818.48 ms | 137303 tok/s) step 4062/76294 | train loss 3.627498 | norm 0.1839 | lr 1.12e-03 | (3801.78 ms | 137906 tok/s) step 4063/76294 | train loss 3.634711 | norm 0.2027 | lr 1.12e-03 | (3801.23 ms | 137926 tok/s) step 4064/76294 | train loss 3.693358 | norm 0.2065 | lr 1.12e-03 | (3802.58 ms | 137877 tok/s) step 4065/76294 | train loss 3.672349 | norm 0.2488 | lr 1.12e-03 | (3870.76 ms | 135449 tok/s) step 4066/76294 | train loss 3.680233 | norm 0.2474 | lr 1.12e-03 | (3924.14 ms | 133606 tok/s) step 4067/76294 | train loss 3.600026 | norm 0.2206 | lr 1.12e-03 | (3826.49 ms | 137015 tok/s) step 4068/76294 | train loss 3.632998 | norm 0.2353 | lr 1.12e-03 | (4629.07 ms | 113260 tok/s) step 4069/76294 | train loss 3.623293 | norm 0.1959 | lr 1.12e-03 | (3847.37 ms | 136272 tok/s) step 4070/76294 | train loss 3.623608 | norm 0.1863 | lr 1.12e-03 | (3807.36 ms | 137704 tok/s) step 4071/76294 | train loss 3.629119 | norm 0.2024 | lr 1.12e-03 | (3798.46 ms | 138026 tok/s) step 4072/76294 | train loss 3.684265 | norm 0.2030 | lr 1.12e-03 | (3876.44 ms | 135250 tok/s) step 4073/76294 | train loss 3.638816 | norm 0.2221 | lr 1.12e-03 | (3800.38 ms | 137957 tok/s) step 4074/76294 | train loss 3.662469 | norm 0.2203 | lr 1.12e-03 | (3814.25 ms | 137455 tok/s) step 4075/76294 | train loss 3.671177 | norm 0.2082 | lr 1.12e-03 | (4103.85 ms | 127755 tok/s) step 4076/76294 | train loss 3.598080 | norm 0.2032 | lr 1.12e-03 | (3793.89 ms | 138193 tok/s) step 4077/76294 | train loss 3.594048 | norm 0.2206 | lr 1.12e-03 | (3821.00 ms | 137212 tok/s) step 4078/76294 | train loss 3.740538 | norm 0.2280 | lr 1.12e-03 | (3830.38 ms | 136876 tok/s) step 4079/76294 | train loss 3.679370 | norm 0.2403 | lr 1.12e-03 | (3810.60 ms | 137587 tok/s) step 4080/76294 | train loss 3.653166 | norm 0.2130 | lr 1.12e-03 | (3799.70 ms | 137981 tok/s) step 4081/76294 | train loss 3.637593 | norm 0.2637 | lr 1.12e-03 | (3868.86 ms | 135515 tok/s) step 4082/76294 | train loss 3.650548 | norm 0.2957 | lr 1.12e-03 | (3799.41 ms | 137992 tok/s) step 4083/76294 | train loss 3.691654 | norm 0.2480 | lr 1.12e-03 | (3803.42 ms | 137847 tok/s) step 4084/76294 | train loss 3.702571 | norm 0.1984 | lr 1.12e-03 | (3826.47 ms | 137016 tok/s) step 4085/76294 | train loss 3.604600 | norm 0.2149 | lr 1.12e-03 | (3804.93 ms | 137792 tok/s) step 4086/76294 | train loss 3.622112 | norm 0.2066 | lr 1.12e-03 | (3810.85 ms | 137578 tok/s) step 4087/76294 | train loss 3.670508 | norm 0.2038 | lr 1.12e-03 | (3842.89 ms | 136430 tok/s) step 4088/76294 | train loss 3.656501 | norm 0.2224 | lr 1.12e-03 | (3804.22 ms | 137818 tok/s) step 4089/76294 | train loss 3.674908 | norm 0.2457 | lr 1.12e-03 | (3813.81 ms | 137471 tok/s) step 4090/76294 | train loss 3.669901 | norm 0.2247 | lr 1.12e-03 | (3827.20 ms | 136990 tok/s) step 4091/76294 | train loss 3.622338 | norm 0.2205 | lr 1.12e-03 | (3839.01 ms | 136568 tok/s) step 4092/76294 | train loss 3.642212 | norm 0.4438 | lr 1.12e-03 | (3805.08 ms | 137786 tok/s) step 4093/76294 | train loss 3.672554 | norm 0.2529 | lr 1.12e-03 | (6094.12 ms | 86032 tok/s) step 4094/76294 | train loss 3.689872 | norm 0.2722 | lr 1.12e-03 | (3795.31 ms | 138141 tok/s) step 4095/76294 | train loss 3.701792 | norm 0.2119 | lr 1.12e-03 | (3912.41 ms | 134006 tok/s) step 4096/76294 | train loss 3.683954 | norm 0.2482 | lr 1.12e-03 | (3798.40 ms | 138029 tok/s) step 4097/76294 | train loss 3.684795 | norm 0.2382 | lr 1.12e-03 | (3841.42 ms | 136483 tok/s) step 4098/76294 | train loss 3.640163 | norm 0.2775 | lr 1.12e-03 | (3800.44 ms | 137954 tok/s) step 4099/76294 | train loss 3.686959 | norm 0.2576 | lr 1.12e-03 | (3805.10 ms | 137785 tok/s) step 4100/76294 | train loss 3.660095 | norm 0.1988 | lr 1.12e-03 | (3832.27 ms | 136809 tok/s) step 4101/76294 | train loss 3.658571 | norm 0.2450 | lr 1.12e-03 | (3804.99 ms | 137789 tok/s) step 4102/76294 | train loss 3.556253 | norm 0.2495 | lr 1.12e-03 | (3831.24 ms | 136846 tok/s) step 4103/76294 | train loss 3.599556 | norm 0.2141 | lr 1.12e-03 | (3820.32 ms | 137237 tok/s) step 4104/76294 | train loss 3.709184 | norm 0.2104 | lr 1.12e-03 | (3836.99 ms | 136641 tok/s) step 4105/76294 | train loss 3.622810 | norm 0.1969 | lr 1.12e-03 | (3804.82 ms | 137796 tok/s) step 4106/76294 | train loss 3.594073 | norm 0.2478 | lr 1.12e-03 | (3803.40 ms | 137847 tok/s) step 4107/76294 | train loss 3.569604 | norm 0.2248 | lr 1.12e-03 | (3824.02 ms | 137104 tok/s) step 4108/76294 | train loss 3.577810 | norm 0.2131 | lr 1.12e-03 | (3804.39 ms | 137811 tok/s) step 4109/76294 | train loss 3.673025 | norm 0.2163 | lr 1.12e-03 | (3798.05 ms | 138041 tok/s) step 4110/76294 | train loss 3.582791 | norm 0.2257 | lr 1.12e-03 | (3802.53 ms | 137879 tok/s) step 4111/76294 | train loss 3.649463 | norm 0.2248 | lr 1.12e-03 | (3800.44 ms | 137954 tok/s) step 4112/76294 | train loss 3.572124 | norm 0.2033 | lr 1.12e-03 | (3832.86 ms | 136788 tok/s) step 4113/76294 | train loss 3.596984 | norm 0.2613 | lr 1.12e-03 | (3800.27 ms | 137961 tok/s) step 4114/76294 | train loss 3.691364 | norm 0.2311 | lr 1.12e-03 | (3826.21 ms | 137026 tok/s) step 4115/76294 | train loss 3.619277 | norm 0.2169 | lr 1.12e-03 | (3953.63 ms | 132609 tok/s) step 4116/76294 | train loss 3.641604 | norm 0.2036 | lr 1.12e-03 | (3861.48 ms | 135774 tok/s) step 4117/76294 | train loss 3.652763 | norm 0.2229 | lr 1.12e-03 | (3827.93 ms | 136964 tok/s) step 4118/76294 | train loss 3.611185 | norm 0.2670 | lr 1.12e-03 | (3808.59 ms | 137659 tok/s) step 4119/76294 | train loss 3.678746 | norm 0.2217 | lr 1.12e-03 | (3819.83 ms | 137254 tok/s) step 4120/76294 | train loss 3.609497 | norm 0.2192 | lr 1.12e-03 | (3798.05 ms | 138041 tok/s) step 4121/76294 | train loss 3.661408 | norm 0.2117 | lr 1.12e-03 | (3843.65 ms | 136404 tok/s) step 4122/76294 | train loss 3.573402 | norm 0.2136 | lr 1.12e-03 | (3822.54 ms | 137157 tok/s) step 4123/76294 | train loss 3.628732 | norm 0.2537 | lr 1.12e-03 | (3807.96 ms | 137682 tok/s) step 4124/76294 | train loss 3.667647 | norm 0.2130 | lr 1.12e-03 | (3822.81 ms | 137147 tok/s) step 4125/76294 | train loss 3.687757 | norm 0.2060 | lr 1.12e-03 | (3805.98 ms | 137754 tok/s) step 4126/76294 | train loss 3.653877 | norm 0.2175 | lr 1.12e-03 | (3824.23 ms | 137097 tok/s) step 4127/76294 | train loss 3.572526 | norm 0.2220 | lr 1.12e-03 | (3807.33 ms | 137705 tok/s) step 4128/76294 | train loss 3.608683 | norm 0.2277 | lr 1.11e-03 | (3827.12 ms | 136993 tok/s) step 4129/76294 | train loss 3.640358 | norm 0.2151 | lr 1.11e-03 | (3806.61 ms | 137731 tok/s) step 4130/76294 | train loss 3.635143 | norm 0.2611 | lr 1.11e-03 | (3802.97 ms | 137863 tok/s) step 4131/76294 | train loss 3.621469 | norm 0.2766 | lr 1.11e-03 | (3842.16 ms | 136457 tok/s) step 4132/76294 | train loss 3.585368 | norm 0.2328 | lr 1.11e-03 | (3802.04 ms | 137896 tok/s) step 4133/76294 | train loss 3.677235 | norm 0.2995 | lr 1.11e-03 | (3894.23 ms | 134632 tok/s) step 4134/76294 | train loss 3.630114 | norm 0.2972 | lr 1.11e-03 | (3808.67 ms | 137656 tok/s) step 4135/76294 | train loss 3.612280 | norm 0.2871 | lr 1.11e-03 | (5850.05 ms | 89621 tok/s) step 4136/76294 | train loss 3.605838 | norm 0.3042 | lr 1.11e-03 | (3801.68 ms | 137910 tok/s) step 4137/76294 | train loss 3.593774 | norm 0.2674 | lr 1.11e-03 | (3848.69 ms | 136225 tok/s) step 4138/76294 | train loss 3.652027 | norm 0.2398 | lr 1.11e-03 | (3801.56 ms | 137914 tok/s) step 4139/76294 | train loss 3.599329 | norm 0.2442 | lr 1.11e-03 | (3824.38 ms | 137091 tok/s) step 4140/76294 | train loss 3.608213 | norm 0.2572 | lr 1.11e-03 | (3797.33 ms | 138068 tok/s) step 4141/76294 | train loss 3.676383 | norm 0.2221 | lr 1.11e-03 | (3931.36 ms | 133360 tok/s) step 4142/76294 | train loss 3.581842 | norm 0.2332 | lr 1.11e-03 | (3831.33 ms | 136842 tok/s) step 4143/76294 | train loss 3.694206 | norm 0.2009 | lr 1.11e-03 | (3800.44 ms | 137955 tok/s) step 4144/76294 | train loss 3.561561 | norm 0.2014 | lr 1.11e-03 | (3873.11 ms | 135366 tok/s) step 4145/76294 | train loss 3.659945 | norm 0.2145 | lr 1.11e-03 | (3832.98 ms | 136784 tok/s) step 4146/76294 | train loss 3.646621 | norm 0.2006 | lr 1.11e-03 | (3802.15 ms | 137893 tok/s) step 4147/76294 | train loss 3.561371 | norm 0.2132 | lr 1.11e-03 | (4012.30 ms | 130670 tok/s) step 4148/76294 | train loss 3.644175 | norm 0.2055 | lr 1.11e-03 | (3847.62 ms | 136263 tok/s) step 4149/76294 | train loss 3.599180 | norm 0.2125 | lr 1.11e-03 | (3857.31 ms | 135921 tok/s) step 4150/76294 | train loss 3.672455 | norm 0.2220 | lr 1.11e-03 | (3797.35 ms | 138067 tok/s) step 4151/76294 | train loss 3.639988 | norm 0.2225 | lr 1.11e-03 | (3841.77 ms | 136470 tok/s) step 4152/76294 | train loss 3.676483 | norm 0.2549 | lr 1.11e-03 | (3826.45 ms | 137017 tok/s) step 4153/76294 | train loss 3.597610 | norm 0.1936 | lr 1.11e-03 | (3834.66 ms | 136724 tok/s) step 4154/76294 | train loss 3.626457 | norm 0.2254 | lr 1.11e-03 | (3797.48 ms | 138062 tok/s) step 4155/76294 | train loss 3.587968 | norm 0.2709 | lr 1.11e-03 | (4255.43 ms | 123205 tok/s) step 4156/76294 | train loss 3.615124 | norm 0.2537 | lr 1.11e-03 | (3795.19 ms | 138145 tok/s) step 4157/76294 | train loss 3.616103 | norm 0.2192 | lr 1.11e-03 | (3859.69 ms | 135837 tok/s) step 4158/76294 | train loss 3.647594 | norm 0.2923 | lr 1.11e-03 | (3796.17 ms | 138110 tok/s) step 4159/76294 | train loss 3.669995 | norm 0.2679 | lr 1.11e-03 | (3834.89 ms | 136715 tok/s) step 4160/76294 | train loss 3.567180 | norm 0.2284 | lr 1.11e-03 | (3798.63 ms | 138020 tok/s) step 4161/76294 | train loss 3.599855 | norm 0.2264 | lr 1.11e-03 | (3827.49 ms | 136979 tok/s) step 4162/76294 | train loss 3.574805 | norm 0.2196 | lr 1.11e-03 | (3801.41 ms | 137919 tok/s) step 4163/76294 | train loss 3.509312 | norm 0.2537 | lr 1.11e-03 | (3868.18 ms | 135539 tok/s) step 4164/76294 | train loss 3.653712 | norm 0.3113 | lr 1.11e-03 | (3800.30 ms | 137960 tok/s) step 4165/76294 | train loss 3.606904 | norm 0.2603 | lr 1.11e-03 | (3851.38 ms | 136130 tok/s) step 4166/76294 | train loss 3.632860 | norm 0.2201 | lr 1.11e-03 | (3802.17 ms | 137892 tok/s) step 4167/76294 | train loss 3.538399 | norm 0.2583 | lr 1.11e-03 | (3806.64 ms | 137730 tok/s) step 4168/76294 | train loss 3.662100 | norm 0.2173 | lr 1.11e-03 | (3829.32 ms | 136914 tok/s) step 4169/76294 | train loss 3.554578 | norm 0.2044 | lr 1.11e-03 | (3809.58 ms | 137624 tok/s) step 4170/76294 | train loss 3.651216 | norm 0.1995 | lr 1.11e-03 | (3839.38 ms | 136555 tok/s) step 4171/76294 | train loss 3.626190 | norm 0.1902 | lr 1.11e-03 | (3805.49 ms | 137771 tok/s) step 4172/76294 | train loss 3.604066 | norm 0.2286 | lr 1.11e-03 | (3826.87 ms | 137002 tok/s) step 4173/76294 | train loss 3.585393 | norm 0.1982 | lr 1.11e-03 | (3807.21 ms | 137709 tok/s) step 4174/76294 | train loss 3.622584 | norm 0.1990 | lr 1.11e-03 | (3904.13 ms | 134291 tok/s) step 4175/76294 | train loss 3.643630 | norm 0.2049 | lr 1.11e-03 | (3798.91 ms | 138010 tok/s) step 4176/76294 | train loss 3.688326 | norm 0.2084 | lr 1.11e-03 | (3830.06 ms | 136888 tok/s) step 4177/76294 | train loss 3.636101 | norm 0.1890 | lr 1.11e-03 | (3810.63 ms | 137586 tok/s) step 4178/76294 | train loss 3.549184 | norm 0.1995 | lr 1.11e-03 | (3881.93 ms | 135059 tok/s) step 4179/76294 | train loss 3.603061 | norm 0.1905 | lr 1.11e-03 | (3801.95 ms | 137900 tok/s) step 4180/76294 | train loss 3.569467 | norm 0.1843 | lr 1.11e-03 | (3808.21 ms | 137673 tok/s) step 4181/76294 | train loss 3.658895 | norm 0.2255 | lr 1.11e-03 | (3829.12 ms | 136921 tok/s) step 4182/76294 | train loss 3.599154 | norm 0.2125 | lr 1.11e-03 | (3806.25 ms | 137744 tok/s) step 4183/76294 | train loss 3.686402 | norm 0.2316 | lr 1.11e-03 | (4074.52 ms | 128675 tok/s) step 4184/76294 | train loss 3.562772 | norm 0.2054 | lr 1.11e-03 | (3926.37 ms | 133530 tok/s) step 4185/76294 | train loss 3.611863 | norm 0.1818 | lr 1.11e-03 | (3799.21 ms | 137999 tok/s) step 4186/76294 | train loss 3.675481 | norm 0.2320 | lr 1.11e-03 | (3868.99 ms | 135510 tok/s) step 4187/76294 | train loss 3.609117 | norm 0.2186 | lr 1.11e-03 | (3798.76 ms | 138016 tok/s) step 4188/76294 | train loss 3.626889 | norm 0.2116 | lr 1.11e-03 | (3838.77 ms | 136577 tok/s) step 4189/76294 | train loss 3.625137 | norm 0.2135 | lr 1.11e-03 | (3802.83 ms | 137868 tok/s) step 4190/76294 | train loss 3.631470 | norm 0.2122 | lr 1.11e-03 | (3812.12 ms | 137532 tok/s) step 4191/76294 | train loss 3.585626 | norm 0.2102 | lr 1.11e-03 | (3823.96 ms | 137106 tok/s) step 4192/76294 | train loss 3.603777 | norm 0.2926 | lr 1.11e-03 | (4805.46 ms | 109102 tok/s) step 4193/76294 | train loss 3.639783 | norm 0.2758 | lr 1.11e-03 | (3899.48 ms | 134451 tok/s) step 4194/76294 | train loss 3.668508 | norm 0.2011 | lr 1.11e-03 | (3885.80 ms | 134924 tok/s) step 4195/76294 | train loss 3.578444 | norm 0.2680 | lr 1.11e-03 | (3806.34 ms | 137741 tok/s) step 4196/76294 | train loss 3.602211 | norm 0.2366 | lr 1.11e-03 | (3832.67 ms | 136794 tok/s) step 4197/76294 | train loss 3.595567 | norm 0.2195 | lr 1.11e-03 | (4416.10 ms | 118722 tok/s) step 4198/76294 | train loss 3.607496 | norm 0.2885 | lr 1.11e-03 | (3798.49 ms | 138025 tok/s) step 4199/76294 | train loss 3.626240 | norm 0.2255 | lr 1.11e-03 | (3824.43 ms | 137089 tok/s) step 4200/76294 | train loss 3.574358 | norm 0.2059 | lr 1.11e-03 | (3798.95 ms | 138009 tok/s) step 4201/76294 | train loss 3.762332 | norm 0.2735 | lr 1.11e-03 | (3807.51 ms | 137699 tok/s) step 4202/76294 | train loss 3.589160 | norm 0.2562 | lr 1.11e-03 | (3844.44 ms | 136376 tok/s) step 4203/76294 | train loss 3.627061 | norm 0.2232 | lr 1.11e-03 | (3805.18 ms | 137783 tok/s) step 4204/76294 | train loss 3.644523 | norm 0.2470 | lr 1.11e-03 | (3834.12 ms | 136743 tok/s) step 4205/76294 | train loss 3.605683 | norm 0.2044 | lr 1.11e-03 | (3806.60 ms | 137731 tok/s) step 4206/76294 | train loss 3.640840 | norm 0.2335 | lr 1.11e-03 | (3803.78 ms | 137833 tok/s) step 4207/76294 | train loss 3.585563 | norm 0.2423 | lr 1.11e-03 | (3855.20 ms | 135995 tok/s) step 4208/76294 | train loss 3.600064 | norm 0.2421 | lr 1.11e-03 | (3967.30 ms | 132152 tok/s) step 4209/76294 | train loss 3.584952 | norm 0.1944 | lr 1.11e-03 | (3883.49 ms | 135004 tok/s) step 4210/76294 | train loss 3.679675 | norm 0.2020 | lr 1.11e-03 | (3828.17 ms | 136955 tok/s) step 4211/76294 | train loss 3.616949 | norm 0.2099 | lr 1.11e-03 | (3827.00 ms | 136997 tok/s) step 4212/76294 | train loss 3.616390 | norm 0.2210 | lr 1.11e-03 | (3838.47 ms | 136588 tok/s) step 4213/76294 | train loss 3.604268 | norm 0.1951 | lr 1.11e-03 | (3831.14 ms | 136849 tok/s) step 4214/76294 | train loss 3.572152 | norm 0.1911 | lr 1.11e-03 | (3970.39 ms | 132049 tok/s) step 4215/76294 | train loss 3.690247 | norm 0.1863 | lr 1.11e-03 | (3804.84 ms | 137795 tok/s) step 4216/76294 | train loss 3.600476 | norm 0.2056 | lr 1.11e-03 | (3833.53 ms | 136764 tok/s) step 4217/76294 | train loss 3.600156 | norm 0.1862 | lr 1.11e-03 | (3810.10 ms | 137605 tok/s) step 4218/76294 | train loss 3.614570 | norm 0.1812 | lr 1.11e-03 | (3874.06 ms | 135333 tok/s) step 4219/76294 | train loss 3.592530 | norm 0.2120 | lr 1.11e-03 | (3808.76 ms | 137653 tok/s) step 4220/76294 | train loss 3.608548 | norm 0.2278 | lr 1.11e-03 | (3830.86 ms | 136859 tok/s) step 4221/76294 | train loss 3.585439 | norm 0.2113 | lr 1.11e-03 | (3812.60 ms | 137515 tok/s) step 4222/76294 | train loss 3.612610 | norm 0.2086 | lr 1.11e-03 | (3810.84 ms | 137578 tok/s) step 4223/76294 | train loss 3.630784 | norm 0.1951 | lr 1.11e-03 | (3974.57 ms | 131911 tok/s) step 4224/76294 | train loss 3.639285 | norm 0.2138 | lr 1.11e-03 | (3811.24 ms | 137563 tok/s) step 4225/76294 | train loss 3.582930 | norm 0.1949 | lr 1.11e-03 | (3828.20 ms | 136954 tok/s) step 4226/76294 | train loss 3.616419 | norm 0.2245 | lr 1.11e-03 | (3811.23 ms | 137564 tok/s) step 4227/76294 | train loss 3.565567 | norm 0.2245 | lr 1.11e-03 | (3837.72 ms | 136615 tok/s) step 4228/76294 | train loss 3.641950 | norm 0.2208 | lr 1.11e-03 | (3807.18 ms | 137710 tok/s) step 4229/76294 | train loss 3.594979 | norm 0.2047 | lr 1.11e-03 | (3817.53 ms | 137337 tok/s) step 4230/76294 | train loss 3.637763 | norm 0.2066 | lr 1.11e-03 | (3808.97 ms | 137646 tok/s) step 4231/76294 | train loss 3.665388 | norm 0.2295 | lr 1.11e-03 | (3873.53 ms | 135352 tok/s) step 4232/76294 | train loss 3.569726 | norm 0.2123 | lr 1.11e-03 | (3806.55 ms | 137733 tok/s) step 4233/76294 | train loss 3.648404 | norm 0.2591 | lr 1.11e-03 | (3812.46 ms | 137520 tok/s) step 4234/76294 | train loss 3.591039 | norm 0.1861 | lr 1.11e-03 | (3913.59 ms | 133966 tok/s) step 4235/76294 | train loss 3.559094 | norm 0.2215 | lr 1.11e-03 | (3839.33 ms | 136557 tok/s) step 4236/76294 | train loss 3.564957 | norm 0.2143 | lr 1.11e-03 | (3836.05 ms | 136674 tok/s) step 4237/76294 | train loss 3.641382 | norm 0.1756 | lr 1.11e-03 | (3816.64 ms | 137369 tok/s) step 4238/76294 | train loss 3.605339 | norm 0.2127 | lr 1.11e-03 | (3814.81 ms | 137435 tok/s) step 4239/76294 | train loss 3.614854 | norm 0.2322 | lr 1.11e-03 | (3818.75 ms | 137293 tok/s) step 4240/76294 | train loss 3.650970 | norm 0.2988 | lr 1.11e-03 | (3838.23 ms | 136596 tok/s) step 4241/76294 | train loss 3.617457 | norm 0.2532 | lr 1.11e-03 | (3808.87 ms | 137649 tok/s) step 4242/76294 | train loss 3.591812 | norm 0.2064 | lr 1.11e-03 | (3851.46 ms | 136127 tok/s) step 4243/76294 | train loss 3.641306 | norm 0.2303 | lr 1.11e-03 | (3864.18 ms | 135679 tok/s) step 4244/76294 | train loss 3.627638 | norm 0.2486 | lr 1.11e-03 | (3804.14 ms | 137820 tok/s) step 4245/76294 | train loss 3.551497 | norm 0.2663 | lr 1.11e-03 | (3832.94 ms | 136785 tok/s) step 4246/76294 | train loss 3.628405 | norm 0.2446 | lr 1.11e-03 | (3799.14 ms | 138002 tok/s) step 4247/76294 | train loss 3.621734 | norm 0.2076 | lr 1.11e-03 | (3860.28 ms | 135816 tok/s) step 4248/76294 | train loss 3.620414 | norm 0.2209 | lr 1.11e-03 | (3802.28 ms | 137888 tok/s) step 4249/76294 | train loss 3.610583 | norm 0.2525 | lr 1.11e-03 | (3822.99 ms | 137141 tok/s) step 4250/76294 | train loss 3.661225 | norm 0.2084 | lr 1.11e-03 | (3802.13 ms | 137893 tok/s) val loss: 3.606604 saving model checkpoint to ./results/gpt2-124M-gqa/step_4250.pth step 4251/76294 | train loss 3.762661 | norm 0.2342 | lr 1.11e-03 | (3861.49 ms | 135773 tok/s) step 4252/76294 | train loss 3.668341 | norm 0.2654 | lr 1.11e-03 | (3800.47 ms | 137953 tok/s) step 4253/76294 | train loss 3.591372 | norm 0.2276 | lr 1.11e-03 | (3835.71 ms | 136686 tok/s) step 4254/76294 | train loss 3.593302 | norm 0.2068 | lr 1.11e-03 | (3898.17 ms | 134496 tok/s) step 4255/76294 | train loss 3.618615 | norm 0.2087 | lr 1.11e-03 | (3796.59 ms | 138094 tok/s) step 4256/76294 | train loss 3.604648 | norm 0.2195 | lr 1.11e-03 | (3802.10 ms | 137894 tok/s) step 4257/76294 | train loss 3.636917 | norm 0.2083 | lr 1.11e-03 | (3841.21 ms | 136490 tok/s) step 4258/76294 | train loss 3.606872 | norm 0.2249 | lr 1.11e-03 | (3896.22 ms | 134563 tok/s) step 4259/76294 | train loss 3.752029 | norm 0.2124 | lr 1.11e-03 | (3802.99 ms | 137862 tok/s) step 4260/76294 | train loss 3.546556 | norm 0.2262 | lr 1.11e-03 | (3832.20 ms | 136811 tok/s) step 4261/76294 | train loss 3.645765 | norm 0.1848 | lr 1.11e-03 | (3804.39 ms | 137811 tok/s) step 4262/76294 | train loss 3.573364 | norm 0.2245 | lr 1.11e-03 | (3828.66 ms | 136938 tok/s) step 4263/76294 | train loss 3.634310 | norm 0.1925 | lr 1.11e-03 | (3799.76 ms | 137979 tok/s) step 4264/76294 | train loss 3.577282 | norm 0.2265 | lr 1.11e-03 | (3810.41 ms | 137594 tok/s) step 4265/76294 | train loss 3.633804 | norm 0.2195 | lr 1.11e-03 | (3821.95 ms | 137178 tok/s) step 4266/76294 | train loss 3.602956 | norm 0.2192 | lr 1.11e-03 | (3806.42 ms | 137738 tok/s) step 4267/76294 | train loss 3.635507 | norm 0.2853 | lr 1.11e-03 | (3826.03 ms | 137032 tok/s) step 4268/76294 | train loss 3.612783 | norm 0.2592 | lr 1.11e-03 | (3802.60 ms | 137876 tok/s) step 4269/76294 | train loss 3.582223 | norm 0.2012 | lr 1.11e-03 | (3821.76 ms | 137185 tok/s) step 4270/76294 | train loss 3.704929 | norm 0.2432 | lr 1.11e-03 | (3806.56 ms | 137733 tok/s) step 4271/76294 | train loss 3.580467 | norm 0.2688 | lr 1.11e-03 | (3802.07 ms | 137895 tok/s) step 4272/76294 | train loss 3.659604 | norm 0.2478 | lr 1.11e-03 | (3848.90 ms | 136218 tok/s) step 4273/76294 | train loss 3.617187 | norm 0.2539 | lr 1.11e-03 | (3806.81 ms | 137724 tok/s) step 4274/76294 | train loss 3.619928 | norm 0.2174 | lr 1.11e-03 | (3880.75 ms | 135100 tok/s) step 4275/76294 | train loss 3.596364 | norm 0.2268 | lr 1.11e-03 | (3806.82 ms | 137723 tok/s) step 4276/76294 | train loss 3.555808 | norm 0.2426 | lr 1.11e-03 | (3817.52 ms | 137337 tok/s) step 4277/76294 | train loss 3.607526 | norm 0.2164 | lr 1.11e-03 | (3834.22 ms | 136739 tok/s) step 4278/76294 | train loss 3.557925 | norm 0.2147 | lr 1.11e-03 | (3810.75 ms | 137581 tok/s) step 4279/76294 | train loss 3.600075 | norm 0.2197 | lr 1.11e-03 | (3845.33 ms | 136344 tok/s) step 4280/76294 | train loss 3.527216 | norm 0.2298 | lr 1.11e-03 | (3813.84 ms | 137470 tok/s) step 4281/76294 | train loss 3.592403 | norm 0.2335 | lr 1.11e-03 | (3828.55 ms | 136942 tok/s) step 4282/76294 | train loss 3.641088 | norm 0.2421 | lr 1.11e-03 | (3809.12 ms | 137640 tok/s) step 4283/76294 | train loss 3.519953 | norm 0.2364 | lr 1.11e-03 | (3985.04 ms | 131564 tok/s) step 4284/76294 | train loss 3.612118 | norm 0.2637 | lr 1.11e-03 | (3798.44 ms | 138027 tok/s) step 4285/76294 | train loss 3.678198 | norm 0.3231 | lr 1.11e-03 | (5524.22 ms | 94907 tok/s) step 4286/76294 | train loss 3.583553 | norm 0.2472 | lr 1.11e-03 | (3811.06 ms | 137570 tok/s) step 4287/76294 | train loss 3.573008 | norm 0.2379 | lr 1.11e-03 | (3885.89 ms | 134921 tok/s) step 4288/76294 | train loss 3.759536 | norm 0.2569 | lr 1.11e-03 | (3793.67 ms | 138201 tok/s) step 4289/76294 | train loss 3.591070 | norm 0.2155 | lr 1.11e-03 | (3806.11 ms | 137749 tok/s) step 4290/76294 | train loss 3.641405 | norm 0.2460 | lr 1.11e-03 | (3818.30 ms | 137309 tok/s) step 4291/76294 | train loss 3.649053 | norm 0.2235 | lr 1.11e-03 | (3806.05 ms | 137751 tok/s) step 4292/76294 | train loss 3.601717 | norm 0.2150 | lr 1.11e-03 | (3935.18 ms | 133231 tok/s) step 4293/76294 | train loss 3.609100 | norm 0.2196 | lr 1.11e-03 | (3796.60 ms | 138094 tok/s) step 4294/76294 | train loss 3.613882 | norm 0.2115 | lr 1.11e-03 | (3882.66 ms | 135033 tok/s) step 4295/76294 | train loss 3.651860 | norm 0.2460 | lr 1.11e-03 | (3794.23 ms | 138180 tok/s) step 4296/76294 | train loss 3.614032 | norm 0.2281 | lr 1.11e-03 | (3831.13 ms | 136850 tok/s) step 4297/76294 | train loss 3.591965 | norm 0.1862 | lr 1.11e-03 | (3882.82 ms | 135028 tok/s) step 4298/76294 | train loss 3.593147 | norm 0.2318 | lr 1.11e-03 | (3868.35 ms | 135533 tok/s) step 4299/76294 | train loss 3.598393 | norm 0.1899 | lr 1.11e-03 | (3807.66 ms | 137693 tok/s) step 4300/76294 | train loss 3.643331 | norm 0.1904 | lr 1.11e-03 | (3866.49 ms | 135598 tok/s) step 4301/76294 | train loss 3.625278 | norm 0.2063 | lr 1.11e-03 | (3806.61 ms | 137731 tok/s) step 4302/76294 | train loss 3.603498 | norm 0.2102 | lr 1.11e-03 | (3956.58 ms | 132511 tok/s) step 4303/76294 | train loss 3.592335 | norm 0.1934 | lr 1.11e-03 | (3840.67 ms | 136510 tok/s) step 4304/76294 | train loss 3.544651 | norm 0.2217 | lr 1.11e-03 | (3825.08 ms | 137066 tok/s) step 4305/76294 | train loss 3.602540 | norm 0.1652 | lr 1.11e-03 | (3882.70 ms | 135032 tok/s) step 4306/76294 | train loss 3.600689 | norm 0.1996 | lr 1.11e-03 | (3810.04 ms | 137607 tok/s) step 4307/76294 | train loss 3.633987 | norm 0.1895 | lr 1.11e-03 | (3848.14 ms | 136245 tok/s) step 4308/76294 | train loss 3.550175 | norm 0.1582 | lr 1.11e-03 | (3813.24 ms | 137492 tok/s) step 4309/76294 | train loss 3.600924 | norm 0.1979 | lr 1.11e-03 | (3880.53 ms | 135107 tok/s) step 4310/76294 | train loss 3.663862 | norm 0.1952 | lr 1.11e-03 | (3810.81 ms | 137579 tok/s) step 4311/76294 | train loss 3.650212 | norm 0.2196 | lr 1.11e-03 | (3986.82 ms | 131505 tok/s) step 4312/76294 | train loss 3.606143 | norm 0.2502 | lr 1.11e-03 | (3803.47 ms | 137845 tok/s) step 4313/76294 | train loss 3.605212 | norm 0.2102 | lr 1.11e-03 | (3839.23 ms | 136561 tok/s) step 4314/76294 | train loss 3.665488 | norm 0.2120 | lr 1.11e-03 | (3831.74 ms | 136828 tok/s) step 4315/76294 | train loss 3.586775 | norm 0.2250 | lr 1.11e-03 | (3809.85 ms | 137614 tok/s) step 4316/76294 | train loss 3.618382 | norm 0.2224 | lr 1.11e-03 | (3876.22 ms | 135258 tok/s) step 4317/76294 | train loss 3.575846 | norm 0.2406 | lr 1.11e-03 | (3809.14 ms | 137639 tok/s) step 4318/76294 | train loss 3.535158 | norm 0.2195 | lr 1.11e-03 | (3860.69 ms | 135802 tok/s) step 4319/76294 | train loss 3.593664 | norm 0.2342 | lr 1.11e-03 | (3803.18 ms | 137855 tok/s) step 4320/76294 | train loss 3.557598 | norm 0.2113 | lr 1.11e-03 | (3890.32 ms | 134767 tok/s) step 4321/76294 | train loss 3.585291 | norm 0.2651 | lr 1.11e-03 | (3803.46 ms | 137845 tok/s) step 4322/76294 | train loss 3.662080 | norm 0.2116 | lr 1.11e-03 | (3865.08 ms | 135647 tok/s) step 4323/76294 | train loss 3.620059 | norm 0.2277 | lr 1.11e-03 | (3895.78 ms | 134579 tok/s) step 4324/76294 | train loss 3.591820 | norm 0.2684 | lr 1.11e-03 | (3809.88 ms | 137613 tok/s) step 4325/76294 | train loss 3.629093 | norm 0.2530 | lr 1.11e-03 | (3830.74 ms | 136863 tok/s) step 4326/76294 | train loss 3.574010 | norm 0.2174 | lr 1.11e-03 | (3818.63 ms | 137297 tok/s) step 4327/76294 | train loss 3.605105 | norm 0.2711 | lr 1.11e-03 | (3833.53 ms | 136764 tok/s) step 4328/76294 | train loss 3.581733 | norm 0.3013 | lr 1.11e-03 | (3805.77 ms | 137761 tok/s) step 4329/76294 | train loss 3.607075 | norm 0.2962 | lr 1.10e-03 | (3811.12 ms | 137568 tok/s) step 4330/76294 | train loss 3.690929 | norm 0.2561 | lr 1.10e-03 | (3906.55 ms | 134208 tok/s) step 4331/76294 | train loss 3.565409 | norm 0.2212 | lr 1.10e-03 | (3816.75 ms | 137365 tok/s) step 4332/76294 | train loss 3.605259 | norm 0.2914 | lr 1.10e-03 | (3857.60 ms | 135911 tok/s) step 4333/76294 | train loss 3.629528 | norm 0.2279 | lr 1.10e-03 | (3818.43 ms | 137305 tok/s) step 4334/76294 | train loss 3.627127 | norm 0.2846 | lr 1.10e-03 | (3824.32 ms | 137093 tok/s) step 4335/76294 | train loss 3.578023 | norm 0.2449 | lr 1.10e-03 | (3841.90 ms | 136466 tok/s) step 4336/76294 | train loss 3.718493 | norm 0.2315 | lr 1.10e-03 | (3841.87 ms | 136467 tok/s) step 4337/76294 | train loss 3.635369 | norm 0.2256 | lr 1.10e-03 | (3834.14 ms | 136742 tok/s) step 4338/76294 | train loss 3.699889 | norm 0.2188 | lr 1.10e-03 | (3813.32 ms | 137488 tok/s) step 4339/76294 | train loss 3.564463 | norm 0.2243 | lr 1.10e-03 | (3827.98 ms | 136962 tok/s) step 4340/76294 | train loss 3.577738 | norm 0.2150 | lr 1.10e-03 | (3819.04 ms | 137283 tok/s) step 4341/76294 | train loss 3.602427 | norm 0.2032 | lr 1.10e-03 | (3987.62 ms | 131479 tok/s) step 4342/76294 | train loss 3.577372 | norm 0.1808 | lr 1.10e-03 | (3806.92 ms | 137720 tok/s) step 4343/76294 | train loss 3.635489 | norm 0.2180 | lr 1.10e-03 | (3836.32 ms | 136664 tok/s) step 4344/76294 | train loss 3.555465 | norm 0.2295 | lr 1.10e-03 | (3807.52 ms | 137698 tok/s) step 4345/76294 | train loss 3.634376 | norm 0.1980 | lr 1.10e-03 | (3835.55 ms | 136692 tok/s) step 4346/76294 | train loss 3.567348 | norm 0.2376 | lr 1.10e-03 | (3831.64 ms | 136831 tok/s) step 4347/76294 | train loss 3.590717 | norm 0.1950 | lr 1.10e-03 | (3809.40 ms | 137630 tok/s) step 4348/76294 | train loss 3.651178 | norm 0.1967 | lr 1.10e-03 | (3808.15 ms | 137675 tok/s) step 4349/76294 | train loss 3.620517 | norm 0.1762 | lr 1.10e-03 | (3835.34 ms | 136699 tok/s) step 4350/76294 | train loss 3.660627 | norm 0.2403 | lr 1.10e-03 | (3910.84 ms | 134060 tok/s) step 4351/76294 | train loss 3.642546 | norm 0.2592 | lr 1.10e-03 | (3805.17 ms | 137783 tok/s) step 4352/76294 | train loss 3.635250 | norm 0.2213 | lr 1.10e-03 | (3871.13 ms | 135435 tok/s) step 4353/76294 | train loss 3.596405 | norm 0.1953 | lr 1.10e-03 | (3806.65 ms | 137730 tok/s) step 4354/76294 | train loss 3.584526 | norm 0.2231 | lr 1.10e-03 | (3833.24 ms | 136774 tok/s) step 4355/76294 | train loss 3.614726 | norm 0.2019 | lr 1.10e-03 | (3804.65 ms | 137802 tok/s) step 4356/76294 | train loss 3.554825 | norm 0.2009 | lr 1.10e-03 | (3880.71 ms | 135101 tok/s) step 4357/76294 | train loss 3.642508 | norm 0.2154 | lr 1.10e-03 | (3807.13 ms | 137712 tok/s) step 4358/76294 | train loss 3.558942 | norm 0.1839 | lr 1.10e-03 | (3811.92 ms | 137539 tok/s) step 4359/76294 | train loss 3.591142 | norm 0.1939 | lr 1.10e-03 | (3830.61 ms | 136868 tok/s) step 4360/76294 | train loss 3.623789 | norm 0.1909 | lr 1.10e-03 | (3807.95 ms | 137683 tok/s) step 4361/76294 | train loss 3.655725 | norm 0.1902 | lr 1.10e-03 | (3822.74 ms | 137150 tok/s) step 4362/76294 | train loss 3.602469 | norm 0.1748 | lr 1.10e-03 | (3809.20 ms | 137637 tok/s) step 4363/76294 | train loss 3.632277 | norm 0.1750 | lr 1.10e-03 | (3831.14 ms | 136849 tok/s) step 4364/76294 | train loss 3.602000 | norm 0.1721 | lr 1.10e-03 | (3834.88 ms | 136716 tok/s) step 4365/76294 | train loss 3.634148 | norm 0.2080 | lr 1.10e-03 | (3829.68 ms | 136901 tok/s) step 4366/76294 | train loss 3.615568 | norm 0.1965 | lr 1.10e-03 | (3839.16 ms | 136563 tok/s) step 4367/76294 | train loss 3.667305 | norm 0.1903 | lr 1.10e-03 | (3805.59 ms | 137768 tok/s) step 4368/76294 | train loss 3.522917 | norm 0.2139 | lr 1.10e-03 | (3860.34 ms | 135814 tok/s) step 4369/76294 | train loss 3.555851 | norm 0.2491 | lr 1.10e-03 | (3850.93 ms | 136146 tok/s) step 4370/76294 | train loss 3.616126 | norm 0.3012 | lr 1.10e-03 | (3895.20 ms | 134598 tok/s) step 4371/76294 | train loss 3.616403 | norm 0.2541 | lr 1.10e-03 | (3802.77 ms | 137870 tok/s) step 4372/76294 | train loss 3.637829 | norm 0.2101 | lr 1.10e-03 | (3855.85 ms | 135972 tok/s) step 4373/76294 | train loss 3.620105 | norm 0.2533 | lr 1.10e-03 | (3802.90 ms | 137865 tok/s) step 4374/76294 | train loss 3.700395 | norm 0.2056 | lr 1.10e-03 | (3829.89 ms | 136894 tok/s) step 4375/76294 | train loss 3.563666 | norm 0.2288 | lr 1.10e-03 | (3799.58 ms | 137986 tok/s) step 4376/76294 | train loss 3.621922 | norm 0.2512 | lr 1.10e-03 | (3855.21 ms | 135995 tok/s) step 4377/76294 | train loss 3.609949 | norm 0.2336 | lr 1.10e-03 | (3802.21 ms | 137890 tok/s) step 4378/76294 | train loss 3.620075 | norm 0.2515 | lr 1.10e-03 | (3855.57 ms | 135982 tok/s) step 4379/76294 | train loss 3.546390 | norm 0.2343 | lr 1.10e-03 | (3802.07 ms | 137896 tok/s) step 4380/76294 | train loss 3.613572 | norm 0.2098 | lr 1.10e-03 | (3925.23 ms | 133569 tok/s) step 4381/76294 | train loss 3.694736 | norm 0.2719 | lr 1.10e-03 | (3795.91 ms | 138119 tok/s) step 4382/76294 | train loss 3.605510 | norm 0.2133 | lr 1.10e-03 | (3866.67 ms | 135592 tok/s) step 4383/76294 | train loss 3.600063 | norm 0.2531 | lr 1.10e-03 | (3799.12 ms | 138003 tok/s) step 4384/76294 | train loss 3.593854 | norm 0.2134 | lr 1.10e-03 | (3834.32 ms | 136736 tok/s) step 4385/76294 | train loss 3.602361 | norm 0.2095 | lr 1.10e-03 | (3799.48 ms | 137989 tok/s) step 4386/76294 | train loss 3.624114 | norm 0.2606 | lr 1.10e-03 | (3853.20 ms | 136066 tok/s) step 4387/76294 | train loss 3.636575 | norm 0.2059 | lr 1.10e-03 | (4157.88 ms | 126095 tok/s) step 4388/76294 | train loss 3.642777 | norm 0.2303 | lr 1.10e-03 | (3798.38 ms | 138029 tok/s) step 4389/76294 | train loss 3.618690 | norm 0.2087 | lr 1.10e-03 | (3960.07 ms | 132393 tok/s) step 4390/76294 | train loss 3.618988 | norm 0.2147 | lr 1.10e-03 | (3800.44 ms | 137955 tok/s) step 4391/76294 | train loss 3.589428 | norm 0.2034 | lr 1.10e-03 | (3800.44 ms | 137954 tok/s) step 4392/76294 | train loss 3.622851 | norm 0.1814 | lr 1.10e-03 | (3825.39 ms | 137055 tok/s) step 4393/76294 | train loss 3.623804 | norm 0.2133 | lr 1.10e-03 | (3862.21 ms | 135748 tok/s) step 4394/76294 | train loss 3.627068 | norm 0.2779 | lr 1.10e-03 | (3801.80 ms | 137905 tok/s) step 4395/76294 | train loss 3.585335 | norm 0.2487 | lr 1.10e-03 | (3868.04 ms | 135544 tok/s) step 4396/76294 | train loss 3.634963 | norm 0.1869 | lr 1.10e-03 | (3828.92 ms | 136928 tok/s) step 4397/76294 | train loss 3.543426 | norm 0.2526 | lr 1.10e-03 | (3804.79 ms | 137797 tok/s) step 4398/76294 | train loss 3.591841 | norm 0.2265 | lr 1.10e-03 | (3836.58 ms | 136655 tok/s) step 4399/76294 | train loss 3.702049 | norm 0.2334 | lr 1.10e-03 | (3808.32 ms | 137669 tok/s) step 4400/76294 | train loss 3.576466 | norm 0.2335 | lr 1.10e-03 | (3900.45 ms | 134417 tok/s) step 4401/76294 | train loss 3.618629 | norm 0.2179 | lr 1.10e-03 | (3799.72 ms | 137981 tok/s) step 4402/76294 | train loss 3.643424 | norm 0.2236 | lr 1.10e-03 | (3819.42 ms | 137269 tok/s) step 4403/76294 | train loss 3.637166 | norm 0.2428 | lr 1.10e-03 | (3834.31 ms | 136736 tok/s) step 4404/76294 | train loss 3.626836 | norm 0.2279 | lr 1.10e-03 | (3978.31 ms | 131787 tok/s) step 4405/76294 | train loss 3.589684 | norm 0.2395 | lr 1.10e-03 | (3794.70 ms | 138163 tok/s) step 4406/76294 | train loss 3.599571 | norm 0.2093 | lr 1.10e-03 | (3934.83 ms | 133243 tok/s) step 4407/76294 | train loss 3.565166 | norm 0.2127 | lr 1.10e-03 | (3799.12 ms | 138002 tok/s) step 4408/76294 | train loss 3.591097 | norm 0.2203 | lr 1.10e-03 | (5640.34 ms | 92953 tok/s) step 4409/76294 | train loss 3.583326 | norm 0.2356 | lr 1.10e-03 | (3930.93 ms | 133375 tok/s) step 4410/76294 | train loss 3.600907 | norm 0.1751 | lr 1.10e-03 | (3793.41 ms | 138210 tok/s) step 4411/76294 | train loss 3.620983 | norm 0.2229 | lr 1.10e-03 | (3796.40 ms | 138101 tok/s) step 4412/76294 | train loss 3.601877 | norm 0.2152 | lr 1.10e-03 | (3833.70 ms | 136758 tok/s) step 4413/76294 | train loss 3.568057 | norm 0.2127 | lr 1.10e-03 | (3826.28 ms | 137023 tok/s) step 4414/76294 | train loss 3.643848 | norm 0.2548 | lr 1.10e-03 | (3797.40 ms | 138065 tok/s) step 4415/76294 | train loss 3.579507 | norm 0.2708 | lr 1.10e-03 | (3812.04 ms | 137535 tok/s) step 4416/76294 | train loss 3.619565 | norm 0.2960 | lr 1.10e-03 | (4273.00 ms | 122698 tok/s) step 4417/76294 | train loss 3.622442 | norm 0.2437 | lr 1.10e-03 | (3804.72 ms | 137799 tok/s) step 4418/76294 | train loss 3.561802 | norm 0.2037 | lr 1.10e-03 | (3910.99 ms | 134055 tok/s) step 4419/76294 | train loss 3.595191 | norm 0.2602 | lr 1.10e-03 | (3803.10 ms | 137858 tok/s) step 4420/76294 | train loss 3.569484 | norm 0.2244 | lr 1.10e-03 | (3875.30 ms | 135290 tok/s) step 4421/76294 | train loss 3.554521 | norm 0.2300 | lr 1.10e-03 | (3797.85 ms | 138049 tok/s) step 4422/76294 | train loss 3.635416 | norm 0.2399 | lr 1.10e-03 | (3921.68 ms | 133690 tok/s) step 4423/76294 | train loss 3.644192 | norm 0.2389 | lr 1.10e-03 | (3795.98 ms | 138117 tok/s) step 4424/76294 | train loss 3.626852 | norm 0.2047 | lr 1.10e-03 | (3865.29 ms | 135640 tok/s) step 4425/76294 | train loss 3.597776 | norm 0.2254 | lr 1.10e-03 | (3801.50 ms | 137916 tok/s) step 4426/76294 | train loss 3.545833 | norm 0.2114 | lr 1.10e-03 | (4141.69 ms | 126588 tok/s) step 4427/76294 | train loss 3.713745 | norm 0.1987 | lr 1.10e-03 | (3843.87 ms | 136396 tok/s) step 4428/76294 | train loss 3.634717 | norm 0.2152 | lr 1.10e-03 | (3798.43 ms | 138028 tok/s) step 4429/76294 | train loss 3.637772 | norm 0.2085 | lr 1.10e-03 | (3832.37 ms | 136805 tok/s) step 4430/76294 | train loss 3.619387 | norm 0.2041 | lr 1.10e-03 | (3808.05 ms | 137679 tok/s) step 4431/76294 | train loss 3.630625 | norm 0.1968 | lr 1.10e-03 | (3840.88 ms | 136502 tok/s) step 4432/76294 | train loss 3.636670 | norm 0.1993 | lr 1.10e-03 | (3836.60 ms | 136655 tok/s) step 4433/76294 | train loss 3.638180 | norm 0.2065 | lr 1.10e-03 | (3834.58 ms | 136726 tok/s) step 4434/76294 | train loss 3.637653 | norm 0.2429 | lr 1.10e-03 | (3799.92 ms | 137974 tok/s) step 4435/76294 | train loss 3.676901 | norm 0.2151 | lr 1.10e-03 | (3870.70 ms | 135450 tok/s) step 4436/76294 | train loss 3.544007 | norm 0.2154 | lr 1.10e-03 | (3806.83 ms | 137723 tok/s) step 4437/76294 | train loss 3.588933 | norm 0.1887 | lr 1.10e-03 | (3817.38 ms | 137342 tok/s) step 4438/76294 | train loss 3.551542 | norm 0.2048 | lr 1.10e-03 | (3811.89 ms | 137540 tok/s) step 4439/76294 | train loss 3.573745 | norm 0.1977 | lr 1.10e-03 | (3814.52 ms | 137445 tok/s) step 4440/76294 | train loss 3.612035 | norm 0.2440 | lr 1.10e-03 | (3831.63 ms | 136832 tok/s) step 4441/76294 | train loss 3.606776 | norm 0.2691 | lr 1.10e-03 | (3816.86 ms | 137361 tok/s) step 4442/76294 | train loss 3.586742 | norm 0.2064 | lr 1.10e-03 | (3811.09 ms | 137569 tok/s) step 4443/76294 | train loss 3.561445 | norm 0.1998 | lr 1.10e-03 | (3837.22 ms | 136632 tok/s) step 4444/76294 | train loss 3.701096 | norm 0.2202 | lr 1.10e-03 | (3809.26 ms | 137635 tok/s) step 4445/76294 | train loss 3.586649 | norm 0.1989 | lr 1.10e-03 | (3835.77 ms | 136684 tok/s) step 4446/76294 | train loss 3.549202 | norm 0.2073 | lr 1.10e-03 | (3827.18 ms | 136991 tok/s) step 4447/76294 | train loss 3.587815 | norm 0.2041 | lr 1.10e-03 | (4058.77 ms | 129174 tok/s) step 4448/76294 | train loss 3.638057 | norm 0.1795 | lr 1.10e-03 | (3808.30 ms | 137670 tok/s) step 4449/76294 | train loss 3.654896 | norm 0.2153 | lr 1.10e-03 | (3859.41 ms | 135847 tok/s) step 4450/76294 | train loss 3.591691 | norm 0.2141 | lr 1.10e-03 | (3807.73 ms | 137690 tok/s) step 4451/76294 | train loss 3.573149 | norm 0.1948 | lr 1.10e-03 | (3819.58 ms | 137263 tok/s) step 4452/76294 | train loss 3.566887 | norm 0.2187 | lr 1.10e-03 | (3807.10 ms | 137713 tok/s) step 4453/76294 | train loss 3.586493 | norm 0.1920 | lr 1.10e-03 | (3833.26 ms | 136773 tok/s) step 4454/76294 | train loss 3.610698 | norm 0.2220 | lr 1.10e-03 | (3835.85 ms | 136681 tok/s) step 4455/76294 | train loss 3.638645 | norm 0.1787 | lr 1.10e-03 | (3814.66 ms | 137440 tok/s) step 4456/76294 | train loss 3.554832 | norm 0.1999 | lr 1.10e-03 | (4050.95 ms | 129424 tok/s) step 4457/76294 | train loss 3.498554 | norm 0.1830 | lr 1.10e-03 | (3876.88 ms | 135235 tok/s) step 4458/76294 | train loss 3.755106 | norm 0.2622 | lr 1.10e-03 | (3807.92 ms | 137684 tok/s) step 4459/76294 | train loss 3.580809 | norm 0.2545 | lr 1.10e-03 | (3840.80 ms | 136505 tok/s) step 4460/76294 | train loss 3.537359 | norm 0.2243 | lr 1.10e-03 | (3808.09 ms | 137678 tok/s) step 4461/76294 | train loss 3.635018 | norm 0.2412 | lr 1.10e-03 | (3824.71 ms | 137079 tok/s) step 4462/76294 | train loss 3.593509 | norm 0.2157 | lr 1.10e-03 | (3830.49 ms | 136872 tok/s) step 4463/76294 | train loss 3.654512 | norm 0.2485 | lr 1.10e-03 | (3809.86 ms | 137614 tok/s) step 4464/76294 | train loss 3.547712 | norm 0.2385 | lr 1.10e-03 | (3805.57 ms | 137769 tok/s) step 4465/76294 | train loss 3.592331 | norm 0.2497 | lr 1.10e-03 | (3831.82 ms | 136825 tok/s) step 4466/76294 | train loss 3.600944 | norm 0.2567 | lr 1.10e-03 | (3811.04 ms | 137571 tok/s) step 4467/76294 | train loss 3.597122 | norm 0.2337 | lr 1.10e-03 | (3930.86 ms | 133377 tok/s) step 4468/76294 | train loss 3.594860 | norm 0.2249 | lr 1.10e-03 | (3807.56 ms | 137697 tok/s) step 4469/76294 | train loss 3.638140 | norm 0.2298 | lr 1.10e-03 | (3822.83 ms | 137146 tok/s) step 4470/76294 | train loss 3.621948 | norm 0.2289 | lr 1.10e-03 | (3811.25 ms | 137563 tok/s) step 4471/76294 | train loss 3.571342 | norm 0.2218 | lr 1.10e-03 | (3827.70 ms | 136972 tok/s) step 4472/76294 | train loss 3.572684 | norm 0.2255 | lr 1.10e-03 | (3811.07 ms | 137570 tok/s) step 4473/76294 | train loss 3.605803 | norm 0.3078 | lr 1.10e-03 | (3830.31 ms | 136879 tok/s) step 4474/76294 | train loss 3.647984 | norm 0.3176 | lr 1.10e-03 | (3812.99 ms | 137500 tok/s) step 4475/76294 | train loss 3.615502 | norm 0.2141 | lr 1.10e-03 | (3839.87 ms | 136538 tok/s) step 4476/76294 | train loss 3.580509 | norm 0.2658 | lr 1.10e-03 | (3834.25 ms | 136738 tok/s) step 4477/76294 | train loss 3.620548 | norm 0.2471 | lr 1.10e-03 | (3805.98 ms | 137754 tok/s) step 4478/76294 | train loss 3.802500 | norm 0.2092 | lr 1.10e-03 | (3869.65 ms | 135487 tok/s) step 4479/76294 | train loss 3.611662 | norm 0.2575 | lr 1.10e-03 | (3851.75 ms | 136117 tok/s) step 4480/76294 | train loss 3.594521 | norm 0.2145 | lr 1.10e-03 | (3813.25 ms | 137491 tok/s) step 4481/76294 | train loss 3.645403 | norm 0.2096 | lr 1.10e-03 | (3831.79 ms | 136826 tok/s) step 4482/76294 | train loss 3.619261 | norm 0.2326 | lr 1.10e-03 | (3812.57 ms | 137516 tok/s) step 4483/76294 | train loss 3.576027 | norm 0.2270 | lr 1.10e-03 | (3829.03 ms | 136925 tok/s) step 4484/76294 | train loss 3.606580 | norm 0.2192 | lr 1.10e-03 | (3831.30 ms | 136843 tok/s) step 4485/76294 | train loss 3.616954 | norm 0.2067 | lr 1.10e-03 | (3859.87 ms | 135831 tok/s) step 4486/76294 | train loss 3.663967 | norm 0.2040 | lr 1.10e-03 | (3814.79 ms | 137436 tok/s) step 4487/76294 | train loss 3.547372 | norm 0.2157 | lr 1.10e-03 | (3833.95 ms | 136749 tok/s) step 4488/76294 | train loss 3.568857 | norm 0.1953 | lr 1.10e-03 | (3811.66 ms | 137549 tok/s) step 4489/76294 | train loss 3.595301 | norm 0.2028 | lr 1.10e-03 | (3844.27 ms | 136382 tok/s) step 4490/76294 | train loss 3.649277 | norm 0.1959 | lr 1.10e-03 | (3817.20 ms | 137349 tok/s) step 4491/76294 | train loss 3.584610 | norm 0.2108 | lr 1.10e-03 | (3814.15 ms | 137459 tok/s) step 4492/76294 | train loss 3.684956 | norm 0.2145 | lr 1.10e-03 | (3841.59 ms | 136477 tok/s) step 4493/76294 | train loss 3.577401 | norm 0.2264 | lr 1.10e-03 | (3807.20 ms | 137710 tok/s) step 4494/76294 | train loss 3.613720 | norm 0.2000 | lr 1.10e-03 | (3807.73 ms | 137690 tok/s) step 4495/76294 | train loss 3.638342 | norm 0.1921 | lr 1.10e-03 | (3836.15 ms | 136670 tok/s) step 4496/76294 | train loss 3.536600 | norm 0.2199 | lr 1.10e-03 | (3812.57 ms | 137516 tok/s) step 4497/76294 | train loss 3.726539 | norm 0.2158 | lr 1.10e-03 | (3834.43 ms | 136732 tok/s) step 4498/76294 | train loss 3.731435 | norm 0.2382 | lr 1.10e-03 | (3804.99 ms | 137790 tok/s) step 4499/76294 | train loss 3.640236 | norm 0.2549 | lr 1.10e-03 | (3851.51 ms | 136125 tok/s) step 4500/76294 | train loss 3.597378 | norm 0.2395 | lr 1.10e-03 | (3807.38 ms | 137703 tok/s) val loss: 3.590310 saving model checkpoint to ./results/gpt2-124M-gqa/step_4500.pth step 4501/76294 | train loss 3.599939 | norm 0.2427 | lr 1.10e-03 | (3815.41 ms | 137413 tok/s) step 4502/76294 | train loss 3.565015 | norm 0.2058 | lr 1.10e-03 | (3807.50 ms | 137699 tok/s) step 4503/76294 | train loss 3.562494 | norm 0.2386 | lr 1.10e-03 | (3954.09 ms | 132594 tok/s) step 4504/76294 | train loss 3.589771 | norm 0.2246 | lr 1.10e-03 | (3843.52 ms | 136408 tok/s) step 4505/76294 | train loss 3.647512 | norm 0.3224 | lr 1.10e-03 | (3832.57 ms | 136798 tok/s) step 4506/76294 | train loss 3.617086 | norm 0.2818 | lr 1.10e-03 | (4063.13 ms | 129035 tok/s) step 4507/76294 | train loss 3.592329 | norm 0.2204 | lr 1.10e-03 | (4336.81 ms | 120892 tok/s) step 4508/76294 | train loss 3.642679 | norm 0.2172 | lr 1.10e-03 | (3793.82 ms | 138195 tok/s) step 4509/76294 | train loss 3.707998 | norm 0.2543 | lr 1.10e-03 | (3829.02 ms | 136925 tok/s) step 4510/76294 | train loss 3.597810 | norm 0.2340 | lr 1.10e-03 | (3807.42 ms | 137702 tok/s) step 4511/76294 | train loss 3.623795 | norm 0.2451 | lr 1.10e-03 | (3830.52 ms | 136871 tok/s) step 4512/76294 | train loss 3.606252 | norm 0.2075 | lr 1.10e-03 | (3851.39 ms | 136130 tok/s) step 4513/76294 | train loss 3.577841 | norm 0.2110 | lr 1.10e-03 | (3812.29 ms | 137526 tok/s) step 4514/76294 | train loss 3.610734 | norm 0.2383 | lr 1.10e-03 | (3847.34 ms | 136273 tok/s) step 4515/76294 | train loss 3.678161 | norm 0.2176 | lr 1.10e-03 | (3803.96 ms | 137827 tok/s) step 4516/76294 | train loss 3.688639 | norm 0.2771 | lr 1.10e-03 | (3847.56 ms | 136265 tok/s) step 4517/76294 | train loss 3.548018 | norm 0.3726 | lr 1.10e-03 | (3877.05 ms | 135229 tok/s) step 4518/76294 | train loss 3.599863 | norm 0.2578 | lr 1.10e-03 | (4558.94 ms | 115002 tok/s) step 4519/76294 | train loss 3.614452 | norm 0.2569 | lr 1.10e-03 | (3799.51 ms | 137988 tok/s) step 4520/76294 | train loss 3.574072 | norm 0.2230 | lr 1.10e-03 | (3869.36 ms | 135497 tok/s) step 4521/76294 | train loss 3.747887 | norm 0.2399 | lr 1.09e-03 | (3800.88 ms | 137939 tok/s) step 4522/76294 | train loss 3.652697 | norm 0.2633 | lr 1.09e-03 | (3881.11 ms | 135087 tok/s) step 4523/76294 | train loss 3.611918 | norm 0.2186 | lr 1.09e-03 | (3808.93 ms | 137647 tok/s) step 4524/76294 | train loss 3.602436 | norm 0.2437 | lr 1.09e-03 | (3814.22 ms | 137456 tok/s) step 4525/76294 | train loss 3.655932 | norm 0.2296 | lr 1.09e-03 | (3838.49 ms | 136587 tok/s) step 4526/76294 | train loss 3.681919 | norm 0.2246 | lr 1.09e-03 | (3878.51 ms | 135178 tok/s) step 4527/76294 | train loss 3.544436 | norm 0.2075 | lr 1.09e-03 | (4100.68 ms | 127854 tok/s) step 4528/76294 | train loss 3.619298 | norm 0.2469 | lr 1.09e-03 | (3849.06 ms | 136212 tok/s) step 4529/76294 | train loss 3.648682 | norm 0.2219 | lr 1.09e-03 | (3832.81 ms | 136790 tok/s) step 4530/76294 | train loss 3.636784 | norm 0.1911 | lr 1.09e-03 | (3842.21 ms | 136455 tok/s) step 4531/76294 | train loss 3.581590 | norm 0.2340 | lr 1.09e-03 | (3839.99 ms | 136534 tok/s) step 4532/76294 | train loss 3.569612 | norm 0.2104 | lr 1.09e-03 | (4108.67 ms | 127605 tok/s) step 4533/76294 | train loss 3.618548 | norm 0.1867 | lr 1.09e-03 | (3805.14 ms | 137784 tok/s) step 4534/76294 | train loss 3.619214 | norm 0.2113 | lr 1.09e-03 | (3838.70 ms | 136579 tok/s) step 4535/76294 | train loss 3.610084 | norm 0.1860 | lr 1.09e-03 | (3825.82 ms | 137039 tok/s) step 4536/76294 | train loss 3.616461 | norm 0.1984 | lr 1.09e-03 | (3855.20 ms | 135995 tok/s) step 4537/76294 | train loss 3.545173 | norm 0.2061 | lr 1.09e-03 | (3828.49 ms | 136944 tok/s) step 4538/76294 | train loss 3.651841 | norm 0.2128 | lr 1.09e-03 | (3834.54 ms | 136728 tok/s) step 4539/76294 | train loss 3.561348 | norm 0.2102 | lr 1.09e-03 | (3820.86 ms | 137217 tok/s) step 4540/76294 | train loss 3.591190 | norm 0.1899 | lr 1.09e-03 | (3855.11 ms | 135998 tok/s) step 4541/76294 | train loss 3.631342 | norm 0.2249 | lr 1.09e-03 | (3816.49 ms | 137374 tok/s) step 4542/76294 | train loss 3.630007 | norm 0.2463 | lr 1.09e-03 | (4067.59 ms | 128894 tok/s) step 4543/76294 | train loss 3.604051 | norm 0.1983 | lr 1.09e-03 | (3828.97 ms | 136927 tok/s) step 4544/76294 | train loss 3.605313 | norm 0.1973 | lr 1.09e-03 | (3802.82 ms | 137868 tok/s) step 4545/76294 | train loss 3.580472 | norm 0.2821 | lr 1.09e-03 | (3887.36 ms | 134870 tok/s) step 4546/76294 | train loss 3.770540 | norm 0.3513 | lr 1.09e-03 | (3807.22 ms | 137709 tok/s) step 4547/76294 | train loss 3.604708 | norm 0.2434 | lr 1.09e-03 | (3827.75 ms | 136970 tok/s) step 4548/76294 | train loss 3.555242 | norm 0.2693 | lr 1.09e-03 | (4017.83 ms | 130490 tok/s) step 4549/76294 | train loss 3.627018 | norm 0.2991 | lr 1.09e-03 | (3810.51 ms | 137590 tok/s) step 4550/76294 | train loss 3.598718 | norm 0.2664 | lr 1.09e-03 | (3830.53 ms | 136871 tok/s) step 4551/76294 | train loss 3.621530 | norm 0.2896 | lr 1.09e-03 | (3804.15 ms | 137820 tok/s) step 4552/76294 | train loss 3.765251 | norm 0.2687 | lr 1.09e-03 | (3873.78 ms | 135343 tok/s) step 4553/76294 | train loss 3.592244 | norm 0.2648 | lr 1.09e-03 | (3818.88 ms | 137288 tok/s) step 4554/76294 | train loss 3.639924 | norm 0.1923 | lr 1.09e-03 | (3873.27 ms | 135361 tok/s) step 4555/76294 | train loss 3.656850 | norm 0.2454 | lr 1.09e-03 | (3810.22 ms | 137600 tok/s) step 4556/76294 | train loss 3.560528 | norm 0.2049 | lr 1.09e-03 | (3825.03 ms | 137068 tok/s) step 4557/76294 | train loss 3.612274 | norm 0.2304 | lr 1.09e-03 | (3944.85 ms | 132904 tok/s) step 4558/76294 | train loss 3.542460 | norm 0.1870 | lr 1.09e-03 | (3832.54 ms | 136799 tok/s) step 4559/76294 | train loss 3.618793 | norm 0.2088 | lr 1.09e-03 | (3840.61 ms | 136512 tok/s) step 4560/76294 | train loss 3.615849 | norm 0.2473 | lr 1.09e-03 | (3900.21 ms | 134426 tok/s) step 4561/76294 | train loss 3.579229 | norm 0.2205 | lr 1.09e-03 | (3822.68 ms | 137152 tok/s) step 4562/76294 | train loss 3.535861 | norm 0.1857 | lr 1.09e-03 | (3828.47 ms | 136944 tok/s) step 4563/76294 | train loss 3.636007 | norm 0.2352 | lr 1.09e-03 | (3813.61 ms | 137478 tok/s) step 4564/76294 | train loss 3.556204 | norm 0.2003 | lr 1.09e-03 | (3864.53 ms | 135667 tok/s) step 4565/76294 | train loss 3.860509 | norm 0.2319 | lr 1.09e-03 | (3810.22 ms | 137600 tok/s) step 4566/76294 | train loss 3.632003 | norm 0.2465 | lr 1.09e-03 | (3818.79 ms | 137292 tok/s) step 4567/76294 | train loss 3.657963 | norm 0.2246 | lr 1.09e-03 | (3876.11 ms | 135261 tok/s) step 4568/76294 | train loss 3.606653 | norm 0.2079 | lr 1.09e-03 | (4007.27 ms | 130834 tok/s) step 4569/76294 | train loss 3.610319 | norm 0.1841 | lr 1.09e-03 | (3807.60 ms | 137695 tok/s) step 4570/76294 | train loss 3.605448 | norm 0.2926 | lr 1.09e-03 | (3815.82 ms | 137399 tok/s) step 4571/76294 | train loss 3.605515 | norm 0.2021 | lr 1.09e-03 | (3863.67 ms | 135697 tok/s) step 4572/76294 | train loss 3.567124 | norm 0.2024 | lr 1.09e-03 | (3826.48 ms | 137016 tok/s) step 4573/76294 | train loss 3.565414 | norm 0.1894 | lr 1.09e-03 | (3809.04 ms | 137643 tok/s) step 4574/76294 | train loss 3.591233 | norm 0.1800 | lr 1.09e-03 | (3813.17 ms | 137494 tok/s) step 4575/76294 | train loss 3.631627 | norm 0.1918 | lr 1.09e-03 | (3806.27 ms | 137743 tok/s) step 4576/76294 | train loss 3.589970 | norm 0.1962 | lr 1.09e-03 | (3865.95 ms | 135617 tok/s) step 4577/76294 | train loss 3.709024 | norm 0.2104 | lr 1.09e-03 | (3806.11 ms | 137749 tok/s) step 4578/76294 | train loss 3.631920 | norm 0.2330 | lr 1.09e-03 | (4127.52 ms | 127022 tok/s) step 4579/76294 | train loss 3.583050 | norm 0.1964 | lr 1.09e-03 | (3833.65 ms | 136759 tok/s) step 4580/76294 | train loss 3.573860 | norm 0.2337 | lr 1.09e-03 | (3813.80 ms | 137471 tok/s) step 4581/76294 | train loss 3.495822 | norm 0.2009 | lr 1.09e-03 | (3845.21 ms | 136348 tok/s) step 4582/76294 | train loss 3.626609 | norm 0.2628 | lr 1.09e-03 | (3824.39 ms | 137091 tok/s) step 4583/76294 | train loss 3.566082 | norm 0.2568 | lr 1.09e-03 | (3809.89 ms | 137612 tok/s) step 4584/76294 | train loss 3.699149 | norm 0.3267 | lr 1.09e-03 | (3874.86 ms | 135305 tok/s) step 4585/76294 | train loss 3.581741 | norm 0.3277 | lr 1.09e-03 | (3806.92 ms | 137720 tok/s) step 4586/76294 | train loss 3.603066 | norm 0.2439 | lr 1.09e-03 | (3851.01 ms | 136143 tok/s) step 4587/76294 | train loss 3.572182 | norm 0.2500 | lr 1.09e-03 | (3808.40 ms | 137666 tok/s) step 4588/76294 | train loss 3.570823 | norm 0.2367 | lr 1.09e-03 | (3961.75 ms | 132337 tok/s) step 4589/76294 | train loss 3.563698 | norm 0.2263 | lr 1.09e-03 | (3802.10 ms | 137894 tok/s) step 4590/76294 | train loss 3.584133 | norm 0.2824 | lr 1.09e-03 | (3816.95 ms | 137358 tok/s) step 4591/76294 | train loss 3.538993 | norm 0.2620 | lr 1.09e-03 | (3837.48 ms | 136623 tok/s) step 4592/76294 | train loss 3.612163 | norm 0.2899 | lr 1.09e-03 | (3833.25 ms | 136774 tok/s) step 4593/76294 | train loss 3.525234 | norm 0.2169 | lr 1.09e-03 | (3840.95 ms | 136499 tok/s) step 4594/76294 | train loss 3.584423 | norm 0.2017 | lr 1.09e-03 | (3812.21 ms | 137529 tok/s) step 4595/76294 | train loss 3.604904 | norm 0.2320 | lr 1.09e-03 | (3884.72 ms | 134961 tok/s) step 4596/76294 | train loss 3.626501 | norm 0.2697 | lr 1.09e-03 | (3805.12 ms | 137785 tok/s) step 4597/76294 | train loss 3.544775 | norm 0.2402 | lr 1.09e-03 | (3881.31 ms | 135080 tok/s) step 4598/76294 | train loss 3.546555 | norm 0.1947 | lr 1.09e-03 | (3809.29 ms | 137634 tok/s) step 4599/76294 | train loss 3.601209 | norm 0.2201 | lr 1.09e-03 | (3837.37 ms | 136627 tok/s) step 4600/76294 | train loss 3.664699 | norm 0.2096 | lr 1.09e-03 | (3883.41 ms | 135007 tok/s) step 4601/76294 | train loss 3.606237 | norm 0.2369 | lr 1.09e-03 | (3890.99 ms | 134744 tok/s) step 4602/76294 | train loss 3.683841 | norm 0.2688 | lr 1.09e-03 | (3808.39 ms | 137666 tok/s) step 4603/76294 | train loss 3.525666 | norm 0.2060 | lr 1.09e-03 | (3851.15 ms | 136138 tok/s) step 4604/76294 | train loss 3.588868 | norm 0.2100 | lr 1.09e-03 | (3816.30 ms | 137381 tok/s) step 4605/76294 | train loss 3.559857 | norm 0.1974 | lr 1.09e-03 | (3828.73 ms | 136935 tok/s) step 4606/76294 | train loss 3.588810 | norm 0.2305 | lr 1.09e-03 | (3851.64 ms | 136121 tok/s) step 4607/76294 | train loss 3.612338 | norm 0.2919 | lr 1.09e-03 | (3924.91 ms | 133580 tok/s) step 4608/76294 | train loss 3.652081 | norm 0.2975 | lr 1.09e-03 | (3807.96 ms | 137682 tok/s) step 4609/76294 | train loss 3.593004 | norm 0.2045 | lr 1.09e-03 | (3967.33 ms | 132151 tok/s) step 4610/76294 | train loss 3.557500 | norm 0.2884 | lr 1.09e-03 | (3804.97 ms | 137790 tok/s) step 4611/76294 | train loss 3.560840 | norm 0.2772 | lr 1.09e-03 | (3818.03 ms | 137319 tok/s) step 4612/76294 | train loss 3.645030 | norm 0.2631 | lr 1.09e-03 | (3863.31 ms | 135710 tok/s) step 4613/76294 | train loss 3.600422 | norm 0.3016 | lr 1.09e-03 | (3829.26 ms | 136916 tok/s) step 4614/76294 | train loss 3.671840 | norm 0.2678 | lr 1.09e-03 | (3813.17 ms | 137494 tok/s) step 4615/76294 | train loss 3.606744 | norm 0.2240 | lr 1.09e-03 | (3875.29 ms | 135290 tok/s) step 4616/76294 | train loss 3.532750 | norm 0.2316 | lr 1.09e-03 | (3804.33 ms | 137813 tok/s) step 4617/76294 | train loss 3.665414 | norm 0.2266 | lr 1.09e-03 | (3856.13 ms | 135962 tok/s) step 4618/76294 | train loss 3.574065 | norm 0.2260 | lr 1.09e-03 | (3807.70 ms | 137692 tok/s) step 4619/76294 | train loss 3.567608 | norm 0.1988 | lr 1.09e-03 | (3844.98 ms | 136356 tok/s) step 4620/76294 | train loss 3.566701 | norm 0.2197 | lr 1.09e-03 | (3930.33 ms | 133395 tok/s) step 4621/76294 | train loss 3.575379 | norm 0.2061 | lr 1.09e-03 | (3811.54 ms | 137553 tok/s) step 4622/76294 | train loss 3.594759 | norm 0.2208 | lr 1.09e-03 | (3851.24 ms | 136135 tok/s) step 4623/76294 | train loss 3.520493 | norm 0.2320 | lr 1.09e-03 | (3809.79 ms | 137616 tok/s) step 4624/76294 | train loss 3.564040 | norm 0.2361 | lr 1.09e-03 | (3910.91 ms | 134058 tok/s) step 4625/76294 | train loss 3.612358 | norm 0.1892 | lr 1.09e-03 | (3806.07 ms | 137751 tok/s) step 4626/76294 | train loss 3.613997 | norm 0.2225 | lr 1.09e-03 | (7242.28 ms | 72393 tok/s) step 4627/76294 | train loss 3.557915 | norm 0.2015 | lr 1.09e-03 | (4033.28 ms | 129991 tok/s) step 4628/76294 | train loss 3.554118 | norm 0.2119 | lr 1.09e-03 | (3962.92 ms | 132299 tok/s) step 4629/76294 | train loss 3.540813 | norm 0.2121 | lr 1.09e-03 | (3795.76 ms | 138125 tok/s) step 4630/76294 | train loss 3.599623 | norm 0.2019 | lr 1.09e-03 | (3851.77 ms | 136116 tok/s) step 4631/76294 | train loss 3.556242 | norm 0.2063 | lr 1.09e-03 | (3811.81 ms | 137543 tok/s) step 4632/76294 | train loss 3.573482 | norm 0.2162 | lr 1.09e-03 | (4024.67 ms | 130268 tok/s) step 4633/76294 | train loss 3.594573 | norm 0.1950 | lr 1.09e-03 | (3806.37 ms | 137740 tok/s) step 4634/76294 | train loss 3.523450 | norm 0.1793 | lr 1.09e-03 | (3803.06 ms | 137859 tok/s) step 4635/76294 | train loss 3.603317 | norm 0.1820 | lr 1.09e-03 | (3903.74 ms | 134304 tok/s) step 4636/76294 | train loss 3.596554 | norm 0.1959 | lr 1.09e-03 | (3855.18 ms | 135996 tok/s) step 4637/76294 | train loss 3.548111 | norm 0.1807 | lr 1.09e-03 | (3800.36 ms | 137957 tok/s) step 4638/76294 | train loss 3.625684 | norm 0.2225 | lr 1.09e-03 | (3871.24 ms | 135432 tok/s) step 4639/76294 | train loss 3.548012 | norm 0.2592 | lr 1.09e-03 | (3796.58 ms | 138095 tok/s) step 4640/76294 | train loss 3.646517 | norm 0.2762 | lr 1.09e-03 | (3988.25 ms | 131458 tok/s) step 4641/76294 | train loss 3.510396 | norm 0.2238 | lr 1.09e-03 | (3794.41 ms | 138174 tok/s) step 4642/76294 | train loss 3.587715 | norm 0.1960 | lr 1.09e-03 | (3805.70 ms | 137764 tok/s) step 4643/76294 | train loss 3.637841 | norm 0.2123 | lr 1.09e-03 | (3830.67 ms | 136866 tok/s) step 4644/76294 | train loss 3.676700 | norm 0.2184 | lr 1.09e-03 | (3808.35 ms | 137668 tok/s) step 4645/76294 | train loss 3.669461 | norm 0.1941 | lr 1.09e-03 | (3795.21 ms | 138145 tok/s) step 4646/76294 | train loss 3.618164 | norm 0.1923 | lr 1.09e-03 | (3854.02 ms | 136037 tok/s) step 4647/76294 | train loss 3.580278 | norm 0.2170 | lr 1.09e-03 | (3796.72 ms | 138090 tok/s) step 4648/76294 | train loss 3.562121 | norm 0.1780 | lr 1.09e-03 | (3871.61 ms | 135419 tok/s) step 4649/76294 | train loss 3.650709 | norm 0.2072 | lr 1.09e-03 | (3821.35 ms | 137200 tok/s) step 4650/76294 | train loss 3.596652 | norm 0.2400 | lr 1.09e-03 | (3872.08 ms | 135402 tok/s) step 4651/76294 | train loss 3.570178 | norm 0.2404 | lr 1.09e-03 | (3797.46 ms | 138063 tok/s) step 4652/76294 | train loss 3.606265 | norm 0.2084 | lr 1.09e-03 | (3804.09 ms | 137822 tok/s) step 4653/76294 | train loss 3.610457 | norm 0.2247 | lr 1.09e-03 | (3839.87 ms | 136538 tok/s) step 4654/76294 | train loss 3.584283 | norm 0.2797 | lr 1.09e-03 | (3808.93 ms | 137647 tok/s) step 4655/76294 | train loss 3.589532 | norm 0.2204 | lr 1.09e-03 | (3828.89 ms | 136929 tok/s) step 4656/76294 | train loss 3.788571 | norm 0.2233 | lr 1.09e-03 | (3850.74 ms | 136153 tok/s) step 4657/76294 | train loss 3.633534 | norm 0.2592 | lr 1.09e-03 | (3800.86 ms | 137939 tok/s) step 4658/76294 | train loss 3.523003 | norm 0.2115 | lr 1.09e-03 | (3883.35 ms | 135009 tok/s) step 4659/76294 | train loss 3.596855 | norm 0.2410 | lr 1.09e-03 | (3797.73 ms | 138053 tok/s) step 4660/76294 | train loss 3.590217 | norm 0.2411 | lr 1.09e-03 | (3835.50 ms | 136693 tok/s) step 4661/76294 | train loss 3.568801 | norm 0.2771 | lr 1.09e-03 | (3803.41 ms | 137847 tok/s) step 4662/76294 | train loss 3.711391 | norm 0.2520 | lr 1.09e-03 | (3948.61 ms | 132778 tok/s) step 4663/76294 | train loss 3.624119 | norm 0.2497 | lr 1.09e-03 | (3795.70 ms | 138127 tok/s) step 4664/76294 | train loss 3.630028 | norm 0.2530 | lr 1.09e-03 | (3799.53 ms | 137988 tok/s) step 4665/76294 | train loss 3.561224 | norm 0.2405 | lr 1.09e-03 | (3828.06 ms | 136959 tok/s) step 4666/76294 | train loss 3.579165 | norm 0.2118 | lr 1.09e-03 | (3801.26 ms | 137925 tok/s) step 4667/76294 | train loss 3.595618 | norm 0.1866 | lr 1.09e-03 | (3855.51 ms | 135984 tok/s) step 4668/76294 | train loss 3.584469 | norm 0.4122 | lr 1.09e-03 | (3802.78 ms | 137870 tok/s) step 4669/76294 | train loss 3.647910 | norm 0.2410 | lr 1.09e-03 | (3836.47 ms | 136659 tok/s) step 4670/76294 | train loss 3.570858 | norm 0.2390 | lr 1.09e-03 | (3804.67 ms | 137801 tok/s) step 4671/76294 | train loss 3.710038 | norm 0.2714 | lr 1.09e-03 | (3835.05 ms | 136709 tok/s) step 4672/76294 | train loss 3.599169 | norm 0.2758 | lr 1.09e-03 | (3804.21 ms | 137818 tok/s) step 4673/76294 | train loss 3.602240 | norm 0.2731 | lr 1.09e-03 | (3826.95 ms | 136999 tok/s) step 4674/76294 | train loss 3.579141 | norm 0.2390 | lr 1.09e-03 | (3802.52 ms | 137879 tok/s) step 4675/76294 | train loss 3.594053 | norm 0.2809 | lr 1.09e-03 | (4096.97 ms | 127970 tok/s) step 4676/76294 | train loss 3.570612 | norm 0.2310 | lr 1.09e-03 | (3798.12 ms | 138039 tok/s) step 4677/76294 | train loss 3.635289 | norm 0.2576 | lr 1.09e-03 | (3938.31 ms | 133125 tok/s) step 4678/76294 | train loss 3.596166 | norm 0.2383 | lr 1.09e-03 | (3804.78 ms | 137797 tok/s) step 4679/76294 | train loss 3.625809 | norm 0.2325 | lr 1.09e-03 | (3893.30 ms | 134664 tok/s) step 4680/76294 | train loss 3.563387 | norm 0.2218 | lr 1.09e-03 | (3793.91 ms | 138192 tok/s) step 4681/76294 | train loss 3.644157 | norm 0.2695 | lr 1.09e-03 | (3832.44 ms | 136803 tok/s) step 4682/76294 | train loss 3.589149 | norm 0.2753 | lr 1.09e-03 | (3794.83 ms | 138158 tok/s) step 4683/76294 | train loss 3.603257 | norm 0.2987 | lr 1.09e-03 | (3833.30 ms | 136772 tok/s) step 4684/76294 | train loss 3.597293 | norm 0.1995 | lr 1.09e-03 | (3798.81 ms | 138014 tok/s) step 4685/76294 | train loss 3.581994 | norm 0.2714 | lr 1.09e-03 | (3834.18 ms | 136741 tok/s) step 4686/76294 | train loss 3.579630 | norm 0.2323 | lr 1.09e-03 | (3798.79 ms | 138014 tok/s) step 4687/76294 | train loss 3.640336 | norm 0.2437 | lr 1.09e-03 | (3836.52 ms | 136657 tok/s) step 4688/76294 | train loss 3.553458 | norm 0.2143 | lr 1.09e-03 | (3801.75 ms | 137907 tok/s) step 4689/76294 | train loss 3.850578 | norm 0.2693 | lr 1.09e-03 | (3845.63 ms | 136333 tok/s) step 4690/76294 | train loss 3.757396 | norm 0.2895 | lr 1.09e-03 | (3801.89 ms | 137902 tok/s) step 4691/76294 | train loss 3.628820 | norm 0.3409 | lr 1.09e-03 | (3823.19 ms | 137134 tok/s) step 4692/76294 | train loss 3.571495 | norm 0.3695 | lr 1.09e-03 | (3842.39 ms | 136448 tok/s) step 4693/76294 | train loss 3.583736 | norm 0.2608 | lr 1.09e-03 | (3806.84 ms | 137723 tok/s) step 4694/76294 | train loss 3.593230 | norm 0.3209 | lr 1.09e-03 | (3824.28 ms | 137094 tok/s) step 4695/76294 | train loss 3.574583 | norm 0.2730 | lr 1.09e-03 | (3970.17 ms | 132057 tok/s) step 4696/76294 | train loss 3.627924 | norm 0.2382 | lr 1.09e-03 | (3802.37 ms | 137885 tok/s) step 4697/76294 | train loss 3.531172 | norm 0.2243 | lr 1.09e-03 | (3831.10 ms | 136850 tok/s) step 4698/76294 | train loss 3.557587 | norm 0.2112 | lr 1.09e-03 | (3858.17 ms | 135890 tok/s) step 4699/76294 | train loss 3.512245 | norm 0.1990 | lr 1.09e-03 | (3906.65 ms | 134204 tok/s) step 4700/76294 | train loss 3.608610 | norm 0.2096 | lr 1.09e-03 | (3835.05 ms | 136710 tok/s) step 4701/76294 | train loss 3.607042 | norm 0.1880 | lr 1.09e-03 | (3797.75 ms | 138052 tok/s) step 4702/76294 | train loss 3.580674 | norm 0.1927 | lr 1.09e-03 | (3807.24 ms | 137708 tok/s) step 4703/76294 | train loss 3.641955 | norm 0.1759 | lr 1.09e-03 | (3836.03 ms | 136675 tok/s) step 4704/76294 | train loss 3.551036 | norm 0.1704 | lr 1.08e-03 | (3836.71 ms | 136650 tok/s) step 4705/76294 | train loss 3.658209 | norm 0.2099 | lr 1.08e-03 | (3929.21 ms | 133434 tok/s) step 4706/76294 | train loss 3.500829 | norm 0.1941 | lr 1.08e-03 | (3798.50 ms | 138025 tok/s) step 4707/76294 | train loss 3.573690 | norm 0.1982 | lr 1.08e-03 | (3866.10 ms | 135612 tok/s) step 4708/76294 | train loss 3.554676 | norm 0.1929 | lr 1.08e-03 | (3797.42 ms | 138064 tok/s) step 4709/76294 | train loss 3.560875 | norm 0.1893 | lr 1.08e-03 | (3852.46 ms | 136092 tok/s) step 4710/76294 | train loss 3.512357 | norm 0.1915 | lr 1.08e-03 | (3800.94 ms | 137936 tok/s) step 4711/76294 | train loss 3.560536 | norm 0.1841 | lr 1.08e-03 | (3823.48 ms | 137123 tok/s) step 4712/76294 | train loss 3.556131 | norm 0.1909 | lr 1.08e-03 | (3836.99 ms | 136640 tok/s) step 4713/76294 | train loss 3.637496 | norm 0.2230 | lr 1.08e-03 | (3990.48 ms | 131385 tok/s) step 4714/76294 | train loss 3.586354 | norm 0.2048 | lr 1.08e-03 | (3918.60 ms | 133795 tok/s) step 4715/76294 | train loss 3.599943 | norm 0.2142 | lr 1.08e-03 | (3840.83 ms | 136504 tok/s) step 4716/76294 | train loss 3.584264 | norm 0.1989 | lr 1.08e-03 | (4422.87 ms | 118540 tok/s) step 4717/76294 | train loss 3.668403 | norm 0.2321 | lr 1.08e-03 | (3793.25 ms | 138216 tok/s) step 4718/76294 | train loss 3.559587 | norm 0.2360 | lr 1.08e-03 | (3866.51 ms | 135597 tok/s) step 4719/76294 | train loss 3.651153 | norm 0.2162 | lr 1.08e-03 | (3794.81 ms | 138159 tok/s) step 4720/76294 | train loss 3.638298 | norm 0.2471 | lr 1.08e-03 | (3825.03 ms | 137068 tok/s) step 4721/76294 | train loss 3.648789 | norm 0.2601 | lr 1.08e-03 | (3799.02 ms | 138006 tok/s) step 4722/76294 | train loss 3.567243 | norm 0.2134 | lr 1.08e-03 | (3904.93 ms | 134263 tok/s) step 4723/76294 | train loss 3.614053 | norm 0.2263 | lr 1.08e-03 | (3800.86 ms | 137939 tok/s) step 4724/76294 | train loss 3.659336 | norm 0.1977 | lr 1.08e-03 | (3840.40 ms | 136519 tok/s) step 4725/76294 | train loss 3.585294 | norm 0.2206 | lr 1.08e-03 | (3830.29 ms | 136879 tok/s) step 4726/76294 | train loss 3.616484 | norm 0.2336 | lr 1.08e-03 | (3920.52 ms | 133729 tok/s) step 4727/76294 | train loss 3.619472 | norm 0.2314 | lr 1.08e-03 | (3810.09 ms | 137605 tok/s) step 4728/76294 | train loss 3.597612 | norm 0.2024 | lr 1.08e-03 | (3800.12 ms | 137966 tok/s) step 4729/76294 | train loss 3.531157 | norm 0.2318 | lr 1.08e-03 | (3826.31 ms | 137022 tok/s) step 4730/76294 | train loss 3.649986 | norm 0.1879 | lr 1.08e-03 | (3803.94 ms | 137828 tok/s) step 4731/76294 | train loss 3.522979 | norm 0.2105 | lr 1.08e-03 | (3800.96 ms | 137936 tok/s) step 4732/76294 | train loss 3.533073 | norm 0.1960 | lr 1.08e-03 | (3853.71 ms | 136048 tok/s) step 4733/76294 | train loss 3.558004 | norm 0.1581 | lr 1.08e-03 | (3805.22 ms | 137781 tok/s) step 4734/76294 | train loss 3.530402 | norm 0.2144 | lr 1.08e-03 | (3988.89 ms | 131437 tok/s) step 4735/76294 | train loss 3.565078 | norm 0.2302 | lr 1.08e-03 | (3799.08 ms | 138004 tok/s) step 4736/76294 | train loss 3.562190 | norm 0.2243 | lr 1.08e-03 | (3882.58 ms | 135036 tok/s) step 4737/76294 | train loss 3.560986 | norm 0.2085 | lr 1.08e-03 | (3800.99 ms | 137935 tok/s) step 4738/76294 | train loss 3.579467 | norm 0.1713 | lr 1.08e-03 | (4022.47 ms | 130340 tok/s) step 4739/76294 | train loss 3.580554 | norm 0.2129 | lr 1.08e-03 | (3794.66 ms | 138165 tok/s) step 4740/76294 | train loss 3.555918 | norm 0.2018 | lr 1.08e-03 | (3841.14 ms | 136493 tok/s) step 4741/76294 | train loss 3.642822 | norm 0.2206 | lr 1.08e-03 | (3845.37 ms | 136343 tok/s) step 4742/76294 | train loss 3.551932 | norm 0.2066 | lr 1.08e-03 | (3805.23 ms | 137781 tok/s) step 4743/76294 | train loss 3.590671 | norm 0.1916 | lr 1.08e-03 | (3807.32 ms | 137705 tok/s) step 4744/76294 | train loss 3.640497 | norm 0.2143 | lr 1.08e-03 | (3807.78 ms | 137688 tok/s) step 4745/76294 | train loss 3.572820 | norm 0.2245 | lr 1.08e-03 | (3877.67 ms | 135207 tok/s) step 4746/76294 | train loss 3.588519 | norm 0.2011 | lr 1.08e-03 | (3794.86 ms | 138157 tok/s) step 4747/76294 | train loss 3.610327 | norm 0.2480 | lr 1.08e-03 | (3824.85 ms | 137074 tok/s) step 4748/76294 | train loss 3.545013 | norm 0.2348 | lr 1.08e-03 | (3797.36 ms | 138067 tok/s) step 4749/76294 | train loss 3.557231 | norm 0.2481 | lr 1.08e-03 | (3835.49 ms | 136694 tok/s) step 4750/76294 | train loss 3.577241 | norm 0.2208 | lr 1.08e-03 | (3833.59 ms | 136762 tok/s) val loss: 3.578418 saving model checkpoint to ./results/gpt2-124M-gqa/step_4750.pth step 4751/76294 | train loss 3.516562 | norm 0.2759 | lr 1.08e-03 | (3816.60 ms | 137371 tok/s) step 4752/76294 | train loss 3.565721 | norm 0.2353 | lr 1.08e-03 | (3829.64 ms | 136903 tok/s) step 4753/76294 | train loss 3.658255 | norm 0.2328 | lr 1.08e-03 | (3833.83 ms | 136753 tok/s) step 4754/76294 | train loss 3.642864 | norm 0.2401 | lr 1.08e-03 | (3826.66 ms | 137009 tok/s) step 4755/76294 | train loss 3.533813 | norm 0.2173 | lr 1.08e-03 | (3932.64 ms | 133317 tok/s) step 4756/76294 | train loss 3.634252 | norm 0.2250 | lr 1.08e-03 | (3784.77 ms | 138526 tok/s) step 4757/76294 | train loss 3.554036 | norm 0.2100 | lr 1.08e-03 | (4360.91 ms | 120225 tok/s) step 4758/76294 | train loss 3.588452 | norm 0.2470 | lr 1.08e-03 | (3790.23 ms | 138326 tok/s) step 4759/76294 | train loss 3.519404 | norm 0.2555 | lr 1.08e-03 | (3896.14 ms | 134566 tok/s) step 4760/76294 | train loss 3.629384 | norm 0.2163 | lr 1.08e-03 | (3809.84 ms | 137614 tok/s) step 4761/76294 | train loss 3.594651 | norm 0.2737 | lr 1.08e-03 | (3839.34 ms | 136557 tok/s) step 4762/76294 | train loss 3.670027 | norm 0.2673 | lr 1.08e-03 | (3790.56 ms | 138314 tok/s) step 4763/76294 | train loss 3.571933 | norm 0.2451 | lr 1.08e-03 | (3863.68 ms | 135697 tok/s) step 4764/76294 | train loss 3.658929 | norm 0.3039 | lr 1.08e-03 | (3791.28 ms | 138288 tok/s) step 4765/76294 | train loss 3.610795 | norm 0.2424 | lr 1.08e-03 | (3927.15 ms | 133503 tok/s) step 4766/76294 | train loss 3.684978 | norm 0.2822 | lr 1.08e-03 | (3795.17 ms | 138146 tok/s) step 4767/76294 | train loss 3.627615 | norm 0.2485 | lr 1.08e-03 | (3872.43 ms | 135390 tok/s) step 4768/76294 | train loss 3.616344 | norm 0.2365 | lr 1.08e-03 | (3793.80 ms | 138196 tok/s) step 4769/76294 | train loss 3.605766 | norm 0.2237 | lr 1.08e-03 | (4225.50 ms | 124077 tok/s) step 4770/76294 | train loss 3.642074 | norm 0.1987 | lr 1.08e-03 | (4011.85 ms | 130685 tok/s) step 4771/76294 | train loss 3.594981 | norm 0.2031 | lr 1.08e-03 | (3815.65 ms | 137405 tok/s) step 4772/76294 | train loss 3.573698 | norm 0.2017 | lr 1.08e-03 | (3833.89 ms | 136751 tok/s) step 4773/76294 | train loss 3.605178 | norm 0.2077 | lr 1.08e-03 | (3807.83 ms | 137687 tok/s) step 4774/76294 | train loss 3.585570 | norm 0.2136 | lr 1.08e-03 | (3847.74 ms | 136259 tok/s) step 4775/76294 | train loss 3.575735 | norm 0.2037 | lr 1.08e-03 | (3838.69 ms | 136580 tok/s) step 4776/76294 | train loss 3.688122 | norm 0.2359 | lr 1.08e-03 | (3936.60 ms | 133183 tok/s) step 4777/76294 | train loss 3.621384 | norm 0.2175 | lr 1.08e-03 | (3804.12 ms | 137821 tok/s) step 4778/76294 | train loss 3.588449 | norm 0.2150 | lr 1.08e-03 | (3797.08 ms | 138077 tok/s) step 4779/76294 | train loss 3.555027 | norm 0.2320 | lr 1.08e-03 | (3887.46 ms | 134866 tok/s) step 4780/76294 | train loss 3.627494 | norm 0.2118 | lr 1.08e-03 | (3905.39 ms | 134247 tok/s) step 4781/76294 | train loss 3.590056 | norm 0.2093 | lr 1.08e-03 | (3822.55 ms | 137157 tok/s) step 4782/76294 | train loss 3.639196 | norm 0.2011 | lr 1.08e-03 | (3841.84 ms | 136468 tok/s) step 4783/76294 | train loss 3.675456 | norm 0.2023 | lr 1.08e-03 | (3907.14 ms | 134187 tok/s) step 4784/76294 | train loss 3.677872 | norm 0.1958 | lr 1.08e-03 | (3794.79 ms | 138160 tok/s) step 4785/76294 | train loss 3.624192 | norm 0.1917 | lr 1.08e-03 | (3901.66 ms | 134376 tok/s) step 4786/76294 | train loss 3.663217 | norm 0.2358 | lr 1.08e-03 | (3807.77 ms | 137689 tok/s) step 4787/76294 | train loss 3.657016 | norm 0.2597 | lr 1.08e-03 | (3799.32 ms | 137995 tok/s) step 4788/76294 | train loss 3.590319 | norm 0.2428 | lr 1.08e-03 | (4088.71 ms | 128228 tok/s) step 4789/76294 | train loss 3.756878 | norm 0.3156 | lr 1.08e-03 | (4058.90 ms | 129170 tok/s) step 4790/76294 | train loss 3.616517 | norm 0.2698 | lr 1.08e-03 | (3794.54 ms | 138169 tok/s) step 4791/76294 | train loss 3.590656 | norm 0.2555 | lr 1.08e-03 | (3890.03 ms | 134777 tok/s) step 4792/76294 | train loss 3.730403 | norm 0.2153 | lr 1.08e-03 | (3828.77 ms | 136934 tok/s) step 4793/76294 | train loss 3.609035 | norm 0.2476 | lr 1.08e-03 | (3906.27 ms | 134217 tok/s) step 4794/76294 | train loss 3.647753 | norm 0.2439 | lr 1.08e-03 | (3802.02 ms | 137897 tok/s) step 4795/76294 | train loss 3.558848 | norm 0.2388 | lr 1.08e-03 | (3918.64 ms | 133793 tok/s) step 4796/76294 | train loss 3.565798 | norm 0.2191 | lr 1.08e-03 | (3795.63 ms | 138129 tok/s) step 4797/76294 | train loss 3.577448 | norm 0.2089 | lr 1.08e-03 | (3881.60 ms | 135070 tok/s) step 4798/76294 | train loss 3.629624 | norm 0.2213 | lr 1.08e-03 | (3812.03 ms | 137535 tok/s) step 4799/76294 | train loss 3.568482 | norm 0.2488 | lr 1.08e-03 | (3853.67 ms | 136049 tok/s) step 4800/76294 | train loss 3.470090 | norm 0.2957 | lr 1.08e-03 | (3803.78 ms | 137834 tok/s) step 4801/76294 | train loss 3.595522 | norm 0.2496 | lr 1.08e-03 | (3954.93 ms | 132566 tok/s) step 4802/76294 | train loss 3.590156 | norm 0.1957 | lr 1.08e-03 | (3800.62 ms | 137948 tok/s) step 4803/76294 | train loss 3.609792 | norm 0.2400 | lr 1.08e-03 | (3903.52 ms | 134312 tok/s) step 4804/76294 | train loss 3.602802 | norm 0.2158 | lr 1.08e-03 | (3802.09 ms | 137895 tok/s) step 4805/76294 | train loss 3.599877 | norm 0.2205 | lr 1.08e-03 | (3880.28 ms | 135116 tok/s) step 4806/76294 | train loss 3.610630 | norm 0.2223 | lr 1.08e-03 | (5163.15 ms | 101544 tok/s) step 4807/76294 | train loss 3.558795 | norm 0.2176 | lr 1.08e-03 | (6394.71 ms | 81988 tok/s) step 4808/76294 | train loss 3.580600 | norm 0.1973 | lr 1.08e-03 | (3879.13 ms | 135156 tok/s) step 4809/76294 | train loss 3.597498 | norm 0.1979 | lr 1.08e-03 | (3793.80 ms | 138196 tok/s) step 4810/76294 | train loss 3.575480 | norm 0.2157 | lr 1.08e-03 | (3878.83 ms | 135167 tok/s) step 4811/76294 | train loss 3.597195 | norm 0.1982 | lr 1.08e-03 | (3801.97 ms | 137899 tok/s) step 4812/76294 | train loss 3.620901 | norm 0.1936 | lr 1.08e-03 | (3861.61 ms | 135769 tok/s) step 4813/76294 | train loss 3.749251 | norm 0.1942 | lr 1.08e-03 | (3808.32 ms | 137669 tok/s) step 4814/76294 | train loss 3.622865 | norm 0.2608 | lr 1.08e-03 | (3845.64 ms | 136333 tok/s) step 4815/76294 | train loss 3.585240 | norm 0.3413 | lr 1.08e-03 | (3816.42 ms | 137377 tok/s) step 4816/76294 | train loss 3.601039 | norm 0.2766 | lr 1.08e-03 | (3842.02 ms | 136462 tok/s) step 4817/76294 | train loss 3.589238 | norm 0.1961 | lr 1.08e-03 | (3839.77 ms | 136541 tok/s) step 4818/76294 | train loss 3.610819 | norm 0.2098 | lr 1.08e-03 | (3843.95 ms | 136393 tok/s) step 4819/76294 | train loss 3.576898 | norm 0.1972 | lr 1.08e-03 | (3813.89 ms | 137468 tok/s) step 4820/76294 | train loss 3.627205 | norm 0.2551 | lr 1.08e-03 | (3836.17 ms | 136670 tok/s) step 4821/76294 | train loss 3.650673 | norm 0.2435 | lr 1.08e-03 | (3805.90 ms | 137757 tok/s) step 4822/76294 | train loss 3.575976 | norm 0.2287 | lr 1.08e-03 | (3837.82 ms | 136611 tok/s) step 4823/76294 | train loss 3.572923 | norm 0.1867 | lr 1.08e-03 | (3813.24 ms | 137492 tok/s) step 4824/76294 | train loss 3.602026 | norm 0.2382 | lr 1.08e-03 | (3803.08 ms | 137859 tok/s) step 4825/76294 | train loss 3.546136 | norm 0.2311 | lr 1.08e-03 | (3853.22 ms | 136065 tok/s) step 4826/76294 | train loss 3.586782 | norm 0.2135 | lr 1.08e-03 | (3805.68 ms | 137765 tok/s) step 4827/76294 | train loss 3.638259 | norm 0.2103 | lr 1.08e-03 | (3817.29 ms | 137345 tok/s) step 4828/76294 | train loss 3.631363 | norm 0.2140 | lr 1.08e-03 | (3803.13 ms | 137857 tok/s) step 4829/76294 | train loss 3.561482 | norm 0.1943 | lr 1.08e-03 | (3925.29 ms | 133567 tok/s) step 4830/76294 | train loss 3.585995 | norm 0.2180 | lr 1.08e-03 | (3809.32 ms | 137633 tok/s) step 4831/76294 | train loss 3.570500 | norm 0.2038 | lr 1.08e-03 | (3846.12 ms | 136316 tok/s) step 4832/76294 | train loss 3.616376 | norm 0.2287 | lr 1.08e-03 | (3802.46 ms | 137881 tok/s) step 4833/76294 | train loss 3.674451 | norm 0.1946 | lr 1.08e-03 | (3887.71 ms | 134858 tok/s) step 4834/76294 | train loss 3.537995 | norm 0.1998 | lr 1.08e-03 | (3813.04 ms | 137499 tok/s) step 4835/76294 | train loss 3.582569 | norm 0.1970 | lr 1.08e-03 | (3936.62 ms | 133182 tok/s) step 4836/76294 | train loss 3.583893 | norm 0.2146 | lr 1.08e-03 | (3824.13 ms | 137100 tok/s) step 4837/76294 | train loss 3.512222 | norm 0.2093 | lr 1.08e-03 | (3857.14 ms | 135926 tok/s) step 4838/76294 | train loss 3.674909 | norm 0.2416 | lr 1.08e-03 | (3808.44 ms | 137665 tok/s) step 4839/76294 | train loss 3.740375 | norm 0.1982 | lr 1.08e-03 | (3811.98 ms | 137537 tok/s) step 4840/76294 | train loss 3.597920 | norm 0.2141 | lr 1.08e-03 | (3842.98 ms | 136427 tok/s) step 4841/76294 | train loss 3.591198 | norm 0.2175 | lr 1.08e-03 | (3825.24 ms | 137060 tok/s) step 4842/76294 | train loss 3.601919 | norm 0.2310 | lr 1.08e-03 | (3848.85 ms | 136219 tok/s) step 4843/76294 | train loss 3.638668 | norm 0.1894 | lr 1.08e-03 | (3841.14 ms | 136493 tok/s) step 4844/76294 | train loss 3.476621 | norm 0.2242 | lr 1.08e-03 | (3836.64 ms | 136653 tok/s) step 4845/76294 | train loss 3.599565 | norm 0.2241 | lr 1.08e-03 | (3812.90 ms | 137504 tok/s) step 4846/76294 | train loss 3.530304 | norm 0.2208 | lr 1.08e-03 | (3844.37 ms | 136378 tok/s) step 4847/76294 | train loss 3.581807 | norm 0.1953 | lr 1.08e-03 | (3840.74 ms | 136507 tok/s) step 4848/76294 | train loss 3.639608 | norm 0.1959 | lr 1.08e-03 | (3835.73 ms | 136685 tok/s) step 4849/76294 | train loss 3.582064 | norm 0.2022 | lr 1.08e-03 | (3848.79 ms | 136222 tok/s) step 4850/76294 | train loss 3.623126 | norm 0.1910 | lr 1.08e-03 | (4025.21 ms | 130251 tok/s) step 4851/76294 | train loss 3.570703 | norm 0.1694 | lr 1.08e-03 | (3806.72 ms | 137727 tok/s) step 4852/76294 | train loss 3.591945 | norm 0.2316 | lr 1.08e-03 | (3865.74 ms | 135624 tok/s) step 4853/76294 | train loss 3.585396 | norm 0.1972 | lr 1.08e-03 | (3803.19 ms | 137855 tok/s) step 4854/76294 | train loss 3.591251 | norm 0.2065 | lr 1.08e-03 | (3851.59 ms | 136123 tok/s) step 4855/76294 | train loss 3.566537 | norm 0.2084 | lr 1.08e-03 | (3940.14 ms | 133063 tok/s) step 4856/76294 | train loss 3.549328 | norm 0.2266 | lr 1.08e-03 | (4411.56 ms | 118844 tok/s) step 4857/76294 | train loss 3.620196 | norm 0.2080 | lr 1.08e-03 | (3798.35 ms | 138031 tok/s) step 4858/76294 | train loss 3.624257 | norm 0.2482 | lr 1.08e-03 | (3830.39 ms | 136876 tok/s) step 4859/76294 | train loss 3.554289 | norm 0.2371 | lr 1.08e-03 | (3796.70 ms | 138090 tok/s) step 4860/76294 | train loss 3.582115 | norm 0.2185 | lr 1.08e-03 | (7013.97 ms | 74749 tok/s) step 4861/76294 | train loss 3.705317 | norm 0.2146 | lr 1.08e-03 | (3867.51 ms | 135562 tok/s) step 4862/76294 | train loss 3.581890 | norm 0.2229 | lr 1.08e-03 | (3851.66 ms | 136120 tok/s) step 4863/76294 | train loss 3.628984 | norm 0.2140 | lr 1.08e-03 | (3833.79 ms | 136755 tok/s) step 4864/76294 | train loss 3.575066 | norm 0.2162 | lr 1.08e-03 | (3797.51 ms | 138061 tok/s) step 4865/76294 | train loss 3.571729 | norm 0.2138 | lr 1.08e-03 | (3825.37 ms | 137056 tok/s) step 4866/76294 | train loss 3.562930 | norm 0.2135 | lr 1.08e-03 | (3803.68 ms | 137837 tok/s) step 4867/76294 | train loss 3.686470 | norm 0.1946 | lr 1.08e-03 | (3799.51 ms | 137988 tok/s) step 4868/76294 | train loss 3.677986 | norm 0.2617 | lr 1.08e-03 | (3805.97 ms | 137754 tok/s) step 4869/76294 | train loss 3.551468 | norm 0.2206 | lr 1.08e-03 | (4001.49 ms | 131023 tok/s) step 4870/76294 | train loss 3.561782 | norm 0.2368 | lr 1.08e-03 | (3796.18 ms | 138109 tok/s) step 4871/76294 | train loss 3.615173 | norm 0.1983 | lr 1.08e-03 | (3815.80 ms | 137399 tok/s) step 4872/76294 | train loss 3.626478 | norm 0.1880 | lr 1.08e-03 | (3844.00 ms | 136391 tok/s) step 4873/76294 | train loss 3.655135 | norm 0.2329 | lr 1.08e-03 | (3852.62 ms | 136086 tok/s) step 4874/76294 | train loss 3.675277 | norm 0.2189 | lr 1.08e-03 | (3814.08 ms | 137461 tok/s) step 4875/76294 | train loss 3.583302 | norm 0.2619 | lr 1.08e-03 | (3838.22 ms | 136597 tok/s) step 4876/76294 | train loss 3.523319 | norm 0.2471 | lr 1.08e-03 | (3795.90 ms | 138120 tok/s) step 4877/76294 | train loss 3.595172 | norm 0.2514 | lr 1.08e-03 | (3944.34 ms | 132922 tok/s) step 4878/76294 | train loss 3.691462 | norm 0.2616 | lr 1.08e-03 | (3808.56 ms | 137660 tok/s) step 4879/76294 | train loss 3.644311 | norm 0.2759 | lr 1.08e-03 | (3902.31 ms | 134353 tok/s) step 4880/76294 | train loss 3.584118 | norm 0.2584 | lr 1.08e-03 | (3796.40 ms | 138101 tok/s) step 4881/76294 | train loss 3.546182 | norm 0.2131 | lr 1.07e-03 | (3852.06 ms | 136106 tok/s) step 4882/76294 | train loss 3.608123 | norm 0.2307 | lr 1.07e-03 | (3798.90 ms | 138011 tok/s) step 4883/76294 | train loss 3.580944 | norm 0.2267 | lr 1.07e-03 | (3832.53 ms | 136799 tok/s) step 4884/76294 | train loss 3.733148 | norm 0.2314 | lr 1.07e-03 | (3798.11 ms | 138039 tok/s) step 4885/76294 | train loss 3.582630 | norm 0.2007 | lr 1.07e-03 | (3851.03 ms | 136142 tok/s) step 4886/76294 | train loss 3.472731 | norm 0.2606 | lr 1.07e-03 | (3804.51 ms | 137807 tok/s) step 4887/76294 | train loss 3.682193 | norm 0.2539 | lr 1.07e-03 | (4005.47 ms | 130893 tok/s) step 4888/76294 | train loss 3.614212 | norm 0.2035 | lr 1.07e-03 | (3816.71 ms | 137367 tok/s) step 4889/76294 | train loss 3.597735 | norm 0.2133 | lr 1.07e-03 | (3841.48 ms | 136481 tok/s) step 4890/76294 | train loss 3.589985 | norm 0.3473 | lr 1.07e-03 | (3800.06 ms | 137968 tok/s) step 4891/76294 | train loss 3.570079 | norm 0.3426 | lr 1.07e-03 | (4699.16 ms | 111571 tok/s) step 4892/76294 | train loss 3.560872 | norm 0.2909 | lr 1.07e-03 | (3788.34 ms | 138395 tok/s) step 4893/76294 | train loss 3.599123 | norm 0.2174 | lr 1.07e-03 | (3971.47 ms | 132014 tok/s) step 4894/76294 | train loss 3.638949 | norm 0.2394 | lr 1.07e-03 | (3796.87 ms | 138084 tok/s) step 4895/76294 | train loss 3.568169 | norm 0.2142 | lr 1.07e-03 | (3823.07 ms | 137138 tok/s) step 4896/76294 | train loss 3.646309 | norm 0.2021 | lr 1.07e-03 | (3830.95 ms | 136856 tok/s) step 4897/76294 | train loss 3.723914 | norm 0.2295 | lr 1.07e-03 | (3877.66 ms | 135207 tok/s) step 4898/76294 | train loss 3.629230 | norm 0.2041 | lr 1.07e-03 | (3886.86 ms | 134887 tok/s) step 4899/76294 | train loss 3.566822 | norm 0.2042 | lr 1.07e-03 | (3841.47 ms | 136481 tok/s) step 4900/76294 | train loss 3.563662 | norm 0.1824 | lr 1.07e-03 | (3818.27 ms | 137310 tok/s) step 4901/76294 | train loss 3.612926 | norm 0.1928 | lr 1.07e-03 | (3807.98 ms | 137682 tok/s) step 4902/76294 | train loss 3.775103 | norm 0.2023 | lr 1.07e-03 | (3839.28 ms | 136559 tok/s) step 4903/76294 | train loss 3.564793 | norm 0.2676 | lr 1.07e-03 | (3816.22 ms | 137384 tok/s) step 4904/76294 | train loss 3.595586 | norm 0.2797 | lr 1.07e-03 | (3808.43 ms | 137665 tok/s) step 4905/76294 | train loss 3.532674 | norm 0.1940 | lr 1.07e-03 | (3929.45 ms | 133425 tok/s) step 4906/76294 | train loss 3.537198 | norm 0.2647 | lr 1.07e-03 | (3823.06 ms | 137138 tok/s) step 4907/76294 | train loss 3.597628 | norm 0.2360 | lr 1.07e-03 | (3886.12 ms | 134913 tok/s) step 4908/76294 | train loss 3.562045 | norm 0.2103 | lr 1.07e-03 | (3803.97 ms | 137827 tok/s) step 4909/76294 | train loss 3.601109 | norm 0.2302 | lr 1.07e-03 | (3866.20 ms | 135608 tok/s) step 4910/76294 | train loss 3.545913 | norm 0.2409 | lr 1.07e-03 | (3814.56 ms | 137444 tok/s) step 4911/76294 | train loss 3.588469 | norm 0.2250 | lr 1.07e-03 | (3944.59 ms | 132913 tok/s) step 4912/76294 | train loss 3.553871 | norm 0.2635 | lr 1.07e-03 | (3828.16 ms | 136956 tok/s) step 4913/76294 | train loss 3.607345 | norm 0.1978 | lr 1.07e-03 | (3922.18 ms | 133673 tok/s) step 4914/76294 | train loss 3.582513 | norm 0.2296 | lr 1.07e-03 | (3812.04 ms | 137535 tok/s) step 4915/76294 | train loss 3.819912 | norm 0.2392 | lr 1.07e-03 | (3945.70 ms | 132876 tok/s) step 4916/76294 | train loss 3.643885 | norm 0.2402 | lr 1.07e-03 | (4280.15 ms | 122493 tok/s) step 4917/76294 | train loss 3.610707 | norm 0.2132 | lr 1.07e-03 | (3837.95 ms | 136606 tok/s) step 4918/76294 | train loss 3.580316 | norm 0.2118 | lr 1.07e-03 | (3875.80 ms | 135272 tok/s) step 4919/76294 | train loss 3.629737 | norm 0.1912 | lr 1.07e-03 | (3976.41 ms | 131850 tok/s) step 4920/76294 | train loss 3.550635 | norm 0.2122 | lr 1.07e-03 | (3969.06 ms | 132094 tok/s) step 4921/76294 | train loss 3.586180 | norm 0.1851 | lr 1.07e-03 | (3807.72 ms | 137691 tok/s) step 4922/76294 | train loss 3.552135 | norm 0.1911 | lr 1.07e-03 | (3966.33 ms | 132185 tok/s) step 4923/76294 | train loss 3.576160 | norm 0.1978 | lr 1.07e-03 | (13494.49 ms | 38852 tok/s) step 4924/76294 | train loss 3.569918 | norm 0.1887 | lr 1.07e-03 | (5313.48 ms | 98671 tok/s) step 4925/76294 | train loss 3.509017 | norm 0.2143 | lr 1.07e-03 | (5574.46 ms | 94052 tok/s) step 4926/76294 | train loss 3.630406 | norm 0.2132 | lr 1.07e-03 | (3774.84 ms | 138890 tok/s) step 4927/76294 | train loss 3.589260 | norm 0.2246 | lr 1.07e-03 | (3875.56 ms | 135281 tok/s) step 4928/76294 | train loss 3.527905 | norm 0.2151 | lr 1.07e-03 | (4131.27 ms | 126907 tok/s) step 4929/76294 | train loss 3.595149 | norm 0.1847 | lr 1.07e-03 | (3794.42 ms | 138173 tok/s) step 4930/76294 | train loss 3.623747 | norm 0.1903 | lr 1.07e-03 | (3875.51 ms | 135282 tok/s) step 4931/76294 | train loss 3.569087 | norm 0.2069 | lr 1.07e-03 | (3765.48 ms | 139236 tok/s) step 4932/76294 | train loss 3.613761 | norm 0.2084 | lr 1.07e-03 | (11045.16 ms | 47468 tok/s) step 4933/76294 | train loss 3.556112 | norm 0.1991 | lr 1.07e-03 | (3801.89 ms | 137902 tok/s) step 4934/76294 | train loss 3.594292 | norm 0.2072 | lr 1.07e-03 | (30450.09 ms | 17218 tok/s) step 4935/76294 | train loss 3.460333 | norm 0.2205 | lr 1.07e-03 | (76746.50 ms | 6831 tok/s) step 4936/76294 | train loss 3.584463 | norm 0.2986 | lr 1.07e-03 | (4810.88 ms | 108980 tok/s) step 4937/76294 | train loss 3.629457 | norm 0.2671 | lr 1.07e-03 | (3779.93 ms | 138703 tok/s) step 4938/76294 | train loss 3.579556 | norm 0.2146 | lr 1.07e-03 | (3710.34 ms | 141304 tok/s) step 4939/76294 | train loss 3.600652 | norm 0.2958 | lr 1.07e-03 | (4726.12 ms | 110934 tok/s) step 4940/76294 | train loss 3.612179 | norm 0.2355 | lr 1.07e-03 | (3735.54 ms | 140352 tok/s) step 4941/76294 | train loss 3.582884 | norm 0.2373 | lr 1.07e-03 | (3715.79 ms | 141097 tok/s) step 4942/76294 | train loss 3.599906 | norm 0.2590 | lr 1.07e-03 | (3717.71 ms | 141024 tok/s) step 4943/76294 | train loss 3.542971 | norm 0.2156 | lr 1.07e-03 | (3753.05 ms | 139697 tok/s) step 4944/76294 | train loss 3.630148 | norm 0.2306 | lr 1.07e-03 | (3729.78 ms | 140568 tok/s) step 4945/76294 | train loss 3.538750 | norm 0.2757 | lr 1.07e-03 | (3796.88 ms | 138084 tok/s) step 4946/76294 | train loss 3.550935 | norm 0.2387 | lr 1.07e-03 | (3742.07 ms | 140106 tok/s) step 4947/76294 | train loss 3.569174 | norm 0.2285 | lr 1.07e-03 | (3768.02 ms | 139141 tok/s) step 4948/76294 | train loss 3.557694 | norm 0.2171 | lr 1.07e-03 | (4628.39 ms | 113277 tok/s) step 4949/76294 | train loss 3.599047 | norm 0.2213 | lr 1.07e-03 | (3747.35 ms | 139909 tok/s) step 4950/76294 | train loss 3.600522 | norm 0.2155 | lr 1.07e-03 | (3970.09 ms | 132059 tok/s) step 4951/76294 | train loss 3.633411 | norm 0.2235 | lr 1.07e-03 | (3749.31 ms | 139836 tok/s) step 4952/76294 | train loss 3.618678 | norm 0.2313 | lr 1.07e-03 | (3911.58 ms | 134035 tok/s) step 4953/76294 | train loss 3.620687 | norm 0.1907 | lr 1.07e-03 | (3766.10 ms | 139212 tok/s) step 4954/76294 | train loss 3.608892 | norm 0.2282 | lr 1.07e-03 | (3916.93 ms | 133852 tok/s) step 4955/76294 | train loss 3.616735 | norm 0.2039 | lr 1.07e-03 | (3782.63 ms | 138604 tok/s) step 4956/76294 | train loss 3.585431 | norm 0.2073 | lr 1.07e-03 | (3811.94 ms | 137538 tok/s) step 4957/76294 | train loss 3.591631 | norm 0.2240 | lr 1.07e-03 | (3772.57 ms | 138974 tok/s) step 4958/76294 | train loss 3.583611 | norm 0.2327 | lr 1.07e-03 | (3973.77 ms | 131937 tok/s) step 4959/76294 | train loss 3.566005 | norm 0.2018 | lr 1.07e-03 | (3887.96 ms | 134849 tok/s) step 4960/76294 | train loss 3.571440 | norm 0.2051 | lr 1.07e-03 | (4950.12 ms | 105914 tok/s) step 4961/76294 | train loss 3.596817 | norm 0.2233 | lr 1.07e-03 | (3776.99 ms | 138811 tok/s) step 4962/76294 | train loss 3.639213 | norm 0.2245 | lr 1.07e-03 | (3805.91 ms | 137756 tok/s) step 4963/76294 | train loss 3.580595 | norm 0.1988 | lr 1.07e-03 | (3801.83 ms | 137904 tok/s) step 4964/76294 | train loss 3.513628 | norm 0.2268 | lr 1.07e-03 | (3780.91 ms | 138667 tok/s) step 4965/76294 | train loss 3.571575 | norm 0.2201 | lr 1.07e-03 | (3808.39 ms | 137667 tok/s) step 4966/76294 | train loss 3.591810 | norm 0.2214 | lr 1.07e-03 | (3791.10 ms | 138294 tok/s) step 4967/76294 | train loss 3.556742 | norm 0.1915 | lr 1.07e-03 | (3788.00 ms | 138408 tok/s) step 4968/76294 | train loss 3.597601 | norm 0.2251 | lr 1.07e-03 | (3868.35 ms | 135533 tok/s) step 4969/76294 | train loss 3.608843 | norm 0.2169 | lr 1.07e-03 | (3788.96 ms | 138373 tok/s) step 4970/76294 | train loss 3.536243 | norm 0.1899 | lr 1.07e-03 | (3900.46 ms | 134417 tok/s) step 4971/76294 | train loss 3.637068 | norm 0.2569 | lr 1.07e-03 | (3791.28 ms | 138288 tok/s) step 4972/76294 | train loss 3.516578 | norm 0.2055 | lr 1.07e-03 | (3808.18 ms | 137674 tok/s) step 4973/76294 | train loss 3.563511 | norm 0.2301 | lr 1.07e-03 | (3839.13 ms | 136564 tok/s) step 4974/76294 | train loss 3.505188 | norm 0.2210 | lr 1.07e-03 | (3798.95 ms | 138009 tok/s) step 4975/76294 | train loss 3.552098 | norm 0.2254 | lr 1.07e-03 | (3826.70 ms | 137008 tok/s) step 4976/76294 | train loss 3.627743 | norm 0.2069 | lr 1.07e-03 | (3816.84 ms | 137362 tok/s) step 4977/76294 | train loss 3.528229 | norm 0.2269 | lr 1.07e-03 | (3810.49 ms | 137591 tok/s) step 4978/76294 | train loss 3.571935 | norm 0.2007 | lr 1.07e-03 | (3818.45 ms | 137304 tok/s) step 4979/76294 | train loss 3.562389 | norm 0.2055 | lr 1.07e-03 | (3804.22 ms | 137818 tok/s) step 4980/76294 | train loss 3.496362 | norm 0.2097 | lr 1.07e-03 | (3824.43 ms | 137089 tok/s) step 4981/76294 | train loss 3.629735 | norm 0.1824 | lr 1.07e-03 | (3803.36 ms | 137849 tok/s) step 4982/76294 | train loss 3.604451 | norm 0.2417 | lr 1.07e-03 | (3800.10 ms | 137967 tok/s) step 4983/76294 | train loss 3.601909 | norm 0.2162 | lr 1.07e-03 | (4049.56 ms | 129468 tok/s) step 4984/76294 | train loss 3.612764 | norm 0.2107 | lr 1.07e-03 | (3801.10 ms | 137931 tok/s) step 4985/76294 | train loss 3.683574 | norm 0.2329 | lr 1.07e-03 | (5198.82 ms | 100848 tok/s) step 4986/76294 | train loss 3.555234 | norm 0.2503 | lr 1.07e-03 | (3835.54 ms | 136692 tok/s) step 4987/76294 | train loss 3.616878 | norm 0.3585 | lr 1.07e-03 | (3815.06 ms | 137426 tok/s) step 4988/76294 | train loss 3.568895 | norm 0.3360 | lr 1.07e-03 | (3803.89 ms | 137829 tok/s) step 4989/76294 | train loss 3.597963 | norm 0.2568 | lr 1.07e-03 | (3838.49 ms | 136587 tok/s) step 4990/76294 | train loss 3.638753 | norm 0.2590 | lr 1.07e-03 | (3802.87 ms | 137866 tok/s) step 4991/76294 | train loss 3.593766 | norm 0.2117 | lr 1.07e-03 | (3856.19 ms | 135960 tok/s) step 4992/76294 | train loss 3.590682 | norm 0.2437 | lr 1.07e-03 | (3802.40 ms | 137884 tok/s) step 4993/76294 | train loss 3.601580 | norm 0.2026 | lr 1.07e-03 | (3877.73 ms | 135205 tok/s) step 4994/76294 | train loss 3.541406 | norm 0.2532 | lr 1.07e-03 | (3825.07 ms | 137066 tok/s) step 4995/76294 | train loss 3.591931 | norm 0.2535 | lr 1.07e-03 | (3839.83 ms | 136540 tok/s) step 4996/76294 | train loss 3.576810 | norm 0.2160 | lr 1.07e-03 | (3807.13 ms | 137712 tok/s) step 4997/76294 | train loss 3.545495 | norm 0.2367 | lr 1.07e-03 | (3831.24 ms | 136845 tok/s) step 4998/76294 | train loss 3.564623 | norm 0.2262 | lr 1.07e-03 | (3808.14 ms | 137676 tok/s) step 4999/76294 | train loss 3.595769 | norm 0.2260 | lr 1.07e-03 | (3813.12 ms | 137496 tok/s) step 5000/76294 | train loss 3.581598 | norm 0.1896 | lr 1.07e-03 | (3841.06 ms | 136496 tok/s) val loss: 3.566668 saving model checkpoint to ./results/gpt2-124M-gqa/step_5000.pth step 5001/76294 | train loss 3.570714 | norm 0.2013 | lr 1.07e-03 | (3942.15 ms | 132995 tok/s) step 5002/76294 | train loss 3.598631 | norm 0.1780 | lr 1.07e-03 | (3800.72 ms | 137944 tok/s) step 5003/76294 | train loss 3.568246 | norm 0.1935 | lr 1.07e-03 | (3857.23 ms | 135923 tok/s) step 5004/76294 | train loss 3.575989 | norm 0.2011 | lr 1.07e-03 | (3799.04 ms | 138005 tok/s) step 5005/76294 | train loss 3.614028 | norm 0.2632 | lr 1.07e-03 | (3826.23 ms | 137025 tok/s) step 5006/76294 | train loss 3.623032 | norm 0.1848 | lr 1.07e-03 | (3796.06 ms | 138114 tok/s) step 5007/76294 | train loss 3.578151 | norm 0.1807 | lr 1.07e-03 | (3816.17 ms | 137386 tok/s) step 5008/76294 | train loss 3.628789 | norm 0.1890 | lr 1.07e-03 | (3801.45 ms | 137918 tok/s) step 5009/76294 | train loss 3.607125 | norm 0.1831 | lr 1.07e-03 | (3825.31 ms | 137058 tok/s) step 5010/76294 | train loss 3.657299 | norm 0.1803 | lr 1.07e-03 | (3797.39 ms | 138065 tok/s) step 5011/76294 | train loss 3.642596 | norm 0.1853 | lr 1.07e-03 | (3803.23 ms | 137854 tok/s) step 5012/76294 | train loss 3.586447 | norm 0.1745 | lr 1.07e-03 | (3826.57 ms | 137012 tok/s) step 5013/76294 | train loss 3.589598 | norm 0.1772 | lr 1.07e-03 | (3804.76 ms | 137798 tok/s) step 5014/76294 | train loss 3.552913 | norm 0.1976 | lr 1.07e-03 | (3849.95 ms | 136180 tok/s) step 5015/76294 | train loss 3.601104 | norm 0.1913 | lr 1.07e-03 | (3803.45 ms | 137846 tok/s) step 5016/76294 | train loss 3.570891 | norm 0.2071 | lr 1.07e-03 | (3802.00 ms | 137898 tok/s) step 5017/76294 | train loss 3.567030 | norm 0.2313 | lr 1.07e-03 | (3840.96 ms | 136499 tok/s) step 5018/76294 | train loss 3.674072 | norm 0.2511 | lr 1.07e-03 | (3799.54 ms | 137987 tok/s) step 5019/76294 | train loss 3.592624 | norm 0.2445 | lr 1.07e-03 | (3805.21 ms | 137782 tok/s) step 5020/76294 | train loss 3.572222 | norm 0.2189 | lr 1.07e-03 | (3831.98 ms | 136819 tok/s) step 5021/76294 | train loss 3.568194 | norm 0.1900 | lr 1.07e-03 | (3807.09 ms | 137714 tok/s) step 5022/76294 | train loss 3.601130 | norm 0.2013 | lr 1.07e-03 | (3800.60 ms | 137949 tok/s) step 5023/76294 | train loss 3.574982 | norm 0.1835 | lr 1.07e-03 | (3832.02 ms | 136818 tok/s) step 5024/76294 | train loss 3.627001 | norm 0.1948 | lr 1.07e-03 | (3803.17 ms | 137856 tok/s) step 5025/76294 | train loss 3.626908 | norm 0.1899 | lr 1.07e-03 | (3806.38 ms | 137739 tok/s) step 5026/76294 | train loss 3.499282 | norm 0.1636 | lr 1.07e-03 | (3863.02 ms | 135720 tok/s) step 5027/76294 | train loss 3.526662 | norm 0.2226 | lr 1.07e-03 | (3804.09 ms | 137822 tok/s) step 5028/76294 | train loss 3.643423 | norm 0.2446 | lr 1.07e-03 | (3846.37 ms | 136307 tok/s) step 5029/76294 | train loss 3.563653 | norm 0.2625 | lr 1.07e-03 | (3815.70 ms | 137403 tok/s) step 5030/76294 | train loss 3.576330 | norm 0.2492 | lr 1.07e-03 | (3853.27 ms | 136063 tok/s) step 5031/76294 | train loss 3.680537 | norm 0.2196 | lr 1.07e-03 | (3807.11 ms | 137713 tok/s) step 5032/76294 | train loss 3.485901 | norm 0.2752 | lr 1.07e-03 | (3808.18 ms | 137674 tok/s) step 5033/76294 | train loss 3.672019 | norm 0.2844 | lr 1.07e-03 | (3828.51 ms | 136943 tok/s) step 5034/76294 | train loss 3.669810 | norm 0.2432 | lr 1.07e-03 | (3845.13 ms | 136351 tok/s) step 5035/76294 | train loss 3.601864 | norm 0.2351 | lr 1.07e-03 | (3804.31 ms | 137814 tok/s) step 5036/76294 | train loss 3.588635 | norm 0.2374 | lr 1.07e-03 | (3849.24 ms | 136206 tok/s) step 5037/76294 | train loss 3.545007 | norm 0.2355 | lr 1.07e-03 | (3806.33 ms | 137741 tok/s) step 5038/76294 | train loss 3.568325 | norm 0.2281 | lr 1.07e-03 | (3835.08 ms | 136708 tok/s) step 5039/76294 | train loss 3.572386 | norm 0.2278 | lr 1.07e-03 | (3804.25 ms | 137816 tok/s) step 5040/76294 | train loss 3.761735 | norm 0.2008 | lr 1.07e-03 | (3862.75 ms | 135729 tok/s) step 5041/76294 | train loss 3.591653 | norm 0.2001 | lr 1.07e-03 | (3804.91 ms | 137793 tok/s) step 5042/76294 | train loss 3.586247 | norm 0.2208 | lr 1.07e-03 | (4218.13 ms | 124294 tok/s) step 5043/76294 | train loss 3.606550 | norm 0.2204 | lr 1.07e-03 | (3927.86 ms | 133479 tok/s) step 5044/76294 | train loss 3.560617 | norm 0.2076 | lr 1.07e-03 | (3807.39 ms | 137703 tok/s) step 5045/76294 | train loss 3.602190 | norm 0.2164 | lr 1.07e-03 | (3842.32 ms | 136451 tok/s) step 5046/76294 | train loss 3.597853 | norm 0.1825 | lr 1.07e-03 | (3967.88 ms | 132133 tok/s) step 5047/76294 | train loss 3.588202 | norm 0.2143 | lr 1.07e-03 | (3800.04 ms | 137969 tok/s) step 5048/76294 | train loss 3.579141 | norm 0.1812 | lr 1.07e-03 | (3840.51 ms | 136515 tok/s) step 5049/76294 | train loss 3.571299 | norm 0.2198 | lr 1.07e-03 | (3826.55 ms | 137013 tok/s) step 5050/76294 | train loss 3.584812 | norm 0.2175 | lr 1.07e-03 | (3859.45 ms | 135845 tok/s) step 5051/76294 | train loss 3.560708 | norm 0.2110 | lr 1.07e-03 | (3818.60 ms | 137298 tok/s) step 5052/76294 | train loss 3.573067 | norm 0.2136 | lr 1.06e-03 | (3809.70 ms | 137619 tok/s) step 5053/76294 | train loss 3.585821 | norm 0.2085 | lr 1.06e-03 | (3824.67 ms | 137080 tok/s) step 5054/76294 | train loss 3.545411 | norm 0.2170 | lr 1.06e-03 | (3827.95 ms | 136963 tok/s) step 5055/76294 | train loss 3.596176 | norm 0.1992 | lr 1.06e-03 | (3825.22 ms | 137061 tok/s) step 5056/76294 | train loss 3.579850 | norm 0.2328 | lr 1.06e-03 | (3805.50 ms | 137771 tok/s) step 5057/76294 | train loss 3.572136 | norm 0.2178 | lr 1.06e-03 | (3808.15 ms | 137675 tok/s) step 5058/76294 | train loss 3.724308 | norm 0.2787 | lr 1.06e-03 | (3804.19 ms | 137819 tok/s) step 5059/76294 | train loss 3.615528 | norm 0.2561 | lr 1.06e-03 | (3833.65 ms | 136759 tok/s) step 5060/76294 | train loss 3.612916 | norm 0.2113 | lr 1.06e-03 | (3840.90 ms | 136501 tok/s) step 5061/76294 | train loss 3.547160 | norm 0.2549 | lr 1.06e-03 | (3830.59 ms | 136869 tok/s) step 5062/76294 | train loss 3.605209 | norm 0.2612 | lr 1.06e-03 | (3809.86 ms | 137613 tok/s) step 5063/76294 | train loss 3.552550 | norm 0.2385 | lr 1.06e-03 | (3837.76 ms | 136613 tok/s) step 5064/76294 | train loss 3.592816 | norm 0.1922 | lr 1.06e-03 | (3811.55 ms | 137552 tok/s) step 5065/76294 | train loss 3.604095 | norm 0.2363 | lr 1.06e-03 | (3805.79 ms | 137761 tok/s) step 5066/76294 | train loss 3.610107 | norm 0.2007 | lr 1.06e-03 | (3955.37 ms | 132551 tok/s) step 5067/76294 | train loss 3.573842 | norm 0.2283 | lr 1.06e-03 | (5674.42 ms | 92395 tok/s) step 5068/76294 | train loss 3.635165 | norm 0.2180 | lr 1.06e-03 | (3842.39 ms | 136449 tok/s) step 5069/76294 | train loss 3.553679 | norm 0.1974 | lr 1.06e-03 | (3806.67 ms | 137729 tok/s) step 5070/76294 | train loss 3.523700 | norm 0.2109 | lr 1.06e-03 | (3810.50 ms | 137590 tok/s) step 5071/76294 | train loss 3.546615 | norm 0.1985 | lr 1.06e-03 | (3833.62 ms | 136761 tok/s) step 5072/76294 | train loss 3.488770 | norm 0.2640 | lr 1.06e-03 | (3811.52 ms | 137553 tok/s) step 5073/76294 | train loss 3.601149 | norm 0.2904 | lr 1.06e-03 | (3876.07 ms | 135263 tok/s) step 5074/76294 | train loss 3.660196 | norm 0.2268 | lr 1.06e-03 | (3834.55 ms | 136728 tok/s) step 5075/76294 | train loss 3.588205 | norm 0.2414 | lr 1.06e-03 | (3840.85 ms | 136503 tok/s) step 5076/76294 | train loss 3.543730 | norm 0.2420 | lr 1.06e-03 | (3806.28 ms | 137743 tok/s) step 5077/76294 | train loss 3.575830 | norm 0.1956 | lr 1.06e-03 | (3852.58 ms | 136087 tok/s) step 5078/76294 | train loss 3.643858 | norm 0.2684 | lr 1.06e-03 | (3809.79 ms | 137616 tok/s) step 5079/76294 | train loss 3.651224 | norm 0.2573 | lr 1.06e-03 | (3876.20 ms | 135258 tok/s) step 5080/76294 | train loss 3.611219 | norm 0.2385 | lr 1.06e-03 | (3811.35 ms | 137559 tok/s) step 5081/76294 | train loss 3.608678 | norm 0.2167 | lr 1.06e-03 | (3821.94 ms | 137179 tok/s) step 5082/76294 | train loss 3.609022 | norm 0.2274 | lr 1.06e-03 | (3815.49 ms | 137410 tok/s) step 5083/76294 | train loss 3.574464 | norm 0.1996 | lr 1.06e-03 | (3865.44 ms | 135635 tok/s) step 5084/76294 | train loss 3.600959 | norm 0.2148 | lr 1.06e-03 | (3836.95 ms | 136642 tok/s) step 5085/76294 | train loss 3.684691 | norm 0.2175 | lr 1.06e-03 | (3815.59 ms | 137407 tok/s) step 5086/76294 | train loss 3.519355 | norm 0.1933 | lr 1.06e-03 | (3811.89 ms | 137540 tok/s) step 5087/76294 | train loss 3.598858 | norm 0.1900 | lr 1.06e-03 | (3849.39 ms | 136200 tok/s) step 5088/76294 | train loss 3.642007 | norm 0.2080 | lr 1.06e-03 | (3808.89 ms | 137648 tok/s) step 5089/76294 | train loss 3.559113 | norm 0.1813 | lr 1.06e-03 | (3845.82 ms | 136327 tok/s) step 5090/76294 | train loss 3.601306 | norm 0.2011 | lr 1.06e-03 | (3812.96 ms | 137502 tok/s) step 5091/76294 | train loss 3.586566 | norm 0.2083 | lr 1.06e-03 | (3820.90 ms | 137216 tok/s) step 5092/76294 | train loss 3.566579 | norm 0.2334 | lr 1.06e-03 | (3844.10 ms | 136388 tok/s) step 5093/76294 | train loss 3.536756 | norm 0.2634 | lr 1.06e-03 | (3976.34 ms | 131852 tok/s) step 5094/76294 | train loss 3.618083 | norm 0.2306 | lr 1.06e-03 | (3816.76 ms | 137365 tok/s) step 5095/76294 | train loss 3.550356 | norm 0.2377 | lr 1.06e-03 | (3871.50 ms | 135423 tok/s) step 5096/76294 | train loss 3.562261 | norm 0.2533 | lr 1.06e-03 | (3815.45 ms | 137412 tok/s) step 5097/76294 | train loss 3.555459 | norm 0.2026 | lr 1.06e-03 | (3814.75 ms | 137437 tok/s) step 5098/76294 | train loss 3.566799 | norm 0.2232 | lr 1.06e-03 | (3840.02 ms | 136533 tok/s) step 5099/76294 | train loss 3.594425 | norm 0.2971 | lr 1.06e-03 | (3814.14 ms | 137459 tok/s) step 5100/76294 | train loss 3.564116 | norm 0.2870 | lr 1.06e-03 | (3872.64 ms | 135383 tok/s) step 5101/76294 | train loss 3.549005 | norm 0.1944 | lr 1.06e-03 | (5007.30 ms | 104705 tok/s) step 5102/76294 | train loss 3.561321 | norm 0.2362 | lr 1.06e-03 | (4022.95 ms | 130324 tok/s) step 5103/76294 | train loss 3.579807 | norm 0.2029 | lr 1.06e-03 | (3794.91 ms | 138156 tok/s) step 5104/76294 | train loss 3.604664 | norm 0.2315 | lr 1.06e-03 | (3805.74 ms | 137762 tok/s) step 5105/76294 | train loss 3.636921 | norm 0.2057 | lr 1.06e-03 | (3795.95 ms | 138118 tok/s) step 5106/76294 | train loss 3.609226 | norm 0.2268 | lr 1.06e-03 | (3824.57 ms | 137084 tok/s) step 5107/76294 | train loss 3.616687 | norm 0.2327 | lr 1.06e-03 | (3792.69 ms | 138237 tok/s) step 5108/76294 | train loss 3.558567 | norm 0.2202 | lr 1.06e-03 | (3803.47 ms | 137845 tok/s) step 5109/76294 | train loss 3.652903 | norm 0.2231 | lr 1.06e-03 | (3822.45 ms | 137160 tok/s) step 5110/76294 | train loss 3.621981 | norm 0.2398 | lr 1.06e-03 | (3808.06 ms | 137679 tok/s) step 5111/76294 | train loss 3.622077 | norm 0.1844 | lr 1.06e-03 | (3815.80 ms | 137399 tok/s) step 5112/76294 | train loss 3.532233 | norm 0.2432 | lr 1.06e-03 | (3798.77 ms | 138015 tok/s) step 5113/76294 | train loss 3.559429 | norm 0.1971 | lr 1.06e-03 | (3824.43 ms | 137089 tok/s) step 5114/76294 | train loss 3.593693 | norm 0.1989 | lr 1.06e-03 | (3828.86 ms | 136930 tok/s) step 5115/76294 | train loss 3.560061 | norm 0.1837 | lr 1.06e-03 | (3885.32 ms | 134941 tok/s) step 5116/76294 | train loss 3.549548 | norm 0.2196 | lr 1.06e-03 | (3800.95 ms | 137936 tok/s) step 5117/76294 | train loss 3.597184 | norm 0.1962 | lr 1.06e-03 | (3807.71 ms | 137691 tok/s) step 5118/76294 | train loss 3.525995 | norm 0.1986 | lr 1.06e-03 | (3862.11 ms | 135752 tok/s) step 5119/76294 | train loss 3.568260 | norm 0.1910 | lr 1.06e-03 | (3800.08 ms | 137968 tok/s) step 5120/76294 | train loss 3.528733 | norm 0.2123 | lr 1.06e-03 | (3885.48 ms | 134935 tok/s) step 5121/76294 | train loss 3.567023 | norm 0.1832 | lr 1.06e-03 | (3830.97 ms | 136855 tok/s) step 5122/76294 | train loss 3.562545 | norm 0.2136 | lr 1.06e-03 | (3800.32 ms | 137959 tok/s) step 5123/76294 | train loss 3.622671 | norm 0.1857 | lr 1.06e-03 | (3803.62 ms | 137839 tok/s) step 5124/76294 | train loss 3.625814 | norm 0.2241 | lr 1.06e-03 | (3827.52 ms | 136979 tok/s) step 5125/76294 | train loss 3.557229 | norm 0.2290 | lr 1.06e-03 | (3801.39 ms | 137920 tok/s) step 5126/76294 | train loss 3.754613 | norm 0.2801 | lr 1.06e-03 | (3835.29 ms | 136701 tok/s) step 5127/76294 | train loss 3.622785 | norm 0.2520 | lr 1.06e-03 | (3842.91 ms | 136430 tok/s) step 5128/76294 | train loss 3.582785 | norm 0.2073 | lr 1.06e-03 | (3796.87 ms | 138084 tok/s) step 5129/76294 | train loss 3.595321 | norm 0.2484 | lr 1.06e-03 | (3836.19 ms | 136669 tok/s) step 5130/76294 | train loss 3.641134 | norm 0.2053 | lr 1.06e-03 | (3802.68 ms | 137873 tok/s) step 5131/76294 | train loss 3.609010 | norm 0.2172 | lr 1.06e-03 | (3819.78 ms | 137256 tok/s) step 5132/76294 | train loss 3.567202 | norm 0.2122 | lr 1.06e-03 | (3806.21 ms | 137745 tok/s) step 5133/76294 | train loss 3.599073 | norm 0.2117 | lr 1.06e-03 | (3809.27 ms | 137635 tok/s) step 5134/76294 | train loss 3.595387 | norm 0.1945 | lr 1.06e-03 | (3825.57 ms | 137048 tok/s) step 5135/76294 | train loss 3.506784 | norm 0.1855 | lr 1.06e-03 | (3803.31 ms | 137851 tok/s) step 5136/76294 | train loss 3.517502 | norm 0.2211 | lr 1.06e-03 | (3828.52 ms | 136943 tok/s) step 5137/76294 | train loss 3.611795 | norm 0.2309 | lr 1.06e-03 | (3807.86 ms | 137686 tok/s) step 5138/76294 | train loss 3.490605 | norm 0.2051 | lr 1.06e-03 | (3805.14 ms | 137784 tok/s) step 5139/76294 | train loss 3.536776 | norm 0.1819 | lr 1.06e-03 | (3813.12 ms | 137496 tok/s) step 5140/76294 | train loss 3.594936 | norm 0.1875 | lr 1.06e-03 | (3834.56 ms | 136727 tok/s) step 5141/76294 | train loss 3.632035 | norm 0.2008 | lr 1.06e-03 | (3807.83 ms | 137687 tok/s) step 5142/76294 | train loss 3.648857 | norm 0.2358 | lr 1.06e-03 | (3823.14 ms | 137136 tok/s) step 5143/76294 | train loss 3.599748 | norm 0.2051 | lr 1.06e-03 | (3944.55 ms | 132915 tok/s) step 5144/76294 | train loss 3.613844 | norm 0.2105 | lr 1.06e-03 | (3820.85 ms | 137218 tok/s) step 5145/76294 | train loss 3.587679 | norm 0.2075 | lr 1.06e-03 | (3851.81 ms | 136115 tok/s) step 5146/76294 | train loss 3.567320 | norm 0.1791 | lr 1.06e-03 | (3806.70 ms | 137728 tok/s) step 5147/76294 | train loss 3.580099 | norm 0.2312 | lr 1.06e-03 | (3923.18 ms | 133638 tok/s) step 5148/76294 | train loss 3.601886 | norm 0.2578 | lr 1.06e-03 | (3803.48 ms | 137844 tok/s) step 5149/76294 | train loss 3.544388 | norm 0.1931 | lr 1.06e-03 | (3873.42 ms | 135355 tok/s) step 5150/76294 | train loss 3.554517 | norm 0.2023 | lr 1.06e-03 | (4197.36 ms | 124909 tok/s) step 5151/76294 | train loss 3.536048 | norm 0.1884 | lr 1.06e-03 | (3806.61 ms | 137731 tok/s) step 5152/76294 | train loss 3.553616 | norm 0.2135 | lr 1.06e-03 | (3805.37 ms | 137776 tok/s) step 5153/76294 | train loss 3.659241 | norm 0.2028 | lr 1.06e-03 | (3802.54 ms | 137878 tok/s) step 5154/76294 | train loss 3.524243 | norm 0.2233 | lr 1.06e-03 | (3801.36 ms | 137921 tok/s) step 5155/76294 | train loss 3.544139 | norm 0.2182 | lr 1.06e-03 | (3839.29 ms | 136558 tok/s) step 5156/76294 | train loss 3.601625 | norm 0.2149 | lr 1.06e-03 | (3805.53 ms | 137770 tok/s) step 5157/76294 | train loss 3.555076 | norm 0.2097 | lr 1.06e-03 | (3826.27 ms | 137023 tok/s) step 5158/76294 | train loss 3.563832 | norm 0.2289 | lr 1.06e-03 | (3843.52 ms | 136408 tok/s) step 5159/76294 | train loss 3.533689 | norm 0.2247 | lr 1.06e-03 | (3808.45 ms | 137664 tok/s) step 5160/76294 | train loss 3.604598 | norm 0.1999 | lr 1.06e-03 | (3801.02 ms | 137933 tok/s) step 5161/76294 | train loss 3.557557 | norm 0.2524 | lr 1.06e-03 | (3839.86 ms | 136538 tok/s) step 5162/76294 | train loss 3.482777 | norm 0.2069 | lr 1.06e-03 | (3803.77 ms | 137834 tok/s) step 5163/76294 | train loss 3.626091 | norm 0.2073 | lr 1.06e-03 | (3805.24 ms | 137780 tok/s) step 5164/76294 | train loss 3.739564 | norm 0.2054 | lr 1.06e-03 | (3846.46 ms | 136304 tok/s) step 5165/76294 | train loss 3.523841 | norm 0.2260 | lr 1.06e-03 | (3802.45 ms | 137882 tok/s) step 5166/76294 | train loss 3.532704 | norm 0.2283 | lr 1.06e-03 | (3810.10 ms | 137605 tok/s) step 5167/76294 | train loss 3.590890 | norm 0.2011 | lr 1.06e-03 | (3911.05 ms | 134053 tok/s) step 5168/76294 | train loss 3.590623 | norm 0.2403 | lr 1.06e-03 | (3806.64 ms | 137730 tok/s) step 5169/76294 | train loss 3.587498 | norm 0.2356 | lr 1.06e-03 | (3812.00 ms | 137536 tok/s) step 5170/76294 | train loss 3.525171 | norm 0.2154 | lr 1.06e-03 | (3828.95 ms | 136927 tok/s) step 5171/76294 | train loss 3.606994 | norm 0.2216 | lr 1.06e-03 | (3807.03 ms | 137716 tok/s) step 5172/76294 | train loss 3.515922 | norm 0.1866 | lr 1.06e-03 | (3827.95 ms | 136963 tok/s) step 5173/76294 | train loss 3.583161 | norm 0.2242 | lr 1.06e-03 | (3811.18 ms | 137566 tok/s) step 5174/76294 | train loss 3.571441 | norm 0.2030 | lr 1.06e-03 | (3805.45 ms | 137773 tok/s) step 5175/76294 | train loss 3.549943 | norm 0.2500 | lr 1.06e-03 | (3829.76 ms | 136898 tok/s) step 5176/76294 | train loss 3.543493 | norm 0.2165 | lr 1.06e-03 | (3829.55 ms | 136906 tok/s) step 5177/76294 | train loss 3.584049 | norm 0.2083 | lr 1.06e-03 | (3830.08 ms | 136887 tok/s) step 5178/76294 | train loss 3.645512 | norm 0.2188 | lr 1.06e-03 | (3811.33 ms | 137560 tok/s) step 5179/76294 | train loss 3.606722 | norm 0.2603 | lr 1.06e-03 | (3810.30 ms | 137598 tok/s) step 5180/76294 | train loss 3.565813 | norm 0.2444 | lr 1.06e-03 | (3914.00 ms | 133952 tok/s) step 5181/76294 | train loss 3.503941 | norm 0.2008 | lr 1.06e-03 | (3805.26 ms | 137780 tok/s) step 5182/76294 | train loss 3.892495 | norm 0.2557 | lr 1.06e-03 | (3815.79 ms | 137399 tok/s) step 5183/76294 | train loss 3.614373 | norm 0.2586 | lr 1.06e-03 | (3803.96 ms | 137827 tok/s) step 5184/76294 | train loss 3.583002 | norm 0.2251 | lr 1.06e-03 | (3836.20 ms | 136669 tok/s) step 5185/76294 | train loss 3.610478 | norm 0.2322 | lr 1.06e-03 | (3807.52 ms | 137698 tok/s) step 5186/76294 | train loss 3.590735 | norm 0.2689 | lr 1.06e-03 | (3815.47 ms | 137411 tok/s) step 5187/76294 | train loss 3.595584 | norm 0.2308 | lr 1.06e-03 | (3902.99 ms | 134330 tok/s) step 5188/76294 | train loss 3.547037 | norm 0.2586 | lr 1.06e-03 | (3822.25 ms | 137167 tok/s) step 5189/76294 | train loss 3.544869 | norm 0.2130 | lr 1.06e-03 | (3807.35 ms | 137704 tok/s) step 5190/76294 | train loss 3.579208 | norm 0.2077 | lr 1.06e-03 | (3844.42 ms | 136376 tok/s) step 5191/76294 | train loss 3.588146 | norm 0.2161 | lr 1.06e-03 | (3806.47 ms | 137736 tok/s) step 5192/76294 | train loss 3.504167 | norm 0.2081 | lr 1.06e-03 | (3813.29 ms | 137490 tok/s) step 5193/76294 | train loss 3.614643 | norm 0.2091 | lr 1.06e-03 | (3823.95 ms | 137106 tok/s) step 5194/76294 | train loss 3.557696 | norm 0.2273 | lr 1.06e-03 | (3807.96 ms | 137682 tok/s) step 5195/76294 | train loss 3.576868 | norm 0.2230 | lr 1.06e-03 | (3826.79 ms | 137005 tok/s) step 5196/76294 | train loss 3.601821 | norm 0.1972 | lr 1.06e-03 | (3834.61 ms | 136725 tok/s) step 5197/76294 | train loss 3.531385 | norm 0.2108 | lr 1.06e-03 | (3803.41 ms | 137847 tok/s) step 5198/76294 | train loss 3.565233 | norm 0.2341 | lr 1.06e-03 | (3842.87 ms | 136432 tok/s) step 5199/76294 | train loss 3.530998 | norm 0.2669 | lr 1.06e-03 | (3803.99 ms | 137826 tok/s) step 5200/76294 | train loss 3.616867 | norm 0.2462 | lr 1.06e-03 | (3974.22 ms | 131922 tok/s) step 5201/76294 | train loss 3.621930 | norm 0.2532 | lr 1.06e-03 | (3884.35 ms | 134975 tok/s) step 5202/76294 | train loss 3.648145 | norm 0.2625 | lr 1.06e-03 | (3866.38 ms | 135602 tok/s) step 5203/76294 | train loss 3.533429 | norm 0.2568 | lr 1.06e-03 | (3805.10 ms | 137786 tok/s) step 5204/76294 | train loss 3.549348 | norm 0.2100 | lr 1.06e-03 | (3857.09 ms | 135928 tok/s) step 5205/76294 | train loss 3.600535 | norm 0.2319 | lr 1.06e-03 | (3804.14 ms | 137820 tok/s) step 5206/76294 | train loss 3.579870 | norm 0.2084 | lr 1.06e-03 | (3835.76 ms | 136684 tok/s) step 5207/76294 | train loss 3.564380 | norm 0.2002 | lr 1.06e-03 | (3804.96 ms | 137791 tok/s) step 5208/76294 | train loss 3.598779 | norm 0.1729 | lr 1.06e-03 | (3877.43 ms | 135215 tok/s) step 5209/76294 | train loss 3.633353 | norm 0.2196 | lr 1.06e-03 | (3804.87 ms | 137794 tok/s) step 5210/76294 | train loss 3.544688 | norm 0.2290 | lr 1.06e-03 | (3863.91 ms | 135688 tok/s) step 5211/76294 | train loss 3.600208 | norm 0.2255 | lr 1.06e-03 | (3804.51 ms | 137807 tok/s) step 5212/76294 | train loss 3.535357 | norm 0.2247 | lr 1.06e-03 | (3829.83 ms | 136896 tok/s) step 5213/76294 | train loss 3.549144 | norm 0.1937 | lr 1.06e-03 | (3806.39 ms | 137739 tok/s) step 5214/76294 | train loss 3.628821 | norm 0.2116 | lr 1.06e-03 | (3885.69 ms | 134928 tok/s) step 5215/76294 | train loss 3.513100 | norm 0.1941 | lr 1.06e-03 | (3799.04 ms | 138005 tok/s) step 5216/76294 | train loss 3.622061 | norm 0.2299 | lr 1.06e-03 | (3826.36 ms | 137020 tok/s) step 5217/76294 | train loss 3.533107 | norm 0.2327 | lr 1.05e-03 | (3802.39 ms | 137884 tok/s) step 5218/76294 | train loss 3.569987 | norm 0.2193 | lr 1.05e-03 | (3806.94 ms | 137719 tok/s) step 5219/76294 | train loss 3.529013 | norm 0.2262 | lr 1.05e-03 | (3835.24 ms | 136703 tok/s) step 5220/76294 | train loss 3.557857 | norm 0.1952 | lr 1.05e-03 | (3887.93 ms | 134850 tok/s) step 5221/76294 | train loss 3.599193 | norm 0.2108 | lr 1.05e-03 | (3844.04 ms | 136390 tok/s) step 5222/76294 | train loss 3.539757 | norm 0.1866 | lr 1.05e-03 | (3834.63 ms | 136725 tok/s) step 5223/76294 | train loss 3.526618 | norm 0.2031 | lr 1.05e-03 | (3806.45 ms | 137737 tok/s) step 5224/76294 | train loss 3.562544 | norm 0.1998 | lr 1.05e-03 | (3832.31 ms | 136807 tok/s) step 5225/76294 | train loss 3.536239 | norm 0.1958 | lr 1.05e-03 | (3809.30 ms | 137634 tok/s) step 5226/76294 | train loss 3.622744 | norm 0.2050 | lr 1.05e-03 | (3813.66 ms | 137476 tok/s) step 5227/76294 | train loss 3.566131 | norm 0.2002 | lr 1.05e-03 | (3804.58 ms | 137804 tok/s) step 5228/76294 | train loss 3.604958 | norm 0.1917 | lr 1.05e-03 | (3809.26 ms | 137635 tok/s) step 5229/76294 | train loss 3.597926 | norm 0.2103 | lr 1.05e-03 | (3824.64 ms | 137082 tok/s) step 5230/76294 | train loss 3.580572 | norm 0.2204 | lr 1.05e-03 | (3803.65 ms | 137838 tok/s) step 5231/76294 | train loss 3.545875 | norm 0.2540 | lr 1.05e-03 | (3805.64 ms | 137766 tok/s) step 5232/76294 | train loss 3.570352 | norm 0.2975 | lr 1.05e-03 | (3835.94 ms | 136678 tok/s) step 5233/76294 | train loss 3.706603 | norm 0.2147 | lr 1.05e-03 | (3830.21 ms | 136882 tok/s) step 5234/76294 | train loss 3.531199 | norm 0.2448 | lr 1.05e-03 | (3879.92 ms | 135129 tok/s) step 5235/76294 | train loss 3.544301 | norm 0.2350 | lr 1.05e-03 | (3827.83 ms | 136968 tok/s) step 5236/76294 | train loss 3.534656 | norm 0.1956 | lr 1.05e-03 | (3852.53 ms | 136089 tok/s) step 5237/76294 | train loss 3.582346 | norm 0.2519 | lr 1.05e-03 | (3803.57 ms | 137841 tok/s) step 5238/76294 | train loss 3.572547 | norm 0.2007 | lr 1.05e-03 | (3844.02 ms | 136391 tok/s) step 5239/76294 | train loss 3.508445 | norm 0.2137 | lr 1.05e-03 | (3803.80 ms | 137833 tok/s) step 5240/76294 | train loss 3.594957 | norm 0.2185 | lr 1.05e-03 | (3851.31 ms | 136132 tok/s) step 5241/76294 | train loss 3.582992 | norm 0.2160 | lr 1.05e-03 | (3808.44 ms | 137665 tok/s) step 5242/76294 | train loss 3.526890 | norm 0.1964 | lr 1.05e-03 | (3806.63 ms | 137730 tok/s) step 5243/76294 | train loss 3.638974 | norm 0.1927 | lr 1.05e-03 | (3832.27 ms | 136809 tok/s) step 5244/76294 | train loss 3.598808 | norm 0.2539 | lr 1.05e-03 | (3803.30 ms | 137851 tok/s) step 5245/76294 | train loss 3.548946 | norm 0.2207 | lr 1.05e-03 | (3805.68 ms | 137765 tok/s) step 5246/76294 | train loss 3.629936 | norm 0.2292 | lr 1.05e-03 | (3812.16 ms | 137530 tok/s) step 5247/76294 | train loss 3.691023 | norm 0.1911 | lr 1.05e-03 | (3813.60 ms | 137479 tok/s) step 5248/76294 | train loss 3.552195 | norm 0.2092 | lr 1.05e-03 | (3806.10 ms | 137750 tok/s) step 5249/76294 | train loss 3.650689 | norm 0.2050 | lr 1.05e-03 | (3823.56 ms | 137120 tok/s) step 5250/76294 | train loss 3.534481 | norm 0.1941 | lr 1.05e-03 | (3854.62 ms | 136015 tok/s) val loss: 3.552476 saving model checkpoint to ./results/gpt2-124M-gqa/step_5250.pth step 5251/76294 | train loss 3.608704 | norm 0.1875 | lr 1.05e-03 | (3834.08 ms | 136744 tok/s) step 5252/76294 | train loss 3.562224 | norm 0.2073 | lr 1.05e-03 | (3761.29 ms | 139390 tok/s) step 5253/76294 | train loss 3.555142 | norm 0.2224 | lr 1.05e-03 | (3812.64 ms | 137513 tok/s) step 5254/76294 | train loss 3.684062 | norm 0.2054 | lr 1.05e-03 | (3765.87 ms | 139221 tok/s) step 5255/76294 | train loss 3.547483 | norm 0.1841 | lr 1.05e-03 | (3774.19 ms | 138914 tok/s) step 5256/76294 | train loss 3.573903 | norm 0.2045 | lr 1.05e-03 | (3923.97 ms | 133612 tok/s) step 5257/76294 | train loss 3.579608 | norm 0.2021 | lr 1.05e-03 | (3772.93 ms | 138961 tok/s) step 5258/76294 | train loss 3.490602 | norm 0.2340 | lr 1.05e-03 | (3781.06 ms | 138662 tok/s) step 5259/76294 | train loss 3.585360 | norm 0.2058 | lr 1.05e-03 | (3808.00 ms | 137681 tok/s) step 5260/76294 | train loss 3.589197 | norm 0.2005 | lr 1.05e-03 | (3782.65 ms | 138603 tok/s) step 5261/76294 | train loss 3.570868 | norm 0.2527 | lr 1.05e-03 | (3786.39 ms | 138466 tok/s) step 5262/76294 | train loss 3.550129 | norm 0.2344 | lr 1.05e-03 | (3795.26 ms | 138143 tok/s) step 5263/76294 | train loss 3.578154 | norm 0.2221 | lr 1.05e-03 | (3792.99 ms | 138226 tok/s) step 5264/76294 | train loss 3.569883 | norm 0.2032 | lr 1.05e-03 | (3824.40 ms | 137090 tok/s) step 5265/76294 | train loss 3.542667 | norm 0.2029 | lr 1.05e-03 | (4058.58 ms | 129180 tok/s) step 5266/76294 | train loss 3.580484 | norm 0.2064 | lr 1.05e-03 | (3801.77 ms | 137906 tok/s) step 5267/76294 | train loss 3.578341 | norm 0.2427 | lr 1.05e-03 | (3801.83 ms | 137904 tok/s) step 5268/76294 | train loss 3.644186 | norm 0.2153 | lr 1.05e-03 | (3797.23 ms | 138071 tok/s) step 5269/76294 | train loss 3.539089 | norm 0.1988 | lr 1.05e-03 | (3818.86 ms | 137289 tok/s) step 5270/76294 | train loss 3.528819 | norm 0.1989 | lr 1.05e-03 | (3798.87 ms | 138011 tok/s) step 5271/76294 | train loss 3.574016 | norm 0.1916 | lr 1.05e-03 | (3801.26 ms | 137925 tok/s) step 5272/76294 | train loss 3.497515 | norm 0.2389 | lr 1.05e-03 | (3825.15 ms | 137063 tok/s) step 5273/76294 | train loss 3.518805 | norm 0.2207 | lr 1.05e-03 | (3796.47 ms | 138099 tok/s) step 5274/76294 | train loss 3.546126 | norm 0.1996 | lr 1.05e-03 | (3829.85 ms | 136895 tok/s) step 5275/76294 | train loss 3.482278 | norm 0.2280 | lr 1.05e-03 | (3801.21 ms | 137927 tok/s) step 5276/76294 | train loss 3.549878 | norm 0.2365 | lr 1.05e-03 | (3804.21 ms | 137818 tok/s) step 5277/76294 | train loss 3.621412 | norm 0.2343 | lr 1.05e-03 | (3821.87 ms | 137181 tok/s) step 5278/76294 | train loss 3.468299 | norm 0.2212 | lr 1.05e-03 | (3805.36 ms | 137776 tok/s) step 5279/76294 | train loss 3.607023 | norm 0.1951 | lr 1.05e-03 | (3897.93 ms | 134504 tok/s) step 5280/76294 | train loss 3.552825 | norm 0.1928 | lr 1.05e-03 | (3849.45 ms | 136198 tok/s) step 5281/76294 | train loss 3.568363 | norm 0.1963 | lr 1.05e-03 | (3802.81 ms | 137869 tok/s) step 5282/76294 | train loss 3.552132 | norm 0.1932 | lr 1.05e-03 | (3909.03 ms | 134122 tok/s) step 5283/76294 | train loss 3.554832 | norm 0.1980 | lr 1.05e-03 | (3826.88 ms | 137001 tok/s) step 5284/76294 | train loss 3.579231 | norm 0.2138 | lr 1.05e-03 | (3809.29 ms | 137634 tok/s) step 5285/76294 | train loss 3.579995 | norm 0.2537 | lr 1.05e-03 | (3805.11 ms | 137785 tok/s) step 5286/76294 | train loss 3.502381 | norm 0.2295 | lr 1.05e-03 | (3837.48 ms | 136623 tok/s) step 5287/76294 | train loss 3.671870 | norm 0.2069 | lr 1.05e-03 | (3804.82 ms | 137796 tok/s) step 5288/76294 | train loss 3.529930 | norm 0.2567 | lr 1.05e-03 | (3809.33 ms | 137633 tok/s) step 5289/76294 | train loss 3.560902 | norm 0.2764 | lr 1.05e-03 | (3825.38 ms | 137055 tok/s) step 5290/76294 | train loss 3.554757 | norm 0.2090 | lr 1.05e-03 | (3811.26 ms | 137563 tok/s) step 5291/76294 | train loss 3.547637 | norm 0.2067 | lr 1.05e-03 | (3844.87 ms | 136360 tok/s) step 5292/76294 | train loss 3.575683 | norm 0.2587 | lr 1.05e-03 | (3817.04 ms | 137355 tok/s) step 5293/76294 | train loss 3.537321 | norm 0.1940 | lr 1.05e-03 | (3877.81 ms | 135202 tok/s) step 5294/76294 | train loss 3.567544 | norm 0.1968 | lr 1.05e-03 | (3807.11 ms | 137713 tok/s) step 5295/76294 | train loss 3.513494 | norm 0.2402 | lr 1.05e-03 | (3836.87 ms | 136645 tok/s) step 5296/76294 | train loss 3.581178 | norm 0.1807 | lr 1.05e-03 | (3815.16 ms | 137422 tok/s) step 5297/76294 | train loss 3.579842 | norm 0.2245 | lr 1.05e-03 | (3844.97 ms | 136357 tok/s) step 5298/76294 | train loss 3.565198 | norm 0.2271 | lr 1.05e-03 | (3808.12 ms | 137676 tok/s) step 5299/76294 | train loss 3.533728 | norm 0.2289 | lr 1.05e-03 | (3835.19 ms | 136705 tok/s) step 5300/76294 | train loss 3.597440 | norm 0.2173 | lr 1.05e-03 | (3805.62 ms | 137767 tok/s) step 5301/76294 | train loss 3.600112 | norm 0.2031 | lr 1.05e-03 | (3815.41 ms | 137413 tok/s) step 5302/76294 | train loss 3.573307 | norm 0.2110 | lr 1.05e-03 | (3832.83 ms | 136789 tok/s) step 5303/76294 | train loss 3.620386 | norm 0.2164 | lr 1.05e-03 | (3816.80 ms | 137363 tok/s) step 5304/76294 | train loss 3.560725 | norm 0.2183 | lr 1.05e-03 | (3833.13 ms | 136778 tok/s) step 5305/76294 | train loss 3.574610 | norm 0.2270 | lr 1.05e-03 | (3810.98 ms | 137573 tok/s) step 5306/76294 | train loss 3.579557 | norm 0.2482 | lr 1.05e-03 | (3808.68 ms | 137656 tok/s) step 5307/76294 | train loss 3.479910 | norm 0.2391 | lr 1.05e-03 | (4199.60 ms | 124842 tok/s) step 5308/76294 | train loss 3.586496 | norm 0.1796 | lr 1.05e-03 | (3805.78 ms | 137761 tok/s) step 5309/76294 | train loss 3.602987 | norm 0.2253 | lr 1.05e-03 | (3830.60 ms | 136868 tok/s) step 5310/76294 | train loss 3.464916 | norm 0.2210 | lr 1.05e-03 | (3803.73 ms | 137835 tok/s) step 5311/76294 | train loss 3.498096 | norm 0.1827 | lr 1.05e-03 | (4016.31 ms | 130540 tok/s) step 5312/76294 | train loss 3.626641 | norm 0.2288 | lr 1.05e-03 | (3802.82 ms | 137868 tok/s) step 5313/76294 | train loss 3.551748 | norm 0.2606 | lr 1.05e-03 | (3809.23 ms | 137636 tok/s) step 5314/76294 | train loss 3.568222 | norm 0.1972 | lr 1.05e-03 | (3826.46 ms | 137016 tok/s) step 5315/76294 | train loss 3.573759 | norm 0.3107 | lr 1.05e-03 | (3810.79 ms | 137580 tok/s) step 5316/76294 | train loss 3.541150 | norm 0.2401 | lr 1.05e-03 | (3804.27 ms | 137816 tok/s) step 5317/76294 | train loss 3.584686 | norm 0.2258 | lr 1.05e-03 | (3922.60 ms | 133658 tok/s) step 5318/76294 | train loss 3.552608 | norm 0.2414 | lr 1.05e-03 | (3812.43 ms | 137521 tok/s) step 5319/76294 | train loss 3.655175 | norm 0.2197 | lr 1.05e-03 | (3865.99 ms | 135615 tok/s) step 5320/76294 | train loss 3.555468 | norm 0.2362 | lr 1.05e-03 | (3810.03 ms | 137607 tok/s) step 5321/76294 | train loss 3.606434 | norm 0.2126 | lr 1.05e-03 | (3811.82 ms | 137543 tok/s) step 5322/76294 | train loss 3.570738 | norm 0.2111 | lr 1.05e-03 | (3824.31 ms | 137094 tok/s) step 5323/76294 | train loss 3.689210 | norm 0.1979 | lr 1.05e-03 | (3812.62 ms | 137514 tok/s) step 5324/76294 | train loss 3.536348 | norm 0.2034 | lr 1.05e-03 | (3820.73 ms | 137222 tok/s) step 5325/76294 | train loss 3.623145 | norm 0.2007 | lr 1.05e-03 | (3807.62 ms | 137694 tok/s) step 5326/76294 | train loss 3.565919 | norm 0.1799 | lr 1.05e-03 | (3809.91 ms | 137612 tok/s) step 5327/76294 | train loss 3.567731 | norm 0.2024 | lr 1.05e-03 | (3834.10 ms | 136743 tok/s) step 5328/76294 | train loss 3.517990 | norm 0.1938 | lr 1.05e-03 | (3804.86 ms | 137794 tok/s) step 5329/76294 | train loss 3.591125 | norm 0.2119 | lr 1.05e-03 | (3812.98 ms | 137501 tok/s) step 5330/76294 | train loss 3.559165 | norm 0.2195 | lr 1.05e-03 | (3826.95 ms | 136999 tok/s) step 5331/76294 | train loss 3.626029 | norm 0.2194 | lr 1.05e-03 | (3815.07 ms | 137425 tok/s) step 5332/76294 | train loss 3.499181 | norm 0.2158 | lr 1.05e-03 | (3963.72 ms | 132272 tok/s) step 5333/76294 | train loss 3.517280 | norm 0.2046 | lr 1.05e-03 | (3802.91 ms | 137865 tok/s) step 5334/76294 | train loss 3.556312 | norm 0.2006 | lr 1.05e-03 | (3809.88 ms | 137613 tok/s) step 5335/76294 | train loss 3.574561 | norm 0.1889 | lr 1.05e-03 | (3830.06 ms | 136888 tok/s) step 5336/76294 | train loss 3.552016 | norm 0.1842 | lr 1.05e-03 | (4131.55 ms | 126899 tok/s) step 5337/76294 | train loss 3.554781 | norm 0.1955 | lr 1.05e-03 | (3928.60 ms | 133454 tok/s) step 5338/76294 | train loss 3.581463 | norm 0.1981 | lr 1.05e-03 | (3809.88 ms | 137613 tok/s) step 5339/76294 | train loss 3.528596 | norm 0.1936 | lr 1.05e-03 | (3807.70 ms | 137691 tok/s) step 5340/76294 | train loss 3.515349 | norm 0.1714 | lr 1.05e-03 | (3826.09 ms | 137030 tok/s) step 5341/76294 | train loss 3.520859 | norm 0.1895 | lr 1.05e-03 | (4248.13 ms | 123416 tok/s) step 5342/76294 | train loss 3.462740 | norm 0.1908 | lr 1.05e-03 | (3800.59 ms | 137949 tok/s) step 5343/76294 | train loss 3.506149 | norm 0.1733 | lr 1.05e-03 | (3825.91 ms | 137036 tok/s) step 5344/76294 | train loss 3.599774 | norm 0.1852 | lr 1.05e-03 | (3805.90 ms | 137757 tok/s) step 5345/76294 | train loss 3.493340 | norm 0.2023 | lr 1.05e-03 | (3802.46 ms | 137881 tok/s) step 5346/76294 | train loss 3.545639 | norm 0.2181 | lr 1.05e-03 | (3833.62 ms | 136760 tok/s) step 5347/76294 | train loss 3.652326 | norm 0.2404 | lr 1.05e-03 | (3804.65 ms | 137802 tok/s) step 5348/76294 | train loss 3.546327 | norm 0.2128 | lr 1.05e-03 | (3836.00 ms | 136676 tok/s) step 5349/76294 | train loss 3.510529 | norm 0.1968 | lr 1.05e-03 | (3805.19 ms | 137782 tok/s) step 5350/76294 | train loss 3.494772 | norm 0.2721 | lr 1.05e-03 | (3811.00 ms | 137572 tok/s) step 5351/76294 | train loss 3.524347 | norm 0.2578 | lr 1.05e-03 | (3828.41 ms | 136947 tok/s) step 5352/76294 | train loss 3.675009 | norm 0.2355 | lr 1.05e-03 | (3826.90 ms | 137001 tok/s) step 5353/76294 | train loss 3.517787 | norm 0.3352 | lr 1.05e-03 | (3820.59 ms | 137227 tok/s) step 5354/76294 | train loss 3.576703 | norm 0.2565 | lr 1.05e-03 | (3834.64 ms | 136724 tok/s) step 5355/76294 | train loss 3.554049 | norm 0.2357 | lr 1.05e-03 | (3806.83 ms | 137723 tok/s) step 5356/76294 | train loss 3.520481 | norm 0.2435 | lr 1.05e-03 | (3832.08 ms | 136816 tok/s) step 5357/76294 | train loss 3.551776 | norm 0.2529 | lr 1.05e-03 | (3805.18 ms | 137783 tok/s) step 5358/76294 | train loss 3.538450 | norm 0.2509 | lr 1.05e-03 | (3828.67 ms | 136938 tok/s) step 5359/76294 | train loss 3.617440 | norm 0.2076 | lr 1.05e-03 | (3808.92 ms | 137647 tok/s) step 5360/76294 | train loss 3.545740 | norm 0.2353 | lr 1.05e-03 | (3809.13 ms | 137640 tok/s) step 5361/76294 | train loss 3.508417 | norm 0.2058 | lr 1.05e-03 | (3828.01 ms | 136961 tok/s) step 5362/76294 | train loss 3.480459 | norm 0.2030 | lr 1.05e-03 | (3909.79 ms | 134096 tok/s) step 5363/76294 | train loss 3.600162 | norm 0.2013 | lr 1.05e-03 | (3811.17 ms | 137566 tok/s) step 5364/76294 | train loss 3.502824 | norm 0.2110 | lr 1.05e-03 | (3838.72 ms | 136579 tok/s) step 5365/76294 | train loss 3.469994 | norm 0.2016 | lr 1.05e-03 | (3842.98 ms | 136428 tok/s) step 5366/76294 | train loss 3.543329 | norm 0.2014 | lr 1.05e-03 | (3816.54 ms | 137373 tok/s) step 5367/76294 | train loss 3.494939 | norm 0.1899 | lr 1.05e-03 | (3811.91 ms | 137539 tok/s) step 5368/76294 | train loss 3.543725 | norm 0.1951 | lr 1.05e-03 | (3864.11 ms | 135681 tok/s) step 5369/76294 | train loss 3.547427 | norm 0.2019 | lr 1.05e-03 | (3855.53 ms | 135983 tok/s) step 5370/76294 | train loss 3.500563 | norm 0.1935 | lr 1.05e-03 | (3813.43 ms | 137485 tok/s) step 5371/76294 | train loss 3.610774 | norm 0.1858 | lr 1.05e-03 | (3809.41 ms | 137630 tok/s) step 5372/76294 | train loss 3.483644 | norm 0.1979 | lr 1.05e-03 | (3844.26 ms | 136382 tok/s) step 5373/76294 | train loss 3.588166 | norm 0.2416 | lr 1.05e-03 | (3815.98 ms | 137393 tok/s) step 5374/76294 | train loss 3.550853 | norm 0.2436 | lr 1.05e-03 | (3840.68 ms | 136509 tok/s) step 5375/76294 | train loss 3.470037 | norm 0.1965 | lr 1.05e-03 | (3812.56 ms | 137516 tok/s) step 5376/76294 | train loss 3.520081 | norm 0.2108 | lr 1.05e-03 | (3874.71 ms | 135310 tok/s) step 5377/76294 | train loss 3.531024 | norm 0.1953 | lr 1.05e-03 | (3805.81 ms | 137760 tok/s) step 5378/76294 | train loss 3.487993 | norm 0.2013 | lr 1.04e-03 | (3809.19 ms | 137638 tok/s) step 5379/76294 | train loss 3.609529 | norm 0.2261 | lr 1.04e-03 | (3825.83 ms | 137039 tok/s) step 5380/76294 | train loss 3.530938 | norm 0.2155 | lr 1.04e-03 | (3807.51 ms | 137698 tok/s) step 5381/76294 | train loss 3.474391 | norm 0.2338 | lr 1.04e-03 | (3805.06 ms | 137787 tok/s) step 5382/76294 | train loss 3.647913 | norm 0.2063 | lr 1.04e-03 | (3936.17 ms | 133198 tok/s) step 5383/76294 | train loss 3.527952 | norm 0.2284 | lr 1.04e-03 | (3823.08 ms | 137138 tok/s) step 5384/76294 | train loss 3.531836 | norm 0.1931 | lr 1.04e-03 | (3828.75 ms | 136934 tok/s) step 5385/76294 | train loss 3.578697 | norm 0.2349 | lr 1.04e-03 | (3804.23 ms | 137817 tok/s) step 5386/76294 | train loss 3.500975 | norm 0.1776 | lr 1.04e-03 | (3850.38 ms | 136165 tok/s) step 5387/76294 | train loss 3.610046 | norm 0.2226 | lr 1.04e-03 | (3812.07 ms | 137534 tok/s) step 5388/76294 | train loss 3.552236 | norm 0.2472 | lr 1.04e-03 | (3809.42 ms | 137629 tok/s) step 5389/76294 | train loss 3.571301 | norm 0.2318 | lr 1.04e-03 | (3866.36 ms | 135603 tok/s) step 5390/76294 | train loss 3.547200 | norm 0.1882 | lr 1.04e-03 | (3805.55 ms | 137769 tok/s) step 5391/76294 | train loss 3.521345 | norm 0.2106 | lr 1.04e-03 | (3827.84 ms | 136967 tok/s) step 5392/76294 | train loss 3.553797 | norm 0.2595 | lr 1.04e-03 | (3824.74 ms | 137078 tok/s) step 5393/76294 | train loss 3.571951 | norm 0.2250 | lr 1.04e-03 | (3805.64 ms | 137766 tok/s) step 5394/76294 | train loss 3.567807 | norm 0.2093 | lr 1.04e-03 | (3803.77 ms | 137834 tok/s) step 5395/76294 | train loss 3.537764 | norm 0.2019 | lr 1.04e-03 | (3835.63 ms | 136689 tok/s) step 5396/76294 | train loss 3.524045 | norm 0.2071 | lr 1.04e-03 | (3814.46 ms | 137448 tok/s) step 5397/76294 | train loss 3.496750 | norm 0.2079 | lr 1.04e-03 | (3904.33 ms | 134284 tok/s) step 5398/76294 | train loss 3.493277 | norm 0.2064 | lr 1.04e-03 | (3803.00 ms | 137862 tok/s) step 5399/76294 | train loss 3.523920 | norm 0.2000 | lr 1.04e-03 | (4388.40 ms | 119471 tok/s) step 5400/76294 | train loss 3.493473 | norm 0.2135 | lr 1.04e-03 | (3844.65 ms | 136368 tok/s) step 5401/76294 | train loss 3.511408 | norm 0.2142 | lr 1.04e-03 | (3806.56 ms | 137733 tok/s) step 5402/76294 | train loss 3.650622 | norm 0.2478 | lr 1.04e-03 | (3800.75 ms | 137943 tok/s) step 5403/76294 | train loss 3.587909 | norm 0.3328 | lr 1.04e-03 | (3834.71 ms | 136722 tok/s) step 5404/76294 | train loss 3.547674 | norm 0.3160 | lr 1.04e-03 | (3802.16 ms | 137892 tok/s) step 5405/76294 | train loss 3.478662 | norm 0.2769 | lr 1.04e-03 | (3855.14 ms | 135997 tok/s) step 5406/76294 | train loss 3.533083 | norm 0.3325 | lr 1.04e-03 | (3804.37 ms | 137812 tok/s) step 5407/76294 | train loss 3.576994 | norm 0.2908 | lr 1.04e-03 | (3847.22 ms | 136277 tok/s) step 5408/76294 | train loss 3.475653 | norm 0.2498 | lr 1.04e-03 | (3823.24 ms | 137132 tok/s) step 5409/76294 | train loss 3.557536 | norm 0.2729 | lr 1.04e-03 | (3825.00 ms | 137069 tok/s) step 5410/76294 | train loss 3.584333 | norm 0.2564 | lr 1.04e-03 | (3801.68 ms | 137910 tok/s) step 5411/76294 | train loss 3.500465 | norm 0.2720 | lr 1.04e-03 | (3838.82 ms | 136575 tok/s) step 5412/76294 | train loss 3.567328 | norm 0.2364 | lr 1.04e-03 | (3804.65 ms | 137802 tok/s) step 5413/76294 | train loss 3.533583 | norm 0.2316 | lr 1.04e-03 | (3859.80 ms | 135833 tok/s) step 5414/76294 | train loss 3.537851 | norm 0.2355 | lr 1.04e-03 | (3804.13 ms | 137821 tok/s) step 5415/76294 | train loss 3.469748 | norm 0.2703 | lr 1.04e-03 | (3811.41 ms | 137558 tok/s) step 5416/76294 | train loss 3.555942 | norm 0.2642 | lr 1.04e-03 | (3945.99 ms | 132866 tok/s) step 5417/76294 | train loss 3.596750 | norm 0.2464 | lr 1.04e-03 | (3807.67 ms | 137692 tok/s) step 5418/76294 | train loss 3.525074 | norm 0.1875 | lr 1.04e-03 | (3919.50 ms | 133764 tok/s) step 5419/76294 | train loss 3.568922 | norm 0.2215 | lr 1.04e-03 | (3906.57 ms | 134207 tok/s) step 5420/76294 | train loss 3.591353 | norm 0.2211 | lr 1.04e-03 | (3801.70 ms | 137909 tok/s) step 5421/76294 | train loss 3.502733 | norm 0.2416 | lr 1.04e-03 | (3830.31 ms | 136879 tok/s) step 5422/76294 | train loss 3.587303 | norm 0.2510 | lr 1.04e-03 | (3827.17 ms | 136991 tok/s) step 5423/76294 | train loss 3.558263 | norm 0.2611 | lr 1.04e-03 | (3810.26 ms | 137599 tok/s) step 5424/76294 | train loss 3.522663 | norm 0.2464 | lr 1.04e-03 | (3808.92 ms | 137647 tok/s) step 5425/76294 | train loss 3.542594 | norm 0.2414 | lr 1.04e-03 | (3801.66 ms | 137910 tok/s) step 5426/76294 | train loss 3.497576 | norm 0.2447 | lr 1.04e-03 | (3813.12 ms | 137496 tok/s) step 5427/76294 | train loss 3.582625 | norm 0.1993 | lr 1.04e-03 | (3833.20 ms | 136775 tok/s) step 5428/76294 | train loss 3.502066 | norm 0.1971 | lr 1.04e-03 | (3858.99 ms | 135861 tok/s) step 5429/76294 | train loss 3.474836 | norm 0.2332 | lr 1.04e-03 | (3810.02 ms | 137608 tok/s) step 5430/76294 | train loss 3.527053 | norm 0.2147 | lr 1.04e-03 | (3807.42 ms | 137702 tok/s) step 5431/76294 | train loss 3.531429 | norm 0.1785 | lr 1.04e-03 | (3811.20 ms | 137565 tok/s) step 5432/76294 | train loss 3.594306 | norm 0.2054 | lr 1.04e-03 | (3825.14 ms | 137064 tok/s) step 5433/76294 | train loss 3.555210 | norm 0.2167 | lr 1.04e-03 | (3808.47 ms | 137664 tok/s) step 5434/76294 | train loss 3.597713 | norm 0.2341 | lr 1.04e-03 | (3802.15 ms | 137892 tok/s) step 5435/76294 | train loss 3.589953 | norm 0.1927 | lr 1.04e-03 | (3882.57 ms | 135037 tok/s) step 5436/76294 | train loss 3.461446 | norm 0.1943 | lr 1.04e-03 | (3806.94 ms | 137719 tok/s) step 5437/76294 | train loss 3.543815 | norm 0.2166 | lr 1.04e-03 | (3818.99 ms | 137284 tok/s) step 5438/76294 | train loss 3.572773 | norm 0.1948 | lr 1.04e-03 | (3802.72 ms | 137872 tok/s) step 5439/76294 | train loss 3.563106 | norm 0.2372 | lr 1.04e-03 | (3814.26 ms | 137455 tok/s) step 5440/76294 | train loss 3.437520 | norm 0.2547 | lr 1.04e-03 | (3875.20 ms | 135293 tok/s) step 5441/76294 | train loss 3.513060 | norm 0.2094 | lr 1.04e-03 | (3807.77 ms | 137689 tok/s) step 5442/76294 | train loss 3.517046 | norm 0.2317 | lr 1.04e-03 | (3803.14 ms | 137857 tok/s) step 5443/76294 | train loss 3.502629 | norm 0.2790 | lr 1.04e-03 | (3839.50 ms | 136551 tok/s) step 5444/76294 | train loss 3.553807 | norm 0.2023 | lr 1.04e-03 | (3806.39 ms | 137739 tok/s) step 5445/76294 | train loss 3.552003 | norm 0.2772 | lr 1.04e-03 | (3919.19 ms | 133775 tok/s) step 5446/76294 | train loss 3.566535 | norm 0.2235 | lr 1.04e-03 | (3804.07 ms | 137823 tok/s) step 5447/76294 | train loss 3.517007 | norm 0.2130 | lr 1.04e-03 | (3809.97 ms | 137609 tok/s) step 5448/76294 | train loss 3.523195 | norm 0.2317 | lr 1.04e-03 | (3828.06 ms | 136959 tok/s) step 5449/76294 | train loss 3.600461 | norm 0.1948 | lr 1.04e-03 | (3950.98 ms | 132698 tok/s) step 5450/76294 | train loss 3.547523 | norm 0.1853 | lr 1.04e-03 | (3803.34 ms | 137849 tok/s) step 5451/76294 | train loss 3.534627 | norm 0.2265 | lr 1.04e-03 | (3822.68 ms | 137152 tok/s) step 5452/76294 | train loss 3.701251 | norm 0.2703 | lr 1.04e-03 | (3803.56 ms | 137842 tok/s) step 5453/76294 | train loss 3.552230 | norm 0.3057 | lr 1.04e-03 | (3806.65 ms | 137730 tok/s) step 5454/76294 | train loss 3.576540 | norm 0.2187 | lr 1.04e-03 | (3829.14 ms | 136920 tok/s) step 5455/76294 | train loss 3.594064 | norm 0.2590 | lr 1.04e-03 | (3820.63 ms | 137226 tok/s) step 5456/76294 | train loss 3.465902 | norm 0.2044 | lr 1.04e-03 | (3805.50 ms | 137771 tok/s) step 5457/76294 | train loss 3.634812 | norm 0.2437 | lr 1.04e-03 | (3837.54 ms | 136621 tok/s) step 5458/76294 | train loss 3.460307 | norm 0.2457 | lr 1.04e-03 | (3805.16 ms | 137784 tok/s) step 5459/76294 | train loss 3.575511 | norm 0.2206 | lr 1.04e-03 | (3835.34 ms | 136699 tok/s) step 5460/76294 | train loss 3.474773 | norm 0.2182 | lr 1.04e-03 | (3813.47 ms | 137483 tok/s) step 5461/76294 | train loss 3.579575 | norm 0.1856 | lr 1.04e-03 | (3807.55 ms | 137697 tok/s) step 5462/76294 | train loss 3.528381 | norm 0.2386 | lr 1.04e-03 | (3828.44 ms | 136946 tok/s) step 5463/76294 | train loss 3.624384 | norm 0.2021 | lr 1.04e-03 | (3930.39 ms | 133393 tok/s) step 5464/76294 | train loss 3.550470 | norm 0.1995 | lr 1.04e-03 | (3850.89 ms | 136147 tok/s) step 5465/76294 | train loss 3.516115 | norm 0.1880 | lr 1.04e-03 | (3805.05 ms | 137787 tok/s) step 5466/76294 | train loss 3.582886 | norm 0.1748 | lr 1.04e-03 | (3807.79 ms | 137688 tok/s) step 5467/76294 | train loss 3.512988 | norm 0.1816 | lr 1.04e-03 | (3845.78 ms | 136328 tok/s) step 5468/76294 | train loss 3.552162 | norm 0.2013 | lr 1.04e-03 | (3804.90 ms | 137793 tok/s) step 5469/76294 | train loss 3.534702 | norm 0.2528 | lr 1.04e-03 | (3890.64 ms | 134756 tok/s) step 5470/76294 | train loss 3.533986 | norm 0.2350 | lr 1.04e-03 | (3799.86 ms | 137976 tok/s) step 5471/76294 | train loss 3.533418 | norm 0.2022 | lr 1.04e-03 | (3810.07 ms | 137606 tok/s) step 5472/76294 | train loss 3.581792 | norm 0.3140 | lr 1.04e-03 | (3831.65 ms | 136831 tok/s) step 5473/76294 | train loss 3.475020 | norm 0.2771 | lr 1.04e-03 | (3806.04 ms | 137752 tok/s) step 5474/76294 | train loss 3.576550 | norm 0.2061 | lr 1.04e-03 | (3803.21 ms | 137854 tok/s) step 5475/76294 | train loss 3.523586 | norm 0.2357 | lr 1.04e-03 | (3881.08 ms | 135088 tok/s) step 5476/76294 | train loss 3.485224 | norm 0.2078 | lr 1.04e-03 | (3855.02 ms | 136001 tok/s) step 5477/76294 | train loss 3.593620 | norm 0.2415 | lr 1.04e-03 | (3826.55 ms | 137013 tok/s) step 5478/76294 | train loss 3.521607 | norm 0.1889 | lr 1.04e-03 | (3808.16 ms | 137675 tok/s) step 5479/76294 | train loss 3.538202 | norm 0.2253 | lr 1.04e-03 | (3803.20 ms | 137854 tok/s) step 5480/76294 | train loss 3.536829 | norm 0.2173 | lr 1.04e-03 | (3851.78 ms | 136116 tok/s) step 5481/76294 | train loss 3.471353 | norm 0.1928 | lr 1.04e-03 | (3801.67 ms | 137910 tok/s) step 5482/76294 | train loss 3.536631 | norm 0.1965 | lr 1.04e-03 | (3824.55 ms | 137085 tok/s) step 5483/76294 | train loss 3.604944 | norm 0.2302 | lr 1.04e-03 | (3824.42 ms | 137089 tok/s) step 5484/76294 | train loss 3.575051 | norm 0.2253 | lr 1.04e-03 | (3805.73 ms | 137763 tok/s) step 5485/76294 | train loss 3.570575 | norm 0.2322 | lr 1.04e-03 | (3801.03 ms | 137933 tok/s) step 5486/76294 | train loss 3.510593 | norm 0.2458 | lr 1.04e-03 | (3858.95 ms | 135863 tok/s) step 5487/76294 | train loss 3.572064 | norm 0.2591 | lr 1.04e-03 | (3802.74 ms | 137871 tok/s) step 5488/76294 | train loss 3.497074 | norm 0.2317 | lr 1.04e-03 | (3835.29 ms | 136701 tok/s) step 5489/76294 | train loss 3.489170 | norm 0.1958 | lr 1.04e-03 | (3803.69 ms | 137837 tok/s) step 5490/76294 | train loss 3.457850 | norm 0.2251 | lr 1.04e-03 | (3824.30 ms | 137094 tok/s) step 5491/76294 | train loss 3.567722 | norm 0.2105 | lr 1.04e-03 | (3832.84 ms | 136788 tok/s) step 5492/76294 | train loss 3.514557 | norm 0.2062 | lr 1.04e-03 | (3811.40 ms | 137558 tok/s) step 5493/76294 | train loss 3.574034 | norm 0.2460 | lr 1.04e-03 | (3805.92 ms | 137756 tok/s) step 5494/76294 | train loss 3.463979 | norm 0.2057 | lr 1.04e-03 | (3831.18 ms | 136848 tok/s) step 5495/76294 | train loss 3.502221 | norm 0.2160 | lr 1.04e-03 | (3806.05 ms | 137751 tok/s) step 5496/76294 | train loss 3.617417 | norm 0.2291 | lr 1.04e-03 | (3817.85 ms | 137325 tok/s) step 5497/76294 | train loss 3.523458 | norm 0.2140 | lr 1.04e-03 | (3829.25 ms | 136917 tok/s) step 5498/76294 | train loss 3.643563 | norm 0.2151 | lr 1.04e-03 | (3808.70 ms | 137655 tok/s) step 5499/76294 | train loss 3.545946 | norm 0.1858 | lr 1.04e-03 | (3803.09 ms | 137858 tok/s) step 5500/76294 | train loss 3.521732 | norm 0.2034 | lr 1.04e-03 | (3831.91 ms | 136822 tok/s) val loss: 3.546535 saving model checkpoint to ./results/gpt2-124M-gqa/step_5500.pth step 5501/76294 | train loss 3.541370 | norm 0.2154 | lr 1.04e-03 | (3886.82 ms | 134889 tok/s) step 5502/76294 | train loss 3.552032 | norm 0.1696 | lr 1.04e-03 | (3773.29 ms | 138947 tok/s) step 5503/76294 | train loss 3.579228 | norm 0.2209 | lr 1.04e-03 | (3786.34 ms | 138468 tok/s) step 5504/76294 | train loss 3.483328 | norm 0.1884 | lr 1.04e-03 | (3809.62 ms | 137622 tok/s) step 5505/76294 | train loss 3.547220 | norm 0.1928 | lr 1.04e-03 | (3787.72 ms | 138418 tok/s) step 5506/76294 | train loss 3.498197 | norm 0.1868 | lr 1.04e-03 | (3790.66 ms | 138311 tok/s) step 5507/76294 | train loss 3.555159 | norm 0.2105 | lr 1.04e-03 | (3843.59 ms | 136406 tok/s) step 5508/76294 | train loss 3.533332 | norm 0.2132 | lr 1.04e-03 | (3794.62 ms | 138166 tok/s) step 5509/76294 | train loss 3.496672 | norm 0.1865 | lr 1.04e-03 | (3821.57 ms | 137192 tok/s) step 5510/76294 | train loss 3.534671 | norm 0.1967 | lr 1.04e-03 | (3809.84 ms | 137614 tok/s) step 5511/76294 | train loss 3.480254 | norm 0.1902 | lr 1.04e-03 | (3800.02 ms | 137970 tok/s) step 5512/76294 | train loss 3.520475 | norm 0.2199 | lr 1.04e-03 | (3824.58 ms | 137084 tok/s) step 5513/76294 | train loss 3.564476 | norm 0.2069 | lr 1.04e-03 | (3804.41 ms | 137811 tok/s) step 5514/76294 | train loss 3.518724 | norm 0.2244 | lr 1.04e-03 | (3816.72 ms | 137366 tok/s) step 5515/76294 | train loss 3.529825 | norm 0.2228 | lr 1.04e-03 | (3813.69 ms | 137475 tok/s) step 5516/76294 | train loss 3.576755 | norm 0.2466 | lr 1.04e-03 | (3806.95 ms | 137719 tok/s) step 5517/76294 | train loss 3.529313 | norm 0.2070 | lr 1.04e-03 | (3843.95 ms | 136393 tok/s) step 5518/76294 | train loss 3.530717 | norm 0.1997 | lr 1.04e-03 | (3808.77 ms | 137653 tok/s) step 5519/76294 | train loss 3.605912 | norm 0.2219 | lr 1.04e-03 | (3814.86 ms | 137433 tok/s) step 5520/76294 | train loss 3.646317 | norm 0.1935 | lr 1.04e-03 | (3915.66 ms | 133895 tok/s) step 5521/76294 | train loss 3.487989 | norm 0.2024 | lr 1.04e-03 | (3812.84 ms | 137506 tok/s) step 5522/76294 | train loss 3.567383 | norm 0.2134 | lr 1.04e-03 | (3844.19 ms | 136385 tok/s) step 5523/76294 | train loss 3.604504 | norm 0.1929 | lr 1.04e-03 | (3817.17 ms | 137350 tok/s) step 5524/76294 | train loss 3.798605 | norm 0.2275 | lr 1.04e-03 | (3830.65 ms | 136867 tok/s) step 5525/76294 | train loss 3.510408 | norm 0.2371 | lr 1.04e-03 | (3837.52 ms | 136621 tok/s) step 5526/76294 | train loss 3.587159 | norm 0.2568 | lr 1.04e-03 | (3814.42 ms | 137449 tok/s) step 5527/76294 | train loss 3.555804 | norm 0.2806 | lr 1.04e-03 | (3816.89 ms | 137360 tok/s) step 5528/76294 | train loss 3.521976 | norm 0.2134 | lr 1.04e-03 | (3847.99 ms | 136250 tok/s) step 5529/76294 | train loss 3.480826 | norm 0.2387 | lr 1.04e-03 | (3816.63 ms | 137369 tok/s) step 5530/76294 | train loss 3.592506 | norm 0.2350 | lr 1.04e-03 | (3827.44 ms | 136981 tok/s) step 5531/76294 | train loss 3.514625 | norm 0.2465 | lr 1.04e-03 | (3833.82 ms | 136753 tok/s) step 5532/76294 | train loss 3.552163 | norm 0.2424 | lr 1.04e-03 | (4289.83 ms | 122217 tok/s) step 5533/76294 | train loss 3.591681 | norm 0.2327 | lr 1.04e-03 | (3837.69 ms | 136615 tok/s) step 5534/76294 | train loss 3.536257 | norm 0.2387 | lr 1.03e-03 | (3817.32 ms | 137344 tok/s) step 5535/76294 | train loss 3.561605 | norm 0.2184 | lr 1.03e-03 | (3810.99 ms | 137573 tok/s) step 5536/76294 | train loss 3.600010 | norm 0.2706 | lr 1.03e-03 | (3845.49 ms | 136338 tok/s) step 5537/76294 | train loss 3.571520 | norm 0.2319 | lr 1.03e-03 | (3810.30 ms | 137597 tok/s) step 5538/76294 | train loss 3.590075 | norm 0.3473 | lr 1.03e-03 | (3830.69 ms | 136865 tok/s) step 5539/76294 | train loss 3.539720 | norm 0.3176 | lr 1.03e-03 | (3831.26 ms | 136845 tok/s) step 5540/76294 | train loss 3.525046 | norm 0.1931 | lr 1.03e-03 | (3816.74 ms | 137365 tok/s) step 5541/76294 | train loss 3.604389 | norm 0.2675 | lr 1.03e-03 | (3889.40 ms | 134799 tok/s) step 5542/76294 | train loss 3.574367 | norm 0.2310 | lr 1.03e-03 | (3810.19 ms | 137602 tok/s) step 5543/76294 | train loss 3.559348 | norm 0.2586 | lr 1.03e-03 | (3906.94 ms | 134194 tok/s) step 5544/76294 | train loss 3.523705 | norm 0.2584 | lr 1.03e-03 | (3812.92 ms | 137503 tok/s) step 5545/76294 | train loss 3.570056 | norm 0.1948 | lr 1.03e-03 | (4103.05 ms | 127780 tok/s) step 5546/76294 | train loss 3.559501 | norm 0.2018 | lr 1.03e-03 | (3902.19 ms | 134357 tok/s) step 5547/76294 | train loss 3.562208 | norm 0.2078 | lr 1.03e-03 | (3803.22 ms | 137854 tok/s) step 5548/76294 | train loss 3.568332 | norm 0.2120 | lr 1.03e-03 | (3823.48 ms | 137123 tok/s) step 5549/76294 | train loss 3.584124 | norm 0.2177 | lr 1.03e-03 | (3804.70 ms | 137800 tok/s) step 5550/76294 | train loss 3.550306 | norm 0.2445 | lr 1.03e-03 | (3808.63 ms | 137658 tok/s) step 5551/76294 | train loss 3.535278 | norm 0.2144 | lr 1.03e-03 | (3827.68 ms | 136973 tok/s) step 5552/76294 | train loss 3.517642 | norm 0.2355 | lr 1.03e-03 | (3806.10 ms | 137749 tok/s) step 5553/76294 | train loss 3.543793 | norm 0.2571 | lr 1.03e-03 | (3800.68 ms | 137946 tok/s) step 5554/76294 | train loss 3.486715 | norm 0.2272 | lr 1.03e-03 | (3835.42 ms | 136696 tok/s) step 5555/76294 | train loss 3.560712 | norm 0.2358 | lr 1.03e-03 | (3811.09 ms | 137569 tok/s) step 5556/76294 | train loss 3.582436 | norm 0.2239 | lr 1.03e-03 | (3809.35 ms | 137632 tok/s) step 5557/76294 | train loss 3.560806 | norm 0.2115 | lr 1.03e-03 | (3827.99 ms | 136962 tok/s) step 5558/76294 | train loss 3.577001 | norm 0.2255 | lr 1.03e-03 | (3810.47 ms | 137592 tok/s) step 5559/76294 | train loss 3.626279 | norm 0.2083 | lr 1.03e-03 | (3804.24 ms | 137817 tok/s) step 5560/76294 | train loss 3.499659 | norm 0.2293 | lr 1.03e-03 | (3840.20 ms | 136526 tok/s) step 5561/76294 | train loss 3.541768 | norm 0.1905 | lr 1.03e-03 | (3803.28 ms | 137852 tok/s) step 5562/76294 | train loss 3.517374 | norm 0.2107 | lr 1.03e-03 | (3806.69 ms | 137728 tok/s) step 5563/76294 | train loss 3.530393 | norm 0.1975 | lr 1.03e-03 | (3886.26 ms | 134908 tok/s) step 5564/76294 | train loss 3.535304 | norm 0.2425 | lr 1.03e-03 | (3801.33 ms | 137922 tok/s) step 5565/76294 | train loss 3.549689 | norm 0.2235 | lr 1.03e-03 | (3851.99 ms | 136108 tok/s) step 5566/76294 | train loss 3.545777 | norm 0.2146 | lr 1.03e-03 | (3802.85 ms | 137867 tok/s) step 5567/76294 | train loss 3.529311 | norm 0.2213 | lr 1.03e-03 | (3803.56 ms | 137842 tok/s) step 5568/76294 | train loss 3.594073 | norm 0.2551 | lr 1.03e-03 | (3826.60 ms | 137012 tok/s) step 5569/76294 | train loss 3.550944 | norm 0.2201 | lr 1.03e-03 | (3888.23 ms | 134840 tok/s) step 5570/76294 | train loss 3.539831 | norm 0.2404 | lr 1.03e-03 | (3804.79 ms | 137797 tok/s) step 5571/76294 | train loss 3.580388 | norm 0.1982 | lr 1.03e-03 | (3907.39 ms | 134179 tok/s) step 5572/76294 | train loss 3.574595 | norm 0.2413 | lr 1.03e-03 | (3801.43 ms | 137918 tok/s) step 5573/76294 | train loss 3.543157 | norm 0.2054 | lr 1.03e-03 | (3806.28 ms | 137743 tok/s) step 5574/76294 | train loss 3.530093 | norm 0.2582 | lr 1.03e-03 | (3833.72 ms | 136757 tok/s) step 5575/76294 | train loss 3.566017 | norm 0.2381 | lr 1.03e-03 | (3803.40 ms | 137847 tok/s) step 5576/76294 | train loss 3.563090 | norm 0.1929 | lr 1.03e-03 | (3801.47 ms | 137917 tok/s) step 5577/76294 | train loss 3.537199 | norm 0.1960 | lr 1.03e-03 | (3898.87 ms | 134472 tok/s) step 5578/76294 | train loss 3.504828 | norm 0.2015 | lr 1.03e-03 | (3801.56 ms | 137914 tok/s) step 5579/76294 | train loss 3.552478 | norm 0.1776 | lr 1.03e-03 | (3806.13 ms | 137748 tok/s) step 5580/76294 | train loss 3.554313 | norm 0.1697 | lr 1.03e-03 | (3827.09 ms | 136994 tok/s) step 5581/76294 | train loss 3.618665 | norm 0.1959 | lr 1.03e-03 | (3802.73 ms | 137872 tok/s) step 5582/76294 | train loss 3.651118 | norm 0.2076 | lr 1.03e-03 | (3805.82 ms | 137760 tok/s) step 5583/76294 | train loss 3.514321 | norm 0.2274 | lr 1.03e-03 | (3801.38 ms | 137920 tok/s) step 5584/76294 | train loss 3.523889 | norm 0.2016 | lr 1.03e-03 | (3828.56 ms | 136941 tok/s) step 5585/76294 | train loss 3.521681 | norm 0.1963 | lr 1.03e-03 | (3825.77 ms | 137041 tok/s) step 5586/76294 | train loss 3.570818 | norm 0.1999 | lr 1.03e-03 | (3807.97 ms | 137682 tok/s) step 5587/76294 | train loss 3.558995 | norm 0.2181 | lr 1.03e-03 | (3797.06 ms | 138077 tok/s) step 5588/76294 | train loss 3.661953 | norm 0.2686 | lr 1.03e-03 | (3852.82 ms | 136079 tok/s) step 5589/76294 | train loss 3.513567 | norm 0.2938 | lr 1.03e-03 | (3866.66 ms | 135592 tok/s) step 5590/76294 | train loss 3.571661 | norm 0.2034 | lr 1.03e-03 | (3797.67 ms | 138055 tok/s) step 5591/76294 | train loss 3.571088 | norm 0.2891 | lr 1.03e-03 | (3811.94 ms | 137538 tok/s) step 5592/76294 | train loss 3.565817 | norm 0.2619 | lr 1.03e-03 | (3886.25 ms | 134908 tok/s) step 5593/76294 | train loss 3.683120 | norm 0.2468 | lr 1.03e-03 | (3801.70 ms | 137909 tok/s) step 5594/76294 | train loss 3.573527 | norm 0.2535 | lr 1.03e-03 | (3814.66 ms | 137440 tok/s) step 5595/76294 | train loss 3.600629 | norm 0.2447 | lr 1.03e-03 | (3869.07 ms | 135507 tok/s) step 5596/76294 | train loss 3.580091 | norm 0.2541 | lr 1.03e-03 | (3813.13 ms | 137495 tok/s) step 5597/76294 | train loss 3.529149 | norm 0.2429 | lr 1.03e-03 | (3800.18 ms | 137964 tok/s) step 5598/76294 | train loss 3.532905 | norm 0.2119 | lr 1.03e-03 | (3832.84 ms | 136788 tok/s) step 5599/76294 | train loss 3.520151 | norm 0.2093 | lr 1.03e-03 | (3802.69 ms | 137873 tok/s) step 5600/76294 | train loss 3.572828 | norm 0.1917 | lr 1.03e-03 | (3870.21 ms | 135468 tok/s) step 5601/76294 | train loss 3.636592 | norm 0.2129 | lr 1.03e-03 | (3821.37 ms | 137199 tok/s) step 5602/76294 | train loss 3.529871 | norm 0.2054 | lr 1.03e-03 | (3805.46 ms | 137773 tok/s) step 5603/76294 | train loss 3.543747 | norm 0.2110 | lr 1.03e-03 | (3844.38 ms | 136378 tok/s) step 5604/76294 | train loss 3.603228 | norm 0.2371 | lr 1.03e-03 | (3807.42 ms | 137702 tok/s) step 5605/76294 | train loss 3.581285 | norm 0.2738 | lr 1.03e-03 | (3806.58 ms | 137732 tok/s) step 5606/76294 | train loss 3.608324 | norm 0.3374 | lr 1.03e-03 | (3841.40 ms | 136484 tok/s) step 5607/76294 | train loss 3.584830 | norm 0.2924 | lr 1.03e-03 | (3799.66 ms | 137983 tok/s) step 5608/76294 | train loss 3.536206 | norm 0.2062 | lr 1.03e-03 | (3807.39 ms | 137703 tok/s) step 5609/76294 | train loss 3.577367 | norm 0.2477 | lr 1.03e-03 | (3825.61 ms | 137047 tok/s) step 5610/76294 | train loss 3.610397 | norm 0.2009 | lr 1.03e-03 | (3802.06 ms | 137896 tok/s) step 5611/76294 | train loss 3.565530 | norm 0.2045 | lr 1.03e-03 | (3801.90 ms | 137902 tok/s) step 5612/76294 | train loss 3.601789 | norm 0.2035 | lr 1.03e-03 | (3836.82 ms | 136646 tok/s) step 5613/76294 | train loss 3.526072 | norm 0.1915 | lr 1.03e-03 | (3816.80 ms | 137363 tok/s) step 5614/76294 | train loss 3.629941 | norm 0.2174 | lr 1.03e-03 | (3893.97 ms | 134641 tok/s) step 5615/76294 | train loss 3.569409 | norm 0.2463 | lr 1.03e-03 | (3800.78 ms | 137942 tok/s) step 5616/76294 | train loss 3.532010 | norm 0.1986 | lr 1.03e-03 | (3805.43 ms | 137774 tok/s) step 5617/76294 | train loss 3.553841 | norm 0.1974 | lr 1.03e-03 | (6514.38 ms | 80482 tok/s) step 5618/76294 | train loss 3.611510 | norm 0.2232 | lr 1.03e-03 | (4090.82 ms | 128162 tok/s) step 5619/76294 | train loss 3.456843 | norm 0.1936 | lr 1.03e-03 | (3822.03 ms | 137175 tok/s) step 5620/76294 | train loss 3.538775 | norm 0.2412 | lr 1.03e-03 | (3796.59 ms | 138094 tok/s) step 5621/76294 | train loss 3.581069 | norm 0.2113 | lr 1.03e-03 | (3799.06 ms | 138005 tok/s) step 5622/76294 | train loss 3.544385 | norm 0.2088 | lr 1.03e-03 | (3829.75 ms | 136899 tok/s) step 5623/76294 | train loss 3.574650 | norm 0.2341 | lr 1.03e-03 | (3799.49 ms | 137989 tok/s) step 5624/76294 | train loss 3.503522 | norm 0.2367 | lr 1.03e-03 | (3807.73 ms | 137691 tok/s) step 5625/76294 | train loss 3.555484 | norm 0.2245 | lr 1.03e-03 | (3798.53 ms | 138024 tok/s) step 5626/76294 | train loss 3.604997 | norm 0.2150 | lr 1.03e-03 | (3806.12 ms | 137749 tok/s) step 5627/76294 | train loss 3.470452 | norm 0.1966 | lr 1.03e-03 | (3821.29 ms | 137202 tok/s) step 5628/76294 | train loss 3.525613 | norm 0.2082 | lr 1.03e-03 | (3805.27 ms | 137779 tok/s) step 5629/76294 | train loss 3.511747 | norm 0.2132 | lr 1.03e-03 | (3800.36 ms | 137958 tok/s) step 5630/76294 | train loss 3.527960 | norm 0.2365 | lr 1.03e-03 | (3846.01 ms | 136320 tok/s) step 5631/76294 | train loss 3.537475 | norm 0.2470 | lr 1.03e-03 | (3799.97 ms | 137972 tok/s) step 5632/76294 | train loss 3.505565 | norm 0.2109 | lr 1.03e-03 | (3854.44 ms | 136022 tok/s) step 5633/76294 | train loss 3.576763 | norm 0.1923 | lr 1.03e-03 | (3797.64 ms | 138056 tok/s) step 5634/76294 | train loss 3.572062 | norm 0.2271 | lr 1.03e-03 | (3908.31 ms | 134147 tok/s) step 5635/76294 | train loss 3.553865 | norm 0.2012 | lr 1.03e-03 | (3803.30 ms | 137851 tok/s) step 5636/76294 | train loss 3.552864 | norm 0.1807 | lr 1.03e-03 | (3826.34 ms | 137021 tok/s) step 5637/76294 | train loss 3.483453 | norm 0.2005 | lr 1.03e-03 | (3828.31 ms | 136950 tok/s) step 5638/76294 | train loss 3.578490 | norm 0.1885 | lr 1.03e-03 | (3813.08 ms | 137497 tok/s) step 5639/76294 | train loss 3.536973 | norm 0.1880 | lr 1.03e-03 | (3801.14 ms | 137929 tok/s) step 5640/76294 | train loss 3.550092 | norm 0.1898 | lr 1.03e-03 | (3846.98 ms | 136286 tok/s) step 5641/76294 | train loss 3.540981 | norm 0.2068 | lr 1.03e-03 | (3802.45 ms | 137882 tok/s) step 5642/76294 | train loss 3.514851 | norm 0.2141 | lr 1.03e-03 | (3834.76 ms | 136720 tok/s) step 5643/76294 | train loss 3.568195 | norm 0.1989 | lr 1.03e-03 | (3963.89 ms | 132266 tok/s) step 5644/76294 | train loss 3.788157 | norm 0.2385 | lr 1.03e-03 | (3817.35 ms | 137344 tok/s) step 5645/76294 | train loss 3.550245 | norm 0.2267 | lr 1.03e-03 | (3824.00 ms | 137105 tok/s) step 5646/76294 | train loss 3.571464 | norm 0.2407 | lr 1.03e-03 | (3805.67 ms | 137765 tok/s) step 5647/76294 | train loss 3.551786 | norm 0.1779 | lr 1.03e-03 | (3807.93 ms | 137683 tok/s) step 5648/76294 | train loss 3.632946 | norm 0.2212 | lr 1.03e-03 | (3829.57 ms | 136905 tok/s) step 5649/76294 | train loss 3.579611 | norm 0.1889 | lr 1.03e-03 | (3811.33 ms | 137560 tok/s) step 5650/76294 | train loss 3.546952 | norm 0.2231 | lr 1.03e-03 | (3804.66 ms | 137802 tok/s) step 5651/76294 | train loss 3.502048 | norm 0.2639 | lr 1.03e-03 | (3840.34 ms | 136521 tok/s) step 5652/76294 | train loss 3.514514 | norm 0.1976 | lr 1.03e-03 | (3806.44 ms | 137737 tok/s) step 5653/76294 | train loss 3.575405 | norm 0.1987 | lr 1.03e-03 | (3808.25 ms | 137672 tok/s) step 5654/76294 | train loss 3.542484 | norm 0.2128 | lr 1.03e-03 | (3834.51 ms | 136729 tok/s) step 5655/76294 | train loss 3.519700 | norm 0.1888 | lr 1.03e-03 | (3805.52 ms | 137770 tok/s) step 5656/76294 | train loss 3.582911 | norm 0.2287 | lr 1.03e-03 | (3826.85 ms | 137003 tok/s) step 5657/76294 | train loss 3.584467 | norm 0.2542 | lr 1.03e-03 | (3806.37 ms | 137740 tok/s) step 5658/76294 | train loss 3.562690 | norm 0.2381 | lr 1.03e-03 | (3823.27 ms | 137131 tok/s) step 5659/76294 | train loss 3.554216 | norm 0.2441 | lr 1.03e-03 | (3814.82 ms | 137435 tok/s) step 5660/76294 | train loss 3.505986 | norm 0.2145 | lr 1.03e-03 | (3816.51 ms | 137374 tok/s) step 5661/76294 | train loss 3.600215 | norm 0.2083 | lr 1.03e-03 | (3807.46 ms | 137700 tok/s) step 5662/76294 | train loss 3.552866 | norm 0.2409 | lr 1.03e-03 | (3809.24 ms | 137636 tok/s) step 5663/76294 | train loss 3.595725 | norm 0.2126 | lr 1.03e-03 | (3917.07 ms | 133847 tok/s) step 5664/76294 | train loss 3.581277 | norm 0.2069 | lr 1.03e-03 | (3801.62 ms | 137912 tok/s) step 5665/76294 | train loss 3.587023 | norm 0.2231 | lr 1.03e-03 | (4014.86 ms | 130587 tok/s) step 5666/76294 | train loss 3.540162 | norm 0.2049 | lr 1.03e-03 | (3828.11 ms | 136957 tok/s) step 5667/76294 | train loss 3.577393 | norm 0.1999 | lr 1.03e-03 | (3809.60 ms | 137623 tok/s) step 5668/76294 | train loss 3.615714 | norm 0.2034 | lr 1.03e-03 | (4049.40 ms | 129473 tok/s) step 5669/76294 | train loss 3.491329 | norm 0.2460 | lr 1.03e-03 | (3802.91 ms | 137865 tok/s) step 5670/76294 | train loss 3.586509 | norm 0.2296 | lr 1.03e-03 | (3810.16 ms | 137603 tok/s) step 5671/76294 | train loss 3.523825 | norm 0.1944 | lr 1.03e-03 | (3833.46 ms | 136766 tok/s) step 5672/76294 | train loss 3.640178 | norm 0.1926 | lr 1.03e-03 | (3826.69 ms | 137008 tok/s) step 5673/76294 | train loss 3.481881 | norm 0.2183 | lr 1.03e-03 | (3804.14 ms | 137821 tok/s) step 5674/76294 | train loss 3.546756 | norm 0.2401 | lr 1.03e-03 | (3837.03 ms | 136639 tok/s) step 5675/76294 | train loss 3.550547 | norm 0.2151 | lr 1.03e-03 | (3804.22 ms | 137817 tok/s) step 5676/76294 | train loss 3.487263 | norm 0.2190 | lr 1.03e-03 | (3808.93 ms | 137647 tok/s) step 5677/76294 | train loss 3.671453 | norm 0.2114 | lr 1.03e-03 | (3832.89 ms | 136787 tok/s) step 5678/76294 | train loss 3.567927 | norm 0.1909 | lr 1.03e-03 | (3809.41 ms | 137630 tok/s) step 5679/76294 | train loss 3.544803 | norm 0.1979 | lr 1.03e-03 | (3805.30 ms | 137778 tok/s) step 5680/76294 | train loss 3.539901 | norm 0.1881 | lr 1.03e-03 | (3886.51 ms | 134899 tok/s) step 5681/76294 | train loss 3.463786 | norm 0.1798 | lr 1.03e-03 | (3804.67 ms | 137801 tok/s) step 5682/76294 | train loss 3.536779 | norm 0.1920 | lr 1.03e-03 | (3848.02 ms | 136249 tok/s) step 5683/76294 | train loss 3.547426 | norm 0.1986 | lr 1.03e-03 | (3808.41 ms | 137666 tok/s) step 5684/76294 | train loss 3.544668 | norm 0.2026 | lr 1.03e-03 | (3826.64 ms | 137010 tok/s) step 5685/76294 | train loss 3.549193 | norm 0.2097 | lr 1.03e-03 | (3828.99 ms | 136926 tok/s) step 5686/76294 | train loss 3.488147 | norm 0.2157 | lr 1.03e-03 | (3810.51 ms | 137590 tok/s) step 5687/76294 | train loss 3.554723 | norm 0.2362 | lr 1.02e-03 | (3829.30 ms | 136915 tok/s) step 5688/76294 | train loss 3.529590 | norm 0.2532 | lr 1.02e-03 | (3914.68 ms | 133929 tok/s) step 5689/76294 | train loss 3.555314 | norm 0.2339 | lr 1.02e-03 | (3805.83 ms | 137759 tok/s) step 5690/76294 | train loss 3.522979 | norm 0.2413 | lr 1.02e-03 | (3811.96 ms | 137537 tok/s) step 5691/76294 | train loss 3.534622 | norm 0.2212 | lr 1.02e-03 | (3832.14 ms | 136814 tok/s) step 5692/76294 | train loss 3.536551 | norm 0.2086 | lr 1.02e-03 | (3807.01 ms | 137716 tok/s) step 5693/76294 | train loss 3.567437 | norm 0.2083 | lr 1.02e-03 | (3801.53 ms | 137915 tok/s) step 5694/76294 | train loss 3.516526 | norm 0.2114 | lr 1.02e-03 | (3845.47 ms | 136339 tok/s) step 5695/76294 | train loss 3.506366 | norm 0.2123 | lr 1.02e-03 | (3807.11 ms | 137713 tok/s) step 5696/76294 | train loss 3.636434 | norm 0.2049 | lr 1.02e-03 | (3810.82 ms | 137579 tok/s) step 5697/76294 | train loss 3.566862 | norm 0.2239 | lr 1.02e-03 | (3830.49 ms | 136872 tok/s) step 5698/76294 | train loss 3.708835 | norm 0.2146 | lr 1.02e-03 | (3810.16 ms | 137603 tok/s) step 5699/76294 | train loss 3.557080 | norm 0.1935 | lr 1.02e-03 | (3807.37 ms | 137704 tok/s) step 5700/76294 | train loss 3.520531 | norm 0.2327 | lr 1.02e-03 | (3838.90 ms | 136572 tok/s) step 5701/76294 | train loss 3.537369 | norm 0.2089 | lr 1.02e-03 | (3807.23 ms | 137709 tok/s) step 5702/76294 | train loss 3.554680 | norm 0.2240 | lr 1.02e-03 | (3847.54 ms | 136266 tok/s) step 5703/76294 | train loss 3.533478 | norm 0.2274 | lr 1.02e-03 | (3806.73 ms | 137727 tok/s) step 5704/76294 | train loss 3.523688 | norm 0.1971 | lr 1.02e-03 | (3925.69 ms | 133553 tok/s) step 5705/76294 | train loss 3.523988 | norm 0.2043 | lr 1.02e-03 | (3832.95 ms | 136785 tok/s) step 5706/76294 | train loss 3.543461 | norm 0.1812 | lr 1.02e-03 | (3808.03 ms | 137680 tok/s) step 5707/76294 | train loss 3.550199 | norm 0.1996 | lr 1.02e-03 | (3801.93 ms | 137901 tok/s) step 5708/76294 | train loss 3.525285 | norm 0.1823 | lr 1.02e-03 | (3856.29 ms | 135956 tok/s) step 5709/76294 | train loss 3.505594 | norm 0.1798 | lr 1.02e-03 | (3811.12 ms | 137568 tok/s) step 5710/76294 | train loss 3.486004 | norm 0.2248 | lr 1.02e-03 | (3829.76 ms | 136898 tok/s) step 5711/76294 | train loss 3.573796 | norm 0.2065 | lr 1.02e-03 | (3806.42 ms | 137738 tok/s) step 5712/76294 | train loss 3.575342 | norm 0.2018 | lr 1.02e-03 | (3837.27 ms | 136631 tok/s) step 5713/76294 | train loss 3.535873 | norm 0.2030 | lr 1.02e-03 | (3929.93 ms | 133409 tok/s) step 5714/76294 | train loss 3.468148 | norm 0.1838 | lr 1.02e-03 | (5059.60 ms | 103622 tok/s) step 5715/76294 | train loss 3.515716 | norm 0.1980 | lr 1.02e-03 | (5633.24 ms | 93070 tok/s) step 5716/76294 | train loss 3.584828 | norm 0.1762 | lr 1.02e-03 | (3794.42 ms | 138174 tok/s) step 5717/76294 | train loss 3.484544 | norm 0.1870 | lr 1.02e-03 | (3803.00 ms | 137862 tok/s) step 5718/76294 | train loss 3.502969 | norm 0.1959 | lr 1.02e-03 | (3853.89 ms | 136041 tok/s) step 5719/76294 | train loss 3.558271 | norm 0.1998 | lr 1.02e-03 | (3804.94 ms | 137791 tok/s) step 5720/76294 | train loss 3.557039 | norm 0.2356 | lr 1.02e-03 | (3798.34 ms | 138031 tok/s) step 5721/76294 | train loss 3.518759 | norm 0.2536 | lr 1.02e-03 | (3834.02 ms | 136746 tok/s) step 5722/76294 | train loss 3.565317 | norm 0.2585 | lr 1.02e-03 | (3803.79 ms | 137833 tok/s) step 5723/76294 | train loss 3.569057 | norm 0.2081 | lr 1.02e-03 | (4652.60 ms | 112687 tok/s) step 5724/76294 | train loss 3.548884 | norm 0.1972 | lr 1.02e-03 | (3848.65 ms | 136227 tok/s) step 5725/76294 | train loss 3.562060 | norm 0.2372 | lr 1.02e-03 | (3803.13 ms | 137857 tok/s) step 5726/76294 | train loss 3.513350 | norm 0.2170 | lr 1.02e-03 | (3825.11 ms | 137065 tok/s) step 5727/76294 | train loss 3.504362 | norm 0.2217 | lr 1.02e-03 | (3825.90 ms | 137037 tok/s) step 5728/76294 | train loss 3.532215 | norm 0.2014 | lr 1.02e-03 | (3804.26 ms | 137816 tok/s) step 5729/76294 | train loss 3.512714 | norm 0.2118 | lr 1.02e-03 | (3808.03 ms | 137680 tok/s) step 5730/76294 | train loss 3.533009 | norm 0.2075 | lr 1.02e-03 | (3830.91 ms | 136857 tok/s) step 5731/76294 | train loss 3.523257 | norm 0.1949 | lr 1.02e-03 | (3838.47 ms | 136588 tok/s) step 5732/76294 | train loss 3.595257 | norm 0.2206 | lr 1.02e-03 | (3857.60 ms | 135910 tok/s) step 5733/76294 | train loss 3.494105 | norm 0.1817 | lr 1.02e-03 | (4028.78 ms | 130136 tok/s) step 5734/76294 | train loss 3.582323 | norm 0.2207 | lr 1.02e-03 | (3805.59 ms | 137768 tok/s) step 5735/76294 | train loss 3.483429 | norm 0.1905 | lr 1.02e-03 | (3998.06 ms | 131136 tok/s) step 5736/76294 | train loss 3.575600 | norm 0.2265 | lr 1.02e-03 | (3825.82 ms | 137039 tok/s) step 5737/76294 | train loss 3.607892 | norm 0.3403 | lr 1.02e-03 | (3808.68 ms | 137656 tok/s) step 5738/76294 | train loss 3.639942 | norm 0.3855 | lr 1.02e-03 | (3802.60 ms | 137876 tok/s) step 5739/76294 | train loss 3.612705 | norm 0.2638 | lr 1.02e-03 | (3836.15 ms | 136670 tok/s) step 5740/76294 | train loss 3.509273 | norm 0.2336 | lr 1.02e-03 | (3806.53 ms | 137734 tok/s) step 5741/76294 | train loss 3.535861 | norm 0.2662 | lr 1.02e-03 | (3809.62 ms | 137622 tok/s) step 5742/76294 | train loss 3.533696 | norm 0.2242 | lr 1.02e-03 | (3829.77 ms | 136898 tok/s) step 5743/76294 | train loss 3.600574 | norm 0.2341 | lr 1.02e-03 | (3814.64 ms | 137441 tok/s) step 5744/76294 | train loss 3.550626 | norm 0.1958 | lr 1.02e-03 | (3803.97 ms | 137827 tok/s) step 5745/76294 | train loss 3.523852 | norm 0.2055 | lr 1.02e-03 | (3835.12 ms | 136707 tok/s) step 5746/76294 | train loss 3.515994 | norm 0.1857 | lr 1.02e-03 | (3807.12 ms | 137713 tok/s) step 5747/76294 | train loss 3.577559 | norm 0.2146 | lr 1.02e-03 | (3813.51 ms | 137482 tok/s) step 5748/76294 | train loss 3.522324 | norm 0.2098 | lr 1.02e-03 | (3829.42 ms | 136911 tok/s) step 5749/76294 | train loss 3.572529 | norm 0.1940 | lr 1.02e-03 | (3811.60 ms | 137550 tok/s) step 5750/76294 | train loss 3.501004 | norm 0.1730 | lr 1.02e-03 | (3803.92 ms | 137828 tok/s) val loss: 3.532492 saving model checkpoint to ./results/gpt2-124M-gqa/step_5750.pth step 5751/76294 | train loss 3.558097 | norm 0.1961 | lr 1.02e-03 | (3856.81 ms | 135938 tok/s) step 5752/76294 | train loss 3.489658 | norm 0.1908 | lr 1.02e-03 | (3767.50 ms | 139161 tok/s) step 5753/76294 | train loss 3.491303 | norm 0.1797 | lr 1.02e-03 | (3790.33 ms | 138322 tok/s) step 5754/76294 | train loss 3.463639 | norm 0.1880 | lr 1.02e-03 | (3807.74 ms | 137690 tok/s) step 5755/76294 | train loss 3.586570 | norm 0.2273 | lr 1.02e-03 | (3779.32 ms | 138725 tok/s) step 5756/76294 | train loss 3.522912 | norm 0.2842 | lr 1.02e-03 | (3787.97 ms | 138409 tok/s) step 5757/76294 | train loss 3.522519 | norm 0.2494 | lr 1.02e-03 | (3823.23 ms | 137132 tok/s) step 5758/76294 | train loss 3.450656 | norm 0.1973 | lr 1.02e-03 | (3790.24 ms | 138326 tok/s) step 5759/76294 | train loss 3.573120 | norm 0.3297 | lr 1.02e-03 | (3816.02 ms | 137391 tok/s) step 5760/76294 | train loss 3.655418 | norm 0.3027 | lr 1.02e-03 | (3795.65 ms | 138129 tok/s) step 5761/76294 | train loss 3.557285 | norm 0.2214 | lr 1.02e-03 | (3802.68 ms | 137873 tok/s) step 5762/76294 | train loss 3.561206 | norm 0.2298 | lr 1.02e-03 | (3827.63 ms | 136975 tok/s) step 5763/76294 | train loss 3.504389 | norm 0.1957 | lr 1.02e-03 | (3802.29 ms | 137887 tok/s) step 5764/76294 | train loss 3.554314 | norm 0.2082 | lr 1.02e-03 | (3800.71 ms | 137945 tok/s) step 5765/76294 | train loss 3.570339 | norm 0.2143 | lr 1.02e-03 | (3837.02 ms | 136640 tok/s) step 5766/76294 | train loss 3.509813 | norm 0.2123 | lr 1.02e-03 | (4142.91 ms | 126551 tok/s) step 5767/76294 | train loss 3.546175 | norm 0.2021 | lr 1.02e-03 | (3888.48 ms | 134831 tok/s) step 5768/76294 | train loss 3.628610 | norm 0.2135 | lr 1.02e-03 | (3802.69 ms | 137873 tok/s) step 5769/76294 | train loss 3.538341 | norm 0.2116 | lr 1.02e-03 | (3809.62 ms | 137622 tok/s) step 5770/76294 | train loss 3.557923 | norm 0.2209 | lr 1.02e-03 | (3940.51 ms | 133051 tok/s) step 5771/76294 | train loss 3.527559 | norm 0.1972 | lr 1.02e-03 | (3805.18 ms | 137783 tok/s) step 5772/76294 | train loss 3.508366 | norm 0.2083 | lr 1.02e-03 | (3977.85 ms | 131802 tok/s) step 5773/76294 | train loss 3.546306 | norm 0.1733 | lr 1.02e-03 | (3796.62 ms | 138093 tok/s) step 5774/76294 | train loss 3.534404 | norm 0.2073 | lr 1.02e-03 | (3801.89 ms | 137902 tok/s) step 5775/76294 | train loss 3.588680 | norm 0.1855 | lr 1.02e-03 | (3824.37 ms | 137091 tok/s) step 5776/76294 | train loss 3.589664 | norm 0.3180 | lr 1.02e-03 | (3976.64 ms | 131842 tok/s) step 5777/76294 | train loss 3.544703 | norm 0.2815 | lr 1.02e-03 | (3802.14 ms | 137893 tok/s) step 5778/76294 | train loss 3.530134 | norm 0.2454 | lr 1.02e-03 | (3843.40 ms | 136413 tok/s) step 5779/76294 | train loss 3.528100 | norm 0.2345 | lr 1.02e-03 | (3798.13 ms | 138039 tok/s) step 5780/76294 | train loss 3.626403 | norm 0.2293 | lr 1.02e-03 | (3804.92 ms | 137792 tok/s) step 5781/76294 | train loss 3.572063 | norm 0.2113 | lr 1.02e-03 | (3826.87 ms | 137002 tok/s) step 5782/76294 | train loss 3.554187 | norm 0.1879 | lr 1.02e-03 | (3807.86 ms | 137686 tok/s) step 5783/76294 | train loss 3.503211 | norm 0.1914 | lr 1.02e-03 | (3801.69 ms | 137909 tok/s) step 5784/76294 | train loss 3.521018 | norm 0.1903 | lr 1.02e-03 | (3843.07 ms | 136424 tok/s) step 5785/76294 | train loss 3.453568 | norm 0.2126 | lr 1.02e-03 | (3830.80 ms | 136861 tok/s) step 5786/76294 | train loss 3.583426 | norm 0.2183 | lr 1.02e-03 | (3810.09 ms | 137605 tok/s) step 5787/76294 | train loss 3.589361 | norm 0.1862 | lr 1.02e-03 | (3848.23 ms | 136241 tok/s) step 5788/76294 | train loss 3.538450 | norm 0.2362 | lr 1.02e-03 | (3813.66 ms | 137476 tok/s) step 5789/76294 | train loss 3.550620 | norm 0.2006 | lr 1.02e-03 | (3817.62 ms | 137334 tok/s) step 5790/76294 | train loss 3.666133 | norm 0.2071 | lr 1.02e-03 | (3838.32 ms | 136593 tok/s) step 5791/76294 | train loss 3.548722 | norm 0.2168 | lr 1.02e-03 | (3814.25 ms | 137455 tok/s) step 5792/76294 | train loss 3.538380 | norm 0.1804 | lr 1.02e-03 | (3821.05 ms | 137210 tok/s) step 5793/76294 | train loss 3.528529 | norm 0.2248 | lr 1.02e-03 | (3847.27 ms | 136275 tok/s) step 5794/76294 | train loss 3.507085 | norm 0.2193 | lr 1.02e-03 | (3816.56 ms | 137372 tok/s) step 5795/76294 | train loss 3.482755 | norm 0.1836 | lr 1.02e-03 | (3816.87 ms | 137361 tok/s) step 5796/76294 | train loss 3.648396 | norm 0.1836 | lr 1.02e-03 | (3921.93 ms | 133681 tok/s) step 5797/76294 | train loss 3.623603 | norm 0.2749 | lr 1.02e-03 | (3810.82 ms | 137579 tok/s) step 5798/76294 | train loss 3.583988 | norm 0.3501 | lr 1.02e-03 | (3821.82 ms | 137183 tok/s) step 5799/76294 | train loss 3.514184 | norm 0.2554 | lr 1.02e-03 | (3835.35 ms | 136699 tok/s) step 5800/76294 | train loss 3.547328 | norm 0.2364 | lr 1.02e-03 | (3815.82 ms | 137399 tok/s) step 5801/76294 | train loss 3.567470 | norm 0.2396 | lr 1.02e-03 | (3812.68 ms | 137512 tok/s) step 5802/76294 | train loss 3.528837 | norm 0.2338 | lr 1.02e-03 | (3877.20 ms | 135223 tok/s) step 5803/76294 | train loss 3.453591 | norm 0.2425 | lr 1.02e-03 | (3809.94 ms | 137611 tok/s) step 5804/76294 | train loss 3.483662 | norm 0.2471 | lr 1.02e-03 | (3812.80 ms | 137507 tok/s) step 5805/76294 | train loss 3.505673 | norm 0.2096 | lr 1.02e-03 | (3831.68 ms | 136830 tok/s) step 5806/76294 | train loss 3.564217 | norm 0.2205 | lr 1.02e-03 | (3819.19 ms | 137277 tok/s) step 5807/76294 | train loss 3.536177 | norm 0.2215 | lr 1.02e-03 | (3837.79 ms | 136612 tok/s) step 5808/76294 | train loss 3.557725 | norm 0.1934 | lr 1.02e-03 | (3815.86 ms | 137397 tok/s) step 5809/76294 | train loss 3.545624 | norm 0.2127 | lr 1.02e-03 | (3816.13 ms | 137387 tok/s) step 5810/76294 | train loss 3.534565 | norm 0.2217 | lr 1.02e-03 | (3850.30 ms | 136168 tok/s) step 5811/76294 | train loss 3.517496 | norm 0.2149 | lr 1.02e-03 | (3813.55 ms | 137480 tok/s) step 5812/76294 | train loss 3.569558 | norm 0.2032 | lr 1.02e-03 | (3856.94 ms | 135934 tok/s) step 5813/76294 | train loss 3.599107 | norm 0.2318 | lr 1.02e-03 | (3819.62 ms | 137262 tok/s) step 5814/76294 | train loss 3.533648 | norm 0.2004 | lr 1.02e-03 | (3810.47 ms | 137591 tok/s) step 5815/76294 | train loss 3.502118 | norm 0.2131 | lr 1.02e-03 | (3807.63 ms | 137694 tok/s) step 5816/76294 | train loss 3.557345 | norm 0.1913 | lr 1.02e-03 | (3887.30 ms | 134872 tok/s) step 5817/76294 | train loss 3.536074 | norm 0.1926 | lr 1.02e-03 | (3800.32 ms | 137959 tok/s) step 5818/76294 | train loss 3.463729 | norm 0.1967 | lr 1.02e-03 | (3810.99 ms | 137573 tok/s) step 5819/76294 | train loss 3.507883 | norm 0.1955 | lr 1.02e-03 | (3825.80 ms | 137040 tok/s) step 5820/76294 | train loss 3.494590 | norm 0.2141 | lr 1.02e-03 | (3807.30 ms | 137706 tok/s) step 5821/76294 | train loss 3.508141 | norm 0.2160 | lr 1.02e-03 | (3818.32 ms | 137309 tok/s) step 5822/76294 | train loss 3.544274 | norm 0.1783 | lr 1.02e-03 | (3810.45 ms | 137592 tok/s) step 5823/76294 | train loss 3.525552 | norm 0.2246 | lr 1.02e-03 | (3807.15 ms | 137711 tok/s) step 5824/76294 | train loss 3.536073 | norm 0.1878 | lr 1.02e-03 | (3829.20 ms | 136918 tok/s) step 5825/76294 | train loss 3.580052 | norm 0.2165 | lr 1.02e-03 | (3805.64 ms | 137766 tok/s) step 5826/76294 | train loss 3.576253 | norm 0.1872 | lr 1.02e-03 | (3815.55 ms | 137408 tok/s) step 5827/76294 | train loss 3.532590 | norm 0.2055 | lr 1.02e-03 | (3828.25 ms | 136952 tok/s) step 5828/76294 | train loss 3.558646 | norm 0.2605 | lr 1.02e-03 | (3810.37 ms | 137595 tok/s) step 5829/76294 | train loss 3.569423 | norm 0.2366 | lr 1.02e-03 | (3807.61 ms | 137695 tok/s) step 5830/76294 | train loss 3.506197 | norm 0.2757 | lr 1.02e-03 | (3843.05 ms | 136425 tok/s) step 5831/76294 | train loss 3.553151 | norm 0.2494 | lr 1.02e-03 | (3803.54 ms | 137842 tok/s) step 5832/76294 | train loss 3.454800 | norm 0.3233 | lr 1.02e-03 | (3809.70 ms | 137619 tok/s) step 5833/76294 | train loss 3.618323 | norm 0.2333 | lr 1.02e-03 | (3911.22 ms | 134047 tok/s) step 5834/76294 | train loss 3.489505 | norm 0.2422 | lr 1.02e-03 | (3858.56 ms | 135877 tok/s) step 5835/76294 | train loss 3.528180 | norm 0.2555 | lr 1.02e-03 | (3807.53 ms | 137698 tok/s) step 5836/76294 | train loss 3.492161 | norm 0.2246 | lr 1.01e-03 | (3835.03 ms | 136710 tok/s) step 5837/76294 | train loss 3.568370 | norm 0.2027 | lr 1.01e-03 | (3803.83 ms | 137832 tok/s) step 5838/76294 | train loss 3.555213 | norm 0.2310 | lr 1.01e-03 | (3808.29 ms | 137670 tok/s) step 5839/76294 | train loss 3.509323 | norm 0.2984 | lr 1.01e-03 | (3827.54 ms | 136978 tok/s) step 5840/76294 | train loss 3.532250 | norm 0.2779 | lr 1.01e-03 | (3875.61 ms | 135279 tok/s) step 5841/76294 | train loss 3.573319 | norm 0.2706 | lr 1.01e-03 | (3804.53 ms | 137806 tok/s) step 5842/76294 | train loss 3.502649 | norm 0.2498 | lr 1.01e-03 | (3806.99 ms | 137717 tok/s) step 5843/76294 | train loss 3.563904 | norm 0.1990 | lr 1.01e-03 | (3828.05 ms | 136959 tok/s) step 5844/76294 | train loss 3.555535 | norm 0.2512 | lr 1.01e-03 | (3812.39 ms | 137522 tok/s) step 5845/76294 | train loss 3.554789 | norm 0.2205 | lr 1.01e-03 | (3842.63 ms | 136440 tok/s) step 5846/76294 | train loss 3.530852 | norm 0.2184 | lr 1.01e-03 | (3832.67 ms | 136794 tok/s) step 5847/76294 | train loss 3.547272 | norm 0.2016 | lr 1.01e-03 | (3811.86 ms | 137541 tok/s) step 5848/76294 | train loss 3.520198 | norm 0.2060 | lr 1.01e-03 | (3801.85 ms | 137903 tok/s) step 5849/76294 | train loss 3.521572 | norm 0.1953 | lr 1.01e-03 | (3845.69 ms | 136331 tok/s) step 5850/76294 | train loss 3.576475 | norm 0.2064 | lr 1.01e-03 | (3807.99 ms | 137681 tok/s) step 5851/76294 | train loss 3.546240 | norm 0.2463 | lr 1.01e-03 | (3809.92 ms | 137611 tok/s) step 5852/76294 | train loss 3.518682 | norm 0.2133 | lr 1.01e-03 | (3803.92 ms | 137828 tok/s) step 5853/76294 | train loss 3.514432 | norm 0.2577 | lr 1.01e-03 | (3811.50 ms | 137554 tok/s) step 5854/76294 | train loss 3.530838 | norm 0.1981 | lr 1.01e-03 | (3826.81 ms | 137004 tok/s) step 5855/76294 | train loss 3.518596 | norm 0.2616 | lr 1.01e-03 | (3809.57 ms | 137624 tok/s) step 5856/76294 | train loss 3.458749 | norm 0.2862 | lr 1.01e-03 | (3805.31 ms | 137778 tok/s) step 5857/76294 | train loss 3.549650 | norm 0.2239 | lr 1.01e-03 | (3827.81 ms | 136968 tok/s) step 5858/76294 | train loss 3.535112 | norm 0.2369 | lr 1.01e-03 | (3804.00 ms | 137826 tok/s) step 5859/76294 | train loss 3.534861 | norm 0.2391 | lr 1.01e-03 | (3820.02 ms | 137247 tok/s) step 5860/76294 | train loss 3.543354 | norm 0.2088 | lr 1.01e-03 | (3807.84 ms | 137686 tok/s) step 5861/76294 | train loss 3.530178 | norm 0.2095 | lr 1.01e-03 | (3826.85 ms | 137003 tok/s) step 5862/76294 | train loss 3.518044 | norm 0.1991 | lr 1.01e-03 | (3813.35 ms | 137488 tok/s) step 5863/76294 | train loss 3.506977 | norm 0.2020 | lr 1.01e-03 | (3808.48 ms | 137663 tok/s) step 5864/76294 | train loss 3.514170 | norm 0.1944 | lr 1.01e-03 | (3827.55 ms | 136977 tok/s) step 5865/76294 | train loss 3.633397 | norm 0.2493 | lr 1.01e-03 | (3880.56 ms | 135106 tok/s) step 5866/76294 | train loss 3.559375 | norm 0.2713 | lr 1.01e-03 | (3802.31 ms | 137887 tok/s) step 5867/76294 | train loss 3.537334 | norm 0.2349 | lr 1.01e-03 | (3832.48 ms | 136801 tok/s) step 5868/76294 | train loss 3.521796 | norm 0.2172 | lr 1.01e-03 | (3805.89 ms | 137757 tok/s) step 5869/76294 | train loss 3.505650 | norm 0.2038 | lr 1.01e-03 | (3812.83 ms | 137506 tok/s) step 5870/76294 | train loss 3.511184 | norm 0.2064 | lr 1.01e-03 | (3825.72 ms | 137043 tok/s) step 5871/76294 | train loss 3.545532 | norm 0.2480 | lr 1.01e-03 | (3808.80 ms | 137652 tok/s) step 5872/76294 | train loss 3.574218 | norm 0.2578 | lr 1.01e-03 | (3858.91 ms | 135864 tok/s) step 5873/76294 | train loss 3.558239 | norm 0.2378 | lr 1.01e-03 | (3806.04 ms | 137751 tok/s) step 5874/76294 | train loss 3.542755 | norm 0.1956 | lr 1.01e-03 | (3812.79 ms | 137508 tok/s) step 5875/76294 | train loss 3.521714 | norm 0.2708 | lr 1.01e-03 | (3821.35 ms | 137200 tok/s) step 5876/76294 | train loss 3.557004 | norm 0.2465 | lr 1.01e-03 | (3809.16 ms | 137639 tok/s) step 5877/76294 | train loss 3.445623 | norm 0.2197 | lr 1.01e-03 | (3806.31 ms | 137742 tok/s) step 5878/76294 | train loss 3.543511 | norm 0.2663 | lr 1.01e-03 | (3844.53 ms | 136373 tok/s) step 5879/76294 | train loss 3.514390 | norm 0.2520 | lr 1.01e-03 | (3806.63 ms | 137730 tok/s) step 5880/76294 | train loss 3.518955 | norm 0.2215 | lr 1.01e-03 | (3814.96 ms | 137429 tok/s) step 5881/76294 | train loss 3.524794 | norm 0.2128 | lr 1.01e-03 | (3827.62 ms | 136975 tok/s) step 5882/76294 | train loss 3.527357 | norm 0.2305 | lr 1.01e-03 | (3807.09 ms | 137714 tok/s) step 5883/76294 | train loss 3.619898 | norm 0.2209 | lr 1.01e-03 | (3853.03 ms | 136072 tok/s) step 5884/76294 | train loss 3.552116 | norm 0.2224 | lr 1.01e-03 | (3803.02 ms | 137861 tok/s) step 5885/76294 | train loss 3.511280 | norm 0.2096 | lr 1.01e-03 | (3804.89 ms | 137793 tok/s) step 5886/76294 | train loss 3.581771 | norm 0.2047 | lr 1.01e-03 | (3849.11 ms | 136210 tok/s) step 5887/76294 | train loss 3.503586 | norm 0.2536 | lr 1.01e-03 | (3800.89 ms | 137938 tok/s) step 5888/76294 | train loss 3.636228 | norm 0.2548 | lr 1.01e-03 | (3811.55 ms | 137552 tok/s) step 5889/76294 | train loss 3.530700 | norm 0.1964 | lr 1.01e-03 | (3829.90 ms | 136894 tok/s) step 5890/76294 | train loss 3.530562 | norm 0.2373 | lr 1.01e-03 | (3809.35 ms | 137632 tok/s) step 5891/76294 | train loss 3.549411 | norm 0.2034 | lr 1.01e-03 | (3799.41 ms | 137992 tok/s) step 5892/76294 | train loss 3.526684 | norm 0.2041 | lr 1.01e-03 | (3906.87 ms | 134196 tok/s) step 5893/76294 | train loss 3.521045 | norm 0.2170 | lr 1.01e-03 | (3802.19 ms | 137891 tok/s) step 5894/76294 | train loss 3.540590 | norm 0.1897 | lr 1.01e-03 | (3808.81 ms | 137651 tok/s) step 5895/76294 | train loss 3.518208 | norm 0.2091 | lr 1.01e-03 | (3833.33 ms | 136771 tok/s) step 5896/76294 | train loss 3.511512 | norm 0.1747 | lr 1.01e-03 | (3809.08 ms | 137641 tok/s) step 5897/76294 | train loss 3.558012 | norm 0.2018 | lr 1.01e-03 | (3803.81 ms | 137832 tok/s) step 5898/76294 | train loss 3.503561 | norm 0.2160 | lr 1.01e-03 | (3897.71 ms | 134512 tok/s) step 5899/76294 | train loss 3.683152 | norm 0.1826 | lr 1.01e-03 | (3802.73 ms | 137871 tok/s) step 5900/76294 | train loss 3.585776 | norm 0.1912 | lr 1.01e-03 | (3862.05 ms | 135754 tok/s) step 5901/76294 | train loss 3.468864 | norm 0.2131 | lr 1.01e-03 | (3804.71 ms | 137800 tok/s) step 5902/76294 | train loss 3.486448 | norm 0.1720 | lr 1.01e-03 | (3844.84 ms | 136361 tok/s) step 5903/76294 | train loss 3.530823 | norm 0.1874 | lr 1.01e-03 | (3809.87 ms | 137613 tok/s) step 5904/76294 | train loss 3.529406 | norm 0.1839 | lr 1.01e-03 | (3823.11 ms | 137137 tok/s) step 5905/76294 | train loss 3.536054 | norm 0.1883 | lr 1.01e-03 | (3804.44 ms | 137810 tok/s) step 5906/76294 | train loss 3.571383 | norm 0.1782 | lr 1.01e-03 | (3808.67 ms | 137657 tok/s) step 5907/76294 | train loss 3.551581 | norm 0.2097 | lr 1.01e-03 | (3829.20 ms | 136918 tok/s) step 5908/76294 | train loss 3.544129 | norm 0.2389 | lr 1.01e-03 | (3812.90 ms | 137504 tok/s) step 5909/76294 | train loss 3.482012 | norm 0.2673 | lr 1.01e-03 | (3810.08 ms | 137605 tok/s) step 5910/76294 | train loss 3.583503 | norm 0.2434 | lr 1.01e-03 | (3803.91 ms | 137829 tok/s) step 5911/76294 | train loss 3.552700 | norm 0.2661 | lr 1.01e-03 | (3802.14 ms | 137893 tok/s) step 5912/76294 | train loss 3.557076 | norm 0.2019 | lr 1.01e-03 | (3835.88 ms | 136680 tok/s) step 5913/76294 | train loss 3.510953 | norm 0.2772 | lr 1.01e-03 | (4064.34 ms | 128997 tok/s) step 5914/76294 | train loss 3.508498 | norm 0.2293 | lr 1.01e-03 | (3856.55 ms | 135947 tok/s) step 5915/76294 | train loss 3.604466 | norm 0.2134 | lr 1.01e-03 | (3805.24 ms | 137781 tok/s) step 5916/76294 | train loss 3.445425 | norm 0.2200 | lr 1.01e-03 | (3811.66 ms | 137548 tok/s) step 5917/76294 | train loss 3.606043 | norm 0.2639 | lr 1.01e-03 | (4579.55 ms | 114485 tok/s) step 5918/76294 | train loss 3.463490 | norm 0.2843 | lr 1.01e-03 | (3952.13 ms | 132660 tok/s) step 5919/76294 | train loss 3.553678 | norm 0.2341 | lr 1.01e-03 | (3821.33 ms | 137200 tok/s) step 5920/76294 | train loss 3.460229 | norm 0.2638 | lr 1.01e-03 | (3831.01 ms | 136854 tok/s) step 5921/76294 | train loss 3.466274 | norm 0.2201 | lr 1.01e-03 | (3829.71 ms | 136900 tok/s) step 5922/76294 | train loss 3.575732 | norm 0.2356 | lr 1.01e-03 | (3804.68 ms | 137801 tok/s) step 5923/76294 | train loss 3.456669 | norm 0.2101 | lr 1.01e-03 | (3833.26 ms | 136773 tok/s) step 5924/76294 | train loss 3.534844 | norm 0.2184 | lr 1.01e-03 | (3802.24 ms | 137889 tok/s) step 5925/76294 | train loss 3.495184 | norm 0.1938 | lr 1.01e-03 | (3806.21 ms | 137745 tok/s) step 5926/76294 | train loss 3.590015 | norm 0.2052 | lr 1.01e-03 | (3838.63 ms | 136582 tok/s) step 5927/76294 | train loss 3.495998 | norm 0.2027 | lr 1.01e-03 | (3803.91 ms | 137829 tok/s) step 5928/76294 | train loss 3.493929 | norm 0.2685 | lr 1.01e-03 | (3872.78 ms | 135378 tok/s) step 5929/76294 | train loss 3.535257 | norm 0.2598 | lr 1.01e-03 | (3805.68 ms | 137765 tok/s) step 5930/76294 | train loss 3.498107 | norm 0.2250 | lr 1.01e-03 | (3808.34 ms | 137668 tok/s) step 5931/76294 | train loss 3.569738 | norm 0.1862 | lr 1.01e-03 | (3827.88 ms | 136966 tok/s) step 5932/76294 | train loss 3.483708 | norm 0.2331 | lr 1.01e-03 | (3807.97 ms | 137682 tok/s) step 5933/76294 | train loss 3.583194 | norm 0.2273 | lr 1.01e-03 | (3803.88 ms | 137830 tok/s) step 5934/76294 | train loss 3.520015 | norm 0.1950 | lr 1.01e-03 | (3830.10 ms | 136886 tok/s) step 5935/76294 | train loss 3.568429 | norm 0.2849 | lr 1.01e-03 | (3804.61 ms | 137803 tok/s) step 5936/76294 | train loss 3.539930 | norm 0.3033 | lr 1.01e-03 | (3810.85 ms | 137578 tok/s) step 5937/76294 | train loss 3.507124 | norm 0.2378 | lr 1.01e-03 | (3841.72 ms | 136472 tok/s) step 5938/76294 | train loss 3.516293 | norm 0.2352 | lr 1.01e-03 | (3809.45 ms | 137628 tok/s) step 5939/76294 | train loss 3.525792 | norm 0.2310 | lr 1.01e-03 | (3883.15 ms | 135016 tok/s) step 5940/76294 | train loss 3.834974 | norm 0.2498 | lr 1.01e-03 | (3802.14 ms | 137893 tok/s) step 5941/76294 | train loss 3.512516 | norm 0.3142 | lr 1.01e-03 | (3812.12 ms | 137532 tok/s) step 5942/76294 | train loss 3.535050 | norm 0.2422 | lr 1.01e-03 | (3943.49 ms | 132950 tok/s) step 5943/76294 | train loss 3.565691 | norm 0.2531 | lr 1.01e-03 | (3809.23 ms | 137636 tok/s) step 5944/76294 | train loss 3.483155 | norm 0.2245 | lr 1.01e-03 | (3827.06 ms | 136995 tok/s) step 5945/76294 | train loss 3.537032 | norm 0.2188 | lr 1.01e-03 | (3811.34 ms | 137560 tok/s) step 5946/76294 | train loss 3.550291 | norm 0.2364 | lr 1.01e-03 | (3825.65 ms | 137045 tok/s) step 5947/76294 | train loss 3.549472 | norm 0.2053 | lr 1.01e-03 | (3807.12 ms | 137713 tok/s) step 5948/76294 | train loss 3.500964 | norm 0.2305 | lr 1.01e-03 | (3804.54 ms | 137806 tok/s) step 5949/76294 | train loss 3.522428 | norm 0.1982 | lr 1.01e-03 | (3855.85 ms | 135972 tok/s) step 5950/76294 | train loss 3.536013 | norm 0.1991 | lr 1.01e-03 | (3908.62 ms | 134136 tok/s) step 5951/76294 | train loss 3.533273 | norm 0.1923 | lr 1.01e-03 | (3833.49 ms | 136765 tok/s) step 5952/76294 | train loss 3.516054 | norm 0.2002 | lr 1.01e-03 | (3826.81 ms | 137004 tok/s) step 5953/76294 | train loss 3.478882 | norm 0.1930 | lr 1.01e-03 | (3855.78 ms | 135975 tok/s) step 5954/76294 | train loss 3.584294 | norm 0.2577 | lr 1.01e-03 | (3804.08 ms | 137823 tok/s) step 5955/76294 | train loss 3.497705 | norm 1.1716 | lr 1.01e-03 | (3832.55 ms | 136799 tok/s) step 5956/76294 | train loss 3.559402 | norm 0.2329 | lr 1.01e-03 | (3806.06 ms | 137751 tok/s) step 5957/76294 | train loss 3.479327 | norm 0.2052 | lr 1.01e-03 | (3804.87 ms | 137794 tok/s) step 5958/76294 | train loss 3.486023 | norm 0.2199 | lr 1.01e-03 | (3827.57 ms | 136977 tok/s) step 5959/76294 | train loss 3.536388 | norm 0.1793 | lr 1.01e-03 | (3806.64 ms | 137730 tok/s) step 5960/76294 | train loss 3.443545 | norm 0.2373 | lr 1.01e-03 | (3824.96 ms | 137070 tok/s) step 5961/76294 | train loss 3.597102 | norm 0.2248 | lr 1.01e-03 | (3838.93 ms | 136572 tok/s) step 5962/76294 | train loss 3.623052 | norm 0.1978 | lr 1.01e-03 | (3800.69 ms | 137945 tok/s) step 5963/76294 | train loss 3.517997 | norm 0.2073 | lr 1.01e-03 | (3838.80 ms | 136576 tok/s) step 5964/76294 | train loss 3.540011 | norm 0.2295 | lr 1.01e-03 | (4705.95 ms | 111410 tok/s) step 5965/76294 | train loss 3.488832 | norm 0.2332 | lr 1.01e-03 | (3889.31 ms | 134802 tok/s) step 5966/76294 | train loss 3.483175 | norm 0.1992 | lr 1.01e-03 | (3804.83 ms | 137795 tok/s) step 5967/76294 | train loss 3.532003 | norm 0.2085 | lr 1.01e-03 | (3811.17 ms | 137566 tok/s) step 5968/76294 | train loss 3.572898 | norm 0.2000 | lr 1.01e-03 | (3826.89 ms | 137001 tok/s) step 5969/76294 | train loss 3.434112 | norm 0.1861 | lr 1.01e-03 | (3805.81 ms | 137760 tok/s) step 5970/76294 | train loss 3.529484 | norm 0.1997 | lr 1.01e-03 | (3804.79 ms | 137797 tok/s) step 5971/76294 | train loss 3.471958 | norm 0.2251 | lr 1.01e-03 | (3830.54 ms | 136871 tok/s) step 5972/76294 | train loss 3.531870 | norm 0.2640 | lr 1.01e-03 | (3804.50 ms | 137807 tok/s) step 5973/76294 | train loss 3.535462 | norm 0.2559 | lr 1.01e-03 | (3858.16 ms | 135891 tok/s) step 5974/76294 | train loss 3.512939 | norm 0.1946 | lr 1.01e-03 | (3819.80 ms | 137255 tok/s) step 5975/76294 | train loss 3.565466 | norm 0.2299 | lr 1.01e-03 | (3809.79 ms | 137616 tok/s) step 5976/76294 | train loss 3.497629 | norm 0.2244 | lr 1.01e-03 | (3860.78 ms | 135798 tok/s) step 5977/76294 | train loss 3.573876 | norm 0.2036 | lr 1.01e-03 | (3802.78 ms | 137870 tok/s) step 5978/76294 | train loss 3.537610 | norm 0.2259 | lr 1.01e-03 | (3801.29 ms | 137924 tok/s) step 5979/76294 | train loss 3.525538 | norm 0.2276 | lr 1.01e-03 | (3828.27 ms | 136952 tok/s) step 5980/76294 | train loss 3.464657 | norm 0.2246 | lr 1.01e-03 | (3804.27 ms | 137816 tok/s) step 5981/76294 | train loss 3.533114 | norm 0.2147 | lr 1.01e-03 | (3839.92 ms | 136536 tok/s) step 5982/76294 | train loss 3.519281 | norm 0.2050 | lr 1.00e-03 | (3826.15 ms | 137028 tok/s) step 5983/76294 | train loss 3.497492 | norm 0.2032 | lr 1.00e-03 | (3812.62 ms | 137514 tok/s) step 5984/76294 | train loss 3.562051 | norm 0.1896 | lr 1.00e-03 | (3802.27 ms | 137888 tok/s) step 5985/76294 | train loss 3.585242 | norm 0.1738 | lr 1.00e-03 | (4166.25 ms | 125842 tok/s) step 5986/76294 | train loss 3.538969 | norm 0.2079 | lr 1.00e-03 | (3828.50 ms | 136943 tok/s) step 5987/76294 | train loss 3.423434 | norm 0.1805 | lr 1.00e-03 | (3806.12 ms | 137749 tok/s) step 5988/76294 | train loss 3.504570 | norm 0.1931 | lr 1.00e-03 | (3826.71 ms | 137007 tok/s) step 5989/76294 | train loss 3.510511 | norm 0.1978 | lr 1.00e-03 | (3804.11 ms | 137821 tok/s) step 5990/76294 | train loss 3.507492 | norm 0.2389 | lr 1.00e-03 | (3803.88 ms | 137830 tok/s) step 5991/76294 | train loss 3.431945 | norm 0.2750 | lr 1.00e-03 | (3800.42 ms | 137955 tok/s) step 5992/76294 | train loss 3.549305 | norm 0.2365 | lr 1.00e-03 | (3825.95 ms | 137035 tok/s) step 5993/76294 | train loss 3.547739 | norm 0.2421 | lr 1.00e-03 | (3805.58 ms | 137768 tok/s) step 5994/76294 | train loss 3.594171 | norm 0.2599 | lr 1.00e-03 | (3842.41 ms | 136448 tok/s) step 5995/76294 | train loss 3.495581 | norm 0.2354 | lr 1.00e-03 | (3831.05 ms | 136852 tok/s) step 5996/76294 | train loss 3.480378 | norm 0.2242 | lr 1.00e-03 | (3891.11 ms | 134740 tok/s) step 5997/76294 | train loss 3.576632 | norm 0.1881 | lr 1.00e-03 | (3803.51 ms | 137843 tok/s) step 5998/76294 | train loss 3.477914 | norm 0.2285 | lr 1.00e-03 | (5654.46 ms | 92721 tok/s) step 5999/76294 | train loss 3.555708 | norm 0.1762 | lr 1.00e-03 | (4138.66 ms | 126681 tok/s) step 6000/76294 | train loss 3.457760 | norm 0.1920 | lr 1.00e-03 | (3796.96 ms | 138081 tok/s) val loss: 3.524632 saving model checkpoint to ./results/gpt2-124M-gqa/step_6000.pth step 6001/76294 | train loss 3.568017 | norm 0.1901 | lr 1.00e-03 | (3862.74 ms | 135730 tok/s) step 6002/76294 | train loss 3.544830 | norm 0.2262 | lr 1.00e-03 | (3753.71 ms | 139672 tok/s) step 6003/76294 | train loss 3.495165 | norm 0.2327 | lr 1.00e-03 | (3791.94 ms | 138264 tok/s) step 6004/76294 | train loss 3.484313 | norm 0.1796 | lr 1.00e-03 | (3767.85 ms | 139148 tok/s) step 6005/76294 | train loss 3.509860 | norm 0.1934 | lr 1.00e-03 | (3769.69 ms | 139080 tok/s) step 6006/76294 | train loss 3.546514 | norm 0.2113 | lr 1.00e-03 | (3791.65 ms | 138274 tok/s) step 6007/76294 | train loss 3.469071 | norm 0.2136 | lr 1.00e-03 | (3875.79 ms | 135272 tok/s) step 6008/76294 | train loss 3.557198 | norm 0.2173 | lr 1.00e-03 | (3774.66 ms | 138897 tok/s) step 6009/76294 | train loss 3.534136 | norm 0.2551 | lr 1.00e-03 | (3781.05 ms | 138662 tok/s) step 6010/76294 | train loss 3.539104 | norm 0.2615 | lr 1.00e-03 | (3835.56 ms | 136691 tok/s) step 6011/76294 | train loss 3.676907 | norm 0.2242 | lr 1.00e-03 | (3780.19 ms | 138693 tok/s) step 6012/76294 | train loss 3.537457 | norm 0.2062 | lr 1.00e-03 | (3787.07 ms | 138441 tok/s) step 6013/76294 | train loss 3.494357 | norm 0.2169 | lr 1.00e-03 | (3812.41 ms | 137521 tok/s) step 6014/76294 | train loss 3.551353 | norm 0.1750 | lr 1.00e-03 | (3790.66 ms | 138310 tok/s) step 6015/76294 | train loss 3.482885 | norm 0.2309 | lr 1.00e-03 | (3791.51 ms | 138280 tok/s) step 6016/76294 | train loss 3.594332 | norm 0.2683 | lr 1.00e-03 | (3842.91 ms | 136430 tok/s) step 6017/76294 | train loss 3.519523 | norm 0.2404 | lr 1.00e-03 | (3790.70 ms | 138309 tok/s) step 6018/76294 | train loss 3.515394 | norm 0.2031 | lr 1.00e-03 | (3844.62 ms | 136369 tok/s) step 6019/76294 | train loss 3.595814 | norm 0.2271 | lr 1.00e-03 | (3830.51 ms | 136871 tok/s) step 6020/76294 | train loss 3.489313 | norm 0.2486 | lr 1.00e-03 | (3802.38 ms | 137884 tok/s) step 6021/76294 | train loss 3.541209 | norm 0.2243 | lr 1.00e-03 | (3791.96 ms | 138263 tok/s) step 6022/76294 | train loss 3.497039 | norm 0.2041 | lr 1.00e-03 | (3831.28 ms | 136844 tok/s) step 6023/76294 | train loss 3.575246 | norm 0.2212 | lr 1.00e-03 | (3794.27 ms | 138179 tok/s) step 6024/76294 | train loss 3.484904 | norm 0.1937 | lr 1.00e-03 | (3807.94 ms | 137683 tok/s) step 6025/76294 | train loss 3.584146 | norm 0.2173 | lr 1.00e-03 | (3818.43 ms | 137305 tok/s) step 6026/76294 | train loss 3.563527 | norm 0.2073 | lr 1.00e-03 | (3803.07 ms | 137859 tok/s) step 6027/76294 | train loss 3.507561 | norm 0.1908 | lr 1.00e-03 | (3797.52 ms | 138061 tok/s) step 6028/76294 | train loss 3.526120 | norm 0.2025 | lr 1.00e-03 | (3829.38 ms | 136912 tok/s) step 6029/76294 | train loss 3.529655 | norm 0.1965 | lr 1.00e-03 | (3874.69 ms | 135311 tok/s) step 6030/76294 | train loss 3.601510 | norm 0.2062 | lr 1.00e-03 | (3799.34 ms | 137995 tok/s) step 6031/76294 | train loss 3.478168 | norm 0.2302 | lr 1.00e-03 | (3804.19 ms | 137819 tok/s) step 6032/76294 | train loss 3.523808 | norm 0.1872 | lr 1.00e-03 | (3821.81 ms | 137183 tok/s) step 6033/76294 | train loss 3.466640 | norm 0.2329 | lr 1.00e-03 | (3808.34 ms | 137668 tok/s) step 6034/76294 | train loss 3.508864 | norm 0.1814 | lr 1.00e-03 | (3823.60 ms | 137119 tok/s) step 6035/76294 | train loss 3.520962 | norm 0.1964 | lr 1.00e-03 | (3809.20 ms | 137637 tok/s) step 6036/76294 | train loss 3.523548 | norm 0.2071 | lr 1.00e-03 | (3799.60 ms | 137985 tok/s) step 6037/76294 | train loss 3.440686 | norm 0.2205 | lr 1.00e-03 | (3833.80 ms | 136754 tok/s) step 6038/76294 | train loss 3.535380 | norm 0.2154 | lr 1.00e-03 | (3805.39 ms | 137775 tok/s) step 6039/76294 | train loss 3.531875 | norm 0.1796 | lr 1.00e-03 | (3825.36 ms | 137056 tok/s) step 6040/76294 | train loss 3.497551 | norm 0.2147 | lr 1.00e-03 | (3802.09 ms | 137895 tok/s) step 6041/76294 | train loss 3.507072 | norm 0.2176 | lr 1.00e-03 | (3816.91 ms | 137359 tok/s) step 6042/76294 | train loss 3.523730 | norm 0.2326 | lr 1.00e-03 | (3887.15 ms | 134877 tok/s) step 6043/76294 | train loss 3.512641 | norm 0.2061 | lr 1.00e-03 | (3801.59 ms | 137913 tok/s) step 6044/76294 | train loss 3.534441 | norm 0.2083 | lr 1.00e-03 | (3857.89 ms | 135900 tok/s) step 6045/76294 | train loss 3.521019 | norm 0.2236 | lr 1.00e-03 | (3806.73 ms | 137727 tok/s) step 6046/76294 | train loss 3.551837 | norm 0.2330 | lr 1.00e-03 | (3832.67 ms | 136794 tok/s) step 6047/76294 | train loss 3.541799 | norm 0.2570 | lr 1.00e-03 | (3808.07 ms | 137678 tok/s) step 6048/76294 | train loss 3.527227 | norm 0.1913 | lr 1.00e-03 | (3808.66 ms | 137657 tok/s) step 6049/76294 | train loss 3.514355 | norm 0.2298 | lr 1.00e-03 | (3828.08 ms | 136958 tok/s) step 6050/76294 | train loss 3.501020 | norm 0.2435 | lr 1.00e-03 | (3805.56 ms | 137769 tok/s) step 6051/76294 | train loss 3.618965 | norm 0.1999 | lr 1.00e-03 | (3828.23 ms | 136953 tok/s) step 6052/76294 | train loss 3.526481 | norm 0.2551 | lr 1.00e-03 | (3808.84 ms | 137650 tok/s) step 6053/76294 | train loss 3.525901 | norm 0.2415 | lr 1.00e-03 | (3806.90 ms | 137720 tok/s) step 6054/76294 | train loss 3.459614 | norm 0.2020 | lr 1.00e-03 | (3816.92 ms | 137359 tok/s) step 6055/76294 | train loss 3.561208 | norm 0.2572 | lr 1.00e-03 | (3828.19 ms | 136955 tok/s) step 6056/76294 | train loss 3.489485 | norm 0.2209 | lr 1.00e-03 | (3810.70 ms | 137583 tok/s) step 6057/76294 | train loss 3.558366 | norm 0.2227 | lr 1.00e-03 | (3805.01 ms | 137789 tok/s) step 6058/76294 | train loss 3.550656 | norm 0.2577 | lr 1.00e-03 | (3838.35 ms | 136592 tok/s) step 6059/76294 | train loss 3.500149 | norm 0.2034 | lr 1.00e-03 | (3805.61 ms | 137767 tok/s) step 6060/76294 | train loss 3.462405 | norm 0.2582 | lr 1.00e-03 | (3843.01 ms | 136426 tok/s) step 6061/76294 | train loss 3.483578 | norm 0.1956 | lr 9.99e-04 | (3806.22 ms | 137745 tok/s) step 6062/76294 | train loss 3.606268 | norm 0.2521 | lr 9.99e-04 | (3824.57 ms | 137084 tok/s) step 6063/76294 | train loss 3.472842 | norm 0.1868 | lr 9.99e-04 | (3823.79 ms | 137112 tok/s) step 6064/76294 | train loss 3.589561 | norm 0.2233 | lr 9.99e-04 | (3813.25 ms | 137491 tok/s) step 6065/76294 | train loss 3.487425 | norm 0.1956 | lr 9.99e-04 | (3805.04 ms | 137788 tok/s) step 6066/76294 | train loss 3.505126 | norm 0.2403 | lr 9.99e-04 | (3836.79 ms | 136648 tok/s) step 6067/76294 | train loss 3.476603 | norm 0.1922 | lr 9.99e-04 | (3805.30 ms | 137778 tok/s) step 6068/76294 | train loss 3.516715 | norm 0.1981 | lr 9.99e-04 | (6205.50 ms | 84488 tok/s) step 6069/76294 | train loss 3.510287 | norm 0.2083 | lr 9.99e-04 | (3806.70 ms | 137728 tok/s) step 6070/76294 | train loss 3.573452 | norm 0.2614 | lr 9.99e-04 | (3804.61 ms | 137803 tok/s) step 6071/76294 | train loss 3.544353 | norm 0.2037 | lr 9.99e-04 | (3828.12 ms | 136957 tok/s) step 6072/76294 | train loss 3.513520 | norm 0.2044 | lr 9.99e-04 | (3806.38 ms | 137739 tok/s) step 6073/76294 | train loss 3.532206 | norm 0.1921 | lr 9.99e-04 | (3803.88 ms | 137830 tok/s) step 6074/76294 | train loss 3.480315 | norm 0.2041 | lr 9.99e-04 | (3867.36 ms | 135567 tok/s) step 6075/76294 | train loss 3.561019 | norm 0.2043 | lr 9.98e-04 | (3801.82 ms | 137904 tok/s) step 6076/76294 | train loss 3.478500 | norm 0.2231 | lr 9.98e-04 | (3809.16 ms | 137639 tok/s) step 6077/76294 | train loss 3.571701 | norm 0.2266 | lr 9.98e-04 | (3824.06 ms | 137103 tok/s) step 6078/76294 | train loss 3.495954 | norm 0.2099 | lr 9.98e-04 | (3807.76 ms | 137689 tok/s) step 6079/76294 | train loss 3.530559 | norm 0.2820 | lr 9.98e-04 | (3803.87 ms | 137830 tok/s) step 6080/76294 | train loss 3.501894 | norm 0.2835 | lr 9.98e-04 | (3902.59 ms | 134344 tok/s) step 6081/76294 | train loss 3.575373 | norm 0.1972 | lr 9.98e-04 | (3808.15 ms | 137675 tok/s) step 6082/76294 | train loss 3.554003 | norm 0.3306 | lr 9.98e-04 | (3811.78 ms | 137544 tok/s) step 6083/76294 | train loss 3.602649 | norm 0.2336 | lr 9.98e-04 | (3804.04 ms | 137824 tok/s) step 6084/76294 | train loss 3.535553 | norm 0.2289 | lr 9.98e-04 | (3812.62 ms | 137514 tok/s) step 6085/76294 | train loss 3.511690 | norm 0.2407 | lr 9.98e-04 | (3829.29 ms | 136915 tok/s) step 6086/76294 | train loss 3.586322 | norm 0.2102 | lr 9.98e-04 | (3807.86 ms | 137686 tok/s) step 6087/76294 | train loss 3.487467 | norm 0.1978 | lr 9.98e-04 | (3845.48 ms | 136339 tok/s) step 6088/76294 | train loss 3.571573 | norm 0.2152 | lr 9.98e-04 | (3799.82 ms | 137977 tok/s) step 6089/76294 | train loss 3.476611 | norm 0.2276 | lr 9.97e-04 | (3811.99 ms | 137537 tok/s) step 6090/76294 | train loss 3.499086 | norm 0.2262 | lr 9.97e-04 | (3824.94 ms | 137071 tok/s) step 6091/76294 | train loss 3.488878 | norm 0.1968 | lr 9.97e-04 | (3809.70 ms | 137619 tok/s) step 6092/76294 | train loss 3.524271 | norm 0.2468 | lr 9.97e-04 | (3822.00 ms | 137176 tok/s) step 6093/76294 | train loss 3.556051 | norm 0.2300 | lr 9.97e-04 | (3809.17 ms | 137639 tok/s) step 6094/76294 | train loss 3.493721 | norm 0.1890 | lr 9.97e-04 | (3873.59 ms | 135349 tok/s) step 6095/76294 | train loss 3.562449 | norm 0.2033 | lr 9.97e-04 | (3811.13 ms | 137567 tok/s) step 6096/76294 | train loss 3.474347 | norm 0.2014 | lr 9.97e-04 | (3842.99 ms | 136427 tok/s) step 6097/76294 | train loss 3.486943 | norm 0.2158 | lr 9.97e-04 | (3809.19 ms | 137638 tok/s) step 6098/76294 | train loss 3.522475 | norm 0.1967 | lr 9.97e-04 | (3809.08 ms | 137642 tok/s) step 6099/76294 | train loss 3.560844 | norm 0.2155 | lr 9.97e-04 | (3840.40 ms | 136519 tok/s) step 6100/76294 | train loss 3.497951 | norm 0.2029 | lr 9.97e-04 | (3804.98 ms | 137790 tok/s) step 6101/76294 | train loss 3.512196 | norm 0.2149 | lr 9.97e-04 | (3812.60 ms | 137515 tok/s) step 6102/76294 | train loss 3.495330 | norm 0.2168 | lr 9.97e-04 | (3827.42 ms | 136982 tok/s) step 6103/76294 | train loss 3.569458 | norm 0.2106 | lr 9.96e-04 | (3807.96 ms | 137682 tok/s) step 6104/76294 | train loss 3.556159 | norm 0.1898 | lr 9.96e-04 | (4088.73 ms | 128228 tok/s) step 6105/76294 | train loss 3.561665 | norm 0.2160 | lr 9.96e-04 | (3825.80 ms | 137040 tok/s) step 6106/76294 | train loss 3.553633 | norm 0.2006 | lr 9.96e-04 | (3805.01 ms | 137789 tok/s) step 6107/76294 | train loss 3.568951 | norm 0.1938 | lr 9.96e-04 | (3807.70 ms | 137692 tok/s) step 6108/76294 | train loss 3.617369 | norm 0.1985 | lr 9.96e-04 | (3837.63 ms | 136617 tok/s) step 6109/76294 | train loss 3.569246 | norm 0.1983 | lr 9.96e-04 | (3805.06 ms | 137787 tok/s) step 6110/76294 | train loss 3.533786 | norm 0.2075 | lr 9.96e-04 | (3806.89 ms | 137721 tok/s) step 6111/76294 | train loss 3.589283 | norm 0.2653 | lr 9.96e-04 | (3855.33 ms | 135990 tok/s) step 6112/76294 | train loss 3.510573 | norm 0.2981 | lr 9.96e-04 | (3803.32 ms | 137850 tok/s) step 6113/76294 | train loss 3.612717 | norm 0.3157 | lr 9.96e-04 | (3811.91 ms | 137540 tok/s) step 6114/76294 | train loss 3.580577 | norm 0.2695 | lr 9.96e-04 | (3873.11 ms | 135366 tok/s) step 6115/76294 | train loss 3.586093 | norm 0.2270 | lr 9.96e-04 | (3803.88 ms | 137830 tok/s) step 6116/76294 | train loss 3.593242 | norm 0.2403 | lr 9.96e-04 | (3812.07 ms | 137534 tok/s) step 6117/76294 | train loss 3.525524 | norm 0.1983 | lr 9.96e-04 | (3827.59 ms | 136976 tok/s) step 6118/76294 | train loss 3.569107 | norm 0.2165 | lr 9.95e-04 | (3807.57 ms | 137696 tok/s) step 6119/76294 | train loss 3.621199 | norm 0.1900 | lr 9.95e-04 | (3804.31 ms | 137814 tok/s) step 6120/76294 | train loss 3.522272 | norm 0.2169 | lr 9.95e-04 | (3844.53 ms | 136373 tok/s) step 6121/76294 | train loss 3.567577 | norm 0.1981 | lr 9.95e-04 | (3803.49 ms | 137844 tok/s) step 6122/76294 | train loss 3.613731 | norm 0.1896 | lr 9.95e-04 | (3810.27 ms | 137598 tok/s) step 6123/76294 | train loss 3.516526 | norm 0.1921 | lr 9.95e-04 | (3860.39 ms | 135812 tok/s) step 6124/76294 | train loss 3.550426 | norm 0.1855 | lr 9.95e-04 | (3807.24 ms | 137708 tok/s) step 6125/76294 | train loss 3.503185 | norm 0.1815 | lr 9.95e-04 | (3828.67 ms | 136937 tok/s) step 6126/76294 | train loss 3.617499 | norm 0.1781 | lr 9.95e-04 | (3802.71 ms | 137872 tok/s) step 6127/76294 | train loss 3.503902 | norm 0.1805 | lr 9.95e-04 | (3809.19 ms | 137638 tok/s) step 6128/76294 | train loss 3.516361 | norm 0.2035 | lr 9.95e-04 | (3893.53 ms | 134656 tok/s) step 6129/76294 | train loss 3.571949 | norm 0.1973 | lr 9.95e-04 | (3809.06 ms | 137642 tok/s) step 6130/76294 | train loss 3.519507 | norm 0.1833 | lr 9.95e-04 | (3811.53 ms | 137553 tok/s) step 6131/76294 | train loss 3.548422 | norm 0.1715 | lr 9.95e-04 | (3807.47 ms | 137700 tok/s) step 6132/76294 | train loss 3.493019 | norm 0.1753 | lr 9.94e-04 | (3826.90 ms | 137001 tok/s) step 6133/76294 | train loss 3.612239 | norm 0.2103 | lr 9.94e-04 | (3814.18 ms | 137458 tok/s) step 6134/76294 | train loss 3.612987 | norm 0.2150 | lr 9.94e-04 | (3843.59 ms | 136406 tok/s) step 6135/76294 | train loss 3.594512 | norm 0.2577 | lr 9.94e-04 | (3833.53 ms | 136764 tok/s) step 6136/76294 | train loss 3.556785 | norm 0.2176 | lr 9.94e-04 | (3811.75 ms | 137545 tok/s) step 6137/76294 | train loss 3.582161 | norm 0.3092 | lr 9.94e-04 | (3856.01 ms | 135966 tok/s) step 6138/76294 | train loss 3.597131 | norm 0.2392 | lr 9.94e-04 | (3805.42 ms | 137774 tok/s) step 6139/76294 | train loss 3.553717 | norm 0.2540 | lr 9.94e-04 | (3880.11 ms | 135122 tok/s) step 6140/76294 | train loss 3.501108 | norm 0.2287 | lr 9.94e-04 | (3806.68 ms | 137728 tok/s) step 6141/76294 | train loss 3.532362 | norm 0.2048 | lr 9.94e-04 | (3829.14 ms | 136921 tok/s) step 6142/76294 | train loss 3.562724 | norm 0.2170 | lr 9.94e-04 | (3833.50 ms | 136765 tok/s) step 6143/76294 | train loss 3.554168 | norm 0.2115 | lr 9.94e-04 | (3804.06 ms | 137823 tok/s) step 6144/76294 | train loss 3.617215 | norm 0.2010 | lr 9.94e-04 | (3802.69 ms | 137873 tok/s) step 6145/76294 | train loss 3.617687 | norm 0.2201 | lr 9.94e-04 | (3851.89 ms | 136112 tok/s) step 6146/76294 | train loss 3.616044 | norm 0.2157 | lr 9.93e-04 | (3804.40 ms | 137811 tok/s) step 6147/76294 | train loss 3.663893 | norm 0.2380 | lr 9.93e-04 | (3833.26 ms | 136773 tok/s) step 6148/76294 | train loss 3.548275 | norm 0.1983 | lr 9.93e-04 | (3801.72 ms | 137908 tok/s) step 6149/76294 | train loss 3.591948 | norm 0.2229 | lr 9.93e-04 | (3808.55 ms | 137661 tok/s) step 6150/76294 | train loss 3.530395 | norm 0.2093 | lr 9.93e-04 | (3828.61 ms | 136939 tok/s) step 6151/76294 | train loss 3.574721 | norm 0.2054 | lr 9.93e-04 | (3845.10 ms | 136352 tok/s) step 6152/76294 | train loss 3.552175 | norm 0.2382 | lr 9.93e-04 | (3802.08 ms | 137895 tok/s) step 6153/76294 | train loss 3.495013 | norm 0.2237 | lr 9.93e-04 | (3854.04 ms | 136036 tok/s) step 6154/76294 | train loss 3.547889 | norm 0.2138 | lr 9.93e-04 | (3805.43 ms | 137774 tok/s) step 6155/76294 | train loss 3.470283 | norm 0.2594 | lr 9.93e-04 | (3807.28 ms | 137707 tok/s) step 6156/76294 | train loss 3.579986 | norm 0.2036 | lr 9.93e-04 | (3825.81 ms | 137040 tok/s) step 6157/76294 | train loss 3.579085 | norm 0.2137 | lr 9.93e-04 | (3806.26 ms | 137744 tok/s) step 6158/76294 | train loss 3.615579 | norm 0.2044 | lr 9.93e-04 | (3803.72 ms | 137836 tok/s) step 6159/76294 | train loss 3.528579 | norm 0.2307 | lr 9.93e-04 | (3838.19 ms | 136598 tok/s) step 6160/76294 | train loss 3.615405 | norm 0.2009 | lr 9.92e-04 | (3800.50 ms | 137952 tok/s) step 6161/76294 | train loss 3.554778 | norm 0.1937 | lr 9.92e-04 | (3813.29 ms | 137490 tok/s) step 6162/76294 | train loss 3.516332 | norm 0.1861 | lr 9.92e-04 | (3829.90 ms | 136893 tok/s) step 6163/76294 | train loss 3.528033 | norm 0.1819 | lr 9.92e-04 | (3804.01 ms | 137825 tok/s) step 6164/76294 | train loss 3.549033 | norm 0.1968 | lr 9.92e-04 | (3808.91 ms | 137648 tok/s) step 6165/76294 | train loss 3.511815 | norm 0.1738 | lr 9.92e-04 | (3833.73 ms | 136757 tok/s) step 6166/76294 | train loss 3.495427 | norm 0.1847 | lr 9.92e-04 | (3802.26 ms | 137889 tok/s) step 6167/76294 | train loss 3.593641 | norm 0.1873 | lr 9.92e-04 | (3811.71 ms | 137547 tok/s) step 6168/76294 | train loss 3.656544 | norm 0.2530 | lr 9.92e-04 | (3830.92 ms | 136857 tok/s) step 6169/76294 | train loss 3.643070 | norm 0.3216 | lr 9.92e-04 | (3806.96 ms | 137718 tok/s) step 6170/76294 | train loss 3.479934 | norm 0.2372 | lr 9.92e-04 | (3809.45 ms | 137628 tok/s) step 6171/76294 | train loss 3.591337 | norm 0.2050 | lr 9.92e-04 | (3836.46 ms | 136659 tok/s) step 6172/76294 | train loss 3.525303 | norm 0.2353 | lr 9.92e-04 | (3803.90 ms | 137829 tok/s) step 6173/76294 | train loss 3.519915 | norm 0.2250 | lr 9.92e-04 | (3809.79 ms | 137616 tok/s) step 6174/76294 | train loss 3.531474 | norm 0.1992 | lr 9.91e-04 | (3823.64 ms | 137117 tok/s) step 6175/76294 | train loss 3.479853 | norm 0.1929 | lr 9.91e-04 | (3807.47 ms | 137700 tok/s) step 6176/76294 | train loss 3.552719 | norm 0.2013 | lr 9.91e-04 | (3807.01 ms | 137717 tok/s) step 6177/76294 | train loss 3.517361 | norm 0.2027 | lr 9.91e-04 | (3853.16 ms | 136067 tok/s) step 6178/76294 | train loss 3.500886 | norm 0.1788 | lr 9.91e-04 | (3804.85 ms | 137795 tok/s) step 6179/76294 | train loss 3.542115 | norm 0.1950 | lr 9.91e-04 | (3818.02 ms | 137319 tok/s) step 6180/76294 | train loss 3.481991 | norm 0.1949 | lr 9.91e-04 | (3826.46 ms | 137017 tok/s) step 6181/76294 | train loss 3.575918 | norm 0.1964 | lr 9.91e-04 | (3812.98 ms | 137501 tok/s) step 6182/76294 | train loss 3.565369 | norm 0.1944 | lr 9.91e-04 | (3805.61 ms | 137767 tok/s) step 6183/76294 | train loss 3.573312 | norm 0.2057 | lr 9.91e-04 | (3838.94 ms | 136571 tok/s) step 6184/76294 | train loss 3.535581 | norm 0.1804 | lr 9.91e-04 | (3806.70 ms | 137728 tok/s) step 6185/76294 | train loss 3.582417 | norm 0.2000 | lr 9.91e-04 | (3812.08 ms | 137533 tok/s) step 6186/76294 | train loss 3.672164 | norm 0.2137 | lr 9.91e-04 | (3828.06 ms | 136959 tok/s) step 6187/76294 | train loss 3.696209 | norm 0.2286 | lr 9.91e-04 | (3809.03 ms | 137643 tok/s) step 6188/76294 | train loss 3.522053 | norm 0.2234 | lr 9.90e-04 | (3806.95 ms | 137719 tok/s) step 6189/76294 | train loss 3.511510 | norm 0.1991 | lr 9.90e-04 | (3839.23 ms | 136561 tok/s) step 6190/76294 | train loss 3.582371 | norm 0.1737 | lr 9.90e-04 | (3802.76 ms | 137870 tok/s) step 6191/76294 | train loss 3.551964 | norm 0.2373 | lr 9.90e-04 | (3831.96 ms | 136820 tok/s) step 6192/76294 | train loss 3.614122 | norm 0.2126 | lr 9.90e-04 | (3824.99 ms | 137069 tok/s) step 6193/76294 | train loss 3.559703 | norm 0.2006 | lr 9.90e-04 | (3827.73 ms | 136971 tok/s) step 6194/76294 | train loss 3.551580 | norm 0.2028 | lr 9.90e-04 | (3817.45 ms | 137340 tok/s) step 6195/76294 | train loss 3.499614 | norm 0.2174 | lr 9.90e-04 | (3806.38 ms | 137739 tok/s) step 6196/76294 | train loss 3.537712 | norm 0.2218 | lr 9.90e-04 | (3802.14 ms | 137893 tok/s) step 6197/76294 | train loss 3.524765 | norm 0.2019 | lr 9.90e-04 | (3832.27 ms | 136809 tok/s) step 6198/76294 | train loss 3.578422 | norm 0.2080 | lr 9.90e-04 | (3803.52 ms | 137843 tok/s) step 6199/76294 | train loss 3.613962 | norm 0.2075 | lr 9.90e-04 | (3810.01 ms | 137608 tok/s) step 6200/76294 | train loss 3.632259 | norm 0.2230 | lr 9.90e-04 | (3903.19 ms | 134323 tok/s) step 6201/76294 | train loss 3.606284 | norm 0.2112 | lr 9.90e-04 | (3804.10 ms | 137822 tok/s) step 6202/76294 | train loss 3.710908 | norm 0.2385 | lr 9.89e-04 | (3820.10 ms | 137244 tok/s) step 6203/76294 | train loss 3.567039 | norm 0.3054 | lr 9.89e-04 | (3835.22 ms | 136704 tok/s) step 6204/76294 | train loss 3.666294 | norm 0.3466 | lr 9.89e-04 | (3811.24 ms | 137564 tok/s) step 6205/76294 | train loss 3.570345 | norm 0.2390 | lr 9.89e-04 | (3802.71 ms | 137872 tok/s) step 6206/76294 | train loss 3.525864 | norm 0.2290 | lr 9.89e-04 | (3831.63 ms | 136832 tok/s) step 6207/76294 | train loss 3.638859 | norm 0.2900 | lr 9.89e-04 | (3804.30 ms | 137815 tok/s) step 6208/76294 | train loss 3.546682 | norm 0.2968 | lr 9.89e-04 | (3811.14 ms | 137567 tok/s) step 6209/76294 | train loss 3.563298 | norm 0.2600 | lr 9.89e-04 | (3890.40 ms | 134764 tok/s) step 6210/76294 | train loss 3.578893 | norm 0.2116 | lr 9.89e-04 | (3807.97 ms | 137682 tok/s) step 6211/76294 | train loss 3.551898 | norm 0.2410 | lr 9.89e-04 | (3828.28 ms | 136951 tok/s) step 6212/76294 | train loss 3.549047 | norm 0.2001 | lr 9.89e-04 | (3812.21 ms | 137529 tok/s) step 6213/76294 | train loss 3.544240 | norm 0.1996 | lr 9.89e-04 | (3806.38 ms | 137739 tok/s) step 6214/76294 | train loss 3.552501 | norm 0.1909 | lr 9.89e-04 | (3834.00 ms | 136747 tok/s) step 6215/76294 | train loss 3.539268 | norm 0.1969 | lr 9.89e-04 | (3804.16 ms | 137820 tok/s) step 6216/76294 | train loss 3.561477 | norm 0.2007 | lr 9.88e-04 | (3809.93 ms | 137611 tok/s) step 6217/76294 | train loss 3.542509 | norm 0.2085 | lr 9.88e-04 | (3827.29 ms | 136987 tok/s) step 6218/76294 | train loss 3.609550 | norm 0.1933 | lr 9.88e-04 | (3806.89 ms | 137721 tok/s) step 6219/76294 | train loss 3.524576 | norm 0.2188 | lr 9.88e-04 | (3803.55 ms | 137842 tok/s) step 6220/76294 | train loss 3.528966 | norm 0.2242 | lr 9.88e-04 | (3836.96 ms | 136642 tok/s) step 6221/76294 | train loss 3.503253 | norm 0.1786 | lr 9.88e-04 | (3804.62 ms | 137803 tok/s) step 6222/76294 | train loss 3.595446 | norm 0.2044 | lr 9.88e-04 | (3809.52 ms | 137626 tok/s) step 6223/76294 | train loss 3.551549 | norm 0.1898 | lr 9.88e-04 | (3827.29 ms | 136987 tok/s) step 6224/76294 | train loss 3.593321 | norm 0.2093 | lr 9.88e-04 | (3809.01 ms | 137644 tok/s) step 6225/76294 | train loss 3.524432 | norm 0.1980 | lr 9.88e-04 | (3803.03 ms | 137861 tok/s) step 6226/76294 | train loss 3.582326 | norm 0.1961 | lr 9.88e-04 | (3833.87 ms | 136752 tok/s) step 6227/76294 | train loss 3.488534 | norm 0.1832 | lr 9.88e-04 | (3955.14 ms | 132559 tok/s) step 6228/76294 | train loss 3.601973 | norm 0.1962 | lr 9.88e-04 | (3901.05 ms | 134397 tok/s) step 6229/76294 | train loss 3.544205 | norm 0.1732 | lr 9.88e-04 | (4009.26 ms | 130769 tok/s) step 6230/76294 | train loss 3.542447 | norm 0.2157 | lr 9.87e-04 | (3824.03 ms | 137103 tok/s) step 6231/76294 | train loss 3.579674 | norm 0.1852 | lr 9.87e-04 | (3895.48 ms | 134589 tok/s) step 6232/76294 | train loss 3.560969 | norm 0.2481 | lr 9.87e-04 | (3790.98 ms | 138299 tok/s) step 6233/76294 | train loss 3.552641 | norm 0.1977 | lr 9.87e-04 | (3798.48 ms | 138026 tok/s) step 6234/76294 | train loss 3.549229 | norm 0.1951 | lr 9.87e-04 | (3815.84 ms | 137398 tok/s) step 6235/76294 | train loss 3.688284 | norm 0.1967 | lr 9.87e-04 | (3796.99 ms | 138080 tok/s) step 6236/76294 | train loss 3.535971 | norm 0.2177 | lr 9.87e-04 | (3792.43 ms | 138246 tok/s) step 6237/76294 | train loss 3.570241 | norm 0.2635 | lr 9.87e-04 | (3842.81 ms | 136434 tok/s) step 6238/76294 | train loss 3.540416 | norm 0.2525 | lr 9.87e-04 | (3793.42 ms | 138210 tok/s) step 6239/76294 | train loss 3.593662 | norm 0.2081 | lr 9.87e-04 | (3821.24 ms | 137204 tok/s) step 6240/76294 | train loss 3.509061 | norm 0.1837 | lr 9.87e-04 | (3832.29 ms | 136808 tok/s) step 6241/76294 | train loss 3.516132 | norm 0.1973 | lr 9.87e-04 | (3796.59 ms | 138095 tok/s) step 6242/76294 | train loss 3.524920 | norm 0.1659 | lr 9.87e-04 | (3798.11 ms | 138039 tok/s) step 6243/76294 | train loss 3.516641 | norm 0.1943 | lr 9.87e-04 | (3857.71 ms | 135907 tok/s) step 6244/76294 | train loss 3.578171 | norm 0.2239 | lr 9.86e-04 | (3800.01 ms | 137970 tok/s) step 6245/76294 | train loss 3.657849 | norm 0.1992 | lr 9.86e-04 | (3813.08 ms | 137497 tok/s) step 6246/76294 | train loss 3.478201 | norm 0.1866 | lr 9.86e-04 | (3797.48 ms | 138062 tok/s) step 6247/76294 | train loss 3.589575 | norm 0.2132 | lr 9.86e-04 | (3800.84 ms | 137940 tok/s) step 6248/76294 | train loss 3.544965 | norm 0.1649 | lr 9.86e-04 | (3901.34 ms | 134387 tok/s) step 6249/76294 | train loss 3.501310 | norm 0.2022 | lr 9.86e-04 | (3796.27 ms | 138106 tok/s) step 6250/76294 | train loss 3.544991 | norm 0.1838 | lr 9.86e-04 | (3809.38 ms | 137631 tok/s) val loss: 3.516447 saving model checkpoint to ./results/gpt2-124M-gqa/step_6250.pth step 6251/76294 | train loss 3.545730 | norm 0.1863 | lr 9.86e-04 | (3868.61 ms | 135524 tok/s) step 6252/76294 | train loss 3.466289 | norm 0.1699 | lr 9.86e-04 | (3777.86 ms | 138779 tok/s) step 6253/76294 | train loss 3.570285 | norm 0.1939 | lr 9.86e-04 | (3807.32 ms | 137705 tok/s) step 6254/76294 | train loss 3.594211 | norm 0.1888 | lr 9.86e-04 | (3783.22 ms | 138583 tok/s) step 6255/76294 | train loss 3.572103 | norm 0.2284 | lr 9.86e-04 | (3814.28 ms | 137454 tok/s) step 6256/76294 | train loss 3.549855 | norm 0.1962 | lr 9.86e-04 | (3812.09 ms | 137533 tok/s) step 6257/76294 | train loss 3.566418 | norm 0.2020 | lr 9.86e-04 | (3788.55 ms | 138388 tok/s) step 6258/76294 | train loss 3.505802 | norm 0.1708 | lr 9.85e-04 | (3788.99 ms | 138372 tok/s) step 6259/76294 | train loss 3.520542 | norm 0.1975 | lr 9.85e-04 | (3836.90 ms | 136644 tok/s) step 6260/76294 | train loss 3.555275 | norm 0.2006 | lr 9.85e-04 | (3792.52 ms | 138243 tok/s) step 6261/76294 | train loss 3.546545 | norm 0.1840 | lr 9.85e-04 | (3795.05 ms | 138151 tok/s) step 6262/76294 | train loss 3.477585 | norm 0.2008 | lr 9.85e-04 | (3834.70 ms | 136722 tok/s) step 6263/76294 | train loss 3.584890 | norm 0.2302 | lr 9.85e-04 | (3794.90 ms | 138156 tok/s) step 6264/76294 | train loss 3.490985 | norm 0.2181 | lr 9.85e-04 | (3802.24 ms | 137889 tok/s) step 6265/76294 | train loss 3.510412 | norm 0.1965 | lr 9.85e-04 | (3802.03 ms | 137897 tok/s) step 6266/76294 | train loss 3.551981 | norm 0.2177 | lr 9.85e-04 | (3829.09 ms | 136922 tok/s) step 6267/76294 | train loss 3.585582 | norm 0.2389 | lr 9.85e-04 | (3801.86 ms | 137903 tok/s) step 6268/76294 | train loss 3.531799 | norm 0.2579 | lr 9.85e-04 | (3811.37 ms | 137559 tok/s) step 6269/76294 | train loss 3.503144 | norm 0.1855 | lr 9.85e-04 | (3938.38 ms | 133123 tok/s) step 6270/76294 | train loss 3.563026 | norm 0.2228 | lr 9.85e-04 | (3850.55 ms | 136159 tok/s) step 6271/76294 | train loss 3.492877 | norm 0.2771 | lr 9.85e-04 | (3851.79 ms | 136115 tok/s) step 6272/76294 | train loss 3.566440 | norm 0.2751 | lr 9.84e-04 | (3809.05 ms | 137643 tok/s) step 6273/76294 | train loss 3.550092 | norm 0.2267 | lr 9.84e-04 | (3805.53 ms | 137770 tok/s) step 6274/76294 | train loss 3.562602 | norm 0.2190 | lr 9.84e-04 | (3836.21 ms | 136668 tok/s) step 6275/76294 | train loss 3.557017 | norm 0.2511 | lr 9.84e-04 | (3805.62 ms | 137767 tok/s) step 6276/76294 | train loss 3.539069 | norm 0.2033 | lr 9.84e-04 | (3835.11 ms | 136707 tok/s) step 6277/76294 | train loss 3.526812 | norm 0.2187 | lr 9.84e-04 | (3851.34 ms | 136131 tok/s) step 6278/76294 | train loss 3.515221 | norm 0.2162 | lr 9.84e-04 | (3815.24 ms | 137419 tok/s) step 6279/76294 | train loss 3.516499 | norm 0.2357 | lr 9.84e-04 | (3833.08 ms | 136780 tok/s) step 6280/76294 | train loss 3.480044 | norm 0.2205 | lr 9.84e-04 | (3819.86 ms | 137253 tok/s) step 6281/76294 | train loss 3.580878 | norm 0.2095 | lr 9.84e-04 | (3817.42 ms | 137341 tok/s) step 6282/76294 | train loss 3.542472 | norm 0.2324 | lr 9.84e-04 | (3840.09 ms | 136530 tok/s) step 6283/76294 | train loss 3.653392 | norm 0.2266 | lr 9.84e-04 | (3814.98 ms | 137429 tok/s) step 6284/76294 | train loss 3.600480 | norm 0.2472 | lr 9.84e-04 | (3819.05 ms | 137282 tok/s) step 6285/76294 | train loss 3.559039 | norm 0.2165 | lr 9.84e-04 | (3831.45 ms | 136838 tok/s) step 6286/76294 | train loss 3.516732 | norm 0.3065 | lr 9.83e-04 | (3839.28 ms | 136559 tok/s) step 6287/76294 | train loss 3.561740 | norm 0.2533 | lr 9.83e-04 | (3808.79 ms | 137652 tok/s) step 6288/76294 | train loss 3.524397 | norm 0.2218 | lr 9.83e-04 | (3839.98 ms | 136534 tok/s) step 6289/76294 | train loss 3.616094 | norm 0.2295 | lr 9.83e-04 | (3807.95 ms | 137683 tok/s) step 6290/76294 | train loss 3.526043 | norm 0.1979 | lr 9.83e-04 | (3848.20 ms | 136243 tok/s) step 6291/76294 | train loss 3.570174 | norm 0.2071 | lr 9.83e-04 | (3869.55 ms | 135491 tok/s) step 6292/76294 | train loss 3.485646 | norm 0.1924 | lr 9.83e-04 | (3807.83 ms | 137687 tok/s) step 6293/76294 | train loss 3.533021 | norm 0.2123 | lr 9.83e-04 | (3813.73 ms | 137474 tok/s) step 6294/76294 | train loss 3.531868 | norm 0.2091 | lr 9.83e-04 | (3900.85 ms | 134404 tok/s) step 6295/76294 | train loss 3.573326 | norm 0.2257 | lr 9.83e-04 | (4265.63 ms | 122910 tok/s) step 6296/76294 | train loss 3.492433 | norm 0.2554 | lr 9.83e-04 | (3802.22 ms | 137890 tok/s) step 6297/76294 | train loss 3.594168 | norm 0.2316 | lr 9.83e-04 | (3810.18 ms | 137602 tok/s) step 6298/76294 | train loss 3.479013 | norm 0.1916 | lr 9.83e-04 | (3907.37 ms | 134179 tok/s) step 6299/76294 | train loss 3.552319 | norm 0.2117 | lr 9.83e-04 | (3807.22 ms | 137709 tok/s) step 6300/76294 | train loss 3.475631 | norm 0.2343 | lr 9.82e-04 | (3808.99 ms | 137645 tok/s) step 6301/76294 | train loss 3.583102 | norm 0.1847 | lr 9.82e-04 | (3833.00 ms | 136783 tok/s) step 6302/76294 | train loss 3.558061 | norm 0.2366 | lr 9.82e-04 | (3805.24 ms | 137781 tok/s) step 6303/76294 | train loss 3.472490 | norm 0.2100 | lr 9.82e-04 | (3815.51 ms | 137410 tok/s) step 6304/76294 | train loss 3.594726 | norm 0.2432 | lr 9.82e-04 | (3830.45 ms | 136874 tok/s) step 6305/76294 | train loss 3.468868 | norm 0.2009 | lr 9.82e-04 | (3808.09 ms | 137677 tok/s) step 6306/76294 | train loss 3.487103 | norm 0.2091 | lr 9.82e-04 | (3826.86 ms | 137002 tok/s) step 6307/76294 | train loss 3.488724 | norm 0.3485 | lr 9.82e-04 | (3807.04 ms | 137715 tok/s) step 6308/76294 | train loss 3.463499 | norm 0.2561 | lr 9.82e-04 | (3843.20 ms | 136420 tok/s) step 6309/76294 | train loss 3.491718 | norm 0.2228 | lr 9.82e-04 | (3805.56 ms | 137769 tok/s) step 6310/76294 | train loss 3.516104 | norm 0.2936 | lr 9.82e-04 | (3832.52 ms | 136800 tok/s) step 6311/76294 | train loss 3.602233 | norm 0.2754 | lr 9.82e-04 | (3877.62 ms | 135209 tok/s) step 6312/76294 | train loss 3.483657 | norm 0.2681 | lr 9.82e-04 | (3829.70 ms | 136901 tok/s) step 6313/76294 | train loss 3.506301 | norm 0.2552 | lr 9.82e-04 | (3811.08 ms | 137569 tok/s) step 6314/76294 | train loss 3.527529 | norm 0.2451 | lr 9.81e-04 | (3814.26 ms | 137455 tok/s) step 6315/76294 | train loss 3.444170 | norm 0.2235 | lr 9.81e-04 | (3845.91 ms | 136323 tok/s) step 6316/76294 | train loss 3.595151 | norm 0.2123 | lr 9.81e-04 | (3801.52 ms | 137915 tok/s) step 6317/76294 | train loss 3.495647 | norm 0.2525 | lr 9.81e-04 | (3811.50 ms | 137554 tok/s) step 6318/76294 | train loss 3.498976 | norm 0.1765 | lr 9.81e-04 | (3901.72 ms | 134374 tok/s) step 6319/76294 | train loss 3.532286 | norm 0.1939 | lr 9.81e-04 | (3801.34 ms | 137922 tok/s) step 6320/76294 | train loss 3.523131 | norm 0.1862 | lr 9.81e-04 | (3840.81 ms | 136504 tok/s) step 6321/76294 | train loss 3.516189 | norm 0.2002 | lr 9.81e-04 | (3808.10 ms | 137677 tok/s) step 6322/76294 | train loss 3.468012 | norm 0.1783 | lr 9.81e-04 | (3813.21 ms | 137493 tok/s) step 6323/76294 | train loss 3.542816 | norm 0.2099 | lr 9.81e-04 | (3829.10 ms | 136922 tok/s) step 6324/76294 | train loss 3.433001 | norm 0.2207 | lr 9.81e-04 | (3810.92 ms | 137575 tok/s) step 6325/76294 | train loss 3.548214 | norm 0.2562 | lr 9.81e-04 | (3828.51 ms | 136943 tok/s) step 6326/76294 | train loss 3.489152 | norm 0.2057 | lr 9.81e-04 | (3810.23 ms | 137600 tok/s) step 6327/76294 | train loss 3.482238 | norm 0.1999 | lr 9.80e-04 | (3856.30 ms | 135956 tok/s) step 6328/76294 | train loss 3.559932 | norm 0.2152 | lr 9.80e-04 | (3805.18 ms | 137783 tok/s) step 6329/76294 | train loss 3.483778 | norm 0.2015 | lr 9.80e-04 | (3866.83 ms | 135586 tok/s) step 6330/76294 | train loss 3.493189 | norm 0.1895 | lr 9.80e-04 | (3806.04 ms | 137751 tok/s) step 6331/76294 | train loss 3.532549 | norm 0.1798 | lr 9.80e-04 | (3811.65 ms | 137549 tok/s) step 6332/76294 | train loss 3.487912 | norm 0.1733 | lr 9.80e-04 | (3805.01 ms | 137789 tok/s) step 6333/76294 | train loss 3.566211 | norm 0.1903 | lr 9.80e-04 | (3811.73 ms | 137546 tok/s) step 6334/76294 | train loss 3.492264 | norm 0.2081 | lr 9.80e-04 | (3832.24 ms | 136810 tok/s) step 6335/76294 | train loss 3.537559 | norm 0.2165 | lr 9.80e-04 | (3809.70 ms | 137619 tok/s) step 6336/76294 | train loss 3.488142 | norm 0.2091 | lr 9.80e-04 | (3804.03 ms | 137824 tok/s) step 6337/76294 | train loss 3.489825 | norm 0.2322 | lr 9.80e-04 | (3943.44 ms | 132952 tok/s) step 6338/76294 | train loss 3.504323 | norm 0.2129 | lr 9.80e-04 | (3805.55 ms | 137769 tok/s) step 6339/76294 | train loss 3.588129 | norm 0.1989 | lr 9.80e-04 | (3808.39 ms | 137667 tok/s) step 6340/76294 | train loss 3.609355 | norm 0.2611 | lr 9.80e-04 | (3831.27 ms | 136844 tok/s) step 6341/76294 | train loss 3.525624 | norm 0.2976 | lr 9.79e-04 | (3808.06 ms | 137678 tok/s) step 6342/76294 | train loss 3.502551 | norm 0.1958 | lr 9.79e-04 | (3807.66 ms | 137693 tok/s) step 6343/76294 | train loss 3.459423 | norm 0.3356 | lr 9.79e-04 | (3837.61 ms | 136618 tok/s) step 6344/76294 | train loss 3.511130 | norm 0.2252 | lr 9.79e-04 | (3805.10 ms | 137786 tok/s) step 6345/76294 | train loss 3.540017 | norm 0.2424 | lr 9.79e-04 | (3811.31 ms | 137561 tok/s) step 6346/76294 | train loss 3.514843 | norm 0.2465 | lr 9.79e-04 | (3904.42 ms | 134281 tok/s) step 6347/76294 | train loss 3.573943 | norm 0.2312 | lr 9.79e-04 | (3873.33 ms | 135358 tok/s) step 6348/76294 | train loss 3.480448 | norm 0.2134 | lr 9.79e-04 | (3804.60 ms | 137804 tok/s) step 6349/76294 | train loss 3.546518 | norm 0.2199 | lr 9.79e-04 | (3807.95 ms | 137682 tok/s) step 6350/76294 | train loss 3.501510 | norm 0.2140 | lr 9.79e-04 | (3829.59 ms | 136904 tok/s) step 6351/76294 | train loss 3.527159 | norm 0.2234 | lr 9.79e-04 | (3811.56 ms | 137552 tok/s) step 6352/76294 | train loss 3.629319 | norm 0.2302 | lr 9.79e-04 | (3806.51 ms | 137734 tok/s) step 6353/76294 | train loss 3.466209 | norm 0.2299 | lr 9.79e-04 | (3843.59 ms | 136406 tok/s) step 6354/76294 | train loss 3.540136 | norm 0.1835 | lr 9.79e-04 | (3811.19 ms | 137565 tok/s) step 6355/76294 | train loss 3.465037 | norm 0.2069 | lr 9.78e-04 | (3814.49 ms | 137446 tok/s) step 6356/76294 | train loss 3.474445 | norm 0.1986 | lr 9.78e-04 | (3838.02 ms | 136604 tok/s) step 6357/76294 | train loss 3.498575 | norm 0.1943 | lr 9.78e-04 | (3820.42 ms | 137233 tok/s) step 6358/76294 | train loss 3.481607 | norm 0.2016 | lr 9.78e-04 | (3810.59 ms | 137587 tok/s) step 6359/76294 | train loss 3.508771 | norm 0.1990 | lr 9.78e-04 | (3856.76 ms | 135940 tok/s) step 6360/76294 | train loss 3.471881 | norm 0.1926 | lr 9.78e-04 | (3816.13 ms | 137387 tok/s) step 6361/76294 | train loss 3.556760 | norm 0.1838 | lr 9.78e-04 | (3818.29 ms | 137310 tok/s) step 6362/76294 | train loss 3.520173 | norm 0.1967 | lr 9.78e-04 | (3839.92 ms | 136536 tok/s) step 6363/76294 | train loss 3.531459 | norm 0.1823 | lr 9.78e-04 | (3814.47 ms | 137447 tok/s) step 6364/76294 | train loss 3.466755 | norm 0.1949 | lr 9.78e-04 | (3810.41 ms | 137594 tok/s) step 6365/76294 | train loss 3.475914 | norm 0.2071 | lr 9.78e-04 | (3841.62 ms | 136476 tok/s) step 6366/76294 | train loss 3.517413 | norm 0.2046 | lr 9.78e-04 | (3817.15 ms | 137351 tok/s) step 6367/76294 | train loss 3.435097 | norm 0.2053 | lr 9.78e-04 | (3813.52 ms | 137481 tok/s) step 6368/76294 | train loss 3.471683 | norm 0.2173 | lr 9.78e-04 | (3837.72 ms | 136615 tok/s) step 6369/76294 | train loss 3.491382 | norm 0.2052 | lr 9.77e-04 | (3823.97 ms | 137106 tok/s) step 6370/76294 | train loss 3.470968 | norm 0.2064 | lr 9.77e-04 | (3809.34 ms | 137632 tok/s) step 6371/76294 | train loss 3.586749 | norm 0.2124 | lr 9.77e-04 | (3854.50 ms | 136020 tok/s) step 6372/76294 | train loss 3.510985 | norm 0.2563 | lr 9.77e-04 | (3813.25 ms | 137491 tok/s) step 6373/76294 | train loss 3.466161 | norm 0.2799 | lr 9.77e-04 | (3868.84 ms | 135516 tok/s) step 6374/76294 | train loss 3.589543 | norm 0.2427 | lr 9.77e-04 | (3813.53 ms | 137481 tok/s) step 6375/76294 | train loss 3.492099 | norm 0.2842 | lr 9.77e-04 | (3808.13 ms | 137676 tok/s) step 6376/76294 | train loss 3.515735 | norm 0.2094 | lr 9.77e-04 | (3901.38 ms | 134385 tok/s) step 6377/76294 | train loss 3.495973 | norm 0.2334 | lr 9.77e-04 | (3806.06 ms | 137751 tok/s) step 6378/76294 | train loss 3.465400 | norm 0.2019 | lr 9.77e-04 | (3808.01 ms | 137680 tok/s) step 6379/76294 | train loss 3.453774 | norm 0.3139 | lr 9.77e-04 | (3825.49 ms | 137051 tok/s) step 6380/76294 | train loss 3.480383 | norm 0.2649 | lr 9.77e-04 | (3805.50 ms | 137771 tok/s) step 6381/76294 | train loss 3.554111 | norm 0.2196 | lr 9.77e-04 | (3806.02 ms | 137752 tok/s) step 6382/76294 | train loss 3.476362 | norm 0.2113 | lr 9.76e-04 | (3807.03 ms | 137716 tok/s) step 6383/76294 | train loss 3.535937 | norm 0.2202 | lr 9.76e-04 | (3805.93 ms | 137756 tok/s) step 6384/76294 | train loss 3.414752 | norm 0.1885 | lr 9.76e-04 | (3840.36 ms | 136521 tok/s) step 6385/76294 | train loss 3.542908 | norm 0.3107 | lr 9.76e-04 | (3882.36 ms | 135044 tok/s) step 6386/76294 | train loss 3.502942 | norm 0.2486 | lr 9.76e-04 | (3814.43 ms | 137449 tok/s) step 6387/76294 | train loss 3.484838 | norm 0.2050 | lr 9.76e-04 | (3802.98 ms | 137862 tok/s) step 6388/76294 | train loss 3.514240 | norm 0.2612 | lr 9.76e-04 | (4236.08 ms | 123767 tok/s) step 6389/76294 | train loss 3.529368 | norm 0.2122 | lr 9.76e-04 | (3809.54 ms | 137625 tok/s) step 6390/76294 | train loss 3.522941 | norm 0.1850 | lr 9.76e-04 | (3810.71 ms | 137583 tok/s) step 6391/76294 | train loss 3.487985 | norm 0.2212 | lr 9.76e-04 | (3826.64 ms | 137010 tok/s) step 6392/76294 | train loss 3.497813 | norm 0.1764 | lr 9.76e-04 | (3806.36 ms | 137740 tok/s) step 6393/76294 | train loss 3.497470 | norm 0.2097 | lr 9.76e-04 | (3804.84 ms | 137795 tok/s) step 6394/76294 | train loss 3.531520 | norm 0.1752 | lr 9.76e-04 | (3833.69 ms | 136758 tok/s) step 6395/76294 | train loss 3.516416 | norm 0.2025 | lr 9.76e-04 | (3801.00 ms | 137934 tok/s) step 6396/76294 | train loss 3.661450 | norm 0.1952 | lr 9.75e-04 | (3811.37 ms | 137559 tok/s) step 6397/76294 | train loss 3.489383 | norm 0.1924 | lr 9.75e-04 | (3827.13 ms | 136992 tok/s) step 6398/76294 | train loss 3.562779 | norm 0.1987 | lr 9.75e-04 | (3804.81 ms | 137796 tok/s) step 6399/76294 | train loss 3.429527 | norm 0.2040 | lr 9.75e-04 | (3804.49 ms | 137808 tok/s) step 6400/76294 | train loss 3.512476 | norm 0.2223 | lr 9.75e-04 | (3841.40 ms | 136484 tok/s) step 6401/76294 | train loss 3.441004 | norm 0.2139 | lr 9.75e-04 | (3806.87 ms | 137722 tok/s) step 6402/76294 | train loss 3.477561 | norm 0.2066 | lr 9.75e-04 | (3813.92 ms | 137467 tok/s) step 6403/76294 | train loss 3.464837 | norm 0.1956 | lr 9.75e-04 | (4068.62 ms | 128861 tok/s) step 6404/76294 | train loss 3.426296 | norm 0.2122 | lr 9.75e-04 | (3805.41 ms | 137774 tok/s) step 6405/76294 | train loss 3.534218 | norm 0.2132 | lr 9.75e-04 | (3888.86 ms | 134818 tok/s) step 6406/76294 | train loss 3.508075 | norm 0.1941 | lr 9.75e-04 | (3803.29 ms | 137851 tok/s) step 6407/76294 | train loss 3.513171 | norm 0.2304 | lr 9.75e-04 | (3809.31 ms | 137633 tok/s) step 6408/76294 | train loss 3.485731 | norm 0.1990 | lr 9.75e-04 | (3832.38 ms | 136805 tok/s) step 6409/76294 | train loss 3.455804 | norm 0.2137 | lr 9.75e-04 | (3807.22 ms | 137709 tok/s) step 6410/76294 | train loss 3.508453 | norm 0.2001 | lr 9.74e-04 | (3803.48 ms | 137844 tok/s) step 6411/76294 | train loss 3.486281 | norm 0.2139 | lr 9.74e-04 | (3863.82 ms | 135691 tok/s) step 6412/76294 | train loss 3.526626 | norm 0.2490 | lr 9.74e-04 | (3804.70 ms | 137800 tok/s) step 6413/76294 | train loss 3.447672 | norm 0.2256 | lr 9.74e-04 | (3851.52 ms | 136125 tok/s) step 6414/76294 | train loss 3.481201 | norm 0.2129 | lr 9.74e-04 | (3806.78 ms | 137725 tok/s) step 6415/76294 | train loss 3.471174 | norm 0.2221 | lr 9.74e-04 | (3807.63 ms | 137694 tok/s) step 6416/76294 | train loss 3.455798 | norm 0.2368 | lr 9.74e-04 | (3826.86 ms | 137002 tok/s) step 6417/76294 | train loss 3.501089 | norm 0.1998 | lr 9.74e-04 | (3808.54 ms | 137661 tok/s) step 6418/76294 | train loss 3.482073 | norm 0.1863 | lr 9.74e-04 | (3822.80 ms | 137148 tok/s) step 6419/76294 | train loss 3.458260 | norm 0.2212 | lr 9.74e-04 | (3808.63 ms | 137658 tok/s) step 6420/76294 | train loss 3.562605 | norm 0.1936 | lr 9.74e-04 | (3830.49 ms | 136872 tok/s) step 6421/76294 | train loss 3.483050 | norm 0.2182 | lr 9.74e-04 | (3809.41 ms | 137630 tok/s) step 6422/76294 | train loss 3.522038 | norm 0.1983 | lr 9.74e-04 | (3802.00 ms | 137898 tok/s) step 6423/76294 | train loss 3.495534 | norm 0.1903 | lr 9.73e-04 | (3836.46 ms | 136659 tok/s) step 6424/76294 | train loss 3.548888 | norm 0.1998 | lr 9.73e-04 | (3803.78 ms | 137834 tok/s) step 6425/76294 | train loss 3.471501 | norm 0.1934 | lr 9.73e-04 | (3974.11 ms | 131926 tok/s) step 6426/76294 | train loss 3.522740 | norm 0.1942 | lr 9.73e-04 | (3798.47 ms | 138026 tok/s) step 6427/76294 | train loss 3.575114 | norm 0.2141 | lr 9.73e-04 | (3808.81 ms | 137651 tok/s) step 6428/76294 | train loss 3.443900 | norm 0.1814 | lr 9.73e-04 | (3823.25 ms | 137131 tok/s) step 6429/76294 | train loss 3.480252 | norm 0.2020 | lr 9.73e-04 | (3813.82 ms | 137471 tok/s) step 6430/76294 | train loss 3.519355 | norm 0.1928 | lr 9.73e-04 | (3828.16 ms | 136955 tok/s) step 6431/76294 | train loss 3.505560 | norm 0.2030 | lr 9.73e-04 | (3804.70 ms | 137800 tok/s) step 6432/76294 | train loss 3.537644 | norm 0.1884 | lr 9.73e-04 | (3805.32 ms | 137778 tok/s) step 6433/76294 | train loss 3.514038 | norm 0.1976 | lr 9.73e-04 | (3836.57 ms | 136656 tok/s) step 6434/76294 | train loss 3.511666 | norm 0.2530 | lr 9.73e-04 | (3803.13 ms | 137857 tok/s) step 6435/76294 | train loss 3.500280 | norm 0.2751 | lr 9.73e-04 | (3810.35 ms | 137596 tok/s) step 6436/76294 | train loss 3.465549 | norm 0.2537 | lr 9.73e-04 | (3831.66 ms | 136830 tok/s) step 6437/76294 | train loss 3.569805 | norm 0.1875 | lr 9.72e-04 | (3807.69 ms | 137692 tok/s) step 6438/76294 | train loss 3.494233 | norm 0.2685 | lr 9.72e-04 | (3831.59 ms | 136833 tok/s) step 6439/76294 | train loss 3.556530 | norm 0.2200 | lr 9.72e-04 | (3824.95 ms | 137070 tok/s) step 6440/76294 | train loss 3.473886 | norm 0.2111 | lr 9.72e-04 | (3804.85 ms | 137795 tok/s) step 6441/76294 | train loss 3.485200 | norm 0.2215 | lr 9.72e-04 | (3840.37 ms | 136520 tok/s) step 6442/76294 | train loss 3.553217 | norm 0.2186 | lr 9.72e-04 | (3802.74 ms | 137871 tok/s) step 6443/76294 | train loss 3.502964 | norm 0.2017 | lr 9.72e-04 | (3814.25 ms | 137455 tok/s) step 6444/76294 | train loss 3.551788 | norm 0.2329 | lr 9.72e-04 | (3803.84 ms | 137831 tok/s) step 6445/76294 | train loss 3.458626 | norm 0.1987 | lr 9.72e-04 | (3810.52 ms | 137590 tok/s) step 6446/76294 | train loss 3.554222 | norm 0.1965 | lr 9.72e-04 | (5013.64 ms | 104572 tok/s) step 6447/76294 | train loss 3.477689 | norm 0.2068 | lr 9.72e-04 | (3835.83 ms | 136682 tok/s) step 6448/76294 | train loss 3.468053 | norm 0.1962 | lr 9.72e-04 | (3834.67 ms | 136723 tok/s) step 6449/76294 | train loss 3.536295 | norm 0.1848 | lr 9.72e-04 | (3802.79 ms | 137869 tok/s) step 6450/76294 | train loss 3.425763 | norm 0.1872 | lr 9.72e-04 | (3806.54 ms | 137733 tok/s) step 6451/76294 | train loss 3.521257 | norm 0.2116 | lr 9.71e-04 | (3822.58 ms | 137156 tok/s) step 6452/76294 | train loss 3.444488 | norm 0.1830 | lr 9.71e-04 | (3901.95 ms | 134366 tok/s) step 6453/76294 | train loss 3.497082 | norm 0.1884 | lr 9.71e-04 | (3792.15 ms | 138256 tok/s) step 6454/76294 | train loss 3.459759 | norm 0.1919 | lr 9.71e-04 | (3818.06 ms | 137318 tok/s) step 6455/76294 | train loss 3.496439 | norm 0.1884 | lr 9.71e-04 | (3797.78 ms | 138051 tok/s) step 6456/76294 | train loss 3.564506 | norm 0.2197 | lr 9.71e-04 | (3801.03 ms | 137933 tok/s) step 6457/76294 | train loss 3.487396 | norm 0.3507 | lr 9.71e-04 | (3817.32 ms | 137345 tok/s) step 6458/76294 | train loss 3.478278 | norm 0.2667 | lr 9.71e-04 | (3799.09 ms | 138003 tok/s) step 6459/76294 | train loss 3.466001 | norm 0.2401 | lr 9.71e-04 | (3848.28 ms | 136240 tok/s) step 6460/76294 | train loss 3.555165 | norm 0.2207 | lr 9.71e-04 | (3800.50 ms | 137952 tok/s) step 6461/76294 | train loss 3.491944 | norm 0.2297 | lr 9.71e-04 | (3795.62 ms | 138130 tok/s) step 6462/76294 | train loss 3.523893 | norm 0.2802 | lr 9.71e-04 | (3822.08 ms | 137173 tok/s) step 6463/76294 | train loss 3.515612 | norm 0.2613 | lr 9.71e-04 | (3796.65 ms | 138092 tok/s) step 6464/76294 | train loss 3.540985 | norm 0.2210 | lr 9.70e-04 | (3952.04 ms | 132663 tok/s) step 6465/76294 | train loss 3.557215 | norm 0.2197 | lr 9.70e-04 | (3875.86 ms | 135270 tok/s) step 6466/76294 | train loss 3.664896 | norm 0.2469 | lr 9.70e-04 | (3804.42 ms | 137810 tok/s) step 6467/76294 | train loss 3.489869 | norm 0.2035 | lr 9.70e-04 | (3801.49 ms | 137916 tok/s) step 6468/76294 | train loss 3.496335 | norm 0.1914 | lr 9.70e-04 | (3820.50 ms | 137230 tok/s) step 6469/76294 | train loss 3.537351 | norm 0.2813 | lr 9.70e-04 | (3806.56 ms | 137733 tok/s) step 6470/76294 | train loss 3.558274 | norm 0.2858 | lr 9.70e-04 | (3820.17 ms | 137242 tok/s) step 6471/76294 | train loss 3.468962 | norm 0.1833 | lr 9.70e-04 | (3807.43 ms | 137701 tok/s) step 6472/76294 | train loss 3.528157 | norm 0.2040 | lr 9.70e-04 | (3797.28 ms | 138070 tok/s) step 6473/76294 | train loss 3.553090 | norm 0.2114 | lr 9.70e-04 | (3830.77 ms | 136862 tok/s) step 6474/76294 | train loss 3.529871 | norm 0.2022 | lr 9.70e-04 | (3800.99 ms | 137935 tok/s) step 6475/76294 | train loss 3.466414 | norm 0.2571 | lr 9.70e-04 | (3803.30 ms | 137851 tok/s) step 6476/76294 | train loss 3.450632 | norm 0.2604 | lr 9.70e-04 | (3825.57 ms | 137048 tok/s) step 6477/76294 | train loss 3.551837 | norm 0.3039 | lr 9.70e-04 | (3801.30 ms | 137923 tok/s) step 6478/76294 | train loss 3.448128 | norm 0.2596 | lr 9.69e-04 | (3799.20 ms | 137999 tok/s) step 6479/76294 | train loss 3.513861 | norm 0.3109 | lr 9.69e-04 | (3860.61 ms | 135804 tok/s) step 6480/76294 | train loss 3.514946 | norm 0.2412 | lr 9.69e-04 | (3807.89 ms | 137685 tok/s) step 6481/76294 | train loss 3.408716 | norm 0.2603 | lr 9.69e-04 | (3813.89 ms | 137468 tok/s) step 6482/76294 | train loss 3.573354 | norm 0.2761 | lr 9.69e-04 | (3808.94 ms | 137647 tok/s) step 6483/76294 | train loss 3.462462 | norm 0.2269 | lr 9.69e-04 | (3808.28 ms | 137671 tok/s) step 6484/76294 | train loss 3.544129 | norm 0.1961 | lr 9.69e-04 | (3830.67 ms | 136866 tok/s) step 6485/76294 | train loss 3.522803 | norm 0.2078 | lr 9.69e-04 | (4190.27 ms | 125120 tok/s) step 6486/76294 | train loss 3.560516 | norm 0.2067 | lr 9.69e-04 | (3802.83 ms | 137868 tok/s) step 6487/76294 | train loss 3.560913 | norm 0.1700 | lr 9.69e-04 | (3813.15 ms | 137495 tok/s) step 6488/76294 | train loss 3.599043 | norm 0.1888 | lr 9.69e-04 | (3832.40 ms | 136804 tok/s) step 6489/76294 | train loss 3.622046 | norm 0.1807 | lr 9.69e-04 | (3815.55 ms | 137408 tok/s) step 6490/76294 | train loss 3.563603 | norm 0.1991 | lr 9.69e-04 | (3809.51 ms | 137626 tok/s) step 6491/76294 | train loss 3.530709 | norm 0.1991 | lr 9.68e-04 | (3830.59 ms | 136869 tok/s) step 6492/76294 | train loss 3.569037 | norm 0.1942 | lr 9.68e-04 | (3800.40 ms | 137956 tok/s) step 6493/76294 | train loss 3.468254 | norm 0.2031 | lr 9.68e-04 | (3998.13 ms | 131133 tok/s) step 6494/76294 | train loss 3.567545 | norm 0.2287 | lr 9.68e-04 | (3818.25 ms | 137311 tok/s) step 6495/76294 | train loss 3.545075 | norm 0.2137 | lr 9.68e-04 | (3801.65 ms | 137911 tok/s) step 6496/76294 | train loss 3.576096 | norm 0.1970 | lr 9.68e-04 | (3844.21 ms | 136384 tok/s) step 6497/76294 | train loss 3.519983 | norm 0.2347 | lr 9.68e-04 | (3801.62 ms | 137912 tok/s) step 6498/76294 | train loss 3.552450 | norm 0.1838 | lr 9.68e-04 | (3797.66 ms | 138056 tok/s) step 6499/76294 | train loss 3.538593 | norm 0.2004 | lr 9.68e-04 | (3834.47 ms | 136730 tok/s) step 6500/76294 | train loss 3.621271 | norm 0.2125 | lr 9.68e-04 | (3798.84 ms | 138013 tok/s) val loss: 3.513034 saving model checkpoint to ./results/gpt2-124M-gqa/step_6500.pth step 6501/76294 | train loss 3.578570 | norm 0.2334 | lr 9.68e-04 | (3812.93 ms | 137503 tok/s) step 6502/76294 | train loss 3.495514 | norm 0.1988 | lr 9.68e-04 | (3824.27 ms | 137095 tok/s) step 6503/76294 | train loss 3.524756 | norm 0.2565 | lr 9.68e-04 | (3798.80 ms | 138014 tok/s) step 6504/76294 | train loss 3.493476 | norm 0.2650 | lr 9.68e-04 | (3797.13 ms | 138075 tok/s) step 6505/76294 | train loss 3.485085 | norm 0.2176 | lr 9.67e-04 | (3901.54 ms | 134380 tok/s) step 6506/76294 | train loss 3.615788 | norm 0.2745 | lr 9.67e-04 | (3797.96 ms | 138045 tok/s) step 6507/76294 | train loss 3.566860 | norm 0.2535 | lr 9.67e-04 | (3850.76 ms | 136152 tok/s) step 6508/76294 | train loss 3.481139 | norm 0.1944 | lr 9.67e-04 | (3796.65 ms | 138092 tok/s) step 6509/76294 | train loss 3.507818 | norm 0.2327 | lr 9.67e-04 | (3850.69 ms | 136154 tok/s) step 6510/76294 | train loss 3.581888 | norm 0.2180 | lr 9.67e-04 | (3800.12 ms | 137966 tok/s) step 6511/76294 | train loss 3.516731 | norm 0.1810 | lr 9.67e-04 | (3807.19 ms | 137710 tok/s) step 6512/76294 | train loss 3.448706 | norm 0.1909 | lr 9.67e-04 | (3802.82 ms | 137868 tok/s) step 6513/76294 | train loss 3.600194 | norm 0.1788 | lr 9.67e-04 | (3805.70 ms | 137764 tok/s) step 6514/76294 | train loss 3.509130 | norm 0.1829 | lr 9.67e-04 | (3818.12 ms | 137316 tok/s) step 6515/76294 | train loss 3.478487 | norm 0.1948 | lr 9.67e-04 | (3827.62 ms | 136975 tok/s) step 6516/76294 | train loss 3.555707 | norm 0.2012 | lr 9.67e-04 | (3801.13 ms | 137930 tok/s) step 6517/76294 | train loss 3.626603 | norm 0.1991 | lr 9.67e-04 | (3839.38 ms | 136555 tok/s) step 6518/76294 | train loss 3.546557 | norm 0.1875 | lr 9.66e-04 | (3802.17 ms | 137892 tok/s) step 6519/76294 | train loss 3.523615 | norm 0.1906 | lr 9.66e-04 | (3804.12 ms | 137821 tok/s) step 6520/76294 | train loss 3.518501 | norm 0.1816 | lr 9.66e-04 | (3824.83 ms | 137075 tok/s) step 6521/76294 | train loss 3.601832 | norm 0.1769 | lr 9.66e-04 | (3808.20 ms | 137674 tok/s) step 6522/76294 | train loss 3.536206 | norm 0.2358 | lr 9.66e-04 | (3801.74 ms | 137907 tok/s) step 6523/76294 | train loss 3.520455 | norm 0.2108 | lr 9.66e-04 | (3845.69 ms | 136331 tok/s) step 6524/76294 | train loss 3.602182 | norm 0.2069 | lr 9.66e-04 | (3799.93 ms | 137973 tok/s) step 6525/76294 | train loss 3.677790 | norm 0.2556 | lr 9.66e-04 | (3854.96 ms | 136003 tok/s) step 6526/76294 | train loss 3.665444 | norm 0.2387 | lr 9.66e-04 | (3882.14 ms | 135051 tok/s) step 6527/76294 | train loss 3.523237 | norm 0.2441 | lr 9.66e-04 | (3803.07 ms | 137859 tok/s) step 6528/76294 | train loss 3.522091 | norm 0.2029 | lr 9.66e-04 | (3831.03 ms | 136853 tok/s) step 6529/76294 | train loss 3.497710 | norm 0.1980 | lr 9.66e-04 | (3802.13 ms | 137893 tok/s) step 6530/76294 | train loss 3.547519 | norm 0.2156 | lr 9.66e-04 | (3807.06 ms | 137715 tok/s) step 6531/76294 | train loss 3.470261 | norm 0.1950 | lr 9.66e-04 | (3826.03 ms | 137032 tok/s) step 6532/76294 | train loss 3.535133 | norm 0.2109 | lr 9.65e-04 | (3808.46 ms | 137664 tok/s) step 6533/76294 | train loss 3.503350 | norm 0.2226 | lr 9.65e-04 | (3824.59 ms | 137083 tok/s) step 6534/76294 | train loss 3.669140 | norm 0.1962 | lr 9.65e-04 | (3805.88 ms | 137758 tok/s) step 6535/76294 | train loss 3.494696 | norm 0.2554 | lr 9.65e-04 | (3820.99 ms | 137213 tok/s) step 6536/76294 | train loss 3.521816 | norm 0.2067 | lr 9.65e-04 | (3806.84 ms | 137723 tok/s) step 6537/76294 | train loss 3.455975 | norm 0.2387 | lr 9.65e-04 | (3827.31 ms | 136986 tok/s) step 6538/76294 | train loss 3.464909 | norm 0.3726 | lr 9.65e-04 | (3804.85 ms | 137795 tok/s) step 6539/76294 | train loss 3.568001 | norm 0.5220 | lr 9.65e-04 | (3801.70 ms | 137909 tok/s) step 6540/76294 | train loss 3.690329 | norm 0.3260 | lr 9.65e-04 | (3835.72 ms | 136686 tok/s) step 6541/76294 | train loss 3.590825 | norm 0.3492 | lr 9.65e-04 | (3803.57 ms | 137841 tok/s) step 6542/76294 | train loss 3.548169 | norm 0.3006 | lr 9.65e-04 | (3807.10 ms | 137713 tok/s) step 6543/76294 | train loss 3.497357 | norm 0.2713 | lr 9.65e-04 | (3834.45 ms | 136731 tok/s) step 6544/76294 | train loss 3.550110 | norm 0.2370 | lr 9.65e-04 | (3813.26 ms | 137491 tok/s) step 6545/76294 | train loss 3.571904 | norm 0.2416 | lr 9.64e-04 | (3808.92 ms | 137647 tok/s) step 6546/76294 | train loss 3.526040 | norm 0.2247 | lr 9.64e-04 | (3852.85 ms | 136078 tok/s) step 6547/76294 | train loss 3.511813 | norm 0.2090 | lr 9.64e-04 | (3894.20 ms | 134633 tok/s) step 6548/76294 | train loss 3.454726 | norm 0.2292 | lr 9.64e-04 | (3809.55 ms | 137625 tok/s) step 6549/76294 | train loss 3.631722 | norm 0.2160 | lr 9.64e-04 | (3816.90 ms | 137360 tok/s) step 6550/76294 | train loss 3.587308 | norm 0.2452 | lr 9.64e-04 | (3808.51 ms | 137662 tok/s) step 6551/76294 | train loss 3.482908 | norm 0.2196 | lr 9.64e-04 | (3817.28 ms | 137346 tok/s) step 6552/76294 | train loss 3.560765 | norm 0.2307 | lr 9.64e-04 | (3809.35 ms | 137632 tok/s) step 6553/76294 | train loss 3.556686 | norm 0.2121 | lr 9.64e-04 | (3812.33 ms | 137524 tok/s) step 6554/76294 | train loss 3.522202 | norm 0.2010 | lr 9.64e-04 | (3889.44 ms | 134798 tok/s) step 6555/76294 | train loss 3.556565 | norm 0.2003 | lr 9.64e-04 | (3806.13 ms | 137748 tok/s) step 6556/76294 | train loss 3.482796 | norm 0.2151 | lr 9.64e-04 | (3804.02 ms | 137825 tok/s) step 6557/76294 | train loss 3.605128 | norm 0.2455 | lr 9.64e-04 | (3828.59 ms | 136940 tok/s) step 6558/76294 | train loss 3.489930 | norm 0.1896 | lr 9.64e-04 | (3801.93 ms | 137900 tok/s) step 6559/76294 | train loss 3.498205 | norm 0.1876 | lr 9.63e-04 | (3835.33 ms | 136700 tok/s) step 6560/76294 | train loss 3.543169 | norm 0.1900 | lr 9.63e-04 | (3835.42 ms | 136697 tok/s) step 6561/76294 | train loss 3.489697 | norm 0.1925 | lr 9.63e-04 | (3812.04 ms | 137535 tok/s) step 6562/76294 | train loss 3.574889 | norm 0.1979 | lr 9.63e-04 | (3824.80 ms | 137076 tok/s) step 6563/76294 | train loss 3.508997 | norm 0.2015 | lr 9.63e-04 | (3804.89 ms | 137793 tok/s) step 6564/76294 | train loss 3.512939 | norm 0.1691 | lr 9.63e-04 | (3807.43 ms | 137701 tok/s) step 6565/76294 | train loss 3.512857 | norm 0.1908 | lr 9.63e-04 | (3810.34 ms | 137596 tok/s) step 6566/76294 | train loss 3.514756 | norm 0.1739 | lr 9.63e-04 | (3798.29 ms | 138033 tok/s) step 6567/76294 | train loss 3.488888 | norm 0.1868 | lr 9.63e-04 | (3846.81 ms | 136292 tok/s) step 6568/76294 | train loss 3.512960 | norm 0.1862 | lr 9.63e-04 | (3835.72 ms | 136686 tok/s) step 6569/76294 | train loss 3.496900 | norm 0.1922 | lr 9.63e-04 | (3853.56 ms | 136053 tok/s) step 6570/76294 | train loss 3.507835 | norm 0.1856 | lr 9.63e-04 | (3806.20 ms | 137746 tok/s) step 6571/76294 | train loss 3.564646 | norm 0.2083 | lr 9.63e-04 | (3824.28 ms | 137094 tok/s) step 6572/76294 | train loss 3.545976 | norm 0.2002 | lr 9.62e-04 | (3835.94 ms | 136678 tok/s) step 6573/76294 | train loss 3.481418 | norm 0.1773 | lr 9.62e-04 | (3800.70 ms | 137945 tok/s) step 6574/76294 | train loss 3.462429 | norm 0.1851 | lr 9.62e-04 | (3840.60 ms | 136512 tok/s) step 6575/76294 | train loss 3.565252 | norm 0.1720 | lr 9.62e-04 | (3803.91 ms | 137829 tok/s) step 6576/76294 | train loss 3.426405 | norm 0.2241 | lr 9.62e-04 | (3829.18 ms | 136919 tok/s) step 6577/76294 | train loss 3.428483 | norm 0.2110 | lr 9.62e-04 | (3825.56 ms | 137049 tok/s) step 6578/76294 | train loss 3.470515 | norm 0.2161 | lr 9.62e-04 | (3842.64 ms | 136439 tok/s) step 6579/76294 | train loss 3.562192 | norm 0.2105 | lr 9.62e-04 | (3804.35 ms | 137813 tok/s) step 6580/76294 | train loss 3.495949 | norm 0.2259 | lr 9.62e-04 | (3832.16 ms | 136813 tok/s) step 6581/76294 | train loss 3.544998 | norm 0.1921 | lr 9.62e-04 | (3806.60 ms | 137731 tok/s) step 6582/76294 | train loss 3.561573 | norm 0.2141 | lr 9.62e-04 | (3806.98 ms | 137718 tok/s) step 6583/76294 | train loss 3.534183 | norm 0.1886 | lr 9.62e-04 | (3825.03 ms | 137068 tok/s) step 6584/76294 | train loss 3.526643 | norm 0.1850 | lr 9.62e-04 | (3809.99 ms | 137609 tok/s) step 6585/76294 | train loss 3.557000 | norm 0.1990 | lr 9.62e-04 | (3803.46 ms | 137845 tok/s) step 6586/76294 | train loss 3.501809 | norm 0.1889 | lr 9.61e-04 | (3857.34 ms | 135920 tok/s) step 6587/76294 | train loss 3.555721 | norm 0.2454 | lr 9.61e-04 | (3873.96 ms | 135336 tok/s) step 6588/76294 | train loss 3.521342 | norm 0.2621 | lr 9.61e-04 | (3839.54 ms | 136550 tok/s) step 6589/76294 | train loss 3.519078 | norm 0.2177 | lr 9.61e-04 | (3829.50 ms | 136908 tok/s) step 6590/76294 | train loss 3.544798 | norm 0.2678 | lr 9.61e-04 | (3800.96 ms | 137936 tok/s) step 6591/76294 | train loss 3.571825 | norm 0.2572 | lr 9.61e-04 | (3800.19 ms | 137964 tok/s) step 6592/76294 | train loss 3.472918 | norm 0.2301 | lr 9.61e-04 | (3837.24 ms | 136632 tok/s) step 6593/76294 | train loss 3.494118 | norm 0.2440 | lr 9.61e-04 | (3805.48 ms | 137772 tok/s) step 6594/76294 | train loss 3.533950 | norm 0.2163 | lr 9.61e-04 | (3823.13 ms | 137136 tok/s) step 6595/76294 | train loss 3.471685 | norm 0.2236 | lr 9.61e-04 | (3830.33 ms | 136878 tok/s) step 6596/76294 | train loss 3.548421 | norm 0.2123 | lr 9.61e-04 | (3808.83 ms | 137651 tok/s) step 6597/76294 | train loss 3.540249 | norm 0.3038 | lr 9.61e-04 | (3805.64 ms | 137766 tok/s) step 6598/76294 | train loss 3.490771 | norm 0.2267 | lr 9.61e-04 | (3811.85 ms | 137541 tok/s) step 6599/76294 | train loss 3.518270 | norm 0.2752 | lr 9.60e-04 | (3828.11 ms | 136957 tok/s) step 6600/76294 | train loss 3.514455 | norm 0.2934 | lr 9.60e-04 | (3803.77 ms | 137834 tok/s) step 6601/76294 | train loss 3.543148 | norm 0.2238 | lr 9.60e-04 | (3804.81 ms | 137796 tok/s) step 6602/76294 | train loss 3.556542 | norm 0.2404 | lr 9.60e-04 | (3828.28 ms | 136951 tok/s) step 6603/76294 | train loss 3.541218 | norm 0.2081 | lr 9.60e-04 | (3801.19 ms | 137927 tok/s) step 6604/76294 | train loss 3.467682 | norm 0.2238 | lr 9.60e-04 | (3809.29 ms | 137634 tok/s) step 6605/76294 | train loss 3.486928 | norm 0.2143 | lr 9.60e-04 | (3921.03 ms | 133712 tok/s) step 6606/76294 | train loss 3.482798 | norm 0.2192 | lr 9.60e-04 | (3812.47 ms | 137519 tok/s) step 6607/76294 | train loss 3.546469 | norm 0.2106 | lr 9.60e-04 | (3805.22 ms | 137781 tok/s) step 6608/76294 | train loss 3.569564 | norm 0.3173 | lr 9.60e-04 | (3894.40 ms | 134626 tok/s) step 6609/76294 | train loss 3.496639 | norm 0.2492 | lr 9.60e-04 | (3806.71 ms | 137727 tok/s) step 6610/76294 | train loss 3.660323 | norm 0.2475 | lr 9.60e-04 | (3809.84 ms | 137614 tok/s) step 6611/76294 | train loss 3.566857 | norm 0.2247 | lr 9.60e-04 | (3830.32 ms | 136878 tok/s) step 6612/76294 | train loss 3.527041 | norm 0.2902 | lr 9.59e-04 | (3807.27 ms | 137707 tok/s) step 6613/76294 | train loss 3.580114 | norm 0.2238 | lr 9.59e-04 | (3810.36 ms | 137595 tok/s) step 6614/76294 | train loss 3.519207 | norm 0.2261 | lr 9.59e-04 | (3818.06 ms | 137318 tok/s) step 6615/76294 | train loss 3.502239 | norm 0.2050 | lr 9.59e-04 | (3825.27 ms | 137059 tok/s) step 6616/76294 | train loss 3.559162 | norm 0.2140 | lr 9.59e-04 | (3808.25 ms | 137672 tok/s) step 6617/76294 | train loss 3.522853 | norm 0.1968 | lr 9.59e-04 | (3803.71 ms | 137836 tok/s) step 6618/76294 | train loss 3.526017 | norm 0.2191 | lr 9.59e-04 | (3834.67 ms | 136723 tok/s) step 6619/76294 | train loss 3.535794 | norm 0.1968 | lr 9.59e-04 | (3802.86 ms | 137867 tok/s) step 6620/76294 | train loss 3.509960 | norm 0.1940 | lr 9.59e-04 | (3841.35 ms | 136485 tok/s) step 6621/76294 | train loss 3.475333 | norm 0.2037 | lr 9.59e-04 | (3805.17 ms | 137783 tok/s) step 6622/76294 | train loss 3.518764 | norm 0.1787 | lr 9.59e-04 | (3804.25 ms | 137816 tok/s) step 6623/76294 | train loss 3.558300 | norm 0.2117 | lr 9.59e-04 | (3834.13 ms | 136742 tok/s) step 6624/76294 | train loss 3.503883 | norm 0.1894 | lr 9.59e-04 | (3805.06 ms | 137787 tok/s) step 6625/76294 | train loss 3.521515 | norm 0.2049 | lr 9.59e-04 | (3825.58 ms | 137048 tok/s) step 6626/76294 | train loss 3.525758 | norm 0.1929 | lr 9.58e-04 | (3801.19 ms | 137927 tok/s) step 6627/76294 | train loss 3.508914 | norm 0.2073 | lr 9.58e-04 | (3858.49 ms | 135879 tok/s) step 6628/76294 | train loss 3.528416 | norm 0.2214 | lr 9.58e-04 | (3805.78 ms | 137761 tok/s) step 6629/76294 | train loss 3.420253 | norm 0.2010 | lr 9.58e-04 | (3859.80 ms | 135833 tok/s) step 6630/76294 | train loss 3.509432 | norm 0.2564 | lr 9.58e-04 | (3807.76 ms | 137689 tok/s) step 6631/76294 | train loss 3.513895 | norm 0.1941 | lr 9.58e-04 | (3823.57 ms | 137120 tok/s) step 6632/76294 | train loss 3.404065 | norm 0.2415 | lr 9.58e-04 | (3819.56 ms | 137264 tok/s) step 6633/76294 | train loss 3.505003 | norm 0.2290 | lr 9.58e-04 | (3810.60 ms | 137587 tok/s) step 6634/76294 | train loss 3.560623 | norm 0.2417 | lr 9.58e-04 | (3829.03 ms | 136924 tok/s) step 6635/76294 | train loss 3.442117 | norm 0.2033 | lr 9.58e-04 | (3814.15 ms | 137459 tok/s) step 6636/76294 | train loss 3.500374 | norm 0.2436 | lr 9.58e-04 | (3813.37 ms | 137487 tok/s) step 6637/76294 | train loss 3.484747 | norm 0.2397 | lr 9.58e-04 | (3806.73 ms | 137727 tok/s) step 6638/76294 | train loss 3.473550 | norm 0.1980 | lr 9.58e-04 | (3804.09 ms | 137822 tok/s) step 6639/76294 | train loss 3.682903 | norm 0.5516 | lr 9.57e-04 | (3833.85 ms | 136752 tok/s) step 6640/76294 | train loss 3.503649 | norm 0.3817 | lr 9.57e-04 | (3804.81 ms | 137796 tok/s) step 6641/76294 | train loss 3.509368 | norm 0.4568 | lr 9.57e-04 | (3830.36 ms | 136877 tok/s) step 6642/76294 | train loss 3.491624 | norm 0.3488 | lr 9.57e-04 | (3809.09 ms | 137641 tok/s) step 6643/76294 | train loss 3.558125 | norm 0.3010 | lr 9.57e-04 | (3823.57 ms | 137120 tok/s) step 6644/76294 | train loss 3.526901 | norm 0.2927 | lr 9.57e-04 | (3802.93 ms | 137864 tok/s) step 6645/76294 | train loss 3.524776 | norm 0.2498 | lr 9.57e-04 | (3810.82 ms | 137579 tok/s) step 6646/76294 | train loss 3.504231 | norm 0.2710 | lr 9.57e-04 | (3825.21 ms | 137061 tok/s) step 6647/76294 | train loss 3.496885 | norm 0.2506 | lr 9.57e-04 | (3807.76 ms | 137689 tok/s) step 6648/76294 | train loss 3.599108 | norm 0.2797 | lr 9.57e-04 | (3809.46 ms | 137628 tok/s) step 6649/76294 | train loss 3.513882 | norm 0.2426 | lr 9.57e-04 | (4063.99 ms | 129008 tok/s) step 6650/76294 | train loss 3.521171 | norm 0.2583 | lr 9.57e-04 | (3803.19 ms | 137855 tok/s) step 6651/76294 | train loss 3.524845 | norm 0.2553 | lr 9.57e-04 | (3814.14 ms | 137459 tok/s) step 6652/76294 | train loss 3.512463 | norm 0.2609 | lr 9.56e-04 | (3829.87 ms | 136895 tok/s) step 6653/76294 | train loss 3.516362 | norm 0.2313 | lr 9.56e-04 | (3875.18 ms | 135294 tok/s) step 6654/76294 | train loss 3.574438 | norm 0.2270 | lr 9.56e-04 | (3804.69 ms | 137801 tok/s) step 6655/76294 | train loss 3.450016 | norm 0.2512 | lr 9.56e-04 | (3807.92 ms | 137684 tok/s) step 6656/76294 | train loss 3.518248 | norm 0.2322 | lr 9.56e-04 | (3830.49 ms | 136872 tok/s) step 6657/76294 | train loss 3.644578 | norm 0.2429 | lr 9.56e-04 | (3810.02 ms | 137608 tok/s) step 6658/76294 | train loss 3.471140 | norm 0.2141 | lr 9.56e-04 | (3835.49 ms | 136694 tok/s) step 6659/76294 | train loss 3.534868 | norm 0.2495 | lr 9.56e-04 | (3827.04 ms | 136996 tok/s) step 6660/76294 | train loss 3.508480 | norm 0.2251 | lr 9.56e-04 | (3811.77 ms | 137545 tok/s) step 6661/76294 | train loss 3.428739 | norm 0.2268 | lr 9.56e-04 | (3831.01 ms | 136854 tok/s) step 6662/76294 | train loss 3.494686 | norm 0.2468 | lr 9.56e-04 | (3804.52 ms | 137807 tok/s) step 6663/76294 | train loss 3.498901 | norm 0.2234 | lr 9.56e-04 | (3808.83 ms | 137651 tok/s) step 6664/76294 | train loss 3.416774 | norm 0.2024 | lr 9.56e-04 | (3810.20 ms | 137601 tok/s) step 6665/76294 | train loss 3.494421 | norm 0.2257 | lr 9.56e-04 | (3830.00 ms | 136890 tok/s) step 6666/76294 | train loss 3.491409 | norm 0.2121 | lr 9.55e-04 | (3807.64 ms | 137694 tok/s) step 6667/76294 | train loss 3.504670 | norm 0.2109 | lr 9.55e-04 | (3826.06 ms | 137031 tok/s) step 6668/76294 | train loss 3.491665 | norm 0.1995 | lr 9.55e-04 | (3805.88 ms | 137758 tok/s) step 6669/76294 | train loss 3.605728 | norm 0.1857 | lr 9.55e-04 | (3836.94 ms | 136642 tok/s) step 6670/76294 | train loss 3.422183 | norm 0.2012 | lr 9.55e-04 | (3925.99 ms | 133543 tok/s) step 6671/76294 | train loss 3.498940 | norm 0.2059 | lr 9.55e-04 | (3805.34 ms | 137777 tok/s) step 6672/76294 | train loss 3.457906 | norm 0.2092 | lr 9.55e-04 | (3885.89 ms | 134921 tok/s) step 6673/76294 | train loss 3.550961 | norm 0.1913 | lr 9.55e-04 | (3809.28 ms | 137635 tok/s) step 6674/76294 | train loss 3.465887 | norm 0.2179 | lr 9.55e-04 | (3813.95 ms | 137466 tok/s) step 6675/76294 | train loss 3.517233 | norm 0.1991 | lr 9.55e-04 | (3806.10 ms | 137749 tok/s) step 6676/76294 | train loss 3.433960 | norm 0.1910 | lr 9.55e-04 | (4092.31 ms | 128115 tok/s) step 6677/76294 | train loss 3.423226 | norm 0.1952 | lr 9.55e-04 | (3832.84 ms | 136789 tok/s) step 6678/76294 | train loss 3.577629 | norm 0.2479 | lr 9.55e-04 | (3812.49 ms | 137518 tok/s) step 6679/76294 | train loss 3.553336 | norm 0.2152 | lr 9.54e-04 | (8973.38 ms | 58427 tok/s) step 6680/76294 | train loss 3.554481 | norm 0.2018 | lr 9.54e-04 | (3869.85 ms | 135480 tok/s) step 6681/76294 | train loss 3.524289 | norm 0.1893 | lr 9.54e-04 | (3820.61 ms | 137226 tok/s) step 6682/76294 | train loss 3.522521 | norm 0.2071 | lr 9.54e-04 | (3801.21 ms | 137926 tok/s) step 6683/76294 | train loss 3.520016 | norm 0.1933 | lr 9.54e-04 | (3798.29 ms | 138033 tok/s) step 6684/76294 | train loss 3.505148 | norm 0.2103 | lr 9.54e-04 | (3800.05 ms | 137969 tok/s) step 6685/76294 | train loss 3.509128 | norm 0.2133 | lr 9.54e-04 | (3794.69 ms | 138164 tok/s) step 6686/76294 | train loss 3.647977 | norm 0.1886 | lr 9.54e-04 | (3873.18 ms | 135364 tok/s) step 6687/76294 | train loss 3.539476 | norm 0.1827 | lr 9.54e-04 | (3798.44 ms | 138027 tok/s) step 6688/76294 | train loss 3.552510 | norm 0.1944 | lr 9.54e-04 | (3825.56 ms | 137049 tok/s) step 6689/76294 | train loss 3.553231 | norm 0.2060 | lr 9.54e-04 | (3882.86 ms | 135026 tok/s) step 6690/76294 | train loss 3.547040 | norm 0.1861 | lr 9.54e-04 | (3798.09 ms | 138040 tok/s) step 6691/76294 | train loss 3.447299 | norm 0.2613 | lr 9.54e-04 | (3807.39 ms | 137703 tok/s) step 6692/76294 | train loss 3.503894 | norm 0.3211 | lr 9.53e-04 | (3822.83 ms | 137147 tok/s) step 6693/76294 | train loss 3.596408 | norm 0.1964 | lr 9.53e-04 | (3800.04 ms | 137969 tok/s) step 6694/76294 | train loss 3.521933 | norm 0.2753 | lr 9.53e-04 | (3797.54 ms | 138060 tok/s) step 6695/76294 | train loss 3.431987 | norm 0.2446 | lr 9.53e-04 | (3828.38 ms | 136948 tok/s) step 6696/76294 | train loss 3.525154 | norm 0.2497 | lr 9.53e-04 | (3796.65 ms | 138092 tok/s) step 6697/76294 | train loss 3.505580 | norm 0.3346 | lr 9.53e-04 | (5387.03 ms | 97324 tok/s) step 6698/76294 | train loss 3.536574 | norm 0.2301 | lr 9.53e-04 | (6361.62 ms | 82414 tok/s) step 6699/76294 | train loss 3.483095 | norm 0.2359 | lr 9.53e-04 | (3873.97 ms | 135336 tok/s) step 6700/76294 | train loss 3.528617 | norm 0.1999 | lr 9.53e-04 | (3791.02 ms | 138297 tok/s) step 6701/76294 | train loss 3.471432 | norm 0.2341 | lr 9.53e-04 | (3795.69 ms | 138127 tok/s) step 6702/76294 | train loss 3.510927 | norm 0.1956 | lr 9.53e-04 | (3823.93 ms | 137107 tok/s) step 6703/76294 | train loss 3.570557 | norm 0.2061 | lr 9.53e-04 | (3805.57 ms | 137768 tok/s) step 6704/76294 | train loss 3.486178 | norm 0.1930 | lr 9.53e-04 | (3797.45 ms | 138063 tok/s) step 6705/76294 | train loss 3.516052 | norm 0.1908 | lr 9.52e-04 | (3834.72 ms | 136721 tok/s) step 6706/76294 | train loss 3.485266 | norm 0.2012 | lr 9.52e-04 | (3798.32 ms | 138031 tok/s) step 6707/76294 | train loss 3.474149 | norm 0.1879 | lr 9.52e-04 | (3805.11 ms | 137785 tok/s) step 6708/76294 | train loss 3.514137 | norm 0.2138 | lr 9.52e-04 | (3818.57 ms | 137300 tok/s) step 6709/76294 | train loss 3.583405 | norm 0.1937 | lr 9.52e-04 | (3882.53 ms | 135038 tok/s) step 6710/76294 | train loss 3.528478 | norm 0.2023 | lr 9.52e-04 | (3797.46 ms | 138063 tok/s) step 6711/76294 | train loss 3.463372 | norm 0.1978 | lr 9.52e-04 | (3817.64 ms | 137333 tok/s) step 6712/76294 | train loss 3.491731 | norm 0.1918 | lr 9.52e-04 | (3799.23 ms | 137998 tok/s) step 6713/76294 | train loss 3.517500 | norm 0.1909 | lr 9.52e-04 | (3826.57 ms | 137012 tok/s) step 6714/76294 | train loss 3.537125 | norm 0.2035 | lr 9.52e-04 | (3797.47 ms | 138062 tok/s) step 6715/76294 | train loss 3.521593 | norm 0.2158 | lr 9.52e-04 | (3821.52 ms | 137193 tok/s) step 6716/76294 | train loss 3.452196 | norm 0.2085 | lr 9.52e-04 | (3799.58 ms | 137986 tok/s) step 6717/76294 | train loss 3.521581 | norm 0.2054 | lr 9.52e-04 | (3799.32 ms | 137995 tok/s) step 6718/76294 | train loss 3.484864 | norm 0.2065 | lr 9.52e-04 | (3852.32 ms | 136097 tok/s) step 6719/76294 | train loss 3.528231 | norm 0.1906 | lr 9.51e-04 | (3798.69 ms | 138018 tok/s) step 6720/76294 | train loss 3.521352 | norm 0.1919 | lr 9.51e-04 | (3798.07 ms | 138041 tok/s) step 6721/76294 | train loss 3.580350 | norm 0.2004 | lr 9.51e-04 | (3821.01 ms | 137212 tok/s) step 6722/76294 | train loss 3.546165 | norm 0.1839 | lr 9.51e-04 | (3799.67 ms | 137983 tok/s) step 6723/76294 | train loss 3.615629 | norm 0.1918 | lr 9.51e-04 | (3818.43 ms | 137304 tok/s) step 6724/76294 | train loss 3.520573 | norm 0.2652 | lr 9.51e-04 | (3823.81 ms | 137111 tok/s) step 6725/76294 | train loss 3.465961 | norm 0.2134 | lr 9.51e-04 | (3801.03 ms | 137933 tok/s) step 6726/76294 | train loss 3.516427 | norm 0.1981 | lr 9.51e-04 | (3799.01 ms | 138007 tok/s) step 6727/76294 | train loss 3.642171 | norm 0.2893 | lr 9.51e-04 | (3846.86 ms | 136290 tok/s) step 6728/76294 | train loss 3.604240 | norm 0.2993 | lr 9.51e-04 | (3797.69 ms | 138055 tok/s) step 6729/76294 | train loss 3.527155 | norm 0.1881 | lr 9.51e-04 | (3813.47 ms | 137483 tok/s) step 6730/76294 | train loss 3.530596 | norm 0.2387 | lr 9.51e-04 | (3830.31 ms | 136879 tok/s) step 6731/76294 | train loss 3.470894 | norm 0.2168 | lr 9.51e-04 | (3805.56 ms | 137769 tok/s) step 6732/76294 | train loss 3.504021 | norm 0.1896 | lr 9.50e-04 | (3873.66 ms | 135347 tok/s) step 6733/76294 | train loss 3.471057 | norm 0.2354 | lr 9.50e-04 | (3800.83 ms | 137940 tok/s) step 6734/76294 | train loss 3.468774 | norm 0.2160 | lr 9.50e-04 | (3818.67 ms | 137296 tok/s) step 6735/76294 | train loss 3.543471 | norm 0.2035 | lr 9.50e-04 | (3818.97 ms | 137285 tok/s) step 6736/76294 | train loss 3.464970 | norm 0.1930 | lr 9.50e-04 | (3815.63 ms | 137405 tok/s) step 6737/76294 | train loss 3.580869 | norm 0.2220 | lr 9.50e-04 | (3805.16 ms | 137783 tok/s) step 6738/76294 | train loss 3.499996 | norm 0.1840 | lr 9.50e-04 | (3823.23 ms | 137132 tok/s) step 6739/76294 | train loss 3.523595 | norm 0.2186 | lr 9.50e-04 | (3804.59 ms | 137804 tok/s) step 6740/76294 | train loss 3.513877 | norm 0.1759 | lr 9.50e-04 | (3798.78 ms | 138015 tok/s) step 6741/76294 | train loss 3.545807 | norm 0.2241 | lr 9.50e-04 | (3892.38 ms | 134696 tok/s) step 6742/76294 | train loss 3.478660 | norm 0.1761 | lr 9.50e-04 | (3861.03 ms | 135790 tok/s) step 6743/76294 | train loss 3.556024 | norm 0.1969 | lr 9.50e-04 | (3799.70 ms | 137981 tok/s) step 6744/76294 | train loss 3.564794 | norm 0.1891 | lr 9.50e-04 | (3841.62 ms | 136476 tok/s) step 6745/76294 | train loss 3.500329 | norm 0.2289 | lr 9.49e-04 | (3808.55 ms | 137661 tok/s) step 6746/76294 | train loss 3.509538 | norm 0.2018 | lr 9.49e-04 | (3835.86 ms | 136681 tok/s) step 6747/76294 | train loss 3.455461 | norm 0.2564 | lr 9.49e-04 | (3803.79 ms | 137833 tok/s) step 6748/76294 | train loss 3.547516 | norm 0.2739 | lr 9.49e-04 | (3805.21 ms | 137782 tok/s) step 6749/76294 | train loss 3.514359 | norm 0.2239 | lr 9.49e-04 | (3821.18 ms | 137206 tok/s) step 6750/76294 | train loss 3.562947 | norm 0.2461 | lr 9.49e-04 | (3841.93 ms | 136465 tok/s) val loss: 3.502707 saving model checkpoint to ./results/gpt2-124M-gqa/step_6750.pth step 6751/76294 | train loss 3.467746 | norm 0.2251 | lr 9.49e-04 | (3890.77 ms | 134752 tok/s) step 6752/76294 | train loss 3.527720 | norm 0.2368 | lr 9.49e-04 | (3822.25 ms | 137167 tok/s) step 6753/76294 | train loss 3.496808 | norm 0.2428 | lr 9.49e-04 | (3808.31 ms | 137670 tok/s) step 6754/76294 | train loss 3.463120 | norm 0.2137 | lr 9.49e-04 | (3824.99 ms | 137069 tok/s) step 6755/76294 | train loss 3.522147 | norm 0.1978 | lr 9.49e-04 | (3826.15 ms | 137028 tok/s) step 6756/76294 | train loss 3.510769 | norm 0.2363 | lr 9.49e-04 | (3801.64 ms | 137911 tok/s) step 6757/76294 | train loss 3.587206 | norm 0.2074 | lr 9.49e-04 | (3836.35 ms | 136663 tok/s) step 6758/76294 | train loss 3.479683 | norm 0.2004 | lr 9.48e-04 | (3802.06 ms | 137896 tok/s) step 6759/76294 | train loss 3.547706 | norm 0.1950 | lr 9.48e-04 | (3830.28 ms | 136880 tok/s) step 6760/76294 | train loss 3.489693 | norm 0.1797 | lr 9.48e-04 | (3806.77 ms | 137725 tok/s) step 6761/76294 | train loss 3.562203 | norm 0.1876 | lr 9.48e-04 | (3804.58 ms | 137804 tok/s) step 6762/76294 | train loss 3.535821 | norm 0.2379 | lr 9.48e-04 | (3823.68 ms | 137116 tok/s) step 6763/76294 | train loss 3.486362 | norm 0.2556 | lr 9.48e-04 | (3804.00 ms | 137825 tok/s) step 6764/76294 | train loss 3.519722 | norm 0.2123 | lr 9.48e-04 | (3822.74 ms | 137150 tok/s) step 6765/76294 | train loss 3.511153 | norm 0.2505 | lr 9.48e-04 | (3805.64 ms | 137766 tok/s) step 6766/76294 | train loss 3.592987 | norm 0.2146 | lr 9.48e-04 | (3805.09 ms | 137786 tok/s) step 6767/76294 | train loss 3.541756 | norm 0.3005 | lr 9.48e-04 | (3833.49 ms | 136765 tok/s) step 6768/76294 | train loss 3.480659 | norm 0.2982 | lr 9.48e-04 | (3802.69 ms | 137873 tok/s) step 6769/76294 | train loss 3.544137 | norm 0.2293 | lr 9.48e-04 | (3814.39 ms | 137450 tok/s) step 6770/76294 | train loss 3.489371 | norm 0.2871 | lr 9.48e-04 | (3926.57 ms | 133523 tok/s) step 6771/76294 | train loss 3.585746 | norm 0.2940 | lr 9.47e-04 | (3806.84 ms | 137723 tok/s) step 6772/76294 | train loss 3.490510 | norm 0.2396 | lr 9.47e-04 | (3807.22 ms | 137709 tok/s) step 6773/76294 | train loss 3.524316 | norm 0.2775 | lr 9.47e-04 | (3828.09 ms | 136958 tok/s) step 6774/76294 | train loss 3.497467 | norm 0.2225 | lr 9.47e-04 | (3803.38 ms | 137848 tok/s) step 6775/76294 | train loss 3.510282 | norm 0.2240 | lr 9.47e-04 | (3827.53 ms | 136978 tok/s) step 6776/76294 | train loss 3.508564 | norm 0.2274 | lr 9.47e-04 | (3845.54 ms | 136337 tok/s) step 6777/76294 | train loss 3.501681 | norm 0.2038 | lr 9.47e-04 | (3807.46 ms | 137700 tok/s) step 6778/76294 | train loss 3.546961 | norm 0.2610 | lr 9.47e-04 | (3809.88 ms | 137613 tok/s) step 6779/76294 | train loss 3.475588 | norm 0.1744 | lr 9.47e-04 | (3822.02 ms | 137176 tok/s) step 6780/76294 | train loss 3.665487 | norm 0.2381 | lr 9.47e-04 | (3807.31 ms | 137706 tok/s) step 6781/76294 | train loss 3.511420 | norm 0.1976 | lr 9.47e-04 | (3807.43 ms | 137701 tok/s) step 6782/76294 | train loss 3.504493 | norm 0.2187 | lr 9.47e-04 | (3828.93 ms | 136928 tok/s) step 6783/76294 | train loss 3.512291 | norm 0.2129 | lr 9.47e-04 | (3806.52 ms | 137734 tok/s) step 6784/76294 | train loss 3.480534 | norm 0.1986 | lr 9.46e-04 | (3831.47 ms | 136837 tok/s) step 6785/76294 | train loss 3.495681 | norm 0.2520 | lr 9.46e-04 | (3873.14 ms | 135365 tok/s) step 6786/76294 | train loss 3.517943 | norm 0.1967 | lr 9.46e-04 | (3832.24 ms | 136810 tok/s) step 6787/76294 | train loss 3.544882 | norm 0.2015 | lr 9.46e-04 | (3845.04 ms | 136354 tok/s) step 6788/76294 | train loss 3.492296 | norm 0.1817 | lr 9.46e-04 | (3823.21 ms | 137133 tok/s) step 6789/76294 | train loss 3.510437 | norm 0.1874 | lr 9.46e-04 | (3828.50 ms | 136943 tok/s) step 6790/76294 | train loss 3.556628 | norm 0.1778 | lr 9.46e-04 | (3883.92 ms | 134989 tok/s) step 6791/76294 | train loss 3.515292 | norm 0.1912 | lr 9.46e-04 | (3801.37 ms | 137921 tok/s) step 6792/76294 | train loss 3.521425 | norm 0.1785 | lr 9.46e-04 | (3839.29 ms | 136559 tok/s) step 6793/76294 | train loss 3.525342 | norm 0.1800 | lr 9.46e-04 | (3807.34 ms | 137705 tok/s) step 6794/76294 | train loss 3.474341 | norm 0.1829 | lr 9.46e-04 | (3808.90 ms | 137648 tok/s) step 6795/76294 | train loss 3.578351 | norm 0.1746 | lr 9.46e-04 | (3828.87 ms | 136930 tok/s) step 6796/76294 | train loss 3.534255 | norm 0.1784 | lr 9.46e-04 | (3805.90 ms | 137757 tok/s) step 6797/76294 | train loss 3.564598 | norm 0.1856 | lr 9.45e-04 | (3832.91 ms | 136786 tok/s) step 6798/76294 | train loss 3.487776 | norm 0.1955 | lr 9.45e-04 | (3813.02 ms | 137500 tok/s) step 6799/76294 | train loss 3.554990 | norm 0.1823 | lr 9.45e-04 | (3833.26 ms | 136773 tok/s) step 6800/76294 | train loss 3.480376 | norm 0.1878 | lr 9.45e-04 | (3808.24 ms | 137672 tok/s) step 6801/76294 | train loss 3.448890 | norm 0.1962 | lr 9.45e-04 | (3831.87 ms | 136823 tok/s) step 6802/76294 | train loss 3.500142 | norm 0.1949 | lr 9.45e-04 | (3806.57 ms | 137732 tok/s) step 6803/76294 | train loss 3.495468 | norm 0.1937 | lr 9.45e-04 | (3807.05 ms | 137715 tok/s) step 6804/76294 | train loss 3.486387 | norm 0.2227 | lr 9.45e-04 | (3835.02 ms | 136711 tok/s) step 6805/76294 | train loss 3.595002 | norm 0.2662 | lr 9.45e-04 | (3805.00 ms | 137789 tok/s) step 6806/76294 | train loss 3.493434 | norm 0.2271 | lr 9.45e-04 | (3834.17 ms | 136741 tok/s) step 6807/76294 | train loss 3.551250 | norm 0.2146 | lr 9.45e-04 | (3869.13 ms | 135505 tok/s) step 6808/76294 | train loss 3.542165 | norm 0.2650 | lr 9.45e-04 | (3807.99 ms | 137681 tok/s) step 6809/76294 | train loss 3.531580 | norm 0.2069 | lr 9.45e-04 | (3827.54 ms | 136978 tok/s) step 6810/76294 | train loss 3.500207 | norm 0.2268 | lr 9.44e-04 | (3889.28 ms | 134803 tok/s) step 6811/76294 | train loss 3.548384 | norm 0.1999 | lr 9.44e-04 | (3866.75 ms | 135589 tok/s) step 6812/76294 | train loss 3.573598 | norm 0.1893 | lr 9.44e-04 | (3801.34 ms | 137922 tok/s) step 6813/76294 | train loss 3.632677 | norm 0.2731 | lr 9.44e-04 | (3808.38 ms | 137667 tok/s) step 6814/76294 | train loss 3.519851 | norm 0.2710 | lr 9.44e-04 | (3835.41 ms | 136697 tok/s) step 6815/76294 | train loss 3.561151 | norm 0.2463 | lr 9.44e-04 | (3815.04 ms | 137427 tok/s) step 6816/76294 | train loss 3.470976 | norm 0.2289 | lr 9.44e-04 | (4343.78 ms | 120699 tok/s) step 6817/76294 | train loss 3.533191 | norm 0.1966 | lr 9.44e-04 | (3804.91 ms | 137793 tok/s) step 6818/76294 | train loss 3.496338 | norm 0.2253 | lr 9.44e-04 | (3804.16 ms | 137820 tok/s) step 6819/76294 | train loss 3.459619 | norm 0.1927 | lr 9.44e-04 | (3829.44 ms | 136910 tok/s) step 6820/76294 | train loss 3.518656 | norm 0.2022 | lr 9.44e-04 | (3828.88 ms | 136930 tok/s) step 6821/76294 | train loss 3.613404 | norm 0.2384 | lr 9.44e-04 | (3804.78 ms | 137797 tok/s) step 6822/76294 | train loss 3.729689 | norm 0.2349 | lr 9.44e-04 | (3942.98 ms | 132968 tok/s) step 6823/76294 | train loss 3.622367 | norm 0.2349 | lr 9.44e-04 | (3803.14 ms | 137857 tok/s) step 6824/76294 | train loss 3.508362 | norm 0.2332 | lr 9.43e-04 | (3825.51 ms | 137051 tok/s) step 6825/76294 | train loss 3.494587 | norm 0.2440 | lr 9.43e-04 | (3800.46 ms | 137954 tok/s) step 6826/76294 | train loss 3.507767 | norm 0.2329 | lr 9.43e-04 | (3802.73 ms | 137872 tok/s) step 6827/76294 | train loss 3.498088 | norm 0.2312 | lr 9.43e-04 | (3826.47 ms | 137016 tok/s) step 6828/76294 | train loss 3.550898 | norm 0.2004 | lr 9.43e-04 | (3801.79 ms | 137905 tok/s) step 6829/76294 | train loss 3.566922 | norm 0.2224 | lr 9.43e-04 | (3804.97 ms | 137790 tok/s) step 6830/76294 | train loss 3.481891 | norm 0.2139 | lr 9.43e-04 | (3839.07 ms | 136566 tok/s) step 6831/76294 | train loss 3.458447 | norm 0.3626 | lr 9.43e-04 | (3876.39 ms | 135252 tok/s) step 6832/76294 | train loss 3.536423 | norm 0.2981 | lr 9.43e-04 | (3800.53 ms | 137951 tok/s) step 6833/76294 | train loss 3.520417 | norm 0.2172 | lr 9.43e-04 | (3825.82 ms | 137039 tok/s) step 6834/76294 | train loss 3.507257 | norm 0.2981 | lr 9.43e-04 | (3801.93 ms | 137900 tok/s) step 6835/76294 | train loss 3.448694 | norm 0.2587 | lr 9.43e-04 | (3809.87 ms | 137613 tok/s) step 6836/76294 | train loss 3.486519 | norm 0.2152 | lr 9.43e-04 | (3828.03 ms | 136960 tok/s) step 6837/76294 | train loss 3.523804 | norm 0.2394 | lr 9.42e-04 | (3894.78 ms | 134613 tok/s) step 6838/76294 | train loss 3.473493 | norm 0.2307 | lr 9.42e-04 | (3802.87 ms | 137866 tok/s) step 6839/76294 | train loss 3.527231 | norm 0.2676 | lr 9.42e-04 | (3838.75 ms | 136578 tok/s) step 6840/76294 | train loss 3.442284 | norm 0.2569 | lr 9.42e-04 | (3804.90 ms | 137793 tok/s) step 6841/76294 | train loss 3.514204 | norm 0.2298 | lr 9.42e-04 | (3810.20 ms | 137601 tok/s) step 6842/76294 | train loss 3.457860 | norm 0.2142 | lr 9.42e-04 | (3827.20 ms | 136990 tok/s) step 6843/76294 | train loss 3.562764 | norm 0.2187 | lr 9.42e-04 | (3808.22 ms | 137673 tok/s) step 6844/76294 | train loss 3.522777 | norm 0.2301 | lr 9.42e-04 | (3801.76 ms | 137907 tok/s) step 6845/76294 | train loss 3.405216 | norm 0.2230 | lr 9.42e-04 | (3842.79 ms | 136434 tok/s) step 6846/76294 | train loss 3.511919 | norm 0.2343 | lr 9.42e-04 | (3805.76 ms | 137762 tok/s) step 6847/76294 | train loss 3.550843 | norm 0.1972 | lr 9.42e-04 | (3810.17 ms | 137602 tok/s) step 6848/76294 | train loss 3.531438 | norm 0.2473 | lr 9.42e-04 | (3853.76 ms | 136046 tok/s) step 6849/76294 | train loss 3.429896 | norm 0.1984 | lr 9.42e-04 | (3807.62 ms | 137694 tok/s) step 6850/76294 | train loss 3.554074 | norm 0.3206 | lr 9.41e-04 | (3805.65 ms | 137766 tok/s) step 6851/76294 | train loss 3.529893 | norm 0.2096 | lr 9.41e-04 | (3839.14 ms | 136564 tok/s) step 6852/76294 | train loss 3.484700 | norm 0.2160 | lr 9.41e-04 | (3851.20 ms | 136136 tok/s) step 6853/76294 | train loss 3.545656 | norm 0.2374 | lr 9.41e-04 | (3841.33 ms | 136486 tok/s) step 6854/76294 | train loss 3.513192 | norm 0.1769 | lr 9.41e-04 | (3838.62 ms | 136582 tok/s) step 6855/76294 | train loss 3.514778 | norm 0.2051 | lr 9.41e-04 | (3838.87 ms | 136574 tok/s) step 6856/76294 | train loss 3.494117 | norm 0.2571 | lr 9.41e-04 | (3806.83 ms | 137723 tok/s) step 6857/76294 | train loss 3.506303 | norm 0.2515 | lr 9.41e-04 | (3838.34 ms | 136592 tok/s) step 6858/76294 | train loss 3.554871 | norm 0.1956 | lr 9.41e-04 | (3806.23 ms | 137745 tok/s) step 6859/76294 | train loss 3.506080 | norm 0.2392 | lr 9.41e-04 | (3818.79 ms | 137291 tok/s) step 6860/76294 | train loss 3.564187 | norm 0.1950 | lr 9.41e-04 | (3806.24 ms | 137744 tok/s) step 6861/76294 | train loss 3.558614 | norm 0.1956 | lr 9.41e-04 | (3811.61 ms | 137550 tok/s) step 6862/76294 | train loss 3.563123 | norm 0.2121 | lr 9.41e-04 | (3832.73 ms | 136792 tok/s) step 6863/76294 | train loss 3.501146 | norm 0.2371 | lr 9.40e-04 | (3834.59 ms | 136726 tok/s) step 6864/76294 | train loss 3.522562 | norm 0.2163 | lr 9.40e-04 | (3804.77 ms | 137798 tok/s) step 6865/76294 | train loss 3.518299 | norm 0.2161 | lr 9.40e-04 | (3838.54 ms | 136585 tok/s) step 6866/76294 | train loss 3.451212 | norm 0.1929 | lr 9.40e-04 | (3803.19 ms | 137855 tok/s) step 6867/76294 | train loss 3.535230 | norm 0.2167 | lr 9.40e-04 | (4554.27 ms | 115120 tok/s) step 6868/76294 | train loss 3.505037 | norm 0.1908 | lr 9.40e-04 | (3805.72 ms | 137763 tok/s) step 6869/76294 | train loss 3.500813 | norm 0.1987 | lr 9.40e-04 | (3807.45 ms | 137701 tok/s) step 6870/76294 | train loss 3.437357 | norm 0.1848 | lr 9.40e-04 | (3831.72 ms | 136828 tok/s) step 6871/76294 | train loss 3.532031 | norm 0.1836 | lr 9.40e-04 | (3816.64 ms | 137369 tok/s) step 6872/76294 | train loss 3.469494 | norm 0.1962 | lr 9.40e-04 | (3956.58 ms | 132510 tok/s) step 6873/76294 | train loss 3.549145 | norm 0.2131 | lr 9.40e-04 | (3807.37 ms | 137703 tok/s) step 6874/76294 | train loss 3.467431 | norm 0.2155 | lr 9.40e-04 | (3804.10 ms | 137822 tok/s) step 6875/76294 | train loss 3.487076 | norm 0.2009 | lr 9.40e-04 | (3838.67 ms | 136581 tok/s) step 6876/76294 | train loss 3.558561 | norm 0.2090 | lr 9.39e-04 | (3803.99 ms | 137826 tok/s) step 6877/76294 | train loss 3.454031 | norm 0.2648 | lr 9.39e-04 | (3822.06 ms | 137174 tok/s) step 6878/76294 | train loss 3.584960 | norm 0.1924 | lr 9.39e-04 | (3803.41 ms | 137847 tok/s) step 6879/76294 | train loss 3.482126 | norm 0.2462 | lr 9.39e-04 | (3828.38 ms | 136948 tok/s) step 6880/76294 | train loss 3.511697 | norm 0.2012 | lr 9.39e-04 | (3806.61 ms | 137731 tok/s) step 6881/76294 | train loss 3.478403 | norm 0.2113 | lr 9.39e-04 | (3835.32 ms | 136700 tok/s) step 6882/76294 | train loss 3.504865 | norm 0.1982 | lr 9.39e-04 | (3825.78 ms | 137041 tok/s) step 6883/76294 | train loss 3.528421 | norm 0.1832 | lr 9.39e-04 | (3832.57 ms | 136798 tok/s) step 6884/76294 | train loss 3.477719 | norm 0.1948 | lr 9.39e-04 | (3803.07 ms | 137859 tok/s) step 6885/76294 | train loss 3.474831 | norm 0.1720 | lr 9.39e-04 | (3848.92 ms | 136217 tok/s) step 6886/76294 | train loss 3.480455 | norm 0.2110 | lr 9.39e-04 | (3803.64 ms | 137838 tok/s) step 6887/76294 | train loss 3.507263 | norm 0.1886 | lr 9.39e-04 | (3812.88 ms | 137504 tok/s) step 6888/76294 | train loss 3.477113 | norm 0.1990 | lr 9.39e-04 | (3850.12 ms | 136175 tok/s) step 6889/76294 | train loss 3.516772 | norm 0.2275 | lr 9.38e-04 | (3808.56 ms | 137661 tok/s) step 6890/76294 | train loss 3.490089 | norm 0.2354 | lr 9.38e-04 | (3824.53 ms | 137086 tok/s) step 6891/76294 | train loss 3.547342 | norm 0.1908 | lr 9.38e-04 | (3832.67 ms | 136794 tok/s) step 6892/76294 | train loss 3.515583 | norm 0.1911 | lr 9.38e-04 | (3848.33 ms | 136238 tok/s) step 6893/76294 | train loss 3.499114 | norm 0.2148 | lr 9.38e-04 | (3918.08 ms | 133812 tok/s) step 6894/76294 | train loss 3.544119 | norm 0.1749 | lr 9.38e-04 | (3807.23 ms | 137709 tok/s) step 6895/76294 | train loss 3.430495 | norm 0.2215 | lr 9.38e-04 | (3834.66 ms | 136723 tok/s) step 6896/76294 | train loss 3.546534 | norm 0.1877 | lr 9.38e-04 | (3802.92 ms | 137865 tok/s) step 6897/76294 | train loss 3.465400 | norm 0.2023 | lr 9.38e-04 | (3837.41 ms | 136625 tok/s) step 6898/76294 | train loss 3.464524 | norm 0.2566 | lr 9.38e-04 | (3805.78 ms | 137761 tok/s) step 6899/76294 | train loss 3.460024 | norm 0.2659 | lr 9.38e-04 | (3826.29 ms | 137023 tok/s) step 6900/76294 | train loss 3.559834 | norm 0.2257 | lr 9.38e-04 | (3804.52 ms | 137806 tok/s) step 6901/76294 | train loss 3.429386 | norm 0.1962 | lr 9.38e-04 | (3830.32 ms | 136878 tok/s) step 6902/76294 | train loss 3.515459 | norm 0.2149 | lr 9.37e-04 | (3801.09 ms | 137931 tok/s) step 6903/76294 | train loss 3.476714 | norm 0.1865 | lr 9.37e-04 | (3811.01 ms | 137572 tok/s) step 6904/76294 | train loss 3.453595 | norm 0.2596 | lr 9.37e-04 | (3831.40 ms | 136840 tok/s) step 6905/76294 | train loss 3.542613 | norm 0.2894 | lr 9.37e-04 | (3801.66 ms | 137910 tok/s) step 6906/76294 | train loss 3.521885 | norm 0.1909 | lr 9.37e-04 | (3823.78 ms | 137113 tok/s) step 6907/76294 | train loss 3.518331 | norm 0.3444 | lr 9.37e-04 | (3805.31 ms | 137778 tok/s) step 6908/76294 | train loss 3.482649 | norm 0.2779 | lr 9.37e-04 | (3801.87 ms | 137903 tok/s) step 6909/76294 | train loss 3.561941 | norm 0.2153 | lr 9.37e-04 | (4193.08 ms | 125036 tok/s) step 6910/76294 | train loss 3.554345 | norm 0.2824 | lr 9.37e-04 | (3815.98 ms | 137393 tok/s) step 6911/76294 | train loss 3.557554 | norm 0.2175 | lr 9.37e-04 | (3827.14 ms | 136992 tok/s) step 6912/76294 | train loss 3.474317 | norm 0.2443 | lr 9.37e-04 | (3809.86 ms | 137613 tok/s) step 6913/76294 | train loss 3.523205 | norm 0.2152 | lr 9.37e-04 | (3896.98 ms | 134537 tok/s) step 6914/76294 | train loss 3.539927 | norm 0.2106 | lr 9.36e-04 | (3803.77 ms | 137834 tok/s) step 6915/76294 | train loss 3.462276 | norm 0.2150 | lr 9.36e-04 | (3800.73 ms | 137944 tok/s) step 6916/76294 | train loss 3.528661 | norm 0.2094 | lr 9.36e-04 | (3831.86 ms | 136823 tok/s) step 6917/76294 | train loss 3.494788 | norm 0.2105 | lr 9.36e-04 | (3807.20 ms | 137709 tok/s) step 6918/76294 | train loss 3.547767 | norm 0.1956 | lr 9.36e-04 | (3820.73 ms | 137222 tok/s) step 6919/76294 | train loss 3.389781 | norm 0.2266 | lr 9.36e-04 | (3803.23 ms | 137853 tok/s) step 6920/76294 | train loss 3.550488 | norm 0.2321 | lr 9.36e-04 | (3827.86 ms | 136966 tok/s) step 6921/76294 | train loss 3.558243 | norm 0.2259 | lr 9.36e-04 | (3800.15 ms | 137965 tok/s) step 6922/76294 | train loss 3.463801 | norm 0.1955 | lr 9.36e-04 | (3817.70 ms | 137331 tok/s) step 6923/76294 | train loss 3.550357 | norm 0.1968 | lr 9.36e-04 | (3805.07 ms | 137787 tok/s) step 6924/76294 | train loss 3.503555 | norm 0.1788 | lr 9.36e-04 | (3799.62 ms | 137984 tok/s) step 6925/76294 | train loss 3.514184 | norm 0.1811 | lr 9.36e-04 | (3874.25 ms | 135326 tok/s) step 6926/76294 | train loss 3.562449 | norm 0.2563 | lr 9.36e-04 | (3803.59 ms | 137840 tok/s) step 6927/76294 | train loss 3.493860 | norm 0.2400 | lr 9.35e-04 | (3829.41 ms | 136911 tok/s) step 6928/76294 | train loss 3.531790 | norm 0.3871 | lr 9.35e-04 | (3799.23 ms | 137999 tok/s) step 6929/76294 | train loss 3.571976 | norm 0.4565 | lr 9.35e-04 | (3831.40 ms | 136840 tok/s) step 6930/76294 | train loss 3.467804 | norm 0.3289 | lr 9.35e-04 | (3800.29 ms | 137960 tok/s) step 6931/76294 | train loss 3.579448 | norm 0.2442 | lr 9.35e-04 | (3823.91 ms | 137108 tok/s) step 6932/76294 | train loss 3.518465 | norm 0.3216 | lr 9.35e-04 | (3803.50 ms | 137844 tok/s) step 6933/76294 | train loss 3.431204 | norm 0.2873 | lr 9.35e-04 | (3825.31 ms | 137058 tok/s) step 6934/76294 | train loss 3.535033 | norm 0.2607 | lr 9.35e-04 | (3800.16 ms | 137965 tok/s) step 6935/76294 | train loss 3.439416 | norm 0.1998 | lr 9.35e-04 | (3831.97 ms | 136819 tok/s) step 6936/76294 | train loss 3.536416 | norm 0.2403 | lr 9.35e-04 | (3870.50 ms | 135457 tok/s) step 6937/76294 | train loss 3.499295 | norm 0.2039 | lr 9.35e-04 | (3830.55 ms | 136870 tok/s) step 6938/76294 | train loss 3.480690 | norm 0.2140 | lr 9.35e-04 | (3802.76 ms | 137870 tok/s) step 6939/76294 | train loss 3.575351 | norm 0.2002 | lr 9.35e-04 | (3800.41 ms | 137956 tok/s) step 6940/76294 | train loss 3.431462 | norm 0.2237 | lr 9.34e-04 | (3840.34 ms | 136521 tok/s) step 6941/76294 | train loss 3.556425 | norm 0.2076 | lr 9.34e-04 | (3799.39 ms | 137993 tok/s) step 6942/76294 | train loss 3.460519 | norm 0.1983 | lr 9.34e-04 | (3823.02 ms | 137140 tok/s) step 6943/76294 | train loss 3.492211 | norm 0.1780 | lr 9.34e-04 | (3822.57 ms | 137156 tok/s) step 6944/76294 | train loss 3.427144 | norm 0.2017 | lr 9.34e-04 | (3856.59 ms | 135946 tok/s) step 6945/76294 | train loss 3.503386 | norm 0.1797 | lr 9.34e-04 | (3808.98 ms | 137645 tok/s) step 6946/76294 | train loss 3.499698 | norm 0.2227 | lr 9.34e-04 | (3833.78 ms | 136755 tok/s) step 6947/76294 | train loss 3.494444 | norm 0.2334 | lr 9.34e-04 | (3804.97 ms | 137790 tok/s) step 6948/76294 | train loss 3.509757 | norm 0.2006 | lr 9.34e-04 | (3846.46 ms | 136304 tok/s) step 6949/76294 | train loss 3.507216 | norm 0.1987 | lr 9.34e-04 | (3803.48 ms | 137844 tok/s) step 6950/76294 | train loss 3.485659 | norm 0.1854 | lr 9.34e-04 | (3826.12 ms | 137029 tok/s) step 6951/76294 | train loss 3.465906 | norm 0.1908 | lr 9.34e-04 | (3803.69 ms | 137837 tok/s) step 6952/76294 | train loss 3.483543 | norm 0.1970 | lr 9.34e-04 | (3825.27 ms | 137059 tok/s) step 6953/76294 | train loss 3.540235 | norm 0.1931 | lr 9.33e-04 | (3799.45 ms | 137991 tok/s) step 6954/76294 | train loss 3.478463 | norm 0.1985 | lr 9.33e-04 | (4687.64 ms | 111845 tok/s) step 6955/76294 | train loss 3.538809 | norm 0.1902 | lr 9.33e-04 | (3832.96 ms | 136784 tok/s) step 6956/76294 | train loss 3.506069 | norm 0.1895 | lr 9.33e-04 | (3808.98 ms | 137645 tok/s) step 6957/76294 | train loss 3.492276 | norm 0.1712 | lr 9.33e-04 | (3868.19 ms | 135538 tok/s) step 6958/76294 | train loss 3.461415 | norm 0.1893 | lr 9.33e-04 | (3798.31 ms | 138032 tok/s) step 6959/76294 | train loss 3.567482 | norm 0.2039 | lr 9.33e-04 | (3824.97 ms | 137070 tok/s) step 6960/76294 | train loss 3.435303 | norm 0.1859 | lr 9.33e-04 | (3798.30 ms | 138032 tok/s) step 6961/76294 | train loss 3.446424 | norm 0.2013 | lr 9.33e-04 | (3803.07 ms | 137859 tok/s) step 6962/76294 | train loss 3.528007 | norm 0.2481 | lr 9.33e-04 | (3822.55 ms | 137157 tok/s) step 6963/76294 | train loss 3.457914 | norm 0.2842 | lr 9.33e-04 | (3799.50 ms | 137989 tok/s) step 6964/76294 | train loss 3.528944 | norm 0.2194 | lr 9.33e-04 | (3832.10 ms | 136815 tok/s) step 6965/76294 | train loss 3.403509 | norm 0.2697 | lr 9.33e-04 | (3799.33 ms | 137995 tok/s) step 6966/76294 | train loss 3.495479 | norm 0.2015 | lr 9.32e-04 | (3824.03 ms | 137103 tok/s) step 6967/76294 | train loss 3.488066 | norm 0.1989 | lr 9.32e-04 | (3799.02 ms | 138006 tok/s) step 6968/76294 | train loss 3.474893 | norm 0.1824 | lr 9.32e-04 | (3804.40 ms | 137811 tok/s) step 6969/76294 | train loss 3.515669 | norm 0.2619 | lr 9.32e-04 | (3821.80 ms | 137184 tok/s) step 6970/76294 | train loss 3.483983 | norm 0.1856 | lr 9.32e-04 | (3807.86 ms | 137686 tok/s) step 6971/76294 | train loss 3.525752 | norm 0.2473 | lr 9.32e-04 | (3821.29 ms | 137202 tok/s) step 6972/76294 | train loss 3.426152 | norm 0.2500 | lr 9.32e-04 | (3804.71 ms | 137800 tok/s) step 6973/76294 | train loss 3.444155 | norm 0.2126 | lr 9.32e-04 | (3818.65 ms | 137297 tok/s) step 6974/76294 | train loss 3.513394 | norm 0.2044 | lr 9.32e-04 | (3827.43 ms | 136982 tok/s) step 6975/76294 | train loss 3.436994 | norm 0.2494 | lr 9.32e-04 | (4045.41 ms | 129601 tok/s) step 6976/76294 | train loss 3.499842 | norm 0.1910 | lr 9.32e-04 | (3842.82 ms | 136433 tok/s) step 6977/76294 | train loss 3.502925 | norm 0.1872 | lr 9.32e-04 | (3798.12 ms | 138039 tok/s) step 6978/76294 | train loss 3.521636 | norm 0.1875 | lr 9.32e-04 | (3820.86 ms | 137217 tok/s) step 6979/76294 | train loss 3.491011 | norm 0.2287 | lr 9.31e-04 | (3817.51 ms | 137338 tok/s) step 6980/76294 | train loss 3.491348 | norm 0.2385 | lr 9.31e-04 | (3817.88 ms | 137325 tok/s) step 6981/76294 | train loss 3.519959 | norm 0.1808 | lr 9.31e-04 | (3808.18 ms | 137674 tok/s) step 6982/76294 | train loss 3.518954 | norm 0.2250 | lr 9.31e-04 | (3847.12 ms | 136281 tok/s) step 6983/76294 | train loss 3.582097 | norm 0.2272 | lr 9.31e-04 | (3799.64 ms | 137984 tok/s) step 6984/76294 | train loss 3.465285 | norm 0.1987 | lr 9.31e-04 | (3825.03 ms | 137068 tok/s) step 6985/76294 | train loss 3.474448 | norm 0.2173 | lr 9.31e-04 | (3797.93 ms | 138046 tok/s) step 6986/76294 | train loss 3.540928 | norm 0.2135 | lr 9.31e-04 | (3825.54 ms | 137049 tok/s) step 6987/76294 | train loss 3.428945 | norm 0.2008 | lr 9.31e-04 | (3795.53 ms | 138133 tok/s) step 6988/76294 | train loss 3.470661 | norm 0.2438 | lr 9.31e-04 | (3824.21 ms | 137097 tok/s) step 6989/76294 | train loss 3.482185 | norm 0.1852 | lr 9.31e-04 | (3802.68 ms | 137873 tok/s) step 6990/76294 | train loss 3.458228 | norm 0.2255 | lr 9.31e-04 | (3851.40 ms | 136129 tok/s) step 6991/76294 | train loss 3.438842 | norm 0.1936 | lr 9.31e-04 | (3797.05 ms | 138078 tok/s) step 6992/76294 | train loss 3.519222 | norm 0.2070 | lr 9.30e-04 | (3854.15 ms | 136032 tok/s) step 6993/76294 | train loss 3.509350 | norm 0.2272 | lr 9.30e-04 | (3800.99 ms | 137934 tok/s) step 6994/76294 | train loss 3.440567 | norm 0.2221 | lr 9.30e-04 | (3803.18 ms | 137855 tok/s) step 6995/76294 | train loss 3.486722 | norm 0.2109 | lr 9.30e-04 | (3825.01 ms | 137068 tok/s) step 6996/76294 | train loss 3.498436 | norm 0.1989 | lr 9.30e-04 | (3802.68 ms | 137873 tok/s) step 6997/76294 | train loss 3.512226 | norm 0.2160 | lr 9.30e-04 | (3800.56 ms | 137950 tok/s) step 6998/76294 | train loss 3.470713 | norm 0.1964 | lr 9.30e-04 | (3976.96 ms | 131831 tok/s) step 6999/76294 | train loss 3.589573 | norm 0.2031 | lr 9.30e-04 | (3874.53 ms | 135317 tok/s) step 7000/76294 | train loss 3.481366 | norm 0.2021 | lr 9.30e-04 | (3827.56 ms | 136977 tok/s) val loss: 3.494211 saving model checkpoint to ./results/gpt2-124M-gqa/step_7000.pth step 7001/76294 | train loss 3.466556 | norm 0.1886 | lr 9.30e-04 | (3884.59 ms | 134966 tok/s) step 7002/76294 | train loss 3.451873 | norm 0.2003 | lr 9.30e-04 | (3798.17 ms | 138037 tok/s) step 7003/76294 | train loss 3.489428 | norm 0.2498 | lr 9.30e-04 | (3831.36 ms | 136841 tok/s) step 7004/76294 | train loss 3.543751 | norm 0.2279 | lr 9.30e-04 | (3798.86 ms | 138012 tok/s) step 7005/76294 | train loss 3.487602 | norm 0.1971 | lr 9.29e-04 | (3827.09 ms | 136994 tok/s) step 7006/76294 | train loss 3.564149 | norm 0.2241 | lr 9.29e-04 | (3803.89 ms | 137830 tok/s) step 7007/76294 | train loss 3.522887 | norm 0.2712 | lr 9.29e-04 | (3856.20 ms | 135960 tok/s) step 7008/76294 | train loss 3.547670 | norm 0.2531 | lr 9.29e-04 | (3799.36 ms | 137994 tok/s) step 7009/76294 | train loss 3.454624 | norm 0.1923 | lr 9.29e-04 | (3825.93 ms | 137036 tok/s) step 7010/76294 | train loss 3.489126 | norm 0.2649 | lr 9.29e-04 | (3801.60 ms | 137912 tok/s) step 7011/76294 | train loss 3.489592 | norm 0.2622 | lr 9.29e-04 | (3849.63 ms | 136192 tok/s) step 7012/76294 | train loss 3.496064 | norm 0.2186 | lr 9.29e-04 | (3801.55 ms | 137914 tok/s) step 7013/76294 | train loss 3.541655 | norm 0.3652 | lr 9.29e-04 | (3809.78 ms | 137616 tok/s) step 7014/76294 | train loss 3.450563 | norm 0.4146 | lr 9.29e-04 | (3824.83 ms | 137075 tok/s) step 7015/76294 | train loss 3.550774 | norm 0.2359 | lr 9.29e-04 | (3799.60 ms | 137985 tok/s) step 7016/76294 | train loss 3.503731 | norm 0.2824 | lr 9.29e-04 | (3823.17 ms | 137134 tok/s) step 7017/76294 | train loss 3.605453 | norm 0.2610 | lr 9.28e-04 | (3828.19 ms | 136954 tok/s) step 7018/76294 | train loss 3.511292 | norm 0.2233 | lr 9.28e-04 | (3875.75 ms | 135274 tok/s) step 7019/76294 | train loss 3.433993 | norm 0.2176 | lr 9.28e-04 | (3799.87 ms | 137975 tok/s) step 7020/76294 | train loss 3.550362 | norm 0.2296 | lr 9.28e-04 | (3807.28 ms | 137707 tok/s) step 7021/76294 | train loss 3.495516 | norm 0.2078 | lr 9.28e-04 | (3839.38 ms | 136555 tok/s) step 7022/76294 | train loss 3.463814 | norm 0.2569 | lr 9.28e-04 | (3809.36 ms | 137632 tok/s) step 7023/76294 | train loss 3.504115 | norm 0.2119 | lr 9.28e-04 | (3804.20 ms | 137818 tok/s) step 7024/76294 | train loss 3.531572 | norm 0.1958 | lr 9.28e-04 | (3838.11 ms | 136601 tok/s) step 7025/76294 | train loss 3.573357 | norm 0.2025 | lr 9.28e-04 | (3800.50 ms | 137952 tok/s) step 7026/76294 | train loss 3.449624 | norm 0.2100 | lr 9.28e-04 | (3831.03 ms | 136853 tok/s) step 7027/76294 | train loss 3.384212 | norm 0.1884 | lr 9.28e-04 | (3804.46 ms | 137809 tok/s) step 7028/76294 | train loss 3.533615 | norm 0.2008 | lr 9.28e-04 | (3832.94 ms | 136785 tok/s) step 7029/76294 | train loss 3.427574 | norm 0.1889 | lr 9.28e-04 | (3806.68 ms | 137728 tok/s) step 7030/76294 | train loss 3.518991 | norm 0.1815 | lr 9.27e-04 | (3805.69 ms | 137764 tok/s) step 7031/76294 | train loss 3.457350 | norm 0.1948 | lr 9.27e-04 | (3828.92 ms | 136928 tok/s) step 7032/76294 | train loss 3.582866 | norm 0.2087 | lr 9.27e-04 | (3808.23 ms | 137672 tok/s) step 7033/76294 | train loss 3.479996 | norm 0.1983 | lr 9.27e-04 | (3855.11 ms | 135998 tok/s) step 7034/76294 | train loss 3.463096 | norm 0.1904 | lr 9.27e-04 | (3803.34 ms | 137849 tok/s) step 7035/76294 | train loss 3.465297 | norm 0.1912 | lr 9.27e-04 | (3842.03 ms | 136461 tok/s) step 7036/76294 | train loss 3.530890 | norm 0.1921 | lr 9.27e-04 | (3800.89 ms | 137938 tok/s) step 7037/76294 | train loss 3.459775 | norm 0.2203 | lr 9.27e-04 | (3807.53 ms | 137698 tok/s) step 7038/76294 | train loss 3.520121 | norm 0.2144 | lr 9.27e-04 | (3833.59 ms | 136762 tok/s) step 7039/76294 | train loss 3.490656 | norm 0.2353 | lr 9.27e-04 | (3973.33 ms | 131952 tok/s) step 7040/76294 | train loss 3.468785 | norm 0.1970 | lr 9.27e-04 | (3890.60 ms | 134758 tok/s) step 7041/76294 | train loss 3.480816 | norm 0.2273 | lr 9.27e-04 | (3807.10 ms | 137713 tok/s) step 7042/76294 | train loss 3.517422 | norm 0.2208 | lr 9.27e-04 | (3809.32 ms | 137633 tok/s) step 7043/76294 | train loss 3.584581 | norm 0.2318 | lr 9.26e-04 | (3824.35 ms | 137092 tok/s) step 7044/76294 | train loss 3.488232 | norm 0.2490 | lr 9.26e-04 | (3832.42 ms | 136803 tok/s) step 7045/76294 | train loss 3.473516 | norm 0.2517 | lr 9.26e-04 | (3805.74 ms | 137763 tok/s) step 7046/76294 | train loss 3.523394 | norm 0.3738 | lr 9.26e-04 | (3898.38 ms | 134489 tok/s) step 7047/76294 | train loss 3.455092 | norm 0.2758 | lr 9.26e-04 | (3800.74 ms | 137944 tok/s) step 7048/76294 | train loss 3.650555 | norm 0.2104 | lr 9.26e-04 | (3830.04 ms | 136888 tok/s) step 7049/76294 | train loss 3.482440 | norm 0.2539 | lr 9.26e-04 | (3804.54 ms | 137806 tok/s) step 7050/76294 | train loss 3.540985 | norm 0.1966 | lr 9.26e-04 | (3853.81 ms | 136044 tok/s) step 7051/76294 | train loss 3.538994 | norm 0.2528 | lr 9.26e-04 | (3833.15 ms | 136777 tok/s) step 7052/76294 | train loss 3.417456 | norm 0.1891 | lr 9.26e-04 | (3808.07 ms | 137678 tok/s) step 7053/76294 | train loss 3.609751 | norm 0.2004 | lr 9.26e-04 | (3827.77 ms | 136970 tok/s) step 7054/76294 | train loss 3.482202 | norm 0.2124 | lr 9.26e-04 | (3813.60 ms | 137479 tok/s) step 7055/76294 | train loss 3.457251 | norm 0.1808 | lr 9.26e-04 | (3825.64 ms | 137046 tok/s) step 7056/76294 | train loss 3.503749 | norm 0.2393 | lr 9.25e-04 | (3809.60 ms | 137623 tok/s) step 7057/76294 | train loss 3.503385 | norm 0.1943 | lr 9.25e-04 | (3808.51 ms | 137662 tok/s) step 7058/76294 | train loss 3.473245 | norm 0.2232 | lr 9.25e-04 | (4243.06 ms | 123564 tok/s) step 7059/76294 | train loss 3.522865 | norm 0.2521 | lr 9.25e-04 | (3824.98 ms | 137069 tok/s) step 7060/76294 | train loss 3.468763 | norm 0.2261 | lr 9.25e-04 | (3878.65 ms | 135173 tok/s) step 7061/76294 | train loss 3.545446 | norm 0.2409 | lr 9.25e-04 | (3804.80 ms | 137797 tok/s) step 7062/76294 | train loss 3.566520 | norm 0.2096 | lr 9.25e-04 | (3863.78 ms | 135693 tok/s) step 7063/76294 | train loss 3.513598 | norm 0.2149 | lr 9.25e-04 | (3798.95 ms | 138009 tok/s) step 7064/76294 | train loss 3.565225 | norm 0.2312 | lr 9.25e-04 | (3825.30 ms | 137058 tok/s) step 7065/76294 | train loss 3.543830 | norm 0.2452 | lr 9.25e-04 | (3803.44 ms | 137846 tok/s) step 7066/76294 | train loss 3.610986 | norm 0.1974 | lr 9.25e-04 | (3833.20 ms | 136776 tok/s) step 7067/76294 | train loss 3.496493 | norm 0.2225 | lr 9.25e-04 | (3798.94 ms | 138009 tok/s) step 7068/76294 | train loss 3.526094 | norm 0.2105 | lr 9.24e-04 | (3804.35 ms | 137813 tok/s) step 7069/76294 | train loss 3.531285 | norm 0.1928 | lr 9.24e-04 | (3824.16 ms | 137099 tok/s) step 7070/76294 | train loss 3.586546 | norm 0.2095 | lr 9.24e-04 | (3805.03 ms | 137788 tok/s) step 7071/76294 | train loss 3.500998 | norm 0.2278 | lr 9.24e-04 | (3826.95 ms | 136999 tok/s) step 7072/76294 | train loss 3.542390 | norm 0.2255 | lr 9.24e-04 | (3803.31 ms | 137851 tok/s) step 7073/76294 | train loss 3.531149 | norm 0.2345 | lr 9.24e-04 | (3829.84 ms | 136896 tok/s) step 7074/76294 | train loss 3.508909 | norm 0.2329 | lr 9.24e-04 | (3818.71 ms | 137294 tok/s) step 7075/76294 | train loss 3.546443 | norm 0.2071 | lr 9.24e-04 | (3811.01 ms | 137572 tok/s) step 7076/76294 | train loss 3.603445 | norm 0.2508 | lr 9.24e-04 | (3826.91 ms | 137000 tok/s) step 7077/76294 | train loss 3.567589 | norm 0.1969 | lr 9.24e-04 | (3803.60 ms | 137840 tok/s) step 7078/76294 | train loss 3.485275 | norm 0.2445 | lr 9.24e-04 | (3814.42 ms | 137449 tok/s) step 7079/76294 | train loss 3.535526 | norm 0.2005 | lr 9.24e-04 | (3806.20 ms | 137746 tok/s) step 7080/76294 | train loss 3.498081 | norm 0.2166 | lr 9.24e-04 | (3807.94 ms | 137683 tok/s) step 7081/76294 | train loss 3.504860 | norm 0.1959 | lr 9.23e-04 | (3903.19 ms | 134323 tok/s) step 7082/76294 | train loss 3.603520 | norm 0.1893 | lr 9.23e-04 | (3805.85 ms | 137758 tok/s) step 7083/76294 | train loss 3.561973 | norm 0.2650 | lr 9.23e-04 | (3817.64 ms | 137333 tok/s) step 7084/76294 | train loss 3.579422 | norm 0.3094 | lr 9.23e-04 | (3804.83 ms | 137795 tok/s) step 7085/76294 | train loss 3.522311 | norm 0.1948 | lr 9.23e-04 | (3812.36 ms | 137523 tok/s) step 7086/76294 | train loss 3.572901 | norm 0.2650 | lr 9.23e-04 | (3822.27 ms | 137167 tok/s) step 7087/76294 | train loss 3.509229 | norm 0.2250 | lr 9.23e-04 | (3829.42 ms | 136911 tok/s) step 7088/76294 | train loss 3.602164 | norm 0.2230 | lr 9.23e-04 | (3801.27 ms | 137924 tok/s) step 7089/76294 | train loss 3.517951 | norm 0.2289 | lr 9.23e-04 | (3835.82 ms | 136682 tok/s) step 7090/76294 | train loss 3.534496 | norm 0.3711 | lr 9.23e-04 | (3801.24 ms | 137925 tok/s) step 7091/76294 | train loss 3.515454 | norm 0.3230 | lr 9.23e-04 | (3805.15 ms | 137784 tok/s) step 7092/76294 | train loss 3.498360 | norm 0.2642 | lr 9.23e-04 | (3826.89 ms | 137001 tok/s) step 7093/76294 | train loss 3.517308 | norm 0.4876 | lr 9.23e-04 | (3806.59 ms | 137732 tok/s) step 7094/76294 | train loss 3.483332 | norm 0.2503 | lr 9.22e-04 | (3825.74 ms | 137042 tok/s) step 7095/76294 | train loss 3.552262 | norm 0.2200 | lr 9.22e-04 | (3804.43 ms | 137810 tok/s) step 7096/76294 | train loss 3.476032 | norm 0.2208 | lr 9.22e-04 | (3799.12 ms | 138002 tok/s) step 7097/76294 | train loss 3.518653 | norm 0.2865 | lr 9.22e-04 | (3858.62 ms | 135875 tok/s) step 7098/76294 | train loss 3.494664 | norm 0.2382 | lr 9.22e-04 | (3804.25 ms | 137816 tok/s) step 7099/76294 | train loss 3.538612 | norm 0.2202 | lr 9.22e-04 | (3832.96 ms | 136784 tok/s) step 7100/76294 | train loss 3.486958 | norm 0.2467 | lr 9.22e-04 | (5808.03 ms | 90270 tok/s) step 7101/76294 | train loss 3.521437 | norm 0.2295 | lr 9.22e-04 | (3959.83 ms | 132402 tok/s) step 7102/76294 | train loss 3.536812 | norm 0.2455 | lr 9.22e-04 | (3817.11 ms | 137352 tok/s) step 7103/76294 | train loss 3.511007 | norm 0.2910 | lr 9.22e-04 | (3819.26 ms | 137275 tok/s) step 7104/76294 | train loss 3.481770 | norm 0.1924 | lr 9.22e-04 | (3827.25 ms | 136988 tok/s) step 7105/76294 | train loss 3.757200 | norm 0.3181 | lr 9.22e-04 | (3796.53 ms | 138097 tok/s) step 7106/76294 | train loss 3.513139 | norm 0.1874 | lr 9.22e-04 | (3827.91 ms | 136964 tok/s) step 7107/76294 | train loss 3.541752 | norm 0.2350 | lr 9.21e-04 | (3796.98 ms | 138080 tok/s) step 7108/76294 | train loss 3.559124 | norm 0.2315 | lr 9.21e-04 | (3805.48 ms | 137772 tok/s) step 7109/76294 | train loss 3.517552 | norm 0.2341 | lr 9.21e-04 | (3830.65 ms | 136867 tok/s) step 7110/76294 | train loss 3.544395 | norm 0.1863 | lr 9.21e-04 | (3807.85 ms | 137686 tok/s) step 7111/76294 | train loss 3.468712 | norm 0.2182 | lr 9.21e-04 | (3824.76 ms | 137077 tok/s) step 7112/76294 | train loss 3.579727 | norm 0.2141 | lr 9.21e-04 | (3810.06 ms | 137606 tok/s) step 7113/76294 | train loss 3.535526 | norm 0.2384 | lr 9.21e-04 | (3800.83 ms | 137940 tok/s) step 7114/76294 | train loss 3.506466 | norm 0.1981 | lr 9.21e-04 | (3837.23 ms | 136632 tok/s) step 7115/76294 | train loss 3.567369 | norm 0.2301 | lr 9.21e-04 | (3801.54 ms | 137915 tok/s) step 7116/76294 | train loss 3.490252 | norm 0.2036 | lr 9.21e-04 | (3804.30 ms | 137815 tok/s) step 7117/76294 | train loss 3.524827 | norm 0.2704 | lr 9.21e-04 | (3823.34 ms | 137128 tok/s) step 7118/76294 | train loss 3.492536 | norm 0.2240 | lr 9.21e-04 | (3802.57 ms | 137877 tok/s) step 7119/76294 | train loss 3.523779 | norm 0.1834 | lr 9.20e-04 | (3799.35 ms | 137994 tok/s) step 7120/76294 | train loss 3.544784 | norm 0.2101 | lr 9.20e-04 | (3838.48 ms | 136587 tok/s) step 7121/76294 | train loss 3.518139 | norm 0.2300 | lr 9.20e-04 | (3822.04 ms | 137175 tok/s) step 7122/76294 | train loss 3.519409 | norm 0.1893 | lr 9.20e-04 | (3892.47 ms | 134693 tok/s) step 7123/76294 | train loss 3.551872 | norm 0.2334 | lr 9.20e-04 | (3804.03 ms | 137824 tok/s) step 7124/76294 | train loss 3.561924 | norm 0.2146 | lr 9.20e-04 | (7952.55 ms | 65927 tok/s) step 7125/76294 | train loss 3.533083 | norm 0.2413 | lr 9.20e-04 | (3903.27 ms | 134320 tok/s) step 7126/76294 | train loss 3.549960 | norm 0.2265 | lr 9.20e-04 | (3792.36 ms | 138248 tok/s) step 7127/76294 | train loss 3.556657 | norm 0.2487 | lr 9.20e-04 | (3826.48 ms | 137016 tok/s) step 7128/76294 | train loss 3.506838 | norm 0.2944 | lr 9.20e-04 | (3799.05 ms | 138005 tok/s) step 7129/76294 | train loss 3.513988 | norm 0.2281 | lr 9.20e-04 | (3828.00 ms | 136961 tok/s) step 7130/76294 | train loss 3.597315 | norm 0.3233 | lr 9.20e-04 | (3800.74 ms | 137944 tok/s) step 7131/76294 | train loss 3.538388 | norm 0.2640 | lr 9.20e-04 | (8733.12 ms | 60034 tok/s) step 7132/76294 | train loss 3.530720 | norm 0.2414 | lr 9.19e-04 | (3931.82 ms | 133345 tok/s) step 7133/76294 | train loss 3.469242 | norm 0.2421 | lr 9.19e-04 | (3801.49 ms | 137916 tok/s) step 7134/76294 | train loss 3.518400 | norm 0.2562 | lr 9.19e-04 | (3822.22 ms | 137169 tok/s) step 7135/76294 | train loss 3.529892 | norm 0.2192 | lr 9.19e-04 | (3790.83 ms | 138304 tok/s) step 7136/76294 | train loss 3.535064 | norm 0.2187 | lr 9.19e-04 | (3788.41 ms | 138393 tok/s) step 7137/76294 | train loss 3.554888 | norm 0.1991 | lr 9.19e-04 | (3823.03 ms | 137139 tok/s) step 7138/76294 | train loss 3.457612 | norm 0.2042 | lr 9.19e-04 | (3789.92 ms | 138338 tok/s) step 7139/76294 | train loss 3.574810 | norm 0.2173 | lr 9.19e-04 | (3805.36 ms | 137776 tok/s) step 7140/76294 | train loss 3.482476 | norm 0.2378 | lr 9.19e-04 | (3792.17 ms | 138255 tok/s) step 7141/76294 | train loss 3.496603 | norm 0.1896 | lr 9.19e-04 | (3849.19 ms | 136207 tok/s) step 7142/76294 | train loss 3.486907 | norm 0.2217 | lr 9.19e-04 | (3788.61 ms | 138385 tok/s) step 7143/76294 | train loss 3.486888 | norm 0.1769 | lr 9.19e-04 | (3820.20 ms | 137241 tok/s) step 7144/76294 | train loss 3.508377 | norm 0.2063 | lr 9.19e-04 | (3794.11 ms | 138185 tok/s) step 7145/76294 | train loss 3.526361 | norm 0.2720 | lr 9.18e-04 | (3838.88 ms | 136573 tok/s) step 7146/76294 | train loss 3.489632 | norm 0.2717 | lr 9.18e-04 | (3832.10 ms | 136815 tok/s) step 7147/76294 | train loss 3.552627 | norm 0.2187 | lr 9.18e-04 | (3794.65 ms | 138165 tok/s) step 7148/76294 | train loss 3.491602 | norm 0.2870 | lr 9.18e-04 | (3790.95 ms | 138300 tok/s) step 7149/76294 | train loss 3.476461 | norm 0.2163 | lr 9.18e-04 | (3826.17 ms | 137027 tok/s) step 7150/76294 | train loss 3.513699 | norm 0.2311 | lr 9.18e-04 | (3831.67 ms | 136830 tok/s) step 7151/76294 | train loss 3.494539 | norm 0.2391 | lr 9.18e-04 | (3803.75 ms | 137834 tok/s) step 7152/76294 | train loss 3.629317 | norm 0.1985 | lr 9.18e-04 | (3794.24 ms | 138180 tok/s) step 7153/76294 | train loss 3.548697 | norm 0.1887 | lr 9.18e-04 | (3801.30 ms | 137923 tok/s) step 7154/76294 | train loss 3.470140 | norm 0.2038 | lr 9.18e-04 | (3822.05 ms | 137175 tok/s) step 7155/76294 | train loss 3.531856 | norm 0.2066 | lr 9.18e-04 | (3833.56 ms | 136763 tok/s) step 7156/76294 | train loss 3.486903 | norm 0.2121 | lr 9.18e-04 | (3875.14 ms | 135295 tok/s) step 7157/76294 | train loss 3.550234 | norm 0.1952 | lr 9.17e-04 | (3923.86 ms | 133616 tok/s) step 7158/76294 | train loss 3.542577 | norm 0.2768 | lr 9.17e-04 | (3789.33 ms | 138359 tok/s) step 7159/76294 | train loss 3.577225 | norm 0.1951 | lr 9.17e-04 | (3825.00 ms | 137069 tok/s) step 7160/76294 | train loss 3.517692 | norm 0.2192 | lr 9.17e-04 | (3920.52 ms | 133729 tok/s) step 7161/76294 | train loss 3.525435 | norm 0.1904 | lr 9.17e-04 | (3788.86 ms | 138376 tok/s) step 7162/76294 | train loss 3.502056 | norm 0.2099 | lr 9.17e-04 | (3823.12 ms | 137136 tok/s) step 7163/76294 | train loss 3.621910 | norm 0.1846 | lr 9.17e-04 | (3813.88 ms | 137468 tok/s) step 7164/76294 | train loss 3.527479 | norm 0.2480 | lr 9.17e-04 | (3791.41 ms | 138283 tok/s) step 7165/76294 | train loss 3.482424 | norm 0.1673 | lr 9.17e-04 | (3788.83 ms | 138377 tok/s) step 7166/76294 | train loss 3.571276 | norm 0.2582 | lr 9.17e-04 | (3852.21 ms | 136101 tok/s) step 7167/76294 | train loss 3.479700 | norm 0.1996 | lr 9.17e-04 | (3789.29 ms | 138360 tok/s) step 7168/76294 | train loss 3.573080 | norm 0.2531 | lr 9.17e-04 | (3821.00 ms | 137212 tok/s) step 7169/76294 | train loss 3.540089 | norm 0.2117 | lr 9.17e-04 | (3789.90 ms | 138338 tok/s) step 7170/76294 | train loss 3.539698 | norm 0.2064 | lr 9.16e-04 | (3796.06 ms | 138114 tok/s) step 7171/76294 | train loss 3.515664 | norm 0.2381 | lr 9.16e-04 | (3814.38 ms | 137450 tok/s) step 7172/76294 | train loss 3.508915 | norm 0.2459 | lr 9.16e-04 | (4110.81 ms | 127539 tok/s) step 7173/76294 | train loss 3.539948 | norm 0.2319 | lr 9.16e-04 | (3794.57 ms | 138168 tok/s) step 7174/76294 | train loss 3.612670 | norm 0.2394 | lr 9.16e-04 | (3796.53 ms | 138097 tok/s) step 7175/76294 | train loss 3.494817 | norm 0.2130 | lr 9.16e-04 | (3791.99 ms | 138262 tok/s) step 7176/76294 | train loss 3.496083 | norm 0.1964 | lr 9.16e-04 | (3890.28 ms | 134769 tok/s) step 7177/76294 | train loss 3.565523 | norm 0.2093 | lr 9.16e-04 | (3793.12 ms | 138221 tok/s) step 7178/76294 | train loss 3.473303 | norm 0.2113 | lr 9.16e-04 | (3819.82 ms | 137255 tok/s) step 7179/76294 | train loss 3.469150 | norm 0.2244 | lr 9.16e-04 | (3794.40 ms | 138174 tok/s) step 7180/76294 | train loss 3.517054 | norm 0.2216 | lr 9.16e-04 | (3799.89 ms | 137975 tok/s) step 7181/76294 | train loss 3.541834 | norm 0.1977 | lr 9.16e-04 | (3877.05 ms | 135229 tok/s) step 7182/76294 | train loss 3.539693 | norm 0.1874 | lr 9.16e-04 | (3789.72 ms | 138345 tok/s) step 7183/76294 | train loss 3.523011 | norm 0.2239 | lr 9.15e-04 | (3833.80 ms | 136754 tok/s) step 7184/76294 | train loss 3.430103 | norm 0.2090 | lr 9.15e-04 | (3793.82 ms | 138195 tok/s) step 7185/76294 | train loss 3.651414 | norm 0.2102 | lr 9.15e-04 | (3836.96 ms | 136641 tok/s) step 7186/76294 | train loss 3.612350 | norm 0.2195 | lr 9.15e-04 | (3799.40 ms | 137992 tok/s) step 7187/76294 | train loss 3.487620 | norm 0.3641 | lr 9.15e-04 | (3800.22 ms | 137963 tok/s) step 7188/76294 | train loss 3.522166 | norm 0.2627 | lr 9.15e-04 | (3817.93 ms | 137323 tok/s) step 7189/76294 | train loss 3.491378 | norm 0.2251 | lr 9.15e-04 | (3796.22 ms | 138108 tok/s) step 7190/76294 | train loss 3.554796 | norm 0.3565 | lr 9.15e-04 | (3813.60 ms | 137479 tok/s) step 7191/76294 | train loss 3.488484 | norm 0.2308 | lr 9.15e-04 | (3826.54 ms | 137013 tok/s) step 7192/76294 | train loss 3.512121 | norm 0.2352 | lr 9.15e-04 | (3790.44 ms | 138318 tok/s) step 7193/76294 | train loss 3.495321 | norm 0.1970 | lr 9.15e-04 | (3819.65 ms | 137261 tok/s) step 7194/76294 | train loss 3.485625 | norm 0.2067 | lr 9.15e-04 | (3793.07 ms | 138223 tok/s) step 7195/76294 | train loss 3.504325 | norm 0.1819 | lr 9.14e-04 | (3817.59 ms | 137335 tok/s) step 7196/76294 | train loss 3.524521 | norm 0.2737 | lr 9.14e-04 | (3819.16 ms | 137278 tok/s) step 7197/76294 | train loss 3.485206 | norm 0.5060 | lr 9.14e-04 | (3791.37 ms | 138285 tok/s) step 7198/76294 | train loss 3.544804 | norm 0.3414 | lr 9.14e-04 | (3891.75 ms | 134718 tok/s) step 7199/76294 | train loss 3.507063 | norm 0.3174 | lr 9.14e-04 | (3790.45 ms | 138318 tok/s) step 7200/76294 | train loss 3.552584 | norm 0.2563 | lr 9.14e-04 | (3797.26 ms | 138070 tok/s) step 7201/76294 | train loss 3.584850 | norm 0.2582 | lr 9.14e-04 | (3812.11 ms | 137532 tok/s) step 7202/76294 | train loss 3.540466 | norm 0.2591 | lr 9.14e-04 | (3871.63 ms | 135418 tok/s) step 7203/76294 | train loss 3.489241 | norm 0.2291 | lr 9.14e-04 | (3793.22 ms | 138217 tok/s) step 7204/76294 | train loss 3.429622 | norm 0.2094 | lr 9.14e-04 | (3815.60 ms | 137407 tok/s) step 7205/76294 | train loss 3.493885 | norm 0.2154 | lr 9.14e-04 | (3788.43 ms | 138392 tok/s) step 7206/76294 | train loss 3.504810 | norm 0.2368 | lr 9.14e-04 | (3793.79 ms | 138196 tok/s) step 7207/76294 | train loss 3.558915 | norm 0.2107 | lr 9.14e-04 | (3816.82 ms | 137363 tok/s) step 7208/76294 | train loss 3.501260 | norm 0.2514 | lr 9.13e-04 | (3799.45 ms | 137990 tok/s) step 7209/76294 | train loss 3.540437 | norm 0.2129 | lr 9.13e-04 | (3819.98 ms | 137249 tok/s) step 7210/76294 | train loss 3.437802 | norm 0.2134 | lr 9.13e-04 | (3816.81 ms | 137363 tok/s) step 7211/76294 | train loss 3.497002 | norm 0.1989 | lr 9.13e-04 | (3791.93 ms | 138264 tok/s) step 7212/76294 | train loss 3.486938 | norm 0.1958 | lr 9.13e-04 | (3820.78 ms | 137220 tok/s) step 7213/76294 | train loss 3.588646 | norm 0.1896 | lr 9.13e-04 | (3798.08 ms | 138040 tok/s) step 7214/76294 | train loss 3.589988 | norm 0.2061 | lr 9.13e-04 | (3798.91 ms | 138010 tok/s) step 7215/76294 | train loss 3.561332 | norm 0.1787 | lr 9.13e-04 | (3814.49 ms | 137446 tok/s) step 7216/76294 | train loss 3.522740 | norm 0.2010 | lr 9.13e-04 | (3795.87 ms | 138121 tok/s) step 7217/76294 | train loss 3.640875 | norm 0.2140 | lr 9.13e-04 | (3797.07 ms | 138077 tok/s) step 7218/76294 | train loss 3.526015 | norm 0.1962 | lr 9.13e-04 | (3819.79 ms | 137256 tok/s) step 7219/76294 | train loss 3.501876 | norm 0.2035 | lr 9.13e-04 | (3795.92 ms | 138119 tok/s) step 7220/76294 | train loss 3.537287 | norm 0.2119 | lr 9.12e-04 | (3800.91 ms | 137937 tok/s) step 7221/76294 | train loss 3.486833 | norm 0.2234 | lr 9.12e-04 | (3812.41 ms | 137521 tok/s) step 7222/76294 | train loss 3.551147 | norm 0.2464 | lr 9.12e-04 | (3796.99 ms | 138080 tok/s) step 7223/76294 | train loss 3.555182 | norm 0.2828 | lr 9.12e-04 | (3802.11 ms | 137894 tok/s) step 7224/76294 | train loss 3.556405 | norm 0.2774 | lr 9.12e-04 | (3807.58 ms | 137696 tok/s) step 7225/76294 | train loss 3.587240 | norm 0.2615 | lr 9.12e-04 | (3850.82 ms | 136150 tok/s) step 7226/76294 | train loss 3.604388 | norm 0.3217 | lr 9.12e-04 | (3802.25 ms | 137889 tok/s) step 7227/76294 | train loss 3.534481 | norm 0.1996 | lr 9.12e-04 | (3823.28 ms | 137130 tok/s) step 7228/76294 | train loss 3.586196 | norm 0.2093 | lr 9.12e-04 | (3808.70 ms | 137656 tok/s) step 7229/76294 | train loss 3.500059 | norm 0.2243 | lr 9.12e-04 | (3800.39 ms | 137956 tok/s) step 7230/76294 | train loss 3.517310 | norm 0.2072 | lr 9.12e-04 | (3831.58 ms | 136833 tok/s) step 7231/76294 | train loss 3.521043 | norm 0.2392 | lr 9.12e-04 | (3800.38 ms | 137957 tok/s) step 7232/76294 | train loss 3.553241 | norm 0.2013 | lr 9.12e-04 | (3806.55 ms | 137733 tok/s) step 7233/76294 | train loss 3.519363 | norm 0.2330 | lr 9.11e-04 | (3823.76 ms | 137113 tok/s) step 7234/76294 | train loss 3.466597 | norm 0.1821 | lr 9.11e-04 | (3808.31 ms | 137669 tok/s) step 7235/76294 | train loss 3.516643 | norm 0.1940 | lr 9.11e-04 | (3804.21 ms | 137818 tok/s) step 7236/76294 | train loss 3.553872 | norm 0.2064 | lr 9.11e-04 | (3862.34 ms | 135744 tok/s) step 7237/76294 | train loss 3.509719 | norm 0.1983 | lr 9.11e-04 | (3802.35 ms | 137885 tok/s) step 7238/76294 | train loss 3.602618 | norm 0.1915 | lr 9.11e-04 | (3803.40 ms | 137847 tok/s) step 7239/76294 | train loss 3.502665 | norm 0.1948 | lr 9.11e-04 | (3830.83 ms | 136860 tok/s) step 7240/76294 | train loss 3.497425 | norm 0.1910 | lr 9.11e-04 | (3803.12 ms | 137857 tok/s) step 7241/76294 | train loss 3.492484 | norm 0.1761 | lr 9.11e-04 | (3800.28 ms | 137960 tok/s) step 7242/76294 | train loss 3.452974 | norm 0.2042 | lr 9.11e-04 | (3837.55 ms | 136621 tok/s) step 7243/76294 | train loss 3.547777 | norm 0.1971 | lr 9.11e-04 | (3801.84 ms | 137904 tok/s) step 7244/76294 | train loss 3.641629 | norm 0.2410 | lr 9.11e-04 | (3816.67 ms | 137368 tok/s) step 7245/76294 | train loss 3.560207 | norm 0.2352 | lr 9.10e-04 | (3850.93 ms | 136146 tok/s) step 7246/76294 | train loss 3.520051 | norm 0.2001 | lr 9.10e-04 | (3801.30 ms | 137923 tok/s) step 7247/76294 | train loss 3.517368 | norm 0.2051 | lr 9.10e-04 | (3806.75 ms | 137726 tok/s) step 7248/76294 | train loss 3.488945 | norm 0.2324 | lr 9.10e-04 | (4348.56 ms | 120566 tok/s) step 7249/76294 | train loss 3.517779 | norm 0.2091 | lr 9.10e-04 | (3800.22 ms | 137963 tok/s) step 7250/76294 | train loss 3.475510 | norm 0.2180 | lr 9.10e-04 | (3830.69 ms | 136865 tok/s) val loss: 3.488527 saving model checkpoint to ./results/gpt2-124M-gqa/step_7250.pth step 7251/76294 | train loss 3.475494 | norm 0.1862 | lr 9.10e-04 | (3877.64 ms | 135208 tok/s) step 7252/76294 | train loss 3.495062 | norm 0.2726 | lr 9.10e-04 | (3794.07 ms | 138186 tok/s) step 7253/76294 | train loss 3.417578 | norm 0.5218 | lr 9.10e-04 | (3838.95 ms | 136571 tok/s) step 7254/76294 | train loss 3.493112 | norm 0.3270 | lr 9.10e-04 | (3822.50 ms | 137158 tok/s) step 7255/76294 | train loss 3.518996 | norm 0.3268 | lr 9.10e-04 | (3798.60 ms | 138022 tok/s) step 7256/76294 | train loss 3.460606 | norm 0.2581 | lr 9.10e-04 | (3894.29 ms | 134630 tok/s) step 7257/76294 | train loss 3.520541 | norm 0.3018 | lr 9.10e-04 | (3784.56 ms | 138533 tok/s) step 7258/76294 | train loss 3.461069 | norm 0.2219 | lr 9.09e-04 | (3795.82 ms | 138123 tok/s) step 7259/76294 | train loss 3.504620 | norm 0.2796 | lr 9.09e-04 | (3814.16 ms | 137458 tok/s) step 7260/76294 | train loss 3.499038 | norm 0.2186 | lr 9.09e-04 | (3823.42 ms | 137125 tok/s) step 7261/76294 | train loss 3.447783 | norm 0.2658 | lr 9.09e-04 | (3809.08 ms | 137641 tok/s) step 7262/76294 | train loss 3.520583 | norm 0.2066 | lr 9.09e-04 | (3790.82 ms | 138304 tok/s) step 7263/76294 | train loss 3.446316 | norm 0.2475 | lr 9.09e-04 | (3797.42 ms | 138064 tok/s) step 7264/76294 | train loss 3.596400 | norm 0.2911 | lr 9.09e-04 | (3795.29 ms | 138142 tok/s) step 7265/76294 | train loss 3.469138 | norm 0.2118 | lr 9.09e-04 | (3877.09 ms | 135227 tok/s) step 7266/76294 | train loss 3.436100 | norm 0.2456 | lr 9.09e-04 | (3795.94 ms | 138118 tok/s) step 7267/76294 | train loss 3.540679 | norm 0.2106 | lr 9.09e-04 | (3819.84 ms | 137254 tok/s) step 7268/76294 | train loss 3.464383 | norm 0.1945 | lr 9.09e-04 | (3792.49 ms | 138244 tok/s) step 7269/76294 | train loss 3.488431 | norm 0.2085 | lr 9.09e-04 | (3822.77 ms | 137149 tok/s) step 7270/76294 | train loss 3.424638 | norm 0.1985 | lr 9.08e-04 | (3796.68 ms | 138091 tok/s) step 7271/76294 | train loss 3.531848 | norm 0.1975 | lr 9.08e-04 | (3822.98 ms | 137141 tok/s) step 7272/76294 | train loss 3.472868 | norm 0.2242 | lr 9.08e-04 | (3797.32 ms | 138068 tok/s) step 7273/76294 | train loss 3.494891 | norm 0.2187 | lr 9.08e-04 | (3799.37 ms | 137994 tok/s) step 7274/76294 | train loss 3.554075 | norm 0.2053 | lr 9.08e-04 | (3817.46 ms | 137339 tok/s) step 7275/76294 | train loss 3.501265 | norm 0.2219 | lr 9.08e-04 | (3826.63 ms | 137011 tok/s) step 7276/76294 | train loss 3.534930 | norm 0.2319 | lr 9.08e-04 | (3799.09 ms | 138004 tok/s) step 7277/76294 | train loss 3.524849 | norm 0.2055 | lr 9.08e-04 | (3834.02 ms | 136746 tok/s) step 7278/76294 | train loss 3.513345 | norm 0.2164 | lr 9.08e-04 | (3793.72 ms | 138199 tok/s) step 7279/76294 | train loss 3.503451 | norm 0.2138 | lr 9.08e-04 | (3953.77 ms | 132605 tok/s) step 7280/76294 | train loss 3.468763 | norm 0.2631 | lr 9.08e-04 | (3862.40 ms | 135742 tok/s) step 7281/76294 | train loss 3.498360 | norm 0.2199 | lr 9.08e-04 | (3803.51 ms | 137843 tok/s) step 7282/76294 | train loss 3.477660 | norm 0.1892 | lr 9.08e-04 | (3820.79 ms | 137220 tok/s) step 7283/76294 | train loss 3.472871 | norm 0.2329 | lr 9.07e-04 | (3833.43 ms | 136767 tok/s) step 7284/76294 | train loss 3.541470 | norm 0.2022 | lr 9.07e-04 | (3820.12 ms | 137244 tok/s) step 7285/76294 | train loss 3.508791 | norm 0.2638 | lr 9.07e-04 | (3803.62 ms | 137839 tok/s) step 7286/76294 | train loss 3.460032 | norm 0.1783 | lr 9.07e-04 | (3953.71 ms | 132607 tok/s) step 7287/76294 | train loss 3.533878 | norm 0.2696 | lr 9.07e-04 | (3797.36 ms | 138066 tok/s) step 7288/76294 | train loss 3.461353 | norm 0.2310 | lr 9.07e-04 | (3853.92 ms | 136040 tok/s) step 7289/76294 | train loss 3.489266 | norm 0.2248 | lr 9.07e-04 | (3816.43 ms | 137376 tok/s) step 7290/76294 | train loss 3.435031 | norm 0.2939 | lr 9.07e-04 | (3821.62 ms | 137190 tok/s) step 7291/76294 | train loss 3.490839 | norm 0.2544 | lr 9.07e-04 | (3819.30 ms | 137273 tok/s) step 7292/76294 | train loss 3.536809 | norm 0.2475 | lr 9.07e-04 | (3824.04 ms | 137103 tok/s) step 7293/76294 | train loss 3.504203 | norm 0.2333 | lr 9.07e-04 | (3797.25 ms | 138070 tok/s) step 7294/76294 | train loss 3.438646 | norm 0.2456 | lr 9.07e-04 | (3833.41 ms | 136768 tok/s) step 7295/76294 | train loss 3.564958 | norm 0.2074 | lr 9.06e-04 | (3800.53 ms | 137951 tok/s) step 7296/76294 | train loss 3.443271 | norm 0.2131 | lr 9.06e-04 | (3827.33 ms | 136985 tok/s) step 7297/76294 | train loss 3.472444 | norm 0.1985 | lr 9.06e-04 | (3820.93 ms | 137215 tok/s) step 7298/76294 | train loss 3.476921 | norm 0.1880 | lr 9.06e-04 | (3811.19 ms | 137565 tok/s) step 7299/76294 | train loss 3.560223 | norm 0.2214 | lr 9.06e-04 | (3838.61 ms | 136583 tok/s) step 7300/76294 | train loss 3.427169 | norm 0.2164 | lr 9.06e-04 | (3819.13 ms | 137280 tok/s) step 7301/76294 | train loss 3.483007 | norm 0.2388 | lr 9.06e-04 | (3827.27 ms | 136987 tok/s) step 7302/76294 | train loss 3.473293 | norm 0.2068 | lr 9.06e-04 | (3800.42 ms | 137955 tok/s) step 7303/76294 | train loss 3.489900 | norm 0.2226 | lr 9.06e-04 | (3832.14 ms | 136813 tok/s) step 7304/76294 | train loss 3.500710 | norm 0.2154 | lr 9.06e-04 | (3801.34 ms | 137922 tok/s) step 7305/76294 | train loss 3.500794 | norm 0.4233 | lr 9.06e-04 | (3803.93 ms | 137828 tok/s) step 7306/76294 | train loss 3.525939 | norm 0.2949 | lr 9.06e-04 | (3818.27 ms | 137310 tok/s) step 7307/76294 | train loss 3.495966 | norm 0.2826 | lr 9.06e-04 | (3847.91 ms | 136253 tok/s) step 7308/76294 | train loss 3.483333 | norm 0.2187 | lr 9.05e-04 | (3800.71 ms | 137945 tok/s) step 7309/76294 | train loss 3.459327 | norm 0.2338 | lr 9.05e-04 | (3807.28 ms | 137707 tok/s) step 7310/76294 | train loss 3.541397 | norm 0.2233 | lr 9.05e-04 | (3798.93 ms | 138009 tok/s) step 7311/76294 | train loss 3.436961 | norm 0.2122 | lr 9.05e-04 | (3829.44 ms | 136910 tok/s) step 7312/76294 | train loss 3.457620 | norm 0.1981 | lr 9.05e-04 | (3822.88 ms | 137145 tok/s) step 7313/76294 | train loss 3.502560 | norm 0.2287 | lr 9.05e-04 | (3831.74 ms | 136828 tok/s) step 7314/76294 | train loss 3.415432 | norm 0.1908 | lr 9.05e-04 | (3802.30 ms | 137887 tok/s) step 7315/76294 | train loss 3.457385 | norm 0.2150 | lr 9.05e-04 | (3829.03 ms | 136924 tok/s) step 7316/76294 | train loss 3.536182 | norm 0.3681 | lr 9.05e-04 | (3803.09 ms | 137858 tok/s) step 7317/76294 | train loss 3.449840 | norm 0.2345 | lr 9.05e-04 | (3804.90 ms | 137793 tok/s) step 7318/76294 | train loss 3.485530 | norm 0.2582 | lr 9.05e-04 | (3824.84 ms | 137074 tok/s) step 7319/76294 | train loss 3.533902 | norm 0.2017 | lr 9.05e-04 | (3806.77 ms | 137725 tok/s) step 7320/76294 | train loss 3.545575 | norm 0.2298 | lr 9.04e-04 | (3831.10 ms | 136851 tok/s) step 7321/76294 | train loss 3.477669 | norm 0.2096 | lr 9.04e-04 | (3804.14 ms | 137820 tok/s) step 7322/76294 | train loss 3.442694 | norm 0.2185 | lr 9.04e-04 | (3802.95 ms | 137863 tok/s) step 7323/76294 | train loss 3.451292 | norm 0.2567 | lr 9.04e-04 | (3805.78 ms | 137761 tok/s) step 7324/76294 | train loss 3.524970 | norm 0.2460 | lr 9.04e-04 | (3801.98 ms | 137899 tok/s) step 7325/76294 | train loss 3.526812 | norm 0.2141 | lr 9.04e-04 | (3829.20 ms | 136919 tok/s) step 7326/76294 | train loss 3.453918 | norm 0.2253 | lr 9.04e-04 | (3803.43 ms | 137846 tok/s) step 7327/76294 | train loss 3.580759 | norm 0.2471 | lr 9.04e-04 | (3832.85 ms | 136788 tok/s) step 7328/76294 | train loss 3.442904 | norm 0.2173 | lr 9.04e-04 | (3873.35 ms | 135358 tok/s) step 7329/76294 | train loss 3.537258 | norm 0.2353 | lr 9.04e-04 | (3800.53 ms | 137951 tok/s) step 7330/76294 | train loss 3.540382 | norm 0.2493 | lr 9.04e-04 | (3812.48 ms | 137519 tok/s) step 7331/76294 | train loss 3.489757 | norm 0.2419 | lr 9.04e-04 | (3832.74 ms | 136792 tok/s) step 7332/76294 | train loss 3.493135 | norm 0.1886 | lr 9.04e-04 | (3806.65 ms | 137730 tok/s) step 7333/76294 | train loss 3.522475 | norm 0.2000 | lr 9.03e-04 | (3801.83 ms | 137904 tok/s) step 7334/76294 | train loss 3.552843 | norm 0.2181 | lr 9.03e-04 | (3833.18 ms | 136776 tok/s) step 7335/76294 | train loss 3.502055 | norm 0.1796 | lr 9.03e-04 | (3799.89 ms | 137975 tok/s) step 7336/76294 | train loss 3.517106 | norm 0.2510 | lr 9.03e-04 | (3839.50 ms | 136551 tok/s) step 7337/76294 | train loss 3.504125 | norm 0.1817 | lr 9.03e-04 | (3824.27 ms | 137095 tok/s) step 7338/76294 | train loss 3.487147 | norm 0.2354 | lr 9.03e-04 | (3807.84 ms | 137686 tok/s) step 7339/76294 | train loss 3.427669 | norm 0.2141 | lr 9.03e-04 | (3834.70 ms | 136722 tok/s) step 7340/76294 | train loss 3.426273 | norm 0.2233 | lr 9.03e-04 | (3811.94 ms | 137538 tok/s) step 7341/76294 | train loss 3.517199 | norm 0.2367 | lr 9.03e-04 | (3822.62 ms | 137154 tok/s) step 7342/76294 | train loss 3.525115 | norm 0.2112 | lr 9.03e-04 | (3830.51 ms | 136872 tok/s) step 7343/76294 | train loss 3.511795 | norm 0.2496 | lr 9.03e-04 | (4000.65 ms | 131051 tok/s) step 7344/76294 | train loss 3.506858 | norm 0.2046 | lr 9.03e-04 | (3949.20 ms | 132758 tok/s) step 7345/76294 | train loss 3.439557 | norm 0.2008 | lr 9.02e-04 | (3790.53 ms | 138315 tok/s) step 7346/76294 | train loss 3.498299 | norm 0.2155 | lr 9.02e-04 | (3801.94 ms | 137900 tok/s) step 7347/76294 | train loss 3.496580 | norm 0.2319 | lr 9.02e-04 | (3818.80 ms | 137291 tok/s) step 7348/76294 | train loss 3.496326 | norm 0.1891 | lr 9.02e-04 | (6123.49 ms | 85619 tok/s) step 7349/76294 | train loss 3.495086 | norm 0.2486 | lr 9.02e-04 | (3829.49 ms | 136908 tok/s) step 7350/76294 | train loss 3.501456 | norm 0.1811 | lr 9.02e-04 | (3795.60 ms | 138130 tok/s) step 7351/76294 | train loss 3.591578 | norm 0.2246 | lr 9.02e-04 | (3866.09 ms | 135612 tok/s) step 7352/76294 | train loss 3.573572 | norm 0.2849 | lr 9.02e-04 | (3796.91 ms | 138083 tok/s) step 7353/76294 | train loss 3.474385 | norm 0.1998 | lr 9.02e-04 | (3800.69 ms | 137945 tok/s) step 7354/76294 | train loss 3.501187 | norm 0.2626 | lr 9.02e-04 | (3817.99 ms | 137320 tok/s) step 7355/76294 | train loss 3.531003 | norm 0.2829 | lr 9.02e-04 | (3829.42 ms | 136911 tok/s) step 7356/76294 | train loss 3.505880 | norm 0.2080 | lr 9.02e-04 | (3798.73 ms | 138017 tok/s) step 7357/76294 | train loss 3.418469 | norm 0.2468 | lr 9.02e-04 | (3863.96 ms | 135687 tok/s) step 7358/76294 | train loss 3.483416 | norm 0.2104 | lr 9.01e-04 | (3796.42 ms | 138101 tok/s) step 7359/76294 | train loss 3.420216 | norm 0.3245 | lr 9.01e-04 | (3824.01 ms | 137104 tok/s) step 7360/76294 | train loss 3.395137 | norm 0.2649 | lr 9.01e-04 | (3802.00 ms | 137898 tok/s) step 7361/76294 | train loss 3.477485 | norm 0.2533 | lr 9.01e-04 | (3805.14 ms | 137784 tok/s) step 7362/76294 | train loss 3.435193 | norm 0.2772 | lr 9.01e-04 | (3823.35 ms | 137128 tok/s) step 7363/76294 | train loss 3.475281 | norm 0.2214 | lr 9.01e-04 | (3809.95 ms | 137610 tok/s) step 7364/76294 | train loss 3.511108 | norm 0.2102 | lr 9.01e-04 | (3950.31 ms | 132721 tok/s) step 7365/76294 | train loss 3.499933 | norm 0.2144 | lr 9.01e-04 | (3808.81 ms | 137651 tok/s) step 7366/76294 | train loss 3.489961 | norm 0.2318 | lr 9.01e-04 | (3813.89 ms | 137468 tok/s) step 7367/76294 | train loss 3.527969 | norm 0.2158 | lr 9.01e-04 | (3824.42 ms | 137090 tok/s) step 7368/76294 | train loss 3.482563 | norm 0.2096 | lr 9.01e-04 | (3804.34 ms | 137813 tok/s) step 7369/76294 | train loss 3.561803 | norm 0.1936 | lr 9.01e-04 | (3827.57 ms | 136977 tok/s) step 7370/76294 | train loss 3.495518 | norm 0.2132 | lr 9.00e-04 | (3810.59 ms | 137587 tok/s) step 7371/76294 | train loss 3.415274 | norm 0.2600 | lr 9.00e-04 | (3810.23 ms | 137600 tok/s) step 7372/76294 | train loss 3.480389 | norm 0.2692 | lr 9.00e-04 | (3898.53 ms | 134483 tok/s) step 7373/76294 | train loss 3.499052 | norm 0.2251 | lr 9.00e-04 | (3807.99 ms | 137681 tok/s) step 7374/76294 | train loss 3.506471 | norm 0.2302 | lr 9.00e-04 | (3812.06 ms | 137534 tok/s) step 7375/76294 | train loss 3.446324 | norm 0.2985 | lr 9.00e-04 | (3806.98 ms | 137718 tok/s) step 7376/76294 | train loss 3.510818 | norm 0.2068 | lr 9.00e-04 | (3854.23 ms | 136029 tok/s) step 7377/76294 | train loss 3.533415 | norm 0.2647 | lr 9.00e-04 | (3803.70 ms | 137836 tok/s) step 7378/76294 | train loss 3.529395 | norm 0.2065 | lr 9.00e-04 | (3810.52 ms | 137590 tok/s) step 7379/76294 | train loss 3.477879 | norm 0.2359 | lr 9.00e-04 | (3826.96 ms | 136999 tok/s) step 7380/76294 | train loss 3.443175 | norm 0.2050 | lr 9.00e-04 | (3828.79 ms | 136933 tok/s) step 7381/76294 | train loss 3.479566 | norm 0.1972 | lr 9.00e-04 | (3807.75 ms | 137690 tok/s) step 7382/76294 | train loss 3.489911 | norm 0.2436 | lr 9.00e-04 | (3834.09 ms | 136744 tok/s) step 7383/76294 | train loss 3.491516 | norm 0.1979 | lr 8.99e-04 | (3805.60 ms | 137767 tok/s) step 7384/76294 | train loss 3.434730 | norm 0.2228 | lr 8.99e-04 | (3865.74 ms | 135624 tok/s) step 7385/76294 | train loss 3.475462 | norm 0.2104 | lr 8.99e-04 | (3808.43 ms | 137665 tok/s) step 7386/76294 | train loss 3.464737 | norm 0.2161 | lr 8.99e-04 | (3837.14 ms | 136635 tok/s) step 7387/76294 | train loss 3.559891 | norm 0.2135 | lr 8.99e-04 | (3808.06 ms | 137678 tok/s) step 7388/76294 | train loss 3.476437 | norm 0.2090 | lr 8.99e-04 | (4051.66 ms | 129401 tok/s) step 7389/76294 | train loss 3.481913 | norm 0.2652 | lr 8.99e-04 | (3803.85 ms | 137831 tok/s) step 7390/76294 | train loss 3.521382 | norm 0.2065 | lr 8.99e-04 | (3812.95 ms | 137502 tok/s) step 7391/76294 | train loss 3.491225 | norm 0.4372 | lr 8.99e-04 | (3830.01 ms | 136889 tok/s) step 7392/76294 | train loss 3.456918 | norm 0.2593 | lr 8.99e-04 | (3842.30 ms | 136451 tok/s) step 7393/76294 | train loss 3.618357 | norm 0.2625 | lr 8.99e-04 | (3868.12 ms | 135541 tok/s) step 7394/76294 | train loss 3.411967 | norm 0.2345 | lr 8.99e-04 | (3807.06 ms | 137715 tok/s) step 7395/76294 | train loss 3.555061 | norm 0.2138 | lr 8.98e-04 | (3840.25 ms | 136525 tok/s) step 7396/76294 | train loss 3.492871 | norm 0.2039 | lr 8.98e-04 | (3811.38 ms | 137559 tok/s) step 7397/76294 | train loss 3.489982 | norm 0.2328 | lr 8.98e-04 | (3871.83 ms | 135411 tok/s) step 7398/76294 | train loss 3.505505 | norm 0.1901 | lr 8.98e-04 | (3807.31 ms | 137706 tok/s) step 7399/76294 | train loss 3.509323 | norm 0.2432 | lr 8.98e-04 | (3809.07 ms | 137642 tok/s) step 7400/76294 | train loss 3.429784 | norm 0.2082 | lr 8.98e-04 | (3856.57 ms | 135947 tok/s) step 7401/76294 | train loss 3.409181 | norm 0.1908 | lr 8.98e-04 | (3914.10 ms | 133949 tok/s) step 7402/76294 | train loss 3.513375 | norm 0.2017 | lr 8.98e-04 | (3806.17 ms | 137747 tok/s) step 7403/76294 | train loss 3.475084 | norm 0.1879 | lr 8.98e-04 | (3899.63 ms | 134446 tok/s) step 7404/76294 | train loss 3.494636 | norm 0.2027 | lr 8.98e-04 | (3802.63 ms | 137875 tok/s) step 7405/76294 | train loss 3.526292 | norm 0.2005 | lr 8.98e-04 | (3826.12 ms | 137029 tok/s) step 7406/76294 | train loss 3.428991 | norm 0.2224 | lr 8.98e-04 | (3801.08 ms | 137931 tok/s) step 7407/76294 | train loss 3.462519 | norm 0.2396 | lr 8.97e-04 | (3806.21 ms | 137745 tok/s) step 7408/76294 | train loss 3.496334 | norm 0.2019 | lr 8.97e-04 | (3822.22 ms | 137168 tok/s) step 7409/76294 | train loss 3.467387 | norm 0.2736 | lr 8.97e-04 | (3806.38 ms | 137739 tok/s) step 7410/76294 | train loss 3.480365 | norm 0.2743 | lr 8.97e-04 | (3800.52 ms | 137952 tok/s) step 7411/76294 | train loss 3.444019 | norm 0.2213 | lr 8.97e-04 | (3833.72 ms | 136757 tok/s) step 7412/76294 | train loss 3.445474 | norm 0.2531 | lr 8.97e-04 | (3807.18 ms | 137710 tok/s) step 7413/76294 | train loss 3.448213 | norm 0.2302 | lr 8.97e-04 | (3888.69 ms | 134824 tok/s) step 7414/76294 | train loss 3.556041 | norm 0.2510 | lr 8.97e-04 | (3806.52 ms | 137734 tok/s) step 7415/76294 | train loss 3.396794 | norm 0.2932 | lr 8.97e-04 | (3808.39 ms | 137667 tok/s) step 7416/76294 | train loss 3.497272 | norm 0.2067 | lr 8.97e-04 | (3827.73 ms | 136971 tok/s) step 7417/76294 | train loss 3.541915 | norm 0.1908 | lr 8.97e-04 | (3805.98 ms | 137754 tok/s) step 7418/76294 | train loss 3.408147 | norm 0.2147 | lr 8.97e-04 | (3805.11 ms | 137785 tok/s) step 7419/76294 | train loss 3.566261 | norm 0.2747 | lr 8.97e-04 | (3842.01 ms | 136462 tok/s) step 7420/76294 | train loss 3.439204 | norm 0.2961 | lr 8.96e-04 | (3800.56 ms | 137950 tok/s) step 7421/76294 | train loss 3.466501 | norm 0.1963 | lr 8.96e-04 | (3809.21 ms | 137637 tok/s) step 7422/76294 | train loss 3.537542 | norm 0.2839 | lr 8.96e-04 | (3826.70 ms | 137008 tok/s) step 7423/76294 | train loss 3.485328 | norm 0.2067 | lr 8.96e-04 | (3806.12 ms | 137749 tok/s) step 7424/76294 | train loss 3.442247 | norm 0.1851 | lr 8.96e-04 | (3807.06 ms | 137715 tok/s) step 7425/76294 | train loss 3.490133 | norm 0.2124 | lr 8.96e-04 | (3829.41 ms | 136911 tok/s) step 7426/76294 | train loss 3.514425 | norm 0.2342 | lr 8.96e-04 | (3807.76 ms | 137689 tok/s) step 7427/76294 | train loss 3.422970 | norm 0.1830 | lr 8.96e-04 | (3803.08 ms | 137859 tok/s) step 7428/76294 | train loss 3.563629 | norm 0.2048 | lr 8.96e-04 | (3832.03 ms | 136817 tok/s) step 7429/76294 | train loss 3.444771 | norm 0.1805 | lr 8.96e-04 | (3804.20 ms | 137818 tok/s) step 7430/76294 | train loss 3.428814 | norm 0.2394 | lr 8.96e-04 | (3859.34 ms | 135849 tok/s) step 7431/76294 | train loss 3.540959 | norm 0.2392 | lr 8.96e-04 | (3848.50 ms | 136232 tok/s) step 7432/76294 | train loss 3.483638 | norm 0.2056 | lr 8.95e-04 | (3810.07 ms | 137606 tok/s) step 7433/76294 | train loss 3.504723 | norm 0.2148 | lr 8.95e-04 | (3833.52 ms | 136764 tok/s) step 7434/76294 | train loss 3.479024 | norm 0.1903 | lr 8.95e-04 | (3868.84 ms | 135515 tok/s) step 7435/76294 | train loss 3.458085 | norm 0.2074 | lr 8.95e-04 | (3807.02 ms | 137716 tok/s) step 7436/76294 | train loss 3.435059 | norm 0.2355 | lr 8.95e-04 | (3838.72 ms | 136579 tok/s) step 7437/76294 | train loss 3.514276 | norm 0.2376 | lr 8.95e-04 | (3805.07 ms | 137787 tok/s) step 7438/76294 | train loss 3.482373 | norm 0.2460 | lr 8.95e-04 | (3811.02 ms | 137572 tok/s) step 7439/76294 | train loss 3.471211 | norm 0.2263 | lr 8.95e-04 | (4297.21 ms | 122007 tok/s) step 7440/76294 | train loss 3.606812 | norm 0.2203 | lr 8.95e-04 | (3833.22 ms | 136775 tok/s) step 7441/76294 | train loss 3.468624 | norm 0.2156 | lr 8.95e-04 | (3829.44 ms | 136910 tok/s) step 7442/76294 | train loss 3.409941 | norm 0.2239 | lr 8.95e-04 | (3809.28 ms | 137634 tok/s) step 7443/76294 | train loss 3.435219 | norm 0.1830 | lr 8.95e-04 | (3801.78 ms | 137906 tok/s) step 7444/76294 | train loss 3.516171 | norm 0.2065 | lr 8.94e-04 | (3843.82 ms | 136398 tok/s) step 7445/76294 | train loss 3.435527 | norm 0.1912 | lr 8.94e-04 | (3808.33 ms | 137669 tok/s) step 7446/76294 | train loss 3.449406 | norm 0.1913 | lr 8.94e-04 | (3805.60 ms | 137768 tok/s) step 7447/76294 | train loss 3.615465 | norm 0.2211 | lr 8.94e-04 | (3824.51 ms | 137086 tok/s) step 7448/76294 | train loss 3.495220 | norm 0.1849 | lr 8.94e-04 | (3814.18 ms | 137458 tok/s) step 7449/76294 | train loss 3.508884 | norm 0.2154 | lr 8.94e-04 | (3824.71 ms | 137079 tok/s) step 7450/76294 | train loss 3.564975 | norm 0.1852 | lr 8.94e-04 | (3833.06 ms | 136780 tok/s) step 7451/76294 | train loss 3.476051 | norm 0.1867 | lr 8.94e-04 | (3819.80 ms | 137255 tok/s) step 7452/76294 | train loss 3.531348 | norm 0.2490 | lr 8.94e-04 | (3807.82 ms | 137687 tok/s) step 7453/76294 | train loss 3.543633 | norm 0.2946 | lr 8.94e-04 | (3830.82 ms | 136860 tok/s) step 7454/76294 | train loss 3.472106 | norm 0.2609 | lr 8.94e-04 | (3968.07 ms | 132127 tok/s) step 7455/76294 | train loss 3.467931 | norm 0.2333 | lr 8.94e-04 | (3801.81 ms | 137905 tok/s) step 7456/76294 | train loss 3.472138 | norm 0.2856 | lr 8.94e-04 | (3834.72 ms | 136721 tok/s) step 7457/76294 | train loss 3.521383 | norm 0.1926 | lr 8.93e-04 | (3803.33 ms | 137850 tok/s) step 7458/76294 | train loss 3.486377 | norm 0.2381 | lr 8.93e-04 | (3832.65 ms | 136795 tok/s) step 7459/76294 | train loss 3.515167 | norm 0.2089 | lr 8.93e-04 | (3803.47 ms | 137844 tok/s) step 7460/76294 | train loss 3.584934 | norm 0.4940 | lr 8.93e-04 | (3834.79 ms | 136719 tok/s) step 7461/76294 | train loss 3.466645 | norm 0.3509 | lr 8.93e-04 | (3805.01 ms | 137789 tok/s) step 7462/76294 | train loss 3.474063 | norm 0.2839 | lr 8.93e-04 | (3972.13 ms | 131992 tok/s) step 7463/76294 | train loss 3.570947 | norm 0.2361 | lr 8.93e-04 | (3802.08 ms | 137895 tok/s) step 7464/76294 | train loss 3.569281 | norm 0.2969 | lr 8.93e-04 | (3805.66 ms | 137765 tok/s) step 7465/76294 | train loss 3.472728 | norm 0.2334 | lr 8.93e-04 | (3830.70 ms | 136865 tok/s) step 7466/76294 | train loss 3.487192 | norm 0.2006 | lr 8.93e-04 | (3809.67 ms | 137620 tok/s) step 7467/76294 | train loss 3.510146 | norm 0.2114 | lr 8.93e-04 | (3814.48 ms | 137447 tok/s) step 7468/76294 | train loss 3.466909 | norm 0.1994 | lr 8.93e-04 | (3836.46 ms | 136659 tok/s) step 7469/76294 | train loss 3.543486 | norm 0.2376 | lr 8.92e-04 | (3805.27 ms | 137779 tok/s) step 7470/76294 | train loss 3.485427 | norm 0.2229 | lr 8.92e-04 | (3811.61 ms | 137550 tok/s) step 7471/76294 | train loss 3.452025 | norm 0.2057 | lr 8.92e-04 | (3806.59 ms | 137731 tok/s) step 7472/76294 | train loss 3.507807 | norm 0.2622 | lr 8.92e-04 | (3837.60 ms | 136619 tok/s) step 7473/76294 | train loss 3.565127 | norm 0.2072 | lr 8.92e-04 | (3864.84 ms | 135656 tok/s) step 7474/76294 | train loss 3.523748 | norm 0.2269 | lr 8.92e-04 | (3807.58 ms | 137696 tok/s) step 7475/76294 | train loss 3.423262 | norm 0.2079 | lr 8.92e-04 | (3901.20 ms | 134391 tok/s) step 7476/76294 | train loss 3.491565 | norm 0.1918 | lr 8.92e-04 | (3808.26 ms | 137671 tok/s) step 7477/76294 | train loss 3.507746 | norm 0.2005 | lr 8.92e-04 | (3832.01 ms | 136818 tok/s) step 7478/76294 | train loss 3.446743 | norm 0.1831 | lr 8.92e-04 | (3802.84 ms | 137867 tok/s) step 7479/76294 | train loss 3.436892 | norm 0.1969 | lr 8.92e-04 | (3838.49 ms | 136587 tok/s) step 7480/76294 | train loss 3.590246 | norm 0.2115 | lr 8.92e-04 | (3811.55 ms | 137552 tok/s) step 7481/76294 | train loss 3.374391 | norm 0.2170 | lr 8.91e-04 | (3832.51 ms | 136800 tok/s) step 7482/76294 | train loss 3.479692 | norm 0.2625 | lr 8.91e-04 | (3806.26 ms | 137744 tok/s) step 7483/76294 | train loss 3.508564 | norm 0.2266 | lr 8.91e-04 | (3834.85 ms | 136717 tok/s) step 7484/76294 | train loss 3.480623 | norm 0.2731 | lr 8.91e-04 | (3806.49 ms | 137735 tok/s) step 7485/76294 | train loss 3.478530 | norm 0.1945 | lr 8.91e-04 | (3838.31 ms | 136593 tok/s) step 7486/76294 | train loss 3.520564 | norm 0.2285 | lr 8.91e-04 | (3829.93 ms | 136892 tok/s) step 7487/76294 | train loss 3.531993 | norm 0.2247 | lr 8.91e-04 | (3807.61 ms | 137695 tok/s) step 7488/76294 | train loss 3.575359 | norm 0.2353 | lr 8.91e-04 | (3803.28 ms | 137851 tok/s) step 7489/76294 | train loss 3.442347 | norm 0.1801 | lr 8.91e-04 | (3839.45 ms | 136553 tok/s) step 7490/76294 | train loss 3.476682 | norm 0.1892 | lr 8.91e-04 | (3804.11 ms | 137822 tok/s) step 7491/76294 | train loss 3.503687 | norm 0.1949 | lr 8.91e-04 | (3830.51 ms | 136872 tok/s) step 7492/76294 | train loss 3.528049 | norm 0.2145 | lr 8.91e-04 | (3801.07 ms | 137932 tok/s) step 7493/76294 | train loss 3.481235 | norm 0.1820 | lr 8.91e-04 | (3807.14 ms | 137712 tok/s) step 7494/76294 | train loss 3.529411 | norm 0.2111 | lr 8.90e-04 | (3833.63 ms | 136760 tok/s) step 7495/76294 | train loss 3.551067 | norm 0.1966 | lr 8.90e-04 | (3815.62 ms | 137406 tok/s) step 7496/76294 | train loss 3.466208 | norm 0.2058 | lr 8.90e-04 | (3800.17 ms | 137964 tok/s) step 7497/76294 | train loss 3.522644 | norm 0.2072 | lr 8.90e-04 | (3863.65 ms | 135697 tok/s) step 7498/76294 | train loss 3.503054 | norm 0.2237 | lr 8.90e-04 | (3803.20 ms | 137854 tok/s) step 7499/76294 | train loss 3.472116 | norm 0.2218 | lr 8.90e-04 | (3835.64 ms | 136688 tok/s) step 7500/76294 | train loss 3.531670 | norm 0.2391 | lr 8.90e-04 | (3834.47 ms | 136730 tok/s) val loss: 3.486985 saving model checkpoint to ./results/gpt2-124M-gqa/step_7500.pth step 7501/76294 | train loss 3.581090 | norm 0.2667 | lr 8.90e-04 | (3817.63 ms | 137333 tok/s) step 7502/76294 | train loss 3.431817 | norm 0.2471 | lr 8.90e-04 | (3829.59 ms | 136904 tok/s) step 7503/76294 | train loss 3.510302 | norm 0.2271 | lr 8.90e-04 | (3835.63 ms | 136689 tok/s) step 7504/76294 | train loss 3.471146 | norm 0.2247 | lr 8.90e-04 | (3827.05 ms | 136995 tok/s) step 7505/76294 | train loss 3.511778 | norm 0.2090 | lr 8.90e-04 | (3832.32 ms | 136807 tok/s) step 7506/76294 | train loss 3.434259 | norm 0.2269 | lr 8.89e-04 | (3796.46 ms | 138099 tok/s) step 7507/76294 | train loss 3.576557 | norm 0.2044 | lr 8.89e-04 | (3842.02 ms | 136462 tok/s) step 7508/76294 | train loss 3.539412 | norm 0.2068 | lr 8.89e-04 | (3809.69 ms | 137619 tok/s) step 7509/76294 | train loss 3.380920 | norm 0.2720 | lr 8.89e-04 | (3815.73 ms | 137402 tok/s) step 7510/76294 | train loss 3.503171 | norm 0.2641 | lr 8.89e-04 | (3806.95 ms | 137718 tok/s) step 7511/76294 | train loss 3.576291 | norm 0.2157 | lr 8.89e-04 | (3811.15 ms | 137567 tok/s) step 7512/76294 | train loss 3.477492 | norm 0.3183 | lr 8.89e-04 | (3831.51 ms | 136836 tok/s) step 7513/76294 | train loss 3.490942 | norm 0.2540 | lr 8.89e-04 | (3809.83 ms | 137615 tok/s) step 7514/76294 | train loss 3.470333 | norm 0.2159 | lr 8.89e-04 | (3828.87 ms | 136930 tok/s) step 7515/76294 | train loss 3.593545 | norm 0.2358 | lr 8.89e-04 | (3899.61 ms | 134446 tok/s) step 7516/76294 | train loss 3.443609 | norm 0.2244 | lr 8.89e-04 | (3905.84 ms | 134232 tok/s) step 7517/76294 | train loss 3.485021 | norm 0.2105 | lr 8.89e-04 | (3800.97 ms | 137935 tok/s) step 7518/76294 | train loss 3.544317 | norm 0.2074 | lr 8.88e-04 | (3815.16 ms | 137422 tok/s) step 7519/76294 | train loss 3.480452 | norm 0.1856 | lr 8.88e-04 | (3839.92 ms | 136536 tok/s) step 7520/76294 | train loss 3.476039 | norm 0.2187 | lr 8.88e-04 | (3814.41 ms | 137449 tok/s) step 7521/76294 | train loss 3.704707 | norm 0.1800 | lr 8.88e-04 | (3843.49 ms | 136409 tok/s) step 7522/76294 | train loss 3.453237 | norm 0.2217 | lr 8.88e-04 | (3827.17 ms | 136991 tok/s) step 7523/76294 | train loss 3.400531 | norm 0.2340 | lr 8.88e-04 | (3834.72 ms | 136721 tok/s) step 7524/76294 | train loss 3.507632 | norm 0.2442 | lr 8.88e-04 | (4146.37 ms | 126445 tok/s) step 7525/76294 | train loss 3.560368 | norm 0.2518 | lr 8.88e-04 | (3800.76 ms | 137943 tok/s) step 7526/76294 | train loss 3.444634 | norm 0.2297 | lr 8.88e-04 | (3864.17 ms | 135679 tok/s) step 7527/76294 | train loss 3.486725 | norm 0.2118 | lr 8.88e-04 | (3804.91 ms | 137793 tok/s) step 7528/76294 | train loss 3.573147 | norm 0.1978 | lr 8.88e-04 | (3821.05 ms | 137211 tok/s) step 7529/76294 | train loss 3.485000 | norm 0.2240 | lr 8.88e-04 | (3804.71 ms | 137800 tok/s) step 7530/76294 | train loss 3.505620 | norm 0.1765 | lr 8.87e-04 | (3809.36 ms | 137632 tok/s) step 7531/76294 | train loss 3.543517 | norm 0.2329 | lr 8.87e-04 | (3824.17 ms | 137099 tok/s) step 7532/76294 | train loss 3.546997 | norm 0.1934 | lr 8.87e-04 | (3808.50 ms | 137663 tok/s) step 7533/76294 | train loss 3.529111 | norm 0.2046 | lr 8.87e-04 | (3805.68 ms | 137765 tok/s) step 7534/76294 | train loss 3.499138 | norm 0.2263 | lr 8.87e-04 | (3835.58 ms | 136691 tok/s) step 7535/76294 | train loss 3.477194 | norm 0.2388 | lr 8.87e-04 | (3805.49 ms | 137772 tok/s) step 7536/76294 | train loss 3.518051 | norm 0.1932 | lr 8.87e-04 | (3859.10 ms | 135858 tok/s) step 7537/76294 | train loss 3.435707 | norm 0.2364 | lr 8.87e-04 | (3861.61 ms | 135769 tok/s) step 7538/76294 | train loss 3.538870 | norm 0.1861 | lr 8.87e-04 | (3869.00 ms | 135510 tok/s) step 7539/76294 | train loss 3.570733 | norm 0.2287 | lr 8.87e-04 | (3805.10 ms | 137785 tok/s) step 7540/76294 | train loss 3.476878 | norm 0.2306 | lr 8.87e-04 | (3818.62 ms | 137298 tok/s) step 7541/76294 | train loss 3.443431 | norm 0.2284 | lr 8.87e-04 | (3826.74 ms | 137006 tok/s) step 7542/76294 | train loss 3.543452 | norm 0.2671 | lr 8.87e-04 | (3813.66 ms | 137476 tok/s) step 7543/76294 | train loss 3.493813 | norm 0.2736 | lr 8.86e-04 | (3807.80 ms | 137688 tok/s) step 7544/76294 | train loss 3.433050 | norm 0.2686 | lr 8.86e-04 | (3830.20 ms | 136883 tok/s) step 7545/76294 | train loss 3.505521 | norm 0.3268 | lr 8.86e-04 | (3809.39 ms | 137630 tok/s) step 7546/76294 | train loss 3.499007 | norm 0.2758 | lr 8.86e-04 | (3869.06 ms | 135508 tok/s) step 7547/76294 | train loss 3.533065 | norm 0.2727 | lr 8.86e-04 | (3810.81 ms | 137579 tok/s) step 7548/76294 | train loss 3.470334 | norm 0.2991 | lr 8.86e-04 | (3847.50 ms | 136267 tok/s) step 7549/76294 | train loss 3.466745 | norm 0.2175 | lr 8.86e-04 | (3811.77 ms | 137544 tok/s) step 7550/76294 | train loss 3.550277 | norm 0.2318 | lr 8.86e-04 | (3819.87 ms | 137253 tok/s) step 7551/76294 | train loss 3.512856 | norm 0.2135 | lr 8.86e-04 | (3832.56 ms | 136799 tok/s) step 7552/76294 | train loss 3.517827 | norm 0.2542 | lr 8.86e-04 | (3820.10 ms | 137245 tok/s) step 7553/76294 | train loss 3.476037 | norm 0.1995 | lr 8.86e-04 | (3934.95 ms | 133239 tok/s) step 7554/76294 | train loss 3.461780 | norm 0.2664 | lr 8.86e-04 | (3811.84 ms | 137542 tok/s) step 7555/76294 | train loss 3.487278 | norm 0.2300 | lr 8.85e-04 | (3813.46 ms | 137484 tok/s) step 7556/76294 | train loss 3.452929 | norm 0.1877 | lr 8.85e-04 | (3899.61 ms | 134446 tok/s) step 7557/76294 | train loss 3.547441 | norm 0.2118 | lr 8.85e-04 | (3807.55 ms | 137697 tok/s) step 7558/76294 | train loss 3.550171 | norm 0.2351 | lr 8.85e-04 | (3839.85 ms | 136539 tok/s) step 7559/76294 | train loss 3.437232 | norm 0.2136 | lr 8.85e-04 | (3842.47 ms | 136446 tok/s) step 7560/76294 | train loss 3.437331 | norm 0.2072 | lr 8.85e-04 | (3807.10 ms | 137713 tok/s) step 7561/76294 | train loss 3.491384 | norm 0.2170 | lr 8.85e-04 | (3810.08 ms | 137605 tok/s) step 7562/76294 | train loss 3.444789 | norm 0.1882 | lr 8.85e-04 | (4385.34 ms | 119555 tok/s) step 7563/76294 | train loss 3.520316 | norm 0.2095 | lr 8.85e-04 | (3989.02 ms | 131433 tok/s) step 7564/76294 | train loss 3.647521 | norm 0.2047 | lr 8.85e-04 | (3847.05 ms | 136283 tok/s) step 7565/76294 | train loss 3.424875 | norm 0.2236 | lr 8.85e-04 | (3835.35 ms | 136699 tok/s) step 7566/76294 | train loss 3.431826 | norm 0.2430 | lr 8.85e-04 | (3831.84 ms | 136824 tok/s) step 7567/76294 | train loss 3.477286 | norm 0.2708 | lr 8.84e-04 | (3869.40 ms | 135496 tok/s) step 7568/76294 | train loss 3.480276 | norm 0.2005 | lr 8.84e-04 | (3844.97 ms | 136357 tok/s) step 7569/76294 | train loss 3.555024 | norm 0.2925 | lr 8.84e-04 | (3831.65 ms | 136831 tok/s) step 7570/76294 | train loss 3.435083 | norm 0.2359 | lr 8.84e-04 | (3801.57 ms | 137913 tok/s) step 7571/76294 | train loss 3.427143 | norm 0.2019 | lr 8.84e-04 | (3863.86 ms | 135690 tok/s) step 7572/76294 | train loss 3.549945 | norm 0.2217 | lr 8.84e-04 | (3802.19 ms | 137891 tok/s) step 7573/76294 | train loss 3.571495 | norm 0.2075 | lr 8.84e-04 | (3853.60 ms | 136052 tok/s) step 7574/76294 | train loss 3.530938 | norm 0.2147 | lr 8.84e-04 | (3802.04 ms | 137897 tok/s) step 7575/76294 | train loss 3.567708 | norm 0.2070 | lr 8.84e-04 | (3806.90 ms | 137720 tok/s) step 7576/76294 | train loss 3.581041 | norm 0.2276 | lr 8.84e-04 | (3826.49 ms | 137015 tok/s) step 7577/76294 | train loss 3.450086 | norm 0.2005 | lr 8.84e-04 | (3803.80 ms | 137833 tok/s) step 7578/76294 | train loss 3.528753 | norm 0.2469 | lr 8.84e-04 | (3803.64 ms | 137839 tok/s) step 7579/76294 | train loss 3.483207 | norm 0.1877 | lr 8.83e-04 | (3838.02 ms | 136604 tok/s) step 7580/76294 | train loss 3.481247 | norm 0.2569 | lr 8.83e-04 | (3802.97 ms | 137863 tok/s) step 7581/76294 | train loss 3.446215 | norm 0.2095 | lr 8.83e-04 | (3812.91 ms | 137503 tok/s) step 7582/76294 | train loss 3.490882 | norm 0.1942 | lr 8.83e-04 | (3883.55 ms | 135002 tok/s) step 7583/76294 | train loss 3.480076 | norm 0.2057 | lr 8.83e-04 | (3798.45 ms | 138027 tok/s) step 7584/76294 | train loss 3.492382 | norm 0.1892 | lr 8.83e-04 | (3847.79 ms | 136257 tok/s) step 7585/76294 | train loss 3.496389 | norm 0.2144 | lr 8.83e-04 | (3805.52 ms | 137771 tok/s) step 7586/76294 | train loss 3.473507 | norm 0.1962 | lr 8.83e-04 | (3807.75 ms | 137690 tok/s) step 7587/76294 | train loss 3.544430 | norm 0.2242 | lr 8.83e-04 | (3824.53 ms | 137085 tok/s) step 7588/76294 | train loss 3.479915 | norm 0.2203 | lr 8.83e-04 | (3816.70 ms | 137367 tok/s) step 7589/76294 | train loss 3.459215 | norm 0.2211 | lr 8.83e-04 | (3802.05 ms | 137896 tok/s) step 7590/76294 | train loss 3.494530 | norm 0.2648 | lr 8.83e-04 | (3833.67 ms | 136759 tok/s) step 7591/76294 | train loss 3.533907 | norm 0.2098 | lr 8.82e-04 | (3802.56 ms | 137878 tok/s) step 7592/76294 | train loss 3.456587 | norm 0.1987 | lr 8.82e-04 | (3808.12 ms | 137676 tok/s) step 7593/76294 | train loss 3.563624 | norm 0.2349 | lr 8.82e-04 | (3828.84 ms | 136931 tok/s) step 7594/76294 | train loss 3.652428 | norm 0.2037 | lr 8.82e-04 | (3806.59 ms | 137732 tok/s) step 7595/76294 | train loss 3.475145 | norm 0.2508 | lr 8.82e-04 | (3802.26 ms | 137888 tok/s) step 7596/76294 | train loss 3.477059 | norm 0.2561 | lr 8.82e-04 | (3844.87 ms | 136360 tok/s) step 7597/76294 | train loss 3.454437 | norm 0.2777 | lr 8.82e-04 | (3805.66 ms | 137765 tok/s) step 7598/76294 | train loss 3.427839 | norm 0.3440 | lr 8.82e-04 | (3810.14 ms | 137603 tok/s) step 7599/76294 | train loss 3.436758 | norm 0.2847 | lr 8.82e-04 | (3827.41 ms | 136983 tok/s) step 7600/76294 | train loss 3.594323 | norm 0.2185 | lr 8.82e-04 | (3808.13 ms | 137676 tok/s) step 7601/76294 | train loss 3.548799 | norm 0.2624 | lr 8.82e-04 | (3846.65 ms | 136297 tok/s) step 7602/76294 | train loss 3.506368 | norm 0.2079 | lr 8.82e-04 | (3814.82 ms | 137434 tok/s) step 7603/76294 | train loss 3.559168 | norm 0.2304 | lr 8.82e-04 | (3888.64 ms | 134826 tok/s) step 7604/76294 | train loss 3.505531 | norm 0.1977 | lr 8.81e-04 | (3809.74 ms | 137618 tok/s) step 7605/76294 | train loss 3.503553 | norm 0.2243 | lr 8.81e-04 | (3855.31 ms | 135991 tok/s) step 7606/76294 | train loss 3.497881 | norm 0.1826 | lr 8.81e-04 | (3808.79 ms | 137652 tok/s) step 7607/76294 | train loss 3.524005 | norm 0.2264 | lr 8.81e-04 | (3818.83 ms | 137290 tok/s) step 7608/76294 | train loss 3.501017 | norm 0.1757 | lr 8.81e-04 | (3839.21 ms | 136562 tok/s) step 7609/76294 | train loss 3.530287 | norm 0.2085 | lr 8.81e-04 | (3821.53 ms | 137193 tok/s) step 7610/76294 | train loss 3.465156 | norm 0.1831 | lr 8.81e-04 | (3808.53 ms | 137661 tok/s) step 7611/76294 | train loss 3.496017 | norm 0.1762 | lr 8.81e-04 | (3836.54 ms | 136657 tok/s) step 7612/76294 | train loss 3.505898 | norm 0.2057 | lr 8.81e-04 | (3803.30 ms | 137851 tok/s) step 7613/76294 | train loss 3.486872 | norm 0.1984 | lr 8.81e-04 | (3810.70 ms | 137583 tok/s) step 7614/76294 | train loss 3.468312 | norm 0.1768 | lr 8.81e-04 | (3832.10 ms | 136815 tok/s) step 7615/76294 | train loss 3.509733 | norm 0.2394 | lr 8.81e-04 | (3812.38 ms | 137523 tok/s) step 7616/76294 | train loss 3.473759 | norm 0.1892 | lr 8.80e-04 | (3830.38 ms | 136876 tok/s) step 7617/76294 | train loss 3.461637 | norm 0.3175 | lr 8.80e-04 | (3807.34 ms | 137705 tok/s) step 7618/76294 | train loss 3.504002 | norm 0.1899 | lr 8.80e-04 | (3804.66 ms | 137802 tok/s) step 7619/76294 | train loss 3.539474 | norm 0.2475 | lr 8.80e-04 | (3833.66 ms | 136759 tok/s) step 7620/76294 | train loss 3.488909 | norm 0.2241 | lr 8.80e-04 | (3807.12 ms | 137713 tok/s) step 7621/76294 | train loss 3.465578 | norm 0.2296 | lr 8.80e-04 | (3810.72 ms | 137582 tok/s) step 7622/76294 | train loss 3.564970 | norm 0.1957 | lr 8.80e-04 | (3830.61 ms | 136868 tok/s) step 7623/76294 | train loss 3.470306 | norm 0.2097 | lr 8.80e-04 | (3809.53 ms | 137625 tok/s) step 7624/76294 | train loss 3.494541 | norm 0.2501 | lr 8.80e-04 | (3880.22 ms | 135118 tok/s) step 7625/76294 | train loss 3.477974 | norm 0.2314 | lr 8.80e-04 | (3810.79 ms | 137580 tok/s) step 7626/76294 | train loss 3.497878 | norm 0.2930 | lr 8.80e-04 | (3829.44 ms | 136910 tok/s) step 7627/76294 | train loss 3.492315 | norm 0.2120 | lr 8.80e-04 | (3823.55 ms | 137121 tok/s) step 7628/76294 | train loss 3.483981 | norm 0.2312 | lr 8.79e-04 | (3810.87 ms | 137577 tok/s) step 7629/76294 | train loss 3.497784 | norm 0.2145 | lr 8.79e-04 | (3925.37 ms | 133564 tok/s) step 7630/76294 | train loss 3.458128 | norm 0.2505 | lr 8.79e-04 | (4222.69 ms | 124160 tok/s) step 7631/76294 | train loss 3.536021 | norm 0.2029 | lr 8.79e-04 | (3836.64 ms | 136653 tok/s) step 7632/76294 | train loss 3.527678 | norm 0.2430 | lr 8.79e-04 | (3808.72 ms | 137655 tok/s) step 7633/76294 | train loss 3.453347 | norm 0.2327 | lr 8.79e-04 | (3804.27 ms | 137816 tok/s) step 7634/76294 | train loss 3.503426 | norm 0.2347 | lr 8.79e-04 | (3867.86 ms | 135550 tok/s) step 7635/76294 | train loss 3.516007 | norm 0.2004 | lr 8.79e-04 | (3805.86 ms | 137758 tok/s) step 7636/76294 | train loss 3.535815 | norm 0.2106 | lr 8.79e-04 | (3830.72 ms | 136864 tok/s) step 7637/76294 | train loss 3.444542 | norm 0.1959 | lr 8.79e-04 | (3831.04 ms | 136853 tok/s) step 7638/76294 | train loss 3.506985 | norm 0.2032 | lr 8.79e-04 | (3810.05 ms | 137606 tok/s) step 7639/76294 | train loss 3.532557 | norm 0.2346 | lr 8.79e-04 | (3804.13 ms | 137821 tok/s) step 7640/76294 | train loss 3.520696 | norm 0.2371 | lr 8.78e-04 | (3844.89 ms | 136360 tok/s) step 7641/76294 | train loss 3.488926 | norm 0.2367 | lr 8.78e-04 | (3804.28 ms | 137815 tok/s) step 7642/76294 | train loss 3.528256 | norm 0.2914 | lr 8.78e-04 | (3809.75 ms | 137617 tok/s) step 7643/76294 | train loss 3.533229 | norm 0.2215 | lr 8.78e-04 | (3835.23 ms | 136703 tok/s) step 7644/76294 | train loss 3.565720 | norm 0.2121 | lr 8.78e-04 | (3890.07 ms | 134776 tok/s) step 7645/76294 | train loss 3.605260 | norm 0.2200 | lr 8.78e-04 | (3803.96 ms | 137827 tok/s) step 7646/76294 | train loss 3.486162 | norm 0.2468 | lr 8.78e-04 | (3814.29 ms | 137454 tok/s) step 7647/76294 | train loss 3.484839 | norm 0.2607 | lr 8.78e-04 | (3813.11 ms | 137496 tok/s) step 7648/76294 | train loss 3.467287 | norm 0.2056 | lr 8.78e-04 | (3831.45 ms | 136838 tok/s) step 7649/76294 | train loss 3.503555 | norm 0.2449 | lr 8.78e-04 | (3807.53 ms | 137698 tok/s) step 7650/76294 | train loss 3.456737 | norm 0.2085 | lr 8.78e-04 | (3802.21 ms | 137890 tok/s) step 7651/76294 | train loss 3.500781 | norm 0.1917 | lr 8.78e-04 | (3847.75 ms | 136258 tok/s) step 7652/76294 | train loss 3.453511 | norm 0.3259 | lr 8.77e-04 | (3832.06 ms | 136816 tok/s) step 7653/76294 | train loss 3.507417 | norm 0.2530 | lr 8.77e-04 | (3818.13 ms | 137316 tok/s) step 7654/76294 | train loss 3.601213 | norm 0.4563 | lr 8.77e-04 | (3827.79 ms | 136969 tok/s) step 7655/76294 | train loss 3.545229 | norm 0.3094 | lr 8.77e-04 | (3811.67 ms | 137548 tok/s) step 7656/76294 | train loss 3.495043 | norm 0.2200 | lr 8.77e-04 | (3803.59 ms | 137840 tok/s) step 7657/76294 | train loss 3.554148 | norm 0.2458 | lr 8.77e-04 | (3843.00 ms | 136427 tok/s) step 7658/76294 | train loss 3.492199 | norm 0.1908 | lr 8.77e-04 | (3806.03 ms | 137752 tok/s) step 7659/76294 | train loss 3.519881 | norm 0.2196 | lr 8.77e-04 | (3809.09 ms | 137641 tok/s) step 7660/76294 | train loss 3.496348 | norm 0.1968 | lr 8.77e-04 | (3839.73 ms | 136543 tok/s) step 7661/76294 | train loss 3.521231 | norm 0.2327 | lr 8.77e-04 | (3811.91 ms | 137540 tok/s) step 7662/76294 | train loss 3.490460 | norm 0.2428 | lr 8.77e-04 | (3811.82 ms | 137543 tok/s) step 7663/76294 | train loss 3.505006 | norm 0.2397 | lr 8.77e-04 | (3807.95 ms | 137683 tok/s) step 7664/76294 | train loss 3.506176 | norm 0.2827 | lr 8.76e-04 | (3848.52 ms | 136231 tok/s) step 7665/76294 | train loss 3.448885 | norm 0.2414 | lr 8.76e-04 | (3908.38 ms | 134145 tok/s) step 7666/76294 | train loss 3.470859 | norm 0.2415 | lr 8.76e-04 | (3801.21 ms | 137927 tok/s) step 7667/76294 | train loss 3.551636 | norm 0.2349 | lr 8.76e-04 | (3807.14 ms | 137712 tok/s) step 7668/76294 | train loss 3.353583 | norm 0.2414 | lr 8.76e-04 | (3823.36 ms | 137127 tok/s) step 7669/76294 | train loss 3.534674 | norm 0.2165 | lr 8.76e-04 | (3808.48 ms | 137663 tok/s) step 7670/76294 | train loss 3.584216 | norm 0.2167 | lr 8.76e-04 | (3806.01 ms | 137753 tok/s) step 7671/76294 | train loss 3.544708 | norm 0.1845 | lr 8.76e-04 | (3839.04 ms | 136568 tok/s) step 7672/76294 | train loss 3.520161 | norm 0.2134 | lr 8.76e-04 | (3805.71 ms | 137763 tok/s) step 7673/76294 | train loss 3.575759 | norm 0.2134 | lr 8.76e-04 | (3815.17 ms | 137422 tok/s) step 7674/76294 | train loss 3.556648 | norm 0.1901 | lr 8.76e-04 | (3830.97 ms | 136855 tok/s) step 7675/76294 | train loss 3.531054 | norm 0.2214 | lr 8.76e-04 | (3808.91 ms | 137648 tok/s) step 7676/76294 | train loss 3.435720 | norm 0.2523 | lr 8.76e-04 | (3807.65 ms | 137693 tok/s) step 7677/76294 | train loss 3.494180 | norm 0.2997 | lr 8.75e-04 | (3843.47 ms | 136410 tok/s) step 7678/76294 | train loss 3.537463 | norm 0.2187 | lr 8.75e-04 | (3804.68 ms | 137801 tok/s) step 7679/76294 | train loss 3.567894 | norm 0.2560 | lr 8.75e-04 | (3809.81 ms | 137615 tok/s) step 7680/76294 | train loss 3.583205 | norm 0.2043 | lr 8.75e-04 | (3837.77 ms | 136613 tok/s) step 7681/76294 | train loss 3.569407 | norm 0.2492 | lr 8.75e-04 | (3804.78 ms | 137797 tok/s) step 7682/76294 | train loss 3.516505 | norm 0.2456 | lr 8.75e-04 | (3809.39 ms | 137630 tok/s) step 7683/76294 | train loss 3.524178 | norm 0.2018 | lr 8.75e-04 | (3808.57 ms | 137660 tok/s) step 7684/76294 | train loss 3.487419 | norm 0.2308 | lr 8.75e-04 | (3805.88 ms | 137757 tok/s) step 7685/76294 | train loss 3.468642 | norm 0.2210 | lr 8.75e-04 | (3916.65 ms | 133861 tok/s) step 7686/76294 | train loss 3.585758 | norm 0.2221 | lr 8.75e-04 | (3836.84 ms | 136646 tok/s) step 7687/76294 | train loss 3.505608 | norm 0.2603 | lr 8.75e-04 | (3806.27 ms | 137743 tok/s) step 7688/76294 | train loss 3.482753 | norm 0.2116 | lr 8.75e-04 | (5810.42 ms | 90232 tok/s) step 7689/76294 | train loss 3.517383 | norm 0.2592 | lr 8.74e-04 | (6321.69 ms | 82935 tok/s) step 7690/76294 | train loss 3.452591 | norm 0.2497 | lr 8.74e-04 | (3818.34 ms | 137308 tok/s) step 7691/76294 | train loss 3.476141 | norm 0.2266 | lr 8.74e-04 | (3817.40 ms | 137342 tok/s) step 7692/76294 | train loss 3.537853 | norm 0.2685 | lr 8.74e-04 | (3798.89 ms | 138011 tok/s) step 7693/76294 | train loss 3.415117 | norm 0.2042 | lr 8.74e-04 | (3818.60 ms | 137298 tok/s) step 7694/76294 | train loss 3.479044 | norm 0.2110 | lr 8.74e-04 | (3801.54 ms | 137915 tok/s) step 7695/76294 | train loss 3.523052 | norm 0.2217 | lr 8.74e-04 | (3823.54 ms | 137121 tok/s) step 7696/76294 | train loss 3.521836 | norm 0.2198 | lr 8.74e-04 | (3802.23 ms | 137890 tok/s) step 7697/76294 | train loss 3.628655 | norm 0.2390 | lr 8.74e-04 | (3806.14 ms | 137748 tok/s) step 7698/76294 | train loss 3.635651 | norm 0.3534 | lr 8.74e-04 | (3807.97 ms | 137682 tok/s) step 7699/76294 | train loss 3.542894 | norm 0.2811 | lr 8.74e-04 | (3803.35 ms | 137849 tok/s) step 7700/76294 | train loss 3.533522 | norm 0.2385 | lr 8.74e-04 | (3828.66 ms | 136938 tok/s) step 7701/76294 | train loss 3.535460 | norm 0.2537 | lr 8.73e-04 | (3802.33 ms | 137886 tok/s) step 7702/76294 | train loss 3.517634 | norm 0.2626 | lr 8.73e-04 | (3803.10 ms | 137858 tok/s) step 7703/76294 | train loss 3.469939 | norm 0.2968 | lr 8.73e-04 | (3831.15 ms | 136849 tok/s) step 7704/76294 | train loss 3.560181 | norm 0.2098 | lr 8.73e-04 | (3810.00 ms | 137609 tok/s) step 7705/76294 | train loss 3.560623 | norm 0.2470 | lr 8.73e-04 | (3800.40 ms | 137956 tok/s) step 7706/76294 | train loss 3.470861 | norm 0.1812 | lr 8.73e-04 | (3829.62 ms | 136903 tok/s) step 7707/76294 | train loss 3.495766 | norm 0.2213 | lr 8.73e-04 | (3804.08 ms | 137822 tok/s) step 7708/76294 | train loss 3.486362 | norm 0.1929 | lr 8.73e-04 | (3805.49 ms | 137771 tok/s) step 7709/76294 | train loss 3.484344 | norm 0.1941 | lr 8.73e-04 | (3825.15 ms | 137063 tok/s) step 7710/76294 | train loss 3.483979 | norm 0.1884 | lr 8.73e-04 | (3800.81 ms | 137941 tok/s) step 7711/76294 | train loss 3.466513 | norm 0.1833 | lr 8.73e-04 | (3824.00 ms | 137105 tok/s) step 7712/76294 | train loss 3.562372 | norm 0.1988 | lr 8.73e-04 | (3825.08 ms | 137066 tok/s) step 7713/76294 | train loss 3.496274 | norm 0.2042 | lr 8.72e-04 | (3832.39 ms | 136805 tok/s) step 7714/76294 | train loss 3.470822 | norm 0.2209 | lr 8.72e-04 | (3798.54 ms | 138024 tok/s) step 7715/76294 | train loss 3.545155 | norm 0.2598 | lr 8.72e-04 | (3852.74 ms | 136082 tok/s) step 7716/76294 | train loss 3.540504 | norm 0.2003 | lr 8.72e-04 | (3800.48 ms | 137953 tok/s) step 7717/76294 | train loss 3.591308 | norm 0.2419 | lr 8.72e-04 | (3831.10 ms | 136850 tok/s) step 7718/76294 | train loss 3.490401 | norm 0.2147 | lr 8.72e-04 | (3826.85 ms | 137002 tok/s) step 7719/76294 | train loss 3.508945 | norm 0.2154 | lr 8.72e-04 | (3802.77 ms | 137870 tok/s) step 7720/76294 | train loss 3.476300 | norm 0.2124 | lr 8.72e-04 | (3802.40 ms | 137883 tok/s) step 7721/76294 | train loss 3.568842 | norm 0.1982 | lr 8.72e-04 | (3833.39 ms | 136769 tok/s) step 7722/76294 | train loss 3.565594 | norm 0.3207 | lr 8.72e-04 | (3803.39 ms | 137847 tok/s) step 7723/76294 | train loss 3.500958 | norm 0.2538 | lr 8.72e-04 | (3831.31 ms | 136843 tok/s) step 7724/76294 | train loss 3.500474 | norm 0.2531 | lr 8.72e-04 | (3831.49 ms | 136836 tok/s) step 7725/76294 | train loss 3.514838 | norm 0.2103 | lr 8.71e-04 | (3827.74 ms | 136971 tok/s) step 7726/76294 | train loss 3.458495 | norm 0.2629 | lr 8.71e-04 | (3968.21 ms | 132122 tok/s) step 7727/76294 | train loss 3.500083 | norm 0.2245 | lr 8.71e-04 | (3804.72 ms | 137799 tok/s) step 7728/76294 | train loss 3.556830 | norm 0.1942 | lr 8.71e-04 | (3808.81 ms | 137651 tok/s) step 7729/76294 | train loss 3.456768 | norm 0.2469 | lr 8.71e-04 | (3823.85 ms | 137110 tok/s) step 7730/76294 | train loss 3.448611 | norm 0.1924 | lr 8.71e-04 | (3907.80 ms | 134164 tok/s) step 7731/76294 | train loss 3.490988 | norm 0.2046 | lr 8.71e-04 | (3799.19 ms | 138000 tok/s) step 7732/76294 | train loss 3.528739 | norm 0.1860 | lr 8.71e-04 | (3830.43 ms | 136874 tok/s) step 7733/76294 | train loss 3.466277 | norm 0.1909 | lr 8.71e-04 | (3821.22 ms | 137204 tok/s) step 7734/76294 | train loss 3.503148 | norm 0.1879 | lr 8.71e-04 | (3798.14 ms | 138038 tok/s) step 7735/76294 | train loss 3.566367 | norm 0.1924 | lr 8.71e-04 | (3800.24 ms | 137962 tok/s) step 7736/76294 | train loss 3.494774 | norm 0.1679 | lr 8.71e-04 | (3843.78 ms | 136399 tok/s) step 7737/76294 | train loss 3.533830 | norm 0.2042 | lr 8.70e-04 | (3799.22 ms | 137999 tok/s) step 7738/76294 | train loss 3.465495 | norm 0.1950 | lr 8.70e-04 | (3810.35 ms | 137596 tok/s) step 7739/76294 | train loss 3.452292 | norm 0.2168 | lr 8.70e-04 | (3831.39 ms | 136840 tok/s) step 7740/76294 | train loss 3.514156 | norm 0.2171 | lr 8.70e-04 | (3803.77 ms | 137834 tok/s) step 7741/76294 | train loss 3.506096 | norm 0.1801 | lr 8.70e-04 | (3804.90 ms | 137793 tok/s) step 7742/76294 | train loss 3.489912 | norm 0.2480 | lr 8.70e-04 | (3828.65 ms | 136938 tok/s) step 7743/76294 | train loss 3.471375 | norm 0.2276 | lr 8.70e-04 | (3802.78 ms | 137870 tok/s) step 7744/76294 | train loss 3.524013 | norm 0.2083 | lr 8.70e-04 | (3878.41 ms | 135181 tok/s) step 7745/76294 | train loss 3.459316 | norm 0.1821 | lr 8.70e-04 | (3799.92 ms | 137973 tok/s) step 7746/76294 | train loss 3.500300 | norm 0.1971 | lr 8.70e-04 | (3881.72 ms | 135066 tok/s) step 7747/76294 | train loss 3.552678 | norm 0.1973 | lr 8.70e-04 | (3803.28 ms | 137851 tok/s) step 7748/76294 | train loss 3.512185 | norm 0.1827 | lr 8.70e-04 | (3815.44 ms | 137412 tok/s) step 7749/76294 | train loss 3.494856 | norm 0.1828 | lr 8.69e-04 | (3827.74 ms | 136971 tok/s) step 7750/76294 | train loss 3.460568 | norm 0.1889 | lr 8.69e-04 | (3805.35 ms | 137777 tok/s) val loss: 3.475483 saving model checkpoint to ./results/gpt2-124M-gqa/step_7750.pth step 7751/76294 | train loss 3.451145 | norm 0.1841 | lr 8.69e-04 | (3825.27 ms | 137059 tok/s) step 7752/76294 | train loss 3.499653 | norm 0.1813 | lr 8.69e-04 | (3796.52 ms | 138097 tok/s) step 7753/76294 | train loss 3.532181 | norm 0.1881 | lr 8.69e-04 | (3814.60 ms | 137442 tok/s) step 7754/76294 | train loss 3.468264 | norm 0.1824 | lr 8.69e-04 | (3801.49 ms | 137916 tok/s) step 7755/76294 | train loss 3.564389 | norm 0.2032 | lr 8.69e-04 | (3829.83 ms | 136896 tok/s) step 7756/76294 | train loss 3.507815 | norm 0.2295 | lr 8.69e-04 | (3825.94 ms | 137035 tok/s) step 7757/76294 | train loss 3.477288 | norm 0.1821 | lr 8.69e-04 | (3833.50 ms | 136765 tok/s) step 7758/76294 | train loss 3.484743 | norm 0.2129 | lr 8.69e-04 | (3805.78 ms | 137761 tok/s) step 7759/76294 | train loss 3.506141 | norm 0.2805 | lr 8.69e-04 | (3847.99 ms | 136250 tok/s) step 7760/76294 | train loss 3.434865 | norm 0.3194 | lr 8.69e-04 | (3810.88 ms | 137577 tok/s) step 7761/76294 | train loss 3.492251 | norm 0.2017 | lr 8.68e-04 | (3833.23 ms | 136774 tok/s) step 7762/76294 | train loss 3.603458 | norm 0.2476 | lr 8.68e-04 | (3806.09 ms | 137750 tok/s) step 7763/76294 | train loss 3.529693 | norm 0.2276 | lr 8.68e-04 | (3832.02 ms | 136818 tok/s) step 7764/76294 | train loss 3.470207 | norm 0.2668 | lr 8.68e-04 | (3805.71 ms | 137764 tok/s) step 7765/76294 | train loss 3.464937 | norm 0.2999 | lr 8.68e-04 | (3808.87 ms | 137649 tok/s) step 7766/76294 | train loss 3.517467 | norm 0.2079 | lr 8.68e-04 | (3849.29 ms | 136204 tok/s) step 7767/76294 | train loss 3.553560 | norm 0.3045 | lr 8.68e-04 | (3832.30 ms | 136808 tok/s) step 7768/76294 | train loss 3.544197 | norm 0.1904 | lr 8.68e-04 | (3914.91 ms | 133921 tok/s) step 7769/76294 | train loss 3.475902 | norm 0.2214 | lr 8.68e-04 | (3804.49 ms | 137808 tok/s) step 7770/76294 | train loss 3.504172 | norm 0.2055 | lr 8.68e-04 | (3833.50 ms | 136765 tok/s) step 7771/76294 | train loss 3.552566 | norm 0.2005 | lr 8.68e-04 | (3802.43 ms | 137882 tok/s) step 7772/76294 | train loss 3.469144 | norm 0.1987 | lr 8.68e-04 | (12438.47 ms | 42151 tok/s) step 7773/76294 | train loss 3.618586 | norm 0.2475 | lr 8.67e-04 | (3881.38 ms | 135078 tok/s) step 7774/76294 | train loss 3.527570 | norm 0.3054 | lr 8.67e-04 | (3785.00 ms | 138517 tok/s) step 7775/76294 | train loss 3.539601 | norm 0.2200 | lr 8.67e-04 | (3820.14 ms | 137243 tok/s) step 7776/76294 | train loss 3.497873 | norm 0.2651 | lr 8.67e-04 | (3789.46 ms | 138354 tok/s) step 7777/76294 | train loss 3.512630 | norm 0.2275 | lr 8.67e-04 | (3936.20 ms | 133196 tok/s) step 7778/76294 | train loss 3.585334 | norm 0.2351 | lr 8.67e-04 | (3793.52 ms | 138206 tok/s) step 7779/76294 | train loss 3.568906 | norm 0.2628 | lr 8.67e-04 | (3799.69 ms | 137982 tok/s) step 7780/76294 | train loss 3.525799 | norm 0.2202 | lr 8.67e-04 | (3810.29 ms | 137598 tok/s) step 7781/76294 | train loss 3.515284 | norm 0.2784 | lr 8.67e-04 | (3798.00 ms | 138043 tok/s) step 7782/76294 | train loss 3.606121 | norm 0.2656 | lr 8.67e-04 | (3791.59 ms | 138277 tok/s) step 7783/76294 | train loss 3.472624 | norm 0.2529 | lr 8.67e-04 | (3824.05 ms | 137103 tok/s) step 7784/76294 | train loss 3.520427 | norm 0.2080 | lr 8.67e-04 | (3795.26 ms | 138143 tok/s) step 7785/76294 | train loss 3.559651 | norm 0.3321 | lr 8.66e-04 | (3797.89 ms | 138047 tok/s) step 7786/76294 | train loss 3.402602 | norm 0.2460 | lr 8.66e-04 | (3824.97 ms | 137070 tok/s) step 7787/76294 | train loss 3.521146 | norm 0.2791 | lr 8.66e-04 | (3802.70 ms | 137873 tok/s) step 7788/76294 | train loss 3.503256 | norm 0.2307 | lr 8.66e-04 | (3821.23 ms | 137204 tok/s) step 7789/76294 | train loss 3.508741 | norm 0.2669 | lr 8.66e-04 | (3833.41 ms | 136768 tok/s) step 7790/76294 | train loss 3.482766 | norm 0.2373 | lr 8.66e-04 | (3796.12 ms | 138112 tok/s) step 7791/76294 | train loss 3.505640 | norm 0.2071 | lr 8.66e-04 | (3855.35 ms | 135990 tok/s) step 7792/76294 | train loss 3.507021 | norm 0.2075 | lr 8.66e-04 | (3798.61 ms | 138021 tok/s) step 7793/76294 | train loss 3.599594 | norm 0.2612 | lr 8.66e-04 | (3824.96 ms | 137070 tok/s) step 7794/76294 | train loss 3.491369 | norm 0.2592 | lr 8.66e-04 | (3824.56 ms | 137084 tok/s) step 7795/76294 | train loss 3.479685 | norm 0.2352 | lr 8.66e-04 | (3804.48 ms | 137808 tok/s) step 7796/76294 | train loss 3.529695 | norm 0.3345 | lr 8.66e-04 | (3820.81 ms | 137219 tok/s) step 7797/76294 | train loss 3.510499 | norm 0.2868 | lr 8.65e-04 | (3805.56 ms | 137769 tok/s) step 7798/76294 | train loss 3.519592 | norm 0.2403 | lr 8.65e-04 | (3800.85 ms | 137940 tok/s) step 7799/76294 | train loss 3.562029 | norm 0.2655 | lr 8.65e-04 | (3832.72 ms | 136793 tok/s) step 7800/76294 | train loss 3.456379 | norm 0.1953 | lr 8.65e-04 | (3799.86 ms | 137976 tok/s) step 7801/76294 | train loss 3.522462 | norm 0.2030 | lr 8.65e-04 | (3809.40 ms | 137630 tok/s) step 7802/76294 | train loss 3.597619 | norm 0.2109 | lr 8.65e-04 | (3825.90 ms | 137036 tok/s) step 7803/76294 | train loss 3.499242 | norm 0.2040 | lr 8.65e-04 | (3871.48 ms | 135423 tok/s) step 7804/76294 | train loss 3.406933 | norm 0.1910 | lr 8.65e-04 | (3810.19 ms | 137602 tok/s) step 7805/76294 | train loss 3.518844 | norm 0.2139 | lr 8.65e-04 | (3857.30 ms | 135921 tok/s) step 7806/76294 | train loss 3.451405 | norm 0.2235 | lr 8.65e-04 | (3802.10 ms | 137894 tok/s) step 7807/76294 | train loss 3.495151 | norm 0.2201 | lr 8.65e-04 | (3834.57 ms | 136727 tok/s) step 7808/76294 | train loss 3.578496 | norm 0.2236 | lr 8.65e-04 | (3806.50 ms | 137735 tok/s) step 7809/76294 | train loss 3.512398 | norm 0.2076 | lr 8.64e-04 | (3828.27 ms | 136952 tok/s) step 7810/76294 | train loss 3.510826 | norm 0.2053 | lr 8.64e-04 | (3831.36 ms | 136841 tok/s) step 7811/76294 | train loss 3.528138 | norm 0.2215 | lr 8.64e-04 | (3832.97 ms | 136784 tok/s) step 7812/76294 | train loss 3.492610 | norm 0.1939 | lr 8.64e-04 | (3802.67 ms | 137874 tok/s) step 7813/76294 | train loss 3.502892 | norm 0.1918 | lr 8.64e-04 | (3869.03 ms | 135509 tok/s) step 7814/76294 | train loss 3.540268 | norm 0.2866 | lr 8.64e-04 | (3805.31 ms | 137778 tok/s) step 7815/76294 | train loss 3.497566 | norm 0.2760 | lr 8.64e-04 | (3812.73 ms | 137510 tok/s) step 7816/76294 | train loss 3.531259 | norm 0.1958 | lr 8.64e-04 | (3826.14 ms | 137028 tok/s) step 7817/76294 | train loss 3.499311 | norm 0.3023 | lr 8.64e-04 | (3810.10 ms | 137605 tok/s) step 7818/76294 | train loss 3.510034 | norm 0.1763 | lr 8.64e-04 | (3802.69 ms | 137873 tok/s) step 7819/76294 | train loss 3.525517 | norm 0.2538 | lr 8.64e-04 | (3828.52 ms | 136943 tok/s) step 7820/76294 | train loss 3.497628 | norm 0.1934 | lr 8.64e-04 | (3805.86 ms | 137758 tok/s) step 7821/76294 | train loss 3.502702 | norm 0.2287 | lr 8.63e-04 | (4756.72 ms | 110221 tok/s) step 7822/76294 | train loss 3.450160 | norm 0.2002 | lr 8.63e-04 | (3824.86 ms | 137074 tok/s) step 7823/76294 | train loss 3.457813 | norm 0.1995 | lr 8.63e-04 | (3876.62 ms | 135244 tok/s) step 7824/76294 | train loss 3.463954 | norm 0.2081 | lr 8.63e-04 | (3806.70 ms | 137728 tok/s) step 7825/76294 | train loss 3.480930 | norm 0.2202 | lr 8.63e-04 | (3815.94 ms | 137394 tok/s) step 7826/76294 | train loss 3.553191 | norm 0.2280 | lr 8.63e-04 | (3827.87 ms | 136966 tok/s) step 7827/76294 | train loss 3.565216 | norm 0.2335 | lr 8.63e-04 | (3811.28 ms | 137562 tok/s) step 7828/76294 | train loss 3.483645 | norm 0.2228 | lr 8.63e-04 | (3805.67 ms | 137765 tok/s) step 7829/76294 | train loss 3.472496 | norm 0.2077 | lr 8.63e-04 | (3827.00 ms | 136997 tok/s) step 7830/76294 | train loss 3.479940 | norm 0.2056 | lr 8.63e-04 | (3815.22 ms | 137420 tok/s) step 7831/76294 | train loss 3.379614 | norm 0.1994 | lr 8.63e-04 | (3827.21 ms | 136990 tok/s) step 7832/76294 | train loss 3.453407 | norm 0.2012 | lr 8.63e-04 | (3807.11 ms | 137713 tok/s) step 7833/76294 | train loss 3.509891 | norm 0.1903 | lr 8.62e-04 | (3803.37 ms | 137848 tok/s) step 7834/76294 | train loss 3.447098 | norm 0.2141 | lr 8.62e-04 | (3833.64 ms | 136760 tok/s) step 7835/76294 | train loss 3.572127 | norm 0.3048 | lr 8.62e-04 | (3803.64 ms | 137838 tok/s) step 7836/76294 | train loss 3.450831 | norm 0.3493 | lr 8.62e-04 | (3825.03 ms | 137068 tok/s) step 7837/76294 | train loss 3.474450 | norm 0.2064 | lr 8.62e-04 | (3834.50 ms | 136729 tok/s) step 7838/76294 | train loss 3.510656 | norm 0.2814 | lr 8.62e-04 | (3808.89 ms | 137649 tok/s) step 7839/76294 | train loss 3.509819 | norm 0.2007 | lr 8.62e-04 | (3806.98 ms | 137718 tok/s) step 7840/76294 | train loss 3.437769 | norm 0.2138 | lr 8.62e-04 | (3833.73 ms | 136757 tok/s) step 7841/76294 | train loss 3.464323 | norm 0.2212 | lr 8.62e-04 | (3813.79 ms | 137472 tok/s) step 7842/76294 | train loss 3.541058 | norm 0.2311 | lr 8.62e-04 | (3810.26 ms | 137599 tok/s) step 7843/76294 | train loss 3.443635 | norm 0.2287 | lr 8.62e-04 | (3826.65 ms | 137010 tok/s) step 7844/76294 | train loss 3.492874 | norm 0.2022 | lr 8.62e-04 | (3879.40 ms | 135147 tok/s) step 7845/76294 | train loss 3.533213 | norm 0.2029 | lr 8.61e-04 | (3807.83 ms | 137687 tok/s) step 7846/76294 | train loss 3.487060 | norm 0.1920 | lr 8.61e-04 | (3851.91 ms | 136111 tok/s) step 7847/76294 | train loss 3.516173 | norm 0.2054 | lr 8.61e-04 | (3805.75 ms | 137762 tok/s) step 7848/76294 | train loss 3.456725 | norm 0.1960 | lr 8.61e-04 | (3808.17 ms | 137674 tok/s) step 7849/76294 | train loss 3.453459 | norm 0.2103 | lr 8.61e-04 | (3824.77 ms | 137077 tok/s) step 7850/76294 | train loss 3.475851 | norm 0.2100 | lr 8.61e-04 | (3808.93 ms | 137647 tok/s) step 7851/76294 | train loss 3.456909 | norm 0.1820 | lr 8.61e-04 | (3801.86 ms | 137903 tok/s) step 7852/76294 | train loss 3.522721 | norm 0.2106 | lr 8.61e-04 | (3835.23 ms | 136703 tok/s) step 7853/76294 | train loss 3.483284 | norm 0.1919 | lr 8.61e-04 | (3802.33 ms | 137886 tok/s) step 7854/76294 | train loss 3.453934 | norm 0.2440 | lr 8.61e-04 | (3808.71 ms | 137655 tok/s) step 7855/76294 | train loss 3.501911 | norm 0.1809 | lr 8.61e-04 | (3827.14 ms | 136992 tok/s) step 7856/76294 | train loss 3.474407 | norm 0.3000 | lr 8.61e-04 | (3809.44 ms | 137628 tok/s) step 7857/76294 | train loss 3.448262 | norm 0.2498 | lr 8.60e-04 | (3804.71 ms | 137800 tok/s) step 7858/76294 | train loss 3.465447 | norm 0.2234 | lr 8.60e-04 | (3835.20 ms | 136704 tok/s) step 7859/76294 | train loss 3.422757 | norm 0.2223 | lr 8.60e-04 | (3804.71 ms | 137800 tok/s) step 7860/76294 | train loss 3.497241 | norm 0.2161 | lr 8.60e-04 | (3854.17 ms | 136031 tok/s) step 7861/76294 | train loss 3.488718 | norm 0.3101 | lr 8.60e-04 | (3811.69 ms | 137548 tok/s) step 7862/76294 | train loss 3.471815 | norm 0.2034 | lr 8.60e-04 | (3820.10 ms | 137244 tok/s) step 7863/76294 | train loss 3.467629 | norm 0.2074 | lr 8.60e-04 | (3804.37 ms | 137812 tok/s) step 7864/76294 | train loss 3.818199 | norm 0.2229 | lr 8.60e-04 | (3910.45 ms | 134074 tok/s) step 7865/76294 | train loss 3.433014 | norm 0.2271 | lr 8.60e-04 | (3882.15 ms | 135051 tok/s) step 7866/76294 | train loss 3.451636 | norm 0.1919 | lr 8.60e-04 | (3810.79 ms | 137580 tok/s) step 7867/76294 | train loss 3.502710 | norm 0.2032 | lr 8.60e-04 | (3818.59 ms | 137299 tok/s) step 7868/76294 | train loss 3.497722 | norm 0.1788 | lr 8.60e-04 | (3824.51 ms | 137086 tok/s) step 7869/76294 | train loss 3.485249 | norm 0.2023 | lr 8.59e-04 | (3808.23 ms | 137672 tok/s) step 7870/76294 | train loss 3.500185 | norm 0.2067 | lr 8.59e-04 | (3803.04 ms | 137860 tok/s) step 7871/76294 | train loss 3.513122 | norm 0.2356 | lr 8.59e-04 | (3830.73 ms | 136864 tok/s) step 7872/76294 | train loss 3.529566 | norm 0.2117 | lr 8.59e-04 | (3802.15 ms | 137893 tok/s) step 7873/76294 | train loss 3.553971 | norm 0.3301 | lr 8.59e-04 | (3808.14 ms | 137676 tok/s) step 7874/76294 | train loss 3.463497 | norm 0.2095 | lr 8.59e-04 | (3822.18 ms | 137170 tok/s) step 7875/76294 | train loss 3.515148 | norm 0.2407 | lr 8.59e-04 | (3828.06 ms | 136959 tok/s) step 7876/76294 | train loss 3.590385 | norm 0.2105 | lr 8.59e-04 | (3803.53 ms | 137843 tok/s) step 7877/76294 | train loss 3.587834 | norm 0.2144 | lr 8.59e-04 | (3829.75 ms | 136899 tok/s) step 7878/76294 | train loss 3.548558 | norm 0.2147 | lr 8.59e-04 | (3805.64 ms | 137766 tok/s) step 7879/76294 | train loss 3.483668 | norm 0.2123 | lr 8.59e-04 | (3813.02 ms | 137499 tok/s) step 7880/76294 | train loss 3.455902 | norm 0.2010 | lr 8.59e-04 | (3828.69 ms | 136937 tok/s) step 7881/76294 | train loss 3.593014 | norm 0.2056 | lr 8.58e-04 | (3807.10 ms | 137713 tok/s) step 7882/76294 | train loss 3.443036 | norm 0.1923 | lr 8.58e-04 | (3805.27 ms | 137779 tok/s) step 7883/76294 | train loss 3.538313 | norm 0.2146 | lr 8.58e-04 | (3842.36 ms | 136449 tok/s) step 7884/76294 | train loss 3.523596 | norm 0.2422 | lr 8.58e-04 | (3804.46 ms | 137809 tok/s) step 7885/76294 | train loss 3.430813 | norm 0.1868 | lr 8.58e-04 | (3884.49 ms | 134970 tok/s) step 7886/76294 | train loss 3.444311 | norm 0.1723 | lr 8.58e-04 | (3802.83 ms | 137868 tok/s) step 7887/76294 | train loss 3.466332 | norm 0.2050 | lr 8.58e-04 | (3858.37 ms | 135883 tok/s) step 7888/76294 | train loss 3.455217 | norm 0.1762 | lr 8.58e-04 | (3803.15 ms | 137856 tok/s) step 7889/76294 | train loss 3.458177 | norm 0.2398 | lr 8.58e-04 | (3827.14 ms | 136992 tok/s) step 7890/76294 | train loss 3.472492 | norm 0.1823 | lr 8.58e-04 | (3827.94 ms | 136963 tok/s) step 7891/76294 | train loss 3.410692 | norm 0.2621 | lr 8.58e-04 | (3809.62 ms | 137622 tok/s) step 7892/76294 | train loss 3.503641 | norm 0.1919 | lr 8.58e-04 | (3805.37 ms | 137776 tok/s) step 7893/76294 | train loss 3.489610 | norm 0.2476 | lr 8.57e-04 | (3831.90 ms | 136822 tok/s) step 7894/76294 | train loss 3.451776 | norm 0.2239 | lr 8.57e-04 | (3803.60 ms | 137840 tok/s) step 7895/76294 | train loss 3.499478 | norm 0.2498 | lr 8.57e-04 | (3808.75 ms | 137654 tok/s) step 7896/76294 | train loss 3.817761 | norm 0.2242 | lr 8.57e-04 | (3828.85 ms | 136931 tok/s) step 7897/76294 | train loss 3.417282 | norm 0.2342 | lr 8.57e-04 | (3807.58 ms | 137696 tok/s) step 7898/76294 | train loss 3.497421 | norm 0.2034 | lr 8.57e-04 | (3826.99 ms | 136998 tok/s) step 7899/76294 | train loss 3.411685 | norm 0.2293 | lr 8.57e-04 | (3805.03 ms | 137788 tok/s) step 7900/76294 | train loss 3.483815 | norm 0.2177 | lr 8.57e-04 | (3853.79 ms | 136045 tok/s) step 7901/76294 | train loss 3.536590 | norm 0.2379 | lr 8.57e-04 | (3809.99 ms | 137609 tok/s) step 7902/76294 | train loss 3.398158 | norm 0.2039 | lr 8.57e-04 | (3803.79 ms | 137833 tok/s) step 7903/76294 | train loss 3.524164 | norm 0.2245 | lr 8.57e-04 | (3843.80 ms | 136399 tok/s) step 7904/76294 | train loss 3.471770 | norm 0.2572 | lr 8.57e-04 | (3801.03 ms | 137933 tok/s) step 7905/76294 | train loss 3.471172 | norm 0.2141 | lr 8.56e-04 | (3805.08 ms | 137786 tok/s) step 7906/76294 | train loss 3.547699 | norm 0.1940 | lr 8.56e-04 | (3923.30 ms | 133634 tok/s) step 7907/76294 | train loss 3.496945 | norm 0.2071 | lr 8.56e-04 | (3800.43 ms | 137955 tok/s) step 7908/76294 | train loss 3.468779 | norm 0.1937 | lr 8.56e-04 | (3830.93 ms | 136857 tok/s) step 7909/76294 | train loss 3.501248 | norm 0.1815 | lr 8.56e-04 | (3824.34 ms | 137092 tok/s) step 7910/76294 | train loss 3.419250 | norm 0.1832 | lr 8.56e-04 | (3820.78 ms | 137220 tok/s) step 7911/76294 | train loss 3.418926 | norm 0.1813 | lr 8.56e-04 | (3828.39 ms | 136947 tok/s) step 7912/76294 | train loss 3.518175 | norm 0.2061 | lr 8.56e-04 | (3808.30 ms | 137670 tok/s) step 7913/76294 | train loss 3.520762 | norm 0.2124 | lr 8.56e-04 | (3808.45 ms | 137665 tok/s) step 7914/76294 | train loss 3.437156 | norm 0.1857 | lr 8.56e-04 | (3806.76 ms | 137726 tok/s) step 7915/76294 | train loss 3.534092 | norm 0.1960 | lr 8.56e-04 | (3802.45 ms | 137882 tok/s) step 7916/76294 | train loss 3.494805 | norm 0.2146 | lr 8.56e-04 | (3854.09 ms | 136034 tok/s) step 7917/76294 | train loss 3.465326 | norm 0.2447 | lr 8.55e-04 | (3814.18 ms | 137458 tok/s) step 7918/76294 | train loss 3.501962 | norm 0.1985 | lr 8.55e-04 | (3810.23 ms | 137600 tok/s) step 7919/76294 | train loss 3.483184 | norm 0.2100 | lr 8.55e-04 | (3868.53 ms | 135526 tok/s) step 7920/76294 | train loss 3.565924 | norm 0.2400 | lr 8.55e-04 | (3808.99 ms | 137645 tok/s) step 7921/76294 | train loss 3.565793 | norm 0.2037 | lr 8.55e-04 | (3837.85 ms | 136610 tok/s) step 7922/76294 | train loss 3.568333 | norm 0.2585 | lr 8.55e-04 | (3804.04 ms | 137824 tok/s) step 7923/76294 | train loss 3.482471 | norm 0.2430 | lr 8.55e-04 | (3809.87 ms | 137613 tok/s) step 7924/76294 | train loss 3.506507 | norm 0.2104 | lr 8.55e-04 | (3817.73 ms | 137330 tok/s) step 7925/76294 | train loss 3.507009 | norm 0.2581 | lr 8.55e-04 | (3804.12 ms | 137821 tok/s) step 7926/76294 | train loss 3.515781 | norm 0.2541 | lr 8.55e-04 | (3923.50 ms | 133627 tok/s) step 7927/76294 | train loss 3.468528 | norm 0.1958 | lr 8.55e-04 | (3803.63 ms | 137839 tok/s) step 7928/76294 | train loss 3.437568 | norm 0.3341 | lr 8.55e-04 | (3846.72 ms | 136295 tok/s) step 7929/76294 | train loss 3.495983 | norm 0.2459 | lr 8.54e-04 | (3801.73 ms | 137908 tok/s) step 7930/76294 | train loss 3.436618 | norm 0.2618 | lr 8.54e-04 | (3831.99 ms | 136819 tok/s) step 7931/76294 | train loss 3.451452 | norm 0.2211 | lr 8.54e-04 | (3805.06 ms | 137787 tok/s) step 7932/76294 | train loss 3.510369 | norm 0.2228 | lr 8.54e-04 | (3834.67 ms | 136723 tok/s) step 7933/76294 | train loss 3.488227 | norm 0.2074 | lr 8.54e-04 | (3841.75 ms | 136471 tok/s) step 7934/76294 | train loss 3.448476 | norm 0.2116 | lr 8.54e-04 | (3806.56 ms | 137733 tok/s) step 7935/76294 | train loss 3.541276 | norm 0.2170 | lr 8.54e-04 | (3831.92 ms | 136821 tok/s) step 7936/76294 | train loss 3.541290 | norm 0.2388 | lr 8.54e-04 | (3802.24 ms | 137889 tok/s) step 7937/76294 | train loss 3.502939 | norm 0.2838 | lr 8.54e-04 | (3809.61 ms | 137622 tok/s) step 7938/76294 | train loss 3.445686 | norm 0.1953 | lr 8.54e-04 | (3835.68 ms | 136687 tok/s) step 7939/76294 | train loss 3.545778 | norm 0.2594 | lr 8.54e-04 | (3811.66 ms | 137548 tok/s) step 7940/76294 | train loss 3.443238 | norm 0.2065 | lr 8.54e-04 | (3806.14 ms | 137748 tok/s) step 7941/76294 | train loss 3.528513 | norm 0.2428 | lr 8.53e-04 | (4660.70 ms | 112491 tok/s) step 7942/76294 | train loss 3.508616 | norm 0.2182 | lr 8.53e-04 | (3800.55 ms | 137951 tok/s) step 7943/76294 | train loss 3.510621 | norm 0.2061 | lr 8.53e-04 | (3833.84 ms | 136753 tok/s) step 7944/76294 | train loss 3.576985 | norm 0.1874 | lr 8.53e-04 | (3802.32 ms | 137886 tok/s) step 7945/76294 | train loss 3.534345 | norm 0.2542 | lr 8.53e-04 | (3807.45 ms | 137701 tok/s) step 7946/76294 | train loss 3.517688 | norm 0.2118 | lr 8.53e-04 | (3903.60 ms | 134309 tok/s) step 7947/76294 | train loss 3.430431 | norm 0.2072 | lr 8.53e-04 | (3818.52 ms | 137301 tok/s) step 7948/76294 | train loss 3.467481 | norm 0.2991 | lr 8.53e-04 | (3819.44 ms | 137268 tok/s) step 7949/76294 | train loss 3.521370 | norm 0.2455 | lr 8.53e-04 | (3811.16 ms | 137566 tok/s) step 7950/76294 | train loss 3.511369 | norm 0.2430 | lr 8.53e-04 | (3840.77 ms | 136506 tok/s) step 7951/76294 | train loss 3.515544 | norm 0.2486 | lr 8.53e-04 | (3808.95 ms | 137646 tok/s) step 7952/76294 | train loss 3.456563 | norm 0.2034 | lr 8.53e-04 | (3842.64 ms | 136439 tok/s) step 7953/76294 | train loss 3.578204 | norm 0.2396 | lr 8.52e-04 | (3835.40 ms | 136697 tok/s) step 7954/76294 | train loss 3.467570 | norm 0.1928 | lr 8.52e-04 | (3813.09 ms | 137497 tok/s) step 7955/76294 | train loss 3.624609 | norm 0.2443 | lr 8.52e-04 | (3836.84 ms | 136646 tok/s) step 7956/76294 | train loss 3.435869 | norm 0.1877 | lr 8.52e-04 | (3814.35 ms | 137451 tok/s) step 7957/76294 | train loss 3.523584 | norm 0.2258 | lr 8.52e-04 | (3802.83 ms | 137868 tok/s) step 7958/76294 | train loss 3.464578 | norm 0.1969 | lr 8.52e-04 | (3922.79 ms | 133652 tok/s) step 7959/76294 | train loss 3.490638 | norm 0.1996 | lr 8.52e-04 | (3801.12 ms | 137930 tok/s) step 7960/76294 | train loss 3.438442 | norm 0.1896 | lr 8.52e-04 | (3803.43 ms | 137846 tok/s) step 7961/76294 | train loss 3.534789 | norm 0.1796 | lr 8.52e-04 | (3828.08 ms | 136959 tok/s) step 7962/76294 | train loss 3.502384 | norm 0.2453 | lr 8.52e-04 | (3807.75 ms | 137690 tok/s) step 7963/76294 | train loss 3.462532 | norm 0.2151 | lr 8.52e-04 | (3826.11 ms | 137029 tok/s) step 7964/76294 | train loss 3.507019 | norm 0.2006 | lr 8.51e-04 | (3805.47 ms | 137772 tok/s) step 7965/76294 | train loss 3.507269 | norm 0.2214 | lr 8.51e-04 | (3821.99 ms | 137177 tok/s) step 7966/76294 | train loss 3.464449 | norm 0.2028 | lr 8.51e-04 | (3955.48 ms | 132547 tok/s) step 7967/76294 | train loss 3.466529 | norm 0.2243 | lr 8.51e-04 | (3807.93 ms | 137683 tok/s) step 7968/76294 | train loss 3.462049 | norm 0.2380 | lr 8.51e-04 | (3812.10 ms | 137532 tok/s) step 7969/76294 | train loss 3.413520 | norm 0.2072 | lr 8.51e-04 | (3836.98 ms | 136641 tok/s) step 7970/76294 | train loss 3.415885 | norm 0.2021 | lr 8.51e-04 | (3808.67 ms | 137656 tok/s) step 7971/76294 | train loss 3.534818 | norm 0.2196 | lr 8.51e-04 | (3799.32 ms | 137995 tok/s) step 7972/76294 | train loss 3.452997 | norm 0.2039 | lr 8.51e-04 | (3828.79 ms | 136933 tok/s) step 7973/76294 | train loss 3.619437 | norm 0.2160 | lr 8.51e-04 | (3803.40 ms | 137847 tok/s) step 7974/76294 | train loss 3.476447 | norm 0.2200 | lr 8.51e-04 | (3809.84 ms | 137614 tok/s) step 7975/76294 | train loss 3.569359 | norm 0.2327 | lr 8.51e-04 | (3827.35 ms | 136984 tok/s) step 7976/76294 | train loss 3.472775 | norm 0.2289 | lr 8.50e-04 | (3837.31 ms | 136629 tok/s) step 7977/76294 | train loss 3.503640 | norm 0.2633 | lr 8.50e-04 | (3803.95 ms | 137827 tok/s) step 7978/76294 | train loss 3.525553 | norm 0.2390 | lr 8.50e-04 | (3838.82 ms | 136575 tok/s) step 7979/76294 | train loss 3.496554 | norm 0.2148 | lr 8.50e-04 | (3804.08 ms | 137823 tok/s) step 7980/76294 | train loss 3.489689 | norm 0.2235 | lr 8.50e-04 | (3809.56 ms | 137624 tok/s) step 7981/76294 | train loss 3.471980 | norm 0.1953 | lr 8.50e-04 | (3829.27 ms | 136916 tok/s) step 7982/76294 | train loss 3.506878 | norm 0.2210 | lr 8.50e-04 | (3807.12 ms | 137713 tok/s) step 7983/76294 | train loss 3.452692 | norm 0.2360 | lr 8.50e-04 | (3805.36 ms | 137776 tok/s) step 7984/76294 | train loss 3.465890 | norm 0.2151 | lr 8.50e-04 | (4988.24 ms | 105105 tok/s) step 7985/76294 | train loss 3.539225 | norm 0.2451 | lr 8.50e-04 | (3773.07 ms | 138955 tok/s) step 7986/76294 | train loss 3.507511 | norm 0.2491 | lr 8.50e-04 | (3838.69 ms | 136580 tok/s) step 7987/76294 | train loss 3.484560 | norm 0.2151 | lr 8.50e-04 | (3784.78 ms | 138525 tok/s) step 7988/76294 | train loss 3.527353 | norm 0.1977 | lr 8.49e-04 | (4160.09 ms | 126028 tok/s) step 7989/76294 | train loss 3.558896 | norm 0.2168 | lr 8.49e-04 | (3837.11 ms | 136636 tok/s) step 7990/76294 | train loss 3.493371 | norm 0.2320 | lr 8.49e-04 | (3810.07 ms | 137606 tok/s) step 7991/76294 | train loss 3.523572 | norm 0.2326 | lr 8.49e-04 | (3789.66 ms | 138347 tok/s) step 7992/76294 | train loss 3.484062 | norm 0.2093 | lr 8.49e-04 | (3880.98 ms | 135092 tok/s) step 7993/76294 | train loss 3.536427 | norm 0.2489 | lr 8.49e-04 | (3797.65 ms | 138056 tok/s) step 7994/76294 | train loss 3.474118 | norm 0.3025 | lr 8.49e-04 | (3850.26 ms | 136170 tok/s) step 7995/76294 | train loss 3.454129 | norm 0.2272 | lr 8.49e-04 | (3794.89 ms | 138156 tok/s) step 7996/76294 | train loss 3.471753 | norm 0.2182 | lr 8.49e-04 | (3820.12 ms | 137244 tok/s) step 7997/76294 | train loss 3.419456 | norm 0.2393 | lr 8.49e-04 | (3814.90 ms | 137432 tok/s) step 7998/76294 | train loss 3.492836 | norm 0.1844 | lr 8.49e-04 | (3809.24 ms | 137636 tok/s) step 7999/76294 | train loss 3.496180 | norm 0.2319 | lr 8.49e-04 | (3798.48 ms | 138026 tok/s) step 8000/76294 | train loss 3.521180 | norm 0.2018 | lr 8.48e-04 | (3826.33 ms | 137021 tok/s) val loss: 3.472013 saving model checkpoint to ./results/gpt2-124M-gqa/step_8000.pth step 8001/76294 | train loss 3.494243 | norm 0.2168 | lr 8.48e-04 | (3871.14 ms | 135435 tok/s) step 8002/76294 | train loss 3.531345 | norm 0.1929 | lr 8.48e-04 | (3795.85 ms | 138121 tok/s) step 8003/76294 | train loss 3.476071 | norm 0.2198 | lr 8.48e-04 | (3857.90 ms | 135900 tok/s) step 8004/76294 | train loss 3.478903 | norm 0.2039 | lr 8.48e-04 | (3796.25 ms | 138107 tok/s) step 8005/76294 | train loss 3.405526 | norm 0.2532 | lr 8.48e-04 | (3808.52 ms | 137662 tok/s) step 8006/76294 | train loss 3.644177 | norm 0.2187 | lr 8.48e-04 | (3822.36 ms | 137163 tok/s) step 8007/76294 | train loss 3.706131 | norm 0.2793 | lr 8.48e-04 | (3799.51 ms | 137988 tok/s) step 8008/76294 | train loss 3.551598 | norm 0.2296 | lr 8.48e-04 | (3835.98 ms | 136676 tok/s) step 8009/76294 | train loss 3.510076 | norm 0.2886 | lr 8.48e-04 | (3807.22 ms | 137709 tok/s) step 8010/76294 | train loss 3.522261 | norm 0.2521 | lr 8.48e-04 | (3820.60 ms | 137227 tok/s) step 8011/76294 | train loss 3.550869 | norm 0.2187 | lr 8.48e-04 | (4219.36 ms | 124258 tok/s) step 8012/76294 | train loss 3.442931 | norm 0.3039 | lr 8.47e-04 | (3836.93 ms | 136643 tok/s) step 8013/76294 | train loss 3.460613 | norm 0.2116 | lr 8.47e-04 | (3835.10 ms | 136708 tok/s) step 8014/76294 | train loss 3.471282 | norm 0.2526 | lr 8.47e-04 | (3835.65 ms | 136688 tok/s) step 8015/76294 | train loss 3.500086 | norm 0.1959 | lr 8.47e-04 | (3805.77 ms | 137761 tok/s) step 8016/76294 | train loss 3.435840 | norm 0.2559 | lr 8.47e-04 | (3840.77 ms | 136506 tok/s) step 8017/76294 | train loss 3.402650 | norm 0.1921 | lr 8.47e-04 | (3809.36 ms | 137631 tok/s) step 8018/76294 | train loss 3.506594 | norm 0.2088 | lr 8.47e-04 | (3812.73 ms | 137510 tok/s) step 8019/76294 | train loss 3.415230 | norm 0.1930 | lr 8.47e-04 | (3848.89 ms | 136218 tok/s) step 8020/76294 | train loss 3.508479 | norm 0.2047 | lr 8.47e-04 | (3809.08 ms | 137642 tok/s) step 8021/76294 | train loss 3.473218 | norm 0.2198 | lr 8.47e-04 | (3806.30 ms | 137742 tok/s) step 8022/76294 | train loss 3.445609 | norm 0.1909 | lr 8.47e-04 | (3873.61 ms | 135349 tok/s) step 8023/76294 | train loss 3.590609 | norm 0.2242 | lr 8.47e-04 | (3806.73 ms | 137727 tok/s) step 8024/76294 | train loss 3.460578 | norm 0.2371 | lr 8.46e-04 | (3838.57 ms | 136584 tok/s) step 8025/76294 | train loss 3.466964 | norm 0.1781 | lr 8.46e-04 | (3833.79 ms | 136754 tok/s) step 8026/76294 | train loss 3.435865 | norm 0.2197 | lr 8.46e-04 | (3820.98 ms | 137213 tok/s) step 8027/76294 | train loss 3.457567 | norm 0.2214 | lr 8.46e-04 | (3810.35 ms | 137596 tok/s) step 8028/76294 | train loss 3.446999 | norm 0.2175 | lr 8.46e-04 | (3856.62 ms | 135945 tok/s) step 8029/76294 | train loss 3.482411 | norm 0.2250 | lr 8.46e-04 | (3811.60 ms | 137551 tok/s) step 8030/76294 | train loss 3.502373 | norm 0.2188 | lr 8.46e-04 | (3831.12 ms | 136850 tok/s) step 8031/76294 | train loss 3.458972 | norm 0.2054 | lr 8.46e-04 | (3804.21 ms | 137818 tok/s) step 8032/76294 | train loss 3.443286 | norm 0.1941 | lr 8.46e-04 | (3927.47 ms | 133492 tok/s) step 8033/76294 | train loss 3.463572 | norm 0.2114 | lr 8.46e-04 | (3807.81 ms | 137688 tok/s) step 8034/76294 | train loss 3.419939 | norm 0.1806 | lr 8.46e-04 | (3816.34 ms | 137380 tok/s) step 8035/76294 | train loss 3.513543 | norm 0.2110 | lr 8.46e-04 | (3824.15 ms | 137099 tok/s) step 8036/76294 | train loss 3.437293 | norm 0.1778 | lr 8.45e-04 | (3812.53 ms | 137517 tok/s) step 8037/76294 | train loss 3.581357 | norm 0.2098 | lr 8.45e-04 | (3806.41 ms | 137738 tok/s) step 8038/76294 | train loss 3.505195 | norm 0.2031 | lr 8.45e-04 | (3847.02 ms | 136284 tok/s) step 8039/76294 | train loss 3.476706 | norm 0.2654 | lr 8.45e-04 | (3804.76 ms | 137798 tok/s) step 8040/76294 | train loss 3.487372 | norm 0.2184 | lr 8.45e-04 | (3812.18 ms | 137530 tok/s) step 8041/76294 | train loss 3.468636 | norm 0.2141 | lr 8.45e-04 | (3833.38 ms | 136769 tok/s) step 8042/76294 | train loss 3.462327 | norm 0.2467 | lr 8.45e-04 | (3816.95 ms | 137358 tok/s) step 8043/76294 | train loss 3.443902 | norm 0.2611 | lr 8.45e-04 | (3826.89 ms | 137001 tok/s) step 8044/76294 | train loss 3.649821 | norm 0.1833 | lr 8.45e-04 | (3811.04 ms | 137571 tok/s) step 8045/76294 | train loss 3.443275 | norm 0.2454 | lr 8.45e-04 | (3834.26 ms | 136738 tok/s) step 8046/76294 | train loss 3.474357 | norm 0.1848 | lr 8.45e-04 | (3809.96 ms | 137610 tok/s) step 8047/76294 | train loss 3.440871 | norm 0.1856 | lr 8.44e-04 | (3810.05 ms | 137607 tok/s) step 8048/76294 | train loss 3.484752 | norm 0.2140 | lr 8.44e-04 | (3832.41 ms | 136804 tok/s) step 8049/76294 | train loss 3.425424 | norm 0.2786 | lr 8.44e-04 | (3804.88 ms | 137793 tok/s) step 8050/76294 | train loss 3.455395 | norm 0.2007 | lr 8.44e-04 | (3809.27 ms | 137635 tok/s) step 8051/76294 | train loss 3.484335 | norm 0.2824 | lr 8.44e-04 | (3826.75 ms | 137006 tok/s) step 8052/76294 | train loss 3.448944 | norm 0.2051 | lr 8.44e-04 | (3810.20 ms | 137601 tok/s) step 8053/76294 | train loss 3.489420 | norm 0.2514 | lr 8.44e-04 | (3825.43 ms | 137053 tok/s) step 8054/76294 | train loss 3.386774 | norm 0.2327 | lr 8.44e-04 | (3827.08 ms | 136994 tok/s) step 8055/76294 | train loss 3.512410 | norm 0.2385 | lr 8.44e-04 | (3803.09 ms | 137858 tok/s) step 8056/76294 | train loss 3.418142 | norm 0.2246 | lr 8.44e-04 | (3828.24 ms | 136953 tok/s) step 8057/76294 | train loss 3.504293 | norm 0.2300 | lr 8.44e-04 | (3916.39 ms | 133870 tok/s) step 8058/76294 | train loss 3.489674 | norm 0.2961 | lr 8.44e-04 | (3824.51 ms | 137086 tok/s) step 8059/76294 | train loss 3.411275 | norm 0.2313 | lr 8.43e-04 | (3812.45 ms | 137520 tok/s) step 8060/76294 | train loss 3.472350 | norm 0.1936 | lr 8.43e-04 | (3802.83 ms | 137868 tok/s) step 8061/76294 | train loss 3.475346 | norm 0.2178 | lr 8.43e-04 | (3829.42 ms | 136910 tok/s) step 8062/76294 | train loss 3.481426 | norm 0.2021 | lr 8.43e-04 | (3806.50 ms | 137735 tok/s) step 8063/76294 | train loss 3.558465 | norm 0.2051 | lr 8.43e-04 | (3818.31 ms | 137309 tok/s) step 8064/76294 | train loss 3.511803 | norm 0.2020 | lr 8.43e-04 | (3831.37 ms | 136841 tok/s) step 8065/76294 | train loss 3.519090 | norm 0.1979 | lr 8.43e-04 | (3807.28 ms | 137707 tok/s) step 8066/76294 | train loss 3.422795 | norm 0.2238 | lr 8.43e-04 | (3801.45 ms | 137918 tok/s) step 8067/76294 | train loss 3.463549 | norm 0.1942 | lr 8.43e-04 | (3846.60 ms | 136299 tok/s) step 8068/76294 | train loss 3.499259 | norm 0.2172 | lr 8.43e-04 | (3804.99 ms | 137790 tok/s) step 8069/76294 | train loss 3.368566 | norm 0.2195 | lr 8.43e-04 | (3806.29 ms | 137743 tok/s) step 8070/76294 | train loss 3.499189 | norm 0.2301 | lr 8.43e-04 | (3829.58 ms | 136905 tok/s) step 8071/76294 | train loss 3.489875 | norm 0.2106 | lr 8.42e-04 | (3813.53 ms | 137481 tok/s) step 8072/76294 | train loss 3.504229 | norm 0.3756 | lr 8.42e-04 | (3824.73 ms | 137078 tok/s) step 8073/76294 | train loss 3.503998 | norm 0.2641 | lr 8.42e-04 | (4051.16 ms | 129417 tok/s) step 8074/76294 | train loss 3.416831 | norm 0.2467 | lr 8.42e-04 | (3805.31 ms | 137778 tok/s) step 8075/76294 | train loss 3.485633 | norm 0.2332 | lr 8.42e-04 | (3824.23 ms | 137096 tok/s) step 8076/76294 | train loss 3.530103 | norm 0.2649 | lr 8.42e-04 | (3800.61 ms | 137948 tok/s) step 8077/76294 | train loss 3.508374 | norm 0.2468 | lr 8.42e-04 | (3835.82 ms | 136682 tok/s) step 8078/76294 | train loss 3.499723 | norm 0.2187 | lr 8.42e-04 | (3803.96 ms | 137827 tok/s) step 8079/76294 | train loss 3.452133 | norm 0.2414 | lr 8.42e-04 | (3856.61 ms | 135945 tok/s) step 8080/76294 | train loss 3.464283 | norm 0.2088 | lr 8.42e-04 | (3801.81 ms | 137905 tok/s) step 8081/76294 | train loss 3.468970 | norm 0.2380 | lr 8.42e-04 | (3826.20 ms | 137026 tok/s) step 8082/76294 | train loss 3.415644 | norm 0.2240 | lr 8.42e-04 | (3801.57 ms | 137913 tok/s) step 8083/76294 | train loss 3.484370 | norm 0.2099 | lr 8.41e-04 | (3806.80 ms | 137724 tok/s) step 8084/76294 | train loss 3.402071 | norm 0.2639 | lr 8.41e-04 | (3822.99 ms | 137141 tok/s) step 8085/76294 | train loss 3.545154 | norm 0.1854 | lr 8.41e-04 | (3804.12 ms | 137821 tok/s) step 8086/76294 | train loss 3.455174 | norm 0.2053 | lr 8.41e-04 | (3830.70 ms | 136865 tok/s) step 8087/76294 | train loss 3.476011 | norm 0.1952 | lr 8.41e-04 | (5831.62 ms | 89904 tok/s) step 8088/76294 | train loss 3.571873 | norm 0.2446 | lr 8.41e-04 | (3811.61 ms | 137550 tok/s) step 8089/76294 | train loss 3.494865 | norm 0.2665 | lr 8.41e-04 | (3802.12 ms | 137894 tok/s) step 8090/76294 | train loss 3.522574 | norm 0.2156 | lr 8.41e-04 | (3796.77 ms | 138088 tok/s) step 8091/76294 | train loss 3.414043 | norm 0.2727 | lr 8.41e-04 | (3827.80 ms | 136968 tok/s) step 8092/76294 | train loss 3.405099 | norm 0.2512 | lr 8.41e-04 | (3800.90 ms | 137938 tok/s) step 8093/76294 | train loss 3.468639 | norm 0.2474 | lr 8.41e-04 | (3925.83 ms | 133548 tok/s) step 8094/76294 | train loss 3.462005 | norm 0.2826 | lr 8.41e-04 | (3797.37 ms | 138066 tok/s) step 8095/76294 | train loss 3.436202 | norm 0.2122 | lr 8.40e-04 | (4340.15 ms | 120800 tok/s) step 8096/76294 | train loss 3.513401 | norm 0.2134 | lr 8.40e-04 | (3797.15 ms | 138074 tok/s) step 8097/76294 | train loss 3.462450 | norm 0.2566 | lr 8.40e-04 | (3829.23 ms | 136917 tok/s) step 8098/76294 | train loss 3.499079 | norm 0.1893 | lr 8.40e-04 | (3868.00 ms | 135545 tok/s) step 8099/76294 | train loss 3.465357 | norm 0.2418 | lr 8.40e-04 | (3802.86 ms | 137867 tok/s) step 8100/76294 | train loss 3.418323 | norm 0.2580 | lr 8.40e-04 | (3799.45 ms | 137990 tok/s) step 8101/76294 | train loss 3.473354 | norm 0.2521 | lr 8.40e-04 | (3820.22 ms | 137240 tok/s) step 8102/76294 | train loss 3.433604 | norm 0.2337 | lr 8.40e-04 | (3801.32 ms | 137923 tok/s) step 8103/76294 | train loss 3.486206 | norm 0.2222 | lr 8.40e-04 | (3806.11 ms | 137749 tok/s) step 8104/76294 | train loss 3.436147 | norm 0.2377 | lr 8.40e-04 | (3845.60 ms | 136335 tok/s) step 8105/76294 | train loss 3.382723 | norm 0.2428 | lr 8.40e-04 | (3818.05 ms | 137318 tok/s) step 8106/76294 | train loss 3.529219 | norm 0.2418 | lr 8.39e-04 | (3804.00 ms | 137825 tok/s) step 8107/76294 | train loss 3.383118 | norm 0.2503 | lr 8.39e-04 | (3816.77 ms | 137364 tok/s) step 8108/76294 | train loss 3.493200 | norm 0.2401 | lr 8.39e-04 | (3804.88 ms | 137793 tok/s) step 8109/76294 | train loss 3.477199 | norm 0.2116 | lr 8.39e-04 | (3826.76 ms | 137006 tok/s) step 8110/76294 | train loss 3.458831 | norm 0.2172 | lr 8.39e-04 | (3802.99 ms | 137862 tok/s) step 8111/76294 | train loss 3.477079 | norm 0.2254 | lr 8.39e-04 | (3804.19 ms | 137819 tok/s) step 8112/76294 | train loss 3.443339 | norm 0.2196 | lr 8.39e-04 | (3836.58 ms | 136655 tok/s) step 8113/76294 | train loss 3.487395 | norm 0.2418 | lr 8.39e-04 | (3802.23 ms | 137889 tok/s) step 8114/76294 | train loss 3.542205 | norm 0.2145 | lr 8.39e-04 | (3881.22 ms | 135083 tok/s) step 8115/76294 | train loss 3.406798 | norm 0.2410 | lr 8.39e-04 | (3822.58 ms | 137155 tok/s) step 8116/76294 | train loss 3.550241 | norm 0.2169 | lr 8.39e-04 | (3807.03 ms | 137716 tok/s) step 8117/76294 | train loss 3.441873 | norm 0.2036 | lr 8.39e-04 | (3829.35 ms | 136913 tok/s) step 8118/76294 | train loss 3.475517 | norm 0.2489 | lr 8.38e-04 | (3802.07 ms | 137895 tok/s) step 8119/76294 | train loss 3.469016 | norm 0.2544 | lr 8.38e-04 | (3830.92 ms | 136857 tok/s) step 8120/76294 | train loss 3.454502 | norm 0.2312 | lr 8.38e-04 | (3880.33 ms | 135114 tok/s) step 8121/76294 | train loss 3.400809 | norm 0.1837 | lr 8.38e-04 | (3798.89 ms | 138011 tok/s) step 8122/76294 | train loss 3.400291 | norm 0.2150 | lr 8.38e-04 | (3802.56 ms | 137878 tok/s) step 8123/76294 | train loss 3.765676 | norm 0.2955 | lr 8.38e-04 | (3815.97 ms | 137393 tok/s) step 8124/76294 | train loss 3.403098 | norm 0.2561 | lr 8.38e-04 | (3805.72 ms | 137763 tok/s) step 8125/76294 | train loss 3.534525 | norm 0.2117 | lr 8.38e-04 | (3843.31 ms | 136416 tok/s) step 8126/76294 | train loss 3.504085 | norm 0.3477 | lr 8.38e-04 | (3800.62 ms | 137948 tok/s) step 8127/76294 | train loss 3.435341 | norm 0.2294 | lr 8.38e-04 | (3836.01 ms | 136675 tok/s) step 8128/76294 | train loss 3.492396 | norm 0.2326 | lr 8.38e-04 | (3797.86 ms | 138048 tok/s) step 8129/76294 | train loss 3.467549 | norm 0.2065 | lr 8.38e-04 | (4017.15 ms | 130513 tok/s) step 8130/76294 | train loss 3.380260 | norm 0.2073 | lr 8.37e-04 | (3798.71 ms | 138017 tok/s) step 8131/76294 | train loss 3.476200 | norm 0.1792 | lr 8.37e-04 | (3828.02 ms | 136961 tok/s) step 8132/76294 | train loss 3.472389 | norm 0.1901 | lr 8.37e-04 | (3800.18 ms | 137964 tok/s) step 8133/76294 | train loss 3.461362 | norm 0.1849 | lr 8.37e-04 | (3837.82 ms | 136611 tok/s) step 8134/76294 | train loss 3.559272 | norm 0.2264 | lr 8.37e-04 | (3883.97 ms | 134988 tok/s) step 8135/76294 | train loss 3.478324 | norm 0.1842 | lr 8.37e-04 | (3802.91 ms | 137865 tok/s) step 8136/76294 | train loss 3.532903 | norm 0.2034 | lr 8.37e-04 | (3833.69 ms | 136758 tok/s) step 8137/76294 | train loss 3.511224 | norm 0.2211 | lr 8.37e-04 | (3805.36 ms | 137776 tok/s) step 8138/76294 | train loss 3.435185 | norm 0.1813 | lr 8.37e-04 | (3822.48 ms | 137159 tok/s) step 8139/76294 | train loss 3.561989 | norm 0.2149 | lr 8.37e-04 | (3798.93 ms | 138010 tok/s) step 8140/76294 | train loss 3.454234 | norm 0.2008 | lr 8.37e-04 | (3823.54 ms | 137121 tok/s) step 8141/76294 | train loss 3.476653 | norm 0.2251 | lr 8.37e-04 | (3821.78 ms | 137184 tok/s) step 8142/76294 | train loss 3.475221 | norm 0.2058 | lr 8.36e-04 | (3832.63 ms | 136796 tok/s) step 8143/76294 | train loss 3.460387 | norm 0.1917 | lr 8.36e-04 | (3832.15 ms | 136813 tok/s) step 8144/76294 | train loss 3.465031 | norm 0.2105 | lr 8.36e-04 | (3802.35 ms | 137885 tok/s) step 8145/76294 | train loss 3.457815 | norm 0.1867 | lr 8.36e-04 | (3799.64 ms | 137983 tok/s) step 8146/76294 | train loss 3.372944 | norm 0.2406 | lr 8.36e-04 | (3850.38 ms | 136165 tok/s) step 8147/76294 | train loss 3.499689 | norm 0.2709 | lr 8.36e-04 | (3803.45 ms | 137845 tok/s) step 8148/76294 | train loss 3.457840 | norm 0.2269 | lr 8.36e-04 | (3851.09 ms | 136140 tok/s) step 8149/76294 | train loss 3.450502 | norm 0.2280 | lr 8.36e-04 | (3803.44 ms | 137846 tok/s) step 8150/76294 | train loss 3.481351 | norm 0.2148 | lr 8.36e-04 | (3807.53 ms | 137698 tok/s) step 8151/76294 | train loss 3.382915 | norm 0.2178 | lr 8.36e-04 | (3830.00 ms | 136890 tok/s) step 8152/76294 | train loss 3.554816 | norm 0.3126 | lr 8.36e-04 | (3804.44 ms | 137809 tok/s) step 8153/76294 | train loss 3.437170 | norm 0.2941 | lr 8.35e-04 | (3826.62 ms | 137011 tok/s) step 8154/76294 | train loss 3.407753 | norm 0.2502 | lr 8.35e-04 | (3862.13 ms | 135751 tok/s) step 8155/76294 | train loss 3.452298 | norm 0.2251 | lr 8.35e-04 | (3875.16 ms | 135295 tok/s) step 8156/76294 | train loss 3.465182 | norm 0.2354 | lr 8.35e-04 | (3801.88 ms | 137902 tok/s) step 8157/76294 | train loss 3.452794 | norm 0.2049 | lr 8.35e-04 | (3880.73 ms | 135101 tok/s) step 8158/76294 | train loss 3.448115 | norm 0.2357 | lr 8.35e-04 | (3802.63 ms | 137875 tok/s) step 8159/76294 | train loss 3.475154 | norm 0.2704 | lr 8.35e-04 | (3822.89 ms | 137144 tok/s) step 8160/76294 | train loss 3.667767 | norm 0.2929 | lr 8.35e-04 | (3803.47 ms | 137845 tok/s) step 8161/76294 | train loss 3.455255 | norm 0.2536 | lr 8.35e-04 | (3804.28 ms | 137815 tok/s) step 8162/76294 | train loss 3.500956 | norm 0.2207 | lr 8.35e-04 | (3823.43 ms | 137125 tok/s) step 8163/76294 | train loss 3.471242 | norm 0.2293 | lr 8.35e-04 | (3832.90 ms | 136786 tok/s) step 8164/76294 | train loss 3.506828 | norm 0.2307 | lr 8.35e-04 | (3802.46 ms | 137881 tok/s) step 8165/76294 | train loss 3.466492 | norm 0.2119 | lr 8.34e-04 | (3832.55 ms | 136799 tok/s) step 8166/76294 | train loss 3.460247 | norm 0.2032 | lr 8.34e-04 | (3807.64 ms | 137694 tok/s) step 8167/76294 | train loss 3.455017 | norm 0.1973 | lr 8.34e-04 | (3810.70 ms | 137583 tok/s) step 8168/76294 | train loss 3.460993 | norm 0.2013 | lr 8.34e-04 | (3826.95 ms | 136999 tok/s) step 8169/76294 | train loss 3.486663 | norm 0.1844 | lr 8.34e-04 | (3814.75 ms | 137437 tok/s) step 8170/76294 | train loss 3.452149 | norm 0.1855 | lr 8.34e-04 | (3825.04 ms | 137067 tok/s) step 8171/76294 | train loss 3.493733 | norm 0.1943 | lr 8.34e-04 | (3809.01 ms | 137644 tok/s) step 8172/76294 | train loss 3.490844 | norm 0.1876 | lr 8.34e-04 | (3803.65 ms | 137838 tok/s) step 8173/76294 | train loss 3.365689 | norm 0.2638 | lr 8.34e-04 | (3858.40 ms | 135882 tok/s) step 8174/76294 | train loss 3.476878 | norm 0.2236 | lr 8.34e-04 | (3804.30 ms | 137814 tok/s) step 8175/76294 | train loss 3.493923 | norm 0.2116 | lr 8.34e-04 | (3892.05 ms | 134708 tok/s) step 8176/76294 | train loss 3.434863 | norm 0.2176 | lr 8.34e-04 | (3801.59 ms | 137913 tok/s) step 8177/76294 | train loss 3.491034 | norm 0.1979 | lr 8.33e-04 | (3839.33 ms | 136557 tok/s) step 8178/76294 | train loss 3.465663 | norm 0.2230 | lr 8.33e-04 | (3804.19 ms | 137819 tok/s) step 8179/76294 | train loss 3.430223 | norm 0.2079 | lr 8.33e-04 | (3832.07 ms | 136816 tok/s) step 8180/76294 | train loss 3.449696 | norm 0.2160 | lr 8.33e-04 | (3822.32 ms | 137165 tok/s) step 8181/76294 | train loss 3.588670 | norm 0.2349 | lr 8.33e-04 | (3837.05 ms | 136638 tok/s) step 8182/76294 | train loss 3.487437 | norm 0.2061 | lr 8.33e-04 | (3804.13 ms | 137821 tok/s) step 8183/76294 | train loss 3.568756 | norm 0.2085 | lr 8.33e-04 | (3840.04 ms | 136532 tok/s) step 8184/76294 | train loss 3.464188 | norm 0.2471 | lr 8.33e-04 | (3803.95 ms | 137827 tok/s) step 8185/76294 | train loss 3.454123 | norm 0.2039 | lr 8.33e-04 | (3810.00 ms | 137609 tok/s) step 8186/76294 | train loss 3.428950 | norm 0.1910 | lr 8.33e-04 | (3855.07 ms | 136000 tok/s) step 8187/76294 | train loss 3.507089 | norm 0.2121 | lr 8.33e-04 | (3805.24 ms | 137780 tok/s) step 8188/76294 | train loss 3.533613 | norm 0.2232 | lr 8.33e-04 | (3842.78 ms | 136434 tok/s) step 8189/76294 | train loss 3.395313 | norm 0.3825 | lr 8.32e-04 | (3803.92 ms | 137828 tok/s) step 8190/76294 | train loss 3.483411 | norm 2.2323 | lr 8.32e-04 | (3809.45 ms | 137628 tok/s) step 8191/76294 | train loss 3.397542 | norm 0.3924 | lr 8.32e-04 | (3801.51 ms | 137916 tok/s) step 8192/76294 | train loss 3.411009 | norm 0.3587 | lr 8.32e-04 | (3809.64 ms | 137621 tok/s) step 8193/76294 | train loss 3.489887 | norm 0.3105 | lr 8.32e-04 | (3826.02 ms | 137032 tok/s) step 8194/76294 | train loss 3.575144 | norm 0.2954 | lr 8.32e-04 | (3830.63 ms | 136867 tok/s) step 8195/76294 | train loss 3.455364 | norm 0.2616 | lr 8.32e-04 | (3839.74 ms | 136543 tok/s) step 8196/76294 | train loss 3.427639 | norm 0.2231 | lr 8.32e-04 | (5636.88 ms | 93010 tok/s) step 8197/76294 | train loss 3.501045 | norm 0.2484 | lr 8.32e-04 | (3811.60 ms | 137550 tok/s) step 8198/76294 | train loss 3.504886 | norm 0.1977 | lr 8.32e-04 | (3825.92 ms | 137036 tok/s) step 8199/76294 | train loss 3.460777 | norm 0.2110 | lr 8.32e-04 | (3808.83 ms | 137651 tok/s) step 8200/76294 | train loss 3.390740 | norm 0.2143 | lr 8.31e-04 | (3802.49 ms | 137880 tok/s) step 8201/76294 | train loss 3.486369 | norm 0.1969 | lr 8.31e-04 | (6304.51 ms | 83161 tok/s) step 8202/76294 | train loss 3.500459 | norm 0.1954 | lr 8.31e-04 | (7464.17 ms | 70241 tok/s) step 8203/76294 | train loss 3.492409 | norm 0.2011 | lr 8.31e-04 | (3808.21 ms | 137673 tok/s) step 8204/76294 | train loss 3.457314 | norm 0.1874 | lr 8.31e-04 | (3818.15 ms | 137315 tok/s) step 8205/76294 | train loss 3.513943 | norm 0.1892 | lr 8.31e-04 | (3798.39 ms | 138029 tok/s) step 8206/76294 | train loss 3.419912 | norm 0.1768 | lr 8.31e-04 | (3822.80 ms | 137148 tok/s) step 8207/76294 | train loss 3.517230 | norm 0.1717 | lr 8.31e-04 | (3796.11 ms | 138112 tok/s) step 8208/76294 | train loss 3.496929 | norm 0.1966 | lr 8.31e-04 | (3828.21 ms | 136954 tok/s) step 8209/76294 | train loss 3.508931 | norm 0.2175 | lr 8.31e-04 | (3800.96 ms | 137936 tok/s) step 8210/76294 | train loss 3.499321 | norm 0.1965 | lr 8.31e-04 | (3818.57 ms | 137300 tok/s) step 8211/76294 | train loss 3.472929 | norm 0.1901 | lr 8.31e-04 | (3822.05 ms | 137174 tok/s) step 8212/76294 | train loss 3.460386 | norm 0.2026 | lr 8.30e-04 | (3801.84 ms | 137904 tok/s) step 8213/76294 | train loss 3.448364 | norm 0.1905 | lr 8.30e-04 | (3801.97 ms | 137899 tok/s) step 8214/76294 | train loss 3.426070 | norm 0.2151 | lr 8.30e-04 | (3832.12 ms | 136814 tok/s) step 8215/76294 | train loss 3.513497 | norm 0.1935 | lr 8.30e-04 | (3801.49 ms | 137917 tok/s) step 8216/76294 | train loss 3.493363 | norm 0.1878 | lr 8.30e-04 | (3829.80 ms | 136897 tok/s) step 8217/76294 | train loss 3.494298 | norm 0.1896 | lr 8.30e-04 | (3821.66 ms | 137189 tok/s) step 8218/76294 | train loss 3.397671 | norm 0.2441 | lr 8.30e-04 | (3810.92 ms | 137575 tok/s) step 8219/76294 | train loss 3.470574 | norm 0.2086 | lr 8.30e-04 | (3822.47 ms | 137160 tok/s) step 8220/76294 | train loss 3.528297 | norm 0.2484 | lr 8.30e-04 | (3809.97 ms | 137609 tok/s) step 8221/76294 | train loss 3.526556 | norm 0.2148 | lr 8.30e-04 | (3828.74 ms | 136935 tok/s) step 8222/76294 | train loss 3.434083 | norm 0.2563 | lr 8.30e-04 | (3888.43 ms | 134833 tok/s) step 8223/76294 | train loss 3.524157 | norm 0.2271 | lr 8.30e-04 | (3807.44 ms | 137701 tok/s) step 8224/76294 | train loss 3.489684 | norm 0.2203 | lr 8.29e-04 | (3831.81 ms | 136825 tok/s) step 8225/76294 | train loss 3.527065 | norm 0.2020 | lr 8.29e-04 | (3808.52 ms | 137662 tok/s) step 8226/76294 | train loss 3.570041 | norm 0.1980 | lr 8.29e-04 | (3834.48 ms | 136730 tok/s) step 8227/76294 | train loss 3.471842 | norm 0.2096 | lr 8.29e-04 | (3810.59 ms | 137587 tok/s) step 8228/76294 | train loss 3.499449 | norm 0.2293 | lr 8.29e-04 | (3858.05 ms | 135895 tok/s) step 8229/76294 | train loss 3.490620 | norm 0.1894 | lr 8.29e-04 | (3870.49 ms | 135458 tok/s) step 8230/76294 | train loss 3.488734 | norm 0.2245 | lr 8.29e-04 | (3804.32 ms | 137814 tok/s) step 8231/76294 | train loss 3.506681 | norm 0.2003 | lr 8.29e-04 | (3856.66 ms | 135943 tok/s) step 8232/76294 | train loss 3.511837 | norm 0.1720 | lr 8.29e-04 | (3808.92 ms | 137647 tok/s) step 8233/76294 | train loss 3.484746 | norm 0.2048 | lr 8.29e-04 | (4008.24 ms | 130803 tok/s) step 8234/76294 | train loss 3.555354 | norm 0.1788 | lr 8.29e-04 | (3817.29 ms | 137346 tok/s) step 8235/76294 | train loss 3.488807 | norm 0.1752 | lr 8.28e-04 | (3836.30 ms | 136665 tok/s) step 8236/76294 | train loss 3.432667 | norm 0.2090 | lr 8.28e-04 | (3804.68 ms | 137801 tok/s) step 8237/76294 | train loss 3.555364 | norm 0.4193 | lr 8.28e-04 | (3831.23 ms | 136846 tok/s) step 8238/76294 | train loss 3.499585 | norm 0.3349 | lr 8.28e-04 | (3805.14 ms | 137784 tok/s) step 8239/76294 | train loss 3.437318 | norm 0.4113 | lr 8.28e-04 | (3815.22 ms | 137420 tok/s) step 8240/76294 | train loss 3.425446 | norm 0.2666 | lr 8.28e-04 | (3826.89 ms | 137001 tok/s) step 8241/76294 | train loss 3.508355 | norm 0.3262 | lr 8.28e-04 | (3807.66 ms | 137693 tok/s) step 8242/76294 | train loss 3.487496 | norm 0.2140 | lr 8.28e-04 | (3881.10 ms | 135087 tok/s) step 8243/76294 | train loss 3.497592 | norm 0.2967 | lr 8.28e-04 | (3807.31 ms | 137706 tok/s) step 8244/76294 | train loss 3.455475 | norm 0.2081 | lr 8.28e-04 | (3858.78 ms | 135869 tok/s) step 8245/76294 | train loss 3.540494 | norm 0.2290 | lr 8.28e-04 | (3804.23 ms | 137817 tok/s) step 8246/76294 | train loss 3.510876 | norm 0.1934 | lr 8.28e-04 | (3810.56 ms | 137588 tok/s) step 8247/76294 | train loss 3.474241 | norm 0.2650 | lr 8.27e-04 | (3827.45 ms | 136981 tok/s) step 8248/76294 | train loss 3.467127 | norm 0.2030 | lr 8.27e-04 | (3813.13 ms | 137495 tok/s) step 8249/76294 | train loss 3.532314 | norm 0.2149 | lr 8.27e-04 | (3827.17 ms | 136991 tok/s) step 8250/76294 | train loss 3.481240 | norm 0.2645 | lr 8.27e-04 | (3806.92 ms | 137720 tok/s) val loss: 3.466353 saving model checkpoint to ./results/gpt2-124M-gqa/step_8250.pth step 8251/76294 | train loss 3.505727 | norm 0.1729 | lr 8.27e-04 | (3805.27 ms | 137779 tok/s) step 8252/76294 | train loss 3.481697 | norm 0.2193 | lr 8.27e-04 | (3824.92 ms | 137072 tok/s) step 8253/76294 | train loss 3.456118 | norm 0.1920 | lr 8.27e-04 | (3804.25 ms | 137816 tok/s) step 8254/76294 | train loss 3.538196 | norm 0.2078 | lr 8.27e-04 | (3802.25 ms | 137889 tok/s) step 8255/76294 | train loss 3.462829 | norm 0.2292 | lr 8.27e-04 | (3889.42 ms | 134798 tok/s) step 8256/76294 | train loss 3.470131 | norm 0.1969 | lr 8.27e-04 | (3803.35 ms | 137849 tok/s) step 8257/76294 | train loss 3.461179 | norm 0.1989 | lr 8.27e-04 | (3858.16 ms | 135891 tok/s) step 8258/76294 | train loss 3.475230 | norm 0.2383 | lr 8.27e-04 | (3801.57 ms | 137914 tok/s) step 8259/76294 | train loss 3.461381 | norm 0.1918 | lr 8.26e-04 | (3807.44 ms | 137701 tok/s) step 8260/76294 | train loss 3.510575 | norm 0.2142 | lr 8.26e-04 | (3821.68 ms | 137188 tok/s) step 8261/76294 | train loss 3.445102 | norm 0.1748 | lr 8.26e-04 | (3803.59 ms | 137840 tok/s) step 8262/76294 | train loss 3.573765 | norm 0.2160 | lr 8.26e-04 | (3979.19 ms | 131758 tok/s) step 8263/76294 | train loss 3.579061 | norm 0.1854 | lr 8.26e-04 | (3798.24 ms | 138034 tok/s) step 8264/76294 | train loss 3.504426 | norm 0.2443 | lr 8.26e-04 | (3861.58 ms | 135770 tok/s) step 8265/76294 | train loss 3.488379 | norm 0.1832 | lr 8.26e-04 | (3802.55 ms | 137878 tok/s) step 8266/76294 | train loss 3.473393 | norm 0.1934 | lr 8.26e-04 | (3987.02 ms | 131499 tok/s) step 8267/76294 | train loss 3.502277 | norm 0.2052 | lr 8.26e-04 | (3828.97 ms | 136927 tok/s) step 8268/76294 | train loss 3.514786 | norm 0.2103 | lr 8.26e-04 | (3831.61 ms | 136832 tok/s) step 8269/76294 | train loss 3.449081 | norm 0.2238 | lr 8.26e-04 | (3809.11 ms | 137641 tok/s) step 8270/76294 | train loss 3.471686 | norm 0.1946 | lr 8.25e-04 | (3839.81 ms | 136540 tok/s) step 8271/76294 | train loss 3.555498 | norm 0.2064 | lr 8.25e-04 | (3808.20 ms | 137674 tok/s) step 8272/76294 | train loss 3.465369 | norm 0.1906 | lr 8.25e-04 | (3827.94 ms | 136963 tok/s) step 8273/76294 | train loss 3.467139 | norm 0.2721 | lr 8.25e-04 | (3807.22 ms | 137709 tok/s) step 8274/76294 | train loss 3.454678 | norm 0.3139 | lr 8.25e-04 | (3857.54 ms | 135912 tok/s) step 8275/76294 | train loss 3.398510 | norm 0.2601 | lr 8.25e-04 | (3807.09 ms | 137713 tok/s) step 8276/76294 | train loss 3.486777 | norm 0.2786 | lr 8.25e-04 | (3823.71 ms | 137115 tok/s) step 8277/76294 | train loss 3.499866 | norm 0.2406 | lr 8.25e-04 | (3805.86 ms | 137758 tok/s) step 8278/76294 | train loss 3.470572 | norm 0.2632 | lr 8.25e-04 | (3808.25 ms | 137672 tok/s) step 8279/76294 | train loss 3.475564 | norm 0.1757 | lr 8.25e-04 | (3825.61 ms | 137047 tok/s) step 8280/76294 | train loss 3.462108 | norm 0.3142 | lr 8.25e-04 | (3807.56 ms | 137697 tok/s) step 8281/76294 | train loss 3.427655 | norm 0.2264 | lr 8.25e-04 | (3826.98 ms | 136998 tok/s) step 8282/76294 | train loss 3.499294 | norm 0.2448 | lr 8.24e-04 | (3807.16 ms | 137711 tok/s) step 8283/76294 | train loss 3.510367 | norm 0.1707 | lr 8.24e-04 | (3925.58 ms | 133557 tok/s) step 8284/76294 | train loss 3.521438 | norm 0.2122 | lr 8.24e-04 | (3806.63 ms | 137730 tok/s) step 8285/76294 | train loss 3.505278 | norm 0.1840 | lr 8.24e-04 | (3919.07 ms | 133779 tok/s) step 8286/76294 | train loss 3.495032 | norm 0.2250 | lr 8.24e-04 | (3809.69 ms | 137620 tok/s) step 8287/76294 | train loss 3.458691 | norm 0.2297 | lr 8.24e-04 | (3810.79 ms | 137580 tok/s) step 8288/76294 | train loss 3.504348 | norm 0.1821 | lr 8.24e-04 | (3826.37 ms | 137020 tok/s) step 8289/76294 | train loss 3.476230 | norm 0.2327 | lr 8.24e-04 | (3805.46 ms | 137772 tok/s) step 8290/76294 | train loss 3.473667 | norm 0.1947 | lr 8.24e-04 | (3812.78 ms | 137508 tok/s) step 8291/76294 | train loss 3.502878 | norm 0.2009 | lr 8.24e-04 | (3808.65 ms | 137657 tok/s) step 8292/76294 | train loss 3.508092 | norm 0.2178 | lr 8.24e-04 | (3806.30 ms | 137742 tok/s) step 8293/76294 | train loss 3.425644 | norm 0.2400 | lr 8.24e-04 | (3921.62 ms | 133692 tok/s) step 8294/76294 | train loss 3.431525 | norm 0.1712 | lr 8.23e-04 | (3802.16 ms | 137892 tok/s) step 8295/76294 | train loss 3.449700 | norm 0.2133 | lr 8.23e-04 | (3805.95 ms | 137755 tok/s) step 8296/76294 | train loss 3.470747 | norm 0.2148 | lr 8.23e-04 | (3830.35 ms | 136877 tok/s) step 8297/76294 | train loss 3.410378 | norm 0.1927 | lr 8.23e-04 | (3803.08 ms | 137859 tok/s) step 8298/76294 | train loss 3.480752 | norm 0.2023 | lr 8.23e-04 | (3803.49 ms | 137844 tok/s) step 8299/76294 | train loss 3.542719 | norm 0.2101 | lr 8.23e-04 | (3832.77 ms | 136791 tok/s) step 8300/76294 | train loss 3.434855 | norm 0.2581 | lr 8.23e-04 | (3811.90 ms | 137540 tok/s) step 8301/76294 | train loss 3.492625 | norm 0.2594 | lr 8.23e-04 | (3837.76 ms | 136613 tok/s) step 8302/76294 | train loss 3.457280 | norm 0.2088 | lr 8.23e-04 | (3856.71 ms | 135942 tok/s) step 8303/76294 | train loss 3.484956 | norm 0.2182 | lr 8.23e-04 | (3906.70 ms | 134202 tok/s) step 8304/76294 | train loss 3.492044 | norm 0.1876 | lr 8.23e-04 | (3803.45 ms | 137845 tok/s) step 8305/76294 | train loss 3.477685 | norm 0.1979 | lr 8.22e-04 | (3837.81 ms | 136611 tok/s) step 8306/76294 | train loss 3.471874 | norm 0.1991 | lr 8.22e-04 | (3849.58 ms | 136193 tok/s) step 8307/76294 | train loss 3.462621 | norm 0.2006 | lr 8.22e-04 | (3807.34 ms | 137704 tok/s) step 8308/76294 | train loss 3.472653 | norm 0.2219 | lr 8.22e-04 | (3831.89 ms | 136822 tok/s) step 8309/76294 | train loss 3.433064 | norm 0.1691 | lr 8.22e-04 | (3815.44 ms | 137412 tok/s) step 8310/76294 | train loss 3.480175 | norm 0.2524 | lr 8.22e-04 | (3804.63 ms | 137802 tok/s) step 8311/76294 | train loss 3.464738 | norm 0.1787 | lr 8.22e-04 | (3836.47 ms | 136659 tok/s) step 8312/76294 | train loss 3.520857 | norm 0.3255 | lr 8.22e-04 | (3803.03 ms | 137860 tok/s) step 8313/76294 | train loss 3.470807 | norm 0.2261 | lr 8.22e-04 | (3829.48 ms | 136909 tok/s) step 8314/76294 | train loss 3.451072 | norm 0.2378 | lr 8.22e-04 | (3800.78 ms | 137942 tok/s) step 8315/76294 | train loss 3.461197 | norm 0.1989 | lr 8.22e-04 | (3851.74 ms | 136117 tok/s) step 8316/76294 | train loss 3.479768 | norm 0.2172 | lr 8.22e-04 | (3801.80 ms | 137905 tok/s) step 8317/76294 | train loss 3.501636 | norm 0.1946 | lr 8.21e-04 | (3863.21 ms | 135713 tok/s) step 8318/76294 | train loss 3.560422 | norm 0.2180 | lr 8.21e-04 | (3799.29 ms | 137996 tok/s) step 8319/76294 | train loss 3.520746 | norm 0.2176 | lr 8.21e-04 | (3850.07 ms | 136176 tok/s) step 8320/76294 | train loss 3.489461 | norm 0.2239 | lr 8.21e-04 | (3829.98 ms | 136890 tok/s) step 8321/76294 | train loss 3.484433 | norm 0.2616 | lr 8.21e-04 | (3833.21 ms | 136775 tok/s) step 8322/76294 | train loss 3.473824 | norm 0.1789 | lr 8.21e-04 | (3825.36 ms | 137056 tok/s) step 8323/76294 | train loss 3.430369 | norm 0.2129 | lr 8.21e-04 | (3839.44 ms | 136553 tok/s) step 8324/76294 | train loss 3.510092 | norm 0.2233 | lr 8.21e-04 | (3979.57 ms | 131745 tok/s) step 8325/76294 | train loss 3.505791 | norm 0.2376 | lr 8.21e-04 | (3799.94 ms | 137973 tok/s) step 8326/76294 | train loss 3.460800 | norm 0.2391 | lr 8.21e-04 | (3809.30 ms | 137634 tok/s) step 8327/76294 | train loss 3.481545 | norm 0.2061 | lr 8.21e-04 | (3820.37 ms | 137235 tok/s) step 8328/76294 | train loss 3.494819 | norm 0.2051 | lr 8.21e-04 | (3802.74 ms | 137871 tok/s) step 8329/76294 | train loss 3.489711 | norm 0.1990 | lr 8.20e-04 | (3803.01 ms | 137861 tok/s) step 8330/76294 | train loss 3.541567 | norm 0.2047 | lr 8.20e-04 | (3857.52 ms | 135913 tok/s) step 8331/76294 | train loss 3.504207 | norm 0.2185 | lr 8.20e-04 | (3800.61 ms | 137948 tok/s) step 8332/76294 | train loss 3.434734 | norm 0.1922 | lr 8.20e-04 | (3812.71 ms | 137510 tok/s) step 8333/76294 | train loss 3.464665 | norm 0.3290 | lr 8.20e-04 | (3829.35 ms | 136913 tok/s) step 8334/76294 | train loss 3.522442 | norm 0.2419 | lr 8.20e-04 | (3806.54 ms | 137734 tok/s) step 8335/76294 | train loss 3.529194 | norm 0.2569 | lr 8.20e-04 | (3807.75 ms | 137690 tok/s) step 8336/76294 | train loss 3.515973 | norm 0.2430 | lr 8.20e-04 | (3845.65 ms | 136333 tok/s) step 8337/76294 | train loss 3.435862 | norm 0.1879 | lr 8.20e-04 | (3804.27 ms | 137816 tok/s) step 8338/76294 | train loss 3.481664 | norm 0.2120 | lr 8.20e-04 | (3862.91 ms | 135724 tok/s) step 8339/76294 | train loss 3.475965 | norm 0.2026 | lr 8.20e-04 | (3803.82 ms | 137832 tok/s) step 8340/76294 | train loss 3.511880 | norm 0.1778 | lr 8.19e-04 | (3842.48 ms | 136445 tok/s) step 8341/76294 | train loss 3.446742 | norm 0.1837 | lr 8.19e-04 | (3798.57 ms | 138022 tok/s) step 8342/76294 | train loss 3.531080 | norm 0.2423 | lr 8.19e-04 | (3825.16 ms | 137063 tok/s) step 8343/76294 | train loss 3.496553 | norm 0.2441 | lr 8.19e-04 | (3805.19 ms | 137782 tok/s) step 8344/76294 | train loss 3.501288 | norm 0.1989 | lr 8.19e-04 | (3804.86 ms | 137794 tok/s) step 8345/76294 | train loss 3.545145 | norm 0.2460 | lr 8.19e-04 | (3952.86 ms | 132635 tok/s) step 8346/76294 | train loss 3.505194 | norm 0.2051 | lr 8.19e-04 | (3800.76 ms | 137943 tok/s) step 8347/76294 | train loss 3.508444 | norm 0.2371 | lr 8.19e-04 | (3827.48 ms | 136980 tok/s) step 8348/76294 | train loss 3.518288 | norm 0.1714 | lr 8.19e-04 | (3808.74 ms | 137654 tok/s) step 8349/76294 | train loss 3.512138 | norm 0.2155 | lr 8.19e-04 | (3836.84 ms | 136646 tok/s) step 8350/76294 | train loss 3.449275 | norm 0.1719 | lr 8.19e-04 | (3800.62 ms | 137948 tok/s) step 8351/76294 | train loss 3.565198 | norm 0.1836 | lr 8.19e-04 | (3804.70 ms | 137800 tok/s) step 8352/76294 | train loss 3.509790 | norm 0.2208 | lr 8.18e-04 | (3829.79 ms | 136897 tok/s) step 8353/76294 | train loss 3.462686 | norm 0.1913 | lr 8.18e-04 | (3811.53 ms | 137553 tok/s) step 8354/76294 | train loss 3.449411 | norm 0.1885 | lr 8.18e-04 | (3830.46 ms | 136873 tok/s) step 8355/76294 | train loss 3.507606 | norm 0.2058 | lr 8.18e-04 | (3811.44 ms | 137557 tok/s) step 8356/76294 | train loss 3.452376 | norm 0.2022 | lr 8.18e-04 | (3845.79 ms | 136328 tok/s) step 8357/76294 | train loss 3.365948 | norm 0.2617 | lr 8.18e-04 | (3808.95 ms | 137646 tok/s) step 8358/76294 | train loss 3.488707 | norm 0.2559 | lr 8.18e-04 | (3823.52 ms | 137122 tok/s) step 8359/76294 | train loss 3.495672 | norm 0.2230 | lr 8.18e-04 | (3804.00 ms | 137826 tok/s) step 8360/76294 | train loss 3.515192 | norm 0.2585 | lr 8.18e-04 | (3814.32 ms | 137453 tok/s) step 8361/76294 | train loss 3.514662 | norm 0.2133 | lr 8.18e-04 | (3830.67 ms | 136866 tok/s) step 8362/76294 | train loss 3.494678 | norm 0.2199 | lr 8.18e-04 | (3803.24 ms | 137853 tok/s) step 8363/76294 | train loss 3.532441 | norm 0.2009 | lr 8.18e-04 | (3853.87 ms | 136042 tok/s) step 8364/76294 | train loss 3.464260 | norm 0.1989 | lr 8.17e-04 | (3803.03 ms | 137860 tok/s) step 8365/76294 | train loss 3.436532 | norm 0.2139 | lr 8.17e-04 | (3892.10 ms | 134706 tok/s) step 8366/76294 | train loss 3.505308 | norm 0.2931 | lr 8.17e-04 | (3838.11 ms | 136601 tok/s) step 8367/76294 | train loss 3.451680 | norm 0.2419 | lr 8.17e-04 | (3806.30 ms | 137742 tok/s) step 8368/76294 | train loss 3.501616 | norm 0.3053 | lr 8.17e-04 | (3826.82 ms | 137003 tok/s) step 8369/76294 | train loss 3.474540 | norm 0.2823 | lr 8.17e-04 | (3806.30 ms | 137742 tok/s) step 8370/76294 | train loss 3.488548 | norm 0.2439 | lr 8.17e-04 | (3806.66 ms | 137729 tok/s) step 8371/76294 | train loss 3.435654 | norm 0.2369 | lr 8.17e-04 | (3832.42 ms | 136804 tok/s) step 8372/76294 | train loss 3.538851 | norm 0.2387 | lr 8.17e-04 | (3802.94 ms | 137864 tok/s) step 8373/76294 | train loss 3.491984 | norm 0.2645 | lr 8.17e-04 | (3888.93 ms | 134815 tok/s) step 8374/76294 | train loss 3.564263 | norm 0.2501 | lr 8.17e-04 | (3824.29 ms | 137094 tok/s) step 8375/76294 | train loss 3.505831 | norm 0.2027 | lr 8.16e-04 | (3829.34 ms | 136913 tok/s) step 8376/76294 | train loss 3.517951 | norm 0.2860 | lr 8.16e-04 | (3801.67 ms | 137910 tok/s) step 8377/76294 | train loss 3.534414 | norm 0.3083 | lr 8.16e-04 | (3801.04 ms | 137933 tok/s) step 8378/76294 | train loss 3.484690 | norm 0.1929 | lr 8.16e-04 | (4049.59 ms | 129467 tok/s) step 8379/76294 | train loss 3.434370 | norm 0.3676 | lr 8.16e-04 | (3799.95 ms | 137972 tok/s) step 8380/76294 | train loss 3.531898 | norm 0.2327 | lr 8.16e-04 | (3828.20 ms | 136954 tok/s) step 8381/76294 | train loss 3.485058 | norm 0.3242 | lr 8.16e-04 | (3801.75 ms | 137907 tok/s) step 8382/76294 | train loss 3.491347 | norm 0.2088 | lr 8.16e-04 | (3810.14 ms | 137603 tok/s) step 8383/76294 | train loss 3.510209 | norm 0.2629 | lr 8.16e-04 | (3823.28 ms | 137130 tok/s) step 8384/76294 | train loss 3.490248 | norm 0.2088 | lr 8.16e-04 | (3808.12 ms | 137676 tok/s) step 8385/76294 | train loss 3.596619 | norm 0.2491 | lr 8.16e-04 | (3809.41 ms | 137630 tok/s) step 8386/76294 | train loss 3.429726 | norm 0.2091 | lr 8.16e-04 | (3867.02 ms | 135579 tok/s) step 8387/76294 | train loss 3.470264 | norm 0.2745 | lr 8.15e-04 | (3844.86 ms | 136361 tok/s) step 8388/76294 | train loss 3.492972 | norm 0.2628 | lr 8.15e-04 | (3807.37 ms | 137704 tok/s) step 8389/76294 | train loss 3.436573 | norm 0.2746 | lr 8.15e-04 | (3812.42 ms | 137521 tok/s) step 8390/76294 | train loss 3.512985 | norm 0.2130 | lr 8.15e-04 | (3805.21 ms | 137782 tok/s) step 8391/76294 | train loss 3.498101 | norm 0.2900 | lr 8.15e-04 | (3808.50 ms | 137662 tok/s) step 8392/76294 | train loss 3.464812 | norm 0.2459 | lr 8.15e-04 | (3839.28 ms | 136559 tok/s) step 8393/76294 | train loss 3.410492 | norm 0.3108 | lr 8.15e-04 | (4182.47 ms | 125354 tok/s) step 8394/76294 | train loss 3.449538 | norm 0.1942 | lr 8.15e-04 | (3826.86 ms | 137002 tok/s) step 8395/76294 | train loss 3.440706 | norm 0.2466 | lr 8.15e-04 | (3833.45 ms | 136767 tok/s) step 8396/76294 | train loss 3.580470 | norm 0.2361 | lr 8.15e-04 | (3798.97 ms | 138008 tok/s) step 8397/76294 | train loss 3.459381 | norm 0.2014 | lr 8.15e-04 | (3836.01 ms | 136675 tok/s) step 8398/76294 | train loss 3.386295 | norm 0.2405 | lr 8.14e-04 | (3801.37 ms | 137921 tok/s) step 8399/76294 | train loss 3.475305 | norm 0.2093 | lr 8.14e-04 | (3848.72 ms | 136224 tok/s) step 8400/76294 | train loss 3.452199 | norm 0.1696 | lr 8.14e-04 | (3814.45 ms | 137448 tok/s) step 8401/76294 | train loss 3.462930 | norm 0.1817 | lr 8.14e-04 | (3804.17 ms | 137819 tok/s) step 8402/76294 | train loss 3.478488 | norm 0.1915 | lr 8.14e-04 | (3853.54 ms | 136054 tok/s) step 8403/76294 | train loss 3.432129 | norm 0.1907 | lr 8.14e-04 | (3804.26 ms | 137816 tok/s) step 8404/76294 | train loss 3.483782 | norm 0.2217 | lr 8.14e-04 | (3804.12 ms | 137821 tok/s) step 8405/76294 | train loss 3.428182 | norm 0.2568 | lr 8.14e-04 | (3853.67 ms | 136049 tok/s) step 8406/76294 | train loss 3.476567 | norm 0.2324 | lr 8.14e-04 | (3804.50 ms | 137807 tok/s) step 8407/76294 | train loss 3.446472 | norm 0.2745 | lr 8.14e-04 | (3950.15 ms | 132726 tok/s) step 8408/76294 | train loss 3.570833 | norm 0.1890 | lr 8.14e-04 | (3796.99 ms | 138080 tok/s) step 8409/76294 | train loss 3.431773 | norm 0.1843 | lr 8.14e-04 | (3851.72 ms | 136118 tok/s) step 8410/76294 | train loss 3.465700 | norm 0.1760 | lr 8.13e-04 | (3848.63 ms | 136227 tok/s) step 8411/76294 | train loss 3.456191 | norm 0.1861 | lr 8.13e-04 | (3962.31 ms | 132319 tok/s) step 8412/76294 | train loss 3.451932 | norm 0.1956 | lr 8.13e-04 | (3794.81 ms | 138159 tok/s) step 8413/76294 | train loss 3.489889 | norm 0.2437 | lr 8.13e-04 | (3835.94 ms | 136678 tok/s) step 8414/76294 | train loss 3.462195 | norm 0.1912 | lr 8.13e-04 | (3830.86 ms | 136859 tok/s) step 8415/76294 | train loss 3.465368 | norm 0.2691 | lr 8.13e-04 | (3802.38 ms | 137884 tok/s) step 8416/76294 | train loss 3.465526 | norm 0.1993 | lr 8.13e-04 | (3798.96 ms | 138008 tok/s) step 8417/76294 | train loss 3.523628 | norm 0.2837 | lr 8.13e-04 | (3835.87 ms | 136680 tok/s) step 8418/76294 | train loss 3.527286 | norm 0.2200 | lr 8.13e-04 | (3802.15 ms | 137892 tok/s) step 8419/76294 | train loss 3.444552 | norm 0.2159 | lr 8.13e-04 | (3821.37 ms | 137199 tok/s) step 8420/76294 | train loss 3.561488 | norm 0.2156 | lr 8.13e-04 | (3798.03 ms | 138042 tok/s) step 8421/76294 | train loss 3.510063 | norm 0.2237 | lr 8.13e-04 | (3814.06 ms | 137462 tok/s) step 8422/76294 | train loss 3.474896 | norm 0.2455 | lr 8.12e-04 | (3801.40 ms | 137920 tok/s) step 8423/76294 | train loss 3.467732 | norm 0.2554 | lr 8.12e-04 | (3853.51 ms | 136055 tok/s) step 8424/76294 | train loss 3.408094 | norm 0.2101 | lr 8.12e-04 | (3809.41 ms | 137630 tok/s) step 8425/76294 | train loss 3.477842 | norm 0.2677 | lr 8.12e-04 | (3835.58 ms | 136691 tok/s) step 8426/76294 | train loss 3.336030 | norm 0.2356 | lr 8.12e-04 | (3805.00 ms | 137789 tok/s) step 8427/76294 | train loss 3.439899 | norm 0.2217 | lr 8.12e-04 | (3891.54 ms | 134725 tok/s) step 8428/76294 | train loss 3.448344 | norm 0.2265 | lr 8.12e-04 | (3802.86 ms | 137867 tok/s) step 8429/76294 | train loss 3.486569 | norm 0.3496 | lr 8.12e-04 | (3825.71 ms | 137043 tok/s) step 8430/76294 | train loss 3.474992 | norm 0.2835 | lr 8.12e-04 | (3865.47 ms | 135634 tok/s) step 8431/76294 | train loss 3.476512 | norm 0.2309 | lr 8.12e-04 | (3800.83 ms | 137940 tok/s) step 8432/76294 | train loss 3.535205 | norm 0.3124 | lr 8.12e-04 | (3840.10 ms | 136530 tok/s) step 8433/76294 | train loss 3.442749 | norm 0.3001 | lr 8.11e-04 | (3811.91 ms | 137539 tok/s) step 8434/76294 | train loss 3.484139 | norm 0.2655 | lr 8.11e-04 | (3859.95 ms | 135828 tok/s) step 8435/76294 | train loss 3.487158 | norm 0.3342 | lr 8.11e-04 | (3810.31 ms | 137597 tok/s) step 8436/76294 | train loss 3.510940 | norm 0.2689 | lr 8.11e-04 | (3816.47 ms | 137375 tok/s) step 8437/76294 | train loss 3.379442 | norm 0.2683 | lr 8.11e-04 | (3835.32 ms | 136700 tok/s) step 8438/76294 | train loss 3.494096 | norm 0.2767 | lr 8.11e-04 | (3814.89 ms | 137432 tok/s) step 8439/76294 | train loss 3.424992 | norm 0.2529 | lr 8.11e-04 | (3811.67 ms | 137548 tok/s) step 8440/76294 | train loss 3.487032 | norm 0.2457 | lr 8.11e-04 | (3842.78 ms | 136435 tok/s) step 8441/76294 | train loss 3.437655 | norm 0.2206 | lr 8.11e-04 | (3813.14 ms | 137495 tok/s) step 8442/76294 | train loss 3.478384 | norm 0.2346 | lr 8.11e-04 | (3815.83 ms | 137398 tok/s) step 8443/76294 | train loss 3.438221 | norm 0.2088 | lr 8.11e-04 | (3834.63 ms | 136724 tok/s) step 8444/76294 | train loss 3.520334 | norm 0.2246 | lr 8.11e-04 | (3825.16 ms | 137063 tok/s) step 8445/76294 | train loss 3.459834 | norm 0.2056 | lr 8.10e-04 | (3834.08 ms | 136744 tok/s) step 8446/76294 | train loss 3.412402 | norm 0.2463 | lr 8.10e-04 | (3839.67 ms | 136545 tok/s) step 8447/76294 | train loss 3.444947 | norm 0.2624 | lr 8.10e-04 | (3892.59 ms | 134689 tok/s) step 8448/76294 | train loss 3.484637 | norm 0.2176 | lr 8.10e-04 | (3891.31 ms | 134733 tok/s) step 8449/76294 | train loss 3.462936 | norm 0.3742 | lr 8.10e-04 | (3807.42 ms | 137702 tok/s) step 8450/76294 | train loss 3.516793 | norm 0.3712 | lr 8.10e-04 | (3821.74 ms | 137186 tok/s) step 8451/76294 | train loss 3.483694 | norm 0.2499 | lr 8.10e-04 | (3802.72 ms | 137872 tok/s) step 8452/76294 | train loss 3.452178 | norm 0.3587 | lr 8.10e-04 | (3856.81 ms | 135938 tok/s) step 8453/76294 | train loss 3.414984 | norm 0.2178 | lr 8.10e-04 | (3805.50 ms | 137771 tok/s) step 8454/76294 | train loss 3.408916 | norm 0.2666 | lr 8.10e-04 | (3810.10 ms | 137605 tok/s) step 8455/76294 | train loss 3.440645 | norm 0.2321 | lr 8.10e-04 | (3825.97 ms | 137034 tok/s) step 8456/76294 | train loss 3.533800 | norm 0.2528 | lr 8.09e-04 | (3811.63 ms | 137550 tok/s) step 8457/76294 | train loss 3.457026 | norm 0.2091 | lr 8.09e-04 | (3804.63 ms | 137803 tok/s) step 8458/76294 | train loss 3.469773 | norm 0.2079 | lr 8.09e-04 | (3839.87 ms | 136538 tok/s) step 8459/76294 | train loss 3.428020 | norm 0.1921 | lr 8.09e-04 | (3838.79 ms | 136576 tok/s) step 8460/76294 | train loss 3.493635 | norm 0.2019 | lr 8.09e-04 | (3803.47 ms | 137845 tok/s) step 8461/76294 | train loss 3.402972 | norm 0.1993 | lr 8.09e-04 | (3848.69 ms | 136225 tok/s) step 8462/76294 | train loss 3.456084 | norm 0.2054 | lr 8.09e-04 | (3805.78 ms | 137761 tok/s) step 8463/76294 | train loss 3.422500 | norm 0.1926 | lr 8.09e-04 | (3861.36 ms | 135778 tok/s) step 8464/76294 | train loss 3.485434 | norm 0.1964 | lr 8.09e-04 | (3802.81 ms | 137868 tok/s) step 8465/76294 | train loss 3.474529 | norm 0.2210 | lr 8.09e-04 | (3833.72 ms | 136757 tok/s) step 8466/76294 | train loss 3.512514 | norm 0.1849 | lr 8.09e-04 | (3806.62 ms | 137731 tok/s) step 8467/76294 | train loss 3.452386 | norm 0.2044 | lr 8.09e-04 | (3860.69 ms | 135802 tok/s) step 8468/76294 | train loss 3.599722 | norm 0.2075 | lr 8.08e-04 | (3804.63 ms | 137802 tok/s) step 8469/76294 | train loss 3.355145 | norm 0.2981 | lr 8.08e-04 | (3813.38 ms | 137486 tok/s) step 8470/76294 | train loss 3.466683 | norm 0.2863 | lr 8.08e-04 | (3833.00 ms | 136783 tok/s) step 8471/76294 | train loss 3.477326 | norm 0.2124 | lr 8.08e-04 | (3804.56 ms | 137805 tok/s) step 8472/76294 | train loss 3.596251 | norm 0.2565 | lr 8.08e-04 | (3802.31 ms | 137887 tok/s) step 8473/76294 | train loss 3.495435 | norm 0.2606 | lr 8.08e-04 | (3830.76 ms | 136863 tok/s) step 8474/76294 | train loss 3.460973 | norm 0.2588 | lr 8.08e-04 | (3804.27 ms | 137816 tok/s) step 8475/76294 | train loss 3.490436 | norm 0.2553 | lr 8.08e-04 | (3854.29 ms | 136027 tok/s) step 8476/76294 | train loss 3.489961 | norm 0.2589 | lr 8.08e-04 | (3803.83 ms | 137831 tok/s) step 8477/76294 | train loss 3.623848 | norm 0.2188 | lr 8.08e-04 | (3844.94 ms | 136358 tok/s) step 8478/76294 | train loss 3.550413 | norm 0.2613 | lr 8.08e-04 | (3833.43 ms | 136767 tok/s) step 8479/76294 | train loss 3.498541 | norm 0.2271 | lr 8.07e-04 | (3808.57 ms | 137660 tok/s) step 8480/76294 | train loss 3.452810 | norm 0.2038 | lr 8.07e-04 | (3802.45 ms | 137882 tok/s) step 8481/76294 | train loss 3.439601 | norm 0.1994 | lr 8.07e-04 | (3837.67 ms | 136616 tok/s) step 8482/76294 | train loss 3.477211 | norm 0.2155 | lr 8.07e-04 | (3797.84 ms | 138049 tok/s) step 8483/76294 | train loss 3.499454 | norm 0.2119 | lr 8.07e-04 | (3867.80 ms | 135552 tok/s) step 8484/76294 | train loss 3.444899 | norm 0.1983 | lr 8.07e-04 | (3830.38 ms | 136876 tok/s) step 8485/76294 | train loss 3.454554 | norm 0.2214 | lr 8.07e-04 | (3805.15 ms | 137784 tok/s) step 8486/76294 | train loss 3.561882 | norm 0.2533 | lr 8.07e-04 | (3803.42 ms | 137846 tok/s) step 8487/76294 | train loss 3.487932 | norm 0.1885 | lr 8.07e-04 | (3837.53 ms | 136621 tok/s) step 8488/76294 | train loss 3.494489 | norm 0.2247 | lr 8.07e-04 | (3802.74 ms | 137871 tok/s) step 8489/76294 | train loss 3.590010 | norm 0.1962 | lr 8.07e-04 | (4209.54 ms | 124548 tok/s) step 8490/76294 | train loss 3.447203 | norm 0.2304 | lr 8.07e-04 | (3819.78 ms | 137256 tok/s) step 8491/76294 | train loss 3.515679 | norm 0.1934 | lr 8.06e-04 | (3863.81 ms | 135692 tok/s) step 8492/76294 | train loss 3.459337 | norm 0.2179 | lr 8.06e-04 | (3807.01 ms | 137716 tok/s) step 8493/76294 | train loss 3.459737 | norm 0.1944 | lr 8.06e-04 | (3908.12 ms | 134154 tok/s) step 8494/76294 | train loss 3.446960 | norm 0.2571 | lr 8.06e-04 | (3863.76 ms | 135694 tok/s) step 8495/76294 | train loss 3.501662 | norm 0.1852 | lr 8.06e-04 | (3890.85 ms | 134749 tok/s) step 8496/76294 | train loss 3.463571 | norm 0.2455 | lr 8.06e-04 | (3811.99 ms | 137537 tok/s) step 8497/76294 | train loss 3.468856 | norm 0.2169 | lr 8.06e-04 | (3815.36 ms | 137415 tok/s) step 8498/76294 | train loss 3.419121 | norm 0.2315 | lr 8.06e-04 | (3836.10 ms | 136672 tok/s) step 8499/76294 | train loss 3.458340 | norm 0.1928 | lr 8.06e-04 | (3913.23 ms | 133978 tok/s) step 8500/76294 | train loss 3.516808 | norm 0.2680 | lr 8.06e-04 | (3823.63 ms | 137118 tok/s) val loss: 3.454927 saving model checkpoint to ./results/gpt2-124M-gqa/step_8500.pth step 8501/76294 | train loss 3.436212 | norm 0.2307 | lr 8.06e-04 | (3907.05 ms | 134190 tok/s) step 8502/76294 | train loss 3.413183 | norm 0.1956 | lr 8.05e-04 | (3795.94 ms | 138118 tok/s) step 8503/76294 | train loss 3.510906 | norm 0.2378 | lr 8.05e-04 | (3861.63 ms | 135769 tok/s) step 8504/76294 | train loss 3.455487 | norm 0.2348 | lr 8.05e-04 | (3805.27 ms | 137779 tok/s) step 8505/76294 | train loss 3.451113 | norm 0.2136 | lr 8.05e-04 | (3829.53 ms | 136907 tok/s) step 8506/76294 | train loss 3.413332 | norm 0.3004 | lr 8.05e-04 | (4011.47 ms | 130697 tok/s) step 8507/76294 | train loss 3.457407 | norm 0.1962 | lr 8.05e-04 | (3818.11 ms | 137316 tok/s) step 8508/76294 | train loss 3.495634 | norm 0.2603 | lr 8.05e-04 | (3841.69 ms | 136473 tok/s) step 8509/76294 | train loss 3.398767 | norm 0.2411 | lr 8.05e-04 | (3984.23 ms | 131591 tok/s) step 8510/76294 | train loss 3.510688 | norm 0.2237 | lr 8.05e-04 | (3797.46 ms | 138063 tok/s) step 8511/76294 | train loss 3.463014 | norm 0.2378 | lr 8.05e-04 | (4270.30 ms | 122775 tok/s) step 8512/76294 | train loss 3.420726 | norm 0.2674 | lr 8.05e-04 | (3837.29 ms | 136630 tok/s) step 8513/76294 | train loss 3.448793 | norm 0.2226 | lr 8.05e-04 | (3909.50 ms | 134106 tok/s) step 8514/76294 | train loss 3.408349 | norm 0.2779 | lr 8.04e-04 | (3883.97 ms | 134988 tok/s) step 8515/76294 | train loss 3.482505 | norm 0.2219 | lr 8.04e-04 | (3801.38 ms | 137920 tok/s) step 8516/76294 | train loss 3.437027 | norm 0.2100 | lr 8.04e-04 | (3831.75 ms | 136827 tok/s) step 8517/76294 | train loss 3.398339 | norm 0.1852 | lr 8.04e-04 | (3819.65 ms | 137261 tok/s) step 8518/76294 | train loss 3.536863 | norm 0.1918 | lr 8.04e-04 | (3801.80 ms | 137905 tok/s) step 8519/76294 | train loss 3.435959 | norm 0.2285 | lr 8.04e-04 | (3858.27 ms | 135887 tok/s) step 8520/76294 | train loss 3.393904 | norm 0.1884 | lr 8.04e-04 | (3800.99 ms | 137935 tok/s) step 8521/76294 | train loss 3.433814 | norm 0.1986 | lr 8.04e-04 | (3849.50 ms | 136197 tok/s) step 8522/76294 | train loss 3.398524 | norm 0.2610 | lr 8.04e-04 | (3842.58 ms | 136442 tok/s) step 8523/76294 | train loss 3.598614 | norm 0.2149 | lr 8.04e-04 | (3830.13 ms | 136885 tok/s) step 8524/76294 | train loss 3.451050 | norm 0.2794 | lr 8.04e-04 | (3820.95 ms | 137214 tok/s) step 8525/76294 | train loss 3.458244 | norm 0.2254 | lr 8.04e-04 | (3807.12 ms | 137712 tok/s) step 8526/76294 | train loss 3.467114 | norm 0.1873 | lr 8.03e-04 | (3801.23 ms | 137926 tok/s) step 8527/76294 | train loss 3.455720 | norm 0.2112 | lr 8.03e-04 | (3886.65 ms | 134895 tok/s) step 8528/76294 | train loss 3.460057 | norm 0.2037 | lr 8.03e-04 | (3805.83 ms | 137759 tok/s) step 8529/76294 | train loss 3.432972 | norm 0.2304 | lr 8.03e-04 | (4058.19 ms | 129193 tok/s) step 8530/76294 | train loss 3.450380 | norm 0.2069 | lr 8.03e-04 | (3804.76 ms | 137798 tok/s) step 8531/76294 | train loss 3.398057 | norm 0.2103 | lr 8.03e-04 | (3834.52 ms | 136729 tok/s) step 8532/76294 | train loss 3.555933 | norm 0.2193 | lr 8.03e-04 | (3802.06 ms | 137896 tok/s) step 8533/76294 | train loss 3.423069 | norm 0.1909 | lr 8.03e-04 | (4075.01 ms | 128659 tok/s) step 8534/76294 | train loss 3.454416 | norm 0.1977 | lr 8.03e-04 | (3807.00 ms | 137717 tok/s) step 8535/76294 | train loss 3.472010 | norm 0.1966 | lr 8.03e-04 | (4012.33 ms | 130669 tok/s) step 8536/76294 | train loss 3.525742 | norm 0.2295 | lr 8.03e-04 | (3807.75 ms | 137690 tok/s) step 8537/76294 | train loss 3.412291 | norm 0.1993 | lr 8.02e-04 | (3828.76 ms | 136934 tok/s) step 8538/76294 | train loss 3.501998 | norm 0.1806 | lr 8.02e-04 | (4278.43 ms | 122542 tok/s) step 8539/76294 | train loss 3.440799 | norm 0.2597 | lr 8.02e-04 | (3805.37 ms | 137776 tok/s) step 8540/76294 | train loss 3.525663 | norm 0.2322 | lr 8.02e-04 | (3836.86 ms | 136645 tok/s) step 8541/76294 | train loss 3.417984 | norm 0.2118 | lr 8.02e-04 | (4139.68 ms | 126649 tok/s) step 8542/76294 | train loss 3.487884 | norm 0.2156 | lr 8.02e-04 | (3788.45 ms | 138391 tok/s) step 8543/76294 | train loss 3.416731 | norm 0.2265 | lr 8.02e-04 | (3851.75 ms | 136117 tok/s) step 8544/76294 | train loss 3.563901 | norm 0.2272 | lr 8.02e-04 | (3796.47 ms | 138099 tok/s) step 8545/76294 | train loss 3.444621 | norm 0.2042 | lr 8.02e-04 | (3961.90 ms | 132333 tok/s) step 8546/76294 | train loss 3.483597 | norm 0.2341 | lr 8.02e-04 | (3825.86 ms | 137038 tok/s) step 8547/76294 | train loss 3.404027 | norm 0.2047 | lr 8.02e-04 | (3798.48 ms | 138026 tok/s) step 8548/76294 | train loss 3.436611 | norm 0.2348 | lr 8.02e-04 | (3923.13 ms | 133640 tok/s) step 8549/76294 | train loss 3.481949 | norm 0.2074 | lr 8.01e-04 | (3799.38 ms | 137993 tok/s) step 8550/76294 | train loss 3.442397 | norm 0.2514 | lr 8.01e-04 | (3814.71 ms | 137438 tok/s) step 8551/76294 | train loss 3.446141 | norm 0.2265 | lr 8.01e-04 | (3833.20 ms | 136775 tok/s) step 8552/76294 | train loss 3.384640 | norm 0.1889 | lr 8.01e-04 | (3886.03 ms | 134916 tok/s) step 8553/76294 | train loss 3.446143 | norm 0.2273 | lr 8.01e-04 | (3796.63 ms | 138093 tok/s) step 8554/76294 | train loss 3.420345 | norm 0.1962 | lr 8.01e-04 | (3847.89 ms | 136253 tok/s) step 8555/76294 | train loss 3.474997 | norm 0.2055 | lr 8.01e-04 | (3799.57 ms | 137986 tok/s) step 8556/76294 | train loss 3.525571 | norm 0.1765 | lr 8.01e-04 | (3828.24 ms | 136953 tok/s) step 8557/76294 | train loss 3.433341 | norm 0.2249 | lr 8.01e-04 | (3820.35 ms | 137236 tok/s) step 8558/76294 | train loss 3.395676 | norm 0.2245 | lr 8.01e-04 | (3801.92 ms | 137901 tok/s) step 8559/76294 | train loss 3.451308 | norm 0.1766 | lr 8.01e-04 | (3794.57 ms | 138168 tok/s) step 8560/76294 | train loss 3.418310 | norm 0.2397 | lr 8.00e-04 | (3915.02 ms | 133917 tok/s) step 8561/76294 | train loss 3.417483 | norm 0.2108 | lr 8.00e-04 | (3804.64 ms | 137802 tok/s) step 8562/76294 | train loss 3.408369 | norm 0.2356 | lr 8.00e-04 | (3848.86 ms | 136219 tok/s) step 8563/76294 | train loss 3.502273 | norm 0.2127 | lr 8.00e-04 | (3796.14 ms | 138111 tok/s) step 8564/76294 | train loss 3.462635 | norm 0.2420 | lr 8.00e-04 | (3830.83 ms | 136860 tok/s) step 8565/76294 | train loss 3.446236 | norm 0.3492 | lr 8.00e-04 | (3798.12 ms | 138039 tok/s) step 8566/76294 | train loss 3.443340 | norm 0.2489 | lr 8.00e-04 | (3803.33 ms | 137850 tok/s) step 8567/76294 | train loss 3.488436 | norm 0.2657 | lr 8.00e-04 | (3824.11 ms | 137101 tok/s) step 8568/76294 | train loss 3.399823 | norm 0.2592 | lr 8.00e-04 | (3798.64 ms | 138020 tok/s) step 8569/76294 | train loss 3.438892 | norm 0.2144 | lr 8.00e-04 | (3883.11 ms | 135017 tok/s) step 8570/76294 | train loss 3.447722 | norm 0.2338 | lr 8.00e-04 | (3804.46 ms | 137809 tok/s) step 8571/76294 | train loss 3.503065 | norm 0.2042 | lr 8.00e-04 | (3851.29 ms | 136133 tok/s) step 8572/76294 | train loss 3.484318 | norm 0.2146 | lr 7.99e-04 | (3803.49 ms | 137844 tok/s) step 8573/76294 | train loss 3.500736 | norm 0.2049 | lr 7.99e-04 | (3831.86 ms | 136823 tok/s) step 8574/76294 | train loss 3.430140 | norm 0.2120 | lr 7.99e-04 | (3799.86 ms | 137975 tok/s) step 8575/76294 | train loss 3.536509 | norm 0.2351 | lr 7.99e-04 | (3830.71 ms | 136864 tok/s) step 8576/76294 | train loss 3.479225 | norm 0.2090 | lr 7.99e-04 | (3826.01 ms | 137033 tok/s) step 8577/76294 | train loss 3.402640 | norm 0.2076 | lr 7.99e-04 | (3806.66 ms | 137729 tok/s) step 8578/76294 | train loss 3.493551 | norm 0.2635 | lr 7.99e-04 | (3824.07 ms | 137102 tok/s) step 8579/76294 | train loss 3.488816 | norm 0.2270 | lr 7.99e-04 | (3799.32 ms | 137995 tok/s) step 8580/76294 | train loss 3.481017 | norm 0.2138 | lr 7.99e-04 | (3830.22 ms | 136882 tok/s) step 8581/76294 | train loss 3.511056 | norm 0.2278 | lr 7.99e-04 | (3798.62 ms | 138021 tok/s) step 8582/76294 | train loss 3.484803 | norm 0.1932 | lr 7.99e-04 | (3860.44 ms | 135810 tok/s) step 8583/76294 | train loss 3.502387 | norm 0.2102 | lr 7.98e-04 | (3801.36 ms | 137921 tok/s) step 8584/76294 | train loss 3.486541 | norm 0.2484 | lr 7.98e-04 | (4130.94 ms | 126917 tok/s) step 8585/76294 | train loss 3.480156 | norm 0.2314 | lr 7.98e-04 | (3820.79 ms | 137220 tok/s) step 8586/76294 | train loss 3.402206 | norm 0.1771 | lr 7.98e-04 | (3810.46 ms | 137592 tok/s) step 8587/76294 | train loss 3.565197 | norm 0.2686 | lr 7.98e-04 | (3801.54 ms | 137915 tok/s) step 8588/76294 | train loss 3.448702 | norm 0.1860 | lr 7.98e-04 | (3831.82 ms | 136825 tok/s) step 8589/76294 | train loss 3.422090 | norm 0.1891 | lr 7.98e-04 | (3801.25 ms | 137925 tok/s) step 8590/76294 | train loss 3.415572 | norm 0.2023 | lr 7.98e-04 | (3926.60 ms | 133522 tok/s) step 8591/76294 | train loss 3.451147 | norm 0.2339 | lr 7.98e-04 | (3800.55 ms | 137950 tok/s) step 8592/76294 | train loss 3.409759 | norm 0.2081 | lr 7.98e-04 | (3828.83 ms | 136932 tok/s) step 8593/76294 | train loss 3.490433 | norm 0.1903 | lr 7.98e-04 | (3803.16 ms | 137856 tok/s) step 8594/76294 | train loss 3.504876 | norm 0.2026 | lr 7.98e-04 | (3806.02 ms | 137752 tok/s) step 8595/76294 | train loss 3.457179 | norm 0.2019 | lr 7.97e-04 | (3827.74 ms | 136971 tok/s) step 8596/76294 | train loss 3.451595 | norm 0.1964 | lr 7.97e-04 | (3799.41 ms | 137992 tok/s) step 8597/76294 | train loss 3.498090 | norm 0.2799 | lr 7.97e-04 | (3804.79 ms | 137797 tok/s) step 8598/76294 | train loss 3.435317 | norm 0.2318 | lr 7.97e-04 | (3825.26 ms | 137060 tok/s) step 8599/76294 | train loss 3.488690 | norm 0.2458 | lr 7.97e-04 | (3803.10 ms | 137858 tok/s) step 8600/76294 | train loss 3.418027 | norm 0.3141 | lr 7.97e-04 | (3798.32 ms | 138032 tok/s) step 8601/76294 | train loss 3.537027 | norm 0.4376 | lr 7.97e-04 | (3832.38 ms | 136805 tok/s) step 8602/76294 | train loss 3.398717 | norm 0.2523 | lr 7.97e-04 | (3803.36 ms | 137849 tok/s) step 8603/76294 | train loss 3.530272 | norm 0.2512 | lr 7.97e-04 | (3825.62 ms | 137047 tok/s) step 8604/76294 | train loss 3.425876 | norm 0.2215 | lr 7.97e-04 | (3799.89 ms | 137974 tok/s) step 8605/76294 | train loss 3.459628 | norm 0.2438 | lr 7.97e-04 | (3809.74 ms | 137618 tok/s) step 8606/76294 | train loss 3.293851 | norm 0.2840 | lr 7.96e-04 | (3802.59 ms | 137877 tok/s) step 8607/76294 | train loss 3.461228 | norm 0.2892 | lr 7.96e-04 | (3828.07 ms | 136959 tok/s) step 8608/76294 | train loss 3.450310 | norm 0.2385 | lr 7.96e-04 | (3800.67 ms | 137946 tok/s) step 8609/76294 | train loss 3.467738 | norm 0.2722 | lr 7.96e-04 | (3856.26 ms | 135958 tok/s) step 8610/76294 | train loss 3.432187 | norm 0.2016 | lr 7.96e-04 | (3821.05 ms | 137210 tok/s) step 8611/76294 | train loss 3.498690 | norm 0.2221 | lr 7.96e-04 | (3846.08 ms | 136318 tok/s) step 8612/76294 | train loss 3.388405 | norm 0.2497 | lr 7.96e-04 | (3827.84 ms | 136967 tok/s) step 8613/76294 | train loss 3.473842 | norm 0.2261 | lr 7.96e-04 | (3801.44 ms | 137918 tok/s) step 8614/76294 | train loss 3.505443 | norm 0.2652 | lr 7.96e-04 | (3828.09 ms | 136958 tok/s) step 8615/76294 | train loss 3.467487 | norm 0.1996 | lr 7.96e-04 | (3801.26 ms | 137925 tok/s) step 8616/76294 | train loss 3.491636 | norm 0.2550 | lr 7.96e-04 | (3844.45 ms | 136375 tok/s) step 8617/76294 | train loss 3.457262 | norm 0.1882 | lr 7.96e-04 | (3802.76 ms | 137870 tok/s) step 8618/76294 | train loss 3.463678 | norm 0.2018 | lr 7.95e-04 | (3834.77 ms | 136720 tok/s) step 8619/76294 | train loss 3.454253 | norm 0.2046 | lr 7.95e-04 | (3817.04 ms | 137355 tok/s) step 8620/76294 | train loss 3.457329 | norm 0.2151 | lr 7.95e-04 | (3804.51 ms | 137807 tok/s) step 8621/76294 | train loss 3.420888 | norm 0.2233 | lr 7.95e-04 | (3799.09 ms | 138004 tok/s) step 8622/76294 | train loss 3.420629 | norm 0.1944 | lr 7.95e-04 | (3857.97 ms | 135897 tok/s) step 8623/76294 | train loss 3.495053 | norm 0.2133 | lr 7.95e-04 | (3800.27 ms | 137961 tok/s) step 8624/76294 | train loss 3.563722 | norm 0.1981 | lr 7.95e-04 | (3965.55 ms | 132211 tok/s) step 8625/76294 | train loss 3.364972 | norm 0.2367 | lr 7.95e-04 | (3798.19 ms | 138036 tok/s) step 8626/76294 | train loss 3.436319 | norm 0.1700 | lr 7.95e-04 | (3806.91 ms | 137720 tok/s) step 8627/76294 | train loss 3.425258 | norm 0.2407 | lr 7.95e-04 | (3835.53 ms | 136693 tok/s) step 8628/76294 | train loss 3.385913 | norm 0.2031 | lr 7.95e-04 | (3807.50 ms | 137699 tok/s) step 8629/76294 | train loss 3.494106 | norm 0.2271 | lr 7.94e-04 | (3819.53 ms | 137265 tok/s) step 8630/76294 | train loss 3.348656 | norm 0.1850 | lr 7.94e-04 | (3803.68 ms | 137837 tok/s) step 8631/76294 | train loss 3.546761 | norm 0.2428 | lr 7.94e-04 | (3904.36 ms | 134283 tok/s) step 8632/76294 | train loss 3.375532 | norm 0.2017 | lr 7.94e-04 | (3799.48 ms | 137990 tok/s) step 8633/76294 | train loss 3.457185 | norm 0.2090 | lr 7.94e-04 | (3836.22 ms | 136668 tok/s) step 8634/76294 | train loss 3.419931 | norm 0.2933 | lr 7.94e-04 | (3798.92 ms | 138010 tok/s) step 8635/76294 | train loss 3.486540 | norm 0.1982 | lr 7.94e-04 | (3827.13 ms | 136993 tok/s) step 8636/76294 | train loss 3.421862 | norm 0.2292 | lr 7.94e-04 | (3798.28 ms | 138033 tok/s) step 8637/76294 | train loss 3.375524 | norm 0.1888 | lr 7.94e-04 | (3827.37 ms | 136984 tok/s) step 8638/76294 | train loss 3.411291 | norm 0.1817 | lr 7.94e-04 | (3804.45 ms | 137809 tok/s) step 8639/76294 | train loss 3.464177 | norm 0.2209 | lr 7.94e-04 | (3849.10 ms | 136210 tok/s) step 8640/76294 | train loss 3.402691 | norm 0.2211 | lr 7.93e-04 | (3805.99 ms | 137753 tok/s) step 8641/76294 | train loss 3.432670 | norm 0.2309 | lr 7.93e-04 | (3805.28 ms | 137779 tok/s) step 8642/76294 | train loss 3.436564 | norm 0.2400 | lr 7.93e-04 | (3821.58 ms | 137191 tok/s) step 8643/76294 | train loss 3.473273 | norm 0.2242 | lr 7.93e-04 | (4238.32 ms | 123702 tok/s) step 8644/76294 | train loss 3.518826 | norm 0.1814 | lr 7.93e-04 | (3854.62 ms | 136016 tok/s) step 8645/76294 | train loss 3.354674 | norm 0.2189 | lr 7.93e-04 | (3800.02 ms | 137970 tok/s) step 8646/76294 | train loss 3.484916 | norm 0.2478 | lr 7.93e-04 | (3826.16 ms | 137027 tok/s) step 8647/76294 | train loss 3.390409 | norm 0.2073 | lr 7.93e-04 | (3804.76 ms | 137798 tok/s) step 8648/76294 | train loss 3.459244 | norm 0.2501 | lr 7.93e-04 | (3917.05 ms | 133847 tok/s) step 8649/76294 | train loss 3.389693 | norm 0.1966 | lr 7.93e-04 | (3800.18 ms | 137964 tok/s) step 8650/76294 | train loss 3.466026 | norm 0.3653 | lr 7.93e-04 | (3875.29 ms | 135290 tok/s) step 8651/76294 | train loss 3.428064 | norm 0.3598 | lr 7.93e-04 | (3801.07 ms | 137932 tok/s) step 8652/76294 | train loss 3.381905 | norm 0.2354 | lr 7.92e-04 | (3905.86 ms | 134231 tok/s) step 8653/76294 | train loss 3.398617 | norm 0.4515 | lr 7.92e-04 | (3851.05 ms | 136142 tok/s) step 8654/76294 | train loss 3.452441 | norm 0.2413 | lr 7.92e-04 | (3807.01 ms | 137717 tok/s) step 8655/76294 | train loss 3.524133 | norm 0.2209 | lr 7.92e-04 | (3820.91 ms | 137215 tok/s) step 8656/76294 | train loss 3.470451 | norm 0.1942 | lr 7.92e-04 | (3802.24 ms | 137889 tok/s) step 8657/76294 | train loss 3.455063 | norm 0.2003 | lr 7.92e-04 | (3800.94 ms | 137936 tok/s) step 8658/76294 | train loss 3.433744 | norm 0.1795 | lr 7.92e-04 | (3826.75 ms | 137006 tok/s) step 8659/76294 | train loss 3.394062 | norm 0.2248 | lr 7.92e-04 | (3802.96 ms | 137863 tok/s) step 8660/76294 | train loss 3.427608 | norm 0.2002 | lr 7.92e-04 | (3805.44 ms | 137773 tok/s) step 8661/76294 | train loss 3.418478 | norm 0.1921 | lr 7.92e-04 | (3830.92 ms | 136857 tok/s) step 8662/76294 | train loss 3.478016 | norm 0.1934 | lr 7.92e-04 | (3830.26 ms | 136881 tok/s) step 8663/76294 | train loss 3.403740 | norm 0.1993 | lr 7.91e-04 | (3807.19 ms | 137710 tok/s) step 8664/76294 | train loss 3.471356 | norm 0.2020 | lr 7.91e-04 | (3862.22 ms | 135748 tok/s) step 8665/76294 | train loss 3.373653 | norm 0.1898 | lr 7.91e-04 | (3802.16 ms | 137892 tok/s) step 8666/76294 | train loss 3.503464 | norm 0.1756 | lr 7.91e-04 | (3805.70 ms | 137764 tok/s) step 8667/76294 | train loss 3.685638 | norm 0.2192 | lr 7.91e-04 | (3832.79 ms | 136790 tok/s) step 8668/76294 | train loss 3.521548 | norm 0.1912 | lr 7.91e-04 | (3805.82 ms | 137759 tok/s) step 8669/76294 | train loss 3.437956 | norm 0.2060 | lr 7.91e-04 | (3825.61 ms | 137047 tok/s) step 8670/76294 | train loss 3.486141 | norm 0.2187 | lr 7.91e-04 | (3806.31 ms | 137742 tok/s) step 8671/76294 | train loss 3.456737 | norm 0.2332 | lr 7.91e-04 | (3896.35 ms | 134559 tok/s) step 8672/76294 | train loss 3.466292 | norm 0.2457 | lr 7.91e-04 | (3810.94 ms | 137575 tok/s) step 8673/76294 | train loss 3.376462 | norm 0.2222 | lr 7.91e-04 | (3812.72 ms | 137510 tok/s) step 8674/76294 | train loss 3.495058 | norm 0.2728 | lr 7.91e-04 | (3831.23 ms | 136846 tok/s) step 8675/76294 | train loss 3.438161 | norm 0.2061 | lr 7.90e-04 | (5386.06 ms | 97342 tok/s) step 8676/76294 | train loss 3.473478 | norm 0.2709 | lr 7.90e-04 | (5424.57 ms | 96651 tok/s) step 8677/76294 | train loss 3.424728 | norm 0.2212 | lr 7.90e-04 | (3814.55 ms | 137444 tok/s) step 8678/76294 | train loss 3.367594 | norm 0.2330 | lr 7.90e-04 | (3812.72 ms | 137510 tok/s) step 8679/76294 | train loss 3.465584 | norm 0.2465 | lr 7.90e-04 | (3833.45 ms | 136767 tok/s) step 8680/76294 | train loss 3.389692 | norm 0.2300 | lr 7.90e-04 | (3811.74 ms | 137545 tok/s) step 8681/76294 | train loss 3.460399 | norm 0.2615 | lr 7.90e-04 | (3805.89 ms | 137757 tok/s) step 8682/76294 | train loss 3.437045 | norm 0.1916 | lr 7.90e-04 | (3870.79 ms | 135447 tok/s) step 8683/76294 | train loss 3.443975 | norm 0.2121 | lr 7.90e-04 | (3805.81 ms | 137760 tok/s) step 8684/76294 | train loss 3.423150 | norm 0.1739 | lr 7.90e-04 | (3843.20 ms | 136420 tok/s) step 8685/76294 | train loss 3.410810 | norm 0.2027 | lr 7.90e-04 | (3824.36 ms | 137092 tok/s) step 8686/76294 | train loss 3.418102 | norm 0.2119 | lr 7.89e-04 | (3809.38 ms | 137631 tok/s) step 8687/76294 | train loss 3.449145 | norm 0.2011 | lr 7.89e-04 | (3820.94 ms | 137215 tok/s) step 8688/76294 | train loss 3.395115 | norm 0.1886 | lr 7.89e-04 | (3802.75 ms | 137871 tok/s) step 8689/76294 | train loss 3.454431 | norm 0.2593 | lr 7.89e-04 | (3809.90 ms | 137612 tok/s) step 8690/76294 | train loss 3.429388 | norm 0.2126 | lr 7.89e-04 | (3830.79 ms | 136862 tok/s) step 8691/76294 | train loss 3.493246 | norm 0.2482 | lr 7.89e-04 | (3798.92 ms | 138010 tok/s) step 8692/76294 | train loss 3.497844 | norm 0.2652 | lr 7.89e-04 | (3838.25 ms | 136596 tok/s) step 8693/76294 | train loss 3.434176 | norm 0.2418 | lr 7.89e-04 | (3888.53 ms | 134830 tok/s) step 8694/76294 | train loss 3.488018 | norm 0.2252 | lr 7.89e-04 | (3799.36 ms | 137994 tok/s) step 8695/76294 | train loss 3.472402 | norm 0.2252 | lr 7.89e-04 | (3822.74 ms | 137150 tok/s) step 8696/76294 | train loss 3.498301 | norm 0.2132 | lr 7.89e-04 | (3798.59 ms | 138022 tok/s) step 8697/76294 | train loss 3.341983 | norm 0.2367 | lr 7.89e-04 | (3832.03 ms | 136817 tok/s) step 8698/76294 | train loss 3.442539 | norm 0.2357 | lr 7.88e-04 | (3801.99 ms | 137898 tok/s) step 8699/76294 | train loss 3.532498 | norm 0.2345 | lr 7.88e-04 | (3848.11 ms | 136246 tok/s) step 8700/76294 | train loss 3.463318 | norm 0.2161 | lr 7.88e-04 | (3801.55 ms | 137914 tok/s) step 8701/76294 | train loss 3.420750 | norm 0.2113 | lr 7.88e-04 | (3835.14 ms | 136706 tok/s) step 8702/76294 | train loss 3.490417 | norm 0.2628 | lr 7.88e-04 | (3827.02 ms | 136996 tok/s) step 8703/76294 | train loss 3.450556 | norm 0.2379 | lr 7.88e-04 | (3811.91 ms | 137539 tok/s) step 8704/76294 | train loss 3.500475 | norm 0.2699 | lr 7.88e-04 | (3829.14 ms | 136921 tok/s) step 8705/76294 | train loss 3.450599 | norm 0.2960 | lr 7.88e-04 | (3807.55 ms | 137697 tok/s) step 8706/76294 | train loss 3.405813 | norm 0.1863 | lr 7.88e-04 | (3829.52 ms | 136907 tok/s) step 8707/76294 | train loss 3.469455 | norm 0.2734 | lr 7.88e-04 | (3804.43 ms | 137810 tok/s) step 8708/76294 | train loss 3.409657 | norm 0.3282 | lr 7.88e-04 | (3800.12 ms | 137966 tok/s) step 8709/76294 | train loss 3.454446 | norm 0.2076 | lr 7.87e-04 | (3844.14 ms | 136386 tok/s) step 8710/76294 | train loss 3.484073 | norm 0.3627 | lr 7.87e-04 | (3805.11 ms | 137785 tok/s) step 8711/76294 | train loss 3.555482 | norm 0.2027 | lr 7.87e-04 | (3809.94 ms | 137611 tok/s) step 8712/76294 | train loss 3.465583 | norm 0.2108 | lr 7.87e-04 | (4010.71 ms | 130722 tok/s) step 8713/76294 | train loss 3.494962 | norm 0.2169 | lr 7.87e-04 | (3801.29 ms | 137924 tok/s) step 8714/76294 | train loss 3.396575 | norm 0.1823 | lr 7.87e-04 | (3829.07 ms | 136923 tok/s) step 8715/76294 | train loss 3.462737 | norm 0.2250 | lr 7.87e-04 | (3804.31 ms | 137814 tok/s) step 8716/76294 | train loss 3.511976 | norm 0.2043 | lr 7.87e-04 | (3810.71 ms | 137583 tok/s) step 8717/76294 | train loss 3.424932 | norm 0.2623 | lr 7.87e-04 | (3831.65 ms | 136831 tok/s) step 8718/76294 | train loss 3.396905 | norm 0.1895 | lr 7.87e-04 | (3810.75 ms | 137581 tok/s) step 8719/76294 | train loss 3.446843 | norm 0.2217 | lr 7.87e-04 | (3807.16 ms | 137711 tok/s) step 8720/76294 | train loss 3.441200 | norm 0.1985 | lr 7.87e-04 | (3835.55 ms | 136692 tok/s) step 8721/76294 | train loss 3.441747 | norm 0.2014 | lr 7.86e-04 | (3805.33 ms | 137777 tok/s) step 8722/76294 | train loss 3.386084 | norm 0.2148 | lr 7.86e-04 | (3833.24 ms | 136774 tok/s) step 8723/76294 | train loss 3.366581 | norm 0.1826 | lr 7.86e-04 | (3810.01 ms | 137608 tok/s) step 8724/76294 | train loss 3.490613 | norm 0.2540 | lr 7.86e-04 | (3867.46 ms | 135564 tok/s) step 8725/76294 | train loss 3.387598 | norm 0.1936 | lr 7.86e-04 | (3827.49 ms | 136979 tok/s) step 8726/76294 | train loss 3.504988 | norm 0.2708 | lr 7.86e-04 | (3814.26 ms | 137455 tok/s) step 8727/76294 | train loss 3.383620 | norm 0.2079 | lr 7.86e-04 | (3833.97 ms | 136748 tok/s) step 8728/76294 | train loss 3.479960 | norm 0.1784 | lr 7.86e-04 | (3810.73 ms | 137582 tok/s) step 8729/76294 | train loss 3.477769 | norm 0.1952 | lr 7.86e-04 | (3807.95 ms | 137683 tok/s) step 8730/76294 | train loss 3.497481 | norm 0.2190 | lr 7.86e-04 | (3888.18 ms | 134842 tok/s) step 8731/76294 | train loss 3.440464 | norm 0.1924 | lr 7.86e-04 | (3806.03 ms | 137752 tok/s) step 8732/76294 | train loss 3.522599 | norm 0.1832 | lr 7.85e-04 | (3907.70 ms | 134168 tok/s) step 8733/76294 | train loss 3.453209 | norm 0.2202 | lr 7.85e-04 | (3806.74 ms | 137726 tok/s) step 8734/76294 | train loss 3.524233 | norm 0.2379 | lr 7.85e-04 | (3848.02 ms | 136249 tok/s) step 8735/76294 | train loss 3.468470 | norm 0.2293 | lr 7.85e-04 | (3809.33 ms | 137633 tok/s) step 8736/76294 | train loss 3.376285 | norm 0.1970 | lr 7.85e-04 | (3814.93 ms | 137431 tok/s) step 8737/76294 | train loss 3.452978 | norm 0.2270 | lr 7.85e-04 | (3828.60 ms | 136940 tok/s) step 8738/76294 | train loss 3.606033 | norm 0.2334 | lr 7.85e-04 | (3816.74 ms | 137365 tok/s) step 8739/76294 | train loss 3.433455 | norm 0.2469 | lr 7.85e-04 | (3826.86 ms | 137002 tok/s) step 8740/76294 | train loss 3.412248 | norm 0.2486 | lr 7.85e-04 | (3811.41 ms | 137558 tok/s) step 8741/76294 | train loss 3.467061 | norm 0.2480 | lr 7.85e-04 | (3803.85 ms | 137831 tok/s) step 8742/76294 | train loss 3.439999 | norm 0.1945 | lr 7.85e-04 | (3852.78 ms | 136080 tok/s) step 8743/76294 | train loss 3.478639 | norm 0.2070 | lr 7.84e-04 | (3804.59 ms | 137804 tok/s) step 8744/76294 | train loss 3.452347 | norm 0.2260 | lr 7.84e-04 | (3838.40 ms | 136590 tok/s) step 8745/76294 | train loss 3.567111 | norm 0.2170 | lr 7.84e-04 | (3805.81 ms | 137760 tok/s) step 8746/76294 | train loss 3.448268 | norm 0.2438 | lr 7.84e-04 | (3815.85 ms | 137397 tok/s) step 8747/76294 | train loss 3.384512 | norm 0.2242 | lr 7.84e-04 | (3811.56 ms | 137552 tok/s) step 8748/76294 | train loss 3.445485 | norm 0.2119 | lr 7.84e-04 | (3812.11 ms | 137532 tok/s) step 8749/76294 | train loss 3.435330 | norm 0.2175 | lr 7.84e-04 | (3827.94 ms | 136963 tok/s) step 8750/76294 | train loss 3.457013 | norm 0.2208 | lr 7.84e-04 | (3813.25 ms | 137491 tok/s) val loss: 3.455015 saving model checkpoint to ./results/gpt2-124M-gqa/step_8750.pth step 8751/76294 | train loss 3.485190 | norm 0.1995 | lr 7.84e-04 | (3833.33 ms | 136771 tok/s) step 8752/76294 | train loss 3.482118 | norm 0.2095 | lr 7.84e-04 | (3804.10 ms | 137822 tok/s) step 8753/76294 | train loss 3.492556 | norm 0.1768 | lr 7.84e-04 | (3816.38 ms | 137378 tok/s) step 8754/76294 | train loss 3.441736 | norm 0.2212 | lr 7.84e-04 | (3851.52 ms | 136125 tok/s) step 8755/76294 | train loss 3.533710 | norm 0.2087 | lr 7.83e-04 | (3835.08 ms | 136709 tok/s) step 8756/76294 | train loss 3.423144 | norm 0.2524 | lr 7.83e-04 | (3807.72 ms | 137691 tok/s) step 8757/76294 | train loss 3.463590 | norm 0.2980 | lr 7.83e-04 | (3852.18 ms | 136102 tok/s) step 8758/76294 | train loss 3.443972 | norm 0.2066 | lr 7.83e-04 | (3804.78 ms | 137797 tok/s) step 8759/76294 | train loss 3.490275 | norm 0.1828 | lr 7.83e-04 | (3805.70 ms | 137764 tok/s) step 8760/76294 | train loss 3.444236 | norm 0.2356 | lr 7.83e-04 | (3822.40 ms | 137162 tok/s) step 8761/76294 | train loss 3.572420 | norm 0.1801 | lr 7.83e-04 | (3817.61 ms | 137334 tok/s) step 8762/76294 | train loss 3.477092 | norm 0.2234 | lr 7.83e-04 | (3802.47 ms | 137881 tok/s) step 8763/76294 | train loss 3.618097 | norm 0.2308 | lr 7.83e-04 | (3828.03 ms | 136960 tok/s) step 8764/76294 | train loss 3.407145 | norm 0.1878 | lr 7.83e-04 | (3806.65 ms | 137730 tok/s) step 8765/76294 | train loss 3.443930 | norm 0.2257 | lr 7.83e-04 | (3827.89 ms | 136965 tok/s) step 8766/76294 | train loss 3.526291 | norm 0.2099 | lr 7.82e-04 | (3830.88 ms | 136858 tok/s) step 8767/76294 | train loss 3.400092 | norm 0.4042 | lr 7.82e-04 | (3809.90 ms | 137612 tok/s) step 8768/76294 | train loss 3.441561 | norm 0.3363 | lr 7.82e-04 | (3807.10 ms | 137713 tok/s) step 8769/76294 | train loss 3.424501 | norm 0.2521 | lr 7.82e-04 | (3832.00 ms | 136818 tok/s) step 8770/76294 | train loss 3.475693 | norm 0.4297 | lr 7.82e-04 | (3804.09 ms | 137822 tok/s) step 8771/76294 | train loss 3.425678 | norm 0.2336 | lr 7.82e-04 | (3808.24 ms | 137672 tok/s) step 8772/76294 | train loss 3.481808 | norm 0.2352 | lr 7.82e-04 | (3830.24 ms | 136881 tok/s) step 8773/76294 | train loss 3.450097 | norm 0.1855 | lr 7.82e-04 | (3905.62 ms | 134239 tok/s) step 8774/76294 | train loss 3.468802 | norm 0.2592 | lr 7.82e-04 | (4784.69 ms | 109576 tok/s) step 8775/76294 | train loss 3.420038 | norm 0.2076 | lr 7.82e-04 | (3831.54 ms | 136835 tok/s) step 8776/76294 | train loss 3.441139 | norm 0.2190 | lr 7.82e-04 | (3811.73 ms | 137546 tok/s) step 8777/76294 | train loss 3.502563 | norm 0.2211 | lr 7.82e-04 | (3811.05 ms | 137570 tok/s) step 8778/76294 | train loss 3.446104 | norm 0.1959 | lr 7.81e-04 | (3809.82 ms | 137615 tok/s) step 8779/76294 | train loss 3.529418 | norm 0.2190 | lr 7.81e-04 | (3802.04 ms | 137897 tok/s) step 8780/76294 | train loss 3.558647 | norm 0.1888 | lr 7.81e-04 | (3904.28 ms | 134285 tok/s) step 8781/76294 | train loss 3.522109 | norm 0.1917 | lr 7.81e-04 | (3804.06 ms | 137823 tok/s) step 8782/76294 | train loss 3.484923 | norm 0.1844 | lr 7.81e-04 | (3842.48 ms | 136445 tok/s) step 8783/76294 | train loss 3.511127 | norm 0.1881 | lr 7.81e-04 | (3806.80 ms | 137724 tok/s) step 8784/76294 | train loss 3.399124 | norm 0.2082 | lr 7.81e-04 | (3808.76 ms | 137653 tok/s) step 8785/76294 | train loss 3.553968 | norm 0.2195 | lr 7.81e-04 | (3837.22 ms | 136632 tok/s) step 8786/76294 | train loss 3.439387 | norm 0.1857 | lr 7.81e-04 | (3809.00 ms | 137645 tok/s) step 8787/76294 | train loss 3.447278 | norm 0.2212 | lr 7.81e-04 | (3832.35 ms | 136806 tok/s) step 8788/76294 | train loss 3.465077 | norm 0.2040 | lr 7.81e-04 | (3810.86 ms | 137577 tok/s) step 8789/76294 | train loss 3.820519 | norm 0.2170 | lr 7.80e-04 | (3805.91 ms | 137756 tok/s) step 8790/76294 | train loss 3.539881 | norm 0.2168 | lr 7.80e-04 | (3834.33 ms | 136735 tok/s) step 8791/76294 | train loss 3.536431 | norm 0.2772 | lr 7.80e-04 | (3804.81 ms | 137796 tok/s) step 8792/76294 | train loss 3.440652 | norm 0.2832 | lr 7.80e-04 | (3805.18 ms | 137783 tok/s) step 8793/76294 | train loss 3.472085 | norm 0.2379 | lr 7.80e-04 | (3925.06 ms | 133575 tok/s) step 8794/76294 | train loss 3.453452 | norm 0.1974 | lr 7.80e-04 | (3799.09 ms | 138003 tok/s) step 8795/76294 | train loss 3.484652 | norm 0.2368 | lr 7.80e-04 | (3839.37 ms | 136556 tok/s) step 8796/76294 | train loss 3.439589 | norm 0.2205 | lr 7.80e-04 | (3805.43 ms | 137774 tok/s) step 8797/76294 | train loss 3.507414 | norm 0.2109 | lr 7.80e-04 | (3809.13 ms | 137640 tok/s) step 8798/76294 | train loss 3.520372 | norm 0.2496 | lr 7.80e-04 | (3835.31 ms | 136700 tok/s) step 8799/76294 | train loss 3.383883 | norm 0.2270 | lr 7.80e-04 | (4075.59 ms | 128641 tok/s) step 8800/76294 | train loss 3.566557 | norm 0.2306 | lr 7.79e-04 | (3803.44 ms | 137846 tok/s) step 8801/76294 | train loss 3.496233 | norm 0.1819 | lr 7.79e-04 | (3877.26 ms | 135221 tok/s) step 8802/76294 | train loss 3.439166 | norm 0.2242 | lr 7.79e-04 | (3803.92 ms | 137828 tok/s) step 8803/76294 | train loss 3.489126 | norm 0.1964 | lr 7.79e-04 | (3829.91 ms | 136893 tok/s) step 8804/76294 | train loss 3.440171 | norm 0.2082 | lr 7.79e-04 | (3854.03 ms | 136036 tok/s) step 8805/76294 | train loss 3.394247 | norm 0.2132 | lr 7.79e-04 | (3808.80 ms | 137652 tok/s) step 8806/76294 | train loss 3.479034 | norm 0.4861 | lr 7.79e-04 | (3828.62 ms | 136939 tok/s) step 8807/76294 | train loss 3.448667 | norm 0.3793 | lr 7.79e-04 | (3822.23 ms | 137168 tok/s) step 8808/76294 | train loss 3.400568 | norm 0.3203 | lr 7.79e-04 | (3803.06 ms | 137859 tok/s) step 8809/76294 | train loss 3.435571 | norm 0.2650 | lr 7.79e-04 | (3843.93 ms | 136394 tok/s) step 8810/76294 | train loss 3.451802 | norm 0.2010 | lr 7.79e-04 | (3807.80 ms | 137688 tok/s) step 8811/76294 | train loss 3.461531 | norm 0.2214 | lr 7.79e-04 | (3827.49 ms | 136979 tok/s) step 8812/76294 | train loss 3.421770 | norm 0.2117 | lr 7.78e-04 | (3833.10 ms | 136779 tok/s) step 8813/76294 | train loss 3.438427 | norm 0.2765 | lr 7.78e-04 | (3807.97 ms | 137682 tok/s) step 8814/76294 | train loss 3.468578 | norm 0.2238 | lr 7.78e-04 | (3886.58 ms | 134897 tok/s) step 8815/76294 | train loss 3.519970 | norm 0.3521 | lr 7.78e-04 | (3806.91 ms | 137720 tok/s) step 8816/76294 | train loss 3.485881 | norm 0.2168 | lr 7.78e-04 | (3867.17 ms | 135574 tok/s) step 8817/76294 | train loss 3.451290 | norm 0.2151 | lr 7.78e-04 | (3804.96 ms | 137791 tok/s) step 8818/76294 | train loss 3.467904 | norm 0.2001 | lr 7.78e-04 | (3813.05 ms | 137498 tok/s) step 8819/76294 | train loss 3.490565 | norm 0.1865 | lr 7.78e-04 | (3805.88 ms | 137757 tok/s) step 8820/76294 | train loss 3.471159 | norm 0.2082 | lr 7.78e-04 | (3887.42 ms | 134868 tok/s) step 8821/76294 | train loss 3.484684 | norm 0.2755 | lr 7.78e-04 | (3804.97 ms | 137790 tok/s) step 8822/76294 | train loss 3.538669 | norm 0.2543 | lr 7.78e-04 | (3835.11 ms | 136707 tok/s) step 8823/76294 | train loss 3.491067 | norm 0.2347 | lr 7.77e-04 | (3828.69 ms | 136937 tok/s) step 8824/76294 | train loss 3.630418 | norm 0.2454 | lr 7.77e-04 | (3805.35 ms | 137777 tok/s) step 8825/76294 | train loss 3.449840 | norm 0.2140 | lr 7.77e-04 | (4000.39 ms | 131059 tok/s) step 8826/76294 | train loss 3.571195 | norm 0.2120 | lr 7.77e-04 | (3835.09 ms | 136708 tok/s) step 8827/76294 | train loss 3.463757 | norm 0.1714 | lr 7.77e-04 | (3801.74 ms | 137907 tok/s) step 8828/76294 | train loss 3.503625 | norm 0.1977 | lr 7.77e-04 | (3953.82 ms | 132603 tok/s) step 8829/76294 | train loss 3.504404 | norm 0.2155 | lr 7.77e-04 | (3805.64 ms | 137766 tok/s) step 8830/76294 | train loss 3.462443 | norm 0.2727 | lr 7.77e-04 | (3815.53 ms | 137409 tok/s) step 8831/76294 | train loss 3.419554 | norm 0.2097 | lr 7.77e-04 | (3824.27 ms | 137095 tok/s) step 8832/76294 | train loss 3.432661 | norm 0.2455 | lr 7.77e-04 | (3829.88 ms | 136894 tok/s) step 8833/76294 | train loss 3.513296 | norm 0.2476 | lr 7.77e-04 | (3806.54 ms | 137733 tok/s) step 8834/76294 | train loss 3.436980 | norm 0.2203 | lr 7.77e-04 | (3869.11 ms | 135506 tok/s) step 8835/76294 | train loss 3.450858 | norm 0.2928 | lr 7.76e-04 | (3805.49 ms | 137772 tok/s) step 8836/76294 | train loss 3.450457 | norm 0.2067 | lr 7.76e-04 | (3835.37 ms | 136698 tok/s) step 8837/76294 | train loss 3.427329 | norm 0.1954 | lr 7.76e-04 | (3827.15 ms | 136992 tok/s) step 8838/76294 | train loss 3.440167 | norm 0.2605 | lr 7.76e-04 | (3822.26 ms | 137167 tok/s) step 8839/76294 | train loss 3.516497 | norm 0.2325 | lr 7.76e-04 | (3804.58 ms | 137804 tok/s) step 8840/76294 | train loss 3.446548 | norm 0.2667 | lr 7.76e-04 | (3839.22 ms | 136561 tok/s) step 8841/76294 | train loss 3.623034 | norm 0.2723 | lr 7.76e-04 | (3803.01 ms | 137861 tok/s) step 8842/76294 | train loss 3.442082 | norm 0.2658 | lr 7.76e-04 | (3810.13 ms | 137604 tok/s) step 8843/76294 | train loss 3.438262 | norm 0.2323 | lr 7.76e-04 | (3932.94 ms | 133307 tok/s) step 8844/76294 | train loss 3.480618 | norm 0.2436 | lr 7.76e-04 | (3821.45 ms | 137196 tok/s) step 8845/76294 | train loss 3.453711 | norm 0.1963 | lr 7.76e-04 | (3803.81 ms | 137832 tok/s) step 8846/76294 | train loss 3.494802 | norm 0.2006 | lr 7.75e-04 | (3838.54 ms | 136585 tok/s) step 8847/76294 | train loss 3.433323 | norm 0.2100 | lr 7.75e-04 | (3804.59 ms | 137804 tok/s) step 8848/76294 | train loss 3.499239 | norm 0.2243 | lr 7.75e-04 | (3808.74 ms | 137654 tok/s) step 8849/76294 | train loss 3.444573 | norm 0.1963 | lr 7.75e-04 | (3828.82 ms | 136932 tok/s) step 8850/76294 | train loss 3.464221 | norm 0.2251 | lr 7.75e-04 | (3813.86 ms | 137469 tok/s) step 8851/76294 | train loss 3.464670 | norm 0.2032 | lr 7.75e-04 | (3811.36 ms | 137559 tok/s) step 8852/76294 | train loss 3.470997 | norm 0.2309 | lr 7.75e-04 | (3840.30 ms | 136523 tok/s) step 8853/76294 | train loss 3.442440 | norm 0.2472 | lr 7.75e-04 | (3805.23 ms | 137781 tok/s) step 8854/76294 | train loss 3.486991 | norm 0.2356 | lr 7.75e-04 | (10716.92 ms | 48922 tok/s) step 8855/76294 | train loss 3.457400 | norm 0.3405 | lr 7.75e-04 | (3943.65 ms | 132945 tok/s) step 8856/76294 | train loss 3.474272 | norm 0.2860 | lr 7.75e-04 | (3791.11 ms | 138294 tok/s) step 8857/76294 | train loss 3.545556 | norm 0.2346 | lr 7.74e-04 | (3812.53 ms | 137517 tok/s) step 8858/76294 | train loss 3.496432 | norm 0.2943 | lr 7.74e-04 | (4209.61 ms | 124546 tok/s) step 8859/76294 | train loss 3.449042 | norm 0.2054 | lr 7.74e-04 | (3799.00 ms | 138007 tok/s) step 8860/76294 | train loss 3.536352 | norm 0.3007 | lr 7.74e-04 | (3795.34 ms | 138140 tok/s) step 8861/76294 | train loss 3.480283 | norm 0.2042 | lr 7.74e-04 | (3827.42 ms | 136982 tok/s) step 8862/76294 | train loss 3.405387 | norm 0.2415 | lr 7.74e-04 | (3809.35 ms | 137632 tok/s) step 8863/76294 | train loss 3.468764 | norm 0.2163 | lr 7.74e-04 | (3801.54 ms | 137914 tok/s) step 8864/76294 | train loss 3.501683 | norm 0.2297 | lr 7.74e-04 | (3796.33 ms | 138104 tok/s) step 8865/76294 | train loss 3.483616 | norm 0.2524 | lr 7.74e-04 | (3793.97 ms | 138190 tok/s) step 8866/76294 | train loss 3.445959 | norm 0.3387 | lr 7.74e-04 | (3845.14 ms | 136351 tok/s) step 8867/76294 | train loss 3.489820 | norm 0.2079 | lr 7.74e-04 | (3798.35 ms | 138030 tok/s) step 8868/76294 | train loss 3.459125 | norm 0.3070 | lr 7.74e-04 | (3827.90 ms | 136965 tok/s) step 8869/76294 | train loss 3.493530 | norm 0.1987 | lr 7.73e-04 | (3866.60 ms | 135594 tok/s) step 8870/76294 | train loss 3.758498 | norm 0.2031 | lr 7.73e-04 | (3801.76 ms | 137907 tok/s) step 8871/76294 | train loss 3.478528 | norm 0.2313 | lr 7.73e-04 | (3792.96 ms | 138226 tok/s) step 8872/76294 | train loss 3.517435 | norm 0.2119 | lr 7.73e-04 | (3836.63 ms | 136653 tok/s) step 8873/76294 | train loss 3.316262 | norm 0.2910 | lr 7.73e-04 | (3796.67 ms | 138092 tok/s) step 8874/76294 | train loss 3.472586 | norm 0.2066 | lr 7.73e-04 | (3824.50 ms | 137087 tok/s) step 8875/76294 | train loss 3.458604 | norm 0.2577 | lr 7.73e-04 | (3796.61 ms | 138094 tok/s) step 8876/76294 | train loss 3.415740 | norm 0.3071 | lr 7.73e-04 | (3818.58 ms | 137299 tok/s) step 8877/76294 | train loss 3.499411 | norm 0.1876 | lr 7.73e-04 | (3794.84 ms | 138158 tok/s) step 8878/76294 | train loss 3.495098 | norm 0.2885 | lr 7.73e-04 | (3848.89 ms | 136218 tok/s) step 8879/76294 | train loss 3.431019 | norm 0.1938 | lr 7.73e-04 | (3797.94 ms | 138045 tok/s) step 8880/76294 | train loss 3.506879 | norm 0.2090 | lr 7.72e-04 | (3804.10 ms | 137822 tok/s) step 8881/76294 | train loss 3.448629 | norm 0.2327 | lr 7.72e-04 | (3823.95 ms | 137106 tok/s) step 8882/76294 | train loss 3.552086 | norm 0.2458 | lr 7.72e-04 | (3883.89 ms | 134991 tok/s) step 8883/76294 | train loss 3.483030 | norm 0.2945 | lr 7.72e-04 | (3796.35 ms | 138103 tok/s) step 8884/76294 | train loss 3.538984 | norm 0.2508 | lr 7.72e-04 | (3802.40 ms | 137883 tok/s) step 8885/76294 | train loss 3.370246 | norm 0.4030 | lr 7.72e-04 | (3823.42 ms | 137126 tok/s) step 8886/76294 | train loss 3.498318 | norm 0.1924 | lr 7.72e-04 | (3827.87 ms | 136966 tok/s) step 8887/76294 | train loss 3.456398 | norm 0.2908 | lr 7.72e-04 | (3837.13 ms | 136635 tok/s) step 8888/76294 | train loss 3.428515 | norm 0.2099 | lr 7.72e-04 | (3933.80 ms | 133278 tok/s) step 8889/76294 | train loss 3.483044 | norm 0.2902 | lr 7.72e-04 | (3819.05 ms | 137282 tok/s) step 8890/76294 | train loss 3.453836 | norm 0.2219 | lr 7.72e-04 | (3798.24 ms | 138035 tok/s) step 8891/76294 | train loss 3.521321 | norm 0.2640 | lr 7.71e-04 | (3795.23 ms | 138144 tok/s) step 8892/76294 | train loss 3.446864 | norm 0.2469 | lr 7.71e-04 | (3831.05 ms | 136852 tok/s) step 8893/76294 | train loss 3.416259 | norm 0.2158 | lr 7.71e-04 | (3796.33 ms | 138104 tok/s) step 8894/76294 | train loss 3.471616 | norm 0.2201 | lr 7.71e-04 | (3818.51 ms | 137302 tok/s) step 8895/76294 | train loss 3.484941 | norm 0.2166 | lr 7.71e-04 | (3830.18 ms | 136883 tok/s) step 8896/76294 | train loss 3.458559 | norm 0.1988 | lr 7.71e-04 | (3802.76 ms | 137870 tok/s) step 8897/76294 | train loss 3.518075 | norm 0.1888 | lr 7.71e-04 | (3818.77 ms | 137293 tok/s) step 8898/76294 | train loss 3.472922 | norm 0.1819 | lr 7.71e-04 | (3803.40 ms | 137847 tok/s) step 8899/76294 | train loss 3.481207 | norm 0.1834 | lr 7.71e-04 | (3805.90 ms | 137757 tok/s) step 8900/76294 | train loss 3.420959 | norm 0.1724 | lr 7.71e-04 | (3835.51 ms | 136693 tok/s) step 8901/76294 | train loss 3.415134 | norm 0.1809 | lr 7.71e-04 | (3849.36 ms | 136201 tok/s) step 8902/76294 | train loss 3.504594 | norm 0.2067 | lr 7.71e-04 | (3808.07 ms | 137678 tok/s) step 8903/76294 | train loss 3.503219 | norm 0.2032 | lr 7.70e-04 | (3888.66 ms | 134825 tok/s) step 8904/76294 | train loss 3.444019 | norm 0.2087 | lr 7.70e-04 | (3802.00 ms | 137898 tok/s) step 8905/76294 | train loss 3.456235 | norm 0.1806 | lr 7.70e-04 | (3843.02 ms | 136426 tok/s) step 8906/76294 | train loss 3.465688 | norm 0.2010 | lr 7.70e-04 | (3795.07 ms | 138150 tok/s) step 8907/76294 | train loss 3.476770 | norm 0.2406 | lr 7.70e-04 | (3797.78 ms | 138051 tok/s) step 8908/76294 | train loss 3.482291 | norm 0.2242 | lr 7.70e-04 | (3819.44 ms | 137268 tok/s) step 8909/76294 | train loss 3.453346 | norm 0.2027 | lr 7.70e-04 | (3798.64 ms | 138020 tok/s) step 8910/76294 | train loss 3.437701 | norm 0.2001 | lr 7.70e-04 | (3797.67 ms | 138055 tok/s) step 8911/76294 | train loss 3.475479 | norm 0.2289 | lr 7.70e-04 | (3823.53 ms | 137121 tok/s) step 8912/76294 | train loss 3.434211 | norm 0.2173 | lr 7.70e-04 | (3797.52 ms | 138061 tok/s) step 8913/76294 | train loss 3.467582 | norm 0.2381 | lr 7.70e-04 | (3824.34 ms | 137092 tok/s) step 8914/76294 | train loss 3.489169 | norm 0.2561 | lr 7.69e-04 | (3819.86 ms | 137253 tok/s) step 8915/76294 | train loss 3.417493 | norm 0.2133 | lr 7.69e-04 | (3793.42 ms | 138210 tok/s) step 8916/76294 | train loss 3.493445 | norm 0.3714 | lr 7.69e-04 | (3822.71 ms | 137151 tok/s) step 8917/76294 | train loss 3.519530 | norm 0.2448 | lr 7.69e-04 | (3835.08 ms | 136708 tok/s) step 8918/76294 | train loss 3.465371 | norm 0.2853 | lr 7.69e-04 | (3800.92 ms | 137937 tok/s) step 8919/76294 | train loss 3.483952 | norm 0.2801 | lr 7.69e-04 | (3796.56 ms | 138096 tok/s) step 8920/76294 | train loss 3.556216 | norm 0.2187 | lr 7.69e-04 | (3829.69 ms | 136901 tok/s) step 8921/76294 | train loss 3.401689 | norm 0.2053 | lr 7.69e-04 | (3797.99 ms | 138044 tok/s) step 8922/76294 | train loss 3.418469 | norm 0.2214 | lr 7.69e-04 | (3808.95 ms | 137646 tok/s) step 8923/76294 | train loss 3.503539 | norm 0.1634 | lr 7.69e-04 | (3801.76 ms | 137907 tok/s) step 8924/76294 | train loss 3.455147 | norm 0.2679 | lr 7.69e-04 | (3926.38 ms | 133530 tok/s) step 8925/76294 | train loss 3.423043 | norm 0.1745 | lr 7.68e-04 | (3794.86 ms | 138158 tok/s) step 8926/76294 | train loss 3.476058 | norm 0.1926 | lr 7.68e-04 | (3882.10 ms | 135053 tok/s) step 8927/76294 | train loss 3.469698 | norm 0.1872 | lr 7.68e-04 | (3798.34 ms | 138031 tok/s) step 8928/76294 | train loss 3.436706 | norm 0.2043 | lr 7.68e-04 | (3821.79 ms | 137184 tok/s) step 8929/76294 | train loss 3.471834 | norm 0.2230 | lr 7.68e-04 | (3796.98 ms | 138080 tok/s) step 8930/76294 | train loss 3.408100 | norm 0.1848 | lr 7.68e-04 | (3859.38 ms | 135848 tok/s) step 8931/76294 | train loss 3.477647 | norm 0.2267 | lr 7.68e-04 | (12424.46 ms | 42198 tok/s) step 8932/76294 | train loss 3.438536 | norm 0.1870 | lr 7.68e-04 | (3784.05 ms | 138552 tok/s) step 8933/76294 | train loss 3.465415 | norm 0.2849 | lr 7.68e-04 | (3792.28 ms | 138252 tok/s) step 8934/76294 | train loss 3.491729 | norm 0.1895 | lr 7.68e-04 | (3790.21 ms | 138327 tok/s) step 8935/76294 | train loss 3.455933 | norm 0.3764 | lr 7.68e-04 | (3816.11 ms | 137388 tok/s) step 8936/76294 | train loss 3.422346 | norm 0.1967 | lr 7.68e-04 | (3791.59 ms | 138276 tok/s) step 8937/76294 | train loss 3.458264 | norm 0.2789 | lr 7.67e-04 | (3798.05 ms | 138041 tok/s) step 8938/76294 | train loss 3.465156 | norm 0.2380 | lr 7.67e-04 | (3823.73 ms | 137114 tok/s) step 8939/76294 | train loss 3.451288 | norm 0.2093 | lr 7.67e-04 | (3793.51 ms | 138207 tok/s) step 8940/76294 | train loss 3.492689 | norm 0.2293 | lr 7.67e-04 | (3862.78 ms | 135728 tok/s) step 8941/76294 | train loss 3.465267 | norm 0.2283 | lr 7.67e-04 | (3795.86 ms | 138121 tok/s) step 8942/76294 | train loss 3.433100 | norm 0.1814 | lr 7.67e-04 | (3802.34 ms | 137886 tok/s) step 8943/76294 | train loss 3.513633 | norm 0.2034 | lr 7.67e-04 | (3824.22 ms | 137097 tok/s) step 8944/76294 | train loss 3.460768 | norm 0.2125 | lr 7.67e-04 | (3942.09 ms | 132997 tok/s) step 8945/76294 | train loss 3.453435 | norm 0.2169 | lr 7.67e-04 | (3795.76 ms | 138125 tok/s) step 8946/76294 | train loss 3.417687 | norm 0.1936 | lr 7.67e-04 | (3802.39 ms | 137884 tok/s) step 8947/76294 | train loss 3.516488 | norm 0.2186 | lr 7.67e-04 | (3821.85 ms | 137182 tok/s) step 8948/76294 | train loss 3.443457 | norm 0.3220 | lr 7.66e-04 | (3801.53 ms | 137915 tok/s) step 8949/76294 | train loss 3.432768 | norm 0.2177 | lr 7.66e-04 | (3800.97 ms | 137935 tok/s) step 8950/76294 | train loss 3.490406 | norm 0.2702 | lr 7.66e-04 | (3833.93 ms | 136749 tok/s) step 8951/76294 | train loss 3.512328 | norm 0.2016 | lr 7.66e-04 | (3797.09 ms | 138076 tok/s) step 8952/76294 | train loss 3.443843 | norm 0.2820 | lr 7.66e-04 | (3819.51 ms | 137266 tok/s) step 8953/76294 | train loss 3.476694 | norm 0.2647 | lr 7.66e-04 | (3825.33 ms | 137057 tok/s) step 8954/76294 | train loss 3.570029 | norm 0.3184 | lr 7.66e-04 | (3808.06 ms | 137679 tok/s) step 8955/76294 | train loss 3.441310 | norm 0.3366 | lr 7.66e-04 | (3826.75 ms | 137006 tok/s) step 8956/76294 | train loss 3.397090 | norm 0.2387 | lr 7.66e-04 | (3803.35 ms | 137849 tok/s) step 8957/76294 | train loss 3.456695 | norm 0.2665 | lr 7.66e-04 | (3822.55 ms | 137157 tok/s) step 8958/76294 | train loss 3.414305 | norm 0.2805 | lr 7.66e-04 | (3818.58 ms | 137299 tok/s) step 8959/76294 | train loss 3.431369 | norm 0.3581 | lr 7.65e-04 | (3801.86 ms | 137903 tok/s) step 8960/76294 | train loss 3.498860 | norm 0.2989 | lr 7.65e-04 | (3815.92 ms | 137395 tok/s) step 8961/76294 | train loss 3.622161 | norm 0.3319 | lr 7.65e-04 | (3809.10 ms | 137641 tok/s) step 8962/76294 | train loss 3.426652 | norm 0.2628 | lr 7.65e-04 | (3813.81 ms | 137471 tok/s) step 8963/76294 | train loss 3.437452 | norm 0.2923 | lr 7.65e-04 | (3796.33 ms | 138104 tok/s) step 8964/76294 | train loss 3.549012 | norm 0.2116 | lr 7.65e-04 | (3794.84 ms | 138158 tok/s) step 8965/76294 | train loss 3.426262 | norm 0.2826 | lr 7.65e-04 | (4231.51 ms | 123901 tok/s) step 8966/76294 | train loss 3.435322 | norm 0.2291 | lr 7.65e-04 | (3792.41 ms | 138247 tok/s) step 8967/76294 | train loss 3.536337 | norm 0.2251 | lr 7.65e-04 | (3826.16 ms | 137027 tok/s) step 8968/76294 | train loss 3.451315 | norm 0.2265 | lr 7.65e-04 | (3798.46 ms | 138026 tok/s) step 8969/76294 | train loss 3.482227 | norm 0.2666 | lr 7.65e-04 | (3800.00 ms | 137971 tok/s) step 8970/76294 | train loss 3.374418 | norm 0.2604 | lr 7.65e-04 | (3822.14 ms | 137171 tok/s) step 8971/76294 | train loss 3.407360 | norm 0.2160 | lr 7.64e-04 | (3800.81 ms | 137941 tok/s) step 8972/76294 | train loss 3.513553 | norm 0.1952 | lr 7.64e-04 | (3804.58 ms | 137804 tok/s) step 8973/76294 | train loss 3.445130 | norm 0.2169 | lr 7.64e-04 | (3798.98 ms | 138007 tok/s) step 8974/76294 | train loss 3.462866 | norm 0.2145 | lr 7.64e-04 | (3794.73 ms | 138162 tok/s) step 8975/76294 | train loss 3.505457 | norm 0.2597 | lr 7.64e-04 | (3838.74 ms | 136578 tok/s) step 8976/76294 | train loss 3.497256 | norm 0.2061 | lr 7.64e-04 | (3798.01 ms | 138043 tok/s) step 8977/76294 | train loss 3.447696 | norm 0.2562 | lr 7.64e-04 | (3863.28 ms | 135711 tok/s) step 8978/76294 | train loss 3.453988 | norm 0.1991 | lr 7.64e-04 | (3797.69 ms | 138055 tok/s) step 8979/76294 | train loss 3.491403 | norm 0.1845 | lr 7.64e-04 | (3799.60 ms | 137985 tok/s) step 8980/76294 | train loss 3.525710 | norm 0.1941 | lr 7.64e-04 | (3818.65 ms | 137297 tok/s) step 8981/76294 | train loss 3.525003 | norm 0.1944 | lr 7.64e-04 | (3800.42 ms | 137955 tok/s) step 8982/76294 | train loss 3.470002 | norm 0.2291 | lr 7.63e-04 | (3793.69 ms | 138200 tok/s) step 8983/76294 | train loss 3.455009 | norm 0.1841 | lr 7.63e-04 | (3831.97 ms | 136820 tok/s) step 8984/76294 | train loss 3.502085 | norm 0.3423 | lr 7.63e-04 | (3802.17 ms | 137892 tok/s) step 8985/76294 | train loss 3.572352 | norm 0.2435 | lr 7.63e-04 | (3819.38 ms | 137271 tok/s) step 8986/76294 | train loss 3.435809 | norm 0.2115 | lr 7.63e-04 | (3842.03 ms | 136461 tok/s) step 8987/76294 | train loss 3.452696 | norm 0.2199 | lr 7.63e-04 | (3798.56 ms | 138023 tok/s) step 8988/76294 | train loss 3.410956 | norm 0.3124 | lr 7.63e-04 | (3810.16 ms | 137603 tok/s) step 8989/76294 | train loss 3.431617 | norm 0.2116 | lr 7.63e-04 | (3866.84 ms | 135586 tok/s) step 8990/76294 | train loss 3.437659 | norm 0.3809 | lr 7.63e-04 | (3800.00 ms | 137970 tok/s) step 8991/76294 | train loss 3.482909 | norm 0.1945 | lr 7.63e-04 | (3819.86 ms | 137253 tok/s) step 8992/76294 | train loss 3.413701 | norm 0.2550 | lr 7.63e-04 | (3797.98 ms | 138044 tok/s) step 8993/76294 | train loss 3.485206 | norm 0.1803 | lr 7.62e-04 | (3799.06 ms | 138005 tok/s) step 8994/76294 | train loss 3.363863 | norm 0.1896 | lr 7.62e-04 | (3831.24 ms | 136846 tok/s) step 8995/76294 | train loss 3.435670 | norm 0.1744 | lr 7.62e-04 | (3797.22 ms | 138072 tok/s) step 8996/76294 | train loss 3.422748 | norm 0.1973 | lr 7.62e-04 | (3824.79 ms | 137076 tok/s) step 8997/76294 | train loss 3.451028 | norm 0.1921 | lr 7.62e-04 | (3794.91 ms | 138156 tok/s) step 8998/76294 | train loss 3.457205 | norm 0.2487 | lr 7.62e-04 | (3801.07 ms | 137932 tok/s) step 8999/76294 | train loss 3.501429 | norm 0.2278 | lr 7.62e-04 | (3819.24 ms | 137275 tok/s) step 9000/76294 | train loss 3.495285 | norm 0.4181 | lr 7.62e-04 | (3801.84 ms | 137904 tok/s) val loss: 3.448032 saving model checkpoint to ./results/gpt2-124M-gqa/step_9000.pth step 9001/76294 | train loss 3.453687 | norm 0.3445 | lr 7.62e-04 | (3812.88 ms | 137504 tok/s) step 9002/76294 | train loss 3.460176 | norm 0.2710 | lr 7.62e-04 | (3821.58 ms | 137191 tok/s) step 9003/76294 | train loss 3.393069 | norm 0.1944 | lr 7.62e-04 | (3797.27 ms | 138070 tok/s) step 9004/76294 | train loss 3.668452 | norm 0.2316 | lr 7.62e-04 | (3795.98 ms | 138117 tok/s) step 9005/76294 | train loss 3.486594 | norm 0.1833 | lr 7.61e-04 | (3835.53 ms | 136692 tok/s) step 9006/76294 | train loss 3.430231 | norm 0.2271 | lr 7.61e-04 | (3868.29 ms | 135535 tok/s) step 9007/76294 | train loss 3.417422 | norm 0.1769 | lr 7.61e-04 | (3794.49 ms | 138171 tok/s) step 9008/76294 | train loss 3.449601 | norm 0.1992 | lr 7.61e-04 | (3839.64 ms | 136546 tok/s) step 9009/76294 | train loss 3.425970 | norm 0.2237 | lr 7.61e-04 | (3795.02 ms | 138152 tok/s) step 9010/76294 | train loss 3.459509 | norm 0.2648 | lr 7.61e-04 | (3826.68 ms | 137009 tok/s) step 9011/76294 | train loss 3.418500 | norm 0.2579 | lr 7.61e-04 | (3801.09 ms | 137931 tok/s) step 9012/76294 | train loss 3.501419 | norm 0.2112 | lr 7.61e-04 | (3799.89 ms | 137975 tok/s) step 9013/76294 | train loss 3.519979 | norm 0.2416 | lr 7.61e-04 | (3814.29 ms | 137453 tok/s) step 9014/76294 | train loss 3.461391 | norm 0.1930 | lr 7.61e-04 | (3794.06 ms | 138187 tok/s) step 9015/76294 | train loss 3.441822 | norm 0.2913 | lr 7.61e-04 | (3800.63 ms | 137947 tok/s) step 9016/76294 | train loss 3.423061 | norm 0.2582 | lr 7.60e-04 | (3797.32 ms | 138068 tok/s) step 9017/76294 | train loss 3.529606 | norm 0.3449 | lr 7.60e-04 | (3793.12 ms | 138221 tok/s) step 9018/76294 | train loss 3.454140 | norm 0.2201 | lr 7.60e-04 | (3821.79 ms | 137184 tok/s) step 9019/76294 | train loss 3.435019 | norm 0.4419 | lr 7.60e-04 | (3796.28 ms | 138106 tok/s) step 9020/76294 | train loss 3.393178 | norm 0.2814 | lr 7.60e-04 | (3820.91 ms | 137216 tok/s) step 9021/76294 | train loss 3.439377 | norm 0.2678 | lr 7.60e-04 | (3824.49 ms | 137087 tok/s) step 9022/76294 | train loss 3.410522 | norm 0.2313 | lr 7.60e-04 | (3802.41 ms | 137883 tok/s) step 9023/76294 | train loss 3.515579 | norm 0.2377 | lr 7.60e-04 | (3801.70 ms | 137909 tok/s) step 9024/76294 | train loss 3.450901 | norm 0.2628 | lr 7.60e-04 | (3800.58 ms | 137950 tok/s) step 9025/76294 | train loss 3.426650 | norm 0.2669 | lr 7.60e-04 | (4528.04 ms | 115787 tok/s) step 9026/76294 | train loss 3.429183 | norm 0.2852 | lr 7.60e-04 | (3803.28 ms | 137852 tok/s) step 9027/76294 | train loss 3.423641 | norm 0.2171 | lr 7.59e-04 | (3958.72 ms | 132439 tok/s) step 9028/76294 | train loss 3.439648 | norm 0.2662 | lr 7.59e-04 | (3792.71 ms | 138236 tok/s) step 9029/76294 | train loss 3.469440 | norm 0.2179 | lr 7.59e-04 | (3800.34 ms | 137958 tok/s) step 9030/76294 | train loss 3.407312 | norm 0.2396 | lr 7.59e-04 | (3976.39 ms | 131850 tok/s) step 9031/76294 | train loss 3.409706 | norm 0.2087 | lr 7.59e-04 | (3797.59 ms | 138058 tok/s) step 9032/76294 | train loss 3.420200 | norm 0.2784 | lr 7.59e-04 | (3856.57 ms | 135947 tok/s) step 9033/76294 | train loss 3.363777 | norm 0.2062 | lr 7.59e-04 | (3797.77 ms | 138051 tok/s) step 9034/76294 | train loss 3.454674 | norm 0.2241 | lr 7.59e-04 | (3807.44 ms | 137701 tok/s) step 9035/76294 | train loss 3.417182 | norm 0.2486 | lr 7.59e-04 | (3829.02 ms | 136925 tok/s) step 9036/76294 | train loss 3.471666 | norm 0.2157 | lr 7.59e-04 | (3809.84 ms | 137614 tok/s) step 9037/76294 | train loss 3.491345 | norm 0.2357 | lr 7.59e-04 | (3802.87 ms | 137866 tok/s) step 9038/76294 | train loss 3.517069 | norm 0.2765 | lr 7.59e-04 | (3798.67 ms | 138019 tok/s) step 9039/76294 | train loss 3.389206 | norm 0.2562 | lr 7.58e-04 | (3795.57 ms | 138131 tok/s) step 9040/76294 | train loss 3.443923 | norm 0.2200 | lr 7.58e-04 | (3822.27 ms | 137167 tok/s) step 9041/76294 | train loss 3.416103 | norm 0.2422 | lr 7.58e-04 | (3792.41 ms | 138247 tok/s) step 9042/76294 | train loss 3.465215 | norm 0.1958 | lr 7.58e-04 | (3844.66 ms | 136368 tok/s) step 9043/76294 | train loss 3.509029 | norm 0.2063 | lr 7.58e-04 | (3801.63 ms | 137911 tok/s) step 9044/76294 | train loss 3.526309 | norm 0.2670 | lr 7.58e-04 | (3803.47 ms | 137845 tok/s) step 9045/76294 | train loss 3.448847 | norm 0.2146 | lr 7.58e-04 | (3820.32 ms | 137237 tok/s) step 9046/76294 | train loss 3.428457 | norm 0.2170 | lr 7.58e-04 | (3799.17 ms | 138001 tok/s) step 9047/76294 | train loss 3.417176 | norm 0.2058 | lr 7.58e-04 | (3817.63 ms | 137333 tok/s) step 9048/76294 | train loss 3.455881 | norm 0.2061 | lr 7.58e-04 | (3892.01 ms | 134709 tok/s) step 9049/76294 | train loss 3.438166 | norm 0.1979 | lr 7.58e-04 | (3795.41 ms | 138137 tok/s) step 9050/76294 | train loss 3.430191 | norm 0.1879 | lr 7.57e-04 | (3828.71 ms | 136936 tok/s) step 9051/76294 | train loss 3.423834 | norm 0.1956 | lr 7.57e-04 | (3797.81 ms | 138050 tok/s) step 9052/76294 | train loss 3.566679 | norm 0.2156 | lr 7.57e-04 | (3799.38 ms | 137993 tok/s) step 9053/76294 | train loss 3.441103 | norm 0.1951 | lr 7.57e-04 | (3817.76 ms | 137329 tok/s) step 9054/76294 | train loss 3.446250 | norm 0.2090 | lr 7.57e-04 | (3798.70 ms | 138018 tok/s) step 9055/76294 | train loss 3.439707 | norm 0.2097 | lr 7.57e-04 | (3797.31 ms | 138068 tok/s) step 9056/76294 | train loss 3.405545 | norm 0.4220 | lr 7.57e-04 | (3826.10 ms | 137029 tok/s) step 9057/76294 | train loss 3.427803 | norm 0.2533 | lr 7.57e-04 | (3797.38 ms | 138066 tok/s) step 9058/76294 | train loss 3.465050 | norm 0.1922 | lr 7.57e-04 | (3801.11 ms | 137930 tok/s) step 9059/76294 | train loss 3.447541 | norm 0.2341 | lr 7.57e-04 | (3818.24 ms | 137311 tok/s) step 9060/76294 | train loss 3.413902 | norm 0.2096 | lr 7.57e-04 | (3799.49 ms | 137989 tok/s) step 9061/76294 | train loss 3.422448 | norm 0.2753 | lr 7.56e-04 | (3795.50 ms | 138134 tok/s) step 9062/76294 | train loss 3.520092 | norm 0.2165 | lr 7.56e-04 | (3819.21 ms | 137277 tok/s) step 9063/76294 | train loss 3.591403 | norm 0.4571 | lr 7.56e-04 | (3795.32 ms | 138141 tok/s) step 9064/76294 | train loss 3.434410 | norm 0.2717 | lr 7.56e-04 | (3805.64 ms | 137766 tok/s) step 9065/76294 | train loss 3.438246 | norm 0.3089 | lr 7.56e-04 | (3845.95 ms | 136322 tok/s) step 9066/76294 | train loss 3.460124 | norm 0.3051 | lr 7.56e-04 | (3804.65 ms | 137802 tok/s) step 9067/76294 | train loss 3.472572 | norm 0.2105 | lr 7.56e-04 | (3795.97 ms | 138117 tok/s) step 9068/76294 | train loss 3.460036 | norm 0.3207 | lr 7.56e-04 | (3819.75 ms | 137257 tok/s) step 9069/76294 | train loss 3.630294 | norm 0.2308 | lr 7.56e-04 | (3924.77 ms | 133585 tok/s) step 9070/76294 | train loss 3.409250 | norm 0.2619 | lr 7.56e-04 | (4496.14 ms | 116609 tok/s) step 9071/76294 | train loss 3.493992 | norm 0.1992 | lr 7.56e-04 | (3917.27 ms | 133840 tok/s) step 9072/76294 | train loss 3.453557 | norm 0.2040 | lr 7.56e-04 | (3790.28 ms | 138324 tok/s) step 9073/76294 | train loss 3.467585 | norm 0.1875 | lr 7.55e-04 | (3821.27 ms | 137203 tok/s) step 9074/76294 | train loss 3.468855 | norm 0.1904 | lr 7.55e-04 | (3822.00 ms | 137176 tok/s) step 9075/76294 | train loss 3.486775 | norm 0.2578 | lr 7.55e-04 | (3832.66 ms | 136795 tok/s) step 9076/76294 | train loss 3.419635 | norm 0.1912 | lr 7.55e-04 | (3801.38 ms | 137921 tok/s) step 9077/76294 | train loss 3.503167 | norm 0.2496 | lr 7.55e-04 | (3791.41 ms | 138283 tok/s) step 9078/76294 | train loss 3.498638 | norm 0.1856 | lr 7.55e-04 | (3793.42 ms | 138210 tok/s) step 9079/76294 | train loss 3.549306 | norm 0.2231 | lr 7.55e-04 | (3818.48 ms | 137303 tok/s) step 9080/76294 | train loss 3.421221 | norm 0.3251 | lr 7.55e-04 | (3791.77 ms | 138270 tok/s) step 9081/76294 | train loss 3.379554 | norm 0.2508 | lr 7.55e-04 | (3794.80 ms | 138160 tok/s) step 9082/76294 | train loss 3.435871 | norm 0.2505 | lr 7.55e-04 | (3810.36 ms | 137596 tok/s) step 9083/76294 | train loss 3.456088 | norm 0.2749 | lr 7.55e-04 | (3796.79 ms | 138087 tok/s) step 9084/76294 | train loss 3.444331 | norm 0.2196 | lr 7.54e-04 | (3792.00 ms | 138261 tok/s) step 9085/76294 | train loss 3.486747 | norm 0.3145 | lr 7.54e-04 | (5815.91 ms | 90147 tok/s) step 9086/76294 | train loss 3.496466 | norm 0.1952 | lr 7.54e-04 | (3805.15 ms | 137784 tok/s) step 9087/76294 | train loss 3.419891 | norm 0.2762 | lr 7.54e-04 | (3814.04 ms | 137463 tok/s) step 9088/76294 | train loss 3.418370 | norm 0.2297 | lr 7.54e-04 | (3827.28 ms | 136987 tok/s) step 9089/76294 | train loss 3.467231 | norm 0.2516 | lr 7.54e-04 | (3820.61 ms | 137226 tok/s) step 9090/76294 | train loss 3.427320 | norm 0.2735 | lr 7.54e-04 | (3811.59 ms | 137551 tok/s) step 9091/76294 | train loss 3.502603 | norm 0.2052 | lr 7.54e-04 | (3791.46 ms | 138281 tok/s) step 9092/76294 | train loss 3.465487 | norm 0.3818 | lr 7.54e-04 | (3791.46 ms | 138281 tok/s) step 9093/76294 | train loss 3.458873 | norm 0.1874 | lr 7.54e-04 | (3815.71 ms | 137402 tok/s) step 9094/76294 | train loss 3.520246 | norm 0.2548 | lr 7.54e-04 | (3791.15 ms | 138293 tok/s) step 9095/76294 | train loss 3.428162 | norm 0.1891 | lr 7.53e-04 | (3789.54 ms | 138351 tok/s) step 9096/76294 | train loss 3.472282 | norm 0.2168 | lr 7.53e-04 | (3820.46 ms | 137232 tok/s) step 9097/76294 | train loss 3.397462 | norm 0.1964 | lr 7.53e-04 | (3793.13 ms | 138220 tok/s) step 9098/76294 | train loss 3.502692 | norm 0.1943 | lr 7.53e-04 | (3796.86 ms | 138085 tok/s) step 9099/76294 | train loss 3.508274 | norm 0.2092 | lr 7.53e-04 | (3812.19 ms | 137529 tok/s) step 9100/76294 | train loss 3.375553 | norm 0.2950 | lr 7.53e-04 | (3794.63 ms | 138166 tok/s) step 9101/76294 | train loss 3.459925 | norm 0.2769 | lr 7.53e-04 | (3791.96 ms | 138263 tok/s) step 9102/76294 | train loss 3.533080 | norm 0.3859 | lr 7.53e-04 | (3853.99 ms | 136038 tok/s) step 9103/76294 | train loss 3.500567 | norm 0.2290 | lr 7.53e-04 | (3792.66 ms | 138238 tok/s) step 9104/76294 | train loss 3.411011 | norm 0.2557 | lr 7.53e-04 | (3824.82 ms | 137075 tok/s) step 9105/76294 | train loss 3.410012 | norm 0.2692 | lr 7.53e-04 | (3794.18 ms | 138182 tok/s) step 9106/76294 | train loss 3.513326 | norm 0.2340 | lr 7.52e-04 | (3791.19 ms | 138291 tok/s) step 9107/76294 | train loss 3.404079 | norm 0.2059 | lr 7.52e-04 | (3817.90 ms | 137324 tok/s) step 9108/76294 | train loss 3.480827 | norm 0.2048 | lr 7.52e-04 | (3792.90 ms | 138229 tok/s) step 9109/76294 | train loss 3.406904 | norm 0.2322 | lr 7.52e-04 | (3797.34 ms | 138067 tok/s) step 9110/76294 | train loss 3.435393 | norm 0.2523 | lr 7.52e-04 | (3904.70 ms | 134271 tok/s) step 9111/76294 | train loss 3.493533 | norm 0.2300 | lr 7.52e-04 | (3795.13 ms | 138148 tok/s) step 9112/76294 | train loss 3.541763 | norm 0.2450 | lr 7.52e-04 | (3944.54 ms | 132915 tok/s) step 9113/76294 | train loss 3.445473 | norm 0.2742 | lr 7.52e-04 | (3854.38 ms | 136024 tok/s) step 9114/76294 | train loss 3.464484 | norm 0.1873 | lr 7.52e-04 | (3808.04 ms | 137679 tok/s) step 9115/76294 | train loss 3.449673 | norm 0.2296 | lr 7.52e-04 | (3815.94 ms | 137394 tok/s) step 9116/76294 | train loss 3.451730 | norm 0.3010 | lr 7.52e-04 | (3797.06 ms | 138077 tok/s) step 9117/76294 | train loss 3.413555 | norm 0.2404 | lr 7.52e-04 | (3825.62 ms | 137046 tok/s) step 9118/76294 | train loss 3.492131 | norm 0.3361 | lr 7.51e-04 | (3793.23 ms | 138217 tok/s) step 9119/76294 | train loss 3.446006 | norm 0.3019 | lr 7.51e-04 | (3800.45 ms | 137954 tok/s) step 9120/76294 | train loss 3.431576 | norm 0.2933 | lr 7.51e-04 | (4041.55 ms | 129724 tok/s) step 9121/76294 | train loss 3.420624 | norm 0.3161 | lr 7.51e-04 | (3795.99 ms | 138116 tok/s) step 9122/76294 | train loss 3.430148 | norm 0.2049 | lr 7.51e-04 | (3796.26 ms | 138107 tok/s) step 9123/76294 | train loss 3.442193 | norm 0.2944 | lr 7.51e-04 | (3844.84 ms | 136362 tok/s) step 9124/76294 | train loss 3.413075 | norm 0.2081 | lr 7.51e-04 | (3795.95 ms | 138118 tok/s) step 9125/76294 | train loss 3.416488 | norm 0.2963 | lr 7.51e-04 | (3839.50 ms | 136551 tok/s) step 9126/76294 | train loss 3.459913 | norm 0.2004 | lr 7.51e-04 | (3791.65 ms | 138274 tok/s) step 9127/76294 | train loss 3.470492 | norm 0.3801 | lr 7.51e-04 | (3846.66 ms | 136297 tok/s) step 9128/76294 | train loss 3.482689 | norm 0.2330 | lr 7.51e-04 | (3799.00 ms | 138007 tok/s) step 9129/76294 | train loss 3.517459 | norm 0.2861 | lr 7.50e-04 | (3821.35 ms | 137200 tok/s) step 9130/76294 | train loss 3.442825 | norm 0.2014 | lr 7.50e-04 | (3814.98 ms | 137429 tok/s) step 9131/76294 | train loss 3.605811 | norm 0.3792 | lr 7.50e-04 | (3923.59 ms | 133624 tok/s) step 9132/76294 | train loss 3.426763 | norm 0.1990 | lr 7.50e-04 | (3790.75 ms | 138307 tok/s) step 9133/76294 | train loss 3.408715 | norm 0.2319 | lr 7.50e-04 | (3825.08 ms | 137066 tok/s) step 9134/76294 | train loss 3.380979 | norm 0.2202 | lr 7.50e-04 | (3797.00 ms | 138080 tok/s) step 9135/76294 | train loss 3.419246 | norm 0.2320 | lr 7.50e-04 | (3806.15 ms | 137748 tok/s) step 9136/76294 | train loss 3.494219 | norm 0.2204 | lr 7.50e-04 | (3794.49 ms | 138171 tok/s) step 9137/76294 | train loss 3.488277 | norm 0.2134 | lr 7.50e-04 | (3830.03 ms | 136889 tok/s) step 9138/76294 | train loss 3.434995 | norm 0.2796 | lr 7.50e-04 | (3812.46 ms | 137520 tok/s) step 9139/76294 | train loss 3.463789 | norm 0.2026 | lr 7.50e-04 | (3843.90 ms | 136395 tok/s) step 9140/76294 | train loss 3.370264 | norm 0.2065 | lr 7.49e-04 | (3803.87 ms | 137830 tok/s) step 9141/76294 | train loss 3.463305 | norm 0.2411 | lr 7.49e-04 | (3846.36 ms | 136308 tok/s) step 9142/76294 | train loss 3.478790 | norm 0.2153 | lr 7.49e-04 | (3828.30 ms | 136951 tok/s) step 9143/76294 | train loss 3.321847 | norm 0.2534 | lr 7.49e-04 | (3800.11 ms | 137966 tok/s) step 9144/76294 | train loss 3.536462 | norm 0.2814 | lr 7.49e-04 | (3816.36 ms | 137379 tok/s) step 9145/76294 | train loss 3.391103 | norm 0.2198 | lr 7.49e-04 | (3804.40 ms | 137811 tok/s) step 9146/76294 | train loss 3.403395 | norm 0.2933 | lr 7.49e-04 | (3796.16 ms | 138110 tok/s) step 9147/76294 | train loss 3.406745 | norm 0.2517 | lr 7.49e-04 | (3840.22 ms | 136526 tok/s) step 9148/76294 | train loss 3.441028 | norm 0.2982 | lr 7.49e-04 | (3797.51 ms | 138061 tok/s) step 9149/76294 | train loss 3.392112 | norm 0.2592 | lr 7.49e-04 | (3804.26 ms | 137816 tok/s) step 9150/76294 | train loss 3.436692 | norm 0.2223 | lr 7.49e-04 | (3817.58 ms | 137335 tok/s) step 9151/76294 | train loss 3.466076 | norm 0.2078 | lr 7.48e-04 | (3799.75 ms | 137980 tok/s) step 9152/76294 | train loss 3.418660 | norm 0.1762 | lr 7.48e-04 | (3923.04 ms | 133643 tok/s) step 9153/76294 | train loss 3.608107 | norm 0.2249 | lr 7.48e-04 | (3798.08 ms | 138040 tok/s) step 9154/76294 | train loss 3.494385 | norm 0.2807 | lr 7.48e-04 | (3896.34 ms | 134559 tok/s) step 9155/76294 | train loss 3.482691 | norm 0.2377 | lr 7.48e-04 | (3814.78 ms | 137436 tok/s) step 9156/76294 | train loss 3.476538 | norm 0.2651 | lr 7.48e-04 | (4300.00 ms | 121927 tok/s) step 9157/76294 | train loss 3.393099 | norm 0.2232 | lr 7.48e-04 | (3817.91 ms | 137323 tok/s) step 9158/76294 | train loss 3.478220 | norm 0.2167 | lr 7.48e-04 | (3844.27 ms | 136382 tok/s) step 9159/76294 | train loss 3.424812 | norm 0.2947 | lr 7.48e-04 | (3870.62 ms | 135453 tok/s) step 9160/76294 | train loss 3.471895 | norm 0.1895 | lr 7.48e-04 | (3790.52 ms | 138316 tok/s) step 9161/76294 | train loss 3.451141 | norm 0.2195 | lr 7.48e-04 | (3866.29 ms | 135605 tok/s) step 9162/76294 | train loss 3.446349 | norm 0.2081 | lr 7.48e-04 | (3826.41 ms | 137018 tok/s) step 9163/76294 | train loss 3.414645 | norm 0.2031 | lr 7.47e-04 | (3797.30 ms | 138069 tok/s) step 9164/76294 | train loss 3.486999 | norm 0.2078 | lr 7.47e-04 | (3793.95 ms | 138190 tok/s) step 9165/76294 | train loss 3.456515 | norm 0.2155 | lr 7.47e-04 | (3837.60 ms | 136619 tok/s) step 9166/76294 | train loss 3.413915 | norm 0.1976 | lr 7.47e-04 | (3796.35 ms | 138103 tok/s) step 9167/76294 | train loss 3.475072 | norm 0.1906 | lr 7.47e-04 | (3802.07 ms | 137896 tok/s) step 9168/76294 | train loss 3.418073 | norm 0.2359 | lr 7.47e-04 | (3818.02 ms | 137319 tok/s) step 9169/76294 | train loss 3.482833 | norm 0.1843 | lr 7.47e-04 | (3797.22 ms | 138071 tok/s) step 9170/76294 | train loss 3.399475 | norm 0.2138 | lr 7.47e-04 | (3796.15 ms | 138111 tok/s) step 9171/76294 | train loss 3.472044 | norm 0.1984 | lr 7.47e-04 | (3836.91 ms | 136643 tok/s) step 9172/76294 | train loss 3.426603 | norm 0.2483 | lr 7.47e-04 | (3794.91 ms | 138155 tok/s) step 9173/76294 | train loss 3.520185 | norm 0.3134 | lr 7.47e-04 | (3862.97 ms | 135721 tok/s) step 9174/76294 | train loss 3.437042 | norm 0.2041 | lr 7.46e-04 | (3795.85 ms | 138121 tok/s) step 9175/76294 | train loss 3.455582 | norm 0.2808 | lr 7.46e-04 | (3825.55 ms | 137049 tok/s) step 9176/76294 | train loss 3.415140 | norm 0.1871 | lr 7.46e-04 | (3797.60 ms | 138058 tok/s) step 9177/76294 | train loss 3.500321 | norm 0.2691 | lr 7.46e-04 | (3800.37 ms | 137957 tok/s) step 9178/76294 | train loss 3.412875 | norm 0.2066 | lr 7.46e-04 | (3818.98 ms | 137285 tok/s) step 9179/76294 | train loss 3.404126 | norm 0.2269 | lr 7.46e-04 | (3855.66 ms | 135979 tok/s) step 9180/76294 | train loss 3.480668 | norm 0.2660 | lr 7.46e-04 | (3818.80 ms | 137291 tok/s) step 9181/76294 | train loss 3.404367 | norm 0.2061 | lr 7.46e-04 | (3801.94 ms | 137900 tok/s) step 9182/76294 | train loss 3.391940 | norm 0.2333 | lr 7.46e-04 | (3820.83 ms | 137218 tok/s) step 9183/76294 | train loss 3.453838 | norm 0.2197 | lr 7.46e-04 | (3798.31 ms | 138032 tok/s) step 9184/76294 | train loss 3.380976 | norm 0.1811 | lr 7.46e-04 | (3800.54 ms | 137951 tok/s) step 9185/76294 | train loss 3.454819 | norm 0.1960 | lr 7.45e-04 | (3800.60 ms | 137949 tok/s) step 9186/76294 | train loss 3.433535 | norm 0.2087 | lr 7.45e-04 | (3798.50 ms | 138025 tok/s) step 9187/76294 | train loss 3.430167 | norm 0.2979 | lr 7.45e-04 | (3822.36 ms | 137164 tok/s) step 9188/76294 | train loss 3.419699 | norm 0.1624 | lr 7.45e-04 | (3799.66 ms | 137983 tok/s) step 9189/76294 | train loss 3.449694 | norm 0.2571 | lr 7.45e-04 | (3801.72 ms | 137908 tok/s) step 9190/76294 | train loss 3.445863 | norm 0.2035 | lr 7.45e-04 | (3821.99 ms | 137177 tok/s) step 9191/76294 | train loss 3.410530 | norm 0.1974 | lr 7.45e-04 | (3801.62 ms | 137912 tok/s) step 9192/76294 | train loss 3.427649 | norm 0.2154 | lr 7.45e-04 | (3820.98 ms | 137213 tok/s) step 9193/76294 | train loss 3.528796 | norm 0.2196 | lr 7.45e-04 | (3804.38 ms | 137812 tok/s) step 9194/76294 | train loss 3.439504 | norm 0.2291 | lr 7.45e-04 | (3901.27 ms | 134389 tok/s) step 9195/76294 | train loss 3.496730 | norm 0.1919 | lr 7.45e-04 | (3805.89 ms | 137757 tok/s) step 9196/76294 | train loss 3.432191 | norm 0.2698 | lr 7.44e-04 | (3809.84 ms | 137614 tok/s) step 9197/76294 | train loss 3.444599 | norm 0.2172 | lr 7.44e-04 | (3829.34 ms | 136914 tok/s) step 9198/76294 | train loss 3.509157 | norm 0.2266 | lr 7.44e-04 | (3806.66 ms | 137729 tok/s) step 9199/76294 | train loss 3.409637 | norm 0.2433 | lr 7.44e-04 | (3820.63 ms | 137225 tok/s) step 9200/76294 | train loss 3.473028 | norm 0.3346 | lr 7.44e-04 | (3860.10 ms | 135822 tok/s) step 9201/76294 | train loss 3.541518 | norm 0.2078 | lr 7.44e-04 | (3802.58 ms | 137877 tok/s) step 9202/76294 | train loss 3.450317 | norm 0.3103 | lr 7.44e-04 | (3836.92 ms | 136643 tok/s) step 9203/76294 | train loss 3.412524 | norm 0.2459 | lr 7.44e-04 | (3806.46 ms | 137736 tok/s) step 9204/76294 | train loss 3.520434 | norm 0.2000 | lr 7.44e-04 | (3809.67 ms | 137620 tok/s) step 9205/76294 | train loss 3.497370 | norm 0.1923 | lr 7.44e-04 | (3830.17 ms | 136884 tok/s) step 9206/76294 | train loss 3.434899 | norm 0.1933 | lr 7.44e-04 | (3806.77 ms | 137725 tok/s) step 9207/76294 | train loss 3.404207 | norm 0.2030 | lr 7.44e-04 | (3804.83 ms | 137795 tok/s) step 9208/76294 | train loss 3.433665 | norm 0.1873 | lr 7.43e-04 | (3852.80 ms | 136080 tok/s) step 9209/76294 | train loss 3.389698 | norm 0.2336 | lr 7.43e-04 | (3807.88 ms | 137685 tok/s) step 9210/76294 | train loss 3.459367 | norm 0.2203 | lr 7.43e-04 | (3810.67 ms | 137584 tok/s) step 9211/76294 | train loss 3.467899 | norm 0.2241 | lr 7.43e-04 | (3840.13 ms | 136529 tok/s) step 9212/76294 | train loss 3.441361 | norm 0.2029 | lr 7.43e-04 | (3808.68 ms | 137656 tok/s) step 9213/76294 | train loss 3.433451 | norm 0.2484 | lr 7.43e-04 | (3808.81 ms | 137651 tok/s) step 9214/76294 | train loss 3.494832 | norm 0.2024 | lr 7.43e-04 | (3843.67 ms | 136403 tok/s) step 9215/76294 | train loss 3.420033 | norm 0.2420 | lr 7.43e-04 | (4094.33 ms | 128052 tok/s) step 9216/76294 | train loss 3.476809 | norm 0.2231 | lr 7.43e-04 | (3805.24 ms | 137781 tok/s) step 9217/76294 | train loss 3.248895 | norm 0.1944 | lr 7.43e-04 | (3815.51 ms | 137410 tok/s) step 9218/76294 | train loss 3.465877 | norm 0.2251 | lr 7.43e-04 | (3829.59 ms | 136904 tok/s) step 9219/76294 | train loss 3.420885 | norm 0.1973 | lr 7.42e-04 | (3810.74 ms | 137582 tok/s) step 9220/76294 | train loss 3.505806 | norm 0.2404 | lr 7.42e-04 | (3806.08 ms | 137750 tok/s) step 9221/76294 | train loss 3.438454 | norm 0.2237 | lr 7.42e-04 | (3838.37 ms | 136591 tok/s) step 9222/76294 | train loss 3.477641 | norm 0.3159 | lr 7.42e-04 | (3806.37 ms | 137739 tok/s) step 9223/76294 | train loss 3.493461 | norm 0.2122 | lr 7.42e-04 | (3815.69 ms | 137403 tok/s) step 9224/76294 | train loss 3.422097 | norm 0.2120 | lr 7.42e-04 | (3832.08 ms | 136815 tok/s) step 9225/76294 | train loss 3.511586 | norm 0.2472 | lr 7.42e-04 | (3811.16 ms | 137566 tok/s) step 9226/76294 | train loss 3.472058 | norm 0.2060 | lr 7.42e-04 | (3831.86 ms | 136823 tok/s) step 9227/76294 | train loss 3.448681 | norm 0.2504 | lr 7.42e-04 | (3810.51 ms | 137590 tok/s) step 9228/76294 | train loss 3.435549 | norm 0.1936 | lr 7.42e-04 | (3807.14 ms | 137712 tok/s) step 9229/76294 | train loss 3.415491 | norm 0.2062 | lr 7.42e-04 | (3892.74 ms | 134684 tok/s) step 9230/76294 | train loss 3.445257 | norm 0.2146 | lr 7.41e-04 | (3807.69 ms | 137692 tok/s) step 9231/76294 | train loss 3.448250 | norm 0.2109 | lr 7.41e-04 | (3834.06 ms | 136745 tok/s) step 9232/76294 | train loss 3.371962 | norm 0.3029 | lr 7.41e-04 | (3802.16 ms | 137892 tok/s) step 9233/76294 | train loss 3.447668 | norm 0.3257 | lr 7.41e-04 | (3804.25 ms | 137816 tok/s) step 9234/76294 | train loss 3.487980 | norm 0.2224 | lr 7.41e-04 | (3825.65 ms | 137046 tok/s) step 9235/76294 | train loss 3.421638 | norm 0.2050 | lr 7.41e-04 | (3803.99 ms | 137826 tok/s) step 9236/76294 | train loss 3.409581 | norm 0.2825 | lr 7.41e-04 | (3894.04 ms | 134639 tok/s) step 9237/76294 | train loss 3.511277 | norm 0.2984 | lr 7.41e-04 | (3798.45 ms | 138027 tok/s) step 9238/76294 | train loss 3.428846 | norm 0.2288 | lr 7.41e-04 | (3822.16 ms | 137171 tok/s) step 9239/76294 | train loss 3.368402 | norm 0.3827 | lr 7.41e-04 | (3821.99 ms | 137177 tok/s) step 9240/76294 | train loss 3.403187 | norm 0.2433 | lr 7.41e-04 | (3809.52 ms | 137626 tok/s) step 9241/76294 | train loss 3.434795 | norm 0.3642 | lr 7.40e-04 | (3801.15 ms | 137929 tok/s) step 9242/76294 | train loss 3.453899 | norm 0.2939 | lr 7.40e-04 | (3828.63 ms | 136939 tok/s) step 9243/76294 | train loss 3.429066 | norm 0.3641 | lr 7.40e-04 | (3802.17 ms | 137892 tok/s) step 9244/76294 | train loss 3.447605 | norm 0.2383 | lr 7.40e-04 | (3879.02 ms | 135160 tok/s) step 9245/76294 | train loss 3.450125 | norm 0.2296 | lr 7.40e-04 | (3791.92 ms | 138265 tok/s) step 9246/76294 | train loss 3.433360 | norm 0.2450 | lr 7.40e-04 | (3839.80 ms | 136540 tok/s) step 9247/76294 | train loss 3.420837 | norm 0.2008 | lr 7.40e-04 | (3794.04 ms | 138187 tok/s) step 9248/76294 | train loss 3.435104 | norm 0.2649 | lr 7.40e-04 | (3808.02 ms | 137680 tok/s) step 9249/76294 | train loss 3.410201 | norm 0.1839 | lr 7.40e-04 | (3812.45 ms | 137520 tok/s) step 9250/76294 | train loss 3.409728 | norm 0.2226 | lr 7.40e-04 | (3793.28 ms | 138215 tok/s) val loss: 3.442094 saving model checkpoint to ./results/gpt2-124M-gqa/step_9250.pth step 9251/76294 | train loss 3.414642 | norm 0.1803 | lr 7.40e-04 | (3813.34 ms | 137488 tok/s) step 9252/76294 | train loss 3.502413 | norm 0.2310 | lr 7.40e-04 | (3800.13 ms | 137966 tok/s) step 9253/76294 | train loss 3.438661 | norm 0.2063 | lr 7.39e-04 | (3799.41 ms | 137992 tok/s) step 9254/76294 | train loss 3.465770 | norm 0.2399 | lr 7.39e-04 | (3822.43 ms | 137161 tok/s) step 9255/76294 | train loss 3.507710 | norm 0.2415 | lr 7.39e-04 | (3978.88 ms | 131768 tok/s) step 9256/76294 | train loss 3.384728 | norm 0.1988 | lr 7.39e-04 | (3867.09 ms | 135577 tok/s) step 9257/76294 | train loss 3.531613 | norm 0.2720 | lr 7.39e-04 | (3804.34 ms | 137813 tok/s) step 9258/76294 | train loss 3.460907 | norm 0.1841 | lr 7.39e-04 | (3872.36 ms | 135392 tok/s) step 9259/76294 | train loss 3.434430 | norm 0.1829 | lr 7.39e-04 | (3796.27 ms | 138106 tok/s) step 9260/76294 | train loss 3.443377 | norm 0.1994 | lr 7.39e-04 | (3801.11 ms | 137930 tok/s) step 9261/76294 | train loss 3.418385 | norm 0.2002 | lr 7.39e-04 | (3818.95 ms | 137286 tok/s) step 9262/76294 | train loss 3.536049 | norm 0.2595 | lr 7.39e-04 | (3827.15 ms | 136992 tok/s) step 9263/76294 | train loss 3.444434 | norm 0.2521 | lr 7.39e-04 | (3803.93 ms | 137828 tok/s) step 9264/76294 | train loss 3.455118 | norm 0.3396 | lr 7.38e-04 | (3800.43 ms | 137955 tok/s) step 9265/76294 | train loss 3.387527 | norm 0.2052 | lr 7.38e-04 | (3818.64 ms | 137297 tok/s) step 9266/76294 | train loss 3.448394 | norm 0.2597 | lr 7.38e-04 | (3796.36 ms | 138103 tok/s) step 9267/76294 | train loss 3.479178 | norm 0.2247 | lr 7.38e-04 | (3799.87 ms | 137975 tok/s) step 9268/76294 | train loss 3.513768 | norm 0.2014 | lr 7.38e-04 | (3803.19 ms | 137855 tok/s) step 9269/76294 | train loss 3.470718 | norm 0.2033 | lr 7.38e-04 | (3794.90 ms | 138156 tok/s) step 9270/76294 | train loss 3.489686 | norm 0.1893 | lr 7.38e-04 | (4009.90 ms | 130748 tok/s) step 9271/76294 | train loss 3.430118 | norm 0.2049 | lr 7.38e-04 | (3805.69 ms | 137764 tok/s) step 9272/76294 | train loss 3.457838 | norm 0.2914 | lr 7.38e-04 | (3822.28 ms | 137166 tok/s) step 9273/76294 | train loss 3.422418 | norm 0.2862 | lr 7.38e-04 | (3816.24 ms | 137383 tok/s) step 9274/76294 | train loss 3.396771 | norm 0.2475 | lr 7.38e-04 | (3794.83 ms | 138158 tok/s) step 9275/76294 | train loss 3.359560 | norm 0.1904 | lr 7.37e-04 | (3791.90 ms | 138265 tok/s) step 9276/76294 | train loss 3.445903 | norm 0.2318 | lr 7.37e-04 | (3861.30 ms | 135780 tok/s) step 9277/76294 | train loss 3.421807 | norm 0.1996 | lr 7.37e-04 | (3881.09 ms | 135088 tok/s) step 9278/76294 | train loss 3.387757 | norm 0.1905 | lr 7.37e-04 | (3795.09 ms | 138149 tok/s) step 9279/76294 | train loss 3.428817 | norm 0.2361 | lr 7.37e-04 | (3829.64 ms | 136903 tok/s) step 9280/76294 | train loss 3.435637 | norm 0.2117 | lr 7.37e-04 | (3821.60 ms | 137191 tok/s) step 9281/76294 | train loss 3.506510 | norm 0.2023 | lr 7.37e-04 | (3799.39 ms | 137993 tok/s) step 9282/76294 | train loss 3.468193 | norm 0.2017 | lr 7.37e-04 | (3796.37 ms | 138102 tok/s) step 9283/76294 | train loss 3.448462 | norm 0.2666 | lr 7.37e-04 | (3859.45 ms | 135845 tok/s) step 9284/76294 | train loss 3.380996 | norm 0.2229 | lr 7.37e-04 | (3792.72 ms | 138235 tok/s) step 9285/76294 | train loss 3.481189 | norm 0.2671 | lr 7.37e-04 | (3802.82 ms | 137868 tok/s) step 9286/76294 | train loss 3.400708 | norm 0.1885 | lr 7.36e-04 | (6196.03 ms | 84617 tok/s) step 9287/76294 | train loss 3.455628 | norm 0.2047 | lr 7.36e-04 | (3960.04 ms | 132395 tok/s) step 9288/76294 | train loss 3.446792 | norm 0.1929 | lr 7.36e-04 | (3802.51 ms | 137880 tok/s) step 9289/76294 | train loss 3.367289 | norm 0.2020 | lr 7.36e-04 | (3834.86 ms | 136716 tok/s) step 9290/76294 | train loss 3.491437 | norm 0.2851 | lr 7.36e-04 | (3798.26 ms | 138034 tok/s) step 9291/76294 | train loss 3.458106 | norm 0.2038 | lr 7.36e-04 | (7625.31 ms | 68756 tok/s) step 9292/76294 | train loss 3.457178 | norm 0.3415 | lr 7.36e-04 | (3835.60 ms | 136690 tok/s) step 9293/76294 | train loss 3.410954 | norm 0.1892 | lr 7.36e-04 | (3794.35 ms | 138176 tok/s) step 9294/76294 | train loss 3.436928 | norm 0.2362 | lr 7.36e-04 | (3840.37 ms | 136520 tok/s) step 9295/76294 | train loss 3.489722 | norm 0.1924 | lr 7.36e-04 | (3816.36 ms | 137379 tok/s) step 9296/76294 | train loss 3.513042 | norm 0.1922 | lr 7.36e-04 | (3862.02 ms | 135755 tok/s) step 9297/76294 | train loss 3.476428 | norm 0.1957 | lr 7.36e-04 | (3794.78 ms | 138160 tok/s) step 9298/76294 | train loss 3.407055 | norm 0.2177 | lr 7.35e-04 | (3889.51 ms | 134795 tok/s) step 9299/76294 | train loss 3.429933 | norm 0.1994 | lr 7.35e-04 | (3841.13 ms | 136493 tok/s) step 9300/76294 | train loss 3.490774 | norm 0.2164 | lr 7.35e-04 | (3789.11 ms | 138367 tok/s) step 9301/76294 | train loss 3.489692 | norm 0.3069 | lr 7.35e-04 | (3815.01 ms | 137428 tok/s) step 9302/76294 | train loss 3.449532 | norm 0.2370 | lr 7.35e-04 | (3793.93 ms | 138191 tok/s) step 9303/76294 | train loss 3.492501 | norm 0.2190 | lr 7.35e-04 | (3797.84 ms | 138049 tok/s) step 9304/76294 | train loss 3.421601 | norm 0.2107 | lr 7.35e-04 | (3814.00 ms | 137464 tok/s) step 9305/76294 | train loss 3.428167 | norm 0.2566 | lr 7.35e-04 | (3825.71 ms | 137043 tok/s) step 9306/76294 | train loss 3.401427 | norm 0.2102 | lr 7.35e-04 | (3810.28 ms | 137598 tok/s) step 9307/76294 | train loss 3.449791 | norm 0.2213 | lr 7.35e-04 | (3797.98 ms | 138044 tok/s) step 9308/76294 | train loss 3.471081 | norm 0.2085 | lr 7.35e-04 | (3808.70 ms | 137655 tok/s) step 9309/76294 | train loss 3.469282 | norm 0.1905 | lr 7.34e-04 | (3832.40 ms | 136804 tok/s) step 9310/76294 | train loss 3.418007 | norm 0.2514 | lr 7.34e-04 | (3803.94 ms | 137828 tok/s) step 9311/76294 | train loss 3.450492 | norm 0.2190 | lr 7.34e-04 | (3821.45 ms | 137196 tok/s) step 9312/76294 | train loss 3.422547 | norm 0.2271 | lr 7.34e-04 | (3791.40 ms | 138284 tok/s) step 9313/76294 | train loss 3.396754 | norm 0.1945 | lr 7.34e-04 | (3807.53 ms | 137698 tok/s) step 9314/76294 | train loss 3.448566 | norm 0.3072 | lr 7.34e-04 | (3814.75 ms | 137437 tok/s) step 9315/76294 | train loss 3.424437 | norm 0.2110 | lr 7.34e-04 | (3819.47 ms | 137267 tok/s) step 9316/76294 | train loss 3.449732 | norm 0.2519 | lr 7.34e-04 | (3796.74 ms | 138089 tok/s) step 9317/76294 | train loss 3.422202 | norm 0.2649 | lr 7.34e-04 | (3823.75 ms | 137114 tok/s) step 9318/76294 | train loss 3.424291 | norm 0.3662 | lr 7.34e-04 | (3915.43 ms | 133903 tok/s) step 9319/76294 | train loss 3.418497 | norm 0.1788 | lr 7.34e-04 | (3789.81 ms | 138342 tok/s) step 9320/76294 | train loss 3.439853 | norm 0.5193 | lr 7.33e-04 | (3798.21 ms | 138035 tok/s) step 9321/76294 | train loss 3.442292 | norm 0.2069 | lr 7.33e-04 | (3819.52 ms | 137265 tok/s) step 9322/76294 | train loss 3.482858 | norm 0.3299 | lr 7.33e-04 | (3801.92 ms | 137901 tok/s) step 9323/76294 | train loss 3.414328 | norm 0.2064 | lr 7.33e-04 | (3796.56 ms | 138095 tok/s) step 9324/76294 | train loss 3.455638 | norm 0.2481 | lr 7.33e-04 | (3842.85 ms | 136432 tok/s) step 9325/76294 | train loss 3.442029 | norm 0.2316 | lr 7.33e-04 | (3791.40 ms | 138283 tok/s) step 9326/76294 | train loss 3.347329 | norm 0.2497 | lr 7.33e-04 | (3816.80 ms | 137363 tok/s) step 9327/76294 | train loss 3.391331 | norm 0.2403 | lr 7.33e-04 | (3815.73 ms | 137402 tok/s) step 9328/76294 | train loss 3.402097 | norm 0.3138 | lr 7.33e-04 | (3798.93 ms | 138009 tok/s) step 9329/76294 | train loss 3.465464 | norm 0.2580 | lr 7.33e-04 | (3791.27 ms | 138288 tok/s) step 9330/76294 | train loss 3.452936 | norm 0.2817 | lr 7.33e-04 | (3855.13 ms | 135998 tok/s) step 9331/76294 | train loss 3.400123 | norm 0.2903 | lr 7.32e-04 | (3791.50 ms | 138280 tok/s) step 9332/76294 | train loss 3.429740 | norm 0.1865 | lr 7.32e-04 | (3795.10 ms | 138149 tok/s) step 9333/76294 | train loss 3.471117 | norm 0.2215 | lr 7.32e-04 | (3816.16 ms | 137386 tok/s) step 9334/76294 | train loss 3.400010 | norm 0.1874 | lr 7.32e-04 | (3794.42 ms | 138173 tok/s) step 9335/76294 | train loss 3.461566 | norm 0.1983 | lr 7.32e-04 | (3798.31 ms | 138032 tok/s) step 9336/76294 | train loss 3.443463 | norm 0.2568 | lr 7.32e-04 | (3825.89 ms | 137037 tok/s) step 9337/76294 | train loss 3.435149 | norm 0.2446 | lr 7.32e-04 | (3790.84 ms | 138304 tok/s) step 9338/76294 | train loss 3.451961 | norm 0.2819 | lr 7.32e-04 | (3797.24 ms | 138071 tok/s) step 9339/76294 | train loss 3.435913 | norm 0.2801 | lr 7.32e-04 | (3910.68 ms | 134066 tok/s) step 9340/76294 | train loss 3.477730 | norm 0.4036 | lr 7.32e-04 | (3797.23 ms | 138071 tok/s) step 9341/76294 | train loss 3.394915 | norm 0.2624 | lr 7.32e-04 | (3829.80 ms | 136897 tok/s) step 9342/76294 | train loss 3.515811 | norm 0.3347 | lr 7.31e-04 | (3794.23 ms | 138180 tok/s) step 9343/76294 | train loss 3.379465 | norm 0.2217 | lr 7.31e-04 | (3834.35 ms | 136734 tok/s) step 9344/76294 | train loss 3.517037 | norm 0.2853 | lr 7.31e-04 | (3816.66 ms | 137368 tok/s) step 9345/76294 | train loss 3.393559 | norm 0.2174 | lr 7.31e-04 | (3798.04 ms | 138042 tok/s) step 9346/76294 | train loss 3.463562 | norm 0.2698 | lr 7.31e-04 | (4142.70 ms | 126557 tok/s) step 9347/76294 | train loss 3.477675 | norm 0.2007 | lr 7.31e-04 | (4271.13 ms | 122752 tok/s) step 9348/76294 | train loss 3.532836 | norm 0.2577 | lr 7.31e-04 | (3936.16 ms | 133198 tok/s) step 9349/76294 | train loss 3.424544 | norm 0.2181 | lr 7.31e-04 | (3774.50 ms | 138903 tok/s) step 9350/76294 | train loss 3.451043 | norm 0.1950 | lr 7.31e-04 | (21331.10 ms | 24579 tok/s) step 9351/76294 | train loss 3.382666 | norm 0.2082 | lr 7.31e-04 | (4405.17 ms | 119016 tok/s) step 9352/76294 | train loss 3.430708 | norm 0.2302 | lr 7.31e-04 | (21362.02 ms | 24543 tok/s) step 9353/76294 | train loss 3.476006 | norm 0.2147 | lr 7.31e-04 | (3737.43 ms | 140280 tok/s) step 9354/76294 | train loss 3.471434 | norm 0.2523 | lr 7.30e-04 | (3909.80 ms | 134096 tok/s) step 9355/76294 | train loss 3.424312 | norm 0.2019 | lr 7.30e-04 | (5785.51 ms | 90621 tok/s) step 9356/76294 | train loss 3.463484 | norm 0.1930 | lr 7.30e-04 | (3847.96 ms | 136251 tok/s) step 9357/76294 | train loss 3.410666 | norm 0.1879 | lr 7.30e-04 | (3787.18 ms | 138438 tok/s) step 9358/76294 | train loss 3.447384 | norm 0.2000 | lr 7.30e-04 | (3762.47 ms | 139347 tok/s) step 9359/76294 | train loss 3.515914 | norm 0.1891 | lr 7.30e-04 | (3861.85 ms | 135761 tok/s) step 9360/76294 | train loss 3.401847 | norm 0.1983 | lr 7.30e-04 | (3976.10 ms | 131860 tok/s) step 9361/76294 | train loss 3.482703 | norm 0.2109 | lr 7.30e-04 | (3770.73 ms | 139041 tok/s) step 9362/76294 | train loss 3.393343 | norm 0.1806 | lr 7.30e-04 | (3790.26 ms | 138325 tok/s) step 9363/76294 | train loss 3.413399 | norm 0.2000 | lr 7.30e-04 | (3770.89 ms | 139036 tok/s) step 9364/76294 | train loss 3.447280 | norm 0.2002 | lr 7.30e-04 | (3862.99 ms | 135721 tok/s) step 9365/76294 | train loss 3.383853 | norm 0.2319 | lr 7.29e-04 | (3953.39 ms | 132617 tok/s) step 9366/76294 | train loss 3.443713 | norm 0.1892 | lr 7.29e-04 | (3774.57 ms | 138900 tok/s) step 9367/76294 | train loss 3.477343 | norm 0.1980 | lr 7.29e-04 | (3835.25 ms | 136702 tok/s) step 9368/76294 | train loss 3.481805 | norm 0.2067 | lr 7.29e-04 | (3780.16 ms | 138695 tok/s) step 9369/76294 | train loss 3.433472 | norm 0.3170 | lr 7.29e-04 | (3816.59 ms | 137371 tok/s) step 9370/76294 | train loss 3.461775 | norm 0.3288 | lr 7.29e-04 | (3822.50 ms | 137158 tok/s) step 9371/76294 | train loss 3.457423 | norm 0.3218 | lr 7.29e-04 | (3796.10 ms | 138112 tok/s) step 9372/76294 | train loss 3.432780 | norm 0.4583 | lr 7.29e-04 | (3791.64 ms | 138275 tok/s) step 9373/76294 | train loss 3.422842 | norm 0.2850 | lr 7.29e-04 | (3845.75 ms | 136329 tok/s) step 9374/76294 | train loss 3.485818 | norm 0.2962 | lr 7.29e-04 | (3792.89 ms | 138229 tok/s) step 9375/76294 | train loss 3.419558 | norm 0.3635 | lr 7.29e-04 | (3833.70 ms | 136758 tok/s) step 9376/76294 | train loss 3.447258 | norm 0.2423 | lr 7.28e-04 | (3797.54 ms | 138060 tok/s) step 9377/76294 | train loss 3.454443 | norm 0.2180 | lr 7.28e-04 | (3831.04 ms | 136853 tok/s) step 9378/76294 | train loss 3.433210 | norm 0.2065 | lr 7.28e-04 | (3828.90 ms | 136929 tok/s) step 9379/76294 | train loss 3.472233 | norm 0.2170 | lr 7.28e-04 | (3849.09 ms | 136211 tok/s) step 9380/76294 | train loss 3.441335 | norm 0.2936 | lr 7.28e-04 | (3838.00 ms | 136605 tok/s) step 9381/76294 | train loss 3.454513 | norm 0.2726 | lr 7.28e-04 | (3816.50 ms | 137374 tok/s) step 9382/76294 | train loss 3.495828 | norm 0.3048 | lr 7.28e-04 | (3836.03 ms | 136675 tok/s) step 9383/76294 | train loss 3.476407 | norm 0.2139 | lr 7.28e-04 | (3811.27 ms | 137563 tok/s) step 9384/76294 | train loss 3.414142 | norm 0.4967 | lr 7.28e-04 | (3871.40 ms | 135426 tok/s) step 9385/76294 | train loss 3.419270 | norm 0.3088 | lr 7.28e-04 | (3882.11 ms | 135052 tok/s) step 9386/76294 | train loss 3.462362 | norm 0.3549 | lr 7.28e-04 | (3808.04 ms | 137679 tok/s) step 9387/76294 | train loss 3.387090 | norm 0.2068 | lr 7.27e-04 | (3839.09 ms | 136566 tok/s) step 9388/76294 | train loss 3.414497 | norm 0.3155 | lr 7.27e-04 | (3834.30 ms | 136736 tok/s) step 9389/76294 | train loss 3.511312 | norm 0.2038 | lr 7.27e-04 | (3812.01 ms | 137536 tok/s) step 9390/76294 | train loss 3.459794 | norm 0.1926 | lr 7.27e-04 | (3839.66 ms | 136545 tok/s) step 9391/76294 | train loss 3.408889 | norm 0.1753 | lr 7.27e-04 | (3818.26 ms | 137311 tok/s) step 9392/76294 | train loss 3.427598 | norm 0.1683 | lr 7.27e-04 | (3807.68 ms | 137692 tok/s) step 9393/76294 | train loss 3.446681 | norm 0.2399 | lr 7.27e-04 | (3861.81 ms | 135762 tok/s) step 9394/76294 | train loss 3.470492 | norm 0.2070 | lr 7.27e-04 | (3810.67 ms | 137584 tok/s) step 9395/76294 | train loss 3.416050 | norm 0.1987 | lr 7.27e-04 | (3835.04 ms | 136710 tok/s) step 9396/76294 | train loss 3.451047 | norm 0.1875 | lr 7.27e-04 | (3811.21 ms | 137565 tok/s) step 9397/76294 | train loss 3.453550 | norm 0.2057 | lr 7.27e-04 | (3872.36 ms | 135392 tok/s) step 9398/76294 | train loss 3.434502 | norm 0.1862 | lr 7.26e-04 | (3813.18 ms | 137494 tok/s) step 9399/76294 | train loss 3.507834 | norm 0.2053 | lr 7.26e-04 | (3840.83 ms | 136504 tok/s) step 9400/76294 | train loss 3.576442 | norm 0.1932 | lr 7.26e-04 | (3861.66 ms | 135768 tok/s) step 9401/76294 | train loss 3.469984 | norm 0.2319 | lr 7.26e-04 | (3811.77 ms | 137544 tok/s) step 9402/76294 | train loss 3.488147 | norm 0.2423 | lr 7.26e-04 | (3850.64 ms | 136156 tok/s) step 9403/76294 | train loss 3.470682 | norm 0.2042 | lr 7.26e-04 | (3841.92 ms | 136465 tok/s) step 9404/76294 | train loss 3.455125 | norm 0.2282 | lr 7.26e-04 | (3862.18 ms | 135749 tok/s) step 9405/76294 | train loss 3.408215 | norm 0.2013 | lr 7.26e-04 | (3805.17 ms | 137783 tok/s) step 9406/76294 | train loss 3.427768 | norm 0.1964 | lr 7.26e-04 | (3830.86 ms | 136859 tok/s) step 9407/76294 | train loss 3.415266 | norm 0.2270 | lr 7.26e-04 | (3805.68 ms | 137765 tok/s) step 9408/76294 | train loss 3.437782 | norm 0.2417 | lr 7.26e-04 | (3823.38 ms | 137127 tok/s) step 9409/76294 | train loss 3.518533 | norm 0.3770 | lr 7.26e-04 | (3809.29 ms | 137634 tok/s) step 9410/76294 | train loss 3.376032 | norm 0.2141 | lr 7.25e-04 | (3810.97 ms | 137574 tok/s) step 9411/76294 | train loss 3.372128 | norm 0.2492 | lr 7.25e-04 | (3833.25 ms | 136774 tok/s) step 9412/76294 | train loss 3.460879 | norm 0.2632 | lr 7.25e-04 | (3819.87 ms | 137253 tok/s) step 9413/76294 | train loss 3.418542 | norm 0.2246 | lr 7.25e-04 | (3830.82 ms | 136860 tok/s) step 9414/76294 | train loss 3.465875 | norm 0.3391 | lr 7.25e-04 | (3809.82 ms | 137615 tok/s) step 9415/76294 | train loss 3.503132 | norm 0.2898 | lr 7.25e-04 | (3807.88 ms | 137685 tok/s) step 9416/76294 | train loss 3.491612 | norm 0.2618 | lr 7.25e-04 | (3840.70 ms | 136508 tok/s) step 9417/76294 | train loss 3.440661 | norm 0.2456 | lr 7.25e-04 | (3809.83 ms | 137614 tok/s) step 9418/76294 | train loss 3.456434 | norm 0.2829 | lr 7.25e-04 | (3814.62 ms | 137442 tok/s) step 9419/76294 | train loss 3.444930 | norm 0.2073 | lr 7.25e-04 | (3831.04 ms | 136852 tok/s) step 9420/76294 | train loss 3.421901 | norm 0.2455 | lr 7.25e-04 | (3815.01 ms | 137428 tok/s) step 9421/76294 | train loss 3.418394 | norm 0.1908 | lr 7.24e-04 | (3810.96 ms | 137574 tok/s) step 9422/76294 | train loss 3.425134 | norm 0.2344 | lr 7.24e-04 | (3847.82 ms | 136256 tok/s) step 9423/76294 | train loss 3.369575 | norm 0.2205 | lr 7.24e-04 | (3810.43 ms | 137593 tok/s) step 9424/76294 | train loss 3.501323 | norm 0.2250 | lr 7.24e-04 | (3810.19 ms | 137602 tok/s) step 9425/76294 | train loss 3.438663 | norm 0.2858 | lr 7.24e-04 | (3827.63 ms | 136975 tok/s) step 9426/76294 | train loss 3.394020 | norm 0.2193 | lr 7.24e-04 | (3817.19 ms | 137349 tok/s) step 9427/76294 | train loss 3.528284 | norm 0.2053 | lr 7.24e-04 | (3895.89 ms | 134575 tok/s) step 9428/76294 | train loss 3.438466 | norm 0.1910 | lr 7.24e-04 | (3808.75 ms | 137654 tok/s) step 9429/76294 | train loss 3.464811 | norm 0.1868 | lr 7.24e-04 | (3869.71 ms | 135485 tok/s) step 9430/76294 | train loss 3.461607 | norm 0.2094 | lr 7.24e-04 | (3809.68 ms | 137620 tok/s) step 9431/76294 | train loss 3.443843 | norm 0.2311 | lr 7.24e-04 | (3827.62 ms | 136975 tok/s) step 9432/76294 | train loss 3.561984 | norm 0.2093 | lr 7.23e-04 | (3805.82 ms | 137759 tok/s) step 9433/76294 | train loss 3.368427 | norm 0.2929 | lr 7.23e-04 | (3839.92 ms | 136536 tok/s) step 9434/76294 | train loss 3.488327 | norm 0.1928 | lr 7.23e-04 | (3804.67 ms | 137801 tok/s) step 9435/76294 | train loss 3.420337 | norm 0.3373 | lr 7.23e-04 | (3971.93 ms | 131998 tok/s) step 9436/76294 | train loss 3.517620 | norm 0.2096 | lr 7.23e-04 | (3804.85 ms | 137795 tok/s) step 9437/76294 | train loss 3.434506 | norm 0.3219 | lr 7.23e-04 | (3811.90 ms | 137540 tok/s) step 9438/76294 | train loss 3.443739 | norm 0.2030 | lr 7.23e-04 | (3844.67 ms | 136368 tok/s) step 9439/76294 | train loss 3.441356 | norm 0.2466 | lr 7.23e-04 | (3807.12 ms | 137713 tok/s) step 9440/76294 | train loss 3.477615 | norm 0.2302 | lr 7.23e-04 | (3807.13 ms | 137712 tok/s) step 9441/76294 | train loss 3.445335 | norm 0.2243 | lr 7.23e-04 | (3839.09 ms | 136566 tok/s) step 9442/76294 | train loss 3.449844 | norm 0.2912 | lr 7.23e-04 | (3803.67 ms | 137837 tok/s) step 9443/76294 | train loss 3.429640 | norm 0.2777 | lr 7.22e-04 | (3834.69 ms | 136723 tok/s) step 9444/76294 | train loss 3.454541 | norm 0.2096 | lr 7.22e-04 | (4045.64 ms | 129593 tok/s) step 9445/76294 | train loss 3.355810 | norm 0.1985 | lr 7.22e-04 | (3818.57 ms | 137300 tok/s) step 9446/76294 | train loss 3.449481 | norm 0.2040 | lr 7.22e-04 | (3817.03 ms | 137355 tok/s) step 9447/76294 | train loss 3.478687 | norm 0.2063 | lr 7.22e-04 | (3833.54 ms | 136764 tok/s) step 9448/76294 | train loss 3.457253 | norm 0.1965 | lr 7.22e-04 | (3865.83 ms | 135621 tok/s) step 9449/76294 | train loss 3.434625 | norm 0.2236 | lr 7.22e-04 | (3832.58 ms | 136798 tok/s) step 9450/76294 | train loss 3.457554 | norm 0.2272 | lr 7.22e-04 | (3811.61 ms | 137550 tok/s) step 9451/76294 | train loss 3.440754 | norm 0.2064 | lr 7.22e-04 | (3812.28 ms | 137526 tok/s) step 9452/76294 | train loss 3.431367 | norm 0.2500 | lr 7.22e-04 | (3837.11 ms | 136636 tok/s) step 9453/76294 | train loss 3.474605 | norm 0.1997 | lr 7.22e-04 | (3805.17 ms | 137783 tok/s) step 9454/76294 | train loss 3.438382 | norm 0.2954 | lr 7.21e-04 | (3834.94 ms | 136713 tok/s) step 9455/76294 | train loss 3.398490 | norm 0.2142 | lr 7.21e-04 | (3802.34 ms | 137886 tok/s) step 9456/76294 | train loss 3.436292 | norm 0.2173 | lr 7.21e-04 | (3831.72 ms | 136828 tok/s) step 9457/76294 | train loss 3.507481 | norm 0.2465 | lr 7.21e-04 | (3801.57 ms | 137914 tok/s) step 9458/76294 | train loss 3.458970 | norm 0.2495 | lr 7.21e-04 | (3852.37 ms | 136095 tok/s) step 9459/76294 | train loss 3.405594 | norm 0.2330 | lr 7.21e-04 | (3805.50 ms | 137771 tok/s) step 9460/76294 | train loss 3.439528 | norm 0.3268 | lr 7.21e-04 | (3815.54 ms | 137409 tok/s) step 9461/76294 | train loss 3.410177 | norm 0.3354 | lr 7.21e-04 | (3837.32 ms | 136629 tok/s) step 9462/76294 | train loss 3.450612 | norm 0.2142 | lr 7.21e-04 | (3813.78 ms | 137472 tok/s) step 9463/76294 | train loss 3.441094 | norm 0.1931 | lr 7.21e-04 | (3803.80 ms | 137833 tok/s) step 9464/76294 | train loss 3.475310 | norm 0.2128 | lr 7.21e-04 | (3889.96 ms | 134780 tok/s) step 9465/76294 | train loss 3.455213 | norm 0.2014 | lr 7.21e-04 | (3806.34 ms | 137741 tok/s) step 9466/76294 | train loss 3.491369 | norm 0.2289 | lr 7.20e-04 | (3809.89 ms | 137612 tok/s) step 9467/76294 | train loss 3.415730 | norm 0.2783 | lr 7.20e-04 | (3875.39 ms | 135286 tok/s) step 9468/76294 | train loss 3.544289 | norm 0.2356 | lr 7.20e-04 | (3808.41 ms | 137666 tok/s) step 9469/76294 | train loss 3.470945 | norm 0.2792 | lr 7.20e-04 | (3913.52 ms | 133968 tok/s) step 9470/76294 | train loss 3.417996 | norm 0.2366 | lr 7.20e-04 | (3803.72 ms | 137836 tok/s) step 9471/76294 | train loss 3.474246 | norm 0.5050 | lr 7.20e-04 | (3849.14 ms | 136209 tok/s) step 9472/76294 | train loss 3.478527 | norm 0.3019 | lr 7.20e-04 | (3812.03 ms | 137535 tok/s) step 9473/76294 | train loss 3.385948 | norm 0.3161 | lr 7.20e-04 | (3903.72 ms | 134305 tok/s) step 9474/76294 | train loss 3.461103 | norm 0.2002 | lr 7.20e-04 | (3796.47 ms | 138099 tok/s) step 9475/76294 | train loss 3.474656 | norm 0.3988 | lr 7.20e-04 | (3810.39 ms | 137594 tok/s) step 9476/76294 | train loss 3.459360 | norm 0.1735 | lr 7.20e-04 | (3825.89 ms | 137037 tok/s) step 9477/76294 | train loss 3.465801 | norm 0.2653 | lr 7.19e-04 | (3806.23 ms | 137745 tok/s) step 9478/76294 | train loss 3.428300 | norm 0.1940 | lr 7.19e-04 | (3799.05 ms | 138005 tok/s) step 9479/76294 | train loss 3.488049 | norm 0.3512 | lr 7.19e-04 | (3861.63 ms | 135769 tok/s) step 9480/76294 | train loss 3.595833 | norm 0.2255 | lr 7.19e-04 | (3800.13 ms | 137966 tok/s) step 9481/76294 | train loss 3.494350 | norm 0.2805 | lr 7.19e-04 | (3808.29 ms | 137670 tok/s) step 9482/76294 | train loss 3.571546 | norm 0.1798 | lr 7.19e-04 | (3824.48 ms | 137087 tok/s) step 9483/76294 | train loss 3.451287 | norm 0.2728 | lr 7.19e-04 | (3804.35 ms | 137813 tok/s) step 9484/76294 | train loss 3.435962 | norm 0.2052 | lr 7.19e-04 | (3840.10 ms | 136530 tok/s) step 9485/76294 | train loss 3.474514 | norm 0.2160 | lr 7.19e-04 | (3912.59 ms | 134000 tok/s) step 9486/76294 | train loss 3.465221 | norm 0.2473 | lr 7.19e-04 | (3802.29 ms | 137887 tok/s) step 9487/76294 | train loss 3.489485 | norm 0.2752 | lr 7.19e-04 | (3829.81 ms | 136897 tok/s) step 9488/76294 | train loss 3.386423 | norm 0.2503 | lr 7.18e-04 | (3810.71 ms | 137583 tok/s) step 9489/76294 | train loss 3.405992 | norm 0.2144 | lr 7.18e-04 | (3811.00 ms | 137572 tok/s) step 9490/76294 | train loss 3.465875 | norm 0.2198 | lr 7.18e-04 | (3833.62 ms | 136761 tok/s) step 9491/76294 | train loss 3.407211 | norm 0.1898 | lr 7.18e-04 | (3811.10 ms | 137569 tok/s) step 9492/76294 | train loss 3.469928 | norm 0.2355 | lr 7.18e-04 | (3810.16 ms | 137603 tok/s) step 9493/76294 | train loss 3.441573 | norm 0.2057 | lr 7.18e-04 | (3848.02 ms | 136249 tok/s) step 9494/76294 | train loss 3.440030 | norm 0.2330 | lr 7.18e-04 | (3803.95 ms | 137827 tok/s) step 9495/76294 | train loss 3.492704 | norm 0.2693 | lr 7.18e-04 | (3838.89 ms | 136573 tok/s) step 9496/76294 | train loss 3.456722 | norm 0.2257 | lr 7.18e-04 | (3828.30 ms | 136951 tok/s) step 9497/76294 | train loss 3.478291 | norm 0.2255 | lr 7.18e-04 | (3803.70 ms | 137836 tok/s) step 9498/76294 | train loss 3.405974 | norm 0.2114 | lr 7.18e-04 | (3808.24 ms | 137672 tok/s) step 9499/76294 | train loss 3.489936 | norm 0.2175 | lr 7.17e-04 | (3831.68 ms | 136830 tok/s) step 9500/76294 | train loss 3.490889 | norm 0.2080 | lr 7.17e-04 | (3803.79 ms | 137833 tok/s) val loss: 3.442328 saving model checkpoint to ./results/gpt2-124M-gqa/step_9500.pth step 9501/76294 | train loss 3.496749 | norm 0.2001 | lr 7.17e-04 | (3820.09 ms | 137245 tok/s) step 9502/76294 | train loss 3.482763 | norm 0.2049 | lr 7.17e-04 | (3824.36 ms | 137092 tok/s) step 9503/76294 | train loss 3.512955 | norm 0.1939 | lr 7.17e-04 | (4314.98 ms | 121504 tok/s) step 9504/76294 | train loss 3.405507 | norm 0.1948 | lr 7.17e-04 | (3797.15 ms | 138074 tok/s) step 9505/76294 | train loss 3.517839 | norm 0.2037 | lr 7.17e-04 | (3807.41 ms | 137702 tok/s) step 9506/76294 | train loss 3.402617 | norm 0.2605 | lr 7.17e-04 | (3825.05 ms | 137067 tok/s) step 9507/76294 | train loss 3.397020 | norm 0.1898 | lr 7.17e-04 | (6121.18 ms | 85651 tok/s) step 9508/76294 | train loss 3.443187 | norm 0.2254 | lr 7.17e-04 | (3901.04 ms | 134397 tok/s) step 9509/76294 | train loss 3.471601 | norm 0.2187 | lr 7.17e-04 | (3804.30 ms | 137815 tok/s) step 9510/76294 | train loss 3.442603 | norm 0.2288 | lr 7.16e-04 | (3792.74 ms | 138235 tok/s) step 9511/76294 | train loss 3.430740 | norm 0.3278 | lr 7.16e-04 | (3823.18 ms | 137134 tok/s) step 9512/76294 | train loss 3.635974 | norm 0.3329 | lr 7.16e-04 | (3795.60 ms | 138130 tok/s) step 9513/76294 | train loss 3.429323 | norm 0.2570 | lr 7.16e-04 | (3797.18 ms | 138073 tok/s) step 9514/76294 | train loss 3.444255 | norm 0.2652 | lr 7.16e-04 | (3899.38 ms | 134454 tok/s) step 9515/76294 | train loss 3.513556 | norm 0.4619 | lr 7.16e-04 | (3795.30 ms | 138141 tok/s) step 9516/76294 | train loss 3.449960 | norm 0.2925 | lr 7.16e-04 | (3803.37 ms | 137848 tok/s) step 9517/76294 | train loss 3.453482 | norm 0.3380 | lr 7.16e-04 | (3825.24 ms | 137060 tok/s) step 9518/76294 | train loss 3.375051 | norm 0.2067 | lr 7.16e-04 | (3801.91 ms | 137901 tok/s) step 9519/76294 | train loss 3.439318 | norm 0.2722 | lr 7.16e-04 | (3794.27 ms | 138179 tok/s) step 9520/76294 | train loss 3.457357 | norm 0.2805 | lr 7.16e-04 | (3825.22 ms | 137061 tok/s) step 9521/76294 | train loss 3.468025 | norm 0.2058 | lr 7.15e-04 | (3798.01 ms | 138043 tok/s) step 9522/76294 | train loss 3.497066 | norm 0.2140 | lr 7.15e-04 | (3804.21 ms | 137818 tok/s) step 9523/76294 | train loss 3.347686 | norm 0.2419 | lr 7.15e-04 | (3823.33 ms | 137129 tok/s) step 9524/76294 | train loss 3.377943 | norm 0.2022 | lr 7.15e-04 | (3799.87 ms | 137975 tok/s) step 9525/76294 | train loss 3.449682 | norm 0.2618 | lr 7.15e-04 | (3825.48 ms | 137051 tok/s) step 9526/76294 | train loss 3.449402 | norm 0.1864 | lr 7.15e-04 | (3805.86 ms | 137758 tok/s) step 9527/76294 | train loss 3.440161 | norm 0.2444 | lr 7.15e-04 | (3825.42 ms | 137054 tok/s) step 9528/76294 | train loss 3.519407 | norm 0.2347 | lr 7.15e-04 | (3801.56 ms | 137914 tok/s) step 9529/76294 | train loss 3.546392 | norm 0.2320 | lr 7.15e-04 | (3796.77 ms | 138088 tok/s) step 9530/76294 | train loss 3.478366 | norm 0.2284 | lr 7.15e-04 | (3823.74 ms | 137114 tok/s) step 9531/76294 | train loss 3.459139 | norm 0.3559 | lr 7.15e-04 | (3804.28 ms | 137815 tok/s) step 9532/76294 | train loss 3.460031 | norm 0.2476 | lr 7.15e-04 | (3801.66 ms | 137910 tok/s) step 9533/76294 | train loss 3.506954 | norm 0.3389 | lr 7.14e-04 | (3799.69 ms | 137982 tok/s) step 9534/76294 | train loss 3.405244 | norm 0.1683 | lr 7.14e-04 | (3837.53 ms | 136621 tok/s) step 9535/76294 | train loss 3.486543 | norm 0.4580 | lr 7.14e-04 | (3864.73 ms | 135660 tok/s) step 9536/76294 | train loss 3.436020 | norm 0.2260 | lr 7.14e-04 | (3796.68 ms | 138091 tok/s) step 9537/76294 | train loss 3.421381 | norm 0.2946 | lr 7.14e-04 | (4093.19 ms | 128088 tok/s) step 9538/76294 | train loss 3.528010 | norm 0.2289 | lr 7.14e-04 | (3815.38 ms | 137414 tok/s) step 9539/76294 | train loss 3.432474 | norm 0.2181 | lr 7.14e-04 | (3825.89 ms | 137037 tok/s) step 9540/76294 | train loss 3.419504 | norm 0.2052 | lr 7.14e-04 | (3872.74 ms | 135379 tok/s) step 9541/76294 | train loss 3.481099 | norm 0.2064 | lr 7.14e-04 | (3826.82 ms | 137004 tok/s) step 9542/76294 | train loss 3.487287 | norm 0.1754 | lr 7.14e-04 | (3804.89 ms | 137793 tok/s) step 9543/76294 | train loss 3.496207 | norm 0.1985 | lr 7.14e-04 | (3801.99 ms | 137898 tok/s) step 9544/76294 | train loss 3.506784 | norm 0.2613 | lr 7.13e-04 | (3797.91 ms | 138047 tok/s) step 9545/76294 | train loss 3.512282 | norm 0.2135 | lr 7.13e-04 | (3828.25 ms | 136952 tok/s) step 9546/76294 | train loss 3.474149 | norm 0.3527 | lr 7.13e-04 | (3800.23 ms | 137962 tok/s) step 9547/76294 | train loss 3.472363 | norm 0.2023 | lr 7.13e-04 | (3804.54 ms | 137806 tok/s) step 9548/76294 | train loss 3.446250 | norm 0.3342 | lr 7.13e-04 | (3831.28 ms | 136844 tok/s) step 9549/76294 | train loss 3.436468 | norm 0.2265 | lr 7.13e-04 | (3813.67 ms | 137476 tok/s) step 9550/76294 | train loss 3.433128 | norm 0.2524 | lr 7.13e-04 | (3832.10 ms | 136815 tok/s) step 9551/76294 | train loss 3.457746 | norm 0.2517 | lr 7.13e-04 | (3811.57 ms | 137552 tok/s) step 9552/76294 | train loss 3.491024 | norm 0.2935 | lr 7.13e-04 | (3835.67 ms | 136687 tok/s) step 9553/76294 | train loss 3.451498 | norm 0.2041 | lr 7.13e-04 | (3813.82 ms | 137471 tok/s) step 9554/76294 | train loss 3.479186 | norm 0.3283 | lr 7.13e-04 | (3805.32 ms | 137777 tok/s) step 9555/76294 | train loss 3.478558 | norm 0.2102 | lr 7.12e-04 | (3841.98 ms | 136463 tok/s) step 9556/76294 | train loss 3.410291 | norm 0.1934 | lr 7.12e-04 | (3928.94 ms | 133442 tok/s) step 9557/76294 | train loss 3.471854 | norm 0.2004 | lr 7.12e-04 | (3818.30 ms | 137309 tok/s) step 9558/76294 | train loss 3.425043 | norm 0.2150 | lr 7.12e-04 | (3809.43 ms | 137629 tok/s) step 9559/76294 | train loss 3.497161 | norm 0.2010 | lr 7.12e-04 | (3835.42 ms | 136696 tok/s) step 9560/76294 | train loss 3.451654 | norm 0.1887 | lr 7.12e-04 | (3807.88 ms | 137685 tok/s) step 9561/76294 | train loss 3.483877 | norm 0.1982 | lr 7.12e-04 | (3812.03 ms | 137535 tok/s) step 9562/76294 | train loss 3.492208 | norm 0.2567 | lr 7.12e-04 | (3830.51 ms | 136872 tok/s) step 9563/76294 | train loss 3.447229 | norm 0.1968 | lr 7.12e-04 | (3838.16 ms | 136599 tok/s) step 9564/76294 | train loss 3.439539 | norm 0.2490 | lr 7.12e-04 | (3811.69 ms | 137547 tok/s) step 9565/76294 | train loss 3.454219 | norm 0.2197 | lr 7.12e-04 | (3858.14 ms | 135892 tok/s) step 9566/76294 | train loss 3.442103 | norm 0.2443 | lr 7.11e-04 | (3812.94 ms | 137502 tok/s) step 9567/76294 | train loss 3.481876 | norm 0.1782 | lr 7.11e-04 | (3815.46 ms | 137411 tok/s) step 9568/76294 | train loss 3.473014 | norm 0.1849 | lr 7.11e-04 | (3836.42 ms | 136661 tok/s) step 9569/76294 | train loss 3.466320 | norm 0.1895 | lr 7.11e-04 | (3814.21 ms | 137457 tok/s) step 9570/76294 | train loss 3.461513 | norm 0.2023 | lr 7.11e-04 | (3810.17 ms | 137602 tok/s) step 9571/76294 | train loss 3.536916 | norm 0.1748 | lr 7.11e-04 | (3838.27 ms | 136595 tok/s) step 9572/76294 | train loss 3.496768 | norm 0.1878 | lr 7.11e-04 | (3811.65 ms | 137549 tok/s) step 9573/76294 | train loss 3.486681 | norm 0.1696 | lr 7.11e-04 | (3824.60 ms | 137083 tok/s) step 9574/76294 | train loss 3.481662 | norm 0.2001 | lr 7.11e-04 | (3802.95 ms | 137863 tok/s) step 9575/76294 | train loss 3.375591 | norm 0.2954 | lr 7.11e-04 | (3834.87 ms | 136716 tok/s) step 9576/76294 | train loss 3.490531 | norm 0.2040 | lr 7.11e-04 | (3804.84 ms | 137795 tok/s) step 9577/76294 | train loss 3.443250 | norm 0.2746 | lr 7.10e-04 | (3936.13 ms | 133199 tok/s) step 9578/76294 | train loss 3.452655 | norm 0.2148 | lr 7.10e-04 | (3813.05 ms | 137498 tok/s) step 9579/76294 | train loss 3.450655 | norm 0.2845 | lr 7.10e-04 | (3829.79 ms | 136897 tok/s) step 9580/76294 | train loss 3.475126 | norm 0.2252 | lr 7.10e-04 | (3801.77 ms | 137906 tok/s) step 9581/76294 | train loss 3.412520 | norm 0.2337 | lr 7.10e-04 | (3827.18 ms | 136991 tok/s) step 9582/76294 | train loss 3.487082 | norm 0.2077 | lr 7.10e-04 | (3802.34 ms | 137886 tok/s) step 9583/76294 | train loss 3.477661 | norm 0.3441 | lr 7.10e-04 | (3805.52 ms | 137770 tok/s) step 9584/76294 | train loss 3.444459 | norm 0.2002 | lr 7.10e-04 | (3826.26 ms | 137024 tok/s) step 9585/76294 | train loss 3.503690 | norm 0.4562 | lr 7.10e-04 | (3800.34 ms | 137958 tok/s) step 9586/76294 | train loss 3.489266 | norm 0.2137 | lr 7.10e-04 | (3825.75 ms | 137042 tok/s) step 9587/76294 | train loss 3.471193 | norm 0.2858 | lr 7.10e-04 | (3807.75 ms | 137690 tok/s) step 9588/76294 | train loss 3.394967 | norm 0.1961 | lr 7.09e-04 | (3824.56 ms | 137085 tok/s) step 9589/76294 | train loss 3.498243 | norm 0.3142 | lr 7.09e-04 | (3802.79 ms | 137869 tok/s) step 9590/76294 | train loss 3.506392 | norm 0.2717 | lr 7.09e-04 | (3804.72 ms | 137799 tok/s) step 9591/76294 | train loss 3.471259 | norm 0.3684 | lr 7.09e-04 | (3856.03 ms | 135966 tok/s) step 9592/76294 | train loss 3.454537 | norm 0.2713 | lr 7.09e-04 | (3803.08 ms | 137859 tok/s) step 9593/76294 | train loss 3.476027 | norm 0.3292 | lr 7.09e-04 | (3829.65 ms | 136902 tok/s) step 9594/76294 | train loss 3.457040 | norm 0.2119 | lr 7.09e-04 | (3828.55 ms | 136942 tok/s) step 9595/76294 | train loss 3.531676 | norm 0.2602 | lr 7.09e-04 | (3807.70 ms | 137692 tok/s) step 9596/76294 | train loss 3.491626 | norm 0.2896 | lr 7.09e-04 | (3804.28 ms | 137815 tok/s) step 9597/76294 | train loss 3.423360 | norm 0.2256 | lr 7.09e-04 | (3867.63 ms | 135558 tok/s) step 9598/76294 | train loss 3.461263 | norm 0.3484 | lr 7.09e-04 | (3879.46 ms | 135144 tok/s) step 9599/76294 | train loss 3.437534 | norm 0.2207 | lr 7.09e-04 | (3806.18 ms | 137747 tok/s) step 9600/76294 | train loss 3.485206 | norm 0.2486 | lr 7.08e-04 | (3830.41 ms | 136875 tok/s) step 9601/76294 | train loss 3.433716 | norm 0.2075 | lr 7.08e-04 | (3805.05 ms | 137788 tok/s) step 9602/76294 | train loss 3.510618 | norm 0.2676 | lr 7.08e-04 | (3811.18 ms | 137566 tok/s) step 9603/76294 | train loss 3.428184 | norm 0.2506 | lr 7.08e-04 | (3825.56 ms | 137049 tok/s) step 9604/76294 | train loss 3.411386 | norm 0.3114 | lr 7.08e-04 | (3803.85 ms | 137831 tok/s) step 9605/76294 | train loss 3.564129 | norm 0.2530 | lr 7.08e-04 | (3830.14 ms | 136885 tok/s) step 9606/76294 | train loss 3.435572 | norm 0.2414 | lr 7.08e-04 | (3827.86 ms | 136966 tok/s) step 9607/76294 | train loss 3.446178 | norm 0.2359 | lr 7.08e-04 | (3801.99 ms | 137898 tok/s) step 9608/76294 | train loss 3.544396 | norm 0.2040 | lr 7.08e-04 | (3834.10 ms | 136744 tok/s) step 9609/76294 | train loss 3.449682 | norm 0.2056 | lr 7.08e-04 | (3805.24 ms | 137781 tok/s) step 9610/76294 | train loss 3.441955 | norm 0.2268 | lr 7.08e-04 | (3827.73 ms | 136971 tok/s) step 9611/76294 | train loss 3.502624 | norm 0.2337 | lr 7.07e-04 | (3805.96 ms | 137754 tok/s) step 9612/76294 | train loss 3.449567 | norm 0.2591 | lr 7.07e-04 | (3806.01 ms | 137753 tok/s) step 9613/76294 | train loss 3.425909 | norm 0.2165 | lr 7.07e-04 | (3829.14 ms | 136920 tok/s) step 9614/76294 | train loss 3.503813 | norm 0.2635 | lr 7.07e-04 | (3805.64 ms | 137766 tok/s) step 9615/76294 | train loss 3.445994 | norm 0.2613 | lr 7.07e-04 | (3800.97 ms | 137935 tok/s) step 9616/76294 | train loss 3.541888 | norm 0.2495 | lr 7.07e-04 | (3835.55 ms | 136692 tok/s) step 9617/76294 | train loss 3.453177 | norm 0.4876 | lr 7.07e-04 | (3809.21 ms | 137637 tok/s) step 9618/76294 | train loss 3.442045 | norm 0.2721 | lr 7.07e-04 | (3811.41 ms | 137557 tok/s) step 9619/76294 | train loss 3.472158 | norm 0.3427 | lr 7.07e-04 | (3895.14 ms | 134601 tok/s) step 9620/76294 | train loss 3.451460 | norm 0.2388 | lr 7.07e-04 | (3808.21 ms | 137673 tok/s) step 9621/76294 | train loss 3.447926 | norm 0.5178 | lr 7.07e-04 | (3854.57 ms | 136017 tok/s) step 9622/76294 | train loss 3.476772 | norm 0.2880 | lr 7.06e-04 | (3803.59 ms | 137840 tok/s) step 9623/76294 | train loss 3.688811 | norm 0.3343 | lr 7.06e-04 | (3810.02 ms | 137608 tok/s) step 9624/76294 | train loss 3.465241 | norm 0.2346 | lr 7.06e-04 | (3824.52 ms | 137086 tok/s) step 9625/76294 | train loss 3.439937 | norm 0.3787 | lr 7.06e-04 | (3831.43 ms | 136839 tok/s) step 9626/76294 | train loss 3.484029 | norm 0.2343 | lr 7.06e-04 | (3827.91 ms | 136965 tok/s) step 9627/76294 | train loss 3.444513 | norm 0.2541 | lr 7.06e-04 | (3808.75 ms | 137654 tok/s) step 9628/76294 | train loss 3.467445 | norm 0.2840 | lr 7.06e-04 | (3800.69 ms | 137946 tok/s) step 9629/76294 | train loss 3.474102 | norm 0.2084 | lr 7.06e-04 | (3836.18 ms | 136669 tok/s) step 9630/76294 | train loss 3.470340 | norm 0.4303 | lr 7.06e-04 | (3804.00 ms | 137825 tok/s) step 9631/76294 | train loss 3.432868 | norm 0.2127 | lr 7.06e-04 | (3829.58 ms | 136905 tok/s) step 9632/76294 | train loss 3.444159 | norm 0.3227 | lr 7.06e-04 | (3813.98 ms | 137465 tok/s) step 9633/76294 | train loss 3.496075 | norm 0.2968 | lr 7.05e-04 | (3834.26 ms | 136738 tok/s) step 9634/76294 | train loss 3.493275 | norm 0.2222 | lr 7.05e-04 | (3806.41 ms | 137738 tok/s) step 9635/76294 | train loss 3.423584 | norm 0.3329 | lr 7.05e-04 | (3830.22 ms | 136882 tok/s) step 9636/76294 | train loss 3.466782 | norm 0.1981 | lr 7.05e-04 | (3803.30 ms | 137851 tok/s) step 9637/76294 | train loss 3.544345 | norm 0.2085 | lr 7.05e-04 | (3807.78 ms | 137689 tok/s) step 9638/76294 | train loss 3.443445 | norm 0.2040 | lr 7.05e-04 | (3833.24 ms | 136774 tok/s) step 9639/76294 | train loss 3.494022 | norm 0.1862 | lr 7.05e-04 | (3807.54 ms | 137697 tok/s) step 9640/76294 | train loss 3.452625 | norm 0.1900 | lr 7.05e-04 | (3884.43 ms | 134972 tok/s) step 9641/76294 | train loss 3.391562 | norm 0.1805 | lr 7.05e-04 | (3803.81 ms | 137832 tok/s) step 9642/76294 | train loss 3.433228 | norm 0.1774 | lr 7.05e-04 | (3860.55 ms | 135806 tok/s) step 9643/76294 | train loss 3.422421 | norm 0.1939 | lr 7.05e-04 | (3803.52 ms | 137843 tok/s) step 9644/76294 | train loss 3.466404 | norm 0.1945 | lr 7.04e-04 | (3811.56 ms | 137552 tok/s) step 9645/76294 | train loss 3.463170 | norm 0.1919 | lr 7.04e-04 | (3821.03 ms | 137211 tok/s) step 9646/76294 | train loss 3.486455 | norm 0.1824 | lr 7.04e-04 | (3807.85 ms | 137686 tok/s) step 9647/76294 | train loss 3.436192 | norm 0.2001 | lr 7.04e-04 | (3802.58 ms | 137877 tok/s) step 9648/76294 | train loss 3.468223 | norm 0.1841 | lr 7.04e-04 | (3831.84 ms | 136824 tok/s) step 9649/76294 | train loss 3.431969 | norm 0.1990 | lr 7.04e-04 | (5580.66 ms | 93947 tok/s) step 9650/76294 | train loss 3.430638 | norm 0.1876 | lr 7.04e-04 | (5729.02 ms | 91514 tok/s) step 9651/76294 | train loss 3.479980 | norm 0.1794 | lr 7.04e-04 | (3857.79 ms | 135904 tok/s) step 9652/76294 | train loss 3.428632 | norm 0.2188 | lr 7.04e-04 | (3797.09 ms | 138076 tok/s) step 9653/76294 | train loss 3.435160 | norm 0.1896 | lr 7.04e-04 | (3801.21 ms | 137926 tok/s) step 9654/76294 | train loss 3.481367 | norm 0.2243 | lr 7.04e-04 | (3822.67 ms | 137152 tok/s) step 9655/76294 | train loss 3.459485 | norm 0.2186 | lr 7.03e-04 | (3860.39 ms | 135812 tok/s) step 9656/76294 | train loss 3.495134 | norm 0.2299 | lr 7.03e-04 | (3817.10 ms | 137353 tok/s) step 9657/76294 | train loss 3.461692 | norm 0.3070 | lr 7.03e-04 | (3829.17 ms | 136920 tok/s) step 9658/76294 | train loss 3.409241 | norm 0.2214 | lr 7.03e-04 | (3796.78 ms | 138088 tok/s) step 9659/76294 | train loss 3.478957 | norm 0.2609 | lr 7.03e-04 | (3846.56 ms | 136301 tok/s) step 9660/76294 | train loss 3.413815 | norm 0.3044 | lr 7.03e-04 | (3816.06 ms | 137390 tok/s) step 9661/76294 | train loss 3.450626 | norm 0.2217 | lr 7.03e-04 | (3844.59 ms | 136370 tok/s) step 9662/76294 | train loss 3.500412 | norm 0.2504 | lr 7.03e-04 | (3835.48 ms | 136694 tok/s) step 9663/76294 | train loss 3.476920 | norm 0.2414 | lr 7.03e-04 | (3804.03 ms | 137824 tok/s) step 9664/76294 | train loss 3.451837 | norm 0.2218 | lr 7.03e-04 | (3806.57 ms | 137733 tok/s) step 9665/76294 | train loss 3.457520 | norm 0.3241 | lr 7.03e-04 | (3825.98 ms | 137034 tok/s) step 9666/76294 | train loss 3.466310 | norm 0.2339 | lr 7.02e-04 | (3806.15 ms | 137748 tok/s) step 9667/76294 | train loss 3.441452 | norm 0.2066 | lr 7.02e-04 | (3801.07 ms | 137932 tok/s) step 9668/76294 | train loss 3.429887 | norm 0.2506 | lr 7.02e-04 | (3845.67 ms | 136332 tok/s) step 9669/76294 | train loss 3.464168 | norm 0.2101 | lr 7.02e-04 | (3804.25 ms | 137816 tok/s) step 9670/76294 | train loss 3.443853 | norm 0.2215 | lr 7.02e-04 | (3860.32 ms | 135814 tok/s) step 9671/76294 | train loss 3.454369 | norm 0.2223 | lr 7.02e-04 | (3804.31 ms | 137814 tok/s) step 9672/76294 | train loss 3.490135 | norm 0.2362 | lr 7.02e-04 | (3952.33 ms | 132653 tok/s) step 9673/76294 | train loss 3.440339 | norm 0.2308 | lr 7.02e-04 | (3799.79 ms | 137978 tok/s) step 9674/76294 | train loss 3.419733 | norm 0.2012 | lr 7.02e-04 | (3811.83 ms | 137542 tok/s) step 9675/76294 | train loss 3.436180 | norm 0.2248 | lr 7.02e-04 | (3842.69 ms | 136438 tok/s) step 9676/76294 | train loss 3.470436 | norm 0.2369 | lr 7.02e-04 | (3816.56 ms | 137372 tok/s) step 9677/76294 | train loss 3.412816 | norm 0.1992 | lr 7.02e-04 | (3801.53 ms | 137915 tok/s) step 9678/76294 | train loss 3.486571 | norm 0.2437 | lr 7.01e-04 | (3874.82 ms | 135307 tok/s) step 9679/76294 | train loss 3.461667 | norm 0.2013 | lr 7.01e-04 | (3804.22 ms | 137818 tok/s) step 9680/76294 | train loss 3.442527 | norm 0.3574 | lr 7.01e-04 | (3826.99 ms | 136997 tok/s) step 9681/76294 | train loss 3.456499 | norm 0.2156 | lr 7.01e-04 | (3831.04 ms | 136853 tok/s) step 9682/76294 | train loss 3.376548 | norm 0.2860 | lr 7.01e-04 | (3800.20 ms | 137963 tok/s) step 9683/76294 | train loss 3.463914 | norm 0.1854 | lr 7.01e-04 | (3899.03 ms | 134466 tok/s) step 9684/76294 | train loss 3.529326 | norm 0.1993 | lr 7.01e-04 | (3802.22 ms | 137890 tok/s) step 9685/76294 | train loss 3.496614 | norm 0.1733 | lr 7.01e-04 | (3812.14 ms | 137531 tok/s) step 9686/76294 | train loss 3.472923 | norm 0.1660 | lr 7.01e-04 | (3815.95 ms | 137394 tok/s) step 9687/76294 | train loss 3.456607 | norm 0.1926 | lr 7.01e-04 | (3806.50 ms | 137735 tok/s) step 9688/76294 | train loss 3.472310 | norm 0.2121 | lr 7.01e-04 | (3829.61 ms | 136904 tok/s) step 9689/76294 | train loss 3.489889 | norm 0.2158 | lr 7.00e-04 | (3804.92 ms | 137792 tok/s) step 9690/76294 | train loss 3.465565 | norm 0.2439 | lr 7.00e-04 | (3806.13 ms | 137748 tok/s) step 9691/76294 | train loss 3.584534 | norm 0.2459 | lr 7.00e-04 | (3832.18 ms | 136812 tok/s) step 9692/76294 | train loss 3.471882 | norm 0.4461 | lr 7.00e-04 | (3807.01 ms | 137717 tok/s) step 9693/76294 | train loss 3.560825 | norm 0.3157 | lr 7.00e-04 | (3804.14 ms | 137820 tok/s) step 9694/76294 | train loss 3.455090 | norm 0.2948 | lr 7.00e-04 | (3804.00 ms | 137825 tok/s) step 9695/76294 | train loss 3.458357 | norm 0.2158 | lr 7.00e-04 | (3827.85 ms | 136967 tok/s) step 9696/76294 | train loss 3.432813 | norm 0.1833 | lr 7.00e-04 | (3806.69 ms | 137728 tok/s) step 9697/76294 | train loss 3.412578 | norm 0.2352 | lr 7.00e-04 | (3805.90 ms | 137757 tok/s) step 9698/76294 | train loss 3.367605 | norm 0.2800 | lr 7.00e-04 | (3822.86 ms | 137145 tok/s) step 9699/76294 | train loss 3.410798 | norm 0.2197 | lr 7.00e-04 | (3838.69 ms | 136580 tok/s) step 9700/76294 | train loss 3.505691 | norm 0.1911 | lr 6.99e-04 | (3803.84 ms | 137831 tok/s) step 9701/76294 | train loss 3.502607 | norm 0.2554 | lr 6.99e-04 | (3859.20 ms | 135854 tok/s) step 9702/76294 | train loss 3.481856 | norm 0.1965 | lr 6.99e-04 | (3804.68 ms | 137801 tok/s) step 9703/76294 | train loss 3.573739 | norm 0.2503 | lr 6.99e-04 | (3804.10 ms | 137822 tok/s) step 9704/76294 | train loss 3.485108 | norm 0.2363 | lr 6.99e-04 | (4127.67 ms | 127018 tok/s) step 9705/76294 | train loss 3.457639 | norm 0.2994 | lr 6.99e-04 | (3873.67 ms | 135347 tok/s) step 9706/76294 | train loss 3.440022 | norm 0.2052 | lr 6.99e-04 | (3803.20 ms | 137854 tok/s) step 9707/76294 | train loss 3.538674 | norm 0.2615 | lr 6.99e-04 | (3804.66 ms | 137802 tok/s) step 9708/76294 | train loss 3.472574 | norm 0.2101 | lr 6.99e-04 | (3828.19 ms | 136955 tok/s) step 9709/76294 | train loss 3.420126 | norm 0.2604 | lr 6.99e-04 | (3805.57 ms | 137768 tok/s) step 9710/76294 | train loss 3.413717 | norm 0.2252 | lr 6.99e-04 | (3809.12 ms | 137640 tok/s) step 9711/76294 | train loss 3.429047 | norm 0.2032 | lr 6.98e-04 | (3828.94 ms | 136928 tok/s) step 9712/76294 | train loss 3.514787 | norm 0.2453 | lr 6.98e-04 | (3805.00 ms | 137789 tok/s) step 9713/76294 | train loss 3.444202 | norm 0.3240 | lr 6.98e-04 | (3802.90 ms | 137865 tok/s) step 9714/76294 | train loss 3.440391 | norm 0.2269 | lr 6.98e-04 | (3891.24 ms | 134735 tok/s) step 9715/76294 | train loss 3.505198 | norm 0.2173 | lr 6.98e-04 | (3801.48 ms | 137917 tok/s) step 9716/76294 | train loss 3.437241 | norm 0.2130 | lr 6.98e-04 | (3806.06 ms | 137751 tok/s) step 9717/76294 | train loss 3.438785 | norm 0.1896 | lr 6.98e-04 | (3823.17 ms | 137134 tok/s) step 9718/76294 | train loss 3.451170 | norm 0.1993 | lr 6.98e-04 | (3809.53 ms | 137625 tok/s) step 9719/76294 | train loss 3.491596 | norm 0.3199 | lr 6.98e-04 | (3801.74 ms | 137907 tok/s) step 9720/76294 | train loss 3.488689 | norm 0.2872 | lr 6.98e-04 | (3859.79 ms | 135833 tok/s) step 9721/76294 | train loss 3.447263 | norm 0.2817 | lr 6.98e-04 | (3804.08 ms | 137823 tok/s) step 9722/76294 | train loss 3.514717 | norm 0.3105 | lr 6.97e-04 | (3808.54 ms | 137661 tok/s) step 9723/76294 | train loss 3.415569 | norm 0.2782 | lr 6.97e-04 | (3832.03 ms | 136817 tok/s) step 9724/76294 | train loss 3.542976 | norm 0.2096 | lr 6.97e-04 | (3808.22 ms | 137673 tok/s) step 9725/76294 | train loss 3.466713 | norm 0.2177 | lr 6.97e-04 | (3811.33 ms | 137560 tok/s) step 9726/76294 | train loss 3.449018 | norm 0.2145 | lr 6.97e-04 | (3905.46 ms | 134245 tok/s) step 9727/76294 | train loss 3.453023 | norm 0.2050 | lr 6.97e-04 | (3801.60 ms | 137913 tok/s) step 9728/76294 | train loss 3.466739 | norm 0.2564 | lr 6.97e-04 | (5150.19 ms | 101800 tok/s) step 9729/76294 | train loss 3.398231 | norm 0.2427 | lr 6.97e-04 | (3799.76 ms | 137979 tok/s) step 9730/76294 | train loss 3.401495 | norm 0.2047 | lr 6.97e-04 | (3804.98 ms | 137790 tok/s) step 9731/76294 | train loss 3.425797 | norm 0.2431 | lr 6.97e-04 | (3824.91 ms | 137072 tok/s) step 9732/76294 | train loss 3.411314 | norm 0.2054 | lr 6.97e-04 | (3802.87 ms | 137867 tok/s) step 9733/76294 | train loss 3.409097 | norm 0.2992 | lr 6.96e-04 | (3808.89 ms | 137649 tok/s) step 9734/76294 | train loss 3.427077 | norm 0.2600 | lr 6.96e-04 | (4380.75 ms | 119680 tok/s) step 9735/76294 | train loss 3.464746 | norm 0.2688 | lr 6.96e-04 | (3826.72 ms | 137007 tok/s) step 9736/76294 | train loss 3.395165 | norm 0.2227 | lr 6.96e-04 | (3827.76 ms | 136970 tok/s) step 9737/76294 | train loss 3.425000 | norm 0.2741 | lr 6.96e-04 | (3824.95 ms | 137071 tok/s) step 9738/76294 | train loss 3.386377 | norm 0.2200 | lr 6.96e-04 | (3859.38 ms | 135848 tok/s) step 9739/76294 | train loss 3.419065 | norm 0.2200 | lr 6.96e-04 | (3824.80 ms | 137076 tok/s) step 9740/76294 | train loss 3.492188 | norm 0.1800 | lr 6.96e-04 | (3805.84 ms | 137759 tok/s) step 9741/76294 | train loss 3.413819 | norm 0.3674 | lr 6.96e-04 | (3800.15 ms | 137965 tok/s) step 9742/76294 | train loss 3.439352 | norm 0.2511 | lr 6.96e-04 | (3836.82 ms | 136646 tok/s) step 9743/76294 | train loss 3.408561 | norm 0.2734 | lr 6.96e-04 | (3799.97 ms | 137971 tok/s) step 9744/76294 | train loss 3.423025 | norm 0.2322 | lr 6.95e-04 | (3804.39 ms | 137811 tok/s) step 9745/76294 | train loss 3.366478 | norm 0.2600 | lr 6.95e-04 | (3825.19 ms | 137062 tok/s) step 9746/76294 | train loss 3.463722 | norm 0.2377 | lr 6.95e-04 | (3810.78 ms | 137580 tok/s) step 9747/76294 | train loss 3.407766 | norm 0.2488 | lr 6.95e-04 | (3872.84 ms | 135376 tok/s) step 9748/76294 | train loss 3.446282 | norm 0.2199 | lr 6.95e-04 | (3797.67 ms | 138055 tok/s) step 9749/76294 | train loss 3.389854 | norm 0.3393 | lr 6.95e-04 | (3808.38 ms | 137667 tok/s) step 9750/76294 | train loss 3.486343 | norm 0.2074 | lr 6.95e-04 | (3852.28 ms | 136098 tok/s) val loss: 3.433857 saving model checkpoint to ./results/gpt2-124M-gqa/step_9750.pth step 9751/76294 | train loss 3.412048 | norm 0.2861 | lr 6.95e-04 | (3817.11 ms | 137352 tok/s) step 9752/76294 | train loss 3.389054 | norm 0.2059 | lr 6.95e-04 | (3825.45 ms | 137053 tok/s) step 9753/76294 | train loss 3.371730 | norm 0.3025 | lr 6.95e-04 | (3809.35 ms | 137632 tok/s) step 9754/76294 | train loss 3.396635 | norm 0.1949 | lr 6.95e-04 | (3800.25 ms | 137961 tok/s) step 9755/76294 | train loss 3.437692 | norm 0.2251 | lr 6.94e-04 | (3828.29 ms | 136951 tok/s) step 9756/76294 | train loss 3.404500 | norm 0.1802 | lr 6.94e-04 | (3805.29 ms | 137779 tok/s) step 9757/76294 | train loss 3.416986 | norm 0.2374 | lr 6.94e-04 | (3805.44 ms | 137773 tok/s) step 9758/76294 | train loss 3.446984 | norm 0.2326 | lr 6.94e-04 | (3829.23 ms | 136917 tok/s) step 9759/76294 | train loss 3.383310 | norm 0.1922 | lr 6.94e-04 | (3803.60 ms | 137840 tok/s) step 9760/76294 | train loss 3.439593 | norm 0.2320 | lr 6.94e-04 | (3799.69 ms | 137982 tok/s) step 9761/76294 | train loss 3.335911 | norm 0.2187 | lr 6.94e-04 | (3836.90 ms | 136644 tok/s) step 9762/76294 | train loss 3.417717 | norm 0.2191 | lr 6.94e-04 | (3805.06 ms | 137787 tok/s) step 9763/76294 | train loss 3.402133 | norm 0.2229 | lr 6.94e-04 | (3804.94 ms | 137791 tok/s) step 9764/76294 | train loss 3.547641 | norm 0.2027 | lr 6.94e-04 | (3854.38 ms | 136024 tok/s) step 9765/76294 | train loss 3.414882 | norm 0.1861 | lr 6.94e-04 | (3803.95 ms | 137827 tok/s) step 9766/76294 | train loss 3.432370 | norm 0.1801 | lr 6.94e-04 | (3805.88 ms | 137757 tok/s) step 9767/76294 | train loss 3.496439 | norm 0.1776 | lr 6.93e-04 | (3824.07 ms | 137102 tok/s) step 9768/76294 | train loss 3.379849 | norm 0.2344 | lr 6.93e-04 | (3809.83 ms | 137615 tok/s) step 9769/76294 | train loss 3.417234 | norm 0.2267 | lr 6.93e-04 | (3958.58 ms | 132443 tok/s) step 9770/76294 | train loss 3.355695 | norm 0.2044 | lr 6.93e-04 | (3802.31 ms | 137887 tok/s) step 9771/76294 | train loss 3.455152 | norm 0.2738 | lr 6.93e-04 | (3833.12 ms | 136778 tok/s) step 9772/76294 | train loss 3.443173 | norm 0.2417 | lr 6.93e-04 | (3802.55 ms | 137878 tok/s) step 9773/76294 | train loss 3.417692 | norm 0.1964 | lr 6.93e-04 | (3811.97 ms | 137537 tok/s) step 9774/76294 | train loss 3.415528 | norm 0.1935 | lr 6.93e-04 | (3830.77 ms | 136862 tok/s) step 9775/76294 | train loss 3.424522 | norm 0.1779 | lr 6.93e-04 | (3806.89 ms | 137721 tok/s) step 9776/76294 | train loss 3.427950 | norm 0.2014 | lr 6.93e-04 | (3803.92 ms | 137828 tok/s) step 9777/76294 | train loss 3.397455 | norm 0.1857 | lr 6.93e-04 | (3869.01 ms | 135510 tok/s) step 9778/76294 | train loss 3.449339 | norm 0.2342 | lr 6.92e-04 | (3803.95 ms | 137827 tok/s) step 9779/76294 | train loss 3.418415 | norm 0.1914 | lr 6.92e-04 | (3808.69 ms | 137656 tok/s) step 9780/76294 | train loss 3.452086 | norm 0.2256 | lr 6.92e-04 | (3826.15 ms | 137027 tok/s) step 9781/76294 | train loss 3.401545 | norm 0.2013 | lr 6.92e-04 | (3807.84 ms | 137687 tok/s) step 9782/76294 | train loss 3.427566 | norm 0.2498 | lr 6.92e-04 | (3803.76 ms | 137834 tok/s) step 9783/76294 | train loss 3.416424 | norm 0.2507 | lr 6.92e-04 | (3926.14 ms | 133538 tok/s) step 9784/76294 | train loss 3.451481 | norm 0.2040 | lr 6.92e-04 | (3804.64 ms | 137802 tok/s) step 9785/76294 | train loss 3.387208 | norm 0.1917 | lr 6.92e-04 | (3808.97 ms | 137646 tok/s) step 9786/76294 | train loss 3.407000 | norm 0.2753 | lr 6.92e-04 | (3829.84 ms | 136896 tok/s) step 9787/76294 | train loss 3.428375 | norm 0.2254 | lr 6.92e-04 | (3805.43 ms | 137774 tok/s) step 9788/76294 | train loss 3.370510 | norm 0.4220 | lr 6.92e-04 | (3805.45 ms | 137773 tok/s) step 9789/76294 | train loss 3.479917 | norm 0.5940 | lr 6.91e-04 | (3860.50 ms | 135808 tok/s) step 9790/76294 | train loss 3.477502 | norm 0.2551 | lr 6.91e-04 | (3810.33 ms | 137597 tok/s) step 9791/76294 | train loss 3.512634 | norm 0.3241 | lr 6.91e-04 | (3838.49 ms | 136587 tok/s) step 9792/76294 | train loss 3.472991 | norm 0.2271 | lr 6.91e-04 | (3839.53 ms | 136550 tok/s) step 9793/76294 | train loss 3.487192 | norm 0.3086 | lr 6.91e-04 | (3834.39 ms | 136733 tok/s) step 9794/76294 | train loss 3.390222 | norm 0.1938 | lr 6.91e-04 | (3802.07 ms | 137896 tok/s) step 9795/76294 | train loss 3.472717 | norm 0.2713 | lr 6.91e-04 | (3847.35 ms | 136272 tok/s) step 9796/76294 | train loss 3.364844 | norm 0.2493 | lr 6.91e-04 | (3804.79 ms | 137797 tok/s) step 9797/76294 | train loss 3.368594 | norm 0.2299 | lr 6.91e-04 | (3851.85 ms | 136113 tok/s) step 9798/76294 | train loss 3.438566 | norm 0.2967 | lr 6.91e-04 | (3804.54 ms | 137806 tok/s) step 9799/76294 | train loss 3.432757 | norm 0.1978 | lr 6.91e-04 | (3809.43 ms | 137629 tok/s) step 9800/76294 | train loss 3.404882 | norm 0.2305 | lr 6.90e-04 | (3826.98 ms | 136998 tok/s) step 9801/76294 | train loss 3.350580 | norm 0.2472 | lr 6.90e-04 | (3809.33 ms | 137633 tok/s) step 9802/76294 | train loss 3.469339 | norm 0.1935 | lr 6.90e-04 | (3804.91 ms | 137792 tok/s) step 9803/76294 | train loss 3.356287 | norm 0.3256 | lr 6.90e-04 | (3831.05 ms | 136852 tok/s) step 9804/76294 | train loss 3.513079 | norm 0.2055 | lr 6.90e-04 | (3802.97 ms | 137863 tok/s) step 9805/76294 | train loss 3.398066 | norm 0.4472 | lr 6.90e-04 | (3807.50 ms | 137699 tok/s) step 9806/76294 | train loss 3.370921 | norm 0.1974 | lr 6.90e-04 | (3823.53 ms | 137122 tok/s) step 9807/76294 | train loss 3.433032 | norm 0.2587 | lr 6.90e-04 | (3810.37 ms | 137595 tok/s) step 9808/76294 | train loss 3.432240 | norm 0.2030 | lr 6.90e-04 | (3804.49 ms | 137808 tok/s) step 9809/76294 | train loss 3.486329 | norm 0.2117 | lr 6.90e-04 | (3837.48 ms | 136623 tok/s) step 9810/76294 | train loss 3.505163 | norm 0.2559 | lr 6.90e-04 | (3805.61 ms | 137767 tok/s) step 9811/76294 | train loss 3.474515 | norm 0.1653 | lr 6.89e-04 | (3812.73 ms | 137510 tok/s) step 9812/76294 | train loss 3.407923 | norm 0.2908 | lr 6.89e-04 | (3876.07 ms | 135263 tok/s) step 9813/76294 | train loss 3.390861 | norm 0.2098 | lr 6.89e-04 | (3806.80 ms | 137724 tok/s) step 9814/76294 | train loss 3.413977 | norm 0.2607 | lr 6.89e-04 | (3808.77 ms | 137653 tok/s) step 9815/76294 | train loss 3.337445 | norm 0.1979 | lr 6.89e-04 | (3830.37 ms | 136877 tok/s) step 9816/76294 | train loss 3.567995 | norm 0.2455 | lr 6.89e-04 | (3808.40 ms | 137666 tok/s) step 9817/76294 | train loss 3.491874 | norm 0.1899 | lr 6.89e-04 | (3823.59 ms | 137119 tok/s) step 9818/76294 | train loss 3.470233 | norm 0.2248 | lr 6.89e-04 | (3811.56 ms | 137552 tok/s) step 9819/76294 | train loss 3.383368 | norm 0.1756 | lr 6.89e-04 | (3813.19 ms | 137493 tok/s) step 9820/76294 | train loss 3.502591 | norm 0.2022 | lr 6.89e-04 | (3806.97 ms | 137718 tok/s) step 9821/76294 | train loss 3.373711 | norm 0.2500 | lr 6.89e-04 | (3829.00 ms | 136925 tok/s) step 9822/76294 | train loss 3.372599 | norm 0.1648 | lr 6.88e-04 | (3806.16 ms | 137747 tok/s) step 9823/76294 | train loss 3.492184 | norm 0.3143 | lr 6.88e-04 | (3801.32 ms | 137923 tok/s) step 9824/76294 | train loss 3.387508 | norm 0.1912 | lr 6.88e-04 | (3832.27 ms | 136809 tok/s) step 9825/76294 | train loss 3.436445 | norm 0.2418 | lr 6.88e-04 | (3804.82 ms | 137796 tok/s) step 9826/76294 | train loss 3.363440 | norm 0.3029 | lr 6.88e-04 | (3824.90 ms | 137072 tok/s) step 9827/76294 | train loss 3.439122 | norm 0.2022 | lr 6.88e-04 | (3827.28 ms | 136987 tok/s) step 9828/76294 | train loss 3.358088 | norm 0.2474 | lr 6.88e-04 | (3814.17 ms | 137458 tok/s) step 9829/76294 | train loss 3.434481 | norm 0.2011 | lr 6.88e-04 | (3804.53 ms | 137806 tok/s) step 9830/76294 | train loss 3.334891 | norm 0.3141 | lr 6.88e-04 | (3832.49 ms | 136801 tok/s) step 9831/76294 | train loss 3.382147 | norm 0.1963 | lr 6.88e-04 | (3808.32 ms | 137669 tok/s) step 9832/76294 | train loss 3.430438 | norm 0.2487 | lr 6.88e-04 | (3809.19 ms | 137638 tok/s) step 9833/76294 | train loss 3.398476 | norm 0.2127 | lr 6.87e-04 | (3835.67 ms | 136687 tok/s) step 9834/76294 | train loss 3.424301 | norm 0.2221 | lr 6.87e-04 | (3905.85 ms | 134232 tok/s) step 9835/76294 | train loss 3.310049 | norm 0.2166 | lr 6.87e-04 | (3804.35 ms | 137813 tok/s) step 9836/76294 | train loss 3.446767 | norm 0.2566 | lr 6.87e-04 | (3802.39 ms | 137884 tok/s) step 9837/76294 | train loss 3.397580 | norm 0.2106 | lr 6.87e-04 | (3876.81 ms | 135237 tok/s) step 9838/76294 | train loss 3.405304 | norm 0.2312 | lr 6.87e-04 | (3806.95 ms | 137719 tok/s) step 9839/76294 | train loss 3.496804 | norm 0.2378 | lr 6.87e-04 | (3809.89 ms | 137612 tok/s) step 9840/76294 | train loss 3.377703 | norm 0.1888 | lr 6.87e-04 | (3830.37 ms | 136877 tok/s) step 9841/76294 | train loss 3.451592 | norm 0.3301 | lr 6.87e-04 | (3804.66 ms | 137801 tok/s) step 9842/76294 | train loss 3.364112 | norm 0.2177 | lr 6.87e-04 | (3860.38 ms | 135813 tok/s) step 9843/76294 | train loss 3.376414 | norm 0.2887 | lr 6.87e-04 | (3804.10 ms | 137822 tok/s) step 9844/76294 | train loss 3.437998 | norm 0.2083 | lr 6.87e-04 | (3904.64 ms | 134273 tok/s) step 9845/76294 | train loss 3.397541 | norm 0.2027 | lr 6.86e-04 | (3802.54 ms | 137878 tok/s) step 9846/76294 | train loss 3.425574 | norm 0.2206 | lr 6.86e-04 | (3815.30 ms | 137417 tok/s) step 9847/76294 | train loss 3.313126 | norm 0.2725 | lr 6.86e-04 | (3805.17 ms | 137783 tok/s) step 9848/76294 | train loss 3.434300 | norm 0.2365 | lr 6.86e-04 | (3843.03 ms | 136426 tok/s) step 9849/76294 | train loss 3.431190 | norm 0.2327 | lr 6.86e-04 | (3801.64 ms | 137911 tok/s) step 9850/76294 | train loss 3.363012 | norm 0.2788 | lr 6.86e-04 | (3847.73 ms | 136259 tok/s) step 9851/76294 | train loss 3.436155 | norm 0.2034 | lr 6.86e-04 | (3799.88 ms | 137975 tok/s) step 9852/76294 | train loss 3.380479 | norm 0.1925 | lr 6.86e-04 | (3804.40 ms | 137811 tok/s) step 9853/76294 | train loss 3.557753 | norm 0.2435 | lr 6.86e-04 | (3828.28 ms | 136951 tok/s) step 9854/76294 | train loss 3.341218 | norm 0.2115 | lr 6.86e-04 | (3812.45 ms | 137520 tok/s) step 9855/76294 | train loss 3.409240 | norm 0.2421 | lr 6.86e-04 | (3884.49 ms | 134970 tok/s) step 9856/76294 | train loss 3.347234 | norm 0.1852 | lr 6.85e-04 | (3801.10 ms | 137930 tok/s) step 9857/76294 | train loss 3.402341 | norm 0.1980 | lr 6.85e-04 | (3806.81 ms | 137724 tok/s) step 9858/76294 | train loss 3.428965 | norm 0.2220 | lr 6.85e-04 | (3824.85 ms | 137074 tok/s) step 9859/76294 | train loss 3.353442 | norm 0.1983 | lr 6.85e-04 | (3842.45 ms | 136446 tok/s) step 9860/76294 | train loss 3.441305 | norm 0.2367 | lr 6.85e-04 | (3801.16 ms | 137929 tok/s) step 9861/76294 | train loss 3.379922 | norm 0.1920 | lr 6.85e-04 | (3828.90 ms | 136929 tok/s) step 9862/76294 | train loss 3.490593 | norm 0.2286 | lr 6.85e-04 | (3803.43 ms | 137846 tok/s) step 9863/76294 | train loss 3.410944 | norm 0.2404 | lr 6.85e-04 | (3806.05 ms | 137751 tok/s) step 9864/76294 | train loss 3.395893 | norm 0.1999 | lr 6.85e-04 | (3958.75 ms | 132438 tok/s) step 9865/76294 | train loss 3.386175 | norm 0.2129 | lr 6.85e-04 | (3806.09 ms | 137750 tok/s) step 9866/76294 | train loss 3.381968 | norm 0.1914 | lr 6.85e-04 | (3801.36 ms | 137921 tok/s) step 9867/76294 | train loss 3.477391 | norm 0.3033 | lr 6.84e-04 | (3834.86 ms | 136716 tok/s) step 9868/76294 | train loss 3.418732 | norm 0.1953 | lr 6.84e-04 | (3806.69 ms | 137728 tok/s) step 9869/76294 | train loss 3.394736 | norm 0.2610 | lr 6.84e-04 | (3895.28 ms | 134596 tok/s) step 9870/76294 | train loss 3.393775 | norm 0.2388 | lr 6.84e-04 | (3798.35 ms | 138030 tok/s) step 9871/76294 | train loss 3.393446 | norm 0.2395 | lr 6.84e-04 | (4087.16 ms | 128277 tok/s) step 9872/76294 | train loss 3.434191 | norm 0.2083 | lr 6.84e-04 | (3787.22 ms | 138436 tok/s) step 9873/76294 | train loss 3.379634 | norm 0.2382 | lr 6.84e-04 | (52295.84 ms | 10025 tok/s) step 9874/76294 | train loss 3.485485 | norm 0.3284 | lr 6.84e-04 | (3730.63 ms | 140536 tok/s) step 9875/76294 | train loss 3.373124 | norm 0.2030 | lr 6.84e-04 | (3758.34 ms | 139500 tok/s) step 9876/76294 | train loss 3.518420 | norm 0.2916 | lr 6.84e-04 | (3755.48 ms | 139606 tok/s) step 9877/76294 | train loss 3.411620 | norm 0.2173 | lr 6.84e-04 | (3766.59 ms | 139194 tok/s) step 9878/76294 | train loss 3.431621 | norm 0.1894 | lr 6.83e-04 | (3754.94 ms | 139626 tok/s) step 9879/76294 | train loss 3.443126 | norm 0.3031 | lr 6.83e-04 | (3810.42 ms | 137593 tok/s) step 9880/76294 | train loss 3.413197 | norm 0.1922 | lr 6.83e-04 | (3761.31 ms | 139390 tok/s) step 9881/76294 | train loss 3.457377 | norm 0.2419 | lr 6.83e-04 | (3771.05 ms | 139030 tok/s) step 9882/76294 | train loss 3.386958 | norm 0.1983 | lr 6.83e-04 | (3790.26 ms | 138325 tok/s) step 9883/76294 | train loss 3.415015 | norm 0.1857 | lr 6.83e-04 | (3810.29 ms | 137598 tok/s) step 9884/76294 | train loss 3.398915 | norm 0.1763 | lr 6.83e-04 | (3778.50 ms | 138756 tok/s) step 9885/76294 | train loss 3.371779 | norm 0.1859 | lr 6.83e-04 | (3831.94 ms | 136820 tok/s) step 9886/76294 | train loss 3.435639 | norm 0.2052 | lr 6.83e-04 | (3802.97 ms | 137863 tok/s) step 9887/76294 | train loss 3.344355 | norm 0.2099 | lr 6.83e-04 | (3786.16 ms | 138475 tok/s) step 9888/76294 | train loss 3.460431 | norm 0.1828 | lr 6.83e-04 | (3807.04 ms | 137715 tok/s) step 9889/76294 | train loss 3.339802 | norm 0.1938 | lr 6.82e-04 | (3857.00 ms | 135932 tok/s) step 9890/76294 | train loss 3.452693 | norm 0.2036 | lr 6.82e-04 | (3786.55 ms | 138461 tok/s) step 9891/76294 | train loss 3.390905 | norm 0.2899 | lr 6.82e-04 | (3796.14 ms | 138111 tok/s) step 9892/76294 | train loss 3.432178 | norm 0.1929 | lr 6.82e-04 | (3808.47 ms | 137664 tok/s) step 9893/76294 | train loss 3.400012 | norm 0.2611 | lr 6.82e-04 | (3799.50 ms | 137989 tok/s) step 9894/76294 | train loss 3.423921 | norm 0.2554 | lr 6.82e-04 | (5400.30 ms | 97085 tok/s) step 9895/76294 | train loss 3.434936 | norm 0.3657 | lr 6.82e-04 | (3817.69 ms | 137331 tok/s) step 9896/76294 | train loss 3.421662 | norm 0.3838 | lr 6.82e-04 | (3793.36 ms | 138212 tok/s) step 9897/76294 | train loss 3.410780 | norm 0.3702 | lr 6.82e-04 | (3803.48 ms | 137844 tok/s) step 9898/76294 | train loss 3.464571 | norm 0.2751 | lr 6.82e-04 | (3813.12 ms | 137496 tok/s) step 9899/76294 | train loss 3.418473 | norm 0.3127 | lr 6.82e-04 | (3794.71 ms | 138163 tok/s) step 9900/76294 | train loss 3.436199 | norm 0.2133 | lr 6.81e-04 | (3823.70 ms | 137115 tok/s) step 9901/76294 | train loss 3.404877 | norm 0.2382 | lr 6.81e-04 | (3825.70 ms | 137044 tok/s) step 9902/76294 | train loss 3.462869 | norm 0.2296 | lr 6.81e-04 | (3795.08 ms | 138150 tok/s) step 9903/76294 | train loss 3.545899 | norm 0.1920 | lr 6.81e-04 | (3827.64 ms | 136974 tok/s) step 9904/76294 | train loss 3.411943 | norm 0.1927 | lr 6.81e-04 | (3806.32 ms | 137742 tok/s) step 9905/76294 | train loss 3.301611 | norm 0.2183 | lr 6.81e-04 | (3813.10 ms | 137497 tok/s) step 9906/76294 | train loss 3.440571 | norm 0.2218 | lr 6.81e-04 | (3832.59 ms | 136797 tok/s) step 9907/76294 | train loss 3.397238 | norm 0.2541 | lr 6.81e-04 | (3840.32 ms | 136522 tok/s) step 9908/76294 | train loss 3.415400 | norm 0.1847 | lr 6.81e-04 | (3810.49 ms | 137591 tok/s) step 9909/76294 | train loss 3.467075 | norm 0.2336 | lr 6.81e-04 | (3812.96 ms | 137501 tok/s) step 9910/76294 | train loss 3.368991 | norm 0.2143 | lr 6.81e-04 | (3900.57 ms | 134413 tok/s) step 9911/76294 | train loss 3.441491 | norm 0.2459 | lr 6.80e-04 | (3809.47 ms | 137628 tok/s) step 9912/76294 | train loss 3.427980 | norm 0.2851 | lr 6.80e-04 | (3859.99 ms | 135826 tok/s) step 9913/76294 | train loss 3.360086 | norm 0.1978 | lr 6.80e-04 | (3805.06 ms | 137787 tok/s) step 9914/76294 | train loss 3.384246 | norm 0.2193 | lr 6.80e-04 | (3899.57 ms | 134448 tok/s) step 9915/76294 | train loss 3.373336 | norm 0.2029 | lr 6.80e-04 | (3805.42 ms | 137774 tok/s) step 9916/76294 | train loss 3.403705 | norm 0.1814 | lr 6.80e-04 | (3826.62 ms | 137011 tok/s) step 9917/76294 | train loss 3.359930 | norm 0.2198 | lr 6.80e-04 | (3808.57 ms | 137660 tok/s) step 9918/76294 | train loss 3.320332 | norm 0.1889 | lr 6.80e-04 | (3814.76 ms | 137437 tok/s) step 9919/76294 | train loss 3.505563 | norm 0.1883 | lr 6.80e-04 | (4072.70 ms | 128732 tok/s) step 9920/76294 | train loss 3.406713 | norm 0.2316 | lr 6.80e-04 | (3810.42 ms | 137593 tok/s) step 9921/76294 | train loss 3.447058 | norm 0.2001 | lr 6.80e-04 | (3805.73 ms | 137763 tok/s) step 9922/76294 | train loss 3.503667 | norm 0.2305 | lr 6.79e-04 | (3831.20 ms | 136847 tok/s) step 9923/76294 | train loss 3.449591 | norm 0.1873 | lr 6.79e-04 | (3807.65 ms | 137693 tok/s) step 9924/76294 | train loss 3.588817 | norm 0.2162 | lr 6.79e-04 | (3809.95 ms | 137610 tok/s) step 9925/76294 | train loss 3.475087 | norm 0.2443 | lr 6.79e-04 | (3831.45 ms | 136838 tok/s) step 9926/76294 | train loss 3.443521 | norm 0.1886 | lr 6.79e-04 | (3811.39 ms | 137558 tok/s) step 9927/76294 | train loss 3.454589 | norm 0.2359 | lr 6.79e-04 | (3833.82 ms | 136754 tok/s) step 9928/76294 | train loss 3.420880 | norm 0.2676 | lr 6.79e-04 | (3807.63 ms | 137694 tok/s) step 9929/76294 | train loss 3.524382 | norm 0.3537 | lr 6.79e-04 | (3879.37 ms | 135148 tok/s) step 9930/76294 | train loss 3.384253 | norm 0.1907 | lr 6.79e-04 | (3804.81 ms | 137796 tok/s) step 9931/76294 | train loss 3.482244 | norm 0.2314 | lr 6.79e-04 | (3845.04 ms | 136354 tok/s) step 9932/76294 | train loss 3.467332 | norm 0.2511 | lr 6.79e-04 | (3848.59 ms | 136229 tok/s) step 9933/76294 | train loss 3.493471 | norm 0.2579 | lr 6.78e-04 | (3825.64 ms | 137046 tok/s) step 9934/76294 | train loss 3.467266 | norm 0.2712 | lr 6.78e-04 | (3834.22 ms | 136739 tok/s) step 9935/76294 | train loss 3.446298 | norm 0.2241 | lr 6.78e-04 | (3806.93 ms | 137719 tok/s) step 9936/76294 | train loss 3.524853 | norm 0.2658 | lr 6.78e-04 | (3823.38 ms | 137127 tok/s) step 9937/76294 | train loss 3.372490 | norm 0.2588 | lr 6.78e-04 | (3837.65 ms | 136617 tok/s) step 9938/76294 | train loss 3.430122 | norm 0.2390 | lr 6.78e-04 | (3904.27 ms | 134286 tok/s) step 9939/76294 | train loss 3.394500 | norm 0.3270 | lr 6.78e-04 | (3802.67 ms | 137874 tok/s) step 9940/76294 | train loss 3.432438 | norm 0.2349 | lr 6.78e-04 | (3829.08 ms | 136923 tok/s) step 9941/76294 | train loss 3.441397 | norm 0.3133 | lr 6.78e-04 | (3829.04 ms | 136924 tok/s) step 9942/76294 | train loss 3.339407 | norm 0.2018 | lr 6.78e-04 | (3806.88 ms | 137721 tok/s) step 9943/76294 | train loss 3.411531 | norm 0.3463 | lr 6.78e-04 | (3829.26 ms | 136916 tok/s) step 9944/76294 | train loss 3.404156 | norm 0.2075 | lr 6.78e-04 | (3808.93 ms | 137647 tok/s) step 9945/76294 | train loss 3.466193 | norm 0.2814 | lr 6.77e-04 | (3822.02 ms | 137176 tok/s) step 9946/76294 | train loss 3.542077 | norm 0.2308 | lr 6.77e-04 | (3836.62 ms | 136654 tok/s) step 9947/76294 | train loss 3.384543 | norm 0.2008 | lr 6.77e-04 | (3804.81 ms | 137796 tok/s) step 9948/76294 | train loss 3.396278 | norm 0.2064 | lr 6.77e-04 | (3836.17 ms | 136670 tok/s) step 9949/76294 | train loss 3.436999 | norm 0.2448 | lr 6.77e-04 | (3806.89 ms | 137721 tok/s) step 9950/76294 | train loss 3.408890 | norm 0.2538 | lr 6.77e-04 | (3813.55 ms | 137480 tok/s) step 9951/76294 | train loss 3.380552 | norm 0.1951 | lr 6.77e-04 | (3836.24 ms | 136667 tok/s) step 9952/76294 | train loss 3.377044 | norm 0.2421 | lr 6.77e-04 | (3841.80 ms | 136469 tok/s) step 9953/76294 | train loss 3.447452 | norm 0.1976 | lr 6.77e-04 | (3809.24 ms | 137636 tok/s) step 9954/76294 | train loss 3.346835 | norm 0.1989 | lr 6.77e-04 | (3836.20 ms | 136669 tok/s) step 9955/76294 | train loss 3.400368 | norm 0.1842 | lr 6.77e-04 | (4072.62 ms | 128735 tok/s) step 9956/76294 | train loss 3.410731 | norm 0.2050 | lr 6.76e-04 | (6276.30 ms | 83535 tok/s) step 9957/76294 | train loss 3.430962 | norm 0.2109 | lr 6.76e-04 | (4033.45 ms | 129985 tok/s) step 9958/76294 | train loss 3.410558 | norm 0.2905 | lr 6.76e-04 | (3789.75 ms | 138344 tok/s) step 9959/76294 | train loss 3.378309 | norm 0.2048 | lr 6.76e-04 | (3797.00 ms | 138080 tok/s) step 9960/76294 | train loss 3.452619 | norm 0.3377 | lr 6.76e-04 | (3825.31 ms | 137058 tok/s) step 9961/76294 | train loss 3.417147 | norm 0.2179 | lr 6.76e-04 | (3866.35 ms | 135603 tok/s) step 9962/76294 | train loss 3.418906 | norm 0.2062 | lr 6.76e-04 | (3803.32 ms | 137850 tok/s) step 9963/76294 | train loss 3.419258 | norm 0.4128 | lr 6.76e-04 | (3828.27 ms | 136952 tok/s) step 9964/76294 | train loss 3.377951 | norm 0.2595 | lr 6.76e-04 | (3797.31 ms | 138068 tok/s) step 9965/76294 | train loss 3.434540 | norm 0.3426 | lr 6.76e-04 | (3834.95 ms | 136713 tok/s) step 9966/76294 | train loss 3.453239 | norm 0.2556 | lr 6.76e-04 | (3799.45 ms | 137991 tok/s) step 9967/76294 | train loss 3.441669 | norm 0.2554 | lr 6.75e-04 | (3823.97 ms | 137106 tok/s) step 9968/76294 | train loss 3.389549 | norm 0.3969 | lr 6.75e-04 | (3803.44 ms | 137846 tok/s) step 9969/76294 | train loss 3.490216 | norm 0.2308 | lr 6.75e-04 | (3809.69 ms | 137620 tok/s) step 9970/76294 | train loss 3.396981 | norm 0.2752 | lr 6.75e-04 | (3842.70 ms | 136438 tok/s) step 9971/76294 | train loss 3.450668 | norm 0.2010 | lr 6.75e-04 | (3815.02 ms | 137427 tok/s) step 9972/76294 | train loss 3.389085 | norm 0.2660 | lr 6.75e-04 | (3877.71 ms | 135206 tok/s) step 9973/76294 | train loss 3.382560 | norm 0.1985 | lr 6.75e-04 | (3801.50 ms | 137916 tok/s) step 9974/76294 | train loss 3.355363 | norm 0.2599 | lr 6.75e-04 | (3828.20 ms | 136954 tok/s) step 9975/76294 | train loss 3.424459 | norm 0.1874 | lr 6.75e-04 | (3800.63 ms | 137948 tok/s) step 9976/76294 | train loss 3.484881 | norm 0.4265 | lr 6.75e-04 | (3858.25 ms | 135887 tok/s) step 9977/76294 | train loss 3.396087 | norm 0.2080 | lr 6.75e-04 | (3801.63 ms | 137911 tok/s) step 9978/76294 | train loss 3.403095 | norm 0.2538 | lr 6.74e-04 | (3863.27 ms | 135711 tok/s) step 9979/76294 | train loss 3.417073 | norm 0.2547 | lr 6.74e-04 | (3803.29 ms | 137851 tok/s) step 9980/76294 | train loss 3.431562 | norm 0.2028 | lr 6.74e-04 | (3808.11 ms | 137677 tok/s) step 9981/76294 | train loss 3.485340 | norm 0.2553 | lr 6.74e-04 | (3837.29 ms | 136630 tok/s) step 9982/76294 | train loss 3.386890 | norm 0.1998 | lr 6.74e-04 | (3805.85 ms | 137758 tok/s) step 9983/76294 | train loss 3.418163 | norm 0.1891 | lr 6.74e-04 | (3824.86 ms | 137074 tok/s) step 9984/76294 | train loss 3.370550 | norm 0.2054 | lr 6.74e-04 | (3808.85 ms | 137650 tok/s) step 9985/76294 | train loss 3.406547 | norm 0.2809 | lr 6.74e-04 | (3806.38 ms | 137739 tok/s) step 9986/76294 | train loss 3.434963 | norm 0.2042 | lr 6.74e-04 | (3857.22 ms | 135924 tok/s) step 9987/76294 | train loss 3.457060 | norm 0.2129 | lr 6.74e-04 | (4622.77 ms | 113414 tok/s) step 9988/76294 | train loss 3.426449 | norm 0.1801 | lr 6.74e-04 | (3859.93 ms | 135828 tok/s) step 9989/76294 | train loss 3.419007 | norm 0.1884 | lr 6.73e-04 | (3767.35 ms | 139166 tok/s) step 9990/76294 | train loss 3.421408 | norm 0.2236 | lr 6.73e-04 | (3827.41 ms | 136982 tok/s) step 9991/76294 | train loss 3.385401 | norm 0.1799 | lr 6.73e-04 | (3776.67 ms | 138823 tok/s) step 9992/76294 | train loss 3.454633 | norm 0.2396 | lr 6.73e-04 | (3850.18 ms | 136172 tok/s) step 9993/76294 | train loss 3.414672 | norm 0.2167 | lr 6.73e-04 | (3783.39 ms | 138576 tok/s) step 9994/76294 | train loss 3.504216 | norm 0.2261 | lr 6.73e-04 | (3805.94 ms | 137755 tok/s) step 9995/76294 | train loss 3.427806 | norm 0.2640 | lr 6.73e-04 | (3812.94 ms | 137502 tok/s) step 9996/76294 | train loss 3.476942 | norm 0.1865 | lr 6.73e-04 | (3839.84 ms | 136539 tok/s) step 9997/76294 | train loss 3.365079 | norm 0.1663 | lr 6.73e-04 | (3791.41 ms | 138283 tok/s) step 9998/76294 | train loss 3.410425 | norm 0.2399 | lr 6.73e-04 | (3798.25 ms | 138034 tok/s) step 9999/76294 | train loss 3.457355 | norm 0.1865 | lr 6.73e-04 | (3826.60 ms | 137011 tok/s) step 10000/76294 | train loss 3.397363 | norm 0.4889 | lr 6.72e-04 | (3793.23 ms | 138217 tok/s) val loss: 3.425695 saving model checkpoint to ./results/gpt2-124M-gqa/step_10000.pth step 10001/76294 | train loss 3.389457 | norm 0.4222 | lr 6.72e-04 | (3809.85 ms | 137614 tok/s) step 10002/76294 | train loss 3.320835 | norm 0.2912 | lr 6.72e-04 | (3824.13 ms | 137100 tok/s) step 10003/76294 | train loss 3.471335 | norm 0.2741 | lr 6.72e-04 | (3796.09 ms | 138113 tok/s) step 10004/76294 | train loss 3.442081 | norm 0.3265 | lr 6.72e-04 | (3796.55 ms | 138096 tok/s) step 10005/76294 | train loss 3.423664 | norm 0.2529 | lr 6.72e-04 | (3836.57 ms | 136655 tok/s) step 10006/76294 | train loss 3.400892 | norm 0.2985 | lr 6.72e-04 | (3799.84 ms | 137976 tok/s) step 10007/76294 | train loss 3.481746 | norm 0.2371 | lr 6.72e-04 | (3826.15 ms | 137028 tok/s) step 10008/76294 | train loss 3.480067 | norm 0.2152 | lr 6.72e-04 | (4020.37 ms | 130408 tok/s) step 10009/76294 | train loss 3.373563 | norm 0.3651 | lr 6.72e-04 | (3803.59 ms | 137840 tok/s) step 10010/76294 | train loss 3.373996 | norm 0.2308 | lr 6.72e-04 | (3819.75 ms | 137257 tok/s) step 10011/76294 | train loss 3.404907 | norm 0.2580 | lr 6.71e-04 | (3806.08 ms | 137750 tok/s) step 10012/76294 | train loss 3.442618 | norm 0.2262 | lr 6.71e-04 | (3814.06 ms | 137462 tok/s) step 10013/76294 | train loss 3.424240 | norm 0.2019 | lr 6.71e-04 | (3803.89 ms | 137829 tok/s) step 10014/76294 | train loss 3.441304 | norm 0.2385 | lr 6.71e-04 | (4129.01 ms | 126977 tok/s) step 10015/76294 | train loss 3.459321 | norm 0.2127 | lr 6.71e-04 | (3807.84 ms | 137686 tok/s) step 10016/76294 | train loss 3.431610 | norm 0.2435 | lr 6.71e-04 | (3827.55 ms | 136977 tok/s) step 10017/76294 | train loss 3.495832 | norm 0.1989 | lr 6.71e-04 | (3829.38 ms | 136912 tok/s) step 10018/76294 | train loss 3.393616 | norm 0.2426 | lr 6.71e-04 | (3809.82 ms | 137615 tok/s) step 10019/76294 | train loss 3.458852 | norm 0.1866 | lr 6.71e-04 | (3806.13 ms | 137748 tok/s) step 10020/76294 | train loss 3.481978 | norm 0.1949 | lr 6.71e-04 | (3844.78 ms | 136364 tok/s) step 10021/76294 | train loss 3.481473 | norm 0.1982 | lr 6.71e-04 | (3808.77 ms | 137653 tok/s) step 10022/76294 | train loss 3.381304 | norm 0.2663 | lr 6.70e-04 | (3925.24 ms | 133568 tok/s) step 10023/76294 | train loss 3.405316 | norm 0.2221 | lr 6.70e-04 | (3808.14 ms | 137676 tok/s) step 10024/76294 | train loss 3.402927 | norm 0.3419 | lr 6.70e-04 | (3848.27 ms | 136240 tok/s) step 10025/76294 | train loss 3.434139 | norm 0.2393 | lr 6.70e-04 | (3810.66 ms | 137585 tok/s) step 10026/76294 | train loss 3.474332 | norm 0.4460 | lr 6.70e-04 | (3861.85 ms | 135761 tok/s) step 10027/76294 | train loss 3.375478 | norm 0.3004 | lr 6.70e-04 | (3808.14 ms | 137676 tok/s) step 10028/76294 | train loss 3.434772 | norm 0.3373 | lr 6.70e-04 | (3812.87 ms | 137505 tok/s) step 10029/76294 | train loss 3.388705 | norm 0.2009 | lr 6.70e-04 | (3848.20 ms | 136242 tok/s) step 10030/76294 | train loss 3.450074 | norm 0.3006 | lr 6.70e-04 | (3842.24 ms | 136454 tok/s) step 10031/76294 | train loss 3.418870 | norm 0.2502 | lr 6.70e-04 | (3812.56 ms | 137516 tok/s) step 10032/76294 | train loss 3.480369 | norm 0.2004 | lr 6.70e-04 | (3842.61 ms | 136441 tok/s) step 10033/76294 | train loss 3.440966 | norm 0.2168 | lr 6.69e-04 | (3806.20 ms | 137746 tok/s) step 10034/76294 | train loss 3.404139 | norm 0.2250 | lr 6.69e-04 | (3813.37 ms | 137487 tok/s) step 10035/76294 | train loss 3.413110 | norm 0.2344 | lr 6.69e-04 | (3829.14 ms | 136920 tok/s) step 10036/76294 | train loss 3.466476 | norm 0.3010 | lr 6.69e-04 | (3833.35 ms | 136770 tok/s) step 10037/76294 | train loss 3.485934 | norm 0.3120 | lr 6.69e-04 | (3806.20 ms | 137746 tok/s) step 10038/76294 | train loss 3.435575 | norm 0.2612 | lr 6.69e-04 | (3972.43 ms | 131982 tok/s) step 10039/76294 | train loss 3.421782 | norm 0.2699 | lr 6.69e-04 | (3804.25 ms | 137816 tok/s) step 10040/76294 | train loss 3.406781 | norm 0.2183 | lr 6.69e-04 | (3810.78 ms | 137580 tok/s) step 10041/76294 | train loss 3.514073 | norm 0.2494 | lr 6.69e-04 | (3849.43 ms | 136199 tok/s) step 10042/76294 | train loss 3.431689 | norm 0.2467 | lr 6.69e-04 | (3810.22 ms | 137600 tok/s) step 10043/76294 | train loss 3.458441 | norm 0.2014 | lr 6.69e-04 | (5656.32 ms | 92691 tok/s) step 10044/76294 | train loss 3.401510 | norm 0.2257 | lr 6.69e-04 | (3803.49 ms | 137844 tok/s) step 10045/76294 | train loss 3.411783 | norm 0.2370 | lr 6.68e-04 | (3800.84 ms | 137940 tok/s) step 10046/76294 | train loss 3.407442 | norm 0.2265 | lr 6.68e-04 | (3830.24 ms | 136881 tok/s) step 10047/76294 | train loss 3.427794 | norm 0.2064 | lr 6.68e-04 | (4301.37 ms | 121889 tok/s) step 10048/76294 | train loss 3.442403 | norm 0.2314 | lr 6.68e-04 | (3857.97 ms | 135898 tok/s) step 10049/76294 | train loss 3.372396 | norm 0.2292 | lr 6.68e-04 | (3802.78 ms | 137870 tok/s) step 10050/76294 | train loss 3.392260 | norm 0.2723 | lr 6.68e-04 | (3886.01 ms | 134917 tok/s) step 10051/76294 | train loss 3.359560 | norm 0.1998 | lr 6.68e-04 | (3797.95 ms | 138045 tok/s) step 10052/76294 | train loss 3.438175 | norm 0.2444 | lr 6.68e-04 | (3855.12 ms | 135998 tok/s) step 10053/76294 | train loss 3.365538 | norm 0.2910 | lr 6.68e-04 | (3798.83 ms | 138013 tok/s) step 10054/76294 | train loss 3.545584 | norm 0.2822 | lr 6.68e-04 | (3809.31 ms | 137633 tok/s) step 10055/76294 | train loss 3.452804 | norm 0.3136 | lr 6.68e-04 | (3804.02 ms | 137825 tok/s) step 10056/76294 | train loss 3.405524 | norm 0.2334 | lr 6.67e-04 | (3819.68 ms | 137260 tok/s) step 10057/76294 | train loss 3.378625 | norm 0.3022 | lr 6.67e-04 | (3824.03 ms | 137104 tok/s) step 10058/76294 | train loss 3.470043 | norm 0.1951 | lr 6.67e-04 | (3803.60 ms | 137840 tok/s) step 10059/76294 | train loss 3.386806 | norm 0.2587 | lr 6.67e-04 | (3821.83 ms | 137182 tok/s) step 10060/76294 | train loss 3.407364 | norm 0.1783 | lr 6.67e-04 | (3806.54 ms | 137734 tok/s) step 10061/76294 | train loss 3.376177 | norm 0.1999 | lr 6.67e-04 | (3813.21 ms | 137493 tok/s) step 10062/76294 | train loss 3.487635 | norm 0.2122 | lr 6.67e-04 | (3807.38 ms | 137703 tok/s) step 10063/76294 | train loss 3.432645 | norm 0.2233 | lr 6.67e-04 | (3800.88 ms | 137938 tok/s) step 10064/76294 | train loss 3.406233 | norm 0.2053 | lr 6.67e-04 | (3843.00 ms | 136427 tok/s) step 10065/76294 | train loss 3.450990 | norm 0.2190 | lr 6.67e-04 | (3800.94 ms | 137937 tok/s) step 10066/76294 | train loss 3.387730 | norm 0.2285 | lr 6.67e-04 | (3922.64 ms | 133657 tok/s) step 10067/76294 | train loss 3.423543 | norm 0.2171 | lr 6.66e-04 | (3797.48 ms | 138062 tok/s) step 10068/76294 | train loss 3.424867 | norm 0.2176 | lr 6.66e-04 | (3833.30 ms | 136772 tok/s) step 10069/76294 | train loss 3.399589 | norm 0.2099 | lr 6.66e-04 | (3889.33 ms | 134802 tok/s) step 10070/76294 | train loss 3.348533 | norm 0.2048 | lr 6.66e-04 | (3807.31 ms | 137705 tok/s) step 10071/76294 | train loss 3.442237 | norm 0.1749 | lr 6.66e-04 | (3824.56 ms | 137084 tok/s) step 10072/76294 | train loss 3.431683 | norm 0.1910 | lr 6.66e-04 | (3888.79 ms | 134820 tok/s) step 10073/76294 | train loss 3.455947 | norm 0.1896 | lr 6.66e-04 | (3803.32 ms | 137850 tok/s) step 10074/76294 | train loss 3.401371 | norm 0.2555 | lr 6.66e-04 | (3815.35 ms | 137415 tok/s) step 10075/76294 | train loss 3.429909 | norm 0.1866 | lr 6.66e-04 | (3806.10 ms | 137750 tok/s) step 10076/76294 | train loss 3.380235 | norm 0.3074 | lr 6.66e-04 | (3823.32 ms | 137129 tok/s) step 10077/76294 | train loss 3.437246 | norm 0.2781 | lr 6.66e-04 | (3800.19 ms | 137964 tok/s) step 10078/76294 | train loss 3.404961 | norm 0.4440 | lr 6.65e-04 | (3811.22 ms | 137564 tok/s) step 10079/76294 | train loss 3.463636 | norm 0.1659 | lr 6.65e-04 | (3822.09 ms | 137173 tok/s) step 10080/76294 | train loss 3.430658 | norm 0.2679 | lr 6.65e-04 | (3802.24 ms | 137889 tok/s) step 10081/76294 | train loss 3.439621 | norm 0.2620 | lr 6.65e-04 | (3807.79 ms | 137688 tok/s) step 10082/76294 | train loss 3.455575 | norm 0.2213 | lr 6.65e-04 | (3806.41 ms | 137738 tok/s) step 10083/76294 | train loss 3.477767 | norm 0.2323 | lr 6.65e-04 | (3820.75 ms | 137221 tok/s) step 10084/76294 | train loss 3.410462 | norm 0.2359 | lr 6.65e-04 | (3808.05 ms | 137679 tok/s) step 10085/76294 | train loss 3.400456 | norm 0.2082 | lr 6.65e-04 | (3823.32 ms | 137129 tok/s) step 10086/76294 | train loss 3.386716 | norm 0.3217 | lr 6.65e-04 | (3803.84 ms | 137831 tok/s) step 10087/76294 | train loss 3.485768 | norm 0.2666 | lr 6.65e-04 | (3818.81 ms | 137291 tok/s) step 10088/76294 | train loss 3.416458 | norm 0.2597 | lr 6.65e-04 | (3819.56 ms | 137264 tok/s) step 10089/76294 | train loss 3.348638 | norm 0.2049 | lr 6.64e-04 | (3804.99 ms | 137789 tok/s) step 10090/76294 | train loss 3.465731 | norm 0.2293 | lr 6.64e-04 | (3804.58 ms | 137804 tok/s) step 10091/76294 | train loss 3.420008 | norm 0.1982 | lr 6.64e-04 | (3848.18 ms | 136243 tok/s) step 10092/76294 | train loss 3.427853 | norm 0.1808 | lr 6.64e-04 | (3799.51 ms | 137988 tok/s) step 10093/76294 | train loss 3.416865 | norm 0.1828 | lr 6.64e-04 | (3816.75 ms | 137365 tok/s) step 10094/76294 | train loss 3.470693 | norm 0.2206 | lr 6.64e-04 | (3821.74 ms | 137186 tok/s) step 10095/76294 | train loss 3.396488 | norm 0.2431 | lr 6.64e-04 | (3822.40 ms | 137162 tok/s) step 10096/76294 | train loss 3.424236 | norm 0.2033 | lr 6.64e-04 | (3813.75 ms | 137473 tok/s) step 10097/76294 | train loss 3.414013 | norm 0.1775 | lr 6.64e-04 | (3858.39 ms | 135883 tok/s) step 10098/76294 | train loss 3.467298 | norm 0.3793 | lr 6.64e-04 | (3800.05 ms | 137969 tok/s) step 10099/76294 | train loss 3.354371 | norm 0.1885 | lr 6.64e-04 | (3809.81 ms | 137615 tok/s) step 10100/76294 | train loss 3.418300 | norm 0.3707 | lr 6.63e-04 | (3824.80 ms | 137076 tok/s) step 10101/76294 | train loss 3.379647 | norm 0.2074 | lr 6.63e-04 | (3805.08 ms | 137786 tok/s) step 10102/76294 | train loss 3.437911 | norm 0.2174 | lr 6.63e-04 | (3802.81 ms | 137869 tok/s) step 10103/76294 | train loss 3.405058 | norm 0.2052 | lr 6.63e-04 | (3848.23 ms | 136241 tok/s) step 10104/76294 | train loss 3.438521 | norm 0.2170 | lr 6.63e-04 | (3833.32 ms | 136771 tok/s) step 10105/76294 | train loss 3.538361 | norm 0.2539 | lr 6.63e-04 | (3801.10 ms | 137931 tok/s) step 10106/76294 | train loss 3.402856 | norm 0.2339 | lr 6.63e-04 | (3832.63 ms | 136796 tok/s) step 10107/76294 | train loss 3.424206 | norm 0.1943 | lr 6.63e-04 | (3802.81 ms | 137869 tok/s) step 10108/76294 | train loss 3.384757 | norm 0.2255 | lr 6.63e-04 | (3837.30 ms | 136629 tok/s) step 10109/76294 | train loss 3.459484 | norm 0.2428 | lr 6.63e-04 | (4197.63 ms | 124901 tok/s) step 10110/76294 | train loss 3.429753 | norm 0.1906 | lr 6.63e-04 | (3803.80 ms | 137833 tok/s) step 10111/76294 | train loss 3.457609 | norm 0.2147 | lr 6.62e-04 | (3824.24 ms | 137096 tok/s) step 10112/76294 | train loss 3.439003 | norm 0.2633 | lr 6.62e-04 | (3806.30 ms | 137742 tok/s) step 10113/76294 | train loss 3.459990 | norm 0.1843 | lr 6.62e-04 | (3803.82 ms | 137832 tok/s) step 10114/76294 | train loss 3.428108 | norm 0.2243 | lr 6.62e-04 | (3829.91 ms | 136893 tok/s) step 10115/76294 | train loss 3.428536 | norm 0.1960 | lr 6.62e-04 | (3936.27 ms | 133194 tok/s) step 10116/76294 | train loss 3.513299 | norm 0.4758 | lr 6.62e-04 | (3797.15 ms | 138074 tok/s) step 10117/76294 | train loss 3.436167 | norm 0.2869 | lr 6.62e-04 | (3807.17 ms | 137711 tok/s) step 10118/76294 | train loss 3.351871 | norm 0.3199 | lr 6.62e-04 | (3821.52 ms | 137194 tok/s) step 10119/76294 | train loss 3.396535 | norm 0.1949 | lr 6.62e-04 | (3830.94 ms | 136856 tok/s) step 10120/76294 | train loss 3.424125 | norm 0.2244 | lr 6.62e-04 | (3803.74 ms | 137835 tok/s) step 10121/76294 | train loss 3.422823 | norm 0.2309 | lr 6.62e-04 | (3830.08 ms | 136887 tok/s) step 10122/76294 | train loss 3.411227 | norm 0.2170 | lr 6.61e-04 | (3803.00 ms | 137862 tok/s) step 10123/76294 | train loss 3.449690 | norm 0.1713 | lr 6.61e-04 | (3805.82 ms | 137759 tok/s) step 10124/76294 | train loss 3.529230 | norm 0.2067 | lr 6.61e-04 | (3829.68 ms | 136901 tok/s) step 10125/76294 | train loss 3.468165 | norm 0.1814 | lr 6.61e-04 | (3819.08 ms | 137281 tok/s) step 10126/76294 | train loss 3.481791 | norm 0.1988 | lr 6.61e-04 | (3814.00 ms | 137464 tok/s) step 10127/76294 | train loss 3.451080 | norm 0.2248 | lr 6.61e-04 | (3829.95 ms | 136892 tok/s) step 10128/76294 | train loss 3.443019 | norm 0.1924 | lr 6.61e-04 | (3805.39 ms | 137775 tok/s) step 10129/76294 | train loss 3.382680 | norm 0.2009 | lr 6.61e-04 | (3820.35 ms | 137236 tok/s) step 10130/76294 | train loss 3.437380 | norm 0.1864 | lr 6.61e-04 | (3803.44 ms | 137846 tok/s) step 10131/76294 | train loss 3.398168 | norm 0.2012 | lr 6.61e-04 | (3814.77 ms | 137436 tok/s) step 10132/76294 | train loss 3.455568 | norm 0.2713 | lr 6.61e-04 | (3826.29 ms | 137023 tok/s) step 10133/76294 | train loss 3.442182 | norm 0.1932 | lr 6.60e-04 | (3805.81 ms | 137760 tok/s) step 10134/76294 | train loss 3.462088 | norm 0.2122 | lr 6.60e-04 | (3822.12 ms | 137172 tok/s) step 10135/76294 | train loss 3.388441 | norm 0.1834 | lr 6.60e-04 | (3831.02 ms | 136853 tok/s) step 10136/76294 | train loss 3.416390 | norm 0.1964 | lr 6.60e-04 | (3798.52 ms | 138024 tok/s) step 10137/76294 | train loss 3.491371 | norm 0.2054 | lr 6.60e-04 | (3841.56 ms | 136478 tok/s) step 10138/76294 | train loss 3.456770 | norm 0.2910 | lr 6.60e-04 | (4023.80 ms | 130297 tok/s) step 10139/76294 | train loss 3.383890 | norm 0.2141 | lr 6.60e-04 | (3813.19 ms | 137493 tok/s) step 10140/76294 | train loss 3.430192 | norm 0.2787 | lr 6.60e-04 | (3810.08 ms | 137606 tok/s) step 10141/76294 | train loss 3.422072 | norm 0.1807 | lr 6.60e-04 | (3801.22 ms | 137926 tok/s) step 10142/76294 | train loss 3.510558 | norm 0.2332 | lr 6.60e-04 | (3810.28 ms | 137598 tok/s) step 10143/76294 | train loss 3.428410 | norm 0.1637 | lr 6.60e-04 | (3803.08 ms | 137859 tok/s) step 10144/76294 | train loss 3.407895 | norm 0.2683 | lr 6.60e-04 | (3808.33 ms | 137669 tok/s) step 10145/76294 | train loss 3.434283 | norm 0.2213 | lr 6.59e-04 | (3828.34 ms | 136949 tok/s) step 10146/76294 | train loss 3.455660 | norm 0.3237 | lr 6.59e-04 | (3838.96 ms | 136570 tok/s) step 10147/76294 | train loss 3.487554 | norm 0.2387 | lr 6.59e-04 | (3804.32 ms | 137814 tok/s) step 10148/76294 | train loss 3.422940 | norm 0.2527 | lr 6.59e-04 | (3831.45 ms | 136838 tok/s) step 10149/76294 | train loss 3.434633 | norm 0.2929 | lr 6.59e-04 | (3802.82 ms | 137868 tok/s) step 10150/76294 | train loss 3.410397 | norm 0.1784 | lr 6.59e-04 | (3805.14 ms | 137784 tok/s) step 10151/76294 | train loss 3.403798 | norm 0.2719 | lr 6.59e-04 | (3826.39 ms | 137019 tok/s) step 10152/76294 | train loss 3.395541 | norm 0.2038 | lr 6.59e-04 | (3805.88 ms | 137757 tok/s) step 10153/76294 | train loss 3.464216 | norm 0.2168 | lr 6.59e-04 | (3804.55 ms | 137806 tok/s) step 10154/76294 | train loss 3.364974 | norm 0.2133 | lr 6.59e-04 | (3831.51 ms | 136836 tok/s) step 10155/76294 | train loss 3.458935 | norm 0.1910 | lr 6.59e-04 | (3802.87 ms | 137867 tok/s) step 10156/76294 | train loss 3.303256 | norm 0.2043 | lr 6.58e-04 | (3811.66 ms | 137548 tok/s) step 10157/76294 | train loss 3.382033 | norm 0.2337 | lr 6.58e-04 | (3835.02 ms | 136711 tok/s) step 10158/76294 | train loss 3.416914 | norm 0.2273 | lr 6.58e-04 | (3809.97 ms | 137609 tok/s) step 10159/76294 | train loss 3.375603 | norm 0.2068 | lr 6.58e-04 | (3873.19 ms | 135363 tok/s) step 10160/76294 | train loss 3.396249 | norm 0.2589 | lr 6.58e-04 | (3809.00 ms | 137644 tok/s) step 10161/76294 | train loss 3.493449 | norm 0.2319 | lr 6.58e-04 | (3926.17 ms | 133537 tok/s) step 10162/76294 | train loss 3.455942 | norm 0.2736 | lr 6.58e-04 | (3805.76 ms | 137762 tok/s) step 10163/76294 | train loss 3.438321 | norm 0.2121 | lr 6.58e-04 | (3816.47 ms | 137375 tok/s) step 10164/76294 | train loss 3.358547 | norm 0.2279 | lr 6.58e-04 | (3909.31 ms | 134113 tok/s) step 10165/76294 | train loss 3.451911 | norm 0.2180 | lr 6.58e-04 | (3762.16 ms | 139358 tok/s) step 10166/76294 | train loss 3.347001 | norm 0.2875 | lr 6.58e-04 | (3771.84 ms | 139001 tok/s) step 10167/76294 | train loss 3.383095 | norm 0.2122 | lr 6.57e-04 | (3794.76 ms | 138161 tok/s) step 10168/76294 | train loss 3.387344 | norm 0.2486 | lr 6.57e-04 | (3775.82 ms | 138854 tok/s) step 10169/76294 | train loss 3.361706 | norm 0.2068 | lr 6.57e-04 | (3774.64 ms | 138897 tok/s) step 10170/76294 | train loss 3.399804 | norm 0.2094 | lr 6.57e-04 | (3815.10 ms | 137424 tok/s) step 10171/76294 | train loss 3.376902 | norm 0.2138 | lr 6.57e-04 | (3782.38 ms | 138613 tok/s) step 10172/76294 | train loss 3.328712 | norm 0.2268 | lr 6.57e-04 | (3812.05 ms | 137534 tok/s) step 10173/76294 | train loss 3.443038 | norm 0.1898 | lr 6.57e-04 | (3879.72 ms | 135135 tok/s) step 10174/76294 | train loss 3.321567 | norm 0.2251 | lr 6.57e-04 | (3788.31 ms | 138396 tok/s) step 10175/76294 | train loss 3.586194 | norm 0.2123 | lr 6.57e-04 | (3808.24 ms | 137672 tok/s) step 10176/76294 | train loss 3.442926 | norm 0.2301 | lr 6.57e-04 | (3801.72 ms | 137908 tok/s) step 10177/76294 | train loss 3.408443 | norm 0.2163 | lr 6.57e-04 | (3792.48 ms | 138244 tok/s) step 10178/76294 | train loss 3.425099 | norm 0.2423 | lr 6.56e-04 | (3853.21 ms | 136065 tok/s) step 10179/76294 | train loss 3.360082 | norm 0.2567 | lr 6.56e-04 | (3798.50 ms | 138025 tok/s) step 10180/76294 | train loss 3.416593 | norm 0.2066 | lr 6.56e-04 | (3819.37 ms | 137271 tok/s) step 10181/76294 | train loss 3.394077 | norm 0.2286 | lr 6.56e-04 | (3818.91 ms | 137287 tok/s) step 10182/76294 | train loss 3.421934 | norm 0.2952 | lr 6.56e-04 | (5521.16 ms | 94960 tok/s) step 10183/76294 | train loss 3.501407 | norm 0.2102 | lr 6.56e-04 | (3902.44 ms | 134349 tok/s) step 10184/76294 | train loss 3.391255 | norm 0.2546 | lr 6.56e-04 | (3803.90 ms | 137829 tok/s) step 10185/76294 | train loss 3.497801 | norm 0.2092 | lr 6.56e-04 | (3795.24 ms | 138144 tok/s) step 10186/76294 | train loss 3.349623 | norm 0.2288 | lr 6.56e-04 | (3831.04 ms | 136853 tok/s) step 10187/76294 | train loss 3.440668 | norm 0.2274 | lr 6.56e-04 | (3800.65 ms | 137947 tok/s) step 10188/76294 | train loss 3.425365 | norm 0.2737 | lr 6.56e-04 | (3805.68 ms | 137764 tok/s) step 10189/76294 | train loss 3.420351 | norm 0.2716 | lr 6.55e-04 | (3825.68 ms | 137045 tok/s) step 10190/76294 | train loss 3.426005 | norm 0.2601 | lr 6.55e-04 | (3808.41 ms | 137666 tok/s) step 10191/76294 | train loss 3.374455 | norm 0.3031 | lr 6.55e-04 | (3825.70 ms | 137044 tok/s) step 10192/76294 | train loss 3.403954 | norm 0.3724 | lr 6.55e-04 | (3809.12 ms | 137640 tok/s) step 10193/76294 | train loss 3.314354 | norm 0.1858 | lr 6.55e-04 | (3835.01 ms | 136711 tok/s) step 10194/76294 | train loss 3.478967 | norm 0.2501 | lr 6.55e-04 | (3810.31 ms | 137597 tok/s) step 10195/76294 | train loss 3.342887 | norm 0.2448 | lr 6.55e-04 | (3820.31 ms | 137237 tok/s) step 10196/76294 | train loss 3.388553 | norm 0.2702 | lr 6.55e-04 | (3815.64 ms | 137405 tok/s) step 10197/76294 | train loss 3.470301 | norm 0.2893 | lr 6.55e-04 | (3819.18 ms | 137278 tok/s) step 10198/76294 | train loss 3.375200 | norm 0.2608 | lr 6.55e-04 | (3839.94 ms | 136536 tok/s) step 10199/76294 | train loss 3.540517 | norm 0.2139 | lr 6.55e-04 | (4079.44 ms | 128520 tok/s) step 10200/76294 | train loss 3.372504 | norm 0.2136 | lr 6.54e-04 | (3812.84 ms | 137506 tok/s) step 10201/76294 | train loss 3.435798 | norm 0.2235 | lr 6.54e-04 | (3842.09 ms | 136459 tok/s) step 10202/76294 | train loss 3.381890 | norm 0.3041 | lr 6.54e-04 | (3815.98 ms | 137393 tok/s) step 10203/76294 | train loss 3.451166 | norm 0.1905 | lr 6.54e-04 | (3873.41 ms | 135356 tok/s) step 10204/76294 | train loss 3.379447 | norm 0.3443 | lr 6.54e-04 | (3813.71 ms | 137475 tok/s) step 10205/76294 | train loss 3.367589 | norm 0.2146 | lr 6.54e-04 | (3822.40 ms | 137162 tok/s) step 10206/76294 | train loss 3.430990 | norm 0.3793 | lr 6.54e-04 | (3838.82 ms | 136575 tok/s) step 10207/76294 | train loss 3.359294 | norm 0.1970 | lr 6.54e-04 | (3842.40 ms | 136448 tok/s) step 10208/76294 | train loss 3.428810 | norm 0.2842 | lr 6.54e-04 | (3821.41 ms | 137198 tok/s) step 10209/76294 | train loss 3.395356 | norm 0.1983 | lr 6.54e-04 | (3825.33 ms | 137057 tok/s) step 10210/76294 | train loss 3.445574 | norm 0.2575 | lr 6.54e-04 | (3815.11 ms | 137424 tok/s) step 10211/76294 | train loss 3.398252 | norm 0.1768 | lr 6.53e-04 | (3846.20 ms | 136313 tok/s) step 10212/76294 | train loss 3.422030 | norm 0.2592 | lr 6.53e-04 | (3811.80 ms | 137543 tok/s) step 10213/76294 | train loss 3.398364 | norm 0.2051 | lr 6.53e-04 | (3814.68 ms | 137440 tok/s) step 10214/76294 | train loss 3.355487 | norm 0.2738 | lr 6.53e-04 | (3837.27 ms | 136630 tok/s) step 10215/76294 | train loss 3.423071 | norm 0.2150 | lr 6.53e-04 | (3817.52 ms | 137337 tok/s) step 10216/76294 | train loss 3.363861 | norm 0.2590 | lr 6.53e-04 | (3878.89 ms | 135164 tok/s) step 10217/76294 | train loss 3.424929 | norm 0.2349 | lr 6.53e-04 | (3814.21 ms | 137456 tok/s) step 10218/76294 | train loss 3.389071 | norm 0.2713 | lr 6.53e-04 | (3819.23 ms | 137276 tok/s) step 10219/76294 | train loss 3.426361 | norm 0.2338 | lr 6.53e-04 | (3839.51 ms | 136551 tok/s) step 10220/76294 | train loss 3.428243 | norm 0.3243 | lr 6.53e-04 | (3813.28 ms | 137490 tok/s) step 10221/76294 | train loss 3.391620 | norm 0.3202 | lr 6.53e-04 | (3810.99 ms | 137573 tok/s) step 10222/76294 | train loss 3.428594 | norm 0.3614 | lr 6.52e-04 | (3872.96 ms | 135371 tok/s) step 10223/76294 | train loss 3.434765 | norm 0.2760 | lr 6.52e-04 | (3806.63 ms | 137730 tok/s) step 10224/76294 | train loss 3.495286 | norm 0.2795 | lr 6.52e-04 | (3840.06 ms | 136531 tok/s) step 10225/76294 | train loss 3.370559 | norm 0.2276 | lr 6.52e-04 | (3842.60 ms | 136441 tok/s) step 10226/76294 | train loss 3.379493 | norm 0.2232 | lr 6.52e-04 | (3810.66 ms | 137585 tok/s) step 10227/76294 | train loss 3.385309 | norm 0.1800 | lr 6.52e-04 | (3803.08 ms | 137859 tok/s) step 10228/76294 | train loss 3.342941 | norm 0.2127 | lr 6.52e-04 | (3844.91 ms | 136359 tok/s) step 10229/76294 | train loss 3.444936 | norm 0.2198 | lr 6.52e-04 | (3805.52 ms | 137770 tok/s) step 10230/76294 | train loss 3.372515 | norm 0.1963 | lr 6.52e-04 | (3811.00 ms | 137572 tok/s) step 10231/76294 | train loss 3.397109 | norm 0.2237 | lr 6.52e-04 | (3830.57 ms | 136870 tok/s) step 10232/76294 | train loss 3.354807 | norm 0.1981 | lr 6.52e-04 | (3839.93 ms | 136536 tok/s) step 10233/76294 | train loss 3.510591 | norm 0.3119 | lr 6.51e-04 | (3804.75 ms | 137798 tok/s) step 10234/76294 | train loss 3.437579 | norm 0.1896 | lr 6.51e-04 | (3839.05 ms | 136567 tok/s) step 10235/76294 | train loss 3.373122 | norm 0.2846 | lr 6.51e-04 | (3807.95 ms | 137682 tok/s) step 10236/76294 | train loss 3.395605 | norm 0.2303 | lr 6.51e-04 | (3802.15 ms | 137893 tok/s) step 10237/76294 | train loss 3.378304 | norm 0.2011 | lr 6.51e-04 | (3973.81 ms | 131936 tok/s) step 10238/76294 | train loss 3.443798 | norm 0.2287 | lr 6.51e-04 | (3878.51 ms | 135178 tok/s) step 10239/76294 | train loss 3.338555 | norm 0.2036 | lr 6.51e-04 | (3802.77 ms | 137870 tok/s) step 10240/76294 | train loss 3.414715 | norm 0.2053 | lr 6.51e-04 | (3848.86 ms | 136219 tok/s) step 10241/76294 | train loss 3.374861 | norm 0.3221 | lr 6.51e-04 | (3805.19 ms | 137783 tok/s) step 10242/76294 | train loss 3.441479 | norm 0.1832 | lr 6.51e-04 | (3809.56 ms | 137624 tok/s) step 10243/76294 | train loss 3.339052 | norm 0.2618 | lr 6.51e-04 | (3832.89 ms | 136786 tok/s) step 10244/76294 | train loss 3.396396 | norm 0.2358 | lr 6.51e-04 | (3807.76 ms | 137689 tok/s) step 10245/76294 | train loss 3.408493 | norm 0.2077 | lr 6.50e-04 | (3801.59 ms | 137913 tok/s) step 10246/76294 | train loss 3.391154 | norm 0.3128 | lr 6.50e-04 | (3831.08 ms | 136851 tok/s) step 10247/76294 | train loss 3.374490 | norm 0.1781 | lr 6.50e-04 | (3803.75 ms | 137835 tok/s) step 10248/76294 | train loss 3.399264 | norm 0.1949 | lr 6.50e-04 | (3813.80 ms | 137471 tok/s) step 10249/76294 | train loss 3.412487 | norm 0.1778 | lr 6.50e-04 | (3827.04 ms | 136996 tok/s) step 10250/76294 | train loss 3.338293 | norm 0.1915 | lr 6.50e-04 | (3806.49 ms | 137735 tok/s) val loss: 3.420333 saving model checkpoint to ./results/gpt2-124M-gqa/step_10250.pth step 10251/76294 | train loss 3.376852 | norm 0.2066 | lr 6.50e-04 | (3817.68 ms | 137332 tok/s) step 10252/76294 | train loss 3.345555 | norm 0.2060 | lr 6.50e-04 | (3830.43 ms | 136875 tok/s) step 10253/76294 | train loss 3.425302 | norm 0.1886 | lr 6.50e-04 | (3800.12 ms | 137966 tok/s) step 10254/76294 | train loss 3.347653 | norm 0.3399 | lr 6.50e-04 | (3801.99 ms | 137898 tok/s) step 10255/76294 | train loss 3.422229 | norm 0.2686 | lr 6.50e-04 | (3823.86 ms | 137110 tok/s) step 10256/76294 | train loss 3.427252 | norm 0.2614 | lr 6.49e-04 | (3803.35 ms | 137849 tok/s) step 10257/76294 | train loss 3.440356 | norm 0.2168 | lr 6.49e-04 | (3805.61 ms | 137767 tok/s) step 10258/76294 | train loss 3.432611 | norm 0.2641 | lr 6.49e-04 | (3834.98 ms | 136712 tok/s) step 10259/76294 | train loss 3.521513 | norm 0.2665 | lr 6.49e-04 | (3962.40 ms | 132316 tok/s) step 10260/76294 | train loss 3.385954 | norm 0.3408 | lr 6.49e-04 | (3804.29 ms | 137815 tok/s) step 10261/76294 | train loss 3.327255 | norm 0.2382 | lr 6.49e-04 | (3810.32 ms | 137597 tok/s) step 10262/76294 | train loss 3.393156 | norm 0.2468 | lr 6.49e-04 | (3825.73 ms | 137043 tok/s) step 10263/76294 | train loss 3.366993 | norm 0.2741 | lr 6.49e-04 | (3814.24 ms | 137455 tok/s) step 10264/76294 | train loss 3.459111 | norm 0.1719 | lr 6.49e-04 | (3809.23 ms | 137636 tok/s) step 10265/76294 | train loss 3.398293 | norm 0.2562 | lr 6.49e-04 | (3831.17 ms | 136848 tok/s) step 10266/76294 | train loss 3.407068 | norm 0.2232 | lr 6.49e-04 | (3805.22 ms | 137781 tok/s) step 10267/76294 | train loss 3.415628 | norm 0.2395 | lr 6.48e-04 | (3829.67 ms | 136902 tok/s) step 10268/76294 | train loss 3.471133 | norm 0.2264 | lr 6.48e-04 | (3801.26 ms | 137925 tok/s) step 10269/76294 | train loss 3.410053 | norm 0.1911 | lr 6.48e-04 | (3806.16 ms | 137747 tok/s) step 10270/76294 | train loss 3.368550 | norm 0.2494 | lr 6.48e-04 | (3820.09 ms | 137245 tok/s) step 10271/76294 | train loss 3.419443 | norm 0.1818 | lr 6.48e-04 | (3829.78 ms | 136898 tok/s) step 10272/76294 | train loss 3.349277 | norm 0.2454 | lr 6.48e-04 | (3815.78 ms | 137400 tok/s) step 10273/76294 | train loss 3.481785 | norm 0.2396 | lr 6.48e-04 | (3851.99 ms | 136108 tok/s) step 10274/76294 | train loss 3.400411 | norm 0.2167 | lr 6.48e-04 | (3800.23 ms | 137962 tok/s) step 10275/76294 | train loss 3.489527 | norm 0.1835 | lr 6.48e-04 | (3830.21 ms | 136882 tok/s) step 10276/76294 | train loss 3.378540 | norm 0.1947 | lr 6.48e-04 | (3801.76 ms | 137907 tok/s) step 10277/76294 | train loss 3.354553 | norm 0.2143 | lr 6.48e-04 | (3807.73 ms | 137691 tok/s) step 10278/76294 | train loss 3.406655 | norm 0.2653 | lr 6.47e-04 | (3821.98 ms | 137177 tok/s) step 10279/76294 | train loss 3.413773 | norm 0.2630 | lr 6.47e-04 | (3811.14 ms | 137567 tok/s) step 10280/76294 | train loss 3.395662 | norm 0.2240 | lr 6.47e-04 | (3801.81 ms | 137905 tok/s) step 10281/76294 | train loss 3.390781 | norm 0.2203 | lr 6.47e-04 | (3913.83 ms | 133958 tok/s) step 10282/76294 | train loss 3.395555 | norm 0.1961 | lr 6.47e-04 | (3803.14 ms | 137857 tok/s) step 10283/76294 | train loss 3.364885 | norm 0.2294 | lr 6.47e-04 | (3807.31 ms | 137706 tok/s) step 10284/76294 | train loss 3.373850 | norm 0.2610 | lr 6.47e-04 | (3825.70 ms | 137044 tok/s) step 10285/76294 | train loss 3.341574 | norm 0.1839 | lr 6.47e-04 | (3801.56 ms | 137914 tok/s) step 10286/76294 | train loss 3.405322 | norm 0.2391 | lr 6.47e-04 | (3807.06 ms | 137715 tok/s) step 10287/76294 | train loss 3.409382 | norm 0.1858 | lr 6.47e-04 | (3801.00 ms | 137934 tok/s) step 10288/76294 | train loss 3.470510 | norm 0.2071 | lr 6.47e-04 | (3825.06 ms | 137067 tok/s) step 10289/76294 | train loss 3.370489 | norm 0.1968 | lr 6.46e-04 | (3808.06 ms | 137678 tok/s) step 10290/76294 | train loss 3.433480 | norm 0.2195 | lr 6.46e-04 | (3800.06 ms | 137968 tok/s) step 10291/76294 | train loss 3.372581 | norm 0.1974 | lr 6.46e-04 | (3841.14 ms | 136493 tok/s) step 10292/76294 | train loss 3.468579 | norm 0.1991 | lr 6.46e-04 | (3802.69 ms | 137873 tok/s) step 10293/76294 | train loss 3.366826 | norm 0.1806 | lr 6.46e-04 | (3805.56 ms | 137769 tok/s) step 10294/76294 | train loss 3.441670 | norm 0.1871 | lr 6.46e-04 | (3861.23 ms | 135783 tok/s) step 10295/76294 | train loss 3.436172 | norm 0.1995 | lr 6.46e-04 | (3840.40 ms | 136519 tok/s) step 10296/76294 | train loss 3.454529 | norm 0.2069 | lr 6.46e-04 | (3803.03 ms | 137861 tok/s) step 10297/76294 | train loss 3.373286 | norm 0.2123 | lr 6.46e-04 | (3828.31 ms | 136950 tok/s) step 10298/76294 | train loss 3.489045 | norm 0.2277 | lr 6.46e-04 | (3805.82 ms | 137760 tok/s) step 10299/76294 | train loss 3.377901 | norm 0.2468 | lr 6.46e-04 | (3919.57 ms | 133762 tok/s) step 10300/76294 | train loss 3.418391 | norm 0.2908 | lr 6.45e-04 | (4142.69 ms | 126557 tok/s) step 10301/76294 | train loss 3.522293 | norm 0.2683 | lr 6.45e-04 | (3825.58 ms | 137048 tok/s) step 10302/76294 | train loss 3.482936 | norm 0.2205 | lr 6.45e-04 | (3878.45 ms | 135180 tok/s) step 10303/76294 | train loss 3.480008 | norm 0.3183 | lr 6.45e-04 | (3799.93 ms | 137973 tok/s) step 10304/76294 | train loss 3.524008 | norm 0.2373 | lr 6.45e-04 | (3858.74 ms | 135870 tok/s) step 10305/76294 | train loss 3.427666 | norm 0.2705 | lr 6.45e-04 | (3801.81 ms | 137905 tok/s) step 10306/76294 | train loss 3.506819 | norm 0.2650 | lr 6.45e-04 | (3830.35 ms | 136877 tok/s) step 10307/76294 | train loss 3.475661 | norm 0.2126 | lr 6.45e-04 | (3819.56 ms | 137264 tok/s) step 10308/76294 | train loss 3.453299 | norm 0.1989 | lr 6.45e-04 | (3800.12 ms | 137966 tok/s) step 10309/76294 | train loss 3.392036 | norm 0.2252 | lr 6.45e-04 | (3849.88 ms | 136183 tok/s) step 10310/76294 | train loss 3.524419 | norm 0.3225 | lr 6.45e-04 | (3800.93 ms | 137937 tok/s) step 10311/76294 | train loss 3.508647 | norm 0.2624 | lr 6.44e-04 | (3855.85 ms | 135972 tok/s) step 10312/76294 | train loss 3.505392 | norm 0.2533 | lr 6.44e-04 | (3797.02 ms | 138079 tok/s) step 10313/76294 | train loss 3.462496 | norm 0.1925 | lr 6.44e-04 | (3861.85 ms | 135761 tok/s) step 10314/76294 | train loss 3.399166 | norm 0.2050 | lr 6.44e-04 | (3819.57 ms | 137264 tok/s) step 10315/76294 | train loss 3.368408 | norm 0.2096 | lr 6.44e-04 | (3816.56 ms | 137372 tok/s) step 10316/76294 | train loss 3.428434 | norm 0.2082 | lr 6.44e-04 | (3793.83 ms | 138195 tok/s) step 10317/76294 | train loss 3.497738 | norm 0.1690 | lr 6.44e-04 | (3822.70 ms | 137151 tok/s) step 10318/76294 | train loss 3.471801 | norm 0.1687 | lr 6.44e-04 | (3796.44 ms | 138100 tok/s) step 10319/76294 | train loss 3.416107 | norm 0.1607 | lr 6.44e-04 | (3823.36 ms | 137128 tok/s) step 10320/76294 | train loss 3.476323 | norm 0.1687 | lr 6.44e-04 | (3815.55 ms | 137408 tok/s) step 10321/76294 | train loss 3.480869 | norm 0.1785 | lr 6.44e-04 | (3831.44 ms | 136838 tok/s) step 10322/76294 | train loss 3.425782 | norm 0.1776 | lr 6.43e-04 | (3794.24 ms | 138180 tok/s) step 10323/76294 | train loss 3.426888 | norm 0.2090 | lr 6.43e-04 | (3902.42 ms | 134350 tok/s) step 10324/76294 | train loss 3.431277 | norm 0.1857 | lr 6.43e-04 | (3800.97 ms | 137935 tok/s) step 10325/76294 | train loss 3.448992 | norm 0.2228 | lr 6.43e-04 | (4026.66 ms | 130204 tok/s) step 10326/76294 | train loss 3.489568 | norm 0.2549 | lr 6.43e-04 | (3794.32 ms | 138177 tok/s) step 10327/76294 | train loss 3.590228 | norm 0.1911 | lr 6.43e-04 | (3807.59 ms | 137695 tok/s) step 10328/76294 | train loss 3.472887 | norm 0.3197 | lr 6.43e-04 | (3794.24 ms | 138180 tok/s) step 10329/76294 | train loss 3.417149 | norm 0.1984 | lr 6.43e-04 | (3803.60 ms | 137840 tok/s) step 10330/76294 | train loss 3.390858 | norm 0.2556 | lr 6.43e-04 | (3816.26 ms | 137383 tok/s) step 10331/76294 | train loss 3.406368 | norm 0.2180 | lr 6.43e-04 | (3799.47 ms | 137990 tok/s) step 10332/76294 | train loss 3.457545 | norm 0.2003 | lr 6.43e-04 | (3903.01 ms | 134329 tok/s) step 10333/76294 | train loss 3.376024 | norm 0.2283 | lr 6.42e-04 | (3792.27 ms | 138252 tok/s) step 10334/76294 | train loss 3.404750 | norm 0.2254 | lr 6.42e-04 | (3796.33 ms | 138104 tok/s) step 10335/76294 | train loss 3.465130 | norm 0.2371 | lr 6.42e-04 | (3815.37 ms | 137415 tok/s) step 10336/76294 | train loss 3.447387 | norm 0.2276 | lr 6.42e-04 | (3795.11 ms | 138148 tok/s) step 10337/76294 | train loss 3.475225 | norm 0.1870 | lr 6.42e-04 | (3792.31 ms | 138250 tok/s) step 10338/76294 | train loss 3.475121 | norm 0.2128 | lr 6.42e-04 | (3813.19 ms | 137493 tok/s) step 10339/76294 | train loss 3.498090 | norm 0.2404 | lr 6.42e-04 | (3798.50 ms | 138025 tok/s) step 10340/76294 | train loss 3.481520 | norm 0.2685 | lr 6.42e-04 | (3799.42 ms | 137991 tok/s) step 10341/76294 | train loss 3.455510 | norm 0.2512 | lr 6.42e-04 | (3839.97 ms | 136534 tok/s) step 10342/76294 | train loss 3.498886 | norm 0.2178 | lr 6.42e-04 | (3806.10 ms | 137749 tok/s) step 10343/76294 | train loss 3.463315 | norm 0.3739 | lr 6.42e-04 | (3892.24 ms | 134701 tok/s) step 10344/76294 | train loss 3.454719 | norm 0.2495 | lr 6.42e-04 | (3791.00 ms | 138298 tok/s) step 10345/76294 | train loss 3.427473 | norm 0.3822 | lr 6.41e-04 | (3816.61 ms | 137370 tok/s) step 10346/76294 | train loss 3.490533 | norm 0.1973 | lr 6.41e-04 | (3818.14 ms | 137315 tok/s) step 10347/76294 | train loss 3.446571 | norm 0.2627 | lr 6.41e-04 | (3797.67 ms | 138055 tok/s) step 10348/76294 | train loss 3.419768 | norm 0.2414 | lr 6.41e-04 | (3792.85 ms | 138231 tok/s) step 10349/76294 | train loss 3.386637 | norm 0.2374 | lr 6.41e-04 | (3844.58 ms | 136371 tok/s) step 10350/76294 | train loss 3.506733 | norm 0.2402 | lr 6.41e-04 | (3791.54 ms | 138278 tok/s) step 10351/76294 | train loss 3.399348 | norm 0.2721 | lr 6.41e-04 | (3825.99 ms | 137033 tok/s) step 10352/76294 | train loss 3.462294 | norm 0.1846 | lr 6.41e-04 | (3798.32 ms | 138032 tok/s) step 10353/76294 | train loss 3.438733 | norm 0.3290 | lr 6.41e-04 | (3805.18 ms | 137783 tok/s) step 10354/76294 | train loss 3.470359 | norm 0.2543 | lr 6.41e-04 | (3824.84 ms | 137074 tok/s) step 10355/76294 | train loss 3.372946 | norm 0.2319 | lr 6.41e-04 | (3803.89 ms | 137829 tok/s) step 10356/76294 | train loss 3.464306 | norm 0.2512 | lr 6.40e-04 | (3802.26 ms | 137889 tok/s) step 10357/76294 | train loss 3.447585 | norm 0.1957 | lr 6.40e-04 | (3859.35 ms | 135849 tok/s) step 10358/76294 | train loss 3.478147 | norm 0.2346 | lr 6.40e-04 | (3811.83 ms | 137543 tok/s) step 10359/76294 | train loss 3.506125 | norm 0.2484 | lr 6.40e-04 | (3811.80 ms | 137544 tok/s) step 10360/76294 | train loss 3.424319 | norm 0.3070 | lr 6.40e-04 | (3826.32 ms | 137021 tok/s) step 10361/76294 | train loss 3.473868 | norm 0.2426 | lr 6.40e-04 | (3811.07 ms | 137570 tok/s) step 10362/76294 | train loss 3.475570 | norm 0.3268 | lr 6.40e-04 | (3806.58 ms | 137732 tok/s) step 10363/76294 | train loss 3.466154 | norm 0.2019 | lr 6.40e-04 | (3918.10 ms | 133812 tok/s) step 10364/76294 | train loss 3.471003 | norm 0.2308 | lr 6.40e-04 | (3812.43 ms | 137521 tok/s) step 10365/76294 | train loss 3.440281 | norm 0.2175 | lr 6.40e-04 | (3829.51 ms | 136907 tok/s) step 10366/76294 | train loss 3.458866 | norm 0.1848 | lr 6.40e-04 | (3832.86 ms | 136788 tok/s) step 10367/76294 | train loss 3.421546 | norm 0.3023 | lr 6.39e-04 | (3813.09 ms | 137497 tok/s) step 10368/76294 | train loss 3.424003 | norm 0.3293 | lr 6.39e-04 | (3809.59 ms | 137623 tok/s) step 10369/76294 | train loss 3.400192 | norm 0.2077 | lr 6.39e-04 | (3837.21 ms | 136633 tok/s) step 10370/76294 | train loss 3.442518 | norm 0.2115 | lr 6.39e-04 | (4202.13 ms | 124767 tok/s) step 10371/76294 | train loss 3.439114 | norm 0.2085 | lr 6.39e-04 | (3981.71 ms | 131674 tok/s) step 10372/76294 | train loss 3.453046 | norm 0.2350 | lr 6.39e-04 | (3800.10 ms | 137967 tok/s) step 10373/76294 | train loss 3.433925 | norm 0.2421 | lr 6.39e-04 | (3861.99 ms | 135756 tok/s) step 10374/76294 | train loss 3.443567 | norm 0.2062 | lr 6.39e-04 | (3801.18 ms | 137928 tok/s) step 10375/76294 | train loss 3.427413 | norm 0.2621 | lr 6.39e-04 | (26233.11 ms | 19986 tok/s) step 10376/76294 | train loss 3.400631 | norm 0.1729 | lr 6.39e-04 | (3863.60 ms | 135699 tok/s) step 10377/76294 | train loss 3.478198 | norm 0.2531 | lr 6.39e-04 | (3771.49 ms | 139014 tok/s) step 10378/76294 | train loss 3.468203 | norm 0.1947 | lr 6.38e-04 | (3790.36 ms | 138321 tok/s) step 10379/76294 | train loss 3.464130 | norm 0.2521 | lr 6.38e-04 | (9609.11 ms | 54562 tok/s) step 10380/76294 | train loss 3.469103 | norm 0.1930 | lr 6.38e-04 | (3785.23 ms | 138509 tok/s) step 10381/76294 | train loss 3.439280 | norm 0.3029 | lr 6.38e-04 | (14432.00 ms | 36328 tok/s) step 10382/76294 | train loss 3.478235 | norm 0.2077 | lr 6.38e-04 | (9192.69 ms | 57033 tok/s) step 10383/76294 | train loss 3.434700 | norm 0.2504 | lr 6.38e-04 | (3777.97 ms | 138775 tok/s) step 10384/76294 | train loss 3.436172 | norm 0.2639 | lr 6.38e-04 | (3820.85 ms | 137218 tok/s) step 10385/76294 | train loss 3.405781 | norm 0.2272 | lr 6.38e-04 | (3859.16 ms | 135856 tok/s) step 10386/76294 | train loss 3.411555 | norm 0.2511 | lr 6.38e-04 | (3763.06 ms | 139325 tok/s) step 10387/76294 | train loss 3.492366 | norm 0.2640 | lr 6.38e-04 | (3893.41 ms | 134660 tok/s) step 10388/76294 | train loss 3.451669 | norm 0.3038 | lr 6.38e-04 | (3822.19 ms | 137170 tok/s) step 10389/76294 | train loss 3.513140 | norm 0.2561 | lr 6.37e-04 | (3769.45 ms | 139089 tok/s) step 10390/76294 | train loss 3.474845 | norm 0.2103 | lr 6.37e-04 | (3887.50 ms | 134865 tok/s) step 10391/76294 | train loss 3.409964 | norm 0.2129 | lr 6.37e-04 | (3910.73 ms | 134064 tok/s) step 10392/76294 | train loss 3.430606 | norm 0.2889 | lr 6.37e-04 | (3862.60 ms | 135734 tok/s) step 10393/76294 | train loss 3.386832 | norm 0.2628 | lr 6.37e-04 | (3772.03 ms | 138994 tok/s) step 10394/76294 | train loss 3.437930 | norm 0.3363 | lr 6.37e-04 | (3795.29 ms | 138142 tok/s) step 10395/76294 | train loss 3.415318 | norm 0.2272 | lr 6.37e-04 | (3806.11 ms | 137749 tok/s) step 10396/76294 | train loss 3.450002 | norm 0.4301 | lr 6.37e-04 | (3926.33 ms | 133531 tok/s) step 10397/76294 | train loss 3.473075 | norm 0.2093 | lr 6.37e-04 | (3782.78 ms | 138599 tok/s) step 10398/76294 | train loss 3.450606 | norm 0.2871 | lr 6.37e-04 | (3805.36 ms | 137776 tok/s) step 10399/76294 | train loss 3.418212 | norm 0.2279 | lr 6.37e-04 | (3787.82 ms | 138414 tok/s) step 10400/76294 | train loss 3.392187 | norm 0.3289 | lr 6.36e-04 | (3793.01 ms | 138225 tok/s) step 10401/76294 | train loss 3.390131 | norm 0.3153 | lr 6.36e-04 | (3819.22 ms | 137276 tok/s) step 10402/76294 | train loss 3.481480 | norm 0.2283 | lr 6.36e-04 | (3795.83 ms | 138122 tok/s) step 10403/76294 | train loss 3.388220 | norm 0.2717 | lr 6.36e-04 | (3819.24 ms | 137276 tok/s) step 10404/76294 | train loss 3.373542 | norm 0.3177 | lr 6.36e-04 | (3798.85 ms | 138012 tok/s) step 10405/76294 | train loss 3.371782 | norm 0.2682 | lr 6.36e-04 | (3802.76 ms | 137871 tok/s) step 10406/76294 | train loss 3.441394 | norm 0.2269 | lr 6.36e-04 | (3821.59 ms | 137191 tok/s) step 10407/76294 | train loss 3.507923 | norm 0.2073 | lr 6.36e-04 | (3807.33 ms | 137705 tok/s) step 10408/76294 | train loss 3.454947 | norm 0.3676 | lr 6.36e-04 | (3802.61 ms | 137876 tok/s) step 10409/76294 | train loss 3.372309 | norm 0.1924 | lr 6.36e-04 | (3838.43 ms | 136589 tok/s) step 10410/76294 | train loss 3.392006 | norm 0.3080 | lr 6.36e-04 | (3838.30 ms | 136594 tok/s) step 10411/76294 | train loss 3.409186 | norm 0.2199 | lr 6.35e-04 | (3815.32 ms | 137417 tok/s) step 10412/76294 | train loss 3.412164 | norm 0.2910 | lr 6.35e-04 | (3896.05 ms | 134569 tok/s) step 10413/76294 | train loss 3.467742 | norm 0.2658 | lr 6.35e-04 | (3808.19 ms | 137674 tok/s) step 10414/76294 | train loss 3.404639 | norm 0.3601 | lr 6.35e-04 | (3816.09 ms | 137389 tok/s) step 10415/76294 | train loss 3.439293 | norm 0.2209 | lr 6.35e-04 | (3829.49 ms | 136908 tok/s) step 10416/76294 | train loss 3.448261 | norm 0.2152 | lr 6.35e-04 | (3822.49 ms | 137159 tok/s) step 10417/76294 | train loss 3.368134 | norm 0.2372 | lr 6.35e-04 | (3810.18 ms | 137602 tok/s) step 10418/76294 | train loss 3.457569 | norm 0.2495 | lr 6.35e-04 | (3892.49 ms | 134692 tok/s) step 10419/76294 | train loss 3.450990 | norm 0.1894 | lr 6.35e-04 | (3821.73 ms | 137186 tok/s) step 10420/76294 | train loss 3.408779 | norm 0.2428 | lr 6.35e-04 | (3815.46 ms | 137412 tok/s) step 10421/76294 | train loss 3.356237 | norm 0.1869 | lr 6.35e-04 | (3837.90 ms | 136608 tok/s) step 10422/76294 | train loss 3.422318 | norm 0.2991 | lr 6.34e-04 | (3814.42 ms | 137449 tok/s) step 10423/76294 | train loss 3.386890 | norm 0.1912 | lr 6.34e-04 | (3809.60 ms | 137623 tok/s) step 10424/76294 | train loss 3.506225 | norm 0.3044 | lr 6.34e-04 | (3861.29 ms | 135780 tok/s) step 10425/76294 | train loss 3.416035 | norm 0.2417 | lr 6.34e-04 | (3814.81 ms | 137435 tok/s) step 10426/76294 | train loss 3.378858 | norm 0.6525 | lr 6.34e-04 | (3821.61 ms | 137190 tok/s) step 10427/76294 | train loss 3.510596 | norm 0.3624 | lr 6.34e-04 | (3830.64 ms | 136867 tok/s) step 10428/76294 | train loss 3.448128 | norm 0.4535 | lr 6.34e-04 | (3811.30 ms | 137561 tok/s) step 10429/76294 | train loss 3.417132 | norm 0.2425 | lr 6.34e-04 | (3807.53 ms | 137698 tok/s) step 10430/76294 | train loss 3.419332 | norm 0.3259 | lr 6.34e-04 | (3819.36 ms | 137271 tok/s) step 10431/76294 | train loss 3.392614 | norm 0.2791 | lr 6.34e-04 | (3837.14 ms | 136635 tok/s) step 10432/76294 | train loss 3.408228 | norm 0.2099 | lr 6.34e-04 | (3818.29 ms | 137310 tok/s) step 10433/76294 | train loss 3.481917 | norm 0.2450 | lr 6.33e-04 | (3813.08 ms | 137497 tok/s) step 10434/76294 | train loss 3.424035 | norm 0.2383 | lr 6.33e-04 | (3932.68 ms | 133316 tok/s) step 10435/76294 | train loss 3.438432 | norm 0.2240 | lr 6.33e-04 | (3807.41 ms | 137702 tok/s) step 10436/76294 | train loss 3.414871 | norm 0.2168 | lr 6.33e-04 | (3824.59 ms | 137084 tok/s) step 10437/76294 | train loss 3.416234 | norm 0.2064 | lr 6.33e-04 | (3810.36 ms | 137596 tok/s) step 10438/76294 | train loss 3.398152 | norm 0.2603 | lr 6.33e-04 | (3809.37 ms | 137631 tok/s) step 10439/76294 | train loss 3.371191 | norm 0.2220 | lr 6.33e-04 | (3830.31 ms | 136879 tok/s) step 10440/76294 | train loss 3.396614 | norm 0.2297 | lr 6.33e-04 | (3813.05 ms | 137498 tok/s) step 10441/76294 | train loss 3.399067 | norm 0.2273 | lr 6.33e-04 | (3801.13 ms | 137930 tok/s) step 10442/76294 | train loss 3.493682 | norm 0.2077 | lr 6.33e-04 | (3888.28 ms | 134838 tok/s) step 10443/76294 | train loss 3.427278 | norm 0.1999 | lr 6.33e-04 | (3805.39 ms | 137775 tok/s) step 10444/76294 | train loss 3.400477 | norm 0.2004 | lr 6.33e-04 | (3807.09 ms | 137713 tok/s) step 10445/76294 | train loss 3.407699 | norm 0.2249 | lr 6.32e-04 | (3830.42 ms | 136875 tok/s) step 10446/76294 | train loss 3.407066 | norm 0.2344 | lr 6.32e-04 | (3804.10 ms | 137822 tok/s) step 10447/76294 | train loss 3.381490 | norm 0.2119 | lr 6.32e-04 | (3804.62 ms | 137803 tok/s) step 10448/76294 | train loss 3.462522 | norm 0.4411 | lr 6.32e-04 | (3834.21 ms | 136739 tok/s) step 10449/76294 | train loss 3.436629 | norm 0.1710 | lr 6.32e-04 | (3807.96 ms | 137682 tok/s) step 10450/76294 | train loss 3.460011 | norm 0.3412 | lr 6.32e-04 | (3825.03 ms | 137068 tok/s) step 10451/76294 | train loss 3.482497 | norm 0.2842 | lr 6.32e-04 | (3827.04 ms | 136996 tok/s) step 10452/76294 | train loss 3.398096 | norm 0.2206 | lr 6.32e-04 | (3804.70 ms | 137800 tok/s) step 10453/76294 | train loss 3.438932 | norm 0.2460 | lr 6.32e-04 | (3805.47 ms | 137772 tok/s) step 10454/76294 | train loss 3.382862 | norm 0.2074 | lr 6.32e-04 | (3812.41 ms | 137522 tok/s) step 10455/76294 | train loss 3.403956 | norm 0.1820 | lr 6.32e-04 | (3802.08 ms | 137895 tok/s) step 10456/76294 | train loss 3.484764 | norm 0.2161 | lr 6.31e-04 | (3986.73 ms | 131508 tok/s) step 10457/76294 | train loss 3.427371 | norm 0.1712 | lr 6.31e-04 | (3800.44 ms | 137954 tok/s) step 10458/76294 | train loss 3.421083 | norm 0.2382 | lr 6.31e-04 | (3808.18 ms | 137674 tok/s) step 10459/76294 | train loss 3.390087 | norm 0.1958 | lr 6.31e-04 | (3826.44 ms | 137017 tok/s) step 10460/76294 | train loss 3.427253 | norm 0.2076 | lr 6.31e-04 | (3815.27 ms | 137418 tok/s) step 10461/76294 | train loss 3.406518 | norm 0.2572 | lr 6.31e-04 | (3802.81 ms | 137869 tok/s) step 10462/76294 | train loss 3.399882 | norm 0.2894 | lr 6.31e-04 | (3842.52 ms | 136444 tok/s) step 10463/76294 | train loss 3.425678 | norm 0.2499 | lr 6.31e-04 | (3801.47 ms | 137917 tok/s) step 10464/76294 | train loss 3.383645 | norm 0.2722 | lr 6.31e-04 | (3809.55 ms | 137624 tok/s) step 10465/76294 | train loss 3.393246 | norm 0.2260 | lr 6.31e-04 | (3828.30 ms | 136951 tok/s) step 10466/76294 | train loss 3.414201 | norm 0.1877 | lr 6.31e-04 | (3854.94 ms | 136004 tok/s) step 10467/76294 | train loss 3.424219 | norm 0.2454 | lr 6.30e-04 | (3800.93 ms | 137937 tok/s) step 10468/76294 | train loss 3.344700 | norm 0.2341 | lr 6.30e-04 | (3810.04 ms | 137607 tok/s) step 10469/76294 | train loss 3.447886 | norm 0.2664 | lr 6.30e-04 | (3836.44 ms | 136660 tok/s) step 10470/76294 | train loss 3.335525 | norm 0.3283 | lr 6.30e-04 | (3811.10 ms | 137569 tok/s) step 10471/76294 | train loss 3.404946 | norm 0.2304 | lr 6.30e-04 | (3803.44 ms | 137846 tok/s) step 10472/76294 | train loss 3.387438 | norm 0.2997 | lr 6.30e-04 | (3835.91 ms | 136679 tok/s) step 10473/76294 | train loss 3.349944 | norm 0.2283 | lr 6.30e-04 | (3804.78 ms | 137797 tok/s) step 10474/76294 | train loss 3.373811 | norm 0.2730 | lr 6.30e-04 | (3806.34 ms | 137741 tok/s) step 10475/76294 | train loss 3.434415 | norm 0.2186 | lr 6.30e-04 | (3828.67 ms | 136937 tok/s) step 10476/76294 | train loss 3.382713 | norm 0.2459 | lr 6.30e-04 | (3805.45 ms | 137773 tok/s) step 10477/76294 | train loss 3.387114 | norm 0.2267 | lr 6.30e-04 | (3803.02 ms | 137861 tok/s) step 10478/76294 | train loss 3.394702 | norm 0.2549 | lr 6.29e-04 | (3923.31 ms | 133634 tok/s) step 10479/76294 | train loss 3.428034 | norm 0.1898 | lr 6.29e-04 | (3803.74 ms | 137835 tok/s) step 10480/76294 | train loss 3.408049 | norm 0.2409 | lr 6.29e-04 | (3808.16 ms | 137675 tok/s) step 10481/76294 | train loss 3.464771 | norm 0.2072 | lr 6.29e-04 | (3832.72 ms | 136793 tok/s) step 10482/76294 | train loss 3.392604 | norm 0.2075 | lr 6.29e-04 | (3802.70 ms | 137873 tok/s) step 10483/76294 | train loss 3.443019 | norm 0.2207 | lr 6.29e-04 | (3827.43 ms | 136982 tok/s) step 10484/76294 | train loss 3.450048 | norm 0.3439 | lr 6.29e-04 | (3903.85 ms | 134300 tok/s) step 10485/76294 | train loss 3.458126 | norm 0.2024 | lr 6.29e-04 | (3801.05 ms | 137932 tok/s) step 10486/76294 | train loss 3.445984 | norm 0.3330 | lr 6.29e-04 | (3855.66 ms | 135979 tok/s) step 10487/76294 | train loss 3.373665 | norm 0.2239 | lr 6.29e-04 | (3808.84 ms | 137650 tok/s) step 10488/76294 | train loss 3.464229 | norm 0.2768 | lr 6.29e-04 | (3838.08 ms | 136602 tok/s) step 10489/76294 | train loss 3.396929 | norm 0.5520 | lr 6.28e-04 | (3824.12 ms | 137100 tok/s) step 10490/76294 | train loss 3.355219 | norm 0.3447 | lr 6.28e-04 | (3810.11 ms | 137605 tok/s) step 10491/76294 | train loss 3.456867 | norm 0.3960 | lr 6.28e-04 | (4100.37 ms | 127864 tok/s) step 10492/76294 | train loss 3.450719 | norm 0.1928 | lr 6.28e-04 | (3828.50 ms | 136943 tok/s) step 10493/76294 | train loss 3.395869 | norm 0.3205 | lr 6.28e-04 | (3805.82 ms | 137759 tok/s) step 10494/76294 | train loss 3.502069 | norm 0.1759 | lr 6.28e-04 | (3807.43 ms | 137701 tok/s) step 10495/76294 | train loss 3.416466 | norm 0.2774 | lr 6.28e-04 | (3824.33 ms | 137093 tok/s) step 10496/76294 | train loss 3.405981 | norm 0.2006 | lr 6.28e-04 | (3811.61 ms | 137550 tok/s) step 10497/76294 | train loss 3.409089 | norm 0.2757 | lr 6.28e-04 | (3834.09 ms | 136744 tok/s) step 10498/76294 | train loss 3.434156 | norm 0.2139 | lr 6.28e-04 | (3806.18 ms | 137747 tok/s) step 10499/76294 | train loss 3.473773 | norm 0.2376 | lr 6.28e-04 | (3802.06 ms | 137896 tok/s) step 10500/76294 | train loss 3.406612 | norm 0.3393 | lr 6.27e-04 | (3907.20 ms | 134185 tok/s) val loss: 3.415765 saving model checkpoint to ./results/gpt2-124M-gqa/step_10500.pth step 10501/76294 | train loss 3.456428 | norm 0.1968 | lr 6.27e-04 | (3819.03 ms | 137283 tok/s) step 10502/76294 | train loss 3.424403 | norm 0.3857 | lr 6.27e-04 | (3823.72 ms | 137114 tok/s) step 10503/76294 | train loss 3.449207 | norm 0.1897 | lr 6.27e-04 | (3802.46 ms | 137881 tok/s) step 10504/76294 | train loss 3.412292 | norm 0.2490 | lr 6.27e-04 | (3801.42 ms | 137919 tok/s) step 10505/76294 | train loss 3.395311 | norm 0.2573 | lr 6.27e-04 | (3835.80 ms | 136683 tok/s) step 10506/76294 | train loss 3.414170 | norm 0.2016 | lr 6.27e-04 | (3799.91 ms | 137974 tok/s) step 10507/76294 | train loss 3.385867 | norm 0.2024 | lr 6.27e-04 | (3838.86 ms | 136574 tok/s) step 10508/76294 | train loss 3.384950 | norm 0.2177 | lr 6.27e-04 | (3799.94 ms | 137973 tok/s) step 10509/76294 | train loss 3.385021 | norm 0.2241 | lr 6.27e-04 | (3807.35 ms | 137704 tok/s) step 10510/76294 | train loss 3.385312 | norm 0.1968 | lr 6.27e-04 | (3828.73 ms | 136935 tok/s) step 10511/76294 | train loss 3.389342 | norm 0.2238 | lr 6.26e-04 | (3815.72 ms | 137402 tok/s) step 10512/76294 | train loss 3.417714 | norm 0.1994 | lr 6.26e-04 | (3801.22 ms | 137926 tok/s) step 10513/76294 | train loss 3.403070 | norm 0.1965 | lr 6.26e-04 | (3834.97 ms | 136712 tok/s) step 10514/76294 | train loss 3.455190 | norm 0.2094 | lr 6.26e-04 | (3800.71 ms | 137945 tok/s) step 10515/76294 | train loss 3.472097 | norm 0.1922 | lr 6.26e-04 | (3825.14 ms | 137064 tok/s) step 10516/76294 | train loss 3.425089 | norm 0.2684 | lr 6.26e-04 | (3807.27 ms | 137707 tok/s) step 10517/76294 | train loss 3.440555 | norm 0.1800 | lr 6.26e-04 | (3825.10 ms | 137065 tok/s) step 10518/76294 | train loss 3.392746 | norm 0.2575 | lr 6.26e-04 | (3835.27 ms | 136702 tok/s) step 10519/76294 | train loss 3.428071 | norm 0.1822 | lr 6.26e-04 | (3818.36 ms | 137307 tok/s) step 10520/76294 | train loss 3.447315 | norm 0.3150 | lr 6.26e-04 | (3802.53 ms | 137879 tok/s) step 10521/76294 | train loss 3.384956 | norm 0.2052 | lr 6.26e-04 | (3919.98 ms | 133748 tok/s) step 10522/76294 | train loss 3.427432 | norm 0.2674 | lr 6.26e-04 | (3797.26 ms | 138070 tok/s) step 10523/76294 | train loss 3.404908 | norm 0.2536 | lr 6.25e-04 | (3824.71 ms | 137079 tok/s) step 10524/76294 | train loss 3.519333 | norm 0.2179 | lr 6.25e-04 | (3835.69 ms | 136687 tok/s) step 10525/76294 | train loss 3.423532 | norm 0.3415 | lr 6.25e-04 | (3804.71 ms | 137800 tok/s) step 10526/76294 | train loss 3.375978 | norm 0.1902 | lr 6.25e-04 | (3800.83 ms | 137940 tok/s) step 10527/76294 | train loss 3.399321 | norm 0.3419 | lr 6.25e-04 | (3836.77 ms | 136648 tok/s) step 10528/76294 | train loss 3.379097 | norm 0.2337 | lr 6.25e-04 | (3807.33 ms | 137705 tok/s) step 10529/76294 | train loss 3.468534 | norm 0.2986 | lr 6.25e-04 | (3836.03 ms | 136675 tok/s) step 10530/76294 | train loss 3.389982 | norm 0.2274 | lr 6.25e-04 | (3807.11 ms | 137713 tok/s) step 10531/76294 | train loss 3.410064 | norm 0.2420 | lr 6.25e-04 | (3816.19 ms | 137385 tok/s) step 10532/76294 | train loss 3.427970 | norm 0.2109 | lr 6.25e-04 | (3800.30 ms | 137959 tok/s) step 10533/76294 | train loss 3.401194 | norm 0.1938 | lr 6.25e-04 | (3807.43 ms | 137701 tok/s) step 10534/76294 | train loss 3.447814 | norm 0.2105 | lr 6.24e-04 | (3831.04 ms | 136853 tok/s) step 10535/76294 | train loss 3.393237 | norm 0.2172 | lr 6.24e-04 | (3840.13 ms | 136529 tok/s) step 10536/76294 | train loss 3.422863 | norm 0.1684 | lr 6.24e-04 | (3928.43 ms | 133460 tok/s) step 10537/76294 | train loss 3.492598 | norm 0.2023 | lr 6.24e-04 | (3799.93 ms | 137973 tok/s) step 10538/76294 | train loss 3.383867 | norm 0.1652 | lr 6.24e-04 | (3807.56 ms | 137697 tok/s) step 10539/76294 | train loss 3.408405 | norm 0.1773 | lr 6.24e-04 | (3799.05 ms | 138005 tok/s) step 10540/76294 | train loss 3.447830 | norm 0.1855 | lr 6.24e-04 | (3830.55 ms | 136870 tok/s) step 10541/76294 | train loss 3.408580 | norm 0.1830 | lr 6.24e-04 | (3798.95 ms | 138009 tok/s) step 10542/76294 | train loss 3.415166 | norm 0.2239 | lr 6.24e-04 | (3874.31 ms | 135324 tok/s) step 10543/76294 | train loss 3.445686 | norm 0.2869 | lr 6.24e-04 | (3798.02 ms | 138043 tok/s) step 10544/76294 | train loss 3.510782 | norm 0.2388 | lr 6.24e-04 | (3804.06 ms | 137823 tok/s) step 10545/76294 | train loss 3.440103 | norm 0.3581 | lr 6.23e-04 | (3816.20 ms | 137385 tok/s) step 10546/76294 | train loss 3.452410 | norm 0.2235 | lr 6.23e-04 | (3805.41 ms | 137774 tok/s) step 10547/76294 | train loss 3.466051 | norm 0.2647 | lr 6.23e-04 | (3802.10 ms | 137894 tok/s) step 10548/76294 | train loss 3.433448 | norm 0.1916 | lr 6.23e-04 | (3852.49 ms | 136091 tok/s) step 10549/76294 | train loss 3.419660 | norm 0.2050 | lr 6.23e-04 | (3808.19 ms | 137674 tok/s) step 10550/76294 | train loss 3.441862 | norm 0.2219 | lr 6.23e-04 | (3863.27 ms | 135711 tok/s) step 10551/76294 | train loss 3.485255 | norm 0.2010 | lr 6.23e-04 | (3805.86 ms | 137758 tok/s) step 10552/76294 | train loss 3.420583 | norm 0.2402 | lr 6.23e-04 | (3835.22 ms | 136703 tok/s) step 10553/76294 | train loss 3.445821 | norm 0.3166 | lr 6.23e-04 | (3796.83 ms | 138086 tok/s) step 10554/76294 | train loss 3.462500 | norm 0.2202 | lr 6.23e-04 | (3807.93 ms | 137683 tok/s) step 10555/76294 | train loss 3.443518 | norm 0.3025 | lr 6.23e-04 | (3802.13 ms | 137893 tok/s) step 10556/76294 | train loss 3.474942 | norm 0.2216 | lr 6.22e-04 | (3806.20 ms | 137746 tok/s) step 10557/76294 | train loss 3.409479 | norm 0.2151 | lr 6.22e-04 | (3835.34 ms | 136699 tok/s) step 10558/76294 | train loss 3.393170 | norm 0.2359 | lr 6.22e-04 | (3811.90 ms | 137540 tok/s) step 10559/76294 | train loss 3.349719 | norm 0.1822 | lr 6.22e-04 | (3799.72 ms | 137981 tok/s) step 10560/76294 | train loss 3.488921 | norm 0.2078 | lr 6.22e-04 | (3828.74 ms | 136935 tok/s) step 10561/76294 | train loss 3.364361 | norm 0.2071 | lr 6.22e-04 | (3799.65 ms | 137983 tok/s) step 10562/76294 | train loss 3.521100 | norm 0.2081 | lr 6.22e-04 | (3808.81 ms | 137651 tok/s) step 10563/76294 | train loss 3.493028 | norm 0.1900 | lr 6.22e-04 | (4037.94 ms | 129841 tok/s) step 10564/76294 | train loss 3.394260 | norm 0.2025 | lr 6.22e-04 | (3799.53 ms | 137988 tok/s) step 10565/76294 | train loss 3.415296 | norm 0.2122 | lr 6.22e-04 | (3807.10 ms | 137713 tok/s) step 10566/76294 | train loss 3.415147 | norm 0.1889 | lr 6.22e-04 | (3822.26 ms | 137167 tok/s) step 10567/76294 | train loss 3.357266 | norm 0.2289 | lr 6.21e-04 | (3819.67 ms | 137260 tok/s) step 10568/76294 | train loss 3.408693 | norm 0.2640 | lr 6.21e-04 | (3805.51 ms | 137771 tok/s) step 10569/76294 | train loss 3.447506 | norm 0.1906 | lr 6.21e-04 | (3803.00 ms | 137862 tok/s) step 10570/76294 | train loss 3.382500 | norm 0.2874 | lr 6.21e-04 | (3808.53 ms | 137662 tok/s) step 10571/76294 | train loss 3.440007 | norm 0.2773 | lr 6.21e-04 | (3828.21 ms | 136954 tok/s) step 10572/76294 | train loss 3.409721 | norm 0.2313 | lr 6.21e-04 | (3799.23 ms | 137998 tok/s) step 10573/76294 | train loss 3.395721 | norm 0.2832 | lr 6.21e-04 | (3827.71 ms | 136972 tok/s) step 10574/76294 | train loss 3.381734 | norm 0.1906 | lr 6.21e-04 | (3800.35 ms | 137958 tok/s) step 10575/76294 | train loss 3.396680 | norm 0.2664 | lr 6.21e-04 | (3802.20 ms | 137891 tok/s) step 10576/76294 | train loss 3.415363 | norm 0.2124 | lr 6.21e-04 | (3818.68 ms | 137296 tok/s) step 10577/76294 | train loss 3.425219 | norm 0.2208 | lr 6.21e-04 | (3805.74 ms | 137763 tok/s) step 10578/76294 | train loss 3.414282 | norm 0.2239 | lr 6.20e-04 | (3800.63 ms | 137948 tok/s) step 10579/76294 | train loss 3.447370 | norm 0.2109 | lr 6.20e-04 | (3890.31 ms | 134768 tok/s) step 10580/76294 | train loss 3.392614 | norm 0.2268 | lr 6.20e-04 | (3795.07 ms | 138150 tok/s) step 10581/76294 | train loss 3.425592 | norm 0.2247 | lr 6.20e-04 | (3799.93 ms | 137973 tok/s) step 10582/76294 | train loss 3.428312 | norm 0.2075 | lr 6.20e-04 | (3818.40 ms | 137306 tok/s) step 10583/76294 | train loss 3.452669 | norm 0.2615 | lr 6.20e-04 | (3801.07 ms | 137932 tok/s) step 10584/76294 | train loss 3.452908 | norm 0.3212 | lr 6.20e-04 | (3871.47 ms | 135423 tok/s) step 10585/76294 | train loss 3.451083 | norm 0.1963 | lr 6.20e-04 | (3797.78 ms | 138051 tok/s) step 10586/76294 | train loss 3.459608 | norm 0.3778 | lr 6.20e-04 | (3800.67 ms | 137946 tok/s) step 10587/76294 | train loss 3.358925 | norm 0.2139 | lr 6.20e-04 | (3820.65 ms | 137225 tok/s) step 10588/76294 | train loss 3.384342 | norm 0.3096 | lr 6.20e-04 | (3804.37 ms | 137812 tok/s) step 10589/76294 | train loss 3.401029 | norm 0.2688 | lr 6.19e-04 | (3823.12 ms | 137136 tok/s) step 10590/76294 | train loss 3.444513 | norm 0.2173 | lr 6.19e-04 | (3803.03 ms | 137861 tok/s) step 10591/76294 | train loss 3.421768 | norm 0.2507 | lr 6.19e-04 | (3801.54 ms | 137915 tok/s) step 10592/76294 | train loss 3.410441 | norm 0.1940 | lr 6.19e-04 | (3833.16 ms | 136777 tok/s) step 10593/76294 | train loss 3.444628 | norm 0.2219 | lr 6.19e-04 | (3799.29 ms | 137996 tok/s) step 10594/76294 | train loss 3.366160 | norm 0.2681 | lr 6.19e-04 | (3803.27 ms | 137852 tok/s) step 10595/76294 | train loss 3.448789 | norm 0.2745 | lr 6.19e-04 | (3825.52 ms | 137050 tok/s) step 10596/76294 | train loss 3.444060 | norm 0.3466 | lr 6.19e-04 | (3825.01 ms | 137069 tok/s) step 10597/76294 | train loss 3.406269 | norm 0.2770 | lr 6.19e-04 | (3803.29 ms | 137851 tok/s) step 10598/76294 | train loss 3.440686 | norm 0.2772 | lr 6.19e-04 | (4826.65 ms | 108624 tok/s) step 10599/76294 | train loss 3.384259 | norm 0.2509 | lr 6.19e-04 | (5665.42 ms | 92542 tok/s) step 10600/76294 | train loss 3.421647 | norm 0.2638 | lr 6.18e-04 | (3924.77 ms | 133584 tok/s) step 10601/76294 | train loss 3.420551 | norm 0.2023 | lr 6.18e-04 | (5490.93 ms | 95483 tok/s) step 10602/76294 | train loss 3.489010 | norm 0.2037 | lr 6.18e-04 | (3871.39 ms | 135426 tok/s) step 10603/76294 | train loss 3.450708 | norm 0.1978 | lr 6.18e-04 | (3797.05 ms | 138078 tok/s) step 10604/76294 | train loss 3.432060 | norm 0.2402 | lr 6.18e-04 | (3800.49 ms | 137953 tok/s) step 10605/76294 | train loss 3.412638 | norm 0.1774 | lr 6.18e-04 | (3888.77 ms | 134821 tok/s) step 10606/76294 | train loss 3.440229 | norm 0.2267 | lr 6.18e-04 | (3795.05 ms | 138150 tok/s) step 10607/76294 | train loss 3.399809 | norm 0.2649 | lr 6.18e-04 | (3801.08 ms | 137931 tok/s) step 10608/76294 | train loss 3.556673 | norm 0.2667 | lr 6.18e-04 | (3849.04 ms | 136213 tok/s) step 10609/76294 | train loss 3.395106 | norm 0.2757 | lr 6.18e-04 | (3802.14 ms | 137893 tok/s) step 10610/76294 | train loss 3.435348 | norm 0.1921 | lr 6.18e-04 | (3806.20 ms | 137746 tok/s) step 10611/76294 | train loss 3.427758 | norm 0.3396 | lr 6.18e-04 | (3828.96 ms | 136927 tok/s) step 10612/76294 | train loss 3.417808 | norm 0.2269 | lr 6.17e-04 | (3828.91 ms | 136929 tok/s) step 10613/76294 | train loss 3.435719 | norm 0.2132 | lr 6.17e-04 | (3804.81 ms | 137796 tok/s) step 10614/76294 | train loss 3.439591 | norm 0.1995 | lr 6.17e-04 | (3834.93 ms | 136714 tok/s) step 10615/76294 | train loss 3.424608 | norm 0.1976 | lr 6.17e-04 | (3803.35 ms | 137849 tok/s) step 10616/76294 | train loss 3.432766 | norm 0.1772 | lr 6.17e-04 | (3819.60 ms | 137263 tok/s) step 10617/76294 | train loss 3.463645 | norm 0.2098 | lr 6.17e-04 | (3823.55 ms | 137121 tok/s) step 10618/76294 | train loss 3.396112 | norm 0.2051 | lr 6.17e-04 | (3855.69 ms | 135978 tok/s) step 10619/76294 | train loss 3.474057 | norm 0.2856 | lr 6.17e-04 | (3807.10 ms | 137713 tok/s) step 10620/76294 | train loss 3.608124 | norm 0.2024 | lr 6.17e-04 | (3810.12 ms | 137604 tok/s) step 10621/76294 | train loss 3.444138 | norm 0.2818 | lr 6.17e-04 | (3830.40 ms | 136875 tok/s) step 10622/76294 | train loss 3.418822 | norm 0.2534 | lr 6.17e-04 | (3810.61 ms | 137586 tok/s) step 10623/76294 | train loss 3.444414 | norm 0.2670 | lr 6.16e-04 | (3812.35 ms | 137523 tok/s) step 10624/76294 | train loss 3.428801 | norm 0.3281 | lr 6.16e-04 | (3833.41 ms | 136768 tok/s) step 10625/76294 | train loss 3.412047 | norm 0.2737 | lr 6.16e-04 | (3808.28 ms | 137671 tok/s) step 10626/76294 | train loss 3.425043 | norm 0.3119 | lr 6.16e-04 | (3805.41 ms | 137775 tok/s) step 10627/76294 | train loss 3.402487 | norm 0.1907 | lr 6.16e-04 | (3833.07 ms | 136780 tok/s) step 10628/76294 | train loss 3.455184 | norm 0.3399 | lr 6.16e-04 | (3803.76 ms | 137834 tok/s) step 10629/76294 | train loss 3.361117 | norm 0.2145 | lr 6.16e-04 | (3843.79 ms | 136399 tok/s) step 10630/76294 | train loss 3.433913 | norm 0.2260 | lr 6.16e-04 | (3839.62 ms | 136547 tok/s) step 10631/76294 | train loss 3.390516 | norm 0.2057 | lr 6.16e-04 | (3821.04 ms | 137211 tok/s) step 10632/76294 | train loss 3.455253 | norm 0.2499 | lr 6.16e-04 | (3836.87 ms | 136645 tok/s) step 10633/76294 | train loss 3.433207 | norm 0.2293 | lr 6.16e-04 | (3807.00 ms | 137717 tok/s) step 10634/76294 | train loss 3.499239 | norm 0.3256 | lr 6.15e-04 | (3810.78 ms | 137580 tok/s) step 10635/76294 | train loss 3.446708 | norm 0.2803 | lr 6.15e-04 | (3835.23 ms | 136703 tok/s) step 10636/76294 | train loss 3.432265 | norm 0.2411 | lr 6.15e-04 | (3807.12 ms | 137712 tok/s) step 10637/76294 | train loss 3.434390 | norm 0.2488 | lr 6.15e-04 | (3805.75 ms | 137762 tok/s) step 10638/76294 | train loss 3.428515 | norm 0.1677 | lr 6.15e-04 | (3837.39 ms | 136626 tok/s) step 10639/76294 | train loss 3.444311 | norm 0.1967 | lr 6.15e-04 | (3805.81 ms | 137760 tok/s) step 10640/76294 | train loss 3.402876 | norm 0.2055 | lr 6.15e-04 | (3838.06 ms | 136602 tok/s) step 10641/76294 | train loss 3.371750 | norm 0.2014 | lr 6.15e-04 | (3822.57 ms | 137156 tok/s) step 10642/76294 | train loss 3.391622 | norm 0.1915 | lr 6.15e-04 | (3810.98 ms | 137573 tok/s) step 10643/76294 | train loss 3.563383 | norm 0.2358 | lr 6.15e-04 | (3829.22 ms | 136918 tok/s) step 10644/76294 | train loss 3.403909 | norm 0.2298 | lr 6.15e-04 | (3806.63 ms | 137730 tok/s) step 10645/76294 | train loss 3.481163 | norm 0.2044 | lr 6.14e-04 | (3824.84 ms | 137075 tok/s) step 10646/76294 | train loss 3.436846 | norm 0.2105 | lr 6.14e-04 | (3809.98 ms | 137609 tok/s) step 10647/76294 | train loss 3.452116 | norm 0.2131 | lr 6.14e-04 | (3803.38 ms | 137848 tok/s) step 10648/76294 | train loss 3.528620 | norm 0.2038 | lr 6.14e-04 | (3831.92 ms | 136821 tok/s) step 10649/76294 | train loss 3.382657 | norm 0.2047 | lr 6.14e-04 | (3805.63 ms | 137766 tok/s) step 10650/76294 | train loss 3.451288 | norm 0.2124 | lr 6.14e-04 | (3807.02 ms | 137716 tok/s) step 10651/76294 | train loss 3.427655 | norm 0.1988 | lr 6.14e-04 | (3916.41 ms | 133869 tok/s) step 10652/76294 | train loss 3.484464 | norm 0.2970 | lr 6.14e-04 | (3804.49 ms | 137808 tok/s) step 10653/76294 | train loss 3.411604 | norm 0.2514 | lr 6.14e-04 | (3852.39 ms | 136094 tok/s) step 10654/76294 | train loss 3.515749 | norm 0.2786 | lr 6.14e-04 | (3808.59 ms | 137659 tok/s) step 10655/76294 | train loss 3.427328 | norm 0.2256 | lr 6.14e-04 | (3843.93 ms | 136394 tok/s) step 10656/76294 | train loss 3.501358 | norm 0.3162 | lr 6.13e-04 | (3802.69 ms | 137873 tok/s) step 10657/76294 | train loss 3.394806 | norm 0.2498 | lr 6.13e-04 | (3806.03 ms | 137752 tok/s) step 10658/76294 | train loss 3.456367 | norm 0.2440 | lr 6.13e-04 | (3833.13 ms | 136778 tok/s) step 10659/76294 | train loss 3.389169 | norm 0.2297 | lr 6.13e-04 | (3809.22 ms | 137636 tok/s) step 10660/76294 | train loss 3.444655 | norm 0.2681 | lr 6.13e-04 | (3804.64 ms | 137802 tok/s) step 10661/76294 | train loss 3.427943 | norm 0.2673 | lr 6.13e-04 | (3832.53 ms | 136800 tok/s) step 10662/76294 | train loss 3.394852 | norm 0.2393 | lr 6.13e-04 | (3802.77 ms | 137870 tok/s) step 10663/76294 | train loss 3.415703 | norm 0.1827 | lr 6.13e-04 | (3811.13 ms | 137567 tok/s) step 10664/76294 | train loss 3.400670 | norm 0.2604 | lr 6.13e-04 | (3827.39 ms | 136983 tok/s) step 10665/76294 | train loss 3.401062 | norm 0.2030 | lr 6.13e-04 | (3807.18 ms | 137710 tok/s) step 10666/76294 | train loss 3.387094 | norm 0.2016 | lr 6.13e-04 | (3805.50 ms | 137771 tok/s) step 10667/76294 | train loss 3.553795 | norm 0.2461 | lr 6.12e-04 | (3844.28 ms | 136381 tok/s) step 10668/76294 | train loss 3.368014 | norm 0.1990 | lr 6.12e-04 | (3802.32 ms | 137886 tok/s) step 10669/76294 | train loss 3.454643 | norm 0.3430 | lr 6.12e-04 | (3809.21 ms | 137637 tok/s) step 10670/76294 | train loss 3.459419 | norm 0.2674 | lr 6.12e-04 | (3825.10 ms | 137065 tok/s) step 10671/76294 | train loss 3.417580 | norm 0.3128 | lr 6.12e-04 | (3808.87 ms | 137649 tok/s) step 10672/76294 | train loss 3.466197 | norm 0.2013 | lr 6.12e-04 | (3839.52 ms | 136550 tok/s) step 10673/76294 | train loss 3.413642 | norm 0.2868 | lr 6.12e-04 | (3878.84 ms | 135166 tok/s) step 10674/76294 | train loss 3.456420 | norm 0.2331 | lr 6.12e-04 | (3803.33 ms | 137850 tok/s) step 10675/76294 | train loss 3.399436 | norm 0.2424 | lr 6.12e-04 | (3834.95 ms | 136713 tok/s) step 10676/76294 | train loss 3.402102 | norm 0.2255 | lr 6.12e-04 | (3804.43 ms | 137810 tok/s) step 10677/76294 | train loss 3.420860 | norm 0.2586 | lr 6.12e-04 | (3808.10 ms | 137677 tok/s) step 10678/76294 | train loss 3.392354 | norm 0.1990 | lr 6.11e-04 | (3826.89 ms | 137001 tok/s) step 10679/76294 | train loss 3.442403 | norm 0.2558 | lr 6.11e-04 | (3810.44 ms | 137592 tok/s) step 10680/76294 | train loss 3.423113 | norm 0.2612 | lr 6.11e-04 | (3802.58 ms | 137877 tok/s) step 10681/76294 | train loss 3.417773 | norm 0.2347 | lr 6.11e-04 | (3826.36 ms | 137020 tok/s) step 10682/76294 | train loss 3.463625 | norm 0.2098 | lr 6.11e-04 | (4598.13 ms | 114022 tok/s) step 10683/76294 | train loss 3.421759 | norm 0.1996 | lr 6.11e-04 | (3831.30 ms | 136843 tok/s) step 10684/76294 | train loss 3.372960 | norm 0.3223 | lr 6.11e-04 | (3812.89 ms | 137504 tok/s) step 10685/76294 | train loss 3.422795 | norm 0.2980 | lr 6.11e-04 | (3830.35 ms | 136877 tok/s) step 10686/76294 | train loss 3.597130 | norm 0.3497 | lr 6.11e-04 | (3821.05 ms | 137210 tok/s) step 10687/76294 | train loss 3.355194 | norm 0.2333 | lr 6.11e-04 | (3810.37 ms | 137595 tok/s) step 10688/76294 | train loss 3.386359 | norm 0.4269 | lr 6.11e-04 | (3800.97 ms | 137935 tok/s) step 10689/76294 | train loss 3.462182 | norm 0.1947 | lr 6.11e-04 | (3830.75 ms | 136863 tok/s) step 10690/76294 | train loss 3.473191 | norm 0.2907 | lr 6.10e-04 | (3801.84 ms | 137904 tok/s) step 10691/76294 | train loss 3.404585 | norm 0.2425 | lr 6.10e-04 | (3811.70 ms | 137547 tok/s) step 10692/76294 | train loss 3.392145 | norm 0.2219 | lr 6.10e-04 | (3835.14 ms | 136706 tok/s) step 10693/76294 | train loss 3.427547 | norm 0.3277 | lr 6.10e-04 | (3812.27 ms | 137527 tok/s) step 10694/76294 | train loss 3.437147 | norm 0.2386 | lr 6.10e-04 | (3841.36 ms | 136485 tok/s) step 10695/76294 | train loss 3.405955 | norm 0.3232 | lr 6.10e-04 | (3872.25 ms | 135396 tok/s) step 10696/76294 | train loss 3.367971 | norm 0.2279 | lr 6.10e-04 | (3801.39 ms | 137920 tok/s) step 10697/76294 | train loss 3.459819 | norm 0.3155 | lr 6.10e-04 | (3811.65 ms | 137549 tok/s) step 10698/76294 | train loss 3.436976 | norm 0.3090 | lr 6.10e-04 | (3805.17 ms | 137783 tok/s) step 10699/76294 | train loss 3.397460 | norm 0.2779 | lr 6.10e-04 | (3806.95 ms | 137719 tok/s) step 10700/76294 | train loss 3.455205 | norm 0.2453 | lr 6.10e-04 | (3820.90 ms | 137216 tok/s) step 10701/76294 | train loss 3.543857 | norm 0.2381 | lr 6.09e-04 | (3829.11 ms | 136921 tok/s) step 10702/76294 | train loss 3.404085 | norm 0.2972 | lr 6.09e-04 | (3823.23 ms | 137132 tok/s) step 10703/76294 | train loss 3.446571 | norm 0.1925 | lr 6.09e-04 | (3806.35 ms | 137740 tok/s) step 10704/76294 | train loss 3.502332 | norm 0.2757 | lr 6.09e-04 | (3904.21 ms | 134288 tok/s) step 10705/76294 | train loss 3.394146 | norm 0.1999 | lr 6.09e-04 | (3799.36 ms | 137994 tok/s) step 10706/76294 | train loss 3.476713 | norm 0.2366 | lr 6.09e-04 | (3800.37 ms | 137957 tok/s) step 10707/76294 | train loss 3.316073 | norm 0.2521 | lr 6.09e-04 | (3822.64 ms | 137154 tok/s) step 10708/76294 | train loss 3.364411 | norm 0.2149 | lr 6.09e-04 | (3799.83 ms | 137977 tok/s) step 10709/76294 | train loss 3.416554 | norm 0.2430 | lr 6.09e-04 | (3820.73 ms | 137222 tok/s) step 10710/76294 | train loss 3.355677 | norm 0.2440 | lr 6.09e-04 | (3801.20 ms | 137927 tok/s) step 10711/76294 | train loss 3.435671 | norm 0.1921 | lr 6.09e-04 | (3794.57 ms | 138168 tok/s) step 10712/76294 | train loss 3.434678 | norm 0.1880 | lr 6.08e-04 | (3829.54 ms | 136906 tok/s) step 10713/76294 | train loss 3.504788 | norm 0.1835 | lr 6.08e-04 | (3798.79 ms | 138014 tok/s) step 10714/76294 | train loss 3.453768 | norm 0.1920 | lr 6.08e-04 | (3843.02 ms | 136426 tok/s) step 10715/76294 | train loss 3.482077 | norm 0.2234 | lr 6.08e-04 | (3800.26 ms | 137961 tok/s) step 10716/76294 | train loss 3.431200 | norm 0.1908 | lr 6.08e-04 | (3842.14 ms | 136457 tok/s) step 10717/76294 | train loss 3.404925 | norm 0.1970 | lr 6.08e-04 | (3799.53 ms | 137987 tok/s) step 10718/76294 | train loss 3.428156 | norm 0.1835 | lr 6.08e-04 | (3803.64 ms | 137838 tok/s) step 10719/76294 | train loss 3.400724 | norm 0.2237 | lr 6.08e-04 | (3827.58 ms | 136977 tok/s) step 10720/76294 | train loss 3.371566 | norm 0.1961 | lr 6.08e-04 | (3804.42 ms | 137810 tok/s) step 10721/76294 | train loss 3.417939 | norm 0.2501 | lr 6.08e-04 | (3798.83 ms | 138013 tok/s) step 10722/76294 | train loss 3.491708 | norm 0.2242 | lr 6.08e-04 | (3831.54 ms | 136835 tok/s) step 10723/76294 | train loss 3.354625 | norm 0.3330 | lr 6.07e-04 | (3802.85 ms | 137867 tok/s) step 10724/76294 | train loss 3.409293 | norm 0.2266 | lr 6.07e-04 | (3808.60 ms | 137659 tok/s) step 10725/76294 | train loss 3.430617 | norm 0.3207 | lr 6.07e-04 | (3801.03 ms | 137933 tok/s) step 10726/76294 | train loss 3.428544 | norm 0.2259 | lr 6.07e-04 | (3803.20 ms | 137854 tok/s) step 10727/76294 | train loss 3.453135 | norm 0.2262 | lr 6.07e-04 | (3827.97 ms | 136962 tok/s) step 10728/76294 | train loss 3.435545 | norm 0.2531 | lr 6.07e-04 | (3805.52 ms | 137770 tok/s) step 10729/76294 | train loss 3.352217 | norm 0.2081 | lr 6.07e-04 | (3802.24 ms | 137889 tok/s) step 10730/76294 | train loss 3.449320 | norm 0.2823 | lr 6.07e-04 | (3955.72 ms | 132539 tok/s) step 10731/76294 | train loss 3.468934 | norm 0.2141 | lr 6.07e-04 | (3884.60 ms | 134966 tok/s) step 10732/76294 | train loss 3.447853 | norm 0.2317 | lr 6.07e-04 | (3918.56 ms | 133796 tok/s) step 10733/76294 | train loss 3.371653 | norm 0.2454 | lr 6.07e-04 | (3791.56 ms | 138278 tok/s) step 10734/76294 | train loss 3.537065 | norm 0.2242 | lr 6.06e-04 | (12066.02 ms | 43452 tok/s) step 10735/76294 | train loss 3.472280 | norm 0.3174 | lr 6.06e-04 | (3836.89 ms | 136644 tok/s) step 10736/76294 | train loss 3.403221 | norm 0.1873 | lr 6.06e-04 | (3813.21 ms | 137493 tok/s) step 10737/76294 | train loss 3.440325 | norm 0.2371 | lr 6.06e-04 | (3799.75 ms | 137980 tok/s) step 10738/76294 | train loss 3.419262 | norm 0.1958 | lr 6.06e-04 | (3777.85 ms | 138779 tok/s) step 10739/76294 | train loss 3.463220 | norm 0.2368 | lr 6.06e-04 | (3780.17 ms | 138694 tok/s) step 10740/76294 | train loss 3.416335 | norm 0.2102 | lr 6.06e-04 | (3814.91 ms | 137431 tok/s) step 10741/76294 | train loss 3.384313 | norm 0.2520 | lr 6.06e-04 | (3805.48 ms | 137772 tok/s) step 10742/76294 | train loss 3.452465 | norm 0.1919 | lr 6.06e-04 | (3793.91 ms | 138192 tok/s) step 10743/76294 | train loss 3.383877 | norm 0.2429 | lr 6.06e-04 | (3787.52 ms | 138425 tok/s) step 10744/76294 | train loss 3.437344 | norm 0.2116 | lr 6.06e-04 | (3823.00 ms | 137140 tok/s) step 10745/76294 | train loss 3.390869 | norm 0.1942 | lr 6.05e-04 | (3793.64 ms | 138202 tok/s) step 10746/76294 | train loss 3.390446 | norm 0.2288 | lr 6.05e-04 | (5408.70 ms | 96934 tok/s) step 10747/76294 | train loss 3.369606 | norm 0.2603 | lr 6.05e-04 | (4021.11 ms | 130384 tok/s) step 10748/76294 | train loss 3.409423 | norm 0.2373 | lr 6.05e-04 | (3790.99 ms | 138298 tok/s) step 10749/76294 | train loss 3.391227 | norm 0.1938 | lr 6.05e-04 | (3879.46 ms | 135145 tok/s) step 10750/76294 | train loss 3.462377 | norm 0.1909 | lr 6.05e-04 | (3867.29 ms | 135570 tok/s) val loss: 3.415710 saving model checkpoint to ./results/gpt2-124M-gqa/step_10750.pth step 10751/76294 | train loss 3.463462 | norm 0.2223 | lr 6.05e-04 | (3810.75 ms | 137581 tok/s) step 10752/76294 | train loss 3.384204 | norm 0.1811 | lr 6.05e-04 | (3793.76 ms | 138198 tok/s) step 10753/76294 | train loss 3.408859 | norm 0.2240 | lr 6.05e-04 | (3896.74 ms | 134545 tok/s) step 10754/76294 | train loss 3.412148 | norm 0.3717 | lr 6.05e-04 | (3795.36 ms | 138139 tok/s) step 10755/76294 | train loss 3.372288 | norm 0.2002 | lr 6.05e-04 | (3833.96 ms | 136748 tok/s) step 10756/76294 | train loss 3.447929 | norm 0.3809 | lr 6.05e-04 | (3819.91 ms | 137251 tok/s) step 10757/76294 | train loss 3.398613 | norm 0.2233 | lr 6.04e-04 | (3804.56 ms | 137805 tok/s) step 10758/76294 | train loss 3.392431 | norm 0.3423 | lr 6.04e-04 | (3895.41 ms | 134591 tok/s) step 10759/76294 | train loss 3.428016 | norm 0.2609 | lr 6.04e-04 | (3797.23 ms | 138071 tok/s) step 10760/76294 | train loss 3.454221 | norm 0.2761 | lr 6.04e-04 | (3802.32 ms | 137886 tok/s) step 10761/76294 | train loss 3.512476 | norm 0.2013 | lr 6.04e-04 | (3820.08 ms | 137245 tok/s) step 10762/76294 | train loss 3.381989 | norm 0.2344 | lr 6.04e-04 | (3891.47 ms | 134728 tok/s) step 10763/76294 | train loss 3.418946 | norm 0.2023 | lr 6.04e-04 | (3798.47 ms | 138026 tok/s) step 10764/76294 | train loss 3.395699 | norm 0.2214 | lr 6.04e-04 | (3807.36 ms | 137704 tok/s) step 10765/76294 | train loss 3.376329 | norm 0.2288 | lr 6.04e-04 | (3827.19 ms | 136990 tok/s) step 10766/76294 | train loss 3.477799 | norm 0.3327 | lr 6.04e-04 | (3861.34 ms | 135779 tok/s) step 10767/76294 | train loss 3.491584 | norm 0.2324 | lr 6.04e-04 | (3824.25 ms | 137096 tok/s) step 10768/76294 | train loss 3.406598 | norm 0.3083 | lr 6.03e-04 | (3804.56 ms | 137805 tok/s) step 10769/76294 | train loss 3.391562 | norm 0.2231 | lr 6.03e-04 | (3828.90 ms | 136929 tok/s) step 10770/76294 | train loss 3.421740 | norm 0.2283 | lr 6.03e-04 | (3832.07 ms | 136816 tok/s) step 10771/76294 | train loss 3.405023 | norm 0.2740 | lr 6.03e-04 | (3807.29 ms | 137706 tok/s) step 10772/76294 | train loss 3.393219 | norm 0.1993 | lr 6.03e-04 | (3838.48 ms | 136587 tok/s) step 10773/76294 | train loss 3.591605 | norm 0.3391 | lr 6.03e-04 | (3864.98 ms | 135651 tok/s) step 10774/76294 | train loss 3.458555 | norm 0.2714 | lr 6.03e-04 | (3808.19 ms | 137674 tok/s) step 10775/76294 | train loss 3.462979 | norm 0.2258 | lr 6.03e-04 | (3835.24 ms | 136703 tok/s) step 10776/76294 | train loss 3.397874 | norm 0.2298 | lr 6.03e-04 | (3809.90 ms | 137612 tok/s) step 10777/76294 | train loss 3.429201 | norm 0.1991 | lr 6.03e-04 | (3898.08 ms | 134499 tok/s) step 10778/76294 | train loss 3.500212 | norm 0.2448 | lr 6.03e-04 | (3801.48 ms | 137917 tok/s) step 10779/76294 | train loss 3.420229 | norm 0.2635 | lr 6.02e-04 | (3828.25 ms | 136952 tok/s) step 10780/76294 | train loss 3.401168 | norm 0.2452 | lr 6.02e-04 | (3830.65 ms | 136867 tok/s) step 10781/76294 | train loss 3.498338 | norm 0.3095 | lr 6.02e-04 | (3813.69 ms | 137475 tok/s) step 10782/76294 | train loss 3.454206 | norm 0.3159 | lr 6.02e-04 | (3812.66 ms | 137512 tok/s) step 10783/76294 | train loss 3.434002 | norm 0.2166 | lr 6.02e-04 | (3848.91 ms | 136217 tok/s) step 10784/76294 | train loss 3.389238 | norm 0.2584 | lr 6.02e-04 | (3812.65 ms | 137513 tok/s) step 10785/76294 | train loss 3.394559 | norm 0.1798 | lr 6.02e-04 | (3845.81 ms | 136327 tok/s) step 10786/76294 | train loss 3.436161 | norm 0.1949 | lr 6.02e-04 | (3813.53 ms | 137481 tok/s) step 10787/76294 | train loss 3.416442 | norm 0.2132 | lr 6.02e-04 | (3867.32 ms | 135569 tok/s) step 10788/76294 | train loss 3.398125 | norm 0.2739 | lr 6.02e-04 | (3849.45 ms | 136198 tok/s) step 10789/76294 | train loss 3.416466 | norm 0.2240 | lr 6.02e-04 | (3833.22 ms | 136775 tok/s) step 10790/76294 | train loss 3.393285 | norm 0.2883 | lr 6.01e-04 | (3831.37 ms | 136841 tok/s) step 10791/76294 | train loss 3.462900 | norm 0.1893 | lr 6.01e-04 | (3810.86 ms | 137577 tok/s) step 10792/76294 | train loss 3.470242 | norm 0.3836 | lr 6.01e-04 | (3809.97 ms | 137610 tok/s) step 10793/76294 | train loss 3.397099 | norm 0.1780 | lr 6.01e-04 | (3832.76 ms | 136791 tok/s) step 10794/76294 | train loss 3.390326 | norm 0.2705 | lr 6.01e-04 | (3839.48 ms | 136552 tok/s) step 10795/76294 | train loss 3.444830 | norm 0.2466 | lr 6.01e-04 | (3829.42 ms | 136911 tok/s) step 10796/76294 | train loss 3.434890 | norm 0.2104 | lr 6.01e-04 | (3807.94 ms | 137683 tok/s) step 10797/76294 | train loss 3.438629 | norm 0.2583 | lr 6.01e-04 | (3814.32 ms | 137453 tok/s) step 10798/76294 | train loss 3.444623 | norm 0.2013 | lr 6.01e-04 | (3838.72 ms | 136579 tok/s) step 10799/76294 | train loss 3.378357 | norm 0.1954 | lr 6.01e-04 | (3821.25 ms | 137203 tok/s) step 10800/76294 | train loss 3.326509 | norm 0.2282 | lr 6.01e-04 | (3806.98 ms | 137717 tok/s) step 10801/76294 | train loss 3.729119 | norm 0.2142 | lr 6.00e-04 | (3841.77 ms | 136471 tok/s) step 10802/76294 | train loss 3.471606 | norm 0.2489 | lr 6.00e-04 | (3804.38 ms | 137812 tok/s) step 10803/76294 | train loss 3.368163 | norm 0.2391 | lr 6.00e-04 | (3809.83 ms | 137614 tok/s) step 10804/76294 | train loss 3.361648 | norm 0.2334 | lr 6.00e-04 | (3832.08 ms | 136815 tok/s) step 10805/76294 | train loss 3.436244 | norm 0.2715 | lr 6.00e-04 | (3808.50 ms | 137662 tok/s) step 10806/76294 | train loss 3.487314 | norm 0.2356 | lr 6.00e-04 | (3807.82 ms | 137687 tok/s) step 10807/76294 | train loss 3.411397 | norm 0.2273 | lr 6.00e-04 | (3809.53 ms | 137625 tok/s) step 10808/76294 | train loss 3.395965 | norm 0.2013 | lr 6.00e-04 | (3827.53 ms | 136978 tok/s) step 10809/76294 | train loss 3.398222 | norm 0.2566 | lr 6.00e-04 | (3823.16 ms | 137135 tok/s) step 10810/76294 | train loss 3.396624 | norm 0.2484 | lr 6.00e-04 | (3807.53 ms | 137698 tok/s) step 10811/76294 | train loss 3.427418 | norm 0.2262 | lr 6.00e-04 | (3803.60 ms | 137840 tok/s) step 10812/76294 | train loss 3.423785 | norm 0.3428 | lr 5.99e-04 | (3831.96 ms | 136820 tok/s) step 10813/76294 | train loss 3.435223 | norm 0.2760 | lr 5.99e-04 | (3809.45 ms | 137628 tok/s) step 10814/76294 | train loss 3.394403 | norm 0.3404 | lr 5.99e-04 | (3807.78 ms | 137689 tok/s) step 10815/76294 | train loss 3.416071 | norm 0.3738 | lr 5.99e-04 | (3922.44 ms | 133664 tok/s) step 10816/76294 | train loss 3.367950 | norm 0.1886 | lr 5.99e-04 | (3842.88 ms | 136431 tok/s) step 10817/76294 | train loss 3.495722 | norm 0.2790 | lr 5.99e-04 | (3811.14 ms | 137567 tok/s) step 10818/76294 | train loss 3.473042 | norm 0.3167 | lr 5.99e-04 | (3807.00 ms | 137717 tok/s) step 10819/76294 | train loss 3.384006 | norm 0.2111 | lr 5.99e-04 | (3833.83 ms | 136753 tok/s) step 10820/76294 | train loss 3.380418 | norm 0.2254 | lr 5.99e-04 | (3811.13 ms | 137568 tok/s) step 10821/76294 | train loss 3.381341 | norm 0.2387 | lr 5.99e-04 | (3809.82 ms | 137615 tok/s) step 10822/76294 | train loss 3.525349 | norm 0.1924 | lr 5.99e-04 | (3829.49 ms | 136908 tok/s) step 10823/76294 | train loss 3.428226 | norm 0.2176 | lr 5.99e-04 | (6533.33 ms | 80248 tok/s) step 10824/76294 | train loss 3.493861 | norm 0.1914 | lr 5.98e-04 | (3957.78 ms | 132470 tok/s) step 10825/76294 | train loss 3.410996 | norm 0.2225 | lr 5.98e-04 | (3793.00 ms | 138225 tok/s) step 10826/76294 | train loss 3.424632 | norm 0.1730 | lr 5.98e-04 | (3801.50 ms | 137916 tok/s) step 10827/76294 | train loss 3.398274 | norm 0.2078 | lr 5.98e-04 | (3796.21 ms | 138108 tok/s) step 10828/76294 | train loss 3.512080 | norm 0.2095 | lr 5.98e-04 | (3820.88 ms | 137217 tok/s) step 10829/76294 | train loss 3.407180 | norm 0.2994 | lr 5.98e-04 | (3820.28 ms | 137238 tok/s) step 10830/76294 | train loss 3.453218 | norm 0.2163 | lr 5.98e-04 | (3803.58 ms | 137841 tok/s) step 10831/76294 | train loss 3.398932 | norm 0.1983 | lr 5.98e-04 | (3796.06 ms | 138114 tok/s) step 10832/76294 | train loss 3.410288 | norm 0.1872 | lr 5.98e-04 | (3849.35 ms | 136202 tok/s) step 10833/76294 | train loss 3.401076 | norm 0.1981 | lr 5.98e-04 | (3800.74 ms | 137944 tok/s) step 10834/76294 | train loss 3.434508 | norm 0.2635 | lr 5.98e-04 | (3827.24 ms | 136989 tok/s) step 10835/76294 | train loss 3.474342 | norm 0.2604 | lr 5.97e-04 | (3890.29 ms | 134768 tok/s) step 10836/76294 | train loss 3.392002 | norm 0.1888 | lr 5.97e-04 | (3798.43 ms | 138028 tok/s) step 10837/76294 | train loss 3.379817 | norm 0.2169 | lr 5.97e-04 | (3804.04 ms | 137824 tok/s) step 10838/76294 | train loss 3.427344 | norm 0.1757 | lr 5.97e-04 | (3819.73 ms | 137258 tok/s) step 10839/76294 | train loss 3.495377 | norm 0.1888 | lr 5.97e-04 | (3808.86 ms | 137650 tok/s) step 10840/76294 | train loss 3.410793 | norm 0.1756 | lr 5.97e-04 | (3803.79 ms | 137833 tok/s) step 10841/76294 | train loss 3.507773 | norm 0.2196 | lr 5.97e-04 | (3825.07 ms | 137066 tok/s) step 10842/76294 | train loss 3.540639 | norm 0.2217 | lr 5.97e-04 | (3802.51 ms | 137879 tok/s) step 10843/76294 | train loss 3.453210 | norm 0.2461 | lr 5.97e-04 | (3824.60 ms | 137083 tok/s) step 10844/76294 | train loss 3.440113 | norm 0.2476 | lr 5.97e-04 | (3802.13 ms | 137893 tok/s) step 10845/76294 | train loss 3.482342 | norm 0.3465 | lr 5.97e-04 | (3806.96 ms | 137718 tok/s) step 10846/76294 | train loss 3.437358 | norm 0.2134 | lr 5.96e-04 | (5378.98 ms | 97470 tok/s) step 10847/76294 | train loss 3.465441 | norm 0.4060 | lr 5.96e-04 | (3799.16 ms | 138001 tok/s) step 10848/76294 | train loss 3.498005 | norm 0.2081 | lr 5.96e-04 | (3799.49 ms | 137989 tok/s) step 10849/76294 | train loss 3.618961 | norm 0.3078 | lr 5.96e-04 | (3827.80 ms | 136969 tok/s) step 10850/76294 | train loss 3.445145 | norm 0.2041 | lr 5.96e-04 | (3801.30 ms | 137923 tok/s) step 10851/76294 | train loss 3.452393 | norm 0.2776 | lr 5.96e-04 | (3807.29 ms | 137707 tok/s) step 10852/76294 | train loss 3.476696 | norm 0.2018 | lr 5.96e-04 | (3828.32 ms | 136950 tok/s) step 10853/76294 | train loss 3.546300 | norm 0.2325 | lr 5.96e-04 | (3805.19 ms | 137782 tok/s) step 10854/76294 | train loss 3.441771 | norm 0.2085 | lr 5.96e-04 | (3803.99 ms | 137826 tok/s) step 10855/76294 | train loss 3.421469 | norm 0.2202 | lr 5.96e-04 | (3835.36 ms | 136698 tok/s) step 10856/76294 | train loss 3.443676 | norm 0.2538 | lr 5.96e-04 | (3873.40 ms | 135356 tok/s) step 10857/76294 | train loss 3.440823 | norm 0.2573 | lr 5.95e-04 | (3803.63 ms | 137839 tok/s) step 10858/76294 | train loss 3.341644 | norm 0.2613 | lr 5.95e-04 | (4060.92 ms | 129106 tok/s) step 10859/76294 | train loss 3.415006 | norm 0.2833 | lr 5.95e-04 | (3819.69 ms | 137259 tok/s) step 10860/76294 | train loss 3.465791 | norm 0.2470 | lr 5.95e-04 | (3819.55 ms | 137264 tok/s) step 10861/76294 | train loss 3.484394 | norm 0.2524 | lr 5.95e-04 | (3821.07 ms | 137210 tok/s) step 10862/76294 | train loss 3.373446 | norm 0.3089 | lr 5.95e-04 | (3798.68 ms | 138018 tok/s) step 10863/76294 | train loss 3.412240 | norm 0.2459 | lr 5.95e-04 | (3797.42 ms | 138064 tok/s) step 10864/76294 | train loss 3.426820 | norm 0.3156 | lr 5.95e-04 | (3854.05 ms | 136036 tok/s) step 10865/76294 | train loss 3.429602 | norm 0.3072 | lr 5.95e-04 | (3823.44 ms | 137125 tok/s) step 10866/76294 | train loss 3.427552 | norm 0.1986 | lr 5.95e-04 | (3796.70 ms | 138090 tok/s) step 10867/76294 | train loss 3.394886 | norm 0.2308 | lr 5.95e-04 | (3834.70 ms | 136722 tok/s) step 10868/76294 | train loss 3.444038 | norm 0.2046 | lr 5.94e-04 | (3799.90 ms | 137974 tok/s) step 10869/76294 | train loss 3.455498 | norm 0.2063 | lr 5.94e-04 | (3807.90 ms | 137684 tok/s) step 10870/76294 | train loss 3.400518 | norm 0.2052 | lr 5.94e-04 | (3825.60 ms | 137047 tok/s) step 10871/76294 | train loss 3.410197 | norm 0.2090 | lr 5.94e-04 | (3807.08 ms | 137714 tok/s) step 10872/76294 | train loss 3.431267 | norm 0.1980 | lr 5.94e-04 | (4186.76 ms | 125225 tok/s) step 10873/76294 | train loss 3.449186 | norm 0.2323 | lr 5.94e-04 | (3796.92 ms | 138082 tok/s) step 10874/76294 | train loss 3.392075 | norm 0.1976 | lr 5.94e-04 | (3802.64 ms | 137875 tok/s) step 10875/76294 | train loss 3.510887 | norm 0.2010 | lr 5.94e-04 | (3822.10 ms | 137173 tok/s) step 10876/76294 | train loss 3.451726 | norm 0.2265 | lr 5.94e-04 | (3892.83 ms | 134680 tok/s) step 10877/76294 | train loss 3.399066 | norm 0.2339 | lr 5.94e-04 | (3800.35 ms | 137958 tok/s) step 10878/76294 | train loss 3.434129 | norm 0.2334 | lr 5.94e-04 | (3809.48 ms | 137627 tok/s) step 10879/76294 | train loss 3.469371 | norm 0.2370 | lr 5.94e-04 | (3801.83 ms | 137904 tok/s) step 10880/76294 | train loss 3.410240 | norm 0.2912 | lr 5.93e-04 | (3923.11 ms | 133641 tok/s) step 10881/76294 | train loss 3.387345 | norm 0.2355 | lr 5.93e-04 | (3794.07 ms | 138186 tok/s) step 10882/76294 | train loss 3.404852 | norm 0.3063 | lr 5.93e-04 | (3799.20 ms | 138000 tok/s) step 10883/76294 | train loss 3.388547 | norm 0.2256 | lr 5.93e-04 | (3817.68 ms | 137332 tok/s) step 10884/76294 | train loss 3.438080 | norm 0.2151 | lr 5.93e-04 | (3802.36 ms | 137885 tok/s) step 10885/76294 | train loss 3.499512 | norm 0.2627 | lr 5.93e-04 | (3797.55 ms | 138059 tok/s) step 10886/76294 | train loss 3.460460 | norm 0.2403 | lr 5.93e-04 | (3832.32 ms | 136807 tok/s) step 10887/76294 | train loss 3.406043 | norm 0.2748 | lr 5.93e-04 | (3799.06 ms | 138005 tok/s) step 10888/76294 | train loss 3.473493 | norm 0.2004 | lr 5.93e-04 | (3817.05 ms | 137354 tok/s) step 10889/76294 | train loss 3.416322 | norm 0.3649 | lr 5.93e-04 | (3830.16 ms | 136884 tok/s) step 10890/76294 | train loss 3.373548 | norm 0.2009 | lr 5.93e-04 | (3804.64 ms | 137802 tok/s) step 10891/76294 | train loss 3.467181 | norm 0.2644 | lr 5.92e-04 | (3801.07 ms | 137932 tok/s) step 10892/76294 | train loss 3.420548 | norm 0.1912 | lr 5.92e-04 | (3830.10 ms | 136886 tok/s) step 10893/76294 | train loss 3.442689 | norm 0.2070 | lr 5.92e-04 | (3800.70 ms | 137945 tok/s) step 10894/76294 | train loss 3.426373 | norm 0.1955 | lr 5.92e-04 | (3827.42 ms | 136982 tok/s) step 10895/76294 | train loss 3.461276 | norm 0.1887 | lr 5.92e-04 | (3800.44 ms | 137954 tok/s) step 10896/76294 | train loss 3.375224 | norm 0.2278 | lr 5.92e-04 | (3878.39 ms | 135182 tok/s) step 10897/76294 | train loss 3.415781 | norm 0.2294 | lr 5.92e-04 | (3808.06 ms | 137679 tok/s) step 10898/76294 | train loss 3.585355 | norm 0.2004 | lr 5.92e-04 | (3852.57 ms | 136088 tok/s) step 10899/76294 | train loss 3.423619 | norm 0.2644 | lr 5.92e-04 | (3802.19 ms | 137891 tok/s) step 10900/76294 | train loss 3.411612 | norm 0.1893 | lr 5.92e-04 | (3848.43 ms | 136234 tok/s) step 10901/76294 | train loss 3.345594 | norm 0.2993 | lr 5.92e-04 | (3804.14 ms | 137820 tok/s) step 10902/76294 | train loss 3.484846 | norm 0.2225 | lr 5.91e-04 | (3829.92 ms | 136893 tok/s) step 10903/76294 | train loss 3.418730 | norm 0.2731 | lr 5.91e-04 | (3803.58 ms | 137841 tok/s) step 10904/76294 | train loss 3.459837 | norm 0.2444 | lr 5.91e-04 | (3818.71 ms | 137295 tok/s) step 10905/76294 | train loss 3.430430 | norm 0.2479 | lr 5.91e-04 | (3864.01 ms | 135685 tok/s) step 10906/76294 | train loss 3.420020 | norm 0.2275 | lr 5.91e-04 | (3806.79 ms | 137725 tok/s) step 10907/76294 | train loss 3.373979 | norm 0.1912 | lr 5.91e-04 | (3801.63 ms | 137911 tok/s) step 10908/76294 | train loss 3.440429 | norm 0.2151 | lr 5.91e-04 | (3836.16 ms | 136670 tok/s) step 10909/76294 | train loss 3.432878 | norm 0.2994 | lr 5.91e-04 | (3805.80 ms | 137760 tok/s) step 10910/76294 | train loss 3.382739 | norm 0.2703 | lr 5.91e-04 | (3837.44 ms | 136624 tok/s) step 10911/76294 | train loss 3.441802 | norm 0.3321 | lr 5.91e-04 | (3825.88 ms | 137037 tok/s) step 10912/76294 | train loss 3.405067 | norm 0.3438 | lr 5.91e-04 | (3811.86 ms | 137541 tok/s) step 10913/76294 | train loss 3.406123 | norm 0.2521 | lr 5.90e-04 | (3811.02 ms | 137572 tok/s) step 10914/76294 | train loss 3.374703 | norm 0.4017 | lr 5.90e-04 | (3803.70 ms | 137836 tok/s) step 10915/76294 | train loss 3.531529 | norm 0.2883 | lr 5.90e-04 | (3809.58 ms | 137624 tok/s) step 10916/76294 | train loss 3.389781 | norm 0.2850 | lr 5.90e-04 | (3803.40 ms | 137847 tok/s) step 10917/76294 | train loss 3.460932 | norm 0.2194 | lr 5.90e-04 | (3811.65 ms | 137549 tok/s) step 10918/76294 | train loss 3.414765 | norm 0.1990 | lr 5.90e-04 | (3860.59 ms | 135805 tok/s) step 10919/76294 | train loss 3.462207 | norm 0.2241 | lr 5.90e-04 | (3812.94 ms | 137502 tok/s) step 10920/76294 | train loss 3.398956 | norm 0.2024 | lr 5.90e-04 | (3830.47 ms | 136873 tok/s) step 10921/76294 | train loss 3.498636 | norm 0.2198 | lr 5.90e-04 | (3804.35 ms | 137813 tok/s) step 10922/76294 | train loss 3.398637 | norm 0.2657 | lr 5.90e-04 | (3821.57 ms | 137192 tok/s) step 10923/76294 | train loss 3.362803 | norm 0.2093 | lr 5.90e-04 | (3833.02 ms | 136782 tok/s) step 10924/76294 | train loss 3.383090 | norm 0.2653 | lr 5.89e-04 | (3808.42 ms | 137665 tok/s) step 10925/76294 | train loss 3.399324 | norm 0.1968 | lr 5.89e-04 | (3895.20 ms | 134599 tok/s) step 10926/76294 | train loss 3.455131 | norm 0.2514 | lr 5.89e-04 | (3793.71 ms | 138199 tok/s) step 10927/76294 | train loss 3.407666 | norm 0.2048 | lr 5.89e-04 | (3820.37 ms | 137235 tok/s) step 10928/76294 | train loss 3.448934 | norm 0.2073 | lr 5.89e-04 | (3851.45 ms | 136127 tok/s) step 10929/76294 | train loss 3.423517 | norm 0.2238 | lr 5.89e-04 | (3795.41 ms | 138137 tok/s) step 10930/76294 | train loss 3.416206 | norm 0.2052 | lr 5.89e-04 | (3849.77 ms | 136187 tok/s) step 10931/76294 | train loss 3.458321 | norm 0.2103 | lr 5.89e-04 | (3795.99 ms | 138116 tok/s) step 10932/76294 | train loss 3.351773 | norm 0.2493 | lr 5.89e-04 | (3844.80 ms | 136363 tok/s) step 10933/76294 | train loss 3.492383 | norm 0.2779 | lr 5.89e-04 | (3796.06 ms | 138114 tok/s) step 10934/76294 | train loss 3.448228 | norm 0.2646 | lr 5.89e-04 | (3800.95 ms | 137936 tok/s) step 10935/76294 | train loss 3.373981 | norm 0.2333 | lr 5.89e-04 | (3820.19 ms | 137241 tok/s) step 10936/76294 | train loss 3.387205 | norm 0.2059 | lr 5.88e-04 | (3804.79 ms | 137797 tok/s) step 10937/76294 | train loss 3.389629 | norm 0.3096 | lr 5.88e-04 | (3834.68 ms | 136723 tok/s) step 10938/76294 | train loss 3.437503 | norm 0.2146 | lr 5.88e-04 | (3859.63 ms | 135839 tok/s) step 10939/76294 | train loss 3.420662 | norm 0.2204 | lr 5.88e-04 | (3801.38 ms | 137920 tok/s) step 10940/76294 | train loss 3.389127 | norm 0.2518 | lr 5.88e-04 | (3803.40 ms | 137847 tok/s) step 10941/76294 | train loss 3.490874 | norm 0.2232 | lr 5.88e-04 | (3822.77 ms | 137149 tok/s) step 10942/76294 | train loss 3.442625 | norm 0.3831 | lr 5.88e-04 | (3803.49 ms | 137844 tok/s) step 10943/76294 | train loss 3.402721 | norm 0.2960 | lr 5.88e-04 | (3806.09 ms | 137750 tok/s) step 10944/76294 | train loss 3.407250 | norm 0.2729 | lr 5.88e-04 | (3808.36 ms | 137668 tok/s) step 10945/76294 | train loss 3.470628 | norm 0.4451 | lr 5.88e-04 | (4411.05 ms | 118858 tok/s) step 10946/76294 | train loss 3.389097 | norm 0.2517 | lr 5.88e-04 | (3805.48 ms | 137772 tok/s) step 10947/76294 | train loss 3.445862 | norm 0.3500 | lr 5.87e-04 | (3799.72 ms | 137981 tok/s) step 10948/76294 | train loss 3.441695 | norm 0.2489 | lr 5.87e-04 | (3826.51 ms | 137015 tok/s) step 10949/76294 | train loss 3.502096 | norm 0.2864 | lr 5.87e-04 | (3800.90 ms | 137938 tok/s) step 10950/76294 | train loss 3.452084 | norm 0.2831 | lr 5.87e-04 | (3806.45 ms | 137737 tok/s) step 10951/76294 | train loss 3.439550 | norm 0.2060 | lr 5.87e-04 | (3829.24 ms | 136917 tok/s) step 10952/76294 | train loss 3.469581 | norm 0.1961 | lr 5.87e-04 | (3806.10 ms | 137749 tok/s) step 10953/76294 | train loss 3.379359 | norm 0.2074 | lr 5.87e-04 | (3807.51 ms | 137698 tok/s) step 10954/76294 | train loss 3.464588 | norm 0.1988 | lr 5.87e-04 | (3814.58 ms | 137443 tok/s) step 10955/76294 | train loss 3.458110 | norm 0.1832 | lr 5.87e-04 | (3803.11 ms | 137858 tok/s) step 10956/76294 | train loss 3.436415 | norm 0.1853 | lr 5.87e-04 | (3839.17 ms | 136563 tok/s) step 10957/76294 | train loss 3.412906 | norm 0.2331 | lr 5.87e-04 | (3807.41 ms | 137702 tok/s) step 10958/76294 | train loss 3.376265 | norm 0.1914 | lr 5.86e-04 | (3826.15 ms | 137028 tok/s) step 10959/76294 | train loss 3.429287 | norm 0.2194 | lr 5.86e-04 | (3863.41 ms | 135706 tok/s) step 10960/76294 | train loss 3.423051 | norm 0.2031 | lr 5.86e-04 | (3805.90 ms | 137757 tok/s) step 10961/76294 | train loss 3.386127 | norm 0.2021 | lr 5.86e-04 | (3886.51 ms | 134900 tok/s) step 10962/76294 | train loss 3.483645 | norm 0.2364 | lr 5.86e-04 | (3830.22 ms | 136882 tok/s) step 10963/76294 | train loss 3.435971 | norm 0.1853 | lr 5.86e-04 | (3806.85 ms | 137722 tok/s) step 10964/76294 | train loss 3.337367 | norm 0.2546 | lr 5.86e-04 | (3808.22 ms | 137673 tok/s) step 10965/76294 | train loss 3.481071 | norm 0.2079 | lr 5.86e-04 | (3811.85 ms | 137541 tok/s) step 10966/76294 | train loss 3.521526 | norm 0.2450 | lr 5.86e-04 | (3813.34 ms | 137488 tok/s) step 10967/76294 | train loss 3.439604 | norm 0.1983 | lr 5.86e-04 | (3808.27 ms | 137671 tok/s) step 10968/76294 | train loss 3.433259 | norm 0.2474 | lr 5.86e-04 | (3803.35 ms | 137849 tok/s) step 10969/76294 | train loss 3.400510 | norm 0.2722 | lr 5.85e-04 | (3840.96 ms | 136499 tok/s) step 10970/76294 | train loss 3.425490 | norm 0.3827 | lr 5.85e-04 | (3806.27 ms | 137743 tok/s) step 10971/76294 | train loss 3.385020 | norm 0.2445 | lr 5.85e-04 | (3809.18 ms | 137638 tok/s) step 10972/76294 | train loss 3.518339 | norm 0.2421 | lr 5.85e-04 | (3847.73 ms | 136259 tok/s) step 10973/76294 | train loss 3.451469 | norm 0.2661 | lr 5.85e-04 | (3806.02 ms | 137752 tok/s) step 10974/76294 | train loss 3.408005 | norm 0.2403 | lr 5.85e-04 | (3811.75 ms | 137545 tok/s) step 10975/76294 | train loss 3.479370 | norm 0.2218 | lr 5.85e-04 | (3806.94 ms | 137719 tok/s) step 10976/76294 | train loss 3.406278 | norm 0.2144 | lr 5.85e-04 | (3821.45 ms | 137196 tok/s) step 10977/76294 | train loss 3.330607 | norm 0.2610 | lr 5.85e-04 | (3810.68 ms | 137584 tok/s) step 10978/76294 | train loss 3.445067 | norm 0.1928 | lr 5.85e-04 | (3815.82 ms | 137399 tok/s) step 10979/76294 | train loss 3.439060 | norm 0.2036 | lr 5.85e-04 | (3805.38 ms | 137776 tok/s) step 10980/76294 | train loss 3.381409 | norm 0.2052 | lr 5.84e-04 | (3811.05 ms | 137571 tok/s) step 10981/76294 | train loss 3.433146 | norm 0.2115 | lr 5.84e-04 | (3966.18 ms | 132190 tok/s) step 10982/76294 | train loss 3.598210 | norm 0.1890 | lr 5.84e-04 | (3803.37 ms | 137848 tok/s) step 10983/76294 | train loss 3.437995 | norm 0.3783 | lr 5.84e-04 | (3814.14 ms | 137459 tok/s) step 10984/76294 | train loss 3.407623 | norm 0.2270 | lr 5.84e-04 | (3802.89 ms | 137866 tok/s) step 10985/76294 | train loss 3.411302 | norm 0.2111 | lr 5.84e-04 | (3853.04 ms | 136071 tok/s) step 10986/76294 | train loss 3.395120 | norm 0.2480 | lr 5.84e-04 | (3805.16 ms | 137783 tok/s) step 10987/76294 | train loss 3.472492 | norm 0.2294 | lr 5.84e-04 | (3819.95 ms | 137250 tok/s) step 10988/76294 | train loss 3.418657 | norm 0.2011 | lr 5.84e-04 | (3801.58 ms | 137913 tok/s) step 10989/76294 | train loss 3.392966 | norm 0.2481 | lr 5.84e-04 | (3827.41 ms | 136983 tok/s) step 10990/76294 | train loss 3.404839 | norm 0.1958 | lr 5.84e-04 | (3804.09 ms | 137822 tok/s) step 10991/76294 | train loss 3.398985 | norm 0.2172 | lr 5.84e-04 | (3807.81 ms | 137688 tok/s) step 10992/76294 | train loss 3.435706 | norm 0.1988 | lr 5.83e-04 | (3823.05 ms | 137139 tok/s) step 10993/76294 | train loss 3.394702 | norm 0.2022 | lr 5.83e-04 | (3829.42 ms | 136911 tok/s) step 10994/76294 | train loss 3.410900 | norm 0.2762 | lr 5.83e-04 | (3805.57 ms | 137768 tok/s) step 10995/76294 | train loss 3.456947 | norm 0.2433 | lr 5.83e-04 | (3806.51 ms | 137735 tok/s) step 10996/76294 | train loss 3.443960 | norm 0.2433 | lr 5.83e-04 | (3806.85 ms | 137722 tok/s) step 10997/76294 | train loss 3.403815 | norm 0.2383 | lr 5.83e-04 | (3868.41 ms | 135531 tok/s) step 10998/76294 | train loss 3.409086 | norm 0.1996 | lr 5.83e-04 | (3884.50 ms | 134969 tok/s) step 10999/76294 | train loss 3.397262 | norm 0.2211 | lr 5.83e-04 | (3871.20 ms | 135433 tok/s) step 11000/76294 | train loss 3.498579 | norm 0.2028 | lr 5.83e-04 | (3791.16 ms | 138292 tok/s) val loss: 3.411108 saving model checkpoint to ./results/gpt2-124M-gqa/step_11000.pth step 11001/76294 | train loss 3.457350 | norm 0.2436 | lr 5.83e-04 | (3817.44 ms | 137340 tok/s) step 11002/76294 | train loss 3.393847 | norm 0.1967 | lr 5.83e-04 | (3788.64 ms | 138384 tok/s) step 11003/76294 | train loss 3.403677 | norm 0.3404 | lr 5.82e-04 | (3819.49 ms | 137266 tok/s) step 11004/76294 | train loss 3.403361 | norm 0.1938 | lr 5.82e-04 | (3788.49 ms | 138390 tok/s) step 11005/76294 | train loss 3.414763 | norm 0.2385 | lr 5.82e-04 | (3857.15 ms | 135926 tok/s) step 11006/76294 | train loss 3.382535 | norm 0.2215 | lr 5.82e-04 | (3793.77 ms | 138197 tok/s) step 11007/76294 | train loss 3.326561 | norm 0.2049 | lr 5.82e-04 | (6311.10 ms | 83074 tok/s) step 11008/76294 | train loss 3.547695 | norm 0.2908 | lr 5.82e-04 | (3795.45 ms | 138136 tok/s) step 11009/76294 | train loss 3.403841 | norm 0.2505 | lr 5.82e-04 | (3818.60 ms | 137298 tok/s) step 11010/76294 | train loss 3.405744 | norm 0.3770 | lr 5.82e-04 | (3876.16 ms | 135260 tok/s) step 11011/76294 | train loss 3.437202 | norm 0.2320 | lr 5.82e-04 | (3808.48 ms | 137663 tok/s) step 11012/76294 | train loss 3.467707 | norm 0.3895 | lr 5.82e-04 | (3806.52 ms | 137734 tok/s) step 11013/76294 | train loss 3.440898 | norm 0.2739 | lr 5.82e-04 | (3781.07 ms | 138661 tok/s) step 11014/76294 | train loss 3.505435 | norm 0.3344 | lr 5.81e-04 | (3811.81 ms | 137543 tok/s) step 11015/76294 | train loss 3.395031 | norm 0.3789 | lr 5.81e-04 | (3788.42 ms | 138392 tok/s) step 11016/76294 | train loss 3.357017 | norm 0.2452 | lr 5.81e-04 | (3790.75 ms | 138307 tok/s) step 11017/76294 | train loss 3.366601 | norm 0.3449 | lr 5.81e-04 | (3815.54 ms | 137409 tok/s) step 11018/76294 | train loss 3.367871 | norm 0.2313 | lr 5.81e-04 | (3797.62 ms | 138057 tok/s) step 11019/76294 | train loss 3.504991 | norm 0.2425 | lr 5.81e-04 | (3813.54 ms | 137481 tok/s) step 11020/76294 | train loss 3.336309 | norm 0.2810 | lr 5.81e-04 | (3792.56 ms | 138241 tok/s) step 11021/76294 | train loss 3.465402 | norm 0.2219 | lr 5.81e-04 | (3795.92 ms | 138119 tok/s) step 11022/76294 | train loss 3.464278 | norm 0.2433 | lr 5.81e-04 | (3824.32 ms | 137093 tok/s) step 11023/76294 | train loss 3.381719 | norm 0.2034 | lr 5.81e-04 | (3794.42 ms | 138173 tok/s) step 11024/76294 | train loss 3.355613 | norm 0.3027 | lr 5.81e-04 | (3827.28 ms | 136987 tok/s) step 11025/76294 | train loss 3.452157 | norm 0.2179 | lr 5.80e-04 | (3802.45 ms | 137882 tok/s) step 11026/76294 | train loss 3.430484 | norm 0.3206 | lr 5.80e-04 | (8320.83 ms | 63009 tok/s) step 11027/76294 | train loss 3.422159 | norm 0.2294 | lr 5.80e-04 | (3866.31 ms | 135604 tok/s) step 11028/76294 | train loss 3.468808 | norm 0.2473 | lr 5.80e-04 | (3786.55 ms | 138461 tok/s) step 11029/76294 | train loss 3.470386 | norm 0.3454 | lr 5.80e-04 | (3784.81 ms | 138524 tok/s) step 11030/76294 | train loss 3.428944 | norm 0.1806 | lr 5.80e-04 | (3819.16 ms | 137278 tok/s) step 11031/76294 | train loss 3.431943 | norm 0.3284 | lr 5.80e-04 | (3783.78 ms | 138562 tok/s) step 11032/76294 | train loss 3.418349 | norm 0.2029 | lr 5.80e-04 | (3791.18 ms | 138291 tok/s) step 11033/76294 | train loss 3.402588 | norm 0.3631 | lr 5.80e-04 | (3808.29 ms | 137670 tok/s) step 11034/76294 | train loss 3.402501 | norm 0.2118 | lr 5.80e-04 | (3795.11 ms | 138148 tok/s) step 11035/76294 | train loss 3.387305 | norm 0.2031 | lr 5.80e-04 | (3795.45 ms | 138136 tok/s) step 11036/76294 | train loss 3.447315 | norm 0.1989 | lr 5.80e-04 | (3793.01 ms | 138225 tok/s) step 11037/76294 | train loss 3.447932 | norm 0.3396 | lr 5.79e-04 | (3797.93 ms | 138046 tok/s) step 11038/76294 | train loss 3.423707 | norm 0.2203 | lr 5.79e-04 | (3798.36 ms | 138030 tok/s) step 11039/76294 | train loss 3.409382 | norm 0.3849 | lr 5.79e-04 | (3792.41 ms | 138246 tok/s) step 11040/76294 | train loss 3.427738 | norm 0.2036 | lr 5.79e-04 | (3820.41 ms | 137234 tok/s) step 11041/76294 | train loss 3.448467 | norm 0.2826 | lr 5.79e-04 | (4005.66 ms | 130887 tok/s) step 11042/76294 | train loss 3.433424 | norm 0.2379 | lr 5.79e-04 | (3791.47 ms | 138281 tok/s) step 11043/76294 | train loss 3.521324 | norm 0.2017 | lr 5.79e-04 | (3815.81 ms | 137399 tok/s) step 11044/76294 | train loss 3.373502 | norm 0.2405 | lr 5.79e-04 | (3793.39 ms | 138211 tok/s) step 11045/76294 | train loss 3.405194 | norm 0.2029 | lr 5.79e-04 | (3824.38 ms | 137091 tok/s) step 11046/76294 | train loss 3.654314 | norm 0.2308 | lr 5.79e-04 | (3811.37 ms | 137559 tok/s) step 11047/76294 | train loss 3.418392 | norm 0.2257 | lr 5.79e-04 | (3796.34 ms | 138104 tok/s) step 11048/76294 | train loss 3.399680 | norm 0.2264 | lr 5.78e-04 | (3791.80 ms | 138269 tok/s) step 11049/76294 | train loss 3.447862 | norm 0.3705 | lr 5.78e-04 | (3813.45 ms | 137484 tok/s) step 11050/76294 | train loss 3.442167 | norm 0.3226 | lr 5.78e-04 | (3886.30 ms | 134907 tok/s) step 11051/76294 | train loss 3.460648 | norm 0.2422 | lr 5.78e-04 | (3786.95 ms | 138446 tok/s) step 11052/76294 | train loss 3.478025 | norm 0.2641 | lr 5.78e-04 | (3789.81 ms | 138342 tok/s) step 11053/76294 | train loss 3.448167 | norm 0.2509 | lr 5.78e-04 | (3813.86 ms | 137469 tok/s) step 11054/76294 | train loss 3.424671 | norm 0.2390 | lr 5.78e-04 | (3790.64 ms | 138311 tok/s) step 11055/76294 | train loss 3.445018 | norm 0.2278 | lr 5.78e-04 | (3797.11 ms | 138076 tok/s) step 11056/76294 | train loss 3.421093 | norm 0.2292 | lr 5.78e-04 | (3798.29 ms | 138033 tok/s) step 11057/76294 | train loss 3.395589 | norm 0.2480 | lr 5.78e-04 | (3791.14 ms | 138293 tok/s) step 11058/76294 | train loss 3.428233 | norm 0.2266 | lr 5.78e-04 | (3818.00 ms | 137320 tok/s) step 11059/76294 | train loss 3.474281 | norm 0.2839 | lr 5.77e-04 | (3792.26 ms | 138252 tok/s) step 11060/76294 | train loss 3.363827 | norm 0.1906 | lr 5.77e-04 | (3813.22 ms | 137492 tok/s) step 11061/76294 | train loss 3.608678 | norm 0.2873 | lr 5.77e-04 | (3789.67 ms | 138346 tok/s) step 11062/76294 | train loss 3.460745 | norm 0.2183 | lr 5.77e-04 | (3916.78 ms | 133857 tok/s) step 11063/76294 | train loss 3.379146 | norm 0.2146 | lr 5.77e-04 | (4068.31 ms | 128871 tok/s) step 11064/76294 | train loss 3.495265 | norm 0.2683 | lr 5.77e-04 | (3815.44 ms | 137412 tok/s) step 11065/76294 | train loss 3.738697 | norm 0.4671 | lr 5.77e-04 | (3812.23 ms | 137528 tok/s) step 11066/76294 | train loss 3.371810 | norm 0.3835 | lr 5.77e-04 | (3815.90 ms | 137396 tok/s) step 11067/76294 | train loss 3.490423 | norm 0.2710 | lr 5.77e-04 | (3922.87 ms | 133649 tok/s) step 11068/76294 | train loss 3.433776 | norm 0.4461 | lr 5.77e-04 | (3784.76 ms | 138526 tok/s) step 11069/76294 | train loss 3.472060 | norm 0.2773 | lr 5.77e-04 | (3813.62 ms | 137478 tok/s) step 11070/76294 | train loss 3.448168 | norm 0.2890 | lr 5.76e-04 | (3789.00 ms | 138371 tok/s) step 11071/76294 | train loss 3.412803 | norm 0.2432 | lr 5.76e-04 | (3787.37 ms | 138431 tok/s) step 11072/76294 | train loss 3.448434 | norm 0.2074 | lr 5.76e-04 | (3906.30 ms | 134216 tok/s) step 11073/76294 | train loss 3.409335 | norm 0.1988 | lr 5.76e-04 | (3784.80 ms | 138525 tok/s) step 11074/76294 | train loss 3.410133 | norm 0.3379 | lr 5.76e-04 | (3812.46 ms | 137519 tok/s) step 11075/76294 | train loss 3.462610 | norm 0.2007 | lr 5.76e-04 | (3787.16 ms | 138438 tok/s) step 11076/76294 | train loss 3.428015 | norm 0.2628 | lr 5.76e-04 | (3789.72 ms | 138345 tok/s) step 11077/76294 | train loss 3.438921 | norm 0.2735 | lr 5.76e-04 | (3809.10 ms | 137641 tok/s) step 11078/76294 | train loss 3.439154 | norm 0.2335 | lr 5.76e-04 | (3792.19 ms | 138255 tok/s) step 11079/76294 | train loss 3.302210 | norm 0.4089 | lr 5.76e-04 | (3810.20 ms | 137601 tok/s) step 11080/76294 | train loss 3.444729 | norm 0.2466 | lr 5.76e-04 | (3813.52 ms | 137481 tok/s) step 11081/76294 | train loss 3.423778 | norm 0.2536 | lr 5.76e-04 | (3792.88 ms | 138230 tok/s) step 11082/76294 | train loss 3.445812 | norm 0.2099 | lr 5.75e-04 | (3897.27 ms | 134527 tok/s) step 11083/76294 | train loss 3.469562 | norm 0.2062 | lr 5.75e-04 | (3825.70 ms | 137044 tok/s) step 11084/76294 | train loss 3.425203 | norm 0.2004 | lr 5.75e-04 | (3838.74 ms | 136578 tok/s) step 11085/76294 | train loss 3.425947 | norm 0.1954 | lr 5.75e-04 | (3794.51 ms | 138170 tok/s) step 11086/76294 | train loss 3.387756 | norm 0.1962 | lr 5.75e-04 | (3876.26 ms | 135256 tok/s) step 11087/76294 | train loss 3.404780 | norm 0.1935 | lr 5.75e-04 | (3794.54 ms | 138169 tok/s) step 11088/76294 | train loss 3.422329 | norm 0.2398 | lr 5.75e-04 | (3814.13 ms | 137459 tok/s) step 11089/76294 | train loss 3.386276 | norm 0.2844 | lr 5.75e-04 | (3872.33 ms | 135394 tok/s) step 11090/76294 | train loss 3.415307 | norm 0.2481 | lr 5.75e-04 | (3786.70 ms | 138455 tok/s) step 11091/76294 | train loss 3.381388 | norm 0.2526 | lr 5.75e-04 | (3786.59 ms | 138459 tok/s) step 11092/76294 | train loss 3.344251 | norm 0.1888 | lr 5.75e-04 | (3883.48 ms | 135005 tok/s) step 11093/76294 | train loss 3.385849 | norm 0.2687 | lr 5.74e-04 | (3781.67 ms | 138639 tok/s) step 11094/76294 | train loss 3.459437 | norm 0.2179 | lr 5.74e-04 | (3797.89 ms | 138047 tok/s) step 11095/76294 | train loss 3.401289 | norm 0.2213 | lr 5.74e-04 | (3852.06 ms | 136106 tok/s) step 11096/76294 | train loss 3.403325 | norm 0.2447 | lr 5.74e-04 | (3932.37 ms | 133326 tok/s) step 11097/76294 | train loss 3.456975 | norm 0.2352 | lr 5.74e-04 | (3758.56 ms | 139492 tok/s) step 11098/76294 | train loss 3.361977 | norm 0.1860 | lr 5.74e-04 | (3852.55 ms | 136089 tok/s) step 11099/76294 | train loss 3.401044 | norm 0.2319 | lr 5.74e-04 | (3817.98 ms | 137321 tok/s) step 11100/76294 | train loss 3.404572 | norm 0.1915 | lr 5.74e-04 | (3763.00 ms | 139327 tok/s) step 11101/76294 | train loss 3.406586 | norm 0.2393 | lr 5.74e-04 | (3896.35 ms | 134559 tok/s) step 11102/76294 | train loss 3.415836 | norm 0.2277 | lr 5.74e-04 | (3762.44 ms | 139348 tok/s) step 11103/76294 | train loss 3.392345 | norm 0.2365 | lr 5.74e-04 | (3866.65 ms | 135592 tok/s) step 11104/76294 | train loss 3.479565 | norm 0.3356 | lr 5.73e-04 | (3853.71 ms | 136048 tok/s) step 11105/76294 | train loss 3.380010 | norm 0.2422 | lr 5.73e-04 | (3757.24 ms | 139541 tok/s) step 11106/76294 | train loss 3.390330 | norm 0.2141 | lr 5.73e-04 | (3893.95 ms | 134642 tok/s) step 11107/76294 | train loss 3.483488 | norm 0.2452 | lr 5.73e-04 | (3759.68 ms | 139450 tok/s) step 11108/76294 | train loss 3.388391 | norm 0.2176 | lr 5.73e-04 | (3826.32 ms | 137021 tok/s) step 11109/76294 | train loss 3.442946 | norm 0.2118 | lr 5.73e-04 | (3767.78 ms | 139150 tok/s) step 11110/76294 | train loss 3.439189 | norm 0.2671 | lr 5.73e-04 | (3855.02 ms | 136001 tok/s) step 11111/76294 | train loss 3.471255 | norm 0.2177 | lr 5.73e-04 | (3774.19 ms | 138914 tok/s) step 11112/76294 | train loss 3.446917 | norm 0.2364 | lr 5.73e-04 | (3786.61 ms | 138458 tok/s) step 11113/76294 | train loss 3.432279 | norm 0.2058 | lr 5.73e-04 | (3807.81 ms | 137688 tok/s) step 11114/76294 | train loss 3.471278 | norm 0.2272 | lr 5.73e-04 | (3787.86 ms | 138413 tok/s) step 11115/76294 | train loss 3.406226 | norm 0.2250 | lr 5.72e-04 | (3792.50 ms | 138243 tok/s) step 11116/76294 | train loss 3.358771 | norm 0.3919 | lr 5.72e-04 | (3788.13 ms | 138403 tok/s) step 11117/76294 | train loss 3.485170 | norm 0.2514 | lr 5.72e-04 | (3796.61 ms | 138094 tok/s) step 11118/76294 | train loss 3.406307 | norm 0.3011 | lr 5.72e-04 | (3945.66 ms | 132877 tok/s) step 11119/76294 | train loss 3.394103 | norm 0.3333 | lr 5.72e-04 | (3790.89 ms | 138302 tok/s) step 11120/76294 | train loss 3.443495 | norm 0.2091 | lr 5.72e-04 | (3800.04 ms | 137969 tok/s) step 11121/76294 | train loss 3.421453 | norm 0.3928 | lr 5.72e-04 | (3820.50 ms | 137230 tok/s) step 11122/76294 | train loss 3.449823 | norm 0.2040 | lr 5.72e-04 | (3800.90 ms | 137938 tok/s) step 11123/76294 | train loss 3.418158 | norm 0.3485 | lr 5.72e-04 | (3803.16 ms | 137856 tok/s) step 11124/76294 | train loss 3.417341 | norm 0.2038 | lr 5.72e-04 | (3800.23 ms | 137962 tok/s) step 11125/76294 | train loss 3.364252 | norm 0.2626 | lr 5.72e-04 | (3909.00 ms | 134123 tok/s) step 11126/76294 | train loss 3.428905 | norm 0.2207 | lr 5.72e-04 | (3792.13 ms | 138257 tok/s) step 11127/76294 | train loss 3.438671 | norm 0.3593 | lr 5.71e-04 | (3816.05 ms | 137390 tok/s) step 11128/76294 | train loss 3.386473 | norm 0.2163 | lr 5.71e-04 | (3796.42 ms | 138101 tok/s) step 11129/76294 | train loss 3.402035 | norm 0.3189 | lr 5.71e-04 | (3804.37 ms | 137812 tok/s) step 11130/76294 | train loss 3.465269 | norm 0.5216 | lr 5.71e-04 | (3820.95 ms | 137214 tok/s) step 11131/76294 | train loss 3.386225 | norm 0.2251 | lr 5.71e-04 | (3907.40 ms | 134178 tok/s) step 11132/76294 | train loss 3.423723 | norm 0.2454 | lr 5.71e-04 | (3796.10 ms | 138112 tok/s) step 11133/76294 | train loss 3.409322 | norm 0.2638 | lr 5.71e-04 | (3856.88 ms | 135936 tok/s) step 11134/76294 | train loss 3.407717 | norm 0.2229 | lr 5.71e-04 | (3802.02 ms | 137897 tok/s) step 11135/76294 | train loss 3.392323 | norm 0.2490 | lr 5.71e-04 | (3821.64 ms | 137189 tok/s) step 11136/76294 | train loss 3.422715 | norm 0.2347 | lr 5.71e-04 | (3800.70 ms | 137945 tok/s) step 11137/76294 | train loss 3.390734 | norm 0.2054 | lr 5.71e-04 | (4786.48 ms | 109535 tok/s) step 11138/76294 | train loss 3.363360 | norm 0.2959 | lr 5.70e-04 | (3892.09 ms | 134706 tok/s) step 11139/76294 | train loss 3.470083 | norm 0.5201 | lr 5.70e-04 | (3804.80 ms | 137797 tok/s) step 11140/76294 | train loss 3.345752 | norm 0.2728 | lr 5.70e-04 | (3824.42 ms | 137090 tok/s) step 11141/76294 | train loss 3.463053 | norm 0.2764 | lr 5.70e-04 | (3804.49 ms | 137808 tok/s) step 11142/76294 | train loss 3.371544 | norm 0.2346 | lr 5.70e-04 | (3820.49 ms | 137230 tok/s) step 11143/76294 | train loss 3.429628 | norm 0.2298 | lr 5.70e-04 | (3812.82 ms | 137506 tok/s) step 11144/76294 | train loss 3.423627 | norm 0.2334 | lr 5.70e-04 | (3805.76 ms | 137762 tok/s) step 11145/76294 | train loss 3.464987 | norm 0.2326 | lr 5.70e-04 | (3843.48 ms | 136410 tok/s) step 11146/76294 | train loss 3.393322 | norm 0.2857 | lr 5.70e-04 | (3807.78 ms | 137689 tok/s) step 11147/76294 | train loss 3.422524 | norm 0.2513 | lr 5.70e-04 | (3809.80 ms | 137616 tok/s) step 11148/76294 | train loss 3.396708 | norm 0.2647 | lr 5.70e-04 | (3833.51 ms | 136765 tok/s) step 11149/76294 | train loss 3.411978 | norm 0.2798 | lr 5.69e-04 | (3831.87 ms | 136823 tok/s) step 11150/76294 | train loss 3.440608 | norm 0.2499 | lr 5.69e-04 | (3805.84 ms | 137759 tok/s) step 11151/76294 | train loss 3.409938 | norm 0.2158 | lr 5.69e-04 | (3840.01 ms | 136533 tok/s) step 11152/76294 | train loss 3.354103 | norm 0.3062 | lr 5.69e-04 | (3806.46 ms | 137736 tok/s) step 11153/76294 | train loss 3.344790 | norm 0.2341 | lr 5.69e-04 | (3816.40 ms | 137378 tok/s) step 11154/76294 | train loss 3.394105 | norm 0.2510 | lr 5.69e-04 | (3883.46 ms | 135005 tok/s) step 11155/76294 | train loss 3.470555 | norm 0.3250 | lr 5.69e-04 | (3809.11 ms | 137641 tok/s) step 11156/76294 | train loss 3.438659 | norm 0.2798 | lr 5.69e-04 | (3821.12 ms | 137208 tok/s) step 11157/76294 | train loss 3.447152 | norm 0.3358 | lr 5.69e-04 | (3807.42 ms | 137702 tok/s) step 11158/76294 | train loss 3.416881 | norm 0.2076 | lr 5.69e-04 | (3863.85 ms | 135690 tok/s) step 11159/76294 | train loss 3.388215 | norm 0.4309 | lr 5.69e-04 | (3926.65 ms | 133521 tok/s) step 11160/76294 | train loss 3.431927 | norm 0.2044 | lr 5.68e-04 | (3795.88 ms | 138120 tok/s) step 11161/76294 | train loss 3.381434 | norm 0.2391 | lr 5.68e-04 | (3802.77 ms | 137870 tok/s) step 11162/76294 | train loss 3.408753 | norm 0.2321 | lr 5.68e-04 | (3821.86 ms | 137181 tok/s) step 11163/76294 | train loss 3.356946 | norm 0.2501 | lr 5.68e-04 | (3804.84 ms | 137795 tok/s) step 11164/76294 | train loss 3.415337 | norm 0.2275 | lr 5.68e-04 | (3802.67 ms | 137874 tok/s) step 11165/76294 | train loss 3.435515 | norm 0.2882 | lr 5.68e-04 | (3862.45 ms | 135740 tok/s) step 11166/76294 | train loss 3.360898 | norm 0.1994 | lr 5.68e-04 | (3872.23 ms | 135397 tok/s) step 11167/76294 | train loss 3.394659 | norm 0.3239 | lr 5.68e-04 | (3792.14 ms | 138256 tok/s) step 11168/76294 | train loss 3.404767 | norm 0.2649 | lr 5.68e-04 | (3863.08 ms | 135718 tok/s) step 11169/76294 | train loss 3.292489 | norm 0.2855 | lr 5.68e-04 | (3802.18 ms | 137892 tok/s) step 11170/76294 | train loss 3.382949 | norm 0.3496 | lr 5.68e-04 | (3801.92 ms | 137901 tok/s) step 11171/76294 | train loss 3.465226 | norm 0.2156 | lr 5.68e-04 | (3824.82 ms | 137075 tok/s) step 11172/76294 | train loss 3.324367 | norm 0.4084 | lr 5.67e-04 | (3994.33 ms | 131258 tok/s) step 11173/76294 | train loss 3.421613 | norm 0.2204 | lr 5.67e-04 | (3856.26 ms | 135958 tok/s) step 11174/76294 | train loss 3.389284 | norm 0.3639 | lr 5.67e-04 | (3864.73 ms | 135660 tok/s) step 11175/76294 | train loss 3.358841 | norm 0.3388 | lr 5.67e-04 | (3825.36 ms | 137056 tok/s) step 11176/76294 | train loss 3.463887 | norm 0.2868 | lr 5.67e-04 | (3806.62 ms | 137731 tok/s) step 11177/76294 | train loss 3.379588 | norm 0.4866 | lr 5.67e-04 | (3805.36 ms | 137776 tok/s) step 11178/76294 | train loss 3.451196 | norm 0.2047 | lr 5.67e-04 | (3804.55 ms | 137806 tok/s) step 11179/76294 | train loss 3.354899 | norm 0.3617 | lr 5.67e-04 | (3802.28 ms | 137888 tok/s) step 11180/76294 | train loss 3.384860 | norm 0.3282 | lr 5.67e-04 | (3883.77 ms | 134995 tok/s) step 11181/76294 | train loss 3.347921 | norm 0.2119 | lr 5.67e-04 | (3797.74 ms | 138053 tok/s) step 11182/76294 | train loss 3.452869 | norm 0.4338 | lr 5.67e-04 | (3802.01 ms | 137898 tok/s) step 11183/76294 | train loss 3.448182 | norm 0.2578 | lr 5.66e-04 | (3819.82 ms | 137254 tok/s) step 11184/76294 | train loss 3.347998 | norm 0.2888 | lr 5.66e-04 | (3802.21 ms | 137890 tok/s) step 11185/76294 | train loss 3.412649 | norm 0.2755 | lr 5.66e-04 | (3823.77 ms | 137113 tok/s) step 11186/76294 | train loss 3.390647 | norm 0.2334 | lr 5.66e-04 | (3804.18 ms | 137819 tok/s) step 11187/76294 | train loss 3.438050 | norm 0.2137 | lr 5.66e-04 | (3805.86 ms | 137758 tok/s) step 11188/76294 | train loss 3.376474 | norm 0.2852 | lr 5.66e-04 | (3806.54 ms | 137733 tok/s) step 11189/76294 | train loss 3.388735 | norm 0.2143 | lr 5.66e-04 | (3801.83 ms | 137904 tok/s) step 11190/76294 | train loss 3.408365 | norm 0.2625 | lr 5.66e-04 | (3834.60 ms | 136726 tok/s) step 11191/76294 | train loss 3.476665 | norm 0.2008 | lr 5.66e-04 | (3805.37 ms | 137776 tok/s) step 11192/76294 | train loss 3.383684 | norm 0.2546 | lr 5.66e-04 | (3832.54 ms | 136799 tok/s) step 11193/76294 | train loss 3.384151 | norm 0.2529 | lr 5.66e-04 | (3809.47 ms | 137627 tok/s) step 11194/76294 | train loss 3.415150 | norm 0.2462 | lr 5.65e-04 | (3811.37 ms | 137559 tok/s) step 11195/76294 | train loss 3.436637 | norm 0.2910 | lr 5.65e-04 | (3918.09 ms | 133812 tok/s) step 11196/76294 | train loss 3.365474 | norm 0.1902 | lr 5.65e-04 | (3804.18 ms | 137819 tok/s) step 11197/76294 | train loss 3.370571 | norm 0.3361 | lr 5.65e-04 | (3815.14 ms | 137423 tok/s) step 11198/76294 | train loss 3.460061 | norm 0.1943 | lr 5.65e-04 | (3810.11 ms | 137605 tok/s) step 11199/76294 | train loss 3.389162 | norm 0.2588 | lr 5.65e-04 | (3812.31 ms | 137525 tok/s) step 11200/76294 | train loss 3.378753 | norm 0.2177 | lr 5.65e-04 | (3827.66 ms | 136973 tok/s) step 11201/76294 | train loss 3.390318 | norm 0.2259 | lr 5.65e-04 | (3836.49 ms | 136658 tok/s) step 11202/76294 | train loss 3.419414 | norm 0.2918 | lr 5.65e-04 | (3809.26 ms | 137635 tok/s) step 11203/76294 | train loss 3.503057 | norm 0.2133 | lr 5.65e-04 | (3838.54 ms | 136585 tok/s) step 11204/76294 | train loss 3.408736 | norm 0.3237 | lr 5.65e-04 | (3807.89 ms | 137685 tok/s) step 11205/76294 | train loss 3.375570 | norm 0.2710 | lr 5.64e-04 | (3816.66 ms | 137368 tok/s) step 11206/76294 | train loss 3.419887 | norm 0.2433 | lr 5.64e-04 | (3832.82 ms | 136789 tok/s) step 11207/76294 | train loss 3.404337 | norm 0.2628 | lr 5.64e-04 | (3823.60 ms | 137119 tok/s) step 11208/76294 | train loss 3.382414 | norm 0.1996 | lr 5.64e-04 | (3833.56 ms | 136763 tok/s) step 11209/76294 | train loss 3.388615 | norm 0.2706 | lr 5.64e-04 | (3815.59 ms | 137407 tok/s) step 11210/76294 | train loss 3.416529 | norm 0.2734 | lr 5.64e-04 | (3820.09 ms | 137245 tok/s) step 11211/76294 | train loss 3.385833 | norm 0.2115 | lr 5.64e-04 | (3814.45 ms | 137448 tok/s) step 11212/76294 | train loss 3.391404 | norm 0.2465 | lr 5.64e-04 | (3809.05 ms | 137643 tok/s) step 11213/76294 | train loss 3.415075 | norm 0.2254 | lr 5.64e-04 | (3840.26 ms | 136524 tok/s) step 11214/76294 | train loss 3.411806 | norm 0.2207 | lr 5.64e-04 | (3829.64 ms | 136903 tok/s) step 11215/76294 | train loss 3.411510 | norm 0.2310 | lr 5.64e-04 | (3874.88 ms | 135304 tok/s) step 11216/76294 | train loss 3.401082 | norm 0.2447 | lr 5.64e-04 | (3823.04 ms | 137139 tok/s) step 11217/76294 | train loss 3.397851 | norm 0.1947 | lr 5.63e-04 | (3797.86 ms | 138048 tok/s) step 11218/76294 | train loss 3.385282 | norm 0.2248 | lr 5.63e-04 | (9132.33 ms | 57410 tok/s) step 11219/76294 | train loss 3.390388 | norm 0.1964 | lr 5.63e-04 | (3854.06 ms | 136035 tok/s) step 11220/76294 | train loss 3.484365 | norm 0.2299 | lr 5.63e-04 | (3788.65 ms | 138384 tok/s) step 11221/76294 | train loss 3.425288 | norm 0.2112 | lr 5.63e-04 | (3811.08 ms | 137569 tok/s) step 11222/76294 | train loss 3.386270 | norm 0.2030 | lr 5.63e-04 | (3798.72 ms | 138017 tok/s) step 11223/76294 | train loss 3.356506 | norm 0.2166 | lr 5.63e-04 | (3795.97 ms | 138117 tok/s) step 11224/76294 | train loss 3.388867 | norm 0.2884 | lr 5.63e-04 | (3823.59 ms | 137119 tok/s) step 11225/76294 | train loss 3.338309 | norm 0.2316 | lr 5.63e-04 | (3796.97 ms | 138080 tok/s) step 11226/76294 | train loss 3.363748 | norm 0.3004 | lr 5.63e-04 | (3789.24 ms | 138362 tok/s) step 11227/76294 | train loss 3.466302 | norm 0.2931 | lr 5.63e-04 | (3845.93 ms | 136323 tok/s) step 11228/76294 | train loss 3.348032 | norm 0.2917 | lr 5.62e-04 | (3794.15 ms | 138183 tok/s) step 11229/76294 | train loss 3.432971 | norm 0.2574 | lr 5.62e-04 | (3788.19 ms | 138401 tok/s) step 11230/76294 | train loss 3.314220 | norm 0.1928 | lr 5.62e-04 | (3840.86 ms | 136503 tok/s) step 11231/76294 | train loss 3.428072 | norm 0.3075 | lr 5.62e-04 | (3791.50 ms | 138280 tok/s) step 11232/76294 | train loss 3.412145 | norm 0.2186 | lr 5.62e-04 | (3802.70 ms | 137873 tok/s) step 11233/76294 | train loss 3.496234 | norm 0.2417 | lr 5.62e-04 | (3814.63 ms | 137441 tok/s) step 11234/76294 | train loss 3.388145 | norm 0.2934 | lr 5.62e-04 | (3793.56 ms | 138205 tok/s) step 11235/76294 | train loss 3.450153 | norm 0.2486 | lr 5.62e-04 | (3817.73 ms | 137330 tok/s) step 11236/76294 | train loss 3.389639 | norm 0.3175 | lr 5.62e-04 | (3826.84 ms | 137003 tok/s) step 11237/76294 | train loss 3.396128 | norm 0.2224 | lr 5.62e-04 | (3791.40 ms | 138283 tok/s) step 11238/76294 | train loss 3.364263 | norm 0.2331 | lr 5.62e-04 | (3818.86 ms | 137289 tok/s) step 11239/76294 | train loss 3.402472 | norm 0.1939 | lr 5.61e-04 | (3795.86 ms | 138121 tok/s) step 11240/76294 | train loss 3.420955 | norm 0.2396 | lr 5.61e-04 | (3822.68 ms | 137152 tok/s) step 11241/76294 | train loss 3.395561 | norm 0.2236 | lr 5.61e-04 | (3793.43 ms | 138209 tok/s) step 11242/76294 | train loss 3.398227 | norm 0.2357 | lr 5.61e-04 | (3801.91 ms | 137901 tok/s) step 11243/76294 | train loss 3.355838 | norm 0.2314 | lr 5.61e-04 | (3814.31 ms | 137453 tok/s) step 11244/76294 | train loss 3.522432 | norm 0.2592 | lr 5.61e-04 | (3801.10 ms | 137931 tok/s) step 11245/76294 | train loss 3.399834 | norm 0.2330 | lr 5.61e-04 | (3801.12 ms | 137930 tok/s) step 11246/76294 | train loss 3.438581 | norm 0.3450 | lr 5.61e-04 | (3823.12 ms | 137136 tok/s) step 11247/76294 | train loss 3.561020 | norm 0.1927 | lr 5.61e-04 | (3796.50 ms | 138098 tok/s) step 11248/76294 | train loss 3.383706 | norm 0.3949 | lr 5.61e-04 | (3851.51 ms | 136125 tok/s) step 11249/76294 | train loss 3.391524 | norm 0.2019 | lr 5.61e-04 | (3792.57 ms | 138241 tok/s) step 11250/76294 | train loss 3.436697 | norm 0.3139 | lr 5.61e-04 | (4366.44 ms | 120072 tok/s) val loss: 3.405552 saving model checkpoint to ./results/gpt2-124M-gqa/step_11250.pth step 11251/76294 | train loss 3.324979 | norm 0.1804 | lr 5.60e-04 | (3814.10 ms | 137460 tok/s) step 11252/76294 | train loss 3.422159 | norm 0.2134 | lr 5.60e-04 | (3787.79 ms | 138415 tok/s) step 11253/76294 | train loss 3.407574 | norm 0.1785 | lr 5.60e-04 | (3909.35 ms | 134111 tok/s) step 11254/76294 | train loss 3.470377 | norm 0.3087 | lr 5.60e-04 | (4051.61 ms | 129402 tok/s) step 11255/76294 | train loss 3.408893 | norm 0.2254 | lr 5.60e-04 | (3840.63 ms | 136511 tok/s) step 11256/76294 | train loss 3.392872 | norm 0.3522 | lr 5.60e-04 | (3786.90 ms | 138448 tok/s) step 11257/76294 | train loss 3.481677 | norm 0.3525 | lr 5.60e-04 | (3807.85 ms | 137686 tok/s) step 11258/76294 | train loss 3.402982 | norm 0.2856 | lr 5.60e-04 | (3884.91 ms | 134955 tok/s) step 11259/76294 | train loss 3.411549 | norm 0.4329 | lr 5.60e-04 | (3792.83 ms | 138231 tok/s) step 11260/76294 | train loss 3.353236 | norm 0.2547 | lr 5.60e-04 | (3800.97 ms | 137935 tok/s) step 11261/76294 | train loss 3.435280 | norm 0.2391 | lr 5.60e-04 | (3786.09 ms | 138477 tok/s) step 11262/76294 | train loss 3.373805 | norm 0.2153 | lr 5.59e-04 | (3895.66 ms | 134583 tok/s) step 11263/76294 | train loss 3.345079 | norm 0.1945 | lr 5.59e-04 | (3770.22 ms | 139060 tok/s) step 11264/76294 | train loss 3.433342 | norm 0.2975 | lr 5.59e-04 | (3781.52 ms | 138645 tok/s) step 11265/76294 | train loss 3.556315 | norm 0.2494 | lr 5.59e-04 | (3803.27 ms | 137852 tok/s) step 11266/76294 | train loss 3.430289 | norm 0.2717 | lr 5.59e-04 | (3786.37 ms | 138467 tok/s) step 11267/76294 | train loss 3.377514 | norm 0.3031 | lr 5.59e-04 | (3835.39 ms | 136697 tok/s) step 11268/76294 | train loss 3.453592 | norm 0.4080 | lr 5.59e-04 | (3785.95 ms | 138482 tok/s) step 11269/76294 | train loss 3.417226 | norm 0.2879 | lr 5.59e-04 | (3812.60 ms | 137515 tok/s) step 11270/76294 | train loss 3.404319 | norm 0.4476 | lr 5.59e-04 | (3810.77 ms | 137580 tok/s) step 11271/76294 | train loss 3.408668 | norm 0.2252 | lr 5.59e-04 | (3787.47 ms | 138427 tok/s) step 11272/76294 | train loss 3.412916 | norm 0.4815 | lr 5.59e-04 | (3819.53 ms | 137265 tok/s) step 11273/76294 | train loss 3.390252 | norm 0.2227 | lr 5.58e-04 | (3791.01 ms | 138298 tok/s) step 11274/76294 | train loss 3.468918 | norm 0.2816 | lr 5.58e-04 | (3799.51 ms | 137988 tok/s) step 11275/76294 | train loss 3.404312 | norm 0.2229 | lr 5.58e-04 | (3817.02 ms | 137355 tok/s) step 11276/76294 | train loss 3.438849 | norm 0.2015 | lr 5.58e-04 | (3800.25 ms | 137961 tok/s) step 11277/76294 | train loss 3.464801 | norm 0.2083 | lr 5.58e-04 | (3802.79 ms | 137869 tok/s) step 11278/76294 | train loss 3.350101 | norm 0.2344 | lr 5.58e-04 | (3856.23 ms | 135959 tok/s) step 11279/76294 | train loss 3.446996 | norm 0.2123 | lr 5.58e-04 | (3794.69 ms | 138164 tok/s) step 11280/76294 | train loss 3.420397 | norm 0.2038 | lr 5.58e-04 | (3821.29 ms | 137202 tok/s) step 11281/76294 | train loss 3.461410 | norm 0.2284 | lr 5.58e-04 | (3798.44 ms | 138027 tok/s) step 11282/76294 | train loss 3.475387 | norm 0.1983 | lr 5.58e-04 | (3804.12 ms | 137821 tok/s) step 11283/76294 | train loss 3.408352 | norm 0.2352 | lr 5.58e-04 | (3827.12 ms | 136993 tok/s) step 11284/76294 | train loss 3.416829 | norm 0.2466 | lr 5.58e-04 | (3810.90 ms | 137576 tok/s) step 11285/76294 | train loss 3.417327 | norm 0.2551 | lr 5.57e-04 | (3849.61 ms | 136192 tok/s) step 11286/76294 | train loss 3.445951 | norm 0.2940 | lr 5.57e-04 | (3798.70 ms | 138018 tok/s) step 11287/76294 | train loss 3.404452 | norm 0.2339 | lr 5.57e-04 | (3882.74 ms | 135030 tok/s) step 11288/76294 | train loss 3.410985 | norm 0.5227 | lr 5.57e-04 | (3790.42 ms | 138319 tok/s) step 11289/76294 | train loss 3.454011 | norm 0.3860 | lr 5.57e-04 | (3803.52 ms | 137843 tok/s) step 11290/76294 | train loss 3.415999 | norm 0.3236 | lr 5.57e-04 | (3795.55 ms | 138132 tok/s) step 11291/76294 | train loss 3.427593 | norm 0.2481 | lr 5.57e-04 | (3799.53 ms | 137988 tok/s) step 11292/76294 | train loss 3.680657 | norm 0.2985 | lr 5.57e-04 | (3819.32 ms | 137273 tok/s) step 11293/76294 | train loss 3.444363 | norm 0.3766 | lr 5.57e-04 | (3797.35 ms | 138067 tok/s) step 11294/76294 | train loss 3.388724 | norm 0.2570 | lr 5.57e-04 | (3805.56 ms | 137769 tok/s) step 11295/76294 | train loss 3.394019 | norm 0.2162 | lr 5.57e-04 | (3825.89 ms | 137037 tok/s) step 11296/76294 | train loss 3.416103 | norm 0.2480 | lr 5.56e-04 | (3805.27 ms | 137779 tok/s) step 11297/76294 | train loss 3.373446 | norm 0.2443 | lr 5.56e-04 | (3802.24 ms | 137889 tok/s) step 11298/76294 | train loss 3.321992 | norm 0.1910 | lr 5.56e-04 | (3796.41 ms | 138101 tok/s) step 11299/76294 | train loss 3.391343 | norm 0.2142 | lr 5.56e-04 | (3866.37 ms | 135602 tok/s) step 11300/76294 | train loss 3.427394 | norm 0.2193 | lr 5.56e-04 | (3795.85 ms | 138122 tok/s) step 11301/76294 | train loss 3.429448 | norm 0.2036 | lr 5.56e-04 | (3831.09 ms | 136851 tok/s) step 11302/76294 | train loss 3.458492 | norm 0.2536 | lr 5.56e-04 | (3805.59 ms | 137768 tok/s) step 11303/76294 | train loss 3.419014 | norm 0.1984 | lr 5.56e-04 | (3808.63 ms | 137658 tok/s) step 11304/76294 | train loss 3.387837 | norm 0.2282 | lr 5.56e-04 | (3823.90 ms | 137108 tok/s) step 11305/76294 | train loss 3.470485 | norm 0.2453 | lr 5.56e-04 | (3825.72 ms | 137043 tok/s) step 11306/76294 | train loss 3.408723 | norm 0.3837 | lr 5.56e-04 | (3798.87 ms | 138012 tok/s) step 11307/76294 | train loss 3.464298 | norm 0.2268 | lr 5.55e-04 | (3839.28 ms | 136559 tok/s) step 11308/76294 | train loss 3.476660 | norm 0.3452 | lr 5.55e-04 | (3807.02 ms | 137716 tok/s) step 11309/76294 | train loss 3.533338 | norm 0.2445 | lr 5.55e-04 | (3809.05 ms | 137643 tok/s) step 11310/76294 | train loss 3.445714 | norm 0.3001 | lr 5.55e-04 | (3819.84 ms | 137254 tok/s) step 11311/76294 | train loss 3.396276 | norm 0.2225 | lr 5.55e-04 | (3800.86 ms | 137939 tok/s) step 11312/76294 | train loss 3.448982 | norm 0.2573 | lr 5.55e-04 | (3825.02 ms | 137068 tok/s) step 11313/76294 | train loss 3.483499 | norm 0.2104 | lr 5.55e-04 | (3801.51 ms | 137916 tok/s) step 11314/76294 | train loss 3.438805 | norm 0.2327 | lr 5.55e-04 | (3806.11 ms | 137749 tok/s) step 11315/76294 | train loss 3.444787 | norm 0.1924 | lr 5.55e-04 | (3827.39 ms | 136983 tok/s) step 11316/76294 | train loss 3.460143 | norm 0.2383 | lr 5.55e-04 | (3805.34 ms | 137777 tok/s) step 11317/76294 | train loss 3.478272 | norm 0.3229 | lr 5.55e-04 | (3799.65 ms | 137983 tok/s) step 11318/76294 | train loss 3.462257 | norm 0.6994 | lr 5.55e-04 | (3832.43 ms | 136803 tok/s) step 11319/76294 | train loss 3.433354 | norm 0.3420 | lr 5.54e-04 | (3804.99 ms | 137789 tok/s) step 11320/76294 | train loss 3.389090 | norm 0.5606 | lr 5.54e-04 | (3831.31 ms | 136843 tok/s) step 11321/76294 | train loss 3.399833 | norm 0.2752 | lr 5.54e-04 | (3812.61 ms | 137514 tok/s) step 11322/76294 | train loss 3.394655 | norm 0.3375 | lr 5.54e-04 | (3929.73 ms | 133416 tok/s) step 11323/76294 | train loss 3.389652 | norm 0.6047 | lr 5.54e-04 | (3802.60 ms | 137876 tok/s) step 11324/76294 | train loss 3.450649 | norm 0.3124 | lr 5.54e-04 | (3802.27 ms | 137888 tok/s) step 11325/76294 | train loss 3.449711 | norm 0.3877 | lr 5.54e-04 | (3827.58 ms | 136976 tok/s) step 11326/76294 | train loss 3.466731 | norm 0.2957 | lr 5.54e-04 | (3807.64 ms | 137694 tok/s) step 11327/76294 | train loss 3.420026 | norm 0.2394 | lr 5.54e-04 | (3805.08 ms | 137786 tok/s) step 11328/76294 | train loss 3.403820 | norm 0.3296 | lr 5.54e-04 | (3862.63 ms | 135734 tok/s) step 11329/76294 | train loss 3.455558 | norm 0.2129 | lr 5.54e-04 | (3801.47 ms | 137917 tok/s) step 11330/76294 | train loss 3.420775 | norm 0.2737 | lr 5.53e-04 | (3834.05 ms | 136745 tok/s) step 11331/76294 | train loss 3.411925 | norm 0.2240 | lr 5.53e-04 | (3806.92 ms | 137720 tok/s) step 11332/76294 | train loss 3.450461 | norm 0.2731 | lr 5.53e-04 | (3807.26 ms | 137707 tok/s) step 11333/76294 | train loss 3.394736 | norm 0.3031 | lr 5.53e-04 | (3824.38 ms | 137091 tok/s) step 11334/76294 | train loss 3.379976 | norm 0.2793 | lr 5.53e-04 | (3809.21 ms | 137637 tok/s) step 11335/76294 | train loss 3.474450 | norm 0.3160 | lr 5.53e-04 | (3806.35 ms | 137740 tok/s) step 11336/76294 | train loss 3.468382 | norm 0.2462 | lr 5.53e-04 | (3806.33 ms | 137741 tok/s) step 11337/76294 | train loss 3.435559 | norm 0.2177 | lr 5.53e-04 | (3808.47 ms | 137664 tok/s) step 11338/76294 | train loss 3.398365 | norm 0.2479 | lr 5.53e-04 | (3805.74 ms | 137763 tok/s) step 11339/76294 | train loss 3.460630 | norm 0.2238 | lr 5.53e-04 | (3827.98 ms | 136962 tok/s) step 11340/76294 | train loss 3.426140 | norm 0.2236 | lr 5.53e-04 | (3807.28 ms | 137707 tok/s) step 11341/76294 | train loss 3.407735 | norm 0.1712 | lr 5.52e-04 | (3832.59 ms | 136797 tok/s) step 11342/76294 | train loss 3.491953 | norm 0.2274 | lr 5.52e-04 | (3805.54 ms | 137770 tok/s) step 11343/76294 | train loss 3.440782 | norm 0.1989 | lr 5.52e-04 | (3826.78 ms | 137005 tok/s) step 11344/76294 | train loss 3.453679 | norm 0.2457 | lr 5.52e-04 | (3901.71 ms | 134374 tok/s) step 11345/76294 | train loss 3.450830 | norm 0.2658 | lr 5.52e-04 | (3802.98 ms | 137863 tok/s) step 11346/76294 | train loss 3.412297 | norm 0.1792 | lr 5.52e-04 | (3808.30 ms | 137670 tok/s) step 11347/76294 | train loss 3.452405 | norm 0.2857 | lr 5.52e-04 | (3824.81 ms | 137076 tok/s) step 11348/76294 | train loss 3.371280 | norm 0.2800 | lr 5.52e-04 | (3807.27 ms | 137707 tok/s) step 11349/76294 | train loss 3.443504 | norm 0.2056 | lr 5.52e-04 | (3811.86 ms | 137541 tok/s) step 11350/76294 | train loss 3.484236 | norm 0.2138 | lr 5.52e-04 | (3807.39 ms | 137703 tok/s) step 11351/76294 | train loss 3.390372 | norm 0.2326 | lr 5.52e-04 | (3802.12 ms | 137894 tok/s) step 11352/76294 | train loss 3.464689 | norm 0.2870 | lr 5.52e-04 | (3832.82 ms | 136789 tok/s) step 11353/76294 | train loss 3.470434 | norm 0.2008 | lr 5.51e-04 | (3806.67 ms | 137729 tok/s) step 11354/76294 | train loss 3.357407 | norm 0.2790 | lr 5.51e-04 | (3855.46 ms | 135986 tok/s) step 11355/76294 | train loss 3.359608 | norm 0.1955 | lr 5.51e-04 | (3806.00 ms | 137753 tok/s) step 11356/76294 | train loss 3.407119 | norm 0.2780 | lr 5.51e-04 | (3915.95 ms | 133885 tok/s) step 11357/76294 | train loss 3.449377 | norm 0.2060 | lr 5.51e-04 | (3798.03 ms | 138042 tok/s) step 11358/76294 | train loss 3.407141 | norm 0.2610 | lr 5.51e-04 | (3808.00 ms | 137681 tok/s) step 11359/76294 | train loss 3.458842 | norm 0.2299 | lr 5.51e-04 | (3825.67 ms | 137045 tok/s) step 11360/76294 | train loss 3.389706 | norm 0.3219 | lr 5.51e-04 | (3840.39 ms | 136520 tok/s) step 11361/76294 | train loss 3.434441 | norm 0.2832 | lr 5.51e-04 | (3802.06 ms | 137896 tok/s) step 11362/76294 | train loss 3.502547 | norm 0.2835 | lr 5.51e-04 | (3804.12 ms | 137821 tok/s) step 11363/76294 | train loss 3.473189 | norm 0.2849 | lr 5.51e-04 | (3801.40 ms | 137920 tok/s) step 11364/76294 | train loss 3.420736 | norm 0.2435 | lr 5.50e-04 | (3811.98 ms | 137537 tok/s) step 11365/76294 | train loss 3.475711 | norm 0.2260 | lr 5.50e-04 | (3984.55 ms | 131580 tok/s) step 11366/76294 | train loss 3.435193 | norm 0.2157 | lr 5.50e-04 | (3798.72 ms | 138017 tok/s) step 11367/76294 | train loss 3.422672 | norm 0.2182 | lr 5.50e-04 | (3850.39 ms | 136165 tok/s) step 11368/76294 | train loss 3.470708 | norm 0.7432 | lr 5.50e-04 | (3794.71 ms | 138163 tok/s) step 11369/76294 | train loss 3.529340 | norm 0.2599 | lr 5.50e-04 | (3819.82 ms | 137255 tok/s) step 11370/76294 | train loss 3.428927 | norm 0.2263 | lr 5.50e-04 | (3796.64 ms | 138092 tok/s) step 11371/76294 | train loss 3.472612 | norm 0.2663 | lr 5.50e-04 | (3816.78 ms | 137364 tok/s) step 11372/76294 | train loss 3.449921 | norm 0.3622 | lr 5.50e-04 | (3792.86 ms | 138230 tok/s) step 11373/76294 | train loss 3.347508 | norm 0.2240 | lr 5.50e-04 | (3799.25 ms | 137998 tok/s) step 11374/76294 | train loss 3.374703 | norm 0.3098 | lr 5.50e-04 | (3815.36 ms | 137415 tok/s) step 11375/76294 | train loss 3.416402 | norm 0.2631 | lr 5.49e-04 | (3796.38 ms | 138102 tok/s) step 11376/76294 | train loss 3.406684 | norm 0.2351 | lr 5.49e-04 | (3794.29 ms | 138178 tok/s) step 11377/76294 | train loss 3.466224 | norm 0.2335 | lr 5.49e-04 | (3902.93 ms | 134332 tok/s) step 11378/76294 | train loss 3.466725 | norm 0.2195 | lr 5.49e-04 | (3783.68 ms | 138566 tok/s) step 11379/76294 | train loss 3.438225 | norm 0.2426 | lr 5.49e-04 | (3788.57 ms | 138387 tok/s) step 11380/76294 | train loss 3.431114 | norm 0.1918 | lr 5.49e-04 | (3807.84 ms | 137687 tok/s) step 11381/76294 | train loss 3.425923 | norm 0.2201 | lr 5.49e-04 | (3789.51 ms | 138353 tok/s) step 11382/76294 | train loss 3.477481 | norm 0.2050 | lr 5.49e-04 | (3795.67 ms | 138128 tok/s) step 11383/76294 | train loss 3.397537 | norm 0.2775 | lr 5.49e-04 | (5319.44 ms | 98561 tok/s) step 11384/76294 | train loss 3.421613 | norm 0.2345 | lr 5.49e-04 | (3809.04 ms | 137643 tok/s) step 11385/76294 | train loss 3.464015 | norm 0.2690 | lr 5.49e-04 | (3956.89 ms | 132500 tok/s) step 11386/76294 | train loss 3.475001 | norm 0.2963 | lr 5.49e-04 | (3784.93 ms | 138520 tok/s) step 11387/76294 | train loss 3.334804 | norm 0.1945 | lr 5.48e-04 | (3804.80 ms | 137796 tok/s) step 11388/76294 | train loss 3.342349 | norm 0.2800 | lr 5.48e-04 | (3809.01 ms | 137644 tok/s) step 11389/76294 | train loss 3.396383 | norm 0.1962 | lr 5.48e-04 | (3797.57 ms | 138059 tok/s) step 11390/76294 | train loss 3.402378 | norm 0.2509 | lr 5.48e-04 | (3796.48 ms | 138098 tok/s) step 11391/76294 | train loss 3.502894 | norm 0.2354 | lr 5.48e-04 | (3880.88 ms | 135095 tok/s) step 11392/76294 | train loss 3.433222 | norm 0.2183 | lr 5.48e-04 | (3787.40 ms | 138430 tok/s) step 11393/76294 | train loss 3.454736 | norm 0.3531 | lr 5.48e-04 | (3796.64 ms | 138092 tok/s) step 11394/76294 | train loss 3.385403 | norm 0.2211 | lr 5.48e-04 | (3812.09 ms | 137533 tok/s) step 11395/76294 | train loss 3.414860 | norm 0.3762 | lr 5.48e-04 | (3792.97 ms | 138226 tok/s) step 11396/76294 | train loss 3.440257 | norm 0.2627 | lr 5.48e-04 | (3796.91 ms | 138083 tok/s) step 11397/76294 | train loss 3.453896 | norm 0.2828 | lr 5.48e-04 | (3798.50 ms | 138025 tok/s) step 11398/76294 | train loss 3.444051 | norm 0.3329 | lr 5.47e-04 | (3798.87 ms | 138012 tok/s) step 11399/76294 | train loss 3.445977 | norm 0.2770 | lr 5.47e-04 | (3807.19 ms | 137710 tok/s) step 11400/76294 | train loss 3.553242 | norm 0.3030 | lr 5.47e-04 | (3838.13 ms | 136600 tok/s) step 11401/76294 | train loss 3.409555 | norm 0.2388 | lr 5.47e-04 | (3790.55 ms | 138314 tok/s) step 11402/76294 | train loss 3.473345 | norm 0.3134 | lr 5.47e-04 | (3821.18 ms | 137206 tok/s) step 11403/76294 | train loss 3.444239 | norm 0.4899 | lr 5.47e-04 | (3796.00 ms | 138116 tok/s) step 11404/76294 | train loss 3.437264 | norm 0.2546 | lr 5.47e-04 | (3799.11 ms | 138003 tok/s) step 11405/76294 | train loss 3.459745 | norm 0.3191 | lr 5.47e-04 | (3793.33 ms | 138213 tok/s) step 11406/76294 | train loss 3.392500 | norm 0.2939 | lr 5.47e-04 | (3801.48 ms | 137917 tok/s) step 11407/76294 | train loss 3.391360 | norm 0.1988 | lr 5.47e-04 | (3893.04 ms | 134673 tok/s) step 11408/76294 | train loss 3.421717 | norm 0.2460 | lr 5.47e-04 | (3794.16 ms | 138183 tok/s) step 11409/76294 | train loss 3.411199 | norm 0.2123 | lr 5.46e-04 | (3801.41 ms | 137919 tok/s) step 11410/76294 | train loss 3.503926 | norm 0.2334 | lr 5.46e-04 | (3820.69 ms | 137223 tok/s) step 11411/76294 | train loss 3.440524 | norm 0.3376 | lr 5.46e-04 | (3798.14 ms | 138038 tok/s) step 11412/76294 | train loss 3.435450 | norm 0.2094 | lr 5.46e-04 | (3809.46 ms | 137628 tok/s) step 11413/76294 | train loss 3.447485 | norm 0.3620 | lr 5.46e-04 | (3800.94 ms | 137936 tok/s) step 11414/76294 | train loss 3.534056 | norm 0.1995 | lr 5.46e-04 | (3806.15 ms | 137748 tok/s) step 11415/76294 | train loss 3.387541 | norm 0.2867 | lr 5.46e-04 | (3804.73 ms | 137799 tok/s) step 11416/76294 | train loss 3.409562 | norm 0.2696 | lr 5.46e-04 | (3803.18 ms | 137855 tok/s) step 11417/76294 | train loss 3.377024 | norm 0.2571 | lr 5.46e-04 | (3804.74 ms | 137799 tok/s) step 11418/76294 | train loss 3.396534 | norm 0.2210 | lr 5.46e-04 | (3804.33 ms | 137814 tok/s) step 11419/76294 | train loss 3.415422 | norm 0.2615 | lr 5.46e-04 | (3809.85 ms | 137614 tok/s) step 11420/76294 | train loss 3.448373 | norm 0.2361 | lr 5.46e-04 | (3806.43 ms | 137737 tok/s) step 11421/76294 | train loss 3.415725 | norm 0.2728 | lr 5.45e-04 | (3806.33 ms | 137741 tok/s) step 11422/76294 | train loss 3.453368 | norm 0.2734 | lr 5.45e-04 | (3818.76 ms | 137293 tok/s) step 11423/76294 | train loss 3.390115 | norm 0.2915 | lr 5.45e-04 | (6651.76 ms | 78819 tok/s) step 11424/76294 | train loss 3.417149 | norm 0.3968 | lr 5.45e-04 | (3894.58 ms | 134620 tok/s) step 11425/76294 | train loss 3.438130 | norm 0.2003 | lr 5.45e-04 | (3795.95 ms | 138118 tok/s) step 11426/76294 | train loss 3.398827 | norm 0.4767 | lr 5.45e-04 | (3800.83 ms | 137941 tok/s) step 11427/76294 | train loss 3.454023 | norm 0.3509 | lr 5.45e-04 | (3794.61 ms | 138167 tok/s) step 11428/76294 | train loss 3.426600 | norm 0.3166 | lr 5.45e-04 | (5561.72 ms | 94267 tok/s) step 11429/76294 | train loss 3.428487 | norm 0.4239 | lr 5.45e-04 | (3792.16 ms | 138256 tok/s) step 11430/76294 | train loss 3.462665 | norm 0.2299 | lr 5.45e-04 | (3799.90 ms | 137974 tok/s) step 11431/76294 | train loss 3.430409 | norm 0.5145 | lr 5.45e-04 | (3819.19 ms | 137277 tok/s) step 11432/76294 | train loss 3.390272 | norm 0.3836 | lr 5.44e-04 | (3806.25 ms | 137744 tok/s) step 11433/76294 | train loss 3.481684 | norm 0.3469 | lr 5.44e-04 | (3793.11 ms | 138221 tok/s) step 11434/76294 | train loss 3.462175 | norm 0.3421 | lr 5.44e-04 | (3867.41 ms | 135566 tok/s) step 11435/76294 | train loss 3.468476 | norm 0.2700 | lr 5.44e-04 | (3799.20 ms | 137999 tok/s) step 11436/76294 | train loss 3.464840 | norm 0.2326 | lr 5.44e-04 | (3802.63 ms | 137875 tok/s) step 11437/76294 | train loss 3.459965 | norm 0.2989 | lr 5.44e-04 | (3825.37 ms | 137055 tok/s) step 11438/76294 | train loss 3.411123 | norm 0.2651 | lr 5.44e-04 | (3804.73 ms | 137799 tok/s) step 11439/76294 | train loss 3.453923 | norm 0.2961 | lr 5.44e-04 | (3818.46 ms | 137304 tok/s) step 11440/76294 | train loss 3.422722 | norm 0.2431 | lr 5.44e-04 | (3802.82 ms | 137868 tok/s) step 11441/76294 | train loss 3.463295 | norm 0.3489 | lr 5.44e-04 | (3879.16 ms | 135155 tok/s) step 11442/76294 | train loss 3.413114 | norm 0.3065 | lr 5.44e-04 | (3794.89 ms | 138156 tok/s) step 11443/76294 | train loss 3.488007 | norm 0.3354 | lr 5.43e-04 | (3797.48 ms | 138062 tok/s) step 11444/76294 | train loss 3.403481 | norm 0.3797 | lr 5.43e-04 | (3821.40 ms | 137198 tok/s) step 11445/76294 | train loss 3.430645 | norm 0.2131 | lr 5.43e-04 | (4105.24 ms | 127712 tok/s) step 11446/76294 | train loss 3.391711 | norm 0.2227 | lr 5.43e-04 | (3838.60 ms | 136583 tok/s) step 11447/76294 | train loss 3.591258 | norm 0.3033 | lr 5.43e-04 | (3801.94 ms | 137900 tok/s) step 11448/76294 | train loss 3.453586 | norm 0.2128 | lr 5.43e-04 | (3828.48 ms | 136944 tok/s) step 11449/76294 | train loss 3.396998 | norm 0.2407 | lr 5.43e-04 | (3796.92 ms | 138082 tok/s) step 11450/76294 | train loss 3.415401 | norm 0.2167 | lr 5.43e-04 | (3913.24 ms | 133978 tok/s) step 11451/76294 | train loss 3.410358 | norm 0.2398 | lr 5.43e-04 | (3801.96 ms | 137899 tok/s) step 11452/76294 | train loss 3.423440 | norm 0.2763 | lr 5.43e-04 | (3818.62 ms | 137298 tok/s) step 11453/76294 | train loss 3.443242 | norm 0.2381 | lr 5.43e-04 | (3810.79 ms | 137580 tok/s) step 11454/76294 | train loss 3.385688 | norm 0.2342 | lr 5.43e-04 | (3806.75 ms | 137726 tok/s) step 11455/76294 | train loss 3.467115 | norm 0.2975 | lr 5.42e-04 | (3822.59 ms | 137155 tok/s) step 11456/76294 | train loss 3.385743 | norm 0.2108 | lr 5.42e-04 | (3803.65 ms | 137838 tok/s) step 11457/76294 | train loss 3.413531 | norm 0.2142 | lr 5.42e-04 | (3812.39 ms | 137522 tok/s) step 11458/76294 | train loss 3.485401 | norm 0.1795 | lr 5.42e-04 | (3806.90 ms | 137720 tok/s) step 11459/76294 | train loss 3.476870 | norm 0.1874 | lr 5.42e-04 | (3814.68 ms | 137440 tok/s) step 11460/76294 | train loss 3.413597 | norm 0.1920 | lr 5.42e-04 | (3809.46 ms | 137628 tok/s) step 11461/76294 | train loss 3.461539 | norm 0.1719 | lr 5.42e-04 | (3810.35 ms | 137596 tok/s) step 11462/76294 | train loss 3.416780 | norm 0.1914 | lr 5.42e-04 | (3830.45 ms | 136874 tok/s) step 11463/76294 | train loss 3.417980 | norm 0.1882 | lr 5.42e-04 | (3812.21 ms | 137529 tok/s) step 11464/76294 | train loss 3.438413 | norm 0.2149 | lr 5.42e-04 | (3809.91 ms | 137612 tok/s) step 11465/76294 | train loss 3.452392 | norm 0.2097 | lr 5.42e-04 | (3802.65 ms | 137875 tok/s) step 11466/76294 | train loss 3.404923 | norm 0.2777 | lr 5.41e-04 | (3836.02 ms | 136675 tok/s) step 11467/76294 | train loss 3.439805 | norm 0.2487 | lr 5.41e-04 | (3812.19 ms | 137529 tok/s) step 11468/76294 | train loss 3.456599 | norm 0.2422 | lr 5.41e-04 | (3811.71 ms | 137547 tok/s) step 11469/76294 | train loss 3.384456 | norm 0.2596 | lr 5.41e-04 | (3804.74 ms | 137799 tok/s) step 11470/76294 | train loss 3.395801 | norm 0.2471 | lr 5.41e-04 | (4048.49 ms | 129502 tok/s) step 11471/76294 | train loss 3.407002 | norm 0.7329 | lr 5.41e-04 | (3811.38 ms | 137559 tok/s) step 11472/76294 | train loss 3.377614 | norm 0.5506 | lr 5.41e-04 | (3830.00 ms | 136890 tok/s) step 11473/76294 | train loss 3.423903 | norm 0.4681 | lr 5.41e-04 | (3812.61 ms | 137514 tok/s) step 11474/76294 | train loss 3.356878 | norm 0.2343 | lr 5.41e-04 | (3826.64 ms | 137010 tok/s) step 11475/76294 | train loss 3.388952 | norm 0.3710 | lr 5.41e-04 | (3809.81 ms | 137615 tok/s) step 11476/76294 | train loss 3.405195 | norm 0.3895 | lr 5.41e-04 | (3805.30 ms | 137778 tok/s) step 11477/76294 | train loss 3.411884 | norm 0.2135 | lr 5.41e-04 | (3837.19 ms | 136633 tok/s) step 11478/76294 | train loss 3.459425 | norm 0.3657 | lr 5.40e-04 | (3806.60 ms | 137731 tok/s) step 11479/76294 | train loss 3.421922 | norm 0.2257 | lr 5.40e-04 | (3829.85 ms | 136895 tok/s) step 11480/76294 | train loss 3.422858 | norm 0.2292 | lr 5.40e-04 | (3805.43 ms | 137774 tok/s) step 11481/76294 | train loss 3.452344 | norm 0.5261 | lr 5.40e-04 | (3808.10 ms | 137677 tok/s) step 11482/76294 | train loss 3.418085 | norm 0.2303 | lr 5.40e-04 | (3832.81 ms | 136790 tok/s) step 11483/76294 | train loss 3.509795 | norm 0.3101 | lr 5.40e-04 | (3807.77 ms | 137689 tok/s) step 11484/76294 | train loss 3.429861 | norm 0.3265 | lr 5.40e-04 | (3815.34 ms | 137416 tok/s) step 11485/76294 | train loss 3.435887 | norm 0.2393 | lr 5.40e-04 | (3810.73 ms | 137582 tok/s) step 11486/76294 | train loss 3.449141 | norm 0.3029 | lr 5.40e-04 | (3812.15 ms | 137531 tok/s) step 11487/76294 | train loss 3.409156 | norm 0.2427 | lr 5.40e-04 | (3809.58 ms | 137623 tok/s) step 11488/76294 | train loss 3.438246 | norm 0.2817 | lr 5.40e-04 | (3824.03 ms | 137104 tok/s) step 11489/76294 | train loss 3.370439 | norm 0.2384 | lr 5.39e-04 | (3805.75 ms | 137762 tok/s) step 11490/76294 | train loss 3.426873 | norm 0.2097 | lr 5.39e-04 | (3819.89 ms | 137252 tok/s) step 11491/76294 | train loss 3.409490 | norm 0.2653 | lr 5.39e-04 | (3835.48 ms | 136694 tok/s) step 11492/76294 | train loss 3.473211 | norm 0.2610 | lr 5.39e-04 | (3828.53 ms | 136942 tok/s) step 11493/76294 | train loss 3.455497 | norm 0.2353 | lr 5.39e-04 | (3803.49 ms | 137844 tok/s) step 11494/76294 | train loss 3.452220 | norm 0.2328 | lr 5.39e-04 | (3807.78 ms | 137689 tok/s) step 11495/76294 | train loss 3.476007 | norm 0.3950 | lr 5.39e-04 | (3864.05 ms | 135683 tok/s) step 11496/76294 | train loss 3.428925 | norm 0.3174 | lr 5.39e-04 | (3884.93 ms | 134954 tok/s) step 11497/76294 | train loss 3.375612 | norm 0.2713 | lr 5.39e-04 | (3806.17 ms | 137747 tok/s) step 11498/76294 | train loss 3.448718 | norm 0.3841 | lr 5.39e-04 | (3809.54 ms | 137625 tok/s) step 11499/76294 | train loss 3.377369 | norm 0.2096 | lr 5.39e-04 | (3828.49 ms | 136944 tok/s) step 11500/76294 | train loss 3.451698 | norm 0.2218 | lr 5.38e-04 | (3816.21 ms | 137385 tok/s) val loss: 3.404216 saving model checkpoint to ./results/gpt2-124M-gqa/step_11500.pth step 11501/76294 | train loss 3.417796 | norm 0.2043 | lr 5.38e-04 | (3820.14 ms | 137243 tok/s) step 11502/76294 | train loss 3.360226 | norm 0.2449 | lr 5.38e-04 | (3804.59 ms | 137804 tok/s) step 11503/76294 | train loss 3.446654 | norm 0.2324 | lr 5.38e-04 | (3837.99 ms | 136605 tok/s) step 11504/76294 | train loss 3.447429 | norm 0.2907 | lr 5.38e-04 | (3828.48 ms | 136944 tok/s) step 11505/76294 | train loss 3.436031 | norm 0.1978 | lr 5.38e-04 | (3810.71 ms | 137583 tok/s) step 11506/76294 | train loss 3.392797 | norm 0.2813 | lr 5.38e-04 | (3811.67 ms | 137548 tok/s) step 11507/76294 | train loss 3.430087 | norm 0.2177 | lr 5.38e-04 | (3920.24 ms | 133739 tok/s) step 11508/76294 | train loss 3.415006 | norm 0.2004 | lr 5.38e-04 | (3794.50 ms | 138171 tok/s) step 11509/76294 | train loss 3.431575 | norm 0.2031 | lr 5.38e-04 | (3805.78 ms | 137761 tok/s) step 11510/76294 | train loss 3.405844 | norm 0.2470 | lr 5.38e-04 | (3797.93 ms | 138046 tok/s) step 11511/76294 | train loss 3.358685 | norm 0.2031 | lr 5.38e-04 | (3800.67 ms | 137946 tok/s) step 11512/76294 | train loss 3.387512 | norm 0.1944 | lr 5.37e-04 | (3839.36 ms | 136556 tok/s) step 11513/76294 | train loss 3.407503 | norm 0.2270 | lr 5.37e-04 | (3808.99 ms | 137645 tok/s) step 11514/76294 | train loss 3.416164 | norm 0.1974 | lr 5.37e-04 | (3807.54 ms | 137697 tok/s) step 11515/76294 | train loss 3.436878 | norm 0.2156 | lr 5.37e-04 | (3809.11 ms | 137641 tok/s) step 11516/76294 | train loss 3.566164 | norm 0.2006 | lr 5.37e-04 | (3847.31 ms | 136274 tok/s) step 11517/76294 | train loss 3.614766 | norm 0.2209 | lr 5.37e-04 | (3800.87 ms | 137939 tok/s) step 11518/76294 | train loss 3.411671 | norm 0.2254 | lr 5.37e-04 | (3824.48 ms | 137087 tok/s) step 11519/76294 | train loss 3.390150 | norm 0.1939 | lr 5.37e-04 | (3844.55 ms | 136372 tok/s) step 11520/76294 | train loss 3.579780 | norm 0.2577 | lr 5.37e-04 | (3798.86 ms | 138012 tok/s) step 11521/76294 | train loss 3.403142 | norm 0.2089 | lr 5.37e-04 | (3842.17 ms | 136456 tok/s) step 11522/76294 | train loss 3.375321 | norm 0.1862 | lr 5.37e-04 | (3801.71 ms | 137908 tok/s) step 11523/76294 | train loss 3.459013 | norm 0.1940 | lr 5.36e-04 | (3875.19 ms | 135294 tok/s) step 11524/76294 | train loss 3.432798 | norm 0.1994 | lr 5.36e-04 | (3925.08 ms | 133574 tok/s) step 11525/76294 | train loss 3.412587 | norm 0.2006 | lr 5.36e-04 | (3794.70 ms | 138163 tok/s) step 11526/76294 | train loss 3.409986 | norm 0.2459 | lr 5.36e-04 | (3833.46 ms | 136766 tok/s) step 11527/76294 | train loss 3.461498 | norm 0.2822 | lr 5.36e-04 | (3795.30 ms | 138141 tok/s) step 11528/76294 | train loss 3.457420 | norm 0.3239 | lr 5.36e-04 | (3823.50 ms | 137122 tok/s) step 11529/76294 | train loss 3.354881 | norm 0.3392 | lr 5.36e-04 | (3798.66 ms | 138019 tok/s) step 11530/76294 | train loss 3.458670 | norm 0.2447 | lr 5.36e-04 | (3799.32 ms | 137995 tok/s) step 11531/76294 | train loss 3.418002 | norm 0.2605 | lr 5.36e-04 | (3819.79 ms | 137256 tok/s) step 11532/76294 | train loss 3.401741 | norm 0.2352 | lr 5.36e-04 | (3809.74 ms | 137618 tok/s) step 11533/76294 | train loss 3.398109 | norm 0.2080 | lr 5.36e-04 | (3797.55 ms | 138060 tok/s) step 11534/76294 | train loss 3.428565 | norm 0.3083 | lr 5.36e-04 | (3827.98 ms | 136962 tok/s) step 11535/76294 | train loss 3.444393 | norm 0.2003 | lr 5.35e-04 | (3799.47 ms | 137990 tok/s) step 11536/76294 | train loss 3.387132 | norm 0.2709 | lr 5.35e-04 | (3824.51 ms | 137086 tok/s) step 11537/76294 | train loss 3.506056 | norm 0.2995 | lr 5.35e-04 | (3802.04 ms | 137897 tok/s) step 11538/76294 | train loss 3.428149 | norm 0.2658 | lr 5.35e-04 | (3805.24 ms | 137781 tok/s) step 11539/76294 | train loss 3.376179 | norm 0.2944 | lr 5.35e-04 | (3826.26 ms | 137024 tok/s) step 11540/76294 | train loss 3.441228 | norm 0.2978 | lr 5.35e-04 | (3873.70 ms | 135345 tok/s) step 11541/76294 | train loss 3.462104 | norm 0.3239 | lr 5.35e-04 | (3801.09 ms | 137931 tok/s) step 11542/76294 | train loss 3.426136 | norm 0.2116 | lr 5.35e-04 | (3808.20 ms | 137673 tok/s) step 11543/76294 | train loss 3.386015 | norm 0.2959 | lr 5.35e-04 | (5324.17 ms | 98473 tok/s) step 11544/76294 | train loss 3.456121 | norm 0.2412 | lr 5.35e-04 | (3911.56 ms | 134036 tok/s) step 11545/76294 | train loss 3.379613 | norm 0.2006 | lr 5.35e-04 | (5392.89 ms | 97218 tok/s) step 11546/76294 | train loss 3.419441 | norm 0.2563 | lr 5.34e-04 | (3832.86 ms | 136788 tok/s) step 11547/76294 | train loss 3.368513 | norm 0.2096 | lr 5.34e-04 | (3807.64 ms | 137694 tok/s) step 11548/76294 | train loss 3.406125 | norm 0.3196 | lr 5.34e-04 | (3806.38 ms | 137739 tok/s) step 11549/76294 | train loss 3.398714 | norm 0.1969 | lr 5.34e-04 | (3812.40 ms | 137522 tok/s) step 11550/76294 | train loss 3.426355 | norm 0.3175 | lr 5.34e-04 | (3806.80 ms | 137724 tok/s) step 11551/76294 | train loss 3.410128 | norm 0.2171 | lr 5.34e-04 | (3810.80 ms | 137580 tok/s) step 11552/76294 | train loss 3.417307 | norm 0.3096 | lr 5.34e-04 | (3808.13 ms | 137676 tok/s) step 11553/76294 | train loss 3.350128 | norm 0.2497 | lr 5.34e-04 | (3805.21 ms | 137782 tok/s) step 11554/76294 | train loss 3.388026 | norm 0.2505 | lr 5.34e-04 | (3841.11 ms | 136494 tok/s) step 11555/76294 | train loss 3.483709 | norm 0.3521 | lr 5.34e-04 | (3809.85 ms | 137614 tok/s) step 11556/76294 | train loss 3.375981 | norm 0.2200 | lr 5.34e-04 | (3811.20 ms | 137565 tok/s) step 11557/76294 | train loss 3.451655 | norm 0.2640 | lr 5.33e-04 | (3827.15 ms | 136992 tok/s) step 11558/76294 | train loss 3.485398 | norm 0.3200 | lr 5.33e-04 | (3810.25 ms | 137600 tok/s) step 11559/76294 | train loss 3.451808 | norm 0.2114 | lr 5.33e-04 | (3814.10 ms | 137460 tok/s) step 11560/76294 | train loss 3.452780 | norm 0.2782 | lr 5.33e-04 | (3809.31 ms | 137633 tok/s) step 11561/76294 | train loss 3.405111 | norm 0.2171 | lr 5.33e-04 | (3830.08 ms | 136887 tok/s) step 11562/76294 | train loss 3.443172 | norm 0.2575 | lr 5.33e-04 | (3810.08 ms | 137606 tok/s) step 11563/76294 | train loss 3.371063 | norm 0.3037 | lr 5.33e-04 | (3837.56 ms | 136620 tok/s) step 11564/76294 | train loss 3.464467 | norm 0.2219 | lr 5.33e-04 | (3981.47 ms | 131682 tok/s) step 11565/76294 | train loss 3.465364 | norm 0.2681 | lr 5.33e-04 | (3805.84 ms | 137759 tok/s) step 11566/76294 | train loss 3.490490 | norm 0.2083 | lr 5.33e-04 | (3869.69 ms | 135486 tok/s) step 11567/76294 | train loss 3.423988 | norm 0.2065 | lr 5.33e-04 | (3805.80 ms | 137760 tok/s) step 11568/76294 | train loss 3.419690 | norm 0.2373 | lr 5.33e-04 | (3807.49 ms | 137699 tok/s) step 11569/76294 | train loss 3.466472 | norm 0.2274 | lr 5.32e-04 | (3838.76 ms | 136577 tok/s) step 11570/76294 | train loss 3.375087 | norm 0.2456 | lr 5.32e-04 | (3814.13 ms | 137459 tok/s) step 11571/76294 | train loss 3.487843 | norm 0.1925 | lr 5.32e-04 | (3803.79 ms | 137833 tok/s) step 11572/76294 | train loss 3.446486 | norm 0.2683 | lr 5.32e-04 | (3842.01 ms | 136462 tok/s) step 11573/76294 | train loss 3.373317 | norm 0.1828 | lr 5.32e-04 | (3805.22 ms | 137781 tok/s) step 11574/76294 | train loss 3.438128 | norm 0.2381 | lr 5.32e-04 | (3828.65 ms | 136938 tok/s) step 11575/76294 | train loss 3.373665 | norm 0.3673 | lr 5.32e-04 | (3803.40 ms | 137847 tok/s) step 11576/76294 | train loss 3.427406 | norm 0.2659 | lr 5.32e-04 | (3818.37 ms | 137307 tok/s) step 11577/76294 | train loss 3.446272 | norm 0.2855 | lr 5.32e-04 | (3840.24 ms | 136525 tok/s) step 11578/76294 | train loss 3.415671 | norm 0.2312 | lr 5.32e-04 | (3803.67 ms | 137838 tok/s) step 11579/76294 | train loss 3.410347 | norm 0.2426 | lr 5.32e-04 | (3806.05 ms | 137751 tok/s) step 11580/76294 | train loss 3.400178 | norm 0.2334 | lr 5.31e-04 | (3809.97 ms | 137609 tok/s) step 11581/76294 | train loss 3.367963 | norm 0.1951 | lr 5.31e-04 | (3823.76 ms | 137113 tok/s) step 11582/76294 | train loss 3.447606 | norm 0.2148 | lr 5.31e-04 | (3806.53 ms | 137734 tok/s) step 11583/76294 | train loss 3.437722 | norm 0.2222 | lr 5.31e-04 | (3808.58 ms | 137660 tok/s) step 11584/76294 | train loss 3.405719 | norm 0.2102 | lr 5.31e-04 | (3809.18 ms | 137638 tok/s) step 11585/76294 | train loss 3.389272 | norm 0.2955 | lr 5.31e-04 | (3816.21 ms | 137385 tok/s) step 11586/76294 | train loss 3.386747 | norm 0.1976 | lr 5.31e-04 | (3806.28 ms | 137743 tok/s) step 11587/76294 | train loss 3.427840 | norm 0.2781 | lr 5.31e-04 | (3917.11 ms | 133845 tok/s) step 11588/76294 | train loss 3.460842 | norm 0.2196 | lr 5.31e-04 | (3803.25 ms | 137853 tok/s) step 11589/76294 | train loss 3.375286 | norm 0.4155 | lr 5.31e-04 | (3822.08 ms | 137173 tok/s) step 11590/76294 | train loss 3.359624 | norm 0.2481 | lr 5.31e-04 | (3830.53 ms | 136871 tok/s) step 11591/76294 | train loss 3.454150 | norm 0.3358 | lr 5.31e-04 | (3806.86 ms | 137722 tok/s) step 11592/76294 | train loss 3.405418 | norm 0.2195 | lr 5.30e-04 | (3827.75 ms | 136970 tok/s) step 11593/76294 | train loss 3.423437 | norm 0.3373 | lr 5.30e-04 | (3831.24 ms | 136846 tok/s) step 11594/76294 | train loss 3.460587 | norm 0.2735 | lr 5.30e-04 | (3805.19 ms | 137782 tok/s) step 11595/76294 | train loss 3.423358 | norm 0.2139 | lr 5.30e-04 | (3829.17 ms | 136920 tok/s) step 11596/76294 | train loss 3.433805 | norm 0.2376 | lr 5.30e-04 | (3831.79 ms | 136826 tok/s) step 11597/76294 | train loss 3.430655 | norm 0.2174 | lr 5.30e-04 | (3803.86 ms | 137830 tok/s) step 11598/76294 | train loss 3.435036 | norm 0.2499 | lr 5.30e-04 | (3807.00 ms | 137717 tok/s) step 11599/76294 | train loss 3.386309 | norm 0.3114 | lr 5.30e-04 | (3807.25 ms | 137708 tok/s) step 11600/76294 | train loss 3.432002 | norm 0.3225 | lr 5.30e-04 | (3807.93 ms | 137683 tok/s) step 11601/76294 | train loss 3.431428 | norm 0.2476 | lr 5.30e-04 | (3807.37 ms | 137703 tok/s) step 11602/76294 | train loss 3.448608 | norm 0.4772 | lr 5.30e-04 | (3812.74 ms | 137509 tok/s) step 11603/76294 | train loss 3.435624 | norm 0.1947 | lr 5.29e-04 | (3806.81 ms | 137724 tok/s) step 11604/76294 | train loss 3.409259 | norm 0.3505 | lr 5.29e-04 | (3811.06 ms | 137570 tok/s) step 11605/76294 | train loss 3.486011 | norm 0.4421 | lr 5.29e-04 | (3809.49 ms | 137627 tok/s) step 11606/76294 | train loss 3.373594 | norm 0.2989 | lr 5.29e-04 | (3816.84 ms | 137362 tok/s) step 11607/76294 | train loss 3.440742 | norm 0.4113 | lr 5.29e-04 | (3806.02 ms | 137752 tok/s) step 11608/76294 | train loss 3.360944 | norm 0.2248 | lr 5.29e-04 | (3824.86 ms | 137074 tok/s) step 11609/76294 | train loss 3.391157 | norm 0.3121 | lr 5.29e-04 | (3809.27 ms | 137635 tok/s) step 11610/76294 | train loss 3.406076 | norm 0.2576 | lr 5.29e-04 | (3819.04 ms | 137283 tok/s) step 11611/76294 | train loss 3.411848 | norm 0.2441 | lr 5.29e-04 | (3805.10 ms | 137785 tok/s) step 11612/76294 | train loss 3.436182 | norm 0.2721 | lr 5.29e-04 | (3879.60 ms | 135140 tok/s) step 11613/76294 | train loss 3.432858 | norm 0.2234 | lr 5.29e-04 | (3785.21 ms | 138510 tok/s) step 11614/76294 | train loss 3.475984 | norm 0.3209 | lr 5.29e-04 | (3848.16 ms | 136244 tok/s) step 11615/76294 | train loss 3.398148 | norm 0.2107 | lr 5.28e-04 | (3793.16 ms | 138219 tok/s) step 11616/76294 | train loss 3.379681 | norm 0.2077 | lr 5.28e-04 | (3817.46 ms | 137340 tok/s) step 11617/76294 | train loss 3.402094 | norm 0.3353 | lr 5.28e-04 | (3790.40 ms | 138320 tok/s) step 11618/76294 | train loss 3.419915 | norm 0.1800 | lr 5.28e-04 | (3816.32 ms | 137380 tok/s) step 11619/76294 | train loss 3.442796 | norm 0.2620 | lr 5.28e-04 | (3791.42 ms | 138283 tok/s) step 11620/76294 | train loss 3.421238 | norm 0.2363 | lr 5.28e-04 | (4014.89 ms | 130586 tok/s) step 11621/76294 | train loss 3.402452 | norm 0.3336 | lr 5.28e-04 | (3778.48 ms | 138756 tok/s) step 11622/76294 | train loss 3.445836 | norm 0.3286 | lr 5.28e-04 | (3810.63 ms | 137586 tok/s) step 11623/76294 | train loss 3.432652 | norm 0.3209 | lr 5.28e-04 | (3800.48 ms | 137953 tok/s) step 11624/76294 | train loss 3.461314 | norm 0.3825 | lr 5.28e-04 | (3791.05 ms | 138296 tok/s) step 11625/76294 | train loss 3.440948 | norm 0.2054 | lr 5.28e-04 | (3836.94 ms | 136642 tok/s) step 11626/76294 | train loss 3.424166 | norm 0.4705 | lr 5.27e-04 | (3789.66 ms | 138347 tok/s) step 11627/76294 | train loss 3.462204 | norm 0.2567 | lr 5.27e-04 | (3844.36 ms | 136378 tok/s) step 11628/76294 | train loss 3.442959 | norm 0.2796 | lr 5.27e-04 | (3789.83 ms | 138341 tok/s) step 11629/76294 | train loss 3.427451 | norm 0.2913 | lr 5.27e-04 | (3795.23 ms | 138144 tok/s) step 11630/76294 | train loss 3.427355 | norm 0.2228 | lr 5.27e-04 | (3853.66 ms | 136049 tok/s) step 11631/76294 | train loss 3.387367 | norm 0.2032 | lr 5.27e-04 | (3789.64 ms | 138348 tok/s) step 11632/76294 | train loss 3.397076 | norm 0.2141 | lr 5.27e-04 | (3852.52 ms | 136090 tok/s) step 11633/76294 | train loss 3.430382 | norm 0.1928 | lr 5.27e-04 | (3792.16 ms | 138256 tok/s) step 11634/76294 | train loss 3.415983 | norm 0.2873 | lr 5.27e-04 | (3819.84 ms | 137254 tok/s) step 11635/76294 | train loss 3.410595 | norm 0.2671 | lr 5.27e-04 | (4541.05 ms | 115455 tok/s) step 11636/76294 | train loss 3.414391 | norm 0.2624 | lr 5.27e-04 | (3819.08 ms | 137281 tok/s) step 11637/76294 | train loss 3.443560 | norm 0.2814 | lr 5.27e-04 | (3790.98 ms | 138299 tok/s) step 11638/76294 | train loss 3.428162 | norm 0.2614 | lr 5.26e-04 | (3818.65 ms | 137297 tok/s) step 11639/76294 | train loss 3.396012 | norm 0.2870 | lr 5.26e-04 | (3818.42 ms | 137305 tok/s) step 11640/76294 | train loss 3.466319 | norm 0.2003 | lr 5.26e-04 | (3793.14 ms | 138220 tok/s) step 11641/76294 | train loss 3.372768 | norm 0.3034 | lr 5.26e-04 | (3839.67 ms | 136545 tok/s) step 11642/76294 | train loss 3.453157 | norm 0.3643 | lr 5.26e-04 | (3820.51 ms | 137230 tok/s) step 11643/76294 | train loss 3.440977 | norm 0.3091 | lr 5.26e-04 | (3793.83 ms | 138195 tok/s) step 11644/76294 | train loss 3.458222 | norm 0.3630 | lr 5.26e-04 | (3826.45 ms | 137017 tok/s) step 11645/76294 | train loss 3.457954 | norm 0.1837 | lr 5.26e-04 | (3794.60 ms | 138167 tok/s) step 11646/76294 | train loss 3.384595 | norm 0.3121 | lr 5.26e-04 | (3801.24 ms | 137925 tok/s) step 11647/76294 | train loss 3.446717 | norm 0.2803 | lr 5.26e-04 | (3795.19 ms | 138145 tok/s) step 11648/76294 | train loss 3.428646 | norm 0.1762 | lr 5.26e-04 | (3842.41 ms | 136448 tok/s) step 11649/76294 | train loss 3.407486 | norm 0.2254 | lr 5.25e-04 | (3793.81 ms | 138196 tok/s) step 11650/76294 | train loss 3.432707 | norm 0.2298 | lr 5.25e-04 | (3830.96 ms | 136855 tok/s) step 11651/76294 | train loss 3.425044 | norm 0.2152 | lr 5.25e-04 | (3794.77 ms | 138161 tok/s) step 11652/76294 | train loss 3.405067 | norm 0.3344 | lr 5.25e-04 | (3820.76 ms | 137221 tok/s) step 11653/76294 | train loss 3.350674 | norm 0.2457 | lr 5.25e-04 | (3798.08 ms | 138040 tok/s) step 11654/76294 | train loss 3.433955 | norm 0.2975 | lr 5.25e-04 | (3800.59 ms | 137949 tok/s) step 11655/76294 | train loss 3.409835 | norm 0.2264 | lr 5.25e-04 | (3821.07 ms | 137210 tok/s) step 11656/76294 | train loss 3.465248 | norm 0.2960 | lr 5.25e-04 | (3798.23 ms | 138035 tok/s) step 11657/76294 | train loss 3.474091 | norm 0.2777 | lr 5.25e-04 | (3803.37 ms | 137848 tok/s) step 11658/76294 | train loss 3.551201 | norm 0.3100 | lr 5.25e-04 | (3801.88 ms | 137902 tok/s) step 11659/76294 | train loss 3.443302 | norm 0.2366 | lr 5.25e-04 | (3819.94 ms | 137250 tok/s) step 11660/76294 | train loss 3.496195 | norm 0.2367 | lr 5.24e-04 | (3824.83 ms | 137075 tok/s) step 11661/76294 | train loss 3.426901 | norm 0.2340 | lr 5.24e-04 | (3817.37 ms | 137343 tok/s) step 11662/76294 | train loss 3.434097 | norm 0.4530 | lr 5.24e-04 | (3804.07 ms | 137823 tok/s) step 11663/76294 | train loss 3.373017 | norm 0.3433 | lr 5.24e-04 | (3820.19 ms | 137241 tok/s) step 11664/76294 | train loss 3.392536 | norm 0.2975 | lr 5.24e-04 | (3834.57 ms | 136727 tok/s) step 11665/76294 | train loss 3.443065 | norm 0.2796 | lr 5.24e-04 | (3798.08 ms | 138040 tok/s) step 11666/76294 | train loss 3.408673 | norm 0.3195 | lr 5.24e-04 | (3826.02 ms | 137032 tok/s) step 11667/76294 | train loss 3.445063 | norm 0.2743 | lr 5.24e-04 | (3806.67 ms | 137729 tok/s) step 11668/76294 | train loss 3.355811 | norm 0.3246 | lr 5.24e-04 | (3867.44 ms | 135565 tok/s) step 11669/76294 | train loss 3.475310 | norm 0.2520 | lr 5.24e-04 | (3798.34 ms | 138031 tok/s) step 11670/76294 | train loss 3.352843 | norm 0.3011 | lr 5.24e-04 | (3846.38 ms | 136307 tok/s) step 11671/76294 | train loss 3.470602 | norm 0.2702 | lr 5.24e-04 | (3794.71 ms | 138163 tok/s) step 11672/76294 | train loss 3.398087 | norm 0.2186 | lr 5.23e-04 | (3825.14 ms | 137064 tok/s) step 11673/76294 | train loss 3.409233 | norm 0.2639 | lr 5.23e-04 | (3797.01 ms | 138079 tok/s) step 11674/76294 | train loss 3.468920 | norm 0.2232 | lr 5.23e-04 | (3800.54 ms | 137951 tok/s) step 11675/76294 | train loss 3.397108 | norm 0.2424 | lr 5.23e-04 | (3817.57 ms | 137335 tok/s) step 11676/76294 | train loss 3.446146 | norm 0.2253 | lr 5.23e-04 | (3843.22 ms | 136419 tok/s) step 11677/76294 | train loss 3.370762 | norm 0.2261 | lr 5.23e-04 | (3845.95 ms | 136322 tok/s) step 11678/76294 | train loss 3.416972 | norm 0.2104 | lr 5.23e-04 | (3800.58 ms | 137949 tok/s) step 11679/76294 | train loss 3.414123 | norm 0.2026 | lr 5.23e-04 | (3826.98 ms | 136998 tok/s) step 11680/76294 | train loss 3.442790 | norm 0.2439 | lr 5.23e-04 | (3821.77 ms | 137184 tok/s) step 11681/76294 | train loss 3.396128 | norm 0.2155 | lr 5.23e-04 | (3795.24 ms | 138144 tok/s) step 11682/76294 | train loss 3.425495 | norm 0.1886 | lr 5.23e-04 | (3798.39 ms | 138029 tok/s) step 11683/76294 | train loss 3.471509 | norm 0.2133 | lr 5.22e-04 | (3801.79 ms | 137906 tok/s) step 11684/76294 | train loss 3.421386 | norm 0.2101 | lr 5.22e-04 | (3803.06 ms | 137859 tok/s) step 11685/76294 | train loss 3.441844 | norm 0.1993 | lr 5.22e-04 | (3801.11 ms | 137930 tok/s) step 11686/76294 | train loss 3.545003 | norm 0.2071 | lr 5.22e-04 | (3800.38 ms | 137957 tok/s) step 11687/76294 | train loss 3.422559 | norm 0.2492 | lr 5.22e-04 | (3823.95 ms | 137106 tok/s) step 11688/76294 | train loss 3.363994 | norm 0.1869 | lr 5.22e-04 | (3801.25 ms | 137925 tok/s) step 11689/76294 | train loss 3.394484 | norm 0.2107 | lr 5.22e-04 | (3805.61 ms | 137767 tok/s) step 11690/76294 | train loss 3.507488 | norm 0.2370 | lr 5.22e-04 | (3821.00 ms | 137212 tok/s) step 11691/76294 | train loss 3.445168 | norm 0.2209 | lr 5.22e-04 | (3804.59 ms | 137804 tok/s) step 11692/76294 | train loss 3.394957 | norm 0.2084 | lr 5.22e-04 | (3824.44 ms | 137089 tok/s) step 11693/76294 | train loss 3.428160 | norm 0.2006 | lr 5.22e-04 | (3799.56 ms | 137987 tok/s) step 11694/76294 | train loss 3.459380 | norm 0.2611 | lr 5.22e-04 | (3822.08 ms | 137173 tok/s) step 11695/76294 | train loss 3.440008 | norm 0.2669 | lr 5.21e-04 | (3817.44 ms | 137340 tok/s) step 11696/76294 | train loss 3.499571 | norm 0.3973 | lr 5.21e-04 | (3802.07 ms | 137895 tok/s) step 11697/76294 | train loss 3.418149 | norm 0.3753 | lr 5.21e-04 | (3806.55 ms | 137733 tok/s) step 11698/76294 | train loss 3.446347 | norm 0.3554 | lr 5.21e-04 | (3801.66 ms | 137910 tok/s) step 11699/76294 | train loss 3.416129 | norm 0.4505 | lr 5.21e-04 | (3803.02 ms | 137861 tok/s) step 11700/76294 | train loss 3.464270 | norm 0.2153 | lr 5.21e-04 | (3796.91 ms | 138083 tok/s) step 11701/76294 | train loss 3.457835 | norm 0.3333 | lr 5.21e-04 | (3803.88 ms | 137830 tok/s) step 11702/76294 | train loss 3.390690 | norm 0.2815 | lr 5.21e-04 | (3819.58 ms | 137263 tok/s) step 11703/76294 | train loss 3.417360 | norm 0.2621 | lr 5.21e-04 | (3806.45 ms | 137737 tok/s) step 11704/76294 | train loss 3.483380 | norm 0.3780 | lr 5.21e-04 | (3836.73 ms | 136650 tok/s) step 11705/76294 | train loss 3.462786 | norm 0.2053 | lr 5.21e-04 | (3795.45 ms | 138136 tok/s) step 11706/76294 | train loss 3.485118 | norm 0.3582 | lr 5.20e-04 | (3866.98 ms | 135581 tok/s) step 11707/76294 | train loss 3.427917 | norm 0.2680 | lr 5.20e-04 | (3799.32 ms | 137995 tok/s) step 11708/76294 | train loss 3.423831 | norm 0.2196 | lr 5.20e-04 | (3857.48 ms | 135915 tok/s) step 11709/76294 | train loss 3.371746 | norm 0.3418 | lr 5.20e-04 | (3822.16 ms | 137171 tok/s) step 11710/76294 | train loss 3.419413 | norm 0.2311 | lr 5.20e-04 | (3803.20 ms | 137854 tok/s) step 11711/76294 | train loss 3.393396 | norm 0.1854 | lr 5.20e-04 | (3804.72 ms | 137799 tok/s) step 11712/76294 | train loss 3.396435 | norm 0.2303 | lr 5.20e-04 | (3852.61 ms | 136086 tok/s) step 11713/76294 | train loss 3.436016 | norm 0.2137 | lr 5.20e-04 | (3805.08 ms | 137786 tok/s) step 11714/76294 | train loss 3.427171 | norm 0.2034 | lr 5.20e-04 | (3798.16 ms | 138037 tok/s) step 11715/76294 | train loss 3.526366 | norm 0.3662 | lr 5.20e-04 | (3799.47 ms | 137990 tok/s) step 11716/76294 | train loss 3.394701 | norm 0.2030 | lr 5.20e-04 | (3802.42 ms | 137883 tok/s) step 11717/76294 | train loss 3.386455 | norm 0.3192 | lr 5.20e-04 | (3804.35 ms | 137813 tok/s) step 11718/76294 | train loss 3.536690 | norm 0.2170 | lr 5.19e-04 | (3802.79 ms | 137869 tok/s) step 11719/76294 | train loss 3.559417 | norm 0.3038 | lr 5.19e-04 | (3804.21 ms | 137818 tok/s) step 11720/76294 | train loss 3.395232 | norm 0.2907 | lr 5.19e-04 | (3802.00 ms | 137898 tok/s) step 11721/76294 | train loss 3.426052 | norm 0.1964 | lr 5.19e-04 | (3807.44 ms | 137701 tok/s) step 11722/76294 | train loss 3.394049 | norm 0.2537 | lr 5.19e-04 | (3800.75 ms | 137943 tok/s) step 11723/76294 | train loss 3.423894 | norm 0.1882 | lr 5.19e-04 | (3798.30 ms | 138032 tok/s) step 11724/76294 | train loss 3.387896 | norm 0.2148 | lr 5.19e-04 | (3927.08 ms | 133506 tok/s) step 11725/76294 | train loss 3.388855 | norm 0.2949 | lr 5.19e-04 | (3799.05 ms | 138005 tok/s) step 11726/76294 | train loss 3.399107 | norm 0.2402 | lr 5.19e-04 | (3856.48 ms | 135950 tok/s) step 11727/76294 | train loss 3.396555 | norm 0.2845 | lr 5.19e-04 | (3834.71 ms | 136722 tok/s) step 11728/76294 | train loss 3.458236 | norm 0.2919 | lr 5.19e-04 | (3800.88 ms | 137939 tok/s) step 11729/76294 | train loss 3.369312 | norm 0.2491 | lr 5.18e-04 | (3820.01 ms | 137248 tok/s) step 11730/76294 | train loss 3.398321 | norm 0.2779 | lr 5.18e-04 | (3797.78 ms | 138051 tok/s) step 11731/76294 | train loss 3.426759 | norm 0.2822 | lr 5.18e-04 | (3803.90 ms | 137829 tok/s) step 11732/76294 | train loss 3.404759 | norm 0.2023 | lr 5.18e-04 | (3801.58 ms | 137913 tok/s) step 11733/76294 | train loss 3.445425 | norm 0.3151 | lr 5.18e-04 | (3801.73 ms | 137908 tok/s) step 11734/76294 | train loss 3.362312 | norm 0.3462 | lr 5.18e-04 | (3798.65 ms | 138019 tok/s) step 11735/76294 | train loss 3.558643 | norm 0.4490 | lr 5.18e-04 | (3800.95 ms | 137936 tok/s) step 11736/76294 | train loss 3.418254 | norm 0.2816 | lr 5.18e-04 | (3805.12 ms | 137785 tok/s) step 11737/76294 | train loss 3.401076 | norm 0.2602 | lr 5.18e-04 | (3815.40 ms | 137414 tok/s) step 11738/76294 | train loss 3.413928 | norm 0.3785 | lr 5.18e-04 | (3802.91 ms | 137865 tok/s) step 11739/76294 | train loss 3.426131 | norm 0.2829 | lr 5.18e-04 | (3809.54 ms | 137625 tok/s) step 11740/76294 | train loss 3.442296 | norm 0.2988 | lr 5.18e-04 | (3801.56 ms | 137914 tok/s) step 11741/76294 | train loss 3.350110 | norm 0.1988 | lr 5.17e-04 | (3805.77 ms | 137761 tok/s) step 11742/76294 | train loss 3.401012 | norm 0.2606 | lr 5.17e-04 | (3806.45 ms | 137737 tok/s) step 11743/76294 | train loss 3.399269 | norm 0.2145 | lr 5.17e-04 | (3805.25 ms | 137780 tok/s) step 11744/76294 | train loss 3.468292 | norm 0.2101 | lr 5.17e-04 | (3800.77 ms | 137943 tok/s) step 11745/76294 | train loss 3.408485 | norm 0.2611 | lr 5.17e-04 | (3801.50 ms | 137916 tok/s) step 11746/76294 | train loss 3.412041 | norm 0.1943 | lr 5.17e-04 | (3803.48 ms | 137844 tok/s) step 11747/76294 | train loss 3.331399 | norm 0.6772 | lr 5.17e-04 | (3823.88 ms | 137109 tok/s) step 11748/76294 | train loss 3.377367 | norm 0.4314 | lr 5.17e-04 | (3902.35 ms | 134352 tok/s) step 11749/76294 | train loss 3.417951 | norm 0.4793 | lr 5.17e-04 | (3796.88 ms | 138084 tok/s) step 11750/76294 | train loss 3.417349 | norm 0.2322 | lr 5.17e-04 | (3833.10 ms | 136779 tok/s) val loss: 3.397523 saving model checkpoint to ./results/gpt2-124M-gqa/step_11750.pth step 11751/76294 | train loss 3.484651 | norm 0.3120 | lr 5.17e-04 | (3847.05 ms | 136283 tok/s) step 11752/76294 | train loss 3.383081 | norm 0.4531 | lr 5.16e-04 | (3770.20 ms | 139061 tok/s) step 11753/76294 | train loss 3.407292 | norm 0.3249 | lr 5.16e-04 | (3913.89 ms | 133956 tok/s) step 11754/76294 | train loss 3.441177 | norm 0.2700 | lr 5.16e-04 | (3842.06 ms | 136460 tok/s) step 11755/76294 | train loss 3.394516 | norm 0.3334 | lr 5.16e-04 | (3879.45 ms | 135145 tok/s) step 11756/76294 | train loss 3.393366 | norm 0.3199 | lr 5.16e-04 | (3834.14 ms | 136742 tok/s) step 11757/76294 | train loss 3.476773 | norm 0.2953 | lr 5.16e-04 | (3776.32 ms | 138836 tok/s) step 11758/76294 | train loss 3.424356 | norm 0.4429 | lr 5.16e-04 | (3798.16 ms | 138037 tok/s) step 11759/76294 | train loss 3.414368 | norm 0.2629 | lr 5.16e-04 | (3771.78 ms | 139003 tok/s) step 11760/76294 | train loss 3.449877 | norm 0.2968 | lr 5.16e-04 | (3778.76 ms | 138746 tok/s) step 11761/76294 | train loss 3.424786 | norm 0.2912 | lr 5.16e-04 | (3801.88 ms | 137902 tok/s) step 11762/76294 | train loss 3.383926 | norm 0.2268 | lr 5.16e-04 | (3782.69 ms | 138602 tok/s) step 11763/76294 | train loss 3.406168 | norm 0.2688 | lr 5.16e-04 | (3790.44 ms | 138319 tok/s) step 11764/76294 | train loss 3.451173 | norm 0.2954 | lr 5.15e-04 | (3788.64 ms | 138384 tok/s) step 11765/76294 | train loss 3.411610 | norm 0.3367 | lr 5.15e-04 | (3893.92 ms | 134643 tok/s) step 11766/76294 | train loss 3.679100 | norm 0.2678 | lr 5.15e-04 | (3788.61 ms | 138385 tok/s) step 11767/76294 | train loss 3.422647 | norm 0.3764 | lr 5.15e-04 | (3812.28 ms | 137526 tok/s) step 11768/76294 | train loss 3.447011 | norm 0.3030 | lr 5.15e-04 | (3787.59 ms | 138423 tok/s) step 11769/76294 | train loss 3.456136 | norm 0.2670 | lr 5.15e-04 | (3833.11 ms | 136779 tok/s) step 11770/76294 | train loss 3.359341 | norm 0.7476 | lr 5.15e-04 | (3786.43 ms | 138465 tok/s) step 11771/76294 | train loss 3.345178 | norm 0.3231 | lr 5.15e-04 | (3849.93 ms | 136181 tok/s) step 11772/76294 | train loss 3.452445 | norm 0.4262 | lr 5.15e-04 | (3789.20 ms | 138364 tok/s) step 11773/76294 | train loss 3.499822 | norm 0.3328 | lr 5.15e-04 | (3850.19 ms | 136172 tok/s) step 11774/76294 | train loss 3.419283 | norm 0.2051 | lr 5.15e-04 | (3792.69 ms | 138236 tok/s) step 11775/76294 | train loss 3.500499 | norm 0.3285 | lr 5.15e-04 | (3799.54 ms | 137987 tok/s) step 11776/76294 | train loss 3.453534 | norm 0.3334 | lr 5.14e-04 | (3813.91 ms | 137467 tok/s) step 11777/76294 | train loss 3.419780 | norm 0.2642 | lr 5.14e-04 | (3797.83 ms | 138049 tok/s) step 11778/76294 | train loss 3.366765 | norm 0.2228 | lr 5.14e-04 | (3801.56 ms | 137914 tok/s) step 11779/76294 | train loss 3.442048 | norm 0.2923 | lr 5.14e-04 | (3843.90 ms | 136395 tok/s) step 11780/76294 | train loss 3.381407 | norm 0.3217 | lr 5.14e-04 | (3796.43 ms | 138100 tok/s) step 11781/76294 | train loss 3.385200 | norm 0.2142 | lr 5.14e-04 | (3794.48 ms | 138171 tok/s) step 11782/76294 | train loss 3.477701 | norm 0.4344 | lr 5.14e-04 | (3812.43 ms | 137521 tok/s) step 11783/76294 | train loss 3.459583 | norm 0.2923 | lr 5.14e-04 | (3830.54 ms | 136871 tok/s) step 11784/76294 | train loss 3.414797 | norm 0.2730 | lr 5.14e-04 | (3810.74 ms | 137582 tok/s) step 11785/76294 | train loss 3.453531 | norm 0.2565 | lr 5.14e-04 | (3801.76 ms | 137907 tok/s) step 11786/76294 | train loss 3.416802 | norm 0.2544 | lr 5.14e-04 | (3896.47 ms | 134554 tok/s) step 11787/76294 | train loss 3.405179 | norm 0.2895 | lr 5.13e-04 | (3871.00 ms | 135440 tok/s) step 11788/76294 | train loss 3.375087 | norm 0.2803 | lr 5.13e-04 | (32148.04 ms | 16309 tok/s) step 11789/76294 | train loss 3.440580 | norm 0.2188 | lr 5.13e-04 | (3747.93 ms | 139887 tok/s) step 11790/76294 | train loss 3.412624 | norm 0.2899 | lr 5.13e-04 | (3854.11 ms | 136034 tok/s) step 11791/76294 | train loss 3.618461 | norm 0.2600 | lr 5.13e-04 | (3756.88 ms | 139554 tok/s) step 11792/76294 | train loss 3.427674 | norm 0.2162 | lr 5.13e-04 | (3885.65 ms | 134929 tok/s) step 11793/76294 | train loss 3.390362 | norm 0.5089 | lr 5.13e-04 | (3759.86 ms | 139443 tok/s) step 11794/76294 | train loss 3.439441 | norm 0.2149 | lr 5.13e-04 | (3795.23 ms | 138144 tok/s) step 11795/76294 | train loss 3.402175 | norm 0.2812 | lr 5.13e-04 | (3768.38 ms | 139128 tok/s) step 11796/76294 | train loss 3.399771 | norm 0.3078 | lr 5.13e-04 | (3787.20 ms | 138437 tok/s) step 11797/76294 | train loss 3.386086 | norm 0.2211 | lr 5.13e-04 | (3885.31 ms | 134941 tok/s) step 11798/76294 | train loss 3.423398 | norm 0.2329 | lr 5.13e-04 | (3772.37 ms | 138981 tok/s) step 11799/76294 | train loss 3.372222 | norm 0.2171 | lr 5.12e-04 | (3798.35 ms | 138030 tok/s) step 11800/76294 | train loss 3.526112 | norm 0.2047 | lr 5.12e-04 | (3780.09 ms | 138697 tok/s) step 11801/76294 | train loss 3.431035 | norm 0.2098 | lr 5.12e-04 | (3788.14 ms | 138403 tok/s) step 11802/76294 | train loss 3.390076 | norm 0.1717 | lr 5.12e-04 | (3807.49 ms | 137699 tok/s) step 11803/76294 | train loss 3.432136 | norm 0.2716 | lr 5.12e-04 | (3806.20 ms | 137746 tok/s) step 11804/76294 | train loss 3.393835 | norm 0.2177 | lr 5.12e-04 | (3796.12 ms | 138112 tok/s) step 11805/76294 | train loss 3.377327 | norm 0.2761 | lr 5.12e-04 | (3834.20 ms | 136740 tok/s) step 11806/76294 | train loss 3.407441 | norm 0.2739 | lr 5.12e-04 | (3793.19 ms | 138218 tok/s) step 11807/76294 | train loss 3.427129 | norm 0.2284 | lr 5.12e-04 | (3857.10 ms | 135928 tok/s) step 11808/76294 | train loss 3.436868 | norm 0.2569 | lr 5.12e-04 | (3802.89 ms | 137866 tok/s) step 11809/76294 | train loss 3.517565 | norm 0.4541 | lr 5.12e-04 | (3812.82 ms | 137506 tok/s) step 11810/76294 | train loss 3.460535 | norm 0.2729 | lr 5.11e-04 | (3805.19 ms | 137782 tok/s) step 11811/76294 | train loss 3.446729 | norm 0.2972 | lr 5.11e-04 | (3809.31 ms | 137633 tok/s) step 11812/76294 | train loss 3.407084 | norm 0.3004 | lr 5.11e-04 | (3894.22 ms | 134632 tok/s) step 11813/76294 | train loss 3.449287 | norm 0.2204 | lr 5.11e-04 | (3837.56 ms | 136620 tok/s) step 11814/76294 | train loss 3.331007 | norm 0.2038 | lr 5.11e-04 | (3805.67 ms | 137765 tok/s) step 11815/76294 | train loss 3.453387 | norm 0.2413 | lr 5.11e-04 | (3821.42 ms | 137197 tok/s) step 11816/76294 | train loss 3.431252 | norm 0.2130 | lr 5.11e-04 | (3805.42 ms | 137774 tok/s) step 11817/76294 | train loss 3.440910 | norm 0.2547 | lr 5.11e-04 | (3798.19 ms | 138036 tok/s) step 11818/76294 | train loss 3.374128 | norm 0.2147 | lr 5.11e-04 | (3828.79 ms | 136933 tok/s) step 11819/76294 | train loss 3.367981 | norm 0.2328 | lr 5.11e-04 | (3800.40 ms | 137956 tok/s) step 11820/76294 | train loss 3.442641 | norm 0.3732 | lr 5.11e-04 | (3806.76 ms | 137726 tok/s) step 11821/76294 | train loss 3.412438 | norm 0.2134 | lr 5.11e-04 | (3804.54 ms | 137806 tok/s) step 11822/76294 | train loss 3.378686 | norm 0.2624 | lr 5.10e-04 | (3827.53 ms | 136978 tok/s) step 11823/76294 | train loss 3.427000 | norm 0.3303 | lr 5.10e-04 | (3803.23 ms | 137854 tok/s) step 11824/76294 | train loss 3.386051 | norm 0.2337 | lr 5.10e-04 | (4801.79 ms | 109186 tok/s) step 11825/76294 | train loss 3.381642 | norm 0.2584 | lr 5.10e-04 | (3841.23 ms | 136490 tok/s) step 11826/76294 | train loss 3.477987 | norm 0.2614 | lr 5.10e-04 | (4285.69 ms | 122335 tok/s) step 11827/76294 | train loss 3.402893 | norm 0.2031 | lr 5.10e-04 | (3803.16 ms | 137856 tok/s) step 11828/76294 | train loss 3.369756 | norm 0.2357 | lr 5.10e-04 | (3813.76 ms | 137473 tok/s) step 11829/76294 | train loss 3.633727 | norm 0.2212 | lr 5.10e-04 | (3835.79 ms | 136683 tok/s) step 11830/76294 | train loss 3.302217 | norm 0.2222 | lr 5.10e-04 | (3809.42 ms | 137629 tok/s) step 11831/76294 | train loss 3.383300 | norm 0.1943 | lr 5.10e-04 | (3815.42 ms | 137413 tok/s) step 11832/76294 | train loss 3.340205 | norm 0.2006 | lr 5.10e-04 | (3882.86 ms | 135026 tok/s) step 11833/76294 | train loss 3.384329 | norm 0.3742 | lr 5.09e-04 | (3911.21 ms | 134048 tok/s) step 11834/76294 | train loss 3.416648 | norm 0.2193 | lr 5.09e-04 | (3883.90 ms | 134990 tok/s) step 11835/76294 | train loss 3.386875 | norm 0.2038 | lr 5.09e-04 | (3805.45 ms | 137773 tok/s) step 11836/76294 | train loss 3.399246 | norm 0.2390 | lr 5.09e-04 | (3813.13 ms | 137496 tok/s) step 11837/76294 | train loss 3.416046 | norm 0.2125 | lr 5.09e-04 | (3805.07 ms | 137787 tok/s) step 11838/76294 | train loss 3.436999 | norm 0.4907 | lr 5.09e-04 | (3849.72 ms | 136188 tok/s) step 11839/76294 | train loss 3.426432 | norm 0.2305 | lr 5.09e-04 | (3810.63 ms | 137586 tok/s) step 11840/76294 | train loss 3.430340 | norm 0.2712 | lr 5.09e-04 | (3835.96 ms | 136677 tok/s) step 11841/76294 | train loss 3.384825 | norm 0.2366 | lr 5.09e-04 | (3928.72 ms | 133450 tok/s) step 11842/76294 | train loss 3.430515 | norm 0.2964 | lr 5.09e-04 | (3846.61 ms | 136299 tok/s) step 11843/76294 | train loss 3.389084 | norm 0.2249 | lr 5.09e-04 | (3805.82 ms | 137759 tok/s) step 11844/76294 | train loss 3.416011 | norm 0.2027 | lr 5.09e-04 | (3952.31 ms | 132654 tok/s) step 11845/76294 | train loss 3.397883 | norm 0.2922 | lr 5.08e-04 | (3810.46 ms | 137592 tok/s) step 11846/76294 | train loss 3.471470 | norm 0.2436 | lr 5.08e-04 | (3809.73 ms | 137618 tok/s) step 11847/76294 | train loss 3.342241 | norm 0.2747 | lr 5.08e-04 | (3829.99 ms | 136890 tok/s) step 11848/76294 | train loss 3.437350 | norm 0.2635 | lr 5.08e-04 | (3810.35 ms | 137596 tok/s) step 11849/76294 | train loss 3.357222 | norm 0.2600 | lr 5.08e-04 | (3811.38 ms | 137558 tok/s) step 11850/76294 | train loss 3.443774 | norm 0.2731 | lr 5.08e-04 | (3812.18 ms | 137530 tok/s) step 11851/76294 | train loss 3.417178 | norm 0.2880 | lr 5.08e-04 | (3814.24 ms | 137456 tok/s) step 11852/76294 | train loss 3.466172 | norm 0.2678 | lr 5.08e-04 | (3803.76 ms | 137834 tok/s) step 11853/76294 | train loss 3.401864 | norm 0.2224 | lr 5.08e-04 | (3808.07 ms | 137678 tok/s) step 11854/76294 | train loss 3.437379 | norm 0.2670 | lr 5.08e-04 | (4112.31 ms | 127492 tok/s) step 11855/76294 | train loss 3.352369 | norm 0.2064 | lr 5.08e-04 | (3821.78 ms | 137184 tok/s) step 11856/76294 | train loss 3.414613 | norm 0.2952 | lr 5.07e-04 | (3806.87 ms | 137721 tok/s) step 11857/76294 | train loss 3.380937 | norm 0.2314 | lr 5.07e-04 | (3809.23 ms | 137636 tok/s) step 11858/76294 | train loss 3.454642 | norm 0.2904 | lr 5.07e-04 | (3828.45 ms | 136945 tok/s) step 11859/76294 | train loss 3.412270 | norm 0.3515 | lr 5.07e-04 | (3806.37 ms | 137740 tok/s) step 11860/76294 | train loss 3.450594 | norm 0.3343 | lr 5.07e-04 | (3812.53 ms | 137517 tok/s) step 11861/76294 | train loss 3.414070 | norm 0.4264 | lr 5.07e-04 | (3804.69 ms | 137800 tok/s) step 11862/76294 | train loss 3.364015 | norm 0.2374 | lr 5.07e-04 | (3805.37 ms | 137776 tok/s) step 11863/76294 | train loss 3.385428 | norm 0.3106 | lr 5.07e-04 | (3870.46 ms | 135459 tok/s) step 11864/76294 | train loss 3.364215 | norm 0.2331 | lr 5.07e-04 | (3805.63 ms | 137767 tok/s) step 11865/76294 | train loss 3.435217 | norm 0.2435 | lr 5.07e-04 | (3807.67 ms | 137693 tok/s) step 11866/76294 | train loss 3.379612 | norm 0.2578 | lr 5.07e-04 | (3828.12 ms | 136957 tok/s) step 11867/76294 | train loss 3.452029 | norm 0.1855 | lr 5.07e-04 | (3887.30 ms | 134872 tok/s) step 11868/76294 | train loss 3.333968 | norm 0.3873 | lr 5.06e-04 | (3803.28 ms | 137852 tok/s) step 11869/76294 | train loss 3.424804 | norm 0.2209 | lr 5.06e-04 | (3831.79 ms | 136826 tok/s) step 11870/76294 | train loss 3.341920 | norm 0.3603 | lr 5.06e-04 | (3801.39 ms | 137920 tok/s) step 11871/76294 | train loss 3.388486 | norm 0.2377 | lr 5.06e-04 | (3851.26 ms | 136134 tok/s) step 11872/76294 | train loss 3.365382 | norm 0.2360 | lr 5.06e-04 | (3824.01 ms | 137104 tok/s) step 11873/76294 | train loss 3.406440 | norm 0.1918 | lr 5.06e-04 | (3806.61 ms | 137731 tok/s) step 11874/76294 | train loss 3.392561 | norm 0.1919 | lr 5.06e-04 | (3831.34 ms | 136842 tok/s) step 11875/76294 | train loss 3.413123 | norm 0.1852 | lr 5.06e-04 | (3809.86 ms | 137614 tok/s) step 11876/76294 | train loss 3.424806 | norm 0.1994 | lr 5.06e-04 | (3808.22 ms | 137673 tok/s) step 11877/76294 | train loss 3.466491 | norm 0.2255 | lr 5.06e-04 | (3837.91 ms | 136608 tok/s) step 11878/76294 | train loss 3.443004 | norm 0.1918 | lr 5.06e-04 | (3827.89 ms | 136965 tok/s) step 11879/76294 | train loss 3.430506 | norm 0.2976 | lr 5.06e-04 | (3824.82 ms | 137075 tok/s) step 11880/76294 | train loss 3.344116 | norm 0.2197 | lr 5.05e-04 | (3811.53 ms | 137553 tok/s) step 11881/76294 | train loss 3.410567 | norm 0.2852 | lr 5.05e-04 | (3806.26 ms | 137744 tok/s) step 11882/76294 | train loss 3.365997 | norm 0.2407 | lr 5.05e-04 | (3892.52 ms | 134691 tok/s) step 11883/76294 | train loss 3.395351 | norm 0.4185 | lr 5.05e-04 | (3797.09 ms | 138076 tok/s) step 11884/76294 | train loss 3.368498 | norm 0.3121 | lr 5.05e-04 | (3827.13 ms | 136992 tok/s) step 11885/76294 | train loss 3.448656 | norm 0.3344 | lr 5.05e-04 | (3798.60 ms | 138021 tok/s) step 11886/76294 | train loss 3.423451 | norm 0.3360 | lr 5.05e-04 | (3854.95 ms | 136004 tok/s) step 11887/76294 | train loss 3.315966 | norm 0.2729 | lr 5.05e-04 | (3799.15 ms | 138001 tok/s) step 11888/76294 | train loss 3.389095 | norm 0.3029 | lr 5.05e-04 | (3859.56 ms | 135841 tok/s) step 11889/76294 | train loss 3.440792 | norm 0.4093 | lr 5.05e-04 | (3801.81 ms | 137905 tok/s) step 11890/76294 | train loss 3.404219 | norm 0.2545 | lr 5.05e-04 | (3836.40 ms | 136662 tok/s) step 11891/76294 | train loss 3.418782 | norm 0.3642 | lr 5.04e-04 | (3820.63 ms | 137226 tok/s) step 11892/76294 | train loss 3.401709 | norm 0.5158 | lr 5.04e-04 | (3801.87 ms | 137902 tok/s) step 11893/76294 | train loss 3.406918 | norm 0.2621 | lr 5.04e-04 | (3799.42 ms | 137991 tok/s) step 11894/76294 | train loss 3.344069 | norm 0.3977 | lr 5.04e-04 | (3827.80 ms | 136968 tok/s) step 11895/76294 | train loss 3.404603 | norm 0.3561 | lr 5.04e-04 | (3801.48 ms | 137917 tok/s) step 11896/76294 | train loss 3.440906 | norm 0.2710 | lr 5.04e-04 | (3816.70 ms | 137367 tok/s) step 11897/76294 | train loss 3.367348 | norm 0.4828 | lr 5.04e-04 | (3798.12 ms | 138039 tok/s) step 11898/76294 | train loss 3.362706 | norm 0.3790 | lr 5.04e-04 | (3810.26 ms | 137599 tok/s) step 11899/76294 | train loss 3.381981 | norm 0.2766 | lr 5.04e-04 | (3800.92 ms | 137937 tok/s) step 11900/76294 | train loss 3.449496 | norm 0.3848 | lr 5.04e-04 | (3826.88 ms | 137001 tok/s) step 11901/76294 | train loss 3.387830 | norm 0.2921 | lr 5.04e-04 | (3800.97 ms | 137935 tok/s) step 11902/76294 | train loss 3.411817 | norm 0.2326 | lr 5.04e-04 | (3855.67 ms | 135979 tok/s) step 11903/76294 | train loss 3.348514 | norm 0.3548 | lr 5.03e-04 | (3799.76 ms | 137979 tok/s) step 11904/76294 | train loss 3.455164 | norm 0.2219 | lr 5.03e-04 | (3801.98 ms | 137899 tok/s) step 11905/76294 | train loss 3.353970 | norm 0.2170 | lr 5.03e-04 | (3825.25 ms | 137060 tok/s) step 11906/76294 | train loss 3.498450 | norm 0.2021 | lr 5.03e-04 | (3798.73 ms | 138017 tok/s) step 11907/76294 | train loss 3.350932 | norm 0.2315 | lr 5.03e-04 | (3885.13 ms | 134947 tok/s) step 11908/76294 | train loss 3.408095 | norm 0.2338 | lr 5.03e-04 | (3794.20 ms | 138181 tok/s) step 11909/76294 | train loss 3.385431 | norm 0.2191 | lr 5.03e-04 | (3800.09 ms | 137967 tok/s) step 11910/76294 | train loss 3.377052 | norm 0.2197 | lr 5.03e-04 | (3817.03 ms | 137355 tok/s) step 11911/76294 | train loss 3.340686 | norm 0.2358 | lr 5.03e-04 | (3803.72 ms | 137835 tok/s) step 11912/76294 | train loss 3.384935 | norm 0.2166 | lr 5.03e-04 | (3892.81 ms | 134681 tok/s) step 11913/76294 | train loss 3.405210 | norm 0.2784 | lr 5.03e-04 | (3794.71 ms | 138163 tok/s) step 11914/76294 | train loss 3.425164 | norm 0.2672 | lr 5.02e-04 | (3878.27 ms | 135186 tok/s) step 11915/76294 | train loss 3.354499 | norm 0.2108 | lr 5.02e-04 | (3855.67 ms | 135978 tok/s) step 11916/76294 | train loss 3.381800 | norm 0.3036 | lr 5.02e-04 | (3795.35 ms | 138140 tok/s) step 11917/76294 | train loss 3.312822 | norm 0.2086 | lr 5.02e-04 | (3838.34 ms | 136592 tok/s) step 11918/76294 | train loss 3.408664 | norm 0.2269 | lr 5.02e-04 | (3821.70 ms | 137187 tok/s) step 11919/76294 | train loss 3.358448 | norm 0.2172 | lr 5.02e-04 | (3806.76 ms | 137726 tok/s) step 11920/76294 | train loss 3.412858 | norm 0.4182 | lr 5.02e-04 | (3806.20 ms | 137746 tok/s) step 11921/76294 | train loss 3.378606 | norm 0.2064 | lr 5.02e-04 | (3800.60 ms | 137949 tok/s) step 11922/76294 | train loss 3.406907 | norm 0.3423 | lr 5.02e-04 | (3818.01 ms | 137320 tok/s) step 11923/76294 | train loss 3.431690 | norm 0.2556 | lr 5.02e-04 | (3802.59 ms | 137876 tok/s) step 11924/76294 | train loss 3.355428 | norm 0.4233 | lr 5.02e-04 | (3801.50 ms | 137916 tok/s) step 11925/76294 | train loss 3.391353 | norm 0.5942 | lr 5.02e-04 | (3802.18 ms | 137891 tok/s) step 11926/76294 | train loss 3.340345 | norm 0.3080 | lr 5.01e-04 | (3799.19 ms | 138000 tok/s) step 11927/76294 | train loss 3.348592 | norm 0.2016 | lr 5.01e-04 | (3801.08 ms | 137931 tok/s) step 11928/76294 | train loss 3.447585 | norm 0.7131 | lr 5.01e-04 | (3823.59 ms | 137119 tok/s) step 11929/76294 | train loss 3.387720 | norm 0.2564 | lr 5.01e-04 | (3805.49 ms | 137772 tok/s) step 11930/76294 | train loss 3.393846 | norm 0.2338 | lr 5.01e-04 | (3824.73 ms | 137078 tok/s) step 11931/76294 | train loss 3.405560 | norm 0.2637 | lr 5.01e-04 | (3798.42 ms | 138028 tok/s) step 11932/76294 | train loss 3.361565 | norm 0.2786 | lr 5.01e-04 | (3803.21 ms | 137854 tok/s) step 11933/76294 | train loss 3.390979 | norm 0.2943 | lr 5.01e-04 | (3852.56 ms | 136088 tok/s) step 11934/76294 | train loss 3.350191 | norm 0.2355 | lr 5.01e-04 | (3799.52 ms | 137988 tok/s) step 11935/76294 | train loss 3.347645 | norm 0.2079 | lr 5.01e-04 | (3885.18 ms | 134946 tok/s) step 11936/76294 | train loss 3.396512 | norm 0.3060 | lr 5.01e-04 | (3808.94 ms | 137647 tok/s) step 11937/76294 | train loss 3.364169 | norm 0.4463 | lr 5.01e-04 | (3822.54 ms | 137157 tok/s) step 11938/76294 | train loss 3.397062 | norm 0.3208 | lr 5.00e-04 | (3799.87 ms | 137975 tok/s) step 11939/76294 | train loss 3.492814 | norm 0.3478 | lr 5.00e-04 | (3813.05 ms | 137498 tok/s) step 11940/76294 | train loss 3.399449 | norm 0.3787 | lr 5.00e-04 | (3798.34 ms | 138031 tok/s) step 11941/76294 | train loss 3.448295 | norm 0.2568 | lr 5.00e-04 | (3821.32 ms | 137201 tok/s) step 11942/76294 | train loss 3.367040 | norm 0.3044 | lr 5.00e-04 | (3800.75 ms | 137943 tok/s) step 11943/76294 | train loss 3.415396 | norm 0.2758 | lr 5.00e-04 | (3808.15 ms | 137675 tok/s) step 11944/76294 | train loss 3.379383 | norm 0.3785 | lr 5.00e-04 | (3822.74 ms | 137150 tok/s) step 11945/76294 | train loss 3.409856 | norm 0.2977 | lr 5.00e-04 | (3800.91 ms | 137937 tok/s) step 11946/76294 | train loss 3.381911 | norm 0.3025 | lr 5.00e-04 | (3823.09 ms | 137137 tok/s) step 11947/76294 | train loss 3.392966 | norm 0.2365 | lr 5.00e-04 | (3798.18 ms | 138037 tok/s) step 11948/76294 | train loss 3.381825 | norm 0.2152 | lr 5.00e-04 | (3794.84 ms | 138158 tok/s) step 11949/76294 | train loss 3.455374 | norm 0.2238 | lr 4.99e-04 | (3845.59 ms | 136335 tok/s) step 11950/76294 | train loss 3.388648 | norm 0.2705 | lr 4.99e-04 | (3800.71 ms | 137945 tok/s) step 11951/76294 | train loss 3.419067 | norm 0.1948 | lr 4.99e-04 | (3968.35 ms | 132117 tok/s) step 11952/76294 | train loss 3.347141 | norm 0.3344 | lr 4.99e-04 | (3842.93 ms | 136429 tok/s) step 11953/76294 | train loss 3.444471 | norm 0.2622 | lr 4.99e-04 | (3799.49 ms | 137989 tok/s) step 11954/76294 | train loss 3.459313 | norm 0.2650 | lr 4.99e-04 | (3802.62 ms | 137876 tok/s) step 11955/76294 | train loss 3.436560 | norm 0.2260 | lr 4.99e-04 | (3802.91 ms | 137865 tok/s) step 11956/76294 | train loss 3.285490 | norm 0.2427 | lr 4.99e-04 | (3810.36 ms | 137595 tok/s) step 11957/76294 | train loss 3.409453 | norm 0.2106 | lr 4.99e-04 | (3801.66 ms | 137910 tok/s) step 11958/76294 | train loss 3.390067 | norm 0.2148 | lr 4.99e-04 | (3834.68 ms | 136723 tok/s) step 11959/76294 | train loss 3.405936 | norm 0.2115 | lr 4.99e-04 | (3801.96 ms | 137899 tok/s) step 11960/76294 | train loss 3.291605 | norm 0.2345 | lr 4.99e-04 | (3824.19 ms | 137098 tok/s) step 11961/76294 | train loss 3.394529 | norm 0.6008 | lr 4.98e-04 | (3855.18 ms | 135996 tok/s) step 11962/76294 | train loss 3.383403 | norm 0.3533 | lr 4.98e-04 | (3873.09 ms | 135367 tok/s) step 11963/76294 | train loss 3.379748 | norm 0.3980 | lr 4.98e-04 | (3796.28 ms | 138106 tok/s) step 11964/76294 | train loss 3.364283 | norm 0.2482 | lr 4.98e-04 | (3821.45 ms | 137196 tok/s) step 11965/76294 | train loss 3.411036 | norm 0.2369 | lr 4.98e-04 | (3799.13 ms | 138002 tok/s) step 11966/76294 | train loss 3.352225 | norm 0.3165 | lr 4.98e-04 | (3799.82 ms | 137977 tok/s) step 11967/76294 | train loss 3.465137 | norm 0.3334 | lr 4.98e-04 | (3820.33 ms | 137236 tok/s) step 11968/76294 | train loss 3.357604 | norm 0.2110 | lr 4.98e-04 | (3829.83 ms | 136896 tok/s) step 11969/76294 | train loss 3.381611 | norm 0.3493 | lr 4.98e-04 | (3807.75 ms | 137690 tok/s) step 11970/76294 | train loss 3.344053 | norm 0.3061 | lr 4.98e-04 | (3826.25 ms | 137024 tok/s) step 11971/76294 | train loss 3.408907 | norm 0.2261 | lr 4.98e-04 | (3798.87 ms | 138012 tok/s) step 11972/76294 | train loss 3.323487 | norm 0.3581 | lr 4.98e-04 | (3877.31 ms | 135220 tok/s) step 11973/76294 | train loss 3.416778 | norm 0.2991 | lr 4.97e-04 | (3835.10 ms | 136708 tok/s) step 11974/76294 | train loss 3.379878 | norm 0.2888 | lr 4.97e-04 | (3791.22 ms | 138290 tok/s) step 11975/76294 | train loss 3.381528 | norm 0.2535 | lr 4.97e-04 | (3795.68 ms | 138128 tok/s) step 11976/76294 | train loss 3.359662 | norm 0.2424 | lr 4.97e-04 | (3817.61 ms | 137334 tok/s) step 11977/76294 | train loss 3.435795 | norm 0.2477 | lr 4.97e-04 | (3812.43 ms | 137521 tok/s) step 11978/76294 | train loss 3.355327 | norm 0.3574 | lr 4.97e-04 | (3817.59 ms | 137335 tok/s) step 11979/76294 | train loss 3.390246 | norm 0.3680 | lr 4.97e-04 | (3823.03 ms | 137140 tok/s) step 11980/76294 | train loss 3.396875 | norm 0.2775 | lr 4.97e-04 | (3819.20 ms | 137277 tok/s) step 11981/76294 | train loss 3.453068 | norm 0.3551 | lr 4.97e-04 | (5913.31 ms | 88662 tok/s) step 11982/76294 | train loss 3.355914 | norm 0.2684 | lr 4.97e-04 | (3832.61 ms | 136797 tok/s) step 11983/76294 | train loss 3.464454 | norm 0.2318 | lr 4.97e-04 | (3797.88 ms | 138047 tok/s) step 11984/76294 | train loss 3.359521 | norm 0.3302 | lr 4.96e-04 | (3800.52 ms | 137951 tok/s) step 11985/76294 | train loss 3.418059 | norm 0.4252 | lr 4.96e-04 | (3833.22 ms | 136775 tok/s) step 11986/76294 | train loss 3.439336 | norm 0.3319 | lr 4.96e-04 | (3807.15 ms | 137711 tok/s) step 11987/76294 | train loss 3.411827 | norm 0.4307 | lr 4.96e-04 | (3796.95 ms | 138081 tok/s) step 11988/76294 | train loss 3.350281 | norm 0.2777 | lr 4.96e-04 | (3820.57 ms | 137228 tok/s) step 11989/76294 | train loss 3.386603 | norm 0.2546 | lr 4.96e-04 | (3796.30 ms | 138105 tok/s) step 11990/76294 | train loss 3.333535 | norm 0.4108 | lr 4.96e-04 | (3822.89 ms | 137144 tok/s) step 11991/76294 | train loss 3.422614 | norm 0.2326 | lr 4.96e-04 | (3802.64 ms | 137875 tok/s) step 11992/76294 | train loss 3.337487 | norm 0.2245 | lr 4.96e-04 | (3803.61 ms | 137840 tok/s) step 11993/76294 | train loss 3.474698 | norm 0.2263 | lr 4.96e-04 | (3864.75 ms | 135659 tok/s) step 11994/76294 | train loss 3.391595 | norm 0.2321 | lr 4.96e-04 | (3805.34 ms | 137777 tok/s) step 11995/76294 | train loss 3.412179 | norm 0.2412 | lr 4.96e-04 | (3825.48 ms | 137051 tok/s) step 11996/76294 | train loss 3.361612 | norm 0.2078 | lr 4.95e-04 | (3806.22 ms | 137745 tok/s) step 11997/76294 | train loss 3.402640 | norm 0.2508 | lr 4.95e-04 | (3807.86 ms | 137686 tok/s) step 11998/76294 | train loss 3.333973 | norm 0.2393 | lr 4.95e-04 | (3803.61 ms | 137840 tok/s) step 11999/76294 | train loss 3.376792 | norm 0.1859 | lr 4.95e-04 | (3804.43 ms | 137810 tok/s) step 12000/76294 | train loss 3.485168 | norm 0.4965 | lr 4.95e-04 | (3806.42 ms | 137738 tok/s) val loss: 3.394629 saving model checkpoint to ./results/gpt2-124M-gqa/step_12000.pth step 12001/76294 | train loss 3.414151 | norm 0.3306 | lr 4.95e-04 | (3885.83 ms | 134923 tok/s) step 12002/76294 | train loss 3.373466 | norm 0.3201 | lr 4.95e-04 | (3852.79 ms | 136080 tok/s) step 12003/76294 | train loss 3.383317 | norm 0.2380 | lr 4.95e-04 | (3874.03 ms | 135334 tok/s) step 12004/76294 | train loss 3.347483 | norm 0.3010 | lr 4.95e-04 | (3816.67 ms | 137368 tok/s) step 12005/76294 | train loss 3.368669 | norm 0.2493 | lr 4.95e-04 | (3824.54 ms | 137085 tok/s) step 12006/76294 | train loss 3.278422 | norm 0.2851 | lr 4.95e-04 | (3803.08 ms | 137859 tok/s) step 12007/76294 | train loss 3.394122 | norm 0.3082 | lr 4.95e-04 | (3810.24 ms | 137600 tok/s) step 12008/76294 | train loss 3.369307 | norm 0.4177 | lr 4.94e-04 | (3807.45 ms | 137701 tok/s) step 12009/76294 | train loss 3.423216 | norm 0.4692 | lr 4.94e-04 | (3826.03 ms | 137032 tok/s) step 12010/76294 | train loss 3.385213 | norm 0.3073 | lr 4.94e-04 | (3808.28 ms | 137670 tok/s) step 12011/76294 | train loss 3.389241 | norm 0.4066 | lr 4.94e-04 | (3803.48 ms | 137844 tok/s) step 12012/76294 | train loss 3.320080 | norm 0.2321 | lr 4.94e-04 | (3876.73 ms | 135240 tok/s) step 12013/76294 | train loss 3.391180 | norm 0.2663 | lr 4.94e-04 | (3804.44 ms | 137809 tok/s) step 12014/76294 | train loss 3.412279 | norm 0.2422 | lr 4.94e-04 | (3807.16 ms | 137711 tok/s) step 12015/76294 | train loss 3.402148 | norm 0.3460 | lr 4.94e-04 | (3828.27 ms | 136952 tok/s) step 12016/76294 | train loss 3.402031 | norm 0.2834 | lr 4.94e-04 | (3807.28 ms | 137707 tok/s) step 12017/76294 | train loss 3.423307 | norm 0.2462 | lr 4.94e-04 | (4091.91 ms | 128128 tok/s) step 12018/76294 | train loss 3.297849 | norm 0.4131 | lr 4.94e-04 | (3856.00 ms | 135967 tok/s) step 12019/76294 | train loss 3.416492 | norm 0.2090 | lr 4.93e-04 | (3814.94 ms | 137430 tok/s) step 12020/76294 | train loss 3.404825 | norm 0.2697 | lr 4.93e-04 | (3819.32 ms | 137273 tok/s) step 12021/76294 | train loss 3.361106 | norm 0.2815 | lr 4.93e-04 | (3829.83 ms | 136896 tok/s) step 12022/76294 | train loss 3.433891 | norm 0.2084 | lr 4.93e-04 | (3812.82 ms | 137507 tok/s) step 12023/76294 | train loss 3.367005 | norm 0.3099 | lr 4.93e-04 | (3819.21 ms | 137277 tok/s) step 12024/76294 | train loss 3.377088 | norm 0.2762 | lr 4.93e-04 | (3816.22 ms | 137384 tok/s) step 12025/76294 | train loss 3.405211 | norm 0.3256 | lr 4.93e-04 | (3838.84 ms | 136575 tok/s) step 12026/76294 | train loss 3.411839 | norm 0.2963 | lr 4.93e-04 | (3894.75 ms | 134614 tok/s) step 12027/76294 | train loss 3.540075 | norm 0.3031 | lr 4.93e-04 | (3809.78 ms | 137616 tok/s) step 12028/76294 | train loss 3.404817 | norm 0.3591 | lr 4.93e-04 | (3969.82 ms | 132068 tok/s) step 12029/76294 | train loss 3.324035 | norm 0.2485 | lr 4.93e-04 | (3808.89 ms | 137648 tok/s) step 12030/76294 | train loss 3.392155 | norm 0.3475 | lr 4.93e-04 | (3810.19 ms | 137602 tok/s) step 12031/76294 | train loss 3.392457 | norm 0.2328 | lr 4.92e-04 | (3834.15 ms | 136741 tok/s) step 12032/76294 | train loss 3.507402 | norm 0.2081 | lr 4.92e-04 | (3805.87 ms | 137758 tok/s) step 12033/76294 | train loss 3.394427 | norm 0.2247 | lr 4.92e-04 | (3859.92 ms | 135829 tok/s) step 12034/76294 | train loss 3.377608 | norm 0.2308 | lr 4.92e-04 | (3802.91 ms | 137865 tok/s) step 12035/76294 | train loss 3.375570 | norm 0.2201 | lr 4.92e-04 | (3864.70 ms | 135661 tok/s) step 12036/76294 | train loss 3.448985 | norm 0.2857 | lr 4.92e-04 | (3803.49 ms | 137844 tok/s) step 12037/76294 | train loss 3.379423 | norm 0.2033 | lr 4.92e-04 | (3928.82 ms | 133447 tok/s) step 12038/76294 | train loss 3.392529 | norm 0.3286 | lr 4.92e-04 | (3801.89 ms | 137902 tok/s) step 12039/76294 | train loss 3.349997 | norm 0.2631 | lr 4.92e-04 | (3803.72 ms | 137836 tok/s) step 12040/76294 | train loss 3.389030 | norm 0.2746 | lr 4.92e-04 | (4337.01 ms | 120887 tok/s) step 12041/76294 | train loss 3.388460 | norm 0.2252 | lr 4.92e-04 | (3934.29 ms | 133261 tok/s) step 12042/76294 | train loss 3.337526 | norm 0.1850 | lr 4.92e-04 | (3789.20 ms | 138364 tok/s) step 12043/76294 | train loss 3.331007 | norm 0.2358 | lr 4.91e-04 | (3824.20 ms | 137097 tok/s) step 12044/76294 | train loss 3.479544 | norm 0.2368 | lr 4.91e-04 | (3798.89 ms | 138011 tok/s) step 12045/76294 | train loss 3.336520 | norm 0.1849 | lr 4.91e-04 | (5454.56 ms | 96119 tok/s) step 12046/76294 | train loss 3.389214 | norm 0.2753 | lr 4.91e-04 | (3881.12 ms | 135087 tok/s) step 12047/76294 | train loss 3.372016 | norm 0.2916 | lr 4.91e-04 | (3794.72 ms | 138163 tok/s) step 12048/76294 | train loss 3.342502 | norm 0.1980 | lr 4.91e-04 | (3803.44 ms | 137846 tok/s) step 12049/76294 | train loss 3.352972 | norm 0.3056 | lr 4.91e-04 | (3828.62 ms | 136939 tok/s) step 12050/76294 | train loss 3.400206 | norm 0.3120 | lr 4.91e-04 | (3808.45 ms | 137664 tok/s) step 12051/76294 | train loss 3.413621 | norm 0.3620 | lr 4.91e-04 | (3815.69 ms | 137403 tok/s) step 12052/76294 | train loss 3.438064 | norm 0.2526 | lr 4.91e-04 | (3833.51 ms | 136764 tok/s) step 12053/76294 | train loss 3.417506 | norm 0.2757 | lr 4.91e-04 | (3800.54 ms | 137951 tok/s) step 12054/76294 | train loss 3.365239 | norm 0.2629 | lr 4.90e-04 | (3801.46 ms | 137918 tok/s) step 12055/76294 | train loss 3.381361 | norm 0.2888 | lr 4.90e-04 | (3815.39 ms | 137414 tok/s) step 12056/76294 | train loss 3.390764 | norm 0.3017 | lr 4.90e-04 | (3797.31 ms | 138068 tok/s) step 12057/76294 | train loss 3.445633 | norm 0.2581 | lr 4.90e-04 | (3801.71 ms | 137908 tok/s) step 12058/76294 | train loss 3.430680 | norm 0.5180 | lr 4.90e-04 | (3792.87 ms | 138230 tok/s) step 12059/76294 | train loss 3.377371 | norm 0.2942 | lr 4.90e-04 | (3806.24 ms | 137744 tok/s) step 12060/76294 | train loss 3.399327 | norm 0.5679 | lr 4.90e-04 | (3884.57 ms | 134967 tok/s) step 12061/76294 | train loss 3.475145 | norm 0.4091 | lr 4.90e-04 | (3868.63 ms | 135523 tok/s) step 12062/76294 | train loss 3.403890 | norm 0.2336 | lr 4.90e-04 | (3784.66 ms | 138530 tok/s) step 12063/76294 | train loss 3.360032 | norm 0.3043 | lr 4.90e-04 | (3796.37 ms | 138103 tok/s) step 12064/76294 | train loss 3.432007 | norm 0.2283 | lr 4.90e-04 | (3787.46 ms | 138427 tok/s) step 12065/76294 | train loss 3.379935 | norm 0.1998 | lr 4.90e-04 | (3795.30 ms | 138141 tok/s) step 12066/76294 | train loss 3.404614 | norm 0.2479 | lr 4.89e-04 | (3814.72 ms | 137438 tok/s) step 12067/76294 | train loss 3.400127 | norm 0.3319 | lr 4.89e-04 | (3794.88 ms | 138157 tok/s) step 12068/76294 | train loss 3.385876 | norm 0.3988 | lr 4.89e-04 | (3798.47 ms | 138026 tok/s) step 12069/76294 | train loss 3.387632 | norm 0.4625 | lr 4.89e-04 | (3818.50 ms | 137302 tok/s) step 12070/76294 | train loss 3.392512 | norm 0.2036 | lr 4.89e-04 | (3800.93 ms | 137937 tok/s) step 12071/76294 | train loss 3.444261 | norm 0.5789 | lr 4.89e-04 | (3825.54 ms | 137050 tok/s) step 12072/76294 | train loss 3.332284 | norm 0.8671 | lr 4.89e-04 | (3792.26 ms | 138252 tok/s) step 12073/76294 | train loss 3.453243 | norm 0.3324 | lr 4.89e-04 | (3893.44 ms | 134659 tok/s) step 12074/76294 | train loss 3.341534 | norm 0.5893 | lr 4.89e-04 | (3795.70 ms | 138127 tok/s) step 12075/76294 | train loss 3.419660 | norm 0.5478 | lr 4.89e-04 | (3805.11 ms | 137785 tok/s) step 12076/76294 | train loss 3.439286 | norm 0.3210 | lr 4.89e-04 | (3810.84 ms | 137578 tok/s) step 12077/76294 | train loss 3.380520 | norm 0.2264 | lr 4.89e-04 | (3818.32 ms | 137309 tok/s) step 12078/76294 | train loss 3.500369 | norm 0.4459 | lr 4.88e-04 | (3892.72 ms | 134684 tok/s) step 12079/76294 | train loss 3.379560 | norm 0.4360 | lr 4.88e-04 | (3787.25 ms | 138435 tok/s) step 12080/76294 | train loss 3.401894 | norm 0.2883 | lr 4.88e-04 | (3871.28 ms | 135430 tok/s) step 12081/76294 | train loss 3.442812 | norm 0.2467 | lr 4.88e-04 | (3783.21 ms | 138583 tok/s) step 12082/76294 | train loss 3.419580 | norm 0.2971 | lr 4.88e-04 | (3792.88 ms | 138229 tok/s) step 12083/76294 | train loss 3.350647 | norm 0.2634 | lr 4.88e-04 | (3786.48 ms | 138463 tok/s) step 12084/76294 | train loss 3.355683 | norm 0.2816 | lr 4.88e-04 | (3812.35 ms | 137524 tok/s) step 12085/76294 | train loss 3.350089 | norm 0.2224 | lr 4.88e-04 | (3791.09 ms | 138295 tok/s) step 12086/76294 | train loss 3.499331 | norm 0.2147 | lr 4.88e-04 | (3818.78 ms | 137292 tok/s) step 12087/76294 | train loss 3.426631 | norm 0.2586 | lr 4.88e-04 | (3813.64 ms | 137477 tok/s) step 12088/76294 | train loss 3.363214 | norm 0.2596 | lr 4.88e-04 | (3801.36 ms | 137921 tok/s) step 12089/76294 | train loss 3.399911 | norm 0.4652 | lr 4.87e-04 | (3803.82 ms | 137832 tok/s) step 12090/76294 | train loss 3.317724 | norm 0.2388 | lr 4.87e-04 | (3792.92 ms | 138228 tok/s) step 12091/76294 | train loss 3.343578 | norm 0.2787 | lr 4.87e-04 | (3820.75 ms | 137221 tok/s) step 12092/76294 | train loss 3.377083 | norm 0.3397 | lr 4.87e-04 | (3791.89 ms | 138265 tok/s) step 12093/76294 | train loss 3.429591 | norm 0.2867 | lr 4.87e-04 | (3802.45 ms | 137882 tok/s) step 12094/76294 | train loss 3.350212 | norm 0.2172 | lr 4.87e-04 | (3794.78 ms | 138160 tok/s) step 12095/76294 | train loss 3.366451 | norm 0.3494 | lr 4.87e-04 | (3866.36 ms | 135602 tok/s) step 12096/76294 | train loss 3.367073 | norm 0.2216 | lr 4.87e-04 | (3807.63 ms | 137694 tok/s) step 12097/76294 | train loss 3.487171 | norm 0.2815 | lr 4.87e-04 | (3875.10 ms | 135297 tok/s) step 12098/76294 | train loss 3.430782 | norm 0.3980 | lr 4.87e-04 | (3786.44 ms | 138465 tok/s) step 12099/76294 | train loss 3.348680 | norm 0.2314 | lr 4.87e-04 | (3792.98 ms | 138226 tok/s) step 12100/76294 | train loss 3.399658 | norm 0.3870 | lr 4.87e-04 | (3814.92 ms | 137431 tok/s) step 12101/76294 | train loss 3.332834 | norm 0.3455 | lr 4.86e-04 | (3791.85 ms | 138267 tok/s) step 12102/76294 | train loss 3.399920 | norm 0.2466 | lr 4.86e-04 | (3809.73 ms | 137618 tok/s) step 12103/76294 | train loss 3.371016 | norm 0.4409 | lr 4.86e-04 | (3797.26 ms | 138070 tok/s) step 12104/76294 | train loss 3.413496 | norm 0.3638 | lr 4.86e-04 | (3816.63 ms | 137369 tok/s) step 12105/76294 | train loss 3.432747 | norm 0.2592 | lr 4.86e-04 | (3802.69 ms | 137873 tok/s) step 12106/76294 | train loss 3.437963 | norm 0.3075 | lr 4.86e-04 | (3815.42 ms | 137413 tok/s) step 12107/76294 | train loss 3.420783 | norm 0.3354 | lr 4.86e-04 | (3796.81 ms | 138087 tok/s) step 12108/76294 | train loss 3.406136 | norm 0.2082 | lr 4.86e-04 | (3816.17 ms | 137386 tok/s) step 12109/76294 | train loss 3.364789 | norm 0.2356 | lr 4.86e-04 | (3801.72 ms | 137908 tok/s) step 12110/76294 | train loss 2.760597 | norm 0.9030 | lr 4.86e-04 | (3816.11 ms | 137388 tok/s) step 12111/76294 | train loss 3.344331 | norm 0.7887 | lr 4.86e-04 | (3803.17 ms | 137855 tok/s) step 12112/76294 | train loss 3.366910 | norm 0.3065 | lr 4.86e-04 | (3801.85 ms | 137903 tok/s) step 12113/76294 | train loss 3.394906 | norm 0.2876 | lr 4.85e-04 | (3799.44 ms | 137991 tok/s) step 12114/76294 | train loss 3.365877 | norm 0.2765 | lr 4.85e-04 | (3804.08 ms | 137822 tok/s) step 12115/76294 | train loss 3.368851 | norm 0.2253 | lr 4.85e-04 | (3797.90 ms | 138047 tok/s) step 12116/76294 | train loss 3.389120 | norm 0.2556 | lr 4.85e-04 | (3801.70 ms | 137909 tok/s) step 12117/76294 | train loss 3.396867 | norm 0.3017 | lr 4.85e-04 | (3919.61 ms | 133760 tok/s) step 12118/76294 | train loss 3.325443 | norm 0.2373 | lr 4.85e-04 | (3876.02 ms | 135265 tok/s) step 12119/76294 | train loss 3.386924 | norm 0.2399 | lr 4.85e-04 | (3799.17 ms | 138001 tok/s) step 12120/76294 | train loss 3.347558 | norm 0.2080 | lr 4.85e-04 | (3819.01 ms | 137284 tok/s) step 12121/76294 | train loss 3.393484 | norm 0.2108 | lr 4.85e-04 | (3805.58 ms | 137768 tok/s) step 12122/76294 | train loss 3.377931 | norm 0.2457 | lr 4.85e-04 | (3832.19 ms | 136812 tok/s) step 12123/76294 | train loss 3.367086 | norm 0.2343 | lr 4.85e-04 | (3804.23 ms | 137817 tok/s) step 12124/76294 | train loss 3.399740 | norm 0.2439 | lr 4.85e-04 | (3894.80 ms | 134612 tok/s) step 12125/76294 | train loss 3.414435 | norm 0.2541 | lr 4.84e-04 | (3798.63 ms | 138020 tok/s) step 12126/76294 | train loss 3.340327 | norm 0.3067 | lr 4.84e-04 | (3822.29 ms | 137166 tok/s) step 12127/76294 | train loss 3.425770 | norm 0.2411 | lr 4.84e-04 | (3802.99 ms | 137862 tok/s) step 12128/76294 | train loss 3.393013 | norm 0.2719 | lr 4.84e-04 | (3825.65 ms | 137045 tok/s) step 12129/76294 | train loss 3.389833 | norm 0.3550 | lr 4.84e-04 | (3795.86 ms | 138121 tok/s) step 12130/76294 | train loss 3.397220 | norm 0.2378 | lr 4.84e-04 | (3828.49 ms | 136944 tok/s) step 12131/76294 | train loss 3.439720 | norm 0.3343 | lr 4.84e-04 | (3851.97 ms | 136109 tok/s) step 12132/76294 | train loss 3.335255 | norm 0.2952 | lr 4.84e-04 | (3801.02 ms | 137933 tok/s) step 12133/76294 | train loss 3.484824 | norm 0.2374 | lr 4.84e-04 | (3827.25 ms | 136988 tok/s) step 12134/76294 | train loss 3.343367 | norm 0.2474 | lr 4.84e-04 | (3833.88 ms | 136751 tok/s) step 12135/76294 | train loss 3.367315 | norm 0.2175 | lr 4.84e-04 | (3803.32 ms | 137850 tok/s) step 12136/76294 | train loss 3.496696 | norm 0.2448 | lr 4.83e-04 | (3825.39 ms | 137055 tok/s) step 12137/76294 | train loss 3.374918 | norm 0.3855 | lr 4.83e-04 | (3809.40 ms | 137630 tok/s) step 12138/76294 | train loss 3.411335 | norm 0.4363 | lr 4.83e-04 | (3833.49 ms | 136765 tok/s) step 12139/76294 | train loss 3.429046 | norm 0.3360 | lr 4.83e-04 | (3907.80 ms | 134165 tok/s) step 12140/76294 | train loss 3.331120 | norm 0.3163 | lr 4.83e-04 | (3781.66 ms | 138640 tok/s) step 12141/76294 | train loss 3.407510 | norm 0.2186 | lr 4.83e-04 | (3839.32 ms | 136557 tok/s) step 12142/76294 | train loss 3.429302 | norm 0.2068 | lr 4.83e-04 | (3855.02 ms | 136001 tok/s) step 12143/76294 | train loss 3.356335 | norm 0.2464 | lr 4.83e-04 | (3806.24 ms | 137744 tok/s) step 12144/76294 | train loss 3.346599 | norm 0.2363 | lr 4.83e-04 | (3788.89 ms | 138375 tok/s) step 12145/76294 | train loss 3.381925 | norm 0.2558 | lr 4.83e-04 | (3792.88 ms | 138230 tok/s) step 12146/76294 | train loss 3.352830 | norm 0.2189 | lr 4.83e-04 | (3819.69 ms | 137259 tok/s) step 12147/76294 | train loss 3.442641 | norm 0.4367 | lr 4.83e-04 | (3789.93 ms | 138337 tok/s) step 12148/76294 | train loss 3.471237 | norm 0.2751 | lr 4.82e-04 | (3797.31 ms | 138068 tok/s) step 12149/76294 | train loss 3.341276 | norm 0.2213 | lr 4.82e-04 | (3791.28 ms | 138288 tok/s) step 12150/76294 | train loss 3.365910 | norm 0.2777 | lr 4.82e-04 | (3815.88 ms | 137396 tok/s) step 12151/76294 | train loss 3.381580 | norm 0.2398 | lr 4.82e-04 | (3791.97 ms | 138263 tok/s) step 12152/76294 | train loss 3.413043 | norm 0.2426 | lr 4.82e-04 | (3836.06 ms | 136674 tok/s) step 12153/76294 | train loss 3.336156 | norm 0.2945 | lr 4.82e-04 | (3795.01 ms | 138152 tok/s) step 12154/76294 | train loss 3.346690 | norm 0.2462 | lr 4.82e-04 | (3800.97 ms | 137935 tok/s) step 12155/76294 | train loss 3.379372 | norm 0.4244 | lr 4.82e-04 | (3796.52 ms | 138097 tok/s) step 12156/76294 | train loss 3.316509 | norm 0.3590 | lr 4.82e-04 | (3797.11 ms | 138076 tok/s) step 12157/76294 | train loss 3.430500 | norm 0.2438 | lr 4.82e-04 | (3800.62 ms | 137948 tok/s) step 12158/76294 | train loss 3.377055 | norm 0.5850 | lr 4.82e-04 | (3800.83 ms | 137940 tok/s) step 12159/76294 | train loss 3.372147 | norm 0.1938 | lr 4.82e-04 | (3795.45 ms | 138136 tok/s) step 12160/76294 | train loss 3.394969 | norm 0.3302 | lr 4.81e-04 | (3812.63 ms | 137514 tok/s) step 12161/76294 | train loss 3.467284 | norm 0.3055 | lr 4.81e-04 | (3858.09 ms | 135893 tok/s) step 12162/76294 | train loss 3.380609 | norm 0.2447 | lr 4.81e-04 | (3796.55 ms | 138096 tok/s) step 12163/76294 | train loss 3.421690 | norm 0.3052 | lr 4.81e-04 | (3803.91 ms | 137829 tok/s) step 12164/76294 | train loss 3.372866 | norm 0.2168 | lr 4.81e-04 | (3795.85 ms | 138122 tok/s) step 12165/76294 | train loss 3.335456 | norm 0.2802 | lr 4.81e-04 | (3804.11 ms | 137822 tok/s) step 12166/76294 | train loss 3.455436 | norm 0.1919 | lr 4.81e-04 | (3816.54 ms | 137372 tok/s) step 12167/76294 | train loss 3.336593 | norm 0.2057 | lr 4.81e-04 | (3805.69 ms | 137764 tok/s) step 12168/76294 | train loss 3.395374 | norm 0.1900 | lr 4.81e-04 | (3894.04 ms | 134639 tok/s) step 12169/76294 | train loss 3.365171 | norm 0.1880 | lr 4.81e-04 | (3787.30 ms | 138433 tok/s) step 12170/76294 | train loss 3.367802 | norm 0.2259 | lr 4.81e-04 | (3836.97 ms | 136641 tok/s) step 12171/76294 | train loss 3.370024 | norm 0.2499 | lr 4.81e-04 | (3789.76 ms | 138343 tok/s) step 12172/76294 | train loss 3.429670 | norm 0.4011 | lr 4.80e-04 | (3792.93 ms | 138228 tok/s) step 12173/76294 | train loss 3.331203 | norm 0.2088 | lr 4.80e-04 | (3810.46 ms | 137592 tok/s) step 12174/76294 | train loss 3.353565 | norm 0.5001 | lr 4.80e-04 | (3795.20 ms | 138145 tok/s) step 12175/76294 | train loss 3.374996 | norm 0.3483 | lr 4.80e-04 | (3789.72 ms | 138345 tok/s) step 12176/76294 | train loss 3.411460 | norm 0.4423 | lr 4.80e-04 | (3817.84 ms | 137326 tok/s) step 12177/76294 | train loss 3.367163 | norm 0.1770 | lr 4.80e-04 | (3793.32 ms | 138214 tok/s) step 12178/76294 | train loss 3.373359 | norm 0.2279 | lr 4.80e-04 | (3801.31 ms | 137923 tok/s) step 12179/76294 | train loss 3.353227 | norm 0.3037 | lr 4.80e-04 | (3823.27 ms | 137131 tok/s) step 12180/76294 | train loss 3.427994 | norm 0.2074 | lr 4.80e-04 | (3825.32 ms | 137057 tok/s) step 12181/76294 | train loss 3.363165 | norm 0.2666 | lr 4.80e-04 | (3799.46 ms | 137990 tok/s) step 12182/76294 | train loss 3.388348 | norm 0.1892 | lr 4.80e-04 | (3801.16 ms | 137928 tok/s) step 12183/76294 | train loss 3.431657 | norm 0.2847 | lr 4.79e-04 | (3889.09 ms | 134810 tok/s) step 12184/76294 | train loss 3.397297 | norm 0.2270 | lr 4.79e-04 | (3798.50 ms | 138025 tok/s) step 12185/76294 | train loss 3.364663 | norm 0.2251 | lr 4.79e-04 | (3900.65 ms | 134410 tok/s) step 12186/76294 | train loss 3.374216 | norm 0.2984 | lr 4.79e-04 | (3794.68 ms | 138164 tok/s) step 12187/76294 | train loss 3.385863 | norm 0.5179 | lr 4.79e-04 | (3892.55 ms | 134690 tok/s) step 12188/76294 | train loss 3.375142 | norm 0.8007 | lr 4.79e-04 | (3793.70 ms | 138200 tok/s) step 12189/76294 | train loss 3.383862 | norm 1.1086 | lr 4.79e-04 | (3797.23 ms | 138071 tok/s) step 12190/76294 | train loss 3.442370 | norm 0.8381 | lr 4.79e-04 | (3818.09 ms | 137317 tok/s) step 12191/76294 | train loss 3.420267 | norm 0.3085 | lr 4.79e-04 | (3986.74 ms | 131508 tok/s) step 12192/76294 | train loss 3.353846 | norm 0.2807 | lr 4.79e-04 | (3808.12 ms | 137676 tok/s) step 12193/76294 | train loss 3.444488 | norm 0.8221 | lr 4.79e-04 | (3800.52 ms | 137952 tok/s) step 12194/76294 | train loss 3.462844 | norm 0.7338 | lr 4.79e-04 | (3799.92 ms | 137974 tok/s) step 12195/76294 | train loss 3.399176 | norm 0.2396 | lr 4.78e-04 | (3803.91 ms | 137829 tok/s) step 12196/76294 | train loss 3.367219 | norm 0.3199 | lr 4.78e-04 | (3841.06 ms | 136496 tok/s) step 12197/76294 | train loss 3.418331 | norm 0.4330 | lr 4.78e-04 | (3798.92 ms | 138010 tok/s) step 12198/76294 | train loss 3.328482 | norm 0.3008 | lr 4.78e-04 | (3855.06 ms | 136000 tok/s) step 12199/76294 | train loss 3.444648 | norm 0.2704 | lr 4.78e-04 | (3801.59 ms | 137913 tok/s) step 12200/76294 | train loss 3.411448 | norm 0.3047 | lr 4.78e-04 | (3828.10 ms | 136958 tok/s) step 12201/76294 | train loss 3.432690 | norm 0.3038 | lr 4.78e-04 | (3801.48 ms | 137917 tok/s) step 12202/76294 | train loss 3.439444 | norm 0.2192 | lr 4.78e-04 | (3828.02 ms | 136961 tok/s) step 12203/76294 | train loss 3.345900 | norm 0.2560 | lr 4.78e-04 | (3800.72 ms | 137944 tok/s) step 12204/76294 | train loss 3.452212 | norm 0.2357 | lr 4.78e-04 | (3809.22 ms | 137637 tok/s) step 12205/76294 | train loss 3.435783 | norm 0.2738 | lr 4.78e-04 | (3894.21 ms | 134633 tok/s) step 12206/76294 | train loss 3.412234 | norm 0.3591 | lr 4.78e-04 | (3804.93 ms | 137792 tok/s) step 12207/76294 | train loss 3.582489 | norm 0.3917 | lr 4.77e-04 | (4187.38 ms | 125207 tok/s) step 12208/76294 | train loss 3.346919 | norm 0.4306 | lr 4.77e-04 | (3829.72 ms | 136900 tok/s) step 12209/76294 | train loss 3.408401 | norm 0.1991 | lr 4.77e-04 | (3804.35 ms | 137813 tok/s) step 12210/76294 | train loss 3.484021 | norm 0.3654 | lr 4.77e-04 | (3836.24 ms | 136667 tok/s) step 12211/76294 | train loss 3.391568 | norm 0.2669 | lr 4.77e-04 | (3804.11 ms | 137821 tok/s) step 12212/76294 | train loss 3.437192 | norm 0.4308 | lr 4.77e-04 | (3832.83 ms | 136789 tok/s) step 12213/76294 | train loss 3.407388 | norm 0.3267 | lr 4.77e-04 | (3807.78 ms | 137689 tok/s) step 12214/76294 | train loss 3.423454 | norm 0.2877 | lr 4.77e-04 | (3839.07 ms | 136566 tok/s) step 12215/76294 | train loss 3.480085 | norm 0.2915 | lr 4.77e-04 | (3858.39 ms | 135883 tok/s) step 12216/76294 | train loss 3.359193 | norm 0.2619 | lr 4.77e-04 | (3804.49 ms | 137808 tok/s) step 12217/76294 | train loss 3.443967 | norm 0.2477 | lr 4.77e-04 | (3852.48 ms | 136091 tok/s) step 12218/76294 | train loss 3.439724 | norm 0.3855 | lr 4.77e-04 | (3804.83 ms | 137795 tok/s) step 12219/76294 | train loss 3.469552 | norm 0.2433 | lr 4.76e-04 | (3807.46 ms | 137700 tok/s) step 12220/76294 | train loss 3.412336 | norm 0.3425 | lr 4.76e-04 | (3823.49 ms | 137123 tok/s) step 12221/76294 | train loss 3.370786 | norm 0.2414 | lr 4.76e-04 | (3810.71 ms | 137583 tok/s) step 12222/76294 | train loss 3.412276 | norm 0.2700 | lr 4.76e-04 | (3824.82 ms | 137075 tok/s) step 12223/76294 | train loss 3.515222 | norm 0.2525 | lr 4.76e-04 | (3807.13 ms | 137712 tok/s) step 12224/76294 | train loss 3.382127 | norm 0.2980 | lr 4.76e-04 | (3812.41 ms | 137521 tok/s) step 12225/76294 | train loss 3.446397 | norm 0.2878 | lr 4.76e-04 | (3810.38 ms | 137595 tok/s) step 12226/76294 | train loss 3.501754 | norm 0.2380 | lr 4.76e-04 | (3841.79 ms | 136470 tok/s) step 12227/76294 | train loss 3.406405 | norm 0.2099 | lr 4.76e-04 | (3807.88 ms | 137685 tok/s) step 12228/76294 | train loss 3.418483 | norm 0.2824 | lr 4.76e-04 | (3906.95 ms | 134194 tok/s) step 12229/76294 | train loss 3.387650 | norm 0.2719 | lr 4.76e-04 | (3811.06 ms | 137570 tok/s) step 12230/76294 | train loss 3.346809 | norm 0.4443 | lr 4.76e-04 | (3829.13 ms | 136921 tok/s) step 12231/76294 | train loss 3.483760 | norm 0.2363 | lr 4.75e-04 | (3806.07 ms | 137750 tok/s) step 12232/76294 | train loss 3.431255 | norm 0.3256 | lr 4.75e-04 | (3824.10 ms | 137101 tok/s) step 12233/76294 | train loss 3.396294 | norm 0.3085 | lr 4.75e-04 | (3810.04 ms | 137607 tok/s) step 12234/76294 | train loss 3.414893 | norm 0.2438 | lr 4.75e-04 | (3843.15 ms | 136421 tok/s) step 12235/76294 | train loss 3.441323 | norm 0.4104 | lr 4.75e-04 | (3808.84 ms | 137650 tok/s) step 12236/76294 | train loss 3.391210 | norm 0.2610 | lr 4.75e-04 | (3819.54 ms | 137265 tok/s) step 12237/76294 | train loss 3.421604 | norm 0.2920 | lr 4.75e-04 | (3835.15 ms | 136706 tok/s) step 12238/76294 | train loss 3.415781 | norm 0.3503 | lr 4.75e-04 | (3831.59 ms | 136833 tok/s) step 12239/76294 | train loss 3.417780 | norm 0.2078 | lr 4.75e-04 | (3816.08 ms | 137389 tok/s) step 12240/76294 | train loss 3.455596 | norm 0.3782 | lr 4.75e-04 | (3816.26 ms | 137383 tok/s) step 12241/76294 | train loss 3.452784 | norm 0.4102 | lr 4.75e-04 | (3830.07 ms | 136887 tok/s) step 12242/76294 | train loss 3.410587 | norm 0.2789 | lr 4.74e-04 | (3809.14 ms | 137639 tok/s) step 12243/76294 | train loss 3.424448 | norm 0.3610 | lr 4.74e-04 | (3804.99 ms | 137789 tok/s) step 12244/76294 | train loss 3.406669 | norm 0.3307 | lr 4.74e-04 | (3853.55 ms | 136053 tok/s) step 12245/76294 | train loss 3.406800 | norm 0.3579 | lr 4.74e-04 | (3806.47 ms | 137736 tok/s) step 12246/76294 | train loss 3.420236 | norm 0.3512 | lr 4.74e-04 | (3860.47 ms | 135809 tok/s) step 12247/76294 | train loss 3.400602 | norm 0.2211 | lr 4.74e-04 | (3807.74 ms | 137690 tok/s) step 12248/76294 | train loss 3.409138 | norm 0.2566 | lr 4.74e-04 | (3810.98 ms | 137573 tok/s) step 12249/76294 | train loss 3.360148 | norm 0.2507 | lr 4.74e-04 | (3826.85 ms | 137003 tok/s) step 12250/76294 | train loss 3.397374 | norm 0.3166 | lr 4.74e-04 | (3810.00 ms | 137608 tok/s) val loss: 3.391849 saving model checkpoint to ./results/gpt2-124M-gqa/step_12250.pth step 12251/76294 | train loss 3.419006 | norm 0.3332 | lr 4.74e-04 | (3881.01 ms | 135091 tok/s) step 12252/76294 | train loss 3.397425 | norm 0.2258 | lr 4.74e-04 | (3883.25 ms | 135013 tok/s) step 12253/76294 | train loss 3.414748 | norm 0.2824 | lr 4.74e-04 | (3803.18 ms | 137855 tok/s) step 12254/76294 | train loss 3.392195 | norm 0.2475 | lr 4.73e-04 | (4437.95 ms | 118137 tok/s) step 12255/76294 | train loss 3.431140 | norm 0.2303 | lr 4.73e-04 | (6555.19 ms | 79981 tok/s) step 12256/76294 | train loss 3.447522 | norm 0.3609 | lr 4.73e-04 | (3796.11 ms | 138112 tok/s) step 12257/76294 | train loss 3.394459 | norm 0.2649 | lr 4.73e-04 | (3799.90 ms | 137974 tok/s) step 12258/76294 | train loss 3.499000 | norm 0.5464 | lr 4.73e-04 | (3816.35 ms | 137379 tok/s) step 12259/76294 | train loss 3.428401 | norm 0.5625 | lr 4.73e-04 | (6825.99 ms | 76808 tok/s) step 12260/76294 | train loss 3.409720 | norm 0.1880 | lr 4.73e-04 | (3863.39 ms | 135707 tok/s) step 12261/76294 | train loss 3.445487 | norm 0.5558 | lr 4.73e-04 | (3856.70 ms | 135942 tok/s) step 12262/76294 | train loss 3.415422 | norm 0.2892 | lr 4.73e-04 | (3797.33 ms | 138068 tok/s) step 12263/76294 | train loss 3.443394 | norm 0.2091 | lr 4.73e-04 | (3795.60 ms | 138130 tok/s) step 12264/76294 | train loss 3.429556 | norm 0.2931 | lr 4.73e-04 | (3798.07 ms | 138041 tok/s) step 12265/76294 | train loss 3.377168 | norm 0.2425 | lr 4.73e-04 | (3805.01 ms | 137789 tok/s) step 12266/76294 | train loss 3.391637 | norm 0.4538 | lr 4.72e-04 | (3802.74 ms | 137871 tok/s) step 12267/76294 | train loss 3.388469 | norm 0.2502 | lr 4.72e-04 | (3800.80 ms | 137942 tok/s) step 12268/76294 | train loss 3.407890 | norm 0.2975 | lr 4.72e-04 | (3798.91 ms | 138010 tok/s) step 12269/76294 | train loss 3.409932 | norm 0.3621 | lr 4.72e-04 | (3827.59 ms | 136976 tok/s) step 12270/76294 | train loss 3.420455 | norm 0.4809 | lr 4.72e-04 | (3802.60 ms | 137876 tok/s) step 12271/76294 | train loss 3.364140 | norm 0.2452 | lr 4.72e-04 | (3822.98 ms | 137141 tok/s) step 12272/76294 | train loss 3.457658 | norm 0.2366 | lr 4.72e-04 | (3823.72 ms | 137115 tok/s) step 12273/76294 | train loss 3.367610 | norm 0.2613 | lr 4.72e-04 | (3881.02 ms | 135090 tok/s) step 12274/76294 | train loss 3.344109 | norm 0.3717 | lr 4.72e-04 | (3802.30 ms | 137887 tok/s) step 12275/76294 | train loss 3.445900 | norm 0.2812 | lr 4.72e-04 | (3853.51 ms | 136055 tok/s) step 12276/76294 | train loss 3.395009 | norm 0.2396 | lr 4.72e-04 | (3793.75 ms | 138198 tok/s) step 12277/76294 | train loss 3.414408 | norm 0.4591 | lr 4.72e-04 | (3844.09 ms | 136388 tok/s) step 12278/76294 | train loss 3.482018 | norm 0.2314 | lr 4.71e-04 | (3888.46 ms | 134832 tok/s) step 12279/76294 | train loss 3.356829 | norm 0.3108 | lr 4.71e-04 | (3791.87 ms | 138266 tok/s) step 12280/76294 | train loss 3.432934 | norm 0.3654 | lr 4.71e-04 | (3846.17 ms | 136314 tok/s) step 12281/76294 | train loss 3.389355 | norm 0.3247 | lr 4.71e-04 | (3793.79 ms | 138197 tok/s) step 12282/76294 | train loss 3.437185 | norm 0.3561 | lr 4.71e-04 | (3803.73 ms | 137835 tok/s) step 12283/76294 | train loss 3.435997 | norm 0.8925 | lr 4.71e-04 | (3823.46 ms | 137124 tok/s) step 12284/76294 | train loss 3.514353 | norm 0.3452 | lr 4.71e-04 | (3801.39 ms | 137920 tok/s) step 12285/76294 | train loss 3.437758 | norm 0.5714 | lr 4.71e-04 | (3839.11 ms | 136565 tok/s) step 12286/76294 | train loss 3.398353 | norm 0.5609 | lr 4.71e-04 | (3801.46 ms | 137918 tok/s) step 12287/76294 | train loss 3.400861 | norm 0.2725 | lr 4.71e-04 | (3808.51 ms | 137662 tok/s) step 12288/76294 | train loss 3.359494 | norm 0.6971 | lr 4.71e-04 | (3826.90 ms | 137001 tok/s) step 12289/76294 | train loss 3.371102 | norm 0.6320 | lr 4.71e-04 | (3810.84 ms | 137578 tok/s) step 12290/76294 | train loss 3.384542 | norm 0.2336 | lr 4.70e-04 | (3811.04 ms | 137571 tok/s) step 12291/76294 | train loss 3.431634 | norm 0.4179 | lr 4.70e-04 | (3838.13 ms | 136600 tok/s) step 12292/76294 | train loss 3.403978 | norm 0.8072 | lr 4.70e-04 | (3830.63 ms | 136867 tok/s) step 12293/76294 | train loss 3.461218 | norm 0.2558 | lr 4.70e-04 | (3973.39 ms | 131950 tok/s) step 12294/76294 | train loss 3.357839 | norm 0.4440 | lr 4.70e-04 | (3846.86 ms | 136290 tok/s) step 12295/76294 | train loss 3.477134 | norm 0.2493 | lr 4.70e-04 | (3797.39 ms | 138065 tok/s) step 12296/76294 | train loss 3.437644 | norm 0.2831 | lr 4.70e-04 | (3827.55 ms | 136978 tok/s) step 12297/76294 | train loss 3.473629 | norm 0.2899 | lr 4.70e-04 | (3909.84 ms | 134094 tok/s) step 12298/76294 | train loss 3.469561 | norm 0.9766 | lr 4.70e-04 | (3796.64 ms | 138093 tok/s) step 12299/76294 | train loss 3.472419 | norm 0.3306 | lr 4.70e-04 | (3914.43 ms | 133937 tok/s) step 12300/76294 | train loss 3.381662 | norm 0.5807 | lr 4.70e-04 | (3791.95 ms | 138263 tok/s) step 12301/76294 | train loss 3.363479 | norm 0.6074 | lr 4.70e-04 | (3801.53 ms | 137915 tok/s) step 12302/76294 | train loss 3.395497 | norm 0.2503 | lr 4.69e-04 | (3817.68 ms | 137331 tok/s) step 12303/76294 | train loss 3.448895 | norm 0.4554 | lr 4.69e-04 | (3823.52 ms | 137122 tok/s) step 12304/76294 | train loss 3.458270 | norm 0.2959 | lr 4.69e-04 | (3799.49 ms | 137989 tok/s) step 12305/76294 | train loss 3.438902 | norm 0.2847 | lr 4.69e-04 | (3802.78 ms | 137870 tok/s) step 12306/76294 | train loss 3.436992 | norm 0.1969 | lr 4.69e-04 | (3855.63 ms | 135980 tok/s) step 12307/76294 | train loss 3.415855 | norm 0.2721 | lr 4.69e-04 | (3801.07 ms | 137932 tok/s) step 12308/76294 | train loss 3.404422 | norm 0.2432 | lr 4.69e-04 | (3900.05 ms | 134431 tok/s) step 12309/76294 | train loss 3.462265 | norm 0.3014 | lr 4.69e-04 | (3798.56 ms | 138023 tok/s) step 12310/76294 | train loss 3.408484 | norm 0.2327 | lr 4.69e-04 | (3802.12 ms | 137894 tok/s) step 12311/76294 | train loss 3.410522 | norm 0.4024 | lr 4.69e-04 | (3817.65 ms | 137333 tok/s) step 12312/76294 | train loss 3.463918 | norm 0.3483 | lr 4.69e-04 | (3832.00 ms | 136818 tok/s) step 12313/76294 | train loss 3.393582 | norm 0.2522 | lr 4.69e-04 | (3879.89 ms | 135130 tok/s) step 12314/76294 | train loss 3.387439 | norm 0.2726 | lr 4.68e-04 | (3804.45 ms | 137809 tok/s) step 12315/76294 | train loss 3.401973 | norm 0.3191 | lr 4.68e-04 | (3916.16 ms | 133878 tok/s) step 12316/76294 | train loss 3.422333 | norm 0.2653 | lr 4.68e-04 | (3902.99 ms | 134330 tok/s) step 12317/76294 | train loss 3.412101 | norm 0.3237 | lr 4.68e-04 | (3838.42 ms | 136589 tok/s) step 12318/76294 | train loss 3.345119 | norm 0.4276 | lr 4.68e-04 | (3898.85 ms | 134472 tok/s) step 12319/76294 | train loss 3.433552 | norm 0.3252 | lr 4.68e-04 | (3903.20 ms | 134323 tok/s) step 12320/76294 | train loss 3.480955 | norm 0.4189 | lr 4.68e-04 | (3783.63 ms | 138568 tok/s) step 12321/76294 | train loss 3.423030 | norm 0.4318 | lr 4.68e-04 | (3903.18 ms | 134323 tok/s) step 12322/76294 | train loss 3.454120 | norm 0.2317 | lr 4.68e-04 | (3866.39 ms | 135601 tok/s) step 12323/76294 | train loss 3.454914 | norm 0.7977 | lr 4.68e-04 | (3782.12 ms | 138623 tok/s) step 12324/76294 | train loss 3.392706 | norm 0.2212 | lr 4.68e-04 | (3795.66 ms | 138128 tok/s) step 12325/76294 | train loss 3.353235 | norm 0.2774 | lr 4.67e-04 | (3841.41 ms | 136483 tok/s) step 12326/76294 | train loss 3.387046 | norm 0.4283 | lr 4.67e-04 | (3791.79 ms | 138269 tok/s) step 12327/76294 | train loss 3.411504 | norm 0.2336 | lr 4.67e-04 | (3798.99 ms | 138007 tok/s) step 12328/76294 | train loss 3.419237 | norm 0.6831 | lr 4.67e-04 | (3822.54 ms | 137157 tok/s) step 12329/76294 | train loss 3.386943 | norm 0.3420 | lr 4.67e-04 | (3826.40 ms | 137019 tok/s) step 12330/76294 | train loss 3.432927 | norm 0.5393 | lr 4.67e-04 | (3843.88 ms | 136395 tok/s) step 12331/76294 | train loss 3.390935 | norm 0.3829 | lr 4.67e-04 | (3803.43 ms | 137846 tok/s) step 12332/76294 | train loss 3.361591 | norm 0.2814 | lr 4.67e-04 | (3962.39 ms | 132316 tok/s) step 12333/76294 | train loss 3.472710 | norm 0.5424 | lr 4.67e-04 | (3822.01 ms | 137176 tok/s) step 12334/76294 | train loss 3.366842 | norm 0.5619 | lr 4.67e-04 | (3857.54 ms | 135912 tok/s) step 12335/76294 | train loss 3.411414 | norm 0.3738 | lr 4.67e-04 | (3810.50 ms | 137590 tok/s) step 12336/76294 | train loss 3.437763 | norm 0.6221 | lr 4.67e-04 | (3815.10 ms | 137425 tok/s) step 12337/76294 | train loss 3.378031 | norm 0.3206 | lr 4.66e-04 | (3846.50 ms | 136303 tok/s) step 12338/76294 | train loss 3.394658 | norm 0.4038 | lr 4.66e-04 | (3814.15 ms | 137459 tok/s) step 12339/76294 | train loss 3.386595 | norm 0.5127 | lr 4.66e-04 | (3836.71 ms | 136650 tok/s) step 12340/76294 | train loss 3.332129 | norm 0.3484 | lr 4.66e-04 | (3885.42 ms | 134937 tok/s) step 12341/76294 | train loss 3.395874 | norm 0.2366 | lr 4.66e-04 | (3818.51 ms | 137302 tok/s) step 12342/76294 | train loss 3.334003 | norm 0.3592 | lr 4.66e-04 | (3818.92 ms | 137287 tok/s) step 12343/76294 | train loss 3.361781 | norm 0.8786 | lr 4.66e-04 | (5865.25 ms | 89389 tok/s) step 12344/76294 | train loss 3.419935 | norm 0.2980 | lr 4.66e-04 | (3810.67 ms | 137584 tok/s) step 12345/76294 | train loss 3.409232 | norm 0.6479 | lr 4.66e-04 | (3819.55 ms | 137264 tok/s) step 12346/76294 | train loss 3.413840 | norm 0.4333 | lr 4.66e-04 | (3815.94 ms | 137394 tok/s) step 12347/76294 | train loss 3.376928 | norm 0.2584 | lr 4.66e-04 | (3811.86 ms | 137541 tok/s) step 12348/76294 | train loss 3.399786 | norm 0.8675 | lr 4.66e-04 | (3848.34 ms | 136237 tok/s) step 12349/76294 | train loss 3.350073 | norm 0.3155 | lr 4.65e-04 | (3812.79 ms | 137508 tok/s) step 12350/76294 | train loss 3.447167 | norm 0.2322 | lr 4.65e-04 | (3862.43 ms | 135740 tok/s) step 12351/76294 | train loss 3.322747 | norm 0.3374 | lr 4.65e-04 | (3811.24 ms | 137564 tok/s) step 12352/76294 | train loss 3.491052 | norm 0.2738 | lr 4.65e-04 | (3885.37 ms | 134939 tok/s) step 12353/76294 | train loss 3.402094 | norm 0.2719 | lr 4.65e-04 | (3838.08 ms | 136601 tok/s) step 12354/76294 | train loss 3.454120 | norm 0.2925 | lr 4.65e-04 | (3915.18 ms | 133912 tok/s) step 12355/76294 | train loss 3.420838 | norm 0.3346 | lr 4.65e-04 | (3791.37 ms | 138285 tok/s) step 12356/76294 | train loss 3.451732 | norm 0.6766 | lr 4.65e-04 | (3799.93 ms | 137973 tok/s) step 12357/76294 | train loss 3.453545 | norm 0.2131 | lr 4.65e-04 | (3824.48 ms | 137087 tok/s) step 12358/76294 | train loss 3.417675 | norm 1.2452 | lr 4.65e-04 | (3800.40 ms | 137956 tok/s) step 12359/76294 | train loss 3.415082 | norm 0.2927 | lr 4.65e-04 | (3797.85 ms | 138049 tok/s) step 12360/76294 | train loss 3.450479 | norm 0.4976 | lr 4.65e-04 | (3833.02 ms | 136782 tok/s) step 12361/76294 | train loss 3.367465 | norm 0.4999 | lr 4.64e-04 | (3800.09 ms | 137967 tok/s) step 12362/76294 | train loss 3.387678 | norm 0.2155 | lr 4.64e-04 | (3828.14 ms | 136956 tok/s) step 12363/76294 | train loss 3.408659 | norm 0.7084 | lr 4.64e-04 | (3802.05 ms | 137896 tok/s) step 12364/76294 | train loss 3.382280 | norm 0.3076 | lr 4.64e-04 | (3826.53 ms | 137014 tok/s) step 12365/76294 | train loss 3.417153 | norm 0.3708 | lr 4.64e-04 | (3801.08 ms | 137931 tok/s) step 12366/76294 | train loss 3.443778 | norm 0.4359 | lr 4.64e-04 | (3870.09 ms | 135472 tok/s) step 12367/76294 | train loss 3.476072 | norm 0.2479 | lr 4.64e-04 | (3799.53 ms | 137988 tok/s) step 12368/76294 | train loss 3.433577 | norm 0.4143 | lr 4.64e-04 | (3899.29 ms | 134457 tok/s) step 12369/76294 | train loss 3.422671 | norm 0.2881 | lr 4.64e-04 | (3796.08 ms | 138113 tok/s) step 12370/76294 | train loss 3.444539 | norm 0.2871 | lr 4.64e-04 | (3842.29 ms | 136452 tok/s) step 12371/76294 | train loss 3.499352 | norm 0.2531 | lr 4.64e-04 | (3800.67 ms | 137946 tok/s) step 12372/76294 | train loss 3.405882 | norm 0.2679 | lr 4.64e-04 | (3803.87 ms | 137830 tok/s) step 12373/76294 | train loss 3.430009 | norm 0.3751 | lr 4.63e-04 | (3827.43 ms | 136982 tok/s) step 12374/76294 | train loss 3.455065 | norm 0.3169 | lr 4.63e-04 | (3811.94 ms | 137538 tok/s) step 12375/76294 | train loss 3.409164 | norm 0.2737 | lr 4.63e-04 | (3806.04 ms | 137751 tok/s) step 12376/76294 | train loss 3.426511 | norm 0.5290 | lr 4.63e-04 | (3835.13 ms | 136707 tok/s) step 12377/76294 | train loss 3.516623 | norm 0.3495 | lr 4.63e-04 | (3872.77 ms | 135378 tok/s) step 12378/76294 | train loss 3.392015 | norm 0.4984 | lr 4.63e-04 | (3792.46 ms | 138245 tok/s) step 12379/76294 | train loss 3.386729 | norm 0.2169 | lr 4.63e-04 | (3810.78 ms | 137580 tok/s) step 12380/76294 | train loss 3.420451 | norm 0.2898 | lr 4.63e-04 | (3800.23 ms | 137962 tok/s) step 12381/76294 | train loss 3.342018 | norm 0.2070 | lr 4.63e-04 | (3828.23 ms | 136953 tok/s) step 12382/76294 | train loss 3.396445 | norm 0.2508 | lr 4.63e-04 | (3799.61 ms | 137985 tok/s) step 12383/76294 | train loss 3.428999 | norm 0.3631 | lr 4.63e-04 | (3807.04 ms | 137715 tok/s) step 12384/76294 | train loss 3.397523 | norm 0.4711 | lr 4.63e-04 | (3821.62 ms | 137190 tok/s) step 12385/76294 | train loss 3.435353 | norm 0.5108 | lr 4.62e-04 | (3837.92 ms | 136607 tok/s) step 12386/76294 | train loss 3.367113 | norm 0.4934 | lr 4.62e-04 | (3812.39 ms | 137522 tok/s) step 12387/76294 | train loss 3.456692 | norm 0.2895 | lr 4.62e-04 | (3795.51 ms | 138134 tok/s) step 12388/76294 | train loss 3.375411 | norm 0.9167 | lr 4.62e-04 | (3797.65 ms | 138056 tok/s) step 12389/76294 | train loss 3.420369 | norm 0.8055 | lr 4.62e-04 | (3821.59 ms | 137191 tok/s) step 12390/76294 | train loss 3.415816 | norm 0.5935 | lr 4.62e-04 | (3797.44 ms | 138064 tok/s) step 12391/76294 | train loss 3.493546 | norm 0.3404 | lr 4.62e-04 | (3825.66 ms | 137045 tok/s) step 12392/76294 | train loss 3.449469 | norm 0.4577 | lr 4.62e-04 | (3797.42 ms | 138064 tok/s) step 12393/76294 | train loss 3.500567 | norm 0.5855 | lr 4.62e-04 | (3831.02 ms | 136853 tok/s) step 12394/76294 | train loss 3.411107 | norm 0.2723 | lr 4.62e-04 | (3800.07 ms | 137968 tok/s) step 12395/76294 | train loss 3.427791 | norm 0.2797 | lr 4.62e-04 | (3800.95 ms | 137936 tok/s) step 12396/76294 | train loss 3.384495 | norm 0.2421 | lr 4.62e-04 | (3876.71 ms | 135240 tok/s) step 12397/76294 | train loss 3.424968 | norm 0.2829 | lr 4.61e-04 | (3838.78 ms | 136577 tok/s) step 12398/76294 | train loss 3.419449 | norm 0.2477 | lr 4.61e-04 | (4157.79 ms | 126098 tok/s) step 12399/76294 | train loss 3.536216 | norm 0.3027 | lr 4.61e-04 | (3799.26 ms | 137997 tok/s) step 12400/76294 | train loss 3.399081 | norm 0.3579 | lr 4.61e-04 | (3800.80 ms | 137941 tok/s) step 12401/76294 | train loss 3.421549 | norm 0.3295 | lr 4.61e-04 | (3837.27 ms | 136630 tok/s) step 12402/76294 | train loss 3.476550 | norm 0.3670 | lr 4.61e-04 | (3804.36 ms | 137812 tok/s) step 12403/76294 | train loss 3.384184 | norm 0.4800 | lr 4.61e-04 | (3804.63 ms | 137803 tok/s) step 12404/76294 | train loss 3.432936 | norm 0.2516 | lr 4.61e-04 | (3842.11 ms | 136458 tok/s) step 12405/76294 | train loss 3.448454 | norm 0.2157 | lr 4.61e-04 | (3805.66 ms | 137765 tok/s) step 12406/76294 | train loss 3.378891 | norm 0.2187 | lr 4.61e-04 | (3810.20 ms | 137601 tok/s) step 12407/76294 | train loss 3.425598 | norm 0.3107 | lr 4.61e-04 | (3829.84 ms | 136895 tok/s) step 12408/76294 | train loss 3.496346 | norm 0.2696 | lr 4.61e-04 | (3810.54 ms | 137589 tok/s) step 12409/76294 | train loss 3.404720 | norm 0.2040 | lr 4.60e-04 | (3825.89 ms | 137037 tok/s) step 12410/76294 | train loss 3.381490 | norm 0.2982 | lr 4.60e-04 | (3807.54 ms | 137697 tok/s) step 12411/76294 | train loss 3.430424 | norm 0.2276 | lr 4.60e-04 | (3816.42 ms | 137377 tok/s) step 12412/76294 | train loss 3.421346 | norm 0.2955 | lr 4.60e-04 | (3810.96 ms | 137574 tok/s) step 12413/76294 | train loss 3.403706 | norm 0.2424 | lr 4.60e-04 | (3809.82 ms | 137615 tok/s) step 12414/76294 | train loss 3.456476 | norm 0.3261 | lr 4.60e-04 | (3811.39 ms | 137558 tok/s) step 12415/76294 | train loss 3.422030 | norm 0.2403 | lr 4.60e-04 | (3855.71 ms | 135977 tok/s) step 12416/76294 | train loss 3.408926 | norm 0.3897 | lr 4.60e-04 | (3801.95 ms | 137900 tok/s) step 12417/76294 | train loss 3.436033 | norm 0.4837 | lr 4.60e-04 | (3906.39 ms | 134213 tok/s) step 12418/76294 | train loss 3.452662 | norm 0.2524 | lr 4.60e-04 | (3801.24 ms | 137926 tok/s) step 12419/76294 | train loss 3.414657 | norm 0.2799 | lr 4.60e-04 | (3858.04 ms | 135895 tok/s) step 12420/76294 | train loss 3.411150 | norm 0.3294 | lr 4.60e-04 | (3811.84 ms | 137542 tok/s) step 12421/76294 | train loss 3.466494 | norm 0.4732 | lr 4.59e-04 | (3818.54 ms | 137301 tok/s) step 12422/76294 | train loss 3.445155 | norm 0.4228 | lr 4.59e-04 | (3836.33 ms | 136664 tok/s) step 12423/76294 | train loss 3.407020 | norm 0.3289 | lr 4.59e-04 | (3814.91 ms | 137431 tok/s) step 12424/76294 | train loss 3.411726 | norm 0.3106 | lr 4.59e-04 | (3929.92 ms | 133409 tok/s) step 12425/76294 | train loss 3.452846 | norm 0.5926 | lr 4.59e-04 | (3808.71 ms | 137655 tok/s) step 12426/76294 | train loss 3.468541 | norm 0.2567 | lr 4.59e-04 | (3880.84 ms | 135097 tok/s) step 12427/76294 | train loss 3.423934 | norm 0.4721 | lr 4.59e-04 | (3801.70 ms | 137909 tok/s) step 12428/76294 | train loss 3.451707 | norm 0.3098 | lr 4.59e-04 | (3854.79 ms | 136010 tok/s) step 12429/76294 | train loss 3.428222 | norm 0.3748 | lr 4.59e-04 | (3852.23 ms | 136100 tok/s) step 12430/76294 | train loss 3.409421 | norm 0.5007 | lr 4.59e-04 | (3826.66 ms | 137009 tok/s) step 12431/76294 | train loss 3.408959 | norm 0.2548 | lr 4.59e-04 | (3799.55 ms | 137987 tok/s) step 12432/76294 | train loss 3.451354 | norm 0.6861 | lr 4.59e-04 | (3830.34 ms | 136878 tok/s) step 12433/76294 | train loss 3.395875 | norm 0.5382 | lr 4.58e-04 | (3801.84 ms | 137904 tok/s) step 12434/76294 | train loss 3.430919 | norm 0.2353 | lr 4.58e-04 | (3835.20 ms | 136704 tok/s) step 12435/76294 | train loss 3.418080 | norm 0.2584 | lr 4.58e-04 | (3805.98 ms | 137754 tok/s) step 12436/76294 | train loss 3.496665 | norm 0.3266 | lr 4.58e-04 | (3810.30 ms | 137598 tok/s) step 12437/76294 | train loss 3.428217 | norm 0.4113 | lr 4.58e-04 | (3829.93 ms | 136892 tok/s) step 12438/76294 | train loss 3.393799 | norm 0.4109 | lr 4.58e-04 | (3888.87 ms | 134817 tok/s) step 12439/76294 | train loss 3.448497 | norm 0.5495 | lr 4.58e-04 | (3803.33 ms | 137850 tok/s) step 12440/76294 | train loss 3.365153 | norm 0.4147 | lr 4.58e-04 | (3825.40 ms | 137054 tok/s) step 12441/76294 | train loss 3.490260 | norm 0.5610 | lr 4.58e-04 | (3829.04 ms | 136924 tok/s) step 12442/76294 | train loss 3.462286 | norm 0.9930 | lr 4.58e-04 | (3859.57 ms | 135841 tok/s) step 12443/76294 | train loss 3.426815 | norm 0.3208 | lr 4.58e-04 | (3911.93 ms | 134023 tok/s) step 12444/76294 | train loss 3.406256 | norm 0.6075 | lr 4.58e-04 | (3789.99 ms | 138335 tok/s) step 12445/76294 | train loss 3.497720 | norm 0.2946 | lr 4.57e-04 | (3795.72 ms | 138126 tok/s) step 12446/76294 | train loss 3.396298 | norm 0.3372 | lr 4.57e-04 | (3827.01 ms | 136997 tok/s) step 12447/76294 | train loss 3.436467 | norm 0.3165 | lr 4.57e-04 | (3793.09 ms | 138222 tok/s) step 12448/76294 | train loss 3.469765 | norm 0.2066 | lr 4.57e-04 | (3809.28 ms | 137634 tok/s) step 12449/76294 | train loss 3.389343 | norm 0.2877 | lr 4.57e-04 | (3793.82 ms | 138195 tok/s) step 12450/76294 | train loss 3.388559 | norm 0.2185 | lr 4.57e-04 | (3799.17 ms | 138001 tok/s) step 12451/76294 | train loss 3.444337 | norm 0.2881 | lr 4.57e-04 | (3813.01 ms | 137500 tok/s) step 12452/76294 | train loss 3.410655 | norm 0.2429 | lr 4.57e-04 | (3819.29 ms | 137274 tok/s) step 12453/76294 | train loss 3.407988 | norm 0.4610 | lr 4.57e-04 | (3797.09 ms | 138076 tok/s) step 12454/76294 | train loss 3.377723 | norm 0.7747 | lr 4.57e-04 | (4053.13 ms | 129354 tok/s) step 12455/76294 | train loss 3.387934 | norm 0.3726 | lr 4.57e-04 | (4626.88 ms | 113314 tok/s) step 12456/76294 | train loss 3.454151 | norm 0.5417 | lr 4.57e-04 | (3769.04 ms | 139104 tok/s) step 12457/76294 | train loss 3.434062 | norm 0.4434 | lr 4.56e-04 | (3803.74 ms | 137835 tok/s) step 12458/76294 | train loss 3.443492 | norm 0.3579 | lr 4.56e-04 | (3774.32 ms | 138909 tok/s) step 12459/76294 | train loss 3.415680 | norm 0.4179 | lr 4.56e-04 | (3804.40 ms | 137811 tok/s) step 12460/76294 | train loss 3.382063 | norm 0.2656 | lr 4.56e-04 | (3779.92 ms | 138703 tok/s) step 12461/76294 | train loss 3.388119 | norm 0.3788 | lr 4.56e-04 | (3970.85 ms | 132034 tok/s) step 12462/76294 | train loss 3.404175 | norm 0.3205 | lr 4.56e-04 | (3783.45 ms | 138574 tok/s) step 12463/76294 | train loss 3.425813 | norm 0.9007 | lr 4.56e-04 | (3787.53 ms | 138425 tok/s) step 12464/76294 | train loss 3.469432 | norm 0.5772 | lr 4.56e-04 | (3812.35 ms | 137524 tok/s) step 12465/76294 | train loss 3.376260 | norm 0.3226 | lr 4.56e-04 | (3842.93 ms | 136429 tok/s) step 12466/76294 | train loss 3.431539 | norm 0.7104 | lr 4.56e-04 | (3784.29 ms | 138543 tok/s) step 12467/76294 | train loss 3.399838 | norm 0.3082 | lr 4.56e-04 | (3794.92 ms | 138155 tok/s) step 12468/76294 | train loss 3.377520 | norm 0.2634 | lr 4.56e-04 | (3817.91 ms | 137323 tok/s) step 12469/76294 | train loss 3.396514 | norm 0.2592 | lr 4.55e-04 | (3797.03 ms | 138079 tok/s) step 12470/76294 | train loss 3.418569 | norm 0.2514 | lr 4.55e-04 | (3799.26 ms | 137997 tok/s) step 12471/76294 | train loss 3.555497 | norm 0.3765 | lr 4.55e-04 | (3799.65 ms | 137983 tok/s) step 12472/76294 | train loss 3.411319 | norm 0.5419 | lr 4.55e-04 | (3804.35 ms | 137813 tok/s) step 12473/76294 | train loss 3.393821 | norm 0.4447 | lr 4.55e-04 | (3793.85 ms | 138194 tok/s) step 12474/76294 | train loss 3.390801 | norm 0.4401 | lr 4.55e-04 | (3825.76 ms | 137041 tok/s) step 12475/76294 | train loss 3.442015 | norm 0.5834 | lr 4.55e-04 | (5242.51 ms | 100007 tok/s) step 12476/76294 | train loss 3.403376 | norm 0.2812 | lr 4.55e-04 | (4711.42 ms | 111280 tok/s) step 12477/76294 | train loss 3.581321 | norm 0.3972 | lr 4.55e-04 | (4642.57 ms | 112930 tok/s) step 12478/76294 | train loss 3.435223 | norm 0.7091 | lr 4.55e-04 | (3847.86 ms | 136255 tok/s) step 12479/76294 | train loss 3.480998 | norm 0.3388 | lr 4.55e-04 | (3799.40 ms | 137992 tok/s) step 12480/76294 | train loss 3.427813 | norm 0.5404 | lr 4.55e-04 | (3812.17 ms | 137530 tok/s) step 12481/76294 | train loss 3.415739 | norm 0.3882 | lr 4.54e-04 | (3795.37 ms | 138139 tok/s) step 12482/76294 | train loss 3.446118 | norm 0.2718 | lr 4.54e-04 | (3853.80 ms | 136044 tok/s) step 12483/76294 | train loss 3.381783 | norm 0.4918 | lr 4.54e-04 | (3793.56 ms | 138205 tok/s) step 12484/76294 | train loss 3.421099 | norm 0.2222 | lr 4.54e-04 | (4055.65 ms | 129274 tok/s) step 12485/76294 | train loss 3.376228 | norm 0.3324 | lr 4.54e-04 | (3802.23 ms | 137890 tok/s) step 12486/76294 | train loss 3.402009 | norm 0.3213 | lr 4.54e-04 | (3822.54 ms | 137157 tok/s) step 12487/76294 | train loss 3.423985 | norm 0.4506 | lr 4.54e-04 | (3793.82 ms | 138195 tok/s) step 12488/76294 | train loss 3.385915 | norm 0.3807 | lr 4.54e-04 | (3814.96 ms | 137430 tok/s) step 12489/76294 | train loss 3.478922 | norm 0.2213 | lr 4.54e-04 | (3795.85 ms | 138121 tok/s) step 12490/76294 | train loss 3.429188 | norm 0.3789 | lr 4.54e-04 | (3801.97 ms | 137899 tok/s) step 12491/76294 | train loss 3.422915 | norm 0.2741 | lr 4.54e-04 | (3817.53 ms | 137337 tok/s) step 12492/76294 | train loss 3.452221 | norm 0.5485 | lr 4.54e-04 | (3799.86 ms | 137976 tok/s) step 12493/76294 | train loss 3.488931 | norm 0.2634 | lr 4.53e-04 | (3799.80 ms | 137978 tok/s) step 12494/76294 | train loss 3.403555 | norm 0.5189 | lr 4.53e-04 | (3846.29 ms | 136310 tok/s) step 12495/76294 | train loss 3.417755 | norm 0.2355 | lr 4.53e-04 | (3800.86 ms | 137939 tok/s) step 12496/76294 | train loss 3.471363 | norm 0.3133 | lr 4.53e-04 | (3805.50 ms | 137771 tok/s) step 12497/76294 | train loss 3.367648 | norm 0.3292 | lr 4.53e-04 | (3834.31 ms | 136736 tok/s) step 12498/76294 | train loss 3.461397 | norm 0.4866 | lr 4.53e-04 | (3803.92 ms | 137828 tok/s) step 12499/76294 | train loss 3.409486 | norm 0.2736 | lr 4.53e-04 | (3871.31 ms | 135429 tok/s) step 12500/76294 | train loss 3.392286 | norm 0.2036 | lr 4.53e-04 | (3801.20 ms | 137927 tok/s) val loss: 3.392208 saving model checkpoint to ./results/gpt2-124M-gqa/step_12500.pth step 12501/76294 | train loss 3.405477 | norm 0.3735 | lr 4.53e-04 | (3868.05 ms | 135543 tok/s) step 12502/76294 | train loss 3.430018 | norm 0.3281 | lr 4.53e-04 | (3788.33 ms | 138396 tok/s) step 12503/76294 | train loss 3.487776 | norm 0.2369 | lr 4.53e-04 | (3859.39 ms | 135848 tok/s) step 12504/76294 | train loss 3.426402 | norm 0.3198 | lr 4.53e-04 | (4746.55 ms | 110457 tok/s) step 12505/76294 | train loss 3.417778 | norm 0.3464 | lr 4.52e-04 | (3825.29 ms | 137058 tok/s) step 12506/76294 | train loss 3.371068 | norm 0.3159 | lr 4.52e-04 | (3887.01 ms | 134882 tok/s) step 12507/76294 | train loss 3.398660 | norm 0.3169 | lr 4.52e-04 | (3789.67 ms | 138347 tok/s) step 12508/76294 | train loss 3.367460 | norm 0.3601 | lr 4.52e-04 | (3800.07 ms | 137968 tok/s) step 12509/76294 | train loss 3.440321 | norm 0.4235 | lr 4.52e-04 | (3820.24 ms | 137239 tok/s) step 12510/76294 | train loss 3.423819 | norm 0.3488 | lr 4.52e-04 | (3796.11 ms | 138112 tok/s) step 12511/76294 | train loss 3.422849 | norm 0.3437 | lr 4.52e-04 | (3795.79 ms | 138124 tok/s) step 12512/76294 | train loss 3.434397 | norm 0.6028 | lr 4.52e-04 | (3832.12 ms | 136814 tok/s) step 12513/76294 | train loss 3.415890 | norm 0.3567 | lr 4.52e-04 | (3802.85 ms | 137867 tok/s) step 12514/76294 | train loss 3.411463 | norm 0.2596 | lr 4.52e-04 | (3809.75 ms | 137618 tok/s) step 12515/76294 | train loss 3.342547 | norm 0.3490 | lr 4.52e-04 | (3827.19 ms | 136990 tok/s) step 12516/76294 | train loss 3.443488 | norm 0.2430 | lr 4.52e-04 | (3810.38 ms | 137595 tok/s) step 12517/76294 | train loss 3.405441 | norm 0.4662 | lr 4.51e-04 | (3812.87 ms | 137505 tok/s) step 12518/76294 | train loss 3.405899 | norm 0.4714 | lr 4.51e-04 | (3844.29 ms | 136381 tok/s) step 12519/76294 | train loss 3.527209 | norm 0.6589 | lr 4.51e-04 | (3801.60 ms | 137913 tok/s) step 12520/76294 | train loss 3.336590 | norm 0.4285 | lr 4.51e-04 | (3830.51 ms | 136872 tok/s) step 12521/76294 | train loss 3.365904 | norm 0.3172 | lr 4.51e-04 | (3805.83 ms | 137759 tok/s) step 12522/76294 | train loss 3.418582 | norm 0.3671 | lr 4.51e-04 | (3810.76 ms | 137581 tok/s) step 12523/76294 | train loss 3.368130 | norm 0.3084 | lr 4.51e-04 | (3827.14 ms | 136992 tok/s) step 12524/76294 | train loss 3.484497 | norm 0.2666 | lr 4.51e-04 | (3807.31 ms | 137706 tok/s) step 12525/76294 | train loss 3.402992 | norm 0.2768 | lr 4.51e-04 | (3805.34 ms | 137777 tok/s) step 12526/76294 | train loss 3.421099 | norm 0.2664 | lr 4.51e-04 | (3829.74 ms | 136899 tok/s) step 12527/76294 | train loss 3.455009 | norm 0.2504 | lr 4.51e-04 | (3814.77 ms | 137436 tok/s) step 12528/76294 | train loss 3.383437 | norm 0.2551 | lr 4.51e-04 | (3805.26 ms | 137780 tok/s) step 12529/76294 | train loss 3.424874 | norm 0.5273 | lr 4.50e-04 | (3830.95 ms | 136856 tok/s) step 12530/76294 | train loss 3.474987 | norm 0.2431 | lr 4.50e-04 | (3804.96 ms | 137791 tok/s) step 12531/76294 | train loss 3.480051 | norm 0.2256 | lr 4.50e-04 | (3810.17 ms | 137602 tok/s) step 12532/76294 | train loss 3.350255 | norm 0.4424 | lr 4.50e-04 | (3827.74 ms | 136971 tok/s) step 12533/76294 | train loss 3.415226 | norm 0.2240 | lr 4.50e-04 | (3807.74 ms | 137690 tok/s) step 12534/76294 | train loss 3.423818 | norm 0.2626 | lr 4.50e-04 | (3829.05 ms | 136924 tok/s) step 12535/76294 | train loss 3.411085 | norm 0.2466 | lr 4.50e-04 | (3808.51 ms | 137662 tok/s) step 12536/76294 | train loss 3.345996 | norm 0.4151 | lr 4.50e-04 | (3849.63 ms | 136192 tok/s) step 12537/76294 | train loss 3.414016 | norm 0.3439 | lr 4.50e-04 | (3808.75 ms | 137654 tok/s) step 12538/76294 | train loss 3.426938 | norm 0.5866 | lr 4.50e-04 | (4386.03 ms | 119536 tok/s) step 12539/76294 | train loss 3.397161 | norm 0.3078 | lr 4.50e-04 | (3810.55 ms | 137589 tok/s) step 12540/76294 | train loss 3.386606 | norm 0.3730 | lr 4.50e-04 | (3844.42 ms | 136376 tok/s) step 12541/76294 | train loss 3.389379 | norm 0.6220 | lr 4.49e-04 | (3846.62 ms | 136298 tok/s) step 12542/76294 | train loss 3.406452 | norm 0.2609 | lr 4.49e-04 | (3806.27 ms | 137743 tok/s) step 12543/76294 | train loss 3.376595 | norm 0.2999 | lr 4.49e-04 | (3833.34 ms | 136770 tok/s) step 12544/76294 | train loss 3.392118 | norm 0.3091 | lr 4.49e-04 | (3804.85 ms | 137795 tok/s) step 12545/76294 | train loss 3.381685 | norm 0.4540 | lr 4.49e-04 | (3814.90 ms | 137432 tok/s) step 12546/76294 | train loss 3.436543 | norm 0.4811 | lr 4.49e-04 | (3802.78 ms | 137870 tok/s) step 12547/76294 | train loss 3.406426 | norm 0.2431 | lr 4.49e-04 | (3815.36 ms | 137415 tok/s) step 12548/76294 | train loss 3.372057 | norm 0.8141 | lr 4.49e-04 | (3808.59 ms | 137659 tok/s) step 12549/76294 | train loss 3.402904 | norm 0.3806 | lr 4.49e-04 | (3838.98 ms | 136569 tok/s) step 12550/76294 | train loss 3.455127 | norm 0.4537 | lr 4.49e-04 | (3845.04 ms | 136354 tok/s) step 12551/76294 | train loss 3.389393 | norm 0.4337 | lr 4.49e-04 | (3803.95 ms | 137827 tok/s) step 12552/76294 | train loss 3.423295 | norm 0.3251 | lr 4.49e-04 | (3811.76 ms | 137545 tok/s) step 12553/76294 | train loss 3.534971 | norm 0.2698 | lr 4.48e-04 | (3804.18 ms | 137819 tok/s) step 12554/76294 | train loss 3.348873 | norm 0.2659 | lr 4.48e-04 | (3928.12 ms | 133471 tok/s) step 12555/76294 | train loss 3.399030 | norm 0.5435 | lr 4.48e-04 | (3790.43 ms | 138319 tok/s) step 12556/76294 | train loss 3.399183 | norm 0.2661 | lr 4.48e-04 | (3800.29 ms | 137960 tok/s) step 12557/76294 | train loss 3.462973 | norm 0.2912 | lr 4.48e-04 | (3812.69 ms | 137511 tok/s) step 12558/76294 | train loss 3.350276 | norm 0.2855 | lr 4.48e-04 | (3799.86 ms | 137976 tok/s) step 12559/76294 | train loss 3.361715 | norm 0.1884 | lr 4.48e-04 | (3801.78 ms | 137906 tok/s) step 12560/76294 | train loss 3.380171 | norm 0.4297 | lr 4.48e-04 | (3800.17 ms | 137964 tok/s) step 12561/76294 | train loss 3.367639 | norm 0.2267 | lr 4.48e-04 | (3813.97 ms | 137465 tok/s) step 12562/76294 | train loss 3.388431 | norm 0.2760 | lr 4.48e-04 | (3802.85 ms | 137867 tok/s) step 12563/76294 | train loss 3.456890 | norm 0.2232 | lr 4.48e-04 | (3826.81 ms | 137004 tok/s) step 12564/76294 | train loss 3.314428 | norm 0.1968 | lr 4.48e-04 | (3863.58 ms | 135700 tok/s) step 12565/76294 | train loss 3.399184 | norm 0.1791 | lr 4.47e-04 | (3796.94 ms | 138082 tok/s) step 12566/76294 | train loss 3.427449 | norm 0.1889 | lr 4.47e-04 | (3908.74 ms | 134132 tok/s) step 12567/76294 | train loss 3.430989 | norm 0.2174 | lr 4.47e-04 | (3792.94 ms | 138227 tok/s) step 12568/76294 | train loss 3.369116 | norm 0.2788 | lr 4.47e-04 | (3800.21 ms | 137963 tok/s) step 12569/76294 | train loss 3.476302 | norm 0.9538 | lr 4.47e-04 | (3913.69 ms | 133963 tok/s) step 12570/76294 | train loss 3.391325 | norm 0.8345 | lr 4.47e-04 | (3780.91 ms | 138667 tok/s) step 12571/76294 | train loss 3.405077 | norm 0.6789 | lr 4.47e-04 | (3841.77 ms | 136470 tok/s) step 12572/76294 | train loss 3.427503 | norm 0.5891 | lr 4.47e-04 | (3789.64 ms | 138348 tok/s) step 12573/76294 | train loss 3.409951 | norm 0.2786 | lr 4.47e-04 | (3787.74 ms | 138417 tok/s) step 12574/76294 | train loss 3.409953 | norm 0.6283 | lr 4.47e-04 | (3810.46 ms | 137592 tok/s) step 12575/76294 | train loss 3.426245 | norm 0.3601 | lr 4.47e-04 | (3818.55 ms | 137300 tok/s) step 12576/76294 | train loss 3.379880 | norm 0.5074 | lr 4.47e-04 | (3815.91 ms | 137395 tok/s) step 12577/76294 | train loss 3.470627 | norm 1.3277 | lr 4.46e-04 | (3794.09 ms | 138185 tok/s) step 12578/76294 | train loss 3.482392 | norm 0.2877 | lr 4.46e-04 | (3792.27 ms | 138252 tok/s) step 12579/76294 | train loss 3.397718 | norm 0.7065 | lr 4.46e-04 | (3820.81 ms | 137219 tok/s) step 12580/76294 | train loss 3.406416 | norm 0.6439 | lr 4.46e-04 | (3791.70 ms | 138273 tok/s) step 12581/76294 | train loss 3.400975 | norm 0.3312 | lr 4.46e-04 | (3875.57 ms | 135280 tok/s) step 12582/76294 | train loss 3.355582 | norm 0.3460 | lr 4.46e-04 | (3792.97 ms | 138226 tok/s) step 12583/76294 | train loss 3.464762 | norm 0.6614 | lr 4.46e-04 | (3819.14 ms | 137279 tok/s) step 12584/76294 | train loss 3.407809 | norm 0.5666 | lr 4.46e-04 | (3793.02 ms | 138224 tok/s) step 12585/76294 | train loss 3.465935 | norm 0.2442 | lr 4.46e-04 | (3900.90 ms | 134402 tok/s) step 12586/76294 | train loss 3.339628 | norm 0.3402 | lr 4.46e-04 | (3810.61 ms | 137587 tok/s) step 12587/76294 | train loss 3.427234 | norm 0.4162 | lr 4.46e-04 | (3841.20 ms | 136491 tok/s) step 12588/76294 | train loss 3.416158 | norm 0.3027 | lr 4.46e-04 | (3790.43 ms | 138319 tok/s) step 12589/76294 | train loss 3.359832 | norm 0.3870 | lr 4.45e-04 | (3819.01 ms | 137284 tok/s) step 12590/76294 | train loss 3.443150 | norm 0.2201 | lr 4.45e-04 | (3793.67 ms | 138201 tok/s) step 12591/76294 | train loss 3.412018 | norm 0.3588 | lr 4.45e-04 | (3849.52 ms | 136196 tok/s) step 12592/76294 | train loss 3.323190 | norm 0.2105 | lr 4.45e-04 | (3827.20 ms | 136990 tok/s) step 12593/76294 | train loss 3.431351 | norm 0.5925 | lr 4.45e-04 | (3795.94 ms | 138118 tok/s) step 12594/76294 | train loss 3.450300 | norm 0.3658 | lr 4.45e-04 | (3815.58 ms | 137407 tok/s) step 12595/76294 | train loss 3.400394 | norm 0.4346 | lr 4.45e-04 | (3798.55 ms | 138023 tok/s) step 12596/76294 | train loss 3.374984 | norm 0.5379 | lr 4.45e-04 | (3817.22 ms | 137348 tok/s) step 12597/76294 | train loss 3.494740 | norm 0.2224 | lr 4.45e-04 | (3817.19 ms | 137349 tok/s) step 12598/76294 | train loss 3.377423 | norm 0.4528 | lr 4.45e-04 | (3832.68 ms | 136794 tok/s) step 12599/76294 | train loss 3.388483 | norm 0.3998 | lr 4.45e-04 | (3889.03 ms | 134812 tok/s) step 12600/76294 | train loss 3.403187 | norm 0.2834 | lr 4.45e-04 | (3900.89 ms | 134402 tok/s) step 12601/76294 | train loss 3.386123 | norm 0.3250 | lr 4.44e-04 | (3898.77 ms | 134475 tok/s) step 12602/76294 | train loss 3.372193 | norm 0.4958 | lr 4.44e-04 | (3787.52 ms | 138425 tok/s) step 12603/76294 | train loss 3.534334 | norm 0.2248 | lr 4.44e-04 | (3794.48 ms | 138171 tok/s) step 12604/76294 | train loss 3.372631 | norm 0.3632 | lr 4.44e-04 | (3928.19 ms | 133468 tok/s) step 12605/76294 | train loss 3.408208 | norm 0.2735 | lr 4.44e-04 | (3782.89 ms | 138595 tok/s) step 12606/76294 | train loss 3.399621 | norm 0.2665 | lr 4.44e-04 | (3811.44 ms | 137556 tok/s) step 12607/76294 | train loss 3.418296 | norm 0.4696 | lr 4.44e-04 | (3791.49 ms | 138280 tok/s) step 12608/76294 | train loss 3.469541 | norm 0.2692 | lr 4.44e-04 | (3865.81 ms | 135622 tok/s) step 12609/76294 | train loss 3.440182 | norm 0.3246 | lr 4.44e-04 | (3784.22 ms | 138546 tok/s) step 12610/76294 | train loss 3.381877 | norm 0.3654 | lr 4.44e-04 | (3812.68 ms | 137512 tok/s) step 12611/76294 | train loss 3.358058 | norm 0.4165 | lr 4.44e-04 | (3785.59 ms | 138496 tok/s) step 12612/76294 | train loss 3.398247 | norm 0.2243 | lr 4.44e-04 | (3798.27 ms | 138033 tok/s) step 12613/76294 | train loss 3.406692 | norm 0.5040 | lr 4.44e-04 | (3816.32 ms | 137381 tok/s) step 12614/76294 | train loss 3.455676 | norm 0.5112 | lr 4.43e-04 | (3908.78 ms | 134131 tok/s) step 12615/76294 | train loss 3.350111 | norm 0.3316 | lr 4.43e-04 | (3896.79 ms | 134544 tok/s) step 12616/76294 | train loss 3.401455 | norm 0.4605 | lr 4.43e-04 | (3782.22 ms | 138619 tok/s) step 12617/76294 | train loss 3.385047 | norm 0.5051 | lr 4.43e-04 | (3806.40 ms | 137739 tok/s) step 12618/76294 | train loss 3.354064 | norm 0.2798 | lr 4.43e-04 | (3782.85 ms | 138596 tok/s) step 12619/76294 | train loss 3.387112 | norm 0.6164 | lr 4.43e-04 | (3817.16 ms | 137350 tok/s) step 12620/76294 | train loss 3.489030 | norm 0.5992 | lr 4.43e-04 | (3875.68 ms | 135276 tok/s) step 12621/76294 | train loss 3.371174 | norm 0.2339 | lr 4.43e-04 | (3790.10 ms | 138331 tok/s) step 12622/76294 | train loss 3.437655 | norm 0.3066 | lr 4.43e-04 | (3825.47 ms | 137052 tok/s) step 12623/76294 | train loss 3.382818 | norm 0.2464 | lr 4.43e-04 | (3813.14 ms | 137495 tok/s) step 12624/76294 | train loss 3.494763 | norm 0.2020 | lr 4.43e-04 | (3794.17 ms | 138182 tok/s) step 12625/76294 | train loss 3.362142 | norm 0.2673 | lr 4.43e-04 | (3797.43 ms | 138064 tok/s) step 12626/76294 | train loss 3.375985 | norm 0.2175 | lr 4.42e-04 | (3794.86 ms | 138158 tok/s) step 12627/76294 | train loss 3.366924 | norm 0.2639 | lr 4.42e-04 | (3800.62 ms | 137948 tok/s) step 12628/76294 | train loss 3.434271 | norm 0.2629 | lr 4.42e-04 | (3798.45 ms | 138027 tok/s) step 12629/76294 | train loss 3.478431 | norm 0.4921 | lr 4.42e-04 | (3828.42 ms | 136946 tok/s) step 12630/76294 | train loss 3.413218 | norm 0.3450 | lr 4.42e-04 | (3797.35 ms | 138067 tok/s) step 12631/76294 | train loss 3.463292 | norm 0.3255 | lr 4.42e-04 | (3891.84 ms | 134715 tok/s) step 12632/76294 | train loss 3.442643 | norm 0.4703 | lr 4.42e-04 | (3903.59 ms | 134309 tok/s) step 12633/76294 | train loss 3.361915 | norm 0.2206 | lr 4.42e-04 | (3777.31 ms | 138799 tok/s) step 12634/76294 | train loss 3.398932 | norm 0.2815 | lr 4.42e-04 | (3813.20 ms | 137493 tok/s) step 12635/76294 | train loss 3.400582 | norm 0.3829 | lr 4.42e-04 | (3784.74 ms | 138527 tok/s) step 12636/76294 | train loss 3.424670 | norm 0.2784 | lr 4.42e-04 | (4417.49 ms | 118685 tok/s) step 12637/76294 | train loss 3.451351 | norm 0.4916 | lr 4.42e-04 | (3838.88 ms | 136573 tok/s) step 12638/76294 | train loss 3.427493 | norm 0.3746 | lr 4.41e-04 | (3842.07 ms | 136460 tok/s) step 12639/76294 | train loss 3.431249 | norm 0.2794 | lr 4.41e-04 | (3818.00 ms | 137320 tok/s) step 12640/76294 | train loss 3.402194 | norm 0.7183 | lr 4.41e-04 | (3827.89 ms | 136965 tok/s) step 12641/76294 | train loss 3.323617 | norm 0.3137 | lr 4.41e-04 | (3820.15 ms | 137243 tok/s) step 12642/76294 | train loss 3.341063 | norm 0.6110 | lr 4.41e-04 | (3802.58 ms | 137877 tok/s) step 12643/76294 | train loss 3.358840 | norm 0.4263 | lr 4.41e-04 | (3899.83 ms | 134439 tok/s) step 12644/76294 | train loss 3.377029 | norm 0.3272 | lr 4.41e-04 | (3790.40 ms | 138320 tok/s) step 12645/76294 | train loss 3.425574 | norm 0.2185 | lr 4.41e-04 | (4023.42 ms | 130309 tok/s) step 12646/76294 | train loss 3.553056 | norm 0.3068 | lr 4.41e-04 | (3797.42 ms | 138064 tok/s) step 12647/76294 | train loss 3.470889 | norm 0.6660 | lr 4.41e-04 | (3791.65 ms | 138274 tok/s) step 12648/76294 | train loss 3.462709 | norm 0.3962 | lr 4.41e-04 | (3836.05 ms | 136674 tok/s) step 12649/76294 | train loss 3.460958 | norm 0.5643 | lr 4.41e-04 | (3789.41 ms | 138356 tok/s) step 12650/76294 | train loss 3.466567 | norm 0.6362 | lr 4.40e-04 | (3816.02 ms | 137391 tok/s) step 12651/76294 | train loss 3.376205 | norm 0.3141 | lr 4.40e-04 | (3791.29 ms | 138288 tok/s) step 12652/76294 | train loss 3.357071 | norm 0.3082 | lr 4.40e-04 | (3811.33 ms | 137560 tok/s) step 12653/76294 | train loss 3.405606 | norm 0.7755 | lr 4.40e-04 | (3846.41 ms | 136306 tok/s) step 12654/76294 | train loss 3.419228 | norm 0.4064 | lr 4.40e-04 | (3797.04 ms | 138078 tok/s) step 12655/76294 | train loss 3.398455 | norm 0.2983 | lr 4.40e-04 | (3823.44 ms | 137125 tok/s) step 12656/76294 | train loss 3.374450 | norm 0.4485 | lr 4.40e-04 | (3905.83 ms | 134232 tok/s) step 12657/76294 | train loss 3.429494 | norm 0.2641 | lr 4.40e-04 | (3874.35 ms | 135323 tok/s) step 12658/76294 | train loss 3.382069 | norm 0.2726 | lr 4.40e-04 | (3832.73 ms | 136792 tok/s) step 12659/76294 | train loss 3.402805 | norm 0.2519 | lr 4.40e-04 | (3804.68 ms | 137801 tok/s) step 12660/76294 | train loss 3.433306 | norm 0.2540 | lr 4.40e-04 | (3818.11 ms | 137316 tok/s) step 12661/76294 | train loss 3.403800 | norm 0.2145 | lr 4.40e-04 | (3823.86 ms | 137110 tok/s) step 12662/76294 | train loss 3.401426 | norm 0.2288 | lr 4.39e-04 | (3834.42 ms | 136732 tok/s) step 12663/76294 | train loss 3.400774 | norm 0.2544 | lr 4.39e-04 | (3795.49 ms | 138134 tok/s) step 12664/76294 | train loss 3.458399 | norm 0.3471 | lr 4.39e-04 | (3893.14 ms | 134670 tok/s) step 12665/76294 | train loss 3.409513 | norm 0.3067 | lr 4.39e-04 | (3792.29 ms | 138251 tok/s) step 12666/76294 | train loss 3.433687 | norm 0.2222 | lr 4.39e-04 | (3796.81 ms | 138086 tok/s) step 12667/76294 | train loss 3.384546 | norm 0.2878 | lr 4.39e-04 | (3818.97 ms | 137285 tok/s) step 12668/76294 | train loss 3.394791 | norm 0.3002 | lr 4.39e-04 | (3801.58 ms | 137913 tok/s) step 12669/76294 | train loss 3.429254 | norm 0.2497 | lr 4.39e-04 | (3813.96 ms | 137465 tok/s) step 12670/76294 | train loss 3.383444 | norm 0.3183 | lr 4.39e-04 | (3802.83 ms | 137868 tok/s) step 12671/76294 | train loss 3.394860 | norm 0.3774 | lr 4.39e-04 | (3857.14 ms | 135927 tok/s) step 12672/76294 | train loss 3.407391 | norm 0.2788 | lr 4.39e-04 | (3863.53 ms | 135702 tok/s) step 12673/76294 | train loss 3.412408 | norm 0.2792 | lr 4.39e-04 | (3898.13 ms | 134497 tok/s) step 12674/76294 | train loss 3.406717 | norm 0.2569 | lr 4.38e-04 | (3773.08 ms | 138955 tok/s) step 12675/76294 | train loss 3.371156 | norm 0.3781 | lr 4.38e-04 | (4497.35 ms | 116577 tok/s) step 12676/76294 | train loss 3.398206 | norm 0.2370 | lr 4.38e-04 | (3772.22 ms | 138987 tok/s) step 12677/76294 | train loss 3.338912 | norm 0.3166 | lr 4.38e-04 | (3864.64 ms | 135663 tok/s) step 12678/76294 | train loss 3.426149 | norm 0.2931 | lr 4.38e-04 | (3768.54 ms | 139122 tok/s) step 12679/76294 | train loss 3.409737 | norm 0.2955 | lr 4.38e-04 | (3807.41 ms | 137702 tok/s) step 12680/76294 | train loss 3.475075 | norm 0.4468 | lr 4.38e-04 | (3778.22 ms | 138766 tok/s) step 12681/76294 | train loss 3.402997 | norm 0.2662 | lr 4.38e-04 | (3861.05 ms | 135789 tok/s) step 12682/76294 | train loss 3.524102 | norm 0.2189 | lr 4.38e-04 | (3779.64 ms | 138714 tok/s) step 12683/76294 | train loss 3.565011 | norm 0.2445 | lr 4.38e-04 | (3784.17 ms | 138548 tok/s) step 12684/76294 | train loss 3.371023 | norm 0.2259 | lr 4.38e-04 | (3811.71 ms | 137547 tok/s) step 12685/76294 | train loss 3.486763 | norm 0.2935 | lr 4.38e-04 | (3791.99 ms | 138262 tok/s) step 12686/76294 | train loss 3.364513 | norm 0.3682 | lr 4.38e-04 | (3790.44 ms | 138319 tok/s) step 12687/76294 | train loss 3.375753 | norm 0.2032 | lr 4.37e-04 | (3815.19 ms | 137421 tok/s) step 12688/76294 | train loss 3.458321 | norm 0.3373 | lr 4.37e-04 | (3793.91 ms | 138192 tok/s) step 12689/76294 | train loss 3.349464 | norm 0.5721 | lr 4.37e-04 | (3811.98 ms | 137537 tok/s) step 12690/76294 | train loss 3.427244 | norm 0.2107 | lr 4.37e-04 | (3871.77 ms | 135413 tok/s) step 12691/76294 | train loss 3.415392 | norm 0.7126 | lr 4.37e-04 | (3873.96 ms | 135337 tok/s) step 12692/76294 | train loss 3.389297 | norm 0.2530 | lr 4.37e-04 | (3797.05 ms | 138078 tok/s) step 12693/76294 | train loss 3.374582 | norm 0.3675 | lr 4.37e-04 | (3823.69 ms | 137116 tok/s) step 12694/76294 | train loss 3.437501 | norm 0.3786 | lr 4.37e-04 | (4075.51 ms | 128644 tok/s) step 12695/76294 | train loss 3.367115 | norm 0.2878 | lr 4.37e-04 | (3800.04 ms | 137969 tok/s) step 12696/76294 | train loss 3.372305 | norm 0.2492 | lr 4.37e-04 | (3827.53 ms | 136978 tok/s) step 12697/76294 | train loss 3.404922 | norm 0.2396 | lr 4.37e-04 | (3809.13 ms | 137640 tok/s) step 12698/76294 | train loss 3.430588 | norm 0.2197 | lr 4.37e-04 | (3806.46 ms | 137736 tok/s) step 12699/76294 | train loss 3.437613 | norm 0.3602 | lr 4.36e-04 | (3835.49 ms | 136694 tok/s) step 12700/76294 | train loss 3.427234 | norm 0.3951 | lr 4.36e-04 | (3806.27 ms | 137743 tok/s) step 12701/76294 | train loss 3.350244 | norm 0.2596 | lr 4.36e-04 | (3826.72 ms | 137007 tok/s) step 12702/76294 | train loss 3.445732 | norm 0.3752 | lr 4.36e-04 | (3809.62 ms | 137622 tok/s) step 12703/76294 | train loss 3.373644 | norm 0.2812 | lr 4.36e-04 | (3809.02 ms | 137644 tok/s) step 12704/76294 | train loss 3.383600 | norm 0.2647 | lr 4.36e-04 | (3830.92 ms | 136857 tok/s) step 12705/76294 | train loss 3.388144 | norm 0.4953 | lr 4.36e-04 | (3809.62 ms | 137622 tok/s) step 12706/76294 | train loss 3.428493 | norm 0.2415 | lr 4.36e-04 | (4447.55 ms | 117882 tok/s) step 12707/76294 | train loss 3.422181 | norm 0.3633 | lr 4.36e-04 | (3810.06 ms | 137606 tok/s) step 12708/76294 | train loss 3.383426 | norm 0.4109 | lr 4.36e-04 | (3814.10 ms | 137460 tok/s) step 12709/76294 | train loss 3.405600 | norm 0.4226 | lr 4.36e-04 | (3874.85 ms | 135305 tok/s) step 12710/76294 | train loss 3.360076 | norm 0.9049 | lr 4.36e-04 | (3804.85 ms | 137795 tok/s) step 12711/76294 | train loss 3.443380 | norm 0.6945 | lr 4.35e-04 | (3832.31 ms | 136807 tok/s) step 12712/76294 | train loss 3.398682 | norm 0.7879 | lr 4.35e-04 | (3882.95 ms | 135023 tok/s) step 12713/76294 | train loss 3.498955 | norm 0.5059 | lr 4.35e-04 | (3897.49 ms | 134519 tok/s) step 12714/76294 | train loss 3.352577 | norm 0.2469 | lr 4.35e-04 | (3802.40 ms | 137883 tok/s) step 12715/76294 | train loss 3.361896 | norm 0.6570 | lr 4.35e-04 | (3832.46 ms | 136802 tok/s) step 12716/76294 | train loss 3.379240 | norm 0.4541 | lr 4.35e-04 | (3840.56 ms | 136513 tok/s) step 12717/76294 | train loss 3.394985 | norm 0.2698 | lr 4.35e-04 | (3804.74 ms | 137798 tok/s) step 12718/76294 | train loss 3.363036 | norm 0.2860 | lr 4.35e-04 | (3837.97 ms | 136605 tok/s) step 12719/76294 | train loss 3.404619 | norm 0.3129 | lr 4.35e-04 | (3808.67 ms | 137656 tok/s) step 12720/76294 | train loss 3.407157 | norm 0.2844 | lr 4.35e-04 | (3836.36 ms | 136663 tok/s) step 12721/76294 | train loss 3.399456 | norm 0.4233 | lr 4.35e-04 | (3805.95 ms | 137755 tok/s) step 12722/76294 | train loss 3.412428 | norm 0.3950 | lr 4.35e-04 | (3892.03 ms | 134708 tok/s) step 12723/76294 | train loss 3.414210 | norm 0.4299 | lr 4.34e-04 | (3801.43 ms | 137919 tok/s) step 12724/76294 | train loss 3.413103 | norm 0.2686 | lr 4.34e-04 | (3896.01 ms | 134570 tok/s) step 12725/76294 | train loss 3.350676 | norm 0.2532 | lr 4.34e-04 | (3889.92 ms | 134781 tok/s) step 12726/76294 | train loss 3.420339 | norm 0.2188 | lr 4.34e-04 | (3817.35 ms | 137343 tok/s) step 12727/76294 | train loss 3.406910 | norm 0.2649 | lr 4.34e-04 | (3819.47 ms | 137267 tok/s) step 12728/76294 | train loss 3.396406 | norm 0.2549 | lr 4.34e-04 | (3805.93 ms | 137755 tok/s) step 12729/76294 | train loss 3.397589 | norm 0.2848 | lr 4.34e-04 | (3800.58 ms | 137949 tok/s) step 12730/76294 | train loss 3.424186 | norm 0.3107 | lr 4.34e-04 | (3853.98 ms | 136038 tok/s) step 12731/76294 | train loss 3.385988 | norm 0.4102 | lr 4.34e-04 | (3796.51 ms | 138097 tok/s) step 12732/76294 | train loss 3.275515 | norm 0.2441 | lr 4.34e-04 | (3899.21 ms | 134460 tok/s) step 12733/76294 | train loss 3.391548 | norm 0.4682 | lr 4.34e-04 | (3801.33 ms | 137922 tok/s) step 12734/76294 | train loss 3.379920 | norm 0.4350 | lr 4.34e-04 | (3815.40 ms | 137414 tok/s) step 12735/76294 | train loss 3.468513 | norm 0.3741 | lr 4.33e-04 | (3796.64 ms | 138093 tok/s) step 12736/76294 | train loss 3.445003 | norm 0.2300 | lr 4.33e-04 | (3803.10 ms | 137858 tok/s) step 12737/76294 | train loss 3.394947 | norm 0.4356 | lr 4.33e-04 | (3822.61 ms | 137155 tok/s) step 12738/76294 | train loss 3.318342 | norm 0.2061 | lr 4.33e-04 | (3849.74 ms | 136188 tok/s) step 12739/76294 | train loss 3.375902 | norm 0.3862 | lr 4.33e-04 | (3798.11 ms | 138039 tok/s) step 12740/76294 | train loss 3.377997 | norm 0.2478 | lr 4.33e-04 | (3823.39 ms | 137126 tok/s) step 12741/76294 | train loss 3.360662 | norm 0.3968 | lr 4.33e-04 | (3805.81 ms | 137760 tok/s) step 12742/76294 | train loss 3.397342 | norm 0.5916 | lr 4.33e-04 | (3826.27 ms | 137023 tok/s) step 12743/76294 | train loss 3.424787 | norm 0.3023 | lr 4.33e-04 | (3800.37 ms | 137957 tok/s) step 12744/76294 | train loss 3.350616 | norm 0.5158 | lr 4.33e-04 | (3822.87 ms | 137145 tok/s) step 12745/76294 | train loss 3.352398 | norm 0.3901 | lr 4.33e-04 | (3796.29 ms | 138105 tok/s) step 12746/76294 | train loss 3.371379 | norm 0.3245 | lr 4.33e-04 | (3803.94 ms | 137828 tok/s) step 12747/76294 | train loss 3.397965 | norm 0.4915 | lr 4.33e-04 | (3826.48 ms | 137016 tok/s) step 12748/76294 | train loss 3.370119 | norm 0.2728 | lr 4.32e-04 | (3802.92 ms | 137865 tok/s) step 12749/76294 | train loss 3.432745 | norm 0.3484 | lr 4.32e-04 | (3804.12 ms | 137821 tok/s) step 12750/76294 | train loss 3.453189 | norm 0.3851 | lr 4.32e-04 | (3804.80 ms | 137797 tok/s) val loss: 3.382409 saving model checkpoint to ./results/gpt2-124M-gqa/step_12750.pth step 12751/76294 | train loss 3.421241 | norm 0.2133 | lr 4.32e-04 | (3904.63 ms | 134273 tok/s) step 12752/76294 | train loss 3.409089 | norm 0.2512 | lr 4.32e-04 | (3816.24 ms | 137383 tok/s) step 12753/76294 | train loss 3.346188 | norm 0.2700 | lr 4.32e-04 | (3885.36 ms | 134939 tok/s) step 12754/76294 | train loss 3.303831 | norm 0.3951 | lr 4.32e-04 | (3803.30 ms | 137851 tok/s) step 12755/76294 | train loss 3.462157 | norm 0.3427 | lr 4.32e-04 | (3856.09 ms | 135964 tok/s) step 12756/76294 | train loss 3.348251 | norm 0.2829 | lr 4.32e-04 | (3808.92 ms | 137647 tok/s) step 12757/76294 | train loss 3.397135 | norm 0.2838 | lr 4.32e-04 | (3803.07 ms | 137859 tok/s) step 12758/76294 | train loss 3.372856 | norm 0.2682 | lr 4.32e-04 | (3826.74 ms | 137006 tok/s) step 12759/76294 | train loss 3.396214 | norm 0.5332 | lr 4.32e-04 | (3802.64 ms | 137875 tok/s) step 12760/76294 | train loss 3.347643 | norm 0.6187 | lr 4.31e-04 | (3888.47 ms | 134832 tok/s) step 12761/76294 | train loss 3.363564 | norm 0.5594 | lr 4.31e-04 | (3803.97 ms | 137827 tok/s) step 12762/76294 | train loss 3.359599 | norm 0.3885 | lr 4.31e-04 | (3825.96 ms | 137034 tok/s) step 12763/76294 | train loss 3.372906 | norm 0.3949 | lr 4.31e-04 | (3804.13 ms | 137821 tok/s) step 12764/76294 | train loss 3.385840 | norm 0.5215 | lr 4.31e-04 | (3802.71 ms | 137872 tok/s) step 12765/76294 | train loss 3.378867 | norm 0.3137 | lr 4.31e-04 | (3844.02 ms | 136390 tok/s) step 12766/76294 | train loss 3.344388 | norm 0.4393 | lr 4.31e-04 | (3802.42 ms | 137883 tok/s) step 12767/76294 | train loss 3.371434 | norm 0.2907 | lr 4.31e-04 | (3853.74 ms | 136047 tok/s) step 12768/76294 | train loss 3.413288 | norm 0.3016 | lr 4.31e-04 | (3805.80 ms | 137760 tok/s) step 12769/76294 | train loss 3.378740 | norm 0.2506 | lr 4.31e-04 | (3924.96 ms | 133578 tok/s) step 12770/76294 | train loss 3.319810 | norm 0.4408 | lr 4.31e-04 | (3803.77 ms | 137834 tok/s) step 12771/76294 | train loss 3.356985 | norm 0.6009 | lr 4.31e-04 | (3829.74 ms | 136899 tok/s) step 12772/76294 | train loss 3.512203 | norm 0.4909 | lr 4.30e-04 | (3806.25 ms | 137744 tok/s) step 12773/76294 | train loss 3.378385 | norm 0.2018 | lr 4.30e-04 | (3861.36 ms | 135778 tok/s) step 12774/76294 | train loss 3.381265 | norm 0.4731 | lr 4.30e-04 | (3881.34 ms | 135079 tok/s) step 12775/76294 | train loss 3.416573 | norm 0.4474 | lr 4.30e-04 | (3798.09 ms | 138040 tok/s) step 12776/76294 | train loss 3.420677 | norm 0.3614 | lr 4.30e-04 | (3874.56 ms | 135315 tok/s) step 12777/76294 | train loss 3.374655 | norm 0.2361 | lr 4.30e-04 | (3806.61 ms | 137731 tok/s) step 12778/76294 | train loss 3.334280 | norm 0.3010 | lr 4.30e-04 | (3861.19 ms | 135784 tok/s) step 12779/76294 | train loss 3.417092 | norm 0.4772 | lr 4.30e-04 | (3811.14 ms | 137567 tok/s) step 12780/76294 | train loss 3.446978 | norm 0.3042 | lr 4.30e-04 | (3808.03 ms | 137680 tok/s) step 12781/76294 | train loss 3.332386 | norm 0.4466 | lr 4.30e-04 | (3832.13 ms | 136814 tok/s) step 12782/76294 | train loss 3.332763 | norm 0.2855 | lr 4.30e-04 | (3894.86 ms | 134610 tok/s) step 12783/76294 | train loss 3.386096 | norm 0.2195 | lr 4.30e-04 | (3807.26 ms | 137707 tok/s) step 12784/76294 | train loss 3.419542 | norm 0.4072 | lr 4.29e-04 | (3839.88 ms | 136538 tok/s) step 12785/76294 | train loss 3.428834 | norm 0.3192 | lr 4.29e-04 | (3812.05 ms | 137534 tok/s) step 12786/76294 | train loss 3.390056 | norm 0.3438 | lr 4.29e-04 | (3835.71 ms | 136686 tok/s) step 12787/76294 | train loss 3.377861 | norm 0.3678 | lr 4.29e-04 | (3804.71 ms | 137800 tok/s) step 12788/76294 | train loss 3.503880 | norm 0.3590 | lr 4.29e-04 | (3811.48 ms | 137555 tok/s) step 12789/76294 | train loss 3.437183 | norm 0.4704 | lr 4.29e-04 | (3891.40 ms | 134730 tok/s) step 12790/76294 | train loss 3.410031 | norm 0.2563 | lr 4.29e-04 | (3803.87 ms | 137830 tok/s) step 12791/76294 | train loss 3.345536 | norm 0.3467 | lr 4.29e-04 | (4358.76 ms | 120284 tok/s) step 12792/76294 | train loss 3.404090 | norm 0.6673 | lr 4.29e-04 | (3796.62 ms | 138093 tok/s) step 12793/76294 | train loss 3.318247 | norm 0.2833 | lr 4.29e-04 | (3831.39 ms | 136840 tok/s) step 12794/76294 | train loss 3.464441 | norm 0.6061 | lr 4.29e-04 | (3800.67 ms | 137946 tok/s) step 12795/76294 | train loss 3.364069 | norm 0.3227 | lr 4.29e-04 | (3930.69 ms | 133383 tok/s) step 12796/76294 | train loss 3.366396 | norm 0.3469 | lr 4.29e-04 | (3801.44 ms | 137918 tok/s) step 12797/76294 | train loss 3.332894 | norm 0.3810 | lr 4.28e-04 | (3856.93 ms | 135934 tok/s) step 12798/76294 | train loss 3.393649 | norm 0.4159 | lr 4.28e-04 | (3797.06 ms | 138077 tok/s) step 12799/76294 | train loss 3.396406 | norm 0.2763 | lr 4.28e-04 | (3808.02 ms | 137680 tok/s) step 12800/76294 | train loss 3.394097 | norm 0.2947 | lr 4.28e-04 | (3840.34 ms | 136521 tok/s) step 12801/76294 | train loss 3.394334 | norm 0.2774 | lr 4.28e-04 | (3799.09 ms | 138004 tok/s) step 12802/76294 | train loss 3.331662 | norm 0.2573 | lr 4.28e-04 | (3857.87 ms | 135901 tok/s) step 12803/76294 | train loss 3.419224 | norm 0.3522 | lr 4.28e-04 | (3875.22 ms | 135292 tok/s) step 12804/76294 | train loss 3.407457 | norm 0.3658 | lr 4.28e-04 | (3794.83 ms | 138159 tok/s) step 12805/76294 | train loss 3.324853 | norm 0.3275 | lr 4.28e-04 | (3816.71 ms | 137367 tok/s) step 12806/76294 | train loss 3.376911 | norm 0.6362 | lr 4.28e-04 | (3795.82 ms | 138123 tok/s) step 12807/76294 | train loss 3.438971 | norm 0.5136 | lr 4.28e-04 | (3825.08 ms | 137066 tok/s) step 12808/76294 | train loss 3.420377 | norm 0.4127 | lr 4.28e-04 | (3800.35 ms | 137958 tok/s) step 12809/76294 | train loss 3.391732 | norm 0.6360 | lr 4.27e-04 | (3911.74 ms | 134029 tok/s) step 12810/76294 | train loss 3.392690 | norm 0.4741 | lr 4.27e-04 | (3792.86 ms | 138230 tok/s) step 12811/76294 | train loss 3.462955 | norm 0.3869 | lr 4.27e-04 | (3806.77 ms | 137725 tok/s) step 12812/76294 | train loss 3.381255 | norm 0.4585 | lr 4.27e-04 | (3822.84 ms | 137146 tok/s) step 12813/76294 | train loss 3.389815 | norm 0.2535 | lr 4.27e-04 | (3840.63 ms | 136511 tok/s) step 12814/76294 | train loss 3.370424 | norm 0.2797 | lr 4.27e-04 | (3871.45 ms | 135424 tok/s) step 12815/76294 | train loss 3.376763 | norm 0.2546 | lr 4.27e-04 | (3800.66 ms | 137947 tok/s) step 12816/76294 | train loss 3.414868 | norm 0.2479 | lr 4.27e-04 | (3929.19 ms | 133434 tok/s) step 12817/76294 | train loss 3.380358 | norm 0.2140 | lr 4.27e-04 | (3795.30 ms | 138142 tok/s) step 12818/76294 | train loss 3.371498 | norm 0.2518 | lr 4.27e-04 | (3866.74 ms | 135589 tok/s) step 12819/76294 | train loss 3.365632 | norm 0.2899 | lr 4.27e-04 | (3805.77 ms | 137761 tok/s) step 12820/76294 | train loss 3.346956 | norm 0.3406 | lr 4.27e-04 | (3909.44 ms | 134108 tok/s) step 12821/76294 | train loss 3.421488 | norm 0.4486 | lr 4.26e-04 | (3787.73 ms | 138418 tok/s) step 12822/76294 | train loss 3.422239 | norm 0.4205 | lr 4.26e-04 | (3795.11 ms | 138148 tok/s) step 12823/76294 | train loss 3.416407 | norm 0.5088 | lr 4.26e-04 | (3791.44 ms | 138282 tok/s) step 12824/76294 | train loss 3.397288 | norm 0.3291 | lr 4.26e-04 | (3829.09 ms | 136922 tok/s) step 12825/76294 | train loss 3.391629 | norm 0.5072 | lr 4.26e-04 | (3819.39 ms | 137270 tok/s) step 12826/76294 | train loss 3.360649 | norm 0.2724 | lr 4.26e-04 | (3853.94 ms | 136040 tok/s) step 12827/76294 | train loss 3.369659 | norm 0.3351 | lr 4.26e-04 | (3818.31 ms | 137309 tok/s) step 12828/76294 | train loss 3.391728 | norm 0.4679 | lr 4.26e-04 | (3908.81 ms | 134130 tok/s) step 12829/76294 | train loss 3.387370 | norm 0.3022 | lr 4.26e-04 | (3791.38 ms | 138284 tok/s) step 12830/76294 | train loss 3.446622 | norm 0.2300 | lr 4.26e-04 | (3805.88 ms | 137757 tok/s) step 12831/76294 | train loss 3.380433 | norm 0.2406 | lr 4.26e-04 | (3815.63 ms | 137405 tok/s) step 12832/76294 | train loss 3.378530 | norm 0.3179 | lr 4.26e-04 | (3806.86 ms | 137722 tok/s) step 12833/76294 | train loss 3.392093 | norm 0.3163 | lr 4.26e-04 | (3797.28 ms | 138069 tok/s) step 12834/76294 | train loss 3.471398 | norm 0.4614 | lr 4.25e-04 | (3847.46 ms | 136269 tok/s) step 12835/76294 | train loss 3.370681 | norm 0.2143 | lr 4.25e-04 | (3793.42 ms | 138210 tok/s) step 12836/76294 | train loss 3.404458 | norm 0.3788 | lr 4.25e-04 | (3910.25 ms | 134080 tok/s) step 12837/76294 | train loss 3.435214 | norm 0.4947 | lr 4.25e-04 | (3792.91 ms | 138229 tok/s) step 12838/76294 | train loss 3.367496 | norm 0.3935 | lr 4.25e-04 | (22684.53 ms | 23112 tok/s) step 12839/76294 | train loss 3.392397 | norm 0.4289 | lr 4.25e-04 | (3902.83 ms | 134335 tok/s) step 12840/76294 | train loss 3.404398 | norm 0.2591 | lr 4.25e-04 | (3825.44 ms | 137053 tok/s) step 12841/76294 | train loss 3.420905 | norm 0.3303 | lr 4.25e-04 | (3767.75 ms | 139152 tok/s) step 12842/76294 | train loss 3.360259 | norm 0.2005 | lr 4.25e-04 | (3777.28 ms | 138800 tok/s) step 12843/76294 | train loss 3.431752 | norm 0.2360 | lr 4.25e-04 | (3802.88 ms | 137866 tok/s) step 12844/76294 | train loss 3.325150 | norm 0.2196 | lr 4.25e-04 | (3781.64 ms | 138640 tok/s) step 12845/76294 | train loss 3.715159 | norm 0.2438 | lr 4.25e-04 | (3801.11 ms | 137930 tok/s) step 12846/76294 | train loss 3.377269 | norm 0.3201 | lr 4.24e-04 | (4641.65 ms | 112953 tok/s) step 12847/76294 | train loss 3.374822 | norm 0.2084 | lr 4.24e-04 | (3783.79 ms | 138561 tok/s) step 12848/76294 | train loss 3.424229 | norm 0.2695 | lr 4.24e-04 | (3821.12 ms | 137208 tok/s) step 12849/76294 | train loss 3.385609 | norm 0.2919 | lr 4.24e-04 | (3805.59 ms | 137768 tok/s) step 12850/76294 | train loss 3.365533 | norm 0.4450 | lr 4.24e-04 | (3798.01 ms | 138043 tok/s) step 12851/76294 | train loss 3.477774 | norm 0.3351 | lr 4.24e-04 | (3802.77 ms | 137870 tok/s) step 12852/76294 | train loss 3.380962 | norm 0.1998 | lr 4.24e-04 | (3797.68 ms | 138055 tok/s) step 12853/76294 | train loss 3.322442 | norm 0.4513 | lr 4.24e-04 | (3807.82 ms | 137687 tok/s) step 12854/76294 | train loss 3.377479 | norm 0.3066 | lr 4.24e-04 | (3803.72 ms | 137835 tok/s) step 12855/76294 | train loss 3.371603 | norm 0.2204 | lr 4.24e-04 | (3800.67 ms | 137946 tok/s) step 12856/76294 | train loss 3.392326 | norm 0.3322 | lr 4.24e-04 | (3836.93 ms | 136642 tok/s) step 12857/76294 | train loss 3.351106 | norm 0.3434 | lr 4.24e-04 | (3802.91 ms | 137865 tok/s) step 12858/76294 | train loss 3.392912 | norm 0.4224 | lr 4.23e-04 | (3926.94 ms | 133511 tok/s) step 12859/76294 | train loss 3.401457 | norm 0.5168 | lr 4.23e-04 | (3873.99 ms | 135335 tok/s) step 12860/76294 | train loss 3.360895 | norm 0.4930 | lr 4.23e-04 | (3804.54 ms | 137806 tok/s) step 12861/76294 | train loss 3.385345 | norm 0.4920 | lr 4.23e-04 | (3865.55 ms | 135631 tok/s) step 12862/76294 | train loss 3.396984 | norm 0.3508 | lr 4.23e-04 | (3799.02 ms | 138006 tok/s) step 12863/76294 | train loss 3.349250 | norm 0.2969 | lr 4.23e-04 | (3806.18 ms | 137747 tok/s) step 12864/76294 | train loss 3.346833 | norm 0.6070 | lr 4.23e-04 | (3824.98 ms | 137069 tok/s) step 12865/76294 | train loss 3.332977 | norm 0.5574 | lr 4.23e-04 | (3837.54 ms | 136621 tok/s) step 12866/76294 | train loss 3.417938 | norm 0.3355 | lr 4.23e-04 | (3816.74 ms | 137365 tok/s) step 12867/76294 | train loss 3.357311 | norm 0.3430 | lr 4.23e-04 | (3812.36 ms | 137523 tok/s) step 12868/76294 | train loss 3.397639 | norm 0.5395 | lr 4.23e-04 | (3898.98 ms | 134468 tok/s) step 12869/76294 | train loss 3.419992 | norm 0.4160 | lr 4.23e-04 | (3799.82 ms | 137977 tok/s) step 12870/76294 | train loss 3.395434 | norm 0.5196 | lr 4.23e-04 | (4083.53 ms | 128391 tok/s) step 12871/76294 | train loss 3.391903 | norm 0.2809 | lr 4.22e-04 | (3813.61 ms | 137478 tok/s) step 12872/76294 | train loss 3.332189 | norm 0.2811 | lr 4.22e-04 | (4035.71 ms | 129912 tok/s) step 12873/76294 | train loss 3.390530 | norm 0.2508 | lr 4.22e-04 | (3804.68 ms | 137801 tok/s) step 12874/76294 | train loss 3.438400 | norm 0.4284 | lr 4.22e-04 | (3936.66 ms | 133181 tok/s) step 12875/76294 | train loss 3.422388 | norm 0.3100 | lr 4.22e-04 | (3832.26 ms | 136809 tok/s) step 12876/76294 | train loss 3.425786 | norm 0.3669 | lr 4.22e-04 | (3803.08 ms | 137859 tok/s) step 12877/76294 | train loss 3.364504 | norm 0.2541 | lr 4.22e-04 | (5187.14 ms | 101075 tok/s) step 12878/76294 | train loss 3.423503 | norm 0.3065 | lr 4.22e-04 | (3800.35 ms | 137958 tok/s) step 12879/76294 | train loss 3.367616 | norm 0.1991 | lr 4.22e-04 | (3831.97 ms | 136819 tok/s) step 12880/76294 | train loss 3.360778 | norm 0.3663 | lr 4.22e-04 | (3800.68 ms | 137946 tok/s) step 12881/76294 | train loss 3.386259 | norm 0.3609 | lr 4.22e-04 | (3823.80 ms | 137112 tok/s) step 12882/76294 | train loss 3.358415 | norm 0.3641 | lr 4.22e-04 | (3815.68 ms | 137403 tok/s) step 12883/76294 | train loss 3.616326 | norm 0.3543 | lr 4.21e-04 | (3813.95 ms | 137466 tok/s) step 12884/76294 | train loss 3.472409 | norm 0.3569 | lr 4.21e-04 | (3805.77 ms | 137761 tok/s) step 12885/76294 | train loss 3.398329 | norm 0.3142 | lr 4.21e-04 | (4200.62 ms | 124812 tok/s) step 12886/76294 | train loss 3.391917 | norm 0.3212 | lr 4.21e-04 | (3807.57 ms | 137696 tok/s) step 12887/76294 | train loss 3.339027 | norm 0.4757 | lr 4.21e-04 | (3859.25 ms | 135852 tok/s) step 12888/76294 | train loss 3.461232 | norm 0.6306 | lr 4.21e-04 | (3802.55 ms | 137878 tok/s) step 12889/76294 | train loss 3.382780 | norm 0.2973 | lr 4.21e-04 | (3822.90 ms | 137144 tok/s) step 12890/76294 | train loss 3.397951 | norm 0.3577 | lr 4.21e-04 | (3831.50 ms | 136836 tok/s) step 12891/76294 | train loss 3.356167 | norm 0.2732 | lr 4.21e-04 | (3807.73 ms | 137690 tok/s) step 12892/76294 | train loss 3.437167 | norm 0.3987 | lr 4.21e-04 | (3840.42 ms | 136518 tok/s) step 12893/76294 | train loss 3.255811 | norm 0.4800 | lr 4.21e-04 | (3840.86 ms | 136503 tok/s) step 12894/76294 | train loss 3.357779 | norm 0.3488 | lr 4.21e-04 | (3808.18 ms | 137674 tok/s) step 12895/76294 | train loss 3.393681 | norm 0.9623 | lr 4.20e-04 | (3841.24 ms | 136489 tok/s) step 12896/76294 | train loss 3.332721 | norm 1.2151 | lr 4.20e-04 | (3807.35 ms | 137704 tok/s) step 12897/76294 | train loss 3.385399 | norm 0.5913 | lr 4.20e-04 | (3832.40 ms | 136804 tok/s) step 12898/76294 | train loss 3.427585 | norm 0.7747 | lr 4.20e-04 | (3831.47 ms | 136837 tok/s) step 12899/76294 | train loss 3.363105 | norm 0.3050 | lr 4.20e-04 | (3856.95 ms | 135933 tok/s) step 12900/76294 | train loss 3.401021 | norm 0.2669 | lr 4.20e-04 | (3805.09 ms | 137786 tok/s) step 12901/76294 | train loss 3.533089 | norm 0.3640 | lr 4.20e-04 | (3837.32 ms | 136629 tok/s) step 12902/76294 | train loss 3.321473 | norm 0.3000 | lr 4.20e-04 | (3830.35 ms | 136877 tok/s) step 12903/76294 | train loss 3.349797 | norm 0.2903 | lr 4.20e-04 | (3907.87 ms | 134162 tok/s) step 12904/76294 | train loss 3.396633 | norm 0.3936 | lr 4.20e-04 | (3810.68 ms | 137584 tok/s) step 12905/76294 | train loss 3.385353 | norm 0.4606 | lr 4.20e-04 | (3880.72 ms | 135101 tok/s) step 12906/76294 | train loss 3.424185 | norm 0.3884 | lr 4.20e-04 | (3799.39 ms | 137993 tok/s) step 12907/76294 | train loss 3.444854 | norm 0.8577 | lr 4.20e-04 | (3895.81 ms | 134577 tok/s) step 12908/76294 | train loss 3.420964 | norm 0.7236 | lr 4.19e-04 | (3876.48 ms | 135249 tok/s) step 12909/76294 | train loss 3.424971 | norm 0.3654 | lr 4.19e-04 | (5977.05 ms | 87717 tok/s) step 12910/76294 | train loss 3.471790 | norm 0.5110 | lr 4.19e-04 | (3791.43 ms | 138282 tok/s) step 12911/76294 | train loss 3.365997 | norm 0.2671 | lr 4.19e-04 | (3818.33 ms | 137308 tok/s) step 12912/76294 | train loss 3.383891 | norm 0.3595 | lr 4.19e-04 | (3801.27 ms | 137924 tok/s) step 12913/76294 | train loss 3.400642 | norm 0.3366 | lr 4.19e-04 | (3800.85 ms | 137940 tok/s) step 12914/76294 | train loss 3.366903 | norm 0.2632 | lr 4.19e-04 | (3804.48 ms | 137808 tok/s) step 12915/76294 | train loss 3.475636 | norm 0.3944 | lr 4.19e-04 | (3820.95 ms | 137214 tok/s) step 12916/76294 | train loss 3.366219 | norm 0.3990 | lr 4.19e-04 | (3803.33 ms | 137850 tok/s) step 12917/76294 | train loss 3.351728 | norm 0.9380 | lr 4.19e-04 | (3806.81 ms | 137724 tok/s) step 12918/76294 | train loss 3.456664 | norm 1.1156 | lr 4.19e-04 | (3801.53 ms | 137915 tok/s) step 12919/76294 | train loss 3.388891 | norm 0.3168 | lr 4.19e-04 | (3803.68 ms | 137837 tok/s) step 12920/76294 | train loss 3.388912 | norm 0.5004 | lr 4.18e-04 | (3831.47 ms | 136837 tok/s) step 12921/76294 | train loss 3.442745 | norm 0.5462 | lr 4.18e-04 | (3824.47 ms | 137088 tok/s) step 12922/76294 | train loss 3.359985 | norm 0.6322 | lr 4.18e-04 | (3834.59 ms | 136726 tok/s) step 12923/76294 | train loss 3.339441 | norm 0.2618 | lr 4.18e-04 | (3981.13 ms | 131693 tok/s) step 12924/76294 | train loss 3.361405 | norm 0.8012 | lr 4.18e-04 | (3796.15 ms | 138110 tok/s) step 12925/76294 | train loss 3.443113 | norm 0.8751 | lr 4.18e-04 | (3819.10 ms | 137280 tok/s) step 12926/76294 | train loss 3.295461 | norm 0.2384 | lr 4.18e-04 | (3805.04 ms | 137788 tok/s) step 12927/76294 | train loss 3.421190 | norm 0.4854 | lr 4.18e-04 | (3816.06 ms | 137390 tok/s) step 12928/76294 | train loss 3.343100 | norm 0.4133 | lr 4.18e-04 | (3802.27 ms | 137888 tok/s) step 12929/76294 | train loss 3.297869 | norm 0.2665 | lr 4.18e-04 | (3827.66 ms | 136974 tok/s) step 12930/76294 | train loss 3.401006 | norm 0.2113 | lr 4.18e-04 | (3803.89 ms | 137830 tok/s) step 12931/76294 | train loss 3.355112 | norm 0.2985 | lr 4.18e-04 | (3823.77 ms | 137113 tok/s) step 12932/76294 | train loss 3.368882 | norm 0.3698 | lr 4.18e-04 | (3808.90 ms | 137648 tok/s) step 12933/76294 | train loss 3.398187 | norm 0.4308 | lr 4.17e-04 | (3812.54 ms | 137517 tok/s) step 12934/76294 | train loss 3.356560 | norm 0.2609 | lr 4.17e-04 | (3806.13 ms | 137748 tok/s) step 12935/76294 | train loss 3.353746 | norm 0.4391 | lr 4.17e-04 | (3814.35 ms | 137452 tok/s) step 12936/76294 | train loss 3.385113 | norm 0.4350 | lr 4.17e-04 | (3830.60 ms | 136868 tok/s) step 12937/76294 | train loss 3.393318 | norm 0.2752 | lr 4.17e-04 | (3811.49 ms | 137555 tok/s) step 12938/76294 | train loss 3.384979 | norm 0.3504 | lr 4.17e-04 | (3856.02 ms | 135966 tok/s) step 12939/76294 | train loss 3.465077 | norm 0.4015 | lr 4.17e-04 | (3811.91 ms | 137539 tok/s) step 12940/76294 | train loss 3.385549 | norm 0.2898 | lr 4.17e-04 | (3809.08 ms | 137642 tok/s) step 12941/76294 | train loss 3.411491 | norm 0.2909 | lr 4.17e-04 | (3806.59 ms | 137732 tok/s) step 12942/76294 | train loss 3.402877 | norm 0.2388 | lr 4.17e-04 | (3838.35 ms | 136592 tok/s) step 12943/76294 | train loss 3.406607 | norm 0.2442 | lr 4.17e-04 | (3807.75 ms | 137690 tok/s) step 12944/76294 | train loss 3.347047 | norm 0.2283 | lr 4.17e-04 | (3831.57 ms | 136834 tok/s) step 12945/76294 | train loss 3.422709 | norm 0.2761 | lr 4.16e-04 | (3828.80 ms | 136933 tok/s) step 12946/76294 | train loss 3.362687 | norm 0.3331 | lr 4.16e-04 | (3896.68 ms | 134547 tok/s) step 12947/76294 | train loss 3.399454 | norm 0.3398 | lr 4.16e-04 | (3801.92 ms | 137901 tok/s) step 12948/76294 | train loss 3.342849 | norm 0.8942 | lr 4.16e-04 | (3809.79 ms | 137616 tok/s) step 12949/76294 | train loss 3.368050 | norm 0.3257 | lr 4.16e-04 | (3830.43 ms | 136875 tok/s) step 12950/76294 | train loss 3.367256 | norm 0.5667 | lr 4.16e-04 | (3835.65 ms | 136688 tok/s) step 12951/76294 | train loss 3.412192 | norm 0.5540 | lr 4.16e-04 | (3831.23 ms | 136846 tok/s) step 12952/76294 | train loss 3.344282 | norm 0.2632 | lr 4.16e-04 | (3807.31 ms | 137706 tok/s) step 12953/76294 | train loss 3.476080 | norm 0.3144 | lr 4.16e-04 | (3810.25 ms | 137599 tok/s) step 12954/76294 | train loss 3.428995 | norm 0.3047 | lr 4.16e-04 | (3807.03 ms | 137716 tok/s) step 12955/76294 | train loss 3.409238 | norm 0.2886 | lr 4.16e-04 | (3800.90 ms | 137938 tok/s) step 12956/76294 | train loss 3.523066 | norm 0.2555 | lr 4.16e-04 | (3864.08 ms | 135683 tok/s) step 12957/76294 | train loss 3.449953 | norm 0.3166 | lr 4.16e-04 | (3807.45 ms | 137701 tok/s) step 12958/76294 | train loss 3.405130 | norm 0.2522 | lr 4.15e-04 | (3846.15 ms | 136315 tok/s) step 12959/76294 | train loss 3.369782 | norm 0.2617 | lr 4.15e-04 | (3806.42 ms | 137738 tok/s) step 12960/76294 | train loss 3.363616 | norm 0.2049 | lr 4.15e-04 | (3810.72 ms | 137583 tok/s) step 12961/76294 | train loss 3.373364 | norm 0.3910 | lr 4.15e-04 | (3835.83 ms | 136682 tok/s) step 12962/76294 | train loss 3.321428 | norm 0.2695 | lr 4.15e-04 | (3809.70 ms | 137619 tok/s) step 12963/76294 | train loss 3.402159 | norm 0.2657 | lr 4.15e-04 | (3835.00 ms | 136711 tok/s) step 12964/76294 | train loss 3.359368 | norm 0.3302 | lr 4.15e-04 | (3816.41 ms | 137377 tok/s) step 12965/76294 | train loss 3.365850 | norm 0.2809 | lr 4.15e-04 | (3814.57 ms | 137443 tok/s) step 12966/76294 | train loss 3.335655 | norm 0.3632 | lr 4.15e-04 | (3803.70 ms | 137836 tok/s) step 12967/76294 | train loss 3.406790 | norm 0.3989 | lr 4.15e-04 | (3802.84 ms | 137867 tok/s) step 12968/76294 | train loss 3.401022 | norm 0.4720 | lr 4.15e-04 | (3910.81 ms | 134061 tok/s) step 12969/76294 | train loss 3.422138 | norm 0.5777 | lr 4.15e-04 | (3803.96 ms | 137827 tok/s) step 12970/76294 | train loss 3.320586 | norm 0.3811 | lr 4.14e-04 | (3821.71 ms | 137187 tok/s) step 12971/76294 | train loss 3.393972 | norm 0.2868 | lr 4.14e-04 | (3801.86 ms | 137903 tok/s) step 12972/76294 | train loss 3.375622 | norm 0.2549 | lr 4.14e-04 | (3833.34 ms | 136771 tok/s) step 12973/76294 | train loss 3.362145 | norm 0.2330 | lr 4.14e-04 | (3806.03 ms | 137752 tok/s) step 12974/76294 | train loss 3.319045 | norm 0.3007 | lr 4.14e-04 | (3808.34 ms | 137668 tok/s) step 12975/76294 | train loss 3.392657 | norm 0.4817 | lr 4.14e-04 | (3837.44 ms | 136624 tok/s) step 12976/76294 | train loss 3.433814 | norm 0.2780 | lr 4.14e-04 | (3864.82 ms | 135656 tok/s) step 12977/76294 | train loss 3.348437 | norm 0.3195 | lr 4.14e-04 | (3868.83 ms | 135516 tok/s) step 12978/76294 | train loss 3.331152 | norm 0.3432 | lr 4.14e-04 | (3817.75 ms | 137329 tok/s) step 12979/76294 | train loss 3.411840 | norm 0.2638 | lr 4.14e-04 | (3817.20 ms | 137349 tok/s) step 12980/76294 | train loss 3.387550 | norm 0.2389 | lr 4.14e-04 | (3826.26 ms | 137024 tok/s) step 12981/76294 | train loss 3.336072 | norm 0.2604 | lr 4.14e-04 | (3806.32 ms | 137741 tok/s) step 12982/76294 | train loss 3.413084 | norm 0.5653 | lr 4.14e-04 | (3820.74 ms | 137222 tok/s) step 12983/76294 | train loss 3.438538 | norm 0.4258 | lr 4.13e-04 | (3807.90 ms | 137684 tok/s) step 12984/76294 | train loss 3.408537 | norm 0.5127 | lr 4.13e-04 | (3812.30 ms | 137525 tok/s) step 12985/76294 | train loss 3.451904 | norm 0.5648 | lr 4.13e-04 | (3845.42 ms | 136341 tok/s) step 12986/76294 | train loss 3.357322 | norm 0.4072 | lr 4.13e-04 | (3810.05 ms | 137607 tok/s) step 12987/76294 | train loss 3.384029 | norm 0.3927 | lr 4.13e-04 | (3806.29 ms | 137743 tok/s) step 12988/76294 | train loss 3.421258 | norm 0.3311 | lr 4.13e-04 | (3815.16 ms | 137422 tok/s) step 12989/76294 | train loss 3.358952 | norm 0.2979 | lr 4.13e-04 | (3802.33 ms | 137886 tok/s) step 12990/76294 | train loss 3.367201 | norm 0.2567 | lr 4.13e-04 | (4004.12 ms | 130937 tok/s) step 12991/76294 | train loss 3.428140 | norm 0.5083 | lr 4.13e-04 | (3845.46 ms | 136340 tok/s) step 12992/76294 | train loss 3.355639 | norm 0.4011 | lr 4.13e-04 | (3804.35 ms | 137813 tok/s) step 12993/76294 | train loss 3.383466 | norm 0.5273 | lr 4.13e-04 | (3813.73 ms | 137474 tok/s) step 12994/76294 | train loss 3.366286 | norm 0.8662 | lr 4.13e-04 | (3800.76 ms | 137943 tok/s) step 12995/76294 | train loss 3.402395 | norm 0.4869 | lr 4.12e-04 | (3834.03 ms | 136746 tok/s) step 12996/76294 | train loss 3.400136 | norm 0.9908 | lr 4.12e-04 | (3804.97 ms | 137790 tok/s) step 12997/76294 | train loss 3.432468 | norm 0.3925 | lr 4.12e-04 | (3820.75 ms | 137221 tok/s) step 12998/76294 | train loss 3.357785 | norm 0.3740 | lr 4.12e-04 | (3805.16 ms | 137784 tok/s) step 12999/76294 | train loss 3.426093 | norm 0.2766 | lr 4.12e-04 | (3807.80 ms | 137688 tok/s) step 13000/76294 | train loss 3.355124 | norm 0.3352 | lr 4.12e-04 | (3826.39 ms | 137019 tok/s) val loss: 3.383636 saving model checkpoint to ./results/gpt2-124M-gqa/step_13000.pth step 13001/76294 | train loss 3.373094 | norm 0.2747 | lr 4.12e-04 | (3926.11 ms | 133539 tok/s) step 13002/76294 | train loss 3.314523 | norm 0.4347 | lr 4.12e-04 | (3808.82 ms | 137651 tok/s) step 13003/76294 | train loss 3.394994 | norm 0.5853 | lr 4.12e-04 | (3790.18 ms | 138328 tok/s) step 13004/76294 | train loss 3.350930 | norm 0.4064 | lr 4.12e-04 | (3791.43 ms | 138282 tok/s) step 13005/76294 | train loss 3.386227 | norm 0.5527 | lr 4.12e-04 | (3803.72 ms | 137835 tok/s) step 13006/76294 | train loss 3.450859 | norm 0.4379 | lr 4.12e-04 | (3803.13 ms | 137857 tok/s) step 13007/76294 | train loss 3.344928 | norm 0.2255 | lr 4.12e-04 | (3804.70 ms | 137800 tok/s) step 13008/76294 | train loss 3.339447 | norm 0.2569 | lr 4.11e-04 | (3806.37 ms | 137740 tok/s) step 13009/76294 | train loss 3.402581 | norm 0.2969 | lr 4.11e-04 | (3898.30 ms | 134491 tok/s) step 13010/76294 | train loss 3.405570 | norm 0.3152 | lr 4.11e-04 | (3792.06 ms | 138259 tok/s) step 13011/76294 | train loss 3.365550 | norm 0.4309 | lr 4.11e-04 | (3884.70 ms | 134962 tok/s) step 13012/76294 | train loss 3.396596 | norm 0.3357 | lr 4.11e-04 | (3796.84 ms | 138085 tok/s) step 13013/76294 | train loss 3.350060 | norm 0.3493 | lr 4.11e-04 | (3823.38 ms | 137127 tok/s) step 13014/76294 | train loss 3.348695 | norm 0.5671 | lr 4.11e-04 | (3803.16 ms | 137856 tok/s) step 13015/76294 | train loss 3.369071 | norm 0.3985 | lr 4.11e-04 | (3823.79 ms | 137112 tok/s) step 13016/76294 | train loss 3.451183 | norm 0.3814 | lr 4.11e-04 | (3798.64 ms | 138020 tok/s) step 13017/76294 | train loss 3.372675 | norm 0.5192 | lr 4.11e-04 | (6307.78 ms | 83118 tok/s) step 13018/76294 | train loss 3.420365 | norm 0.5378 | lr 4.11e-04 | (3796.14 ms | 138111 tok/s) step 13019/76294 | train loss 3.439933 | norm 0.2835 | lr 4.11e-04 | (3837.89 ms | 136609 tok/s) step 13020/76294 | train loss 3.330409 | norm 0.3415 | lr 4.10e-04 | (3794.87 ms | 138157 tok/s) step 13021/76294 | train loss 3.412502 | norm 0.3286 | lr 4.10e-04 | (3815.69 ms | 137403 tok/s) step 13022/76294 | train loss 3.361926 | norm 0.2710 | lr 4.10e-04 | (3815.36 ms | 137415 tok/s) step 13023/76294 | train loss 3.378083 | norm 0.4408 | lr 4.10e-04 | (3796.06 ms | 138114 tok/s) step 13024/76294 | train loss 3.327583 | norm 0.2691 | lr 4.10e-04 | (3802.29 ms | 137887 tok/s) step 13025/76294 | train loss 3.387128 | norm 0.3332 | lr 4.10e-04 | (3795.96 ms | 138117 tok/s) step 13026/76294 | train loss 3.441809 | norm 0.3190 | lr 4.10e-04 | (3795.79 ms | 138124 tok/s) step 13027/76294 | train loss 3.352122 | norm 0.2729 | lr 4.10e-04 | (3833.30 ms | 136772 tok/s) step 13028/76294 | train loss 3.440882 | norm 0.2233 | lr 4.10e-04 | (3793.80 ms | 138196 tok/s) step 13029/76294 | train loss 3.377625 | norm 0.2507 | lr 4.10e-04 | (3800.12 ms | 137966 tok/s) step 13030/76294 | train loss 3.299464 | norm 0.3133 | lr 4.10e-04 | (3824.17 ms | 137099 tok/s) step 13031/76294 | train loss 3.433288 | norm 0.2209 | lr 4.10e-04 | (3804.28 ms | 137815 tok/s) step 13032/76294 | train loss 3.358380 | norm 0.3007 | lr 4.10e-04 | (3893.57 ms | 134655 tok/s) step 13033/76294 | train loss 3.377571 | norm 0.2703 | lr 4.09e-04 | (3902.12 ms | 134360 tok/s) step 13034/76294 | train loss 3.472251 | norm 0.2246 | lr 4.09e-04 | (3793.94 ms | 138191 tok/s) step 13035/76294 | train loss 3.734785 | norm 0.5745 | lr 4.09e-04 | (3843.86 ms | 136396 tok/s) step 13036/76294 | train loss 3.351746 | norm 0.3075 | lr 4.09e-04 | (3900.61 ms | 134412 tok/s) step 13037/76294 | train loss 3.387232 | norm 0.2420 | lr 4.09e-04 | (3805.56 ms | 137769 tok/s) step 13038/76294 | train loss 3.393638 | norm 0.3206 | lr 4.09e-04 | (3804.41 ms | 137811 tok/s) step 13039/76294 | train loss 3.387809 | norm 0.2997 | lr 4.09e-04 | (3820.91 ms | 137216 tok/s) step 13040/76294 | train loss 3.411051 | norm 0.2356 | lr 4.09e-04 | (3802.17 ms | 137892 tok/s) step 13041/76294 | train loss 3.366050 | norm 0.3771 | lr 4.09e-04 | (3800.33 ms | 137959 tok/s) step 13042/76294 | train loss 3.330620 | norm 0.4562 | lr 4.09e-04 | (3826.29 ms | 137023 tok/s) step 13043/76294 | train loss 3.348444 | norm 0.3762 | lr 4.09e-04 | (3916.45 ms | 133868 tok/s) step 13044/76294 | train loss 3.360388 | norm 0.3658 | lr 4.09e-04 | (3797.71 ms | 138054 tok/s) step 13045/76294 | train loss 3.358422 | norm 0.4092 | lr 4.08e-04 | (3801.56 ms | 137914 tok/s) step 13046/76294 | train loss 3.417050 | norm 0.5008 | lr 4.08e-04 | (3822.24 ms | 137168 tok/s) step 13047/76294 | train loss 3.443339 | norm 1.0907 | lr 4.08e-04 | (3798.71 ms | 138017 tok/s) step 13048/76294 | train loss 3.355900 | norm 0.4764 | lr 4.08e-04 | (3804.65 ms | 137802 tok/s) step 13049/76294 | train loss 3.411387 | norm 0.3163 | lr 4.08e-04 | (3802.00 ms | 137898 tok/s) step 13050/76294 | train loss 3.347315 | norm 0.2347 | lr 4.08e-04 | (3803.05 ms | 137860 tok/s) step 13051/76294 | train loss 3.335873 | norm 0.3713 | lr 4.08e-04 | (3803.20 ms | 137855 tok/s) step 13052/76294 | train loss 3.430250 | norm 0.7380 | lr 4.08e-04 | (3808.51 ms | 137662 tok/s) step 13053/76294 | train loss 3.316999 | norm 0.5768 | lr 4.08e-04 | (3804.23 ms | 137817 tok/s) step 13054/76294 | train loss 3.327884 | norm 0.2906 | lr 4.08e-04 | (3822.46 ms | 137160 tok/s) step 13055/76294 | train loss 3.388129 | norm 0.3402 | lr 4.08e-04 | (3895.00 ms | 134605 tok/s) step 13056/76294 | train loss 3.324379 | norm 0.3295 | lr 4.08e-04 | (3797.42 ms | 138064 tok/s) step 13057/76294 | train loss 3.338314 | norm 0.2259 | lr 4.08e-04 | (3901.38 ms | 134385 tok/s) step 13058/76294 | train loss 3.401628 | norm 0.4186 | lr 4.07e-04 | (3794.55 ms | 138169 tok/s) step 13059/76294 | train loss 3.357559 | norm 0.7460 | lr 4.07e-04 | (3849.63 ms | 136192 tok/s) step 13060/76294 | train loss 3.323777 | norm 0.4582 | lr 4.07e-04 | (3796.31 ms | 138105 tok/s) step 13061/76294 | train loss 3.371425 | norm 0.6121 | lr 4.07e-04 | (3798.85 ms | 138012 tok/s) step 13062/76294 | train loss 3.326872 | norm 0.3597 | lr 4.07e-04 | (3822.74 ms | 137150 tok/s) step 13063/76294 | train loss 3.365519 | norm 0.2551 | lr 4.07e-04 | (3803.81 ms | 137832 tok/s) step 13064/76294 | train loss 3.363292 | norm 0.2879 | lr 4.07e-04 | (3798.98 ms | 138007 tok/s) step 13065/76294 | train loss 3.436623 | norm 0.5030 | lr 4.07e-04 | (3830.36 ms | 136877 tok/s) step 13066/76294 | train loss 3.347757 | norm 0.2845 | lr 4.07e-04 | (3806.02 ms | 137752 tok/s) step 13067/76294 | train loss 3.302646 | norm 0.2180 | lr 4.07e-04 | (3804.22 ms | 137818 tok/s) step 13068/76294 | train loss 3.590111 | norm 0.2955 | lr 4.07e-04 | (3825.12 ms | 137064 tok/s) step 13069/76294 | train loss 3.329759 | norm 0.2694 | lr 4.07e-04 | (3808.49 ms | 137663 tok/s) step 13070/76294 | train loss 3.280217 | norm 0.3073 | lr 4.06e-04 | (3807.84 ms | 137686 tok/s) step 13071/76294 | train loss 3.492350 | norm 0.3268 | lr 4.06e-04 | (3803.86 ms | 137830 tok/s) step 13072/76294 | train loss 3.438590 | norm 0.5145 | lr 4.06e-04 | (3811.08 ms | 137569 tok/s) step 13073/76294 | train loss 3.452024 | norm 0.2869 | lr 4.06e-04 | (3810.17 ms | 137602 tok/s) step 13074/76294 | train loss 3.380851 | norm 0.3900 | lr 4.06e-04 | (3809.67 ms | 137620 tok/s) step 13075/76294 | train loss 3.347334 | norm 0.2073 | lr 4.06e-04 | (3805.44 ms | 137773 tok/s) step 13076/76294 | train loss 3.394677 | norm 0.3513 | lr 4.06e-04 | (4114.60 ms | 127421 tok/s) step 13077/76294 | train loss 3.406974 | norm 0.4556 | lr 4.06e-04 | (3882.43 ms | 135041 tok/s) step 13078/76294 | train loss 3.381098 | norm 0.3545 | lr 4.06e-04 | (3806.65 ms | 137729 tok/s) step 13079/76294 | train loss 3.512123 | norm 0.9416 | lr 4.06e-04 | (3834.31 ms | 136736 tok/s) step 13080/76294 | train loss 3.388540 | norm 0.2247 | lr 4.06e-04 | (3807.09 ms | 137714 tok/s) step 13081/76294 | train loss 3.499721 | norm 0.5095 | lr 4.06e-04 | (3804.00 ms | 137825 tok/s) step 13082/76294 | train loss 3.367698 | norm 0.4525 | lr 4.06e-04 | (3881.36 ms | 135078 tok/s) step 13083/76294 | train loss 3.412221 | norm 0.2802 | lr 4.05e-04 | (3835.26 ms | 136702 tok/s) step 13084/76294 | train loss 3.340810 | norm 0.3465 | lr 4.05e-04 | (3800.60 ms | 137949 tok/s) step 13085/76294 | train loss 3.396868 | norm 0.4449 | lr 4.05e-04 | (3824.42 ms | 137089 tok/s) step 13086/76294 | train loss 3.353895 | norm 0.4243 | lr 4.05e-04 | (4984.01 ms | 105194 tok/s) step 13087/76294 | train loss 3.398527 | norm 0.3064 | lr 4.05e-04 | (3989.25 ms | 131425 tok/s) step 13088/76294 | train loss 3.364120 | norm 0.5076 | lr 4.05e-04 | (3792.40 ms | 138247 tok/s) step 13089/76294 | train loss 3.409878 | norm 0.2695 | lr 4.05e-04 | (3884.67 ms | 134963 tok/s) step 13090/76294 | train loss 3.352214 | norm 0.8197 | lr 4.05e-04 | (3781.01 ms | 138663 tok/s) step 13091/76294 | train loss 3.396778 | norm 0.4343 | lr 4.05e-04 | (3789.99 ms | 138335 tok/s) step 13092/76294 | train loss 3.395561 | norm 0.7878 | lr 4.05e-04 | (3810.49 ms | 137591 tok/s) step 13093/76294 | train loss 3.430655 | norm 0.3535 | lr 4.05e-04 | (3791.96 ms | 138263 tok/s) step 13094/76294 | train loss 3.412181 | norm 0.6179 | lr 4.05e-04 | (3799.61 ms | 137985 tok/s) step 13095/76294 | train loss 3.367325 | norm 0.2492 | lr 4.04e-04 | (3796.75 ms | 138089 tok/s) step 13096/76294 | train loss 3.647251 | norm 0.2487 | lr 4.04e-04 | (3814.75 ms | 137437 tok/s) step 13097/76294 | train loss 3.391148 | norm 0.2577 | lr 4.04e-04 | (3829.54 ms | 136906 tok/s) step 13098/76294 | train loss 3.398115 | norm 0.3217 | lr 4.04e-04 | (3792.49 ms | 138244 tok/s) step 13099/76294 | train loss 3.351114 | norm 0.2742 | lr 4.04e-04 | (3803.60 ms | 137840 tok/s) step 13100/76294 | train loss 3.400273 | norm 0.3064 | lr 4.04e-04 | (3794.00 ms | 138189 tok/s) step 13101/76294 | train loss 3.354573 | norm 0.3777 | lr 4.04e-04 | (3826.39 ms | 137019 tok/s) step 13102/76294 | train loss 3.407086 | norm 0.3156 | lr 4.04e-04 | (3796.61 ms | 138094 tok/s) step 13103/76294 | train loss 3.457496 | norm 0.4377 | lr 4.04e-04 | (3859.96 ms | 135827 tok/s) step 13104/76294 | train loss 3.484673 | norm 0.6807 | lr 4.04e-04 | (3798.51 ms | 138025 tok/s) step 13105/76294 | train loss 3.413049 | norm 0.4875 | lr 4.04e-04 | (3801.05 ms | 137932 tok/s) step 13106/76294 | train loss 3.390837 | norm 1.4915 | lr 4.04e-04 | (3816.08 ms | 137389 tok/s) step 13107/76294 | train loss 3.448565 | norm 0.6107 | lr 4.04e-04 | (3796.84 ms | 138085 tok/s) step 13108/76294 | train loss 3.387703 | norm 0.2906 | lr 4.03e-04 | (3800.73 ms | 137944 tok/s) step 13109/76294 | train loss 3.362763 | norm 0.3769 | lr 4.03e-04 | (3805.40 ms | 137775 tok/s) step 13110/76294 | train loss 3.348135 | norm 0.4235 | lr 4.03e-04 | (3802.26 ms | 137888 tok/s) step 13111/76294 | train loss 3.457136 | norm 0.2981 | lr 4.03e-04 | (3858.94 ms | 135863 tok/s) step 13112/76294 | train loss 3.395439 | norm 0.3556 | lr 4.03e-04 | (3845.47 ms | 136339 tok/s) step 13113/76294 | train loss 3.365933 | norm 0.4219 | lr 4.03e-04 | (3795.08 ms | 138149 tok/s) step 13114/76294 | train loss 3.288886 | norm 0.6085 | lr 4.03e-04 | (3851.29 ms | 136133 tok/s) step 13115/76294 | train loss 3.408588 | norm 0.8290 | lr 4.03e-04 | (3803.93 ms | 137828 tok/s) step 13116/76294 | train loss 3.399930 | norm 0.7118 | lr 4.03e-04 | (3825.00 ms | 137069 tok/s) step 13117/76294 | train loss 3.504167 | norm 0.7610 | lr 4.03e-04 | (3797.60 ms | 138058 tok/s) step 13118/76294 | train loss 3.423475 | norm 0.4318 | lr 4.03e-04 | (3897.58 ms | 134516 tok/s) step 13119/76294 | train loss 3.394093 | norm 0.8802 | lr 4.03e-04 | (3899.85 ms | 134438 tok/s) step 13120/76294 | train loss 3.504285 | norm 0.9060 | lr 4.03e-04 | (3801.71 ms | 137909 tok/s) step 13121/76294 | train loss 3.399136 | norm 0.3848 | lr 4.02e-04 | (3804.88 ms | 137794 tok/s) step 13122/76294 | train loss 3.375940 | norm 1.2213 | lr 4.02e-04 | (3827.02 ms | 136996 tok/s) step 13123/76294 | train loss 3.398683 | norm 0.4011 | lr 4.02e-04 | (3803.43 ms | 137846 tok/s) step 13124/76294 | train loss 3.385008 | norm 0.5874 | lr 4.02e-04 | (3806.71 ms | 137727 tok/s) step 13125/76294 | train loss 3.398534 | norm 0.5287 | lr 4.02e-04 | (3806.94 ms | 137719 tok/s) step 13126/76294 | train loss 3.353530 | norm 0.3129 | lr 4.02e-04 | (3827.75 ms | 136970 tok/s) step 13127/76294 | train loss 3.358589 | norm 0.4033 | lr 4.02e-04 | (3804.92 ms | 137792 tok/s) step 13128/76294 | train loss 3.366885 | norm 0.2552 | lr 4.02e-04 | (3811.77 ms | 137544 tok/s) step 13129/76294 | train loss 3.493914 | norm 0.4237 | lr 4.02e-04 | (3817.82 ms | 137327 tok/s) step 13130/76294 | train loss 3.381300 | norm 0.3689 | lr 4.02e-04 | (3825.99 ms | 137033 tok/s) step 13131/76294 | train loss 3.413500 | norm 0.3323 | lr 4.02e-04 | (3804.52 ms | 137807 tok/s) step 13132/76294 | train loss 3.420031 | norm 0.2771 | lr 4.02e-04 | (3812.83 ms | 137506 tok/s) step 13133/76294 | train loss 3.473700 | norm 0.3057 | lr 4.01e-04 | (3808.20 ms | 137673 tok/s) step 13134/76294 | train loss 3.414507 | norm 0.3183 | lr 4.01e-04 | (3807.64 ms | 137694 tok/s) step 13135/76294 | train loss 3.340445 | norm 0.3000 | lr 4.01e-04 | (3855.37 ms | 135989 tok/s) step 13136/76294 | train loss 3.385447 | norm 0.2852 | lr 4.01e-04 | (3811.23 ms | 137564 tok/s) step 13137/76294 | train loss 3.402749 | norm 0.5412 | lr 4.01e-04 | (3834.14 ms | 136742 tok/s) step 13138/76294 | train loss 3.383151 | norm 0.3167 | lr 4.01e-04 | (3812.61 ms | 137514 tok/s) step 13139/76294 | train loss 3.475749 | norm 0.4910 | lr 4.01e-04 | (3806.31 ms | 137742 tok/s) step 13140/76294 | train loss 3.336521 | norm 0.5144 | lr 4.01e-04 | (3830.94 ms | 136856 tok/s) step 13141/76294 | train loss 3.370380 | norm 0.2580 | lr 4.01e-04 | (3840.94 ms | 136500 tok/s) step 13142/76294 | train loss 3.384824 | norm 0.5250 | lr 4.01e-04 | (3805.30 ms | 137778 tok/s) step 13143/76294 | train loss 3.476207 | norm 0.5959 | lr 4.01e-04 | (3834.15 ms | 136742 tok/s) step 13144/76294 | train loss 3.563664 | norm 0.4176 | lr 4.01e-04 | (3806.12 ms | 137749 tok/s) step 13145/76294 | train loss 3.467339 | norm 0.3913 | lr 4.01e-04 | (3844.73 ms | 136365 tok/s) step 13146/76294 | train loss 3.405007 | norm 0.4884 | lr 4.00e-04 | (3826.32 ms | 137022 tok/s) step 13147/76294 | train loss 3.381740 | norm 0.2734 | lr 4.00e-04 | (3828.20 ms | 136954 tok/s) step 13148/76294 | train loss 3.439336 | norm 0.2491 | lr 4.00e-04 | (3824.80 ms | 137076 tok/s) step 13149/76294 | train loss 3.350949 | norm 0.2326 | lr 4.00e-04 | (3806.46 ms | 137736 tok/s) step 13150/76294 | train loss 3.466329 | norm 0.2480 | lr 4.00e-04 | (3826.08 ms | 137030 tok/s) step 13151/76294 | train loss 3.339793 | norm 0.3202 | lr 4.00e-04 | (3805.70 ms | 137764 tok/s) step 13152/76294 | train loss 3.420147 | norm 0.2200 | lr 4.00e-04 | (3805.28 ms | 137779 tok/s) step 13153/76294 | train loss 3.338487 | norm 0.2770 | lr 4.00e-04 | (3805.50 ms | 137771 tok/s) step 13154/76294 | train loss 3.438030 | norm 0.3486 | lr 4.00e-04 | (3834.51 ms | 136729 tok/s) step 13155/76294 | train loss 3.367712 | norm 0.4232 | lr 4.00e-04 | (3806.81 ms | 137724 tok/s) step 13156/76294 | train loss 3.341170 | norm 0.3178 | lr 4.00e-04 | (3814.70 ms | 137439 tok/s) step 13157/76294 | train loss 3.410754 | norm 1.0260 | lr 4.00e-04 | (3810.30 ms | 137598 tok/s) step 13158/76294 | train loss 3.423903 | norm 0.4111 | lr 4.00e-04 | (3823.50 ms | 137122 tok/s) step 13159/76294 | train loss 3.374522 | norm 0.4892 | lr 3.99e-04 | (3809.90 ms | 137612 tok/s) step 13160/76294 | train loss 3.373485 | norm 0.2752 | lr 3.99e-04 | (3824.74 ms | 137078 tok/s) step 13161/76294 | train loss 3.427218 | norm 0.4143 | lr 3.99e-04 | (3800.91 ms | 137938 tok/s) step 13162/76294 | train loss 3.395909 | norm 0.4051 | lr 3.99e-04 | (3807.77 ms | 137689 tok/s) step 13163/76294 | train loss 3.434581 | norm 0.3075 | lr 3.99e-04 | (3806.14 ms | 137748 tok/s) step 13164/76294 | train loss 3.357156 | norm 0.5546 | lr 3.99e-04 | (3894.44 ms | 134625 tok/s) step 13165/76294 | train loss 3.462261 | norm 0.3547 | lr 3.99e-04 | (3807.16 ms | 137711 tok/s) step 13166/76294 | train loss 3.417302 | norm 0.3241 | lr 3.99e-04 | (3809.43 ms | 137629 tok/s) step 13167/76294 | train loss 3.418211 | norm 0.4363 | lr 3.99e-04 | (3802.38 ms | 137884 tok/s) step 13168/76294 | train loss 3.391615 | norm 0.4706 | lr 3.99e-04 | (3809.36 ms | 137631 tok/s) step 13169/76294 | train loss 3.410801 | norm 0.5899 | lr 3.99e-04 | (3828.56 ms | 136941 tok/s) step 13170/76294 | train loss 3.432341 | norm 0.4547 | lr 3.99e-04 | (3804.96 ms | 137791 tok/s) step 13171/76294 | train loss 3.355433 | norm 0.2969 | lr 3.98e-04 | (3841.46 ms | 136481 tok/s) step 13172/76294 | train loss 3.421330 | norm 0.5931 | lr 3.98e-04 | (3832.48 ms | 136801 tok/s) step 13173/76294 | train loss 3.353873 | norm 0.2505 | lr 3.98e-04 | (3801.98 ms | 137899 tok/s) step 13174/76294 | train loss 3.387698 | norm 0.3491 | lr 3.98e-04 | (3872.87 ms | 135374 tok/s) step 13175/76294 | train loss 3.327289 | norm 0.2784 | lr 3.98e-04 | (3798.85 ms | 138012 tok/s) step 13176/76294 | train loss 3.347830 | norm 0.2624 | lr 3.98e-04 | (3805.76 ms | 137762 tok/s) step 13177/76294 | train loss 3.392194 | norm 0.1881 | lr 3.98e-04 | (3799.43 ms | 137991 tok/s) step 13178/76294 | train loss 3.419824 | norm 0.2784 | lr 3.98e-04 | (3809.87 ms | 137613 tok/s) step 13179/76294 | train loss 3.354731 | norm 0.5996 | lr 3.98e-04 | (3800.41 ms | 137956 tok/s) step 13180/76294 | train loss 3.369520 | norm 0.2161 | lr 3.98e-04 | (3823.44 ms | 137125 tok/s) step 13181/76294 | train loss 3.459318 | norm 0.4143 | lr 3.98e-04 | (3801.91 ms | 137901 tok/s) step 13182/76294 | train loss 3.341438 | norm 0.6310 | lr 3.98e-04 | (3808.66 ms | 137657 tok/s) step 13183/76294 | train loss 3.506472 | norm 0.2465 | lr 3.98e-04 | (3820.81 ms | 137219 tok/s) step 13184/76294 | train loss 3.344669 | norm 0.4019 | lr 3.97e-04 | (3818.85 ms | 137290 tok/s) step 13185/76294 | train loss 3.406425 | norm 0.4130 | lr 3.97e-04 | (3813.94 ms | 137466 tok/s) step 13186/76294 | train loss 3.460334 | norm 0.3839 | lr 3.97e-04 | (3843.96 ms | 136393 tok/s) step 13187/76294 | train loss 3.396323 | norm 0.3574 | lr 3.97e-04 | (3918.52 ms | 133798 tok/s) step 13188/76294 | train loss 3.376436 | norm 0.2851 | lr 3.97e-04 | (3814.90 ms | 137432 tok/s) step 13189/76294 | train loss 3.372681 | norm 0.5723 | lr 3.97e-04 | (3833.27 ms | 136773 tok/s) step 13190/76294 | train loss 3.398414 | norm 0.3115 | lr 3.97e-04 | (3807.78 ms | 137689 tok/s) step 13191/76294 | train loss 3.392373 | norm 0.6706 | lr 3.97e-04 | (3819.10 ms | 137281 tok/s) step 13192/76294 | train loss 3.386321 | norm 0.5439 | lr 3.97e-04 | (3834.15 ms | 136742 tok/s) step 13193/76294 | train loss 3.403837 | norm 0.2808 | lr 3.97e-04 | (3816.17 ms | 137386 tok/s) step 13194/76294 | train loss 3.426854 | norm 1.1604 | lr 3.97e-04 | (3810.91 ms | 137575 tok/s) step 13195/76294 | train loss 3.420733 | norm 0.7490 | lr 3.97e-04 | (3815.23 ms | 137420 tok/s) step 13196/76294 | train loss 3.463352 | norm 0.7581 | lr 3.97e-04 | (3838.08 ms | 136602 tok/s) step 13197/76294 | train loss 3.339581 | norm 0.9188 | lr 3.96e-04 | (3816.85 ms | 137361 tok/s) step 13198/76294 | train loss 3.340670 | norm 0.5146 | lr 3.96e-04 | (3814.41 ms | 137449 tok/s) step 13199/76294 | train loss 3.422281 | norm 0.2572 | lr 3.96e-04 | (3853.76 ms | 136046 tok/s) step 13200/76294 | train loss 3.445882 | norm 0.3571 | lr 3.96e-04 | (3812.93 ms | 137503 tok/s) step 13201/76294 | train loss 3.455201 | norm 0.7559 | lr 3.96e-04 | (3857.47 ms | 135915 tok/s) step 13202/76294 | train loss 3.368869 | norm 0.7767 | lr 3.96e-04 | (3808.18 ms | 137674 tok/s) step 13203/76294 | train loss 3.432827 | norm 0.2873 | lr 3.96e-04 | (3810.97 ms | 137574 tok/s) step 13204/76294 | train loss 3.349056 | norm 0.3027 | lr 3.96e-04 | (3836.82 ms | 136646 tok/s) step 13205/76294 | train loss 3.539787 | norm 0.4680 | lr 3.96e-04 | (3814.75 ms | 137437 tok/s) step 13206/76294 | train loss 3.353013 | norm 0.3723 | lr 3.96e-04 | (3816.42 ms | 137377 tok/s) step 13207/76294 | train loss 3.392093 | norm 0.5443 | lr 3.96e-04 | (3804.79 ms | 137797 tok/s) step 13208/76294 | train loss 3.405645 | norm 0.5606 | lr 3.96e-04 | (3810.84 ms | 137578 tok/s) step 13209/76294 | train loss 3.383213 | norm 0.3446 | lr 3.96e-04 | (3805.06 ms | 137787 tok/s) step 13210/76294 | train loss 3.411209 | norm 0.2369 | lr 3.95e-04 | (3811.62 ms | 137550 tok/s) step 13211/76294 | train loss 3.382036 | norm 0.2466 | lr 3.95e-04 | (3806.54 ms | 137734 tok/s) step 13212/76294 | train loss 3.466065 | norm 1.0503 | lr 3.95e-04 | (3880.97 ms | 135092 tok/s) step 13213/76294 | train loss 3.392742 | norm 0.4867 | lr 3.95e-04 | (3804.59 ms | 137804 tok/s) step 13214/76294 | train loss 3.369094 | norm 0.2695 | lr 3.95e-04 | (3851.49 ms | 136126 tok/s) step 13215/76294 | train loss 3.395646 | norm 0.2557 | lr 3.95e-04 | (3801.14 ms | 137929 tok/s) step 13216/76294 | train loss 3.362561 | norm 0.2938 | lr 3.95e-04 | (3845.38 ms | 136342 tok/s) step 13217/76294 | train loss 3.406678 | norm 0.2215 | lr 3.95e-04 | (3801.10 ms | 137931 tok/s) step 13218/76294 | train loss 3.376662 | norm 0.2435 | lr 3.95e-04 | (3812.23 ms | 137528 tok/s) step 13219/76294 | train loss 3.468613 | norm 0.2316 | lr 3.95e-04 | (3829.22 ms | 136918 tok/s) step 13220/76294 | train loss 3.364777 | norm 0.3817 | lr 3.95e-04 | (3805.69 ms | 137764 tok/s) step 13221/76294 | train loss 3.390218 | norm 0.3257 | lr 3.95e-04 | (3806.27 ms | 137743 tok/s) step 13222/76294 | train loss 3.371550 | norm 0.2906 | lr 3.94e-04 | (3893.80 ms | 134647 tok/s) step 13223/76294 | train loss 3.387839 | norm 0.2108 | lr 3.94e-04 | (3802.13 ms | 137893 tok/s) step 13224/76294 | train loss 3.423057 | norm 0.2165 | lr 3.94e-04 | (3809.59 ms | 137623 tok/s) step 13225/76294 | train loss 3.405002 | norm 0.2141 | lr 3.94e-04 | (3838.44 ms | 136589 tok/s) step 13226/76294 | train loss 3.387151 | norm 0.2558 | lr 3.94e-04 | (3812.72 ms | 137510 tok/s) step 13227/76294 | train loss 3.421906 | norm 0.3493 | lr 3.94e-04 | (3811.49 ms | 137554 tok/s) step 13228/76294 | train loss 3.395914 | norm 0.4865 | lr 3.94e-04 | (3840.03 ms | 136532 tok/s) step 13229/76294 | train loss 3.419365 | norm 2.1394 | lr 3.94e-04 | (3811.29 ms | 137562 tok/s) step 13230/76294 | train loss 3.426378 | norm 0.5980 | lr 3.94e-04 | (3810.97 ms | 137574 tok/s) step 13231/76294 | train loss 3.382851 | norm 1.0352 | lr 3.94e-04 | (3828.23 ms | 136953 tok/s) step 13232/76294 | train loss 3.409096 | norm 0.5582 | lr 3.94e-04 | (3805.58 ms | 137768 tok/s) step 13233/76294 | train loss 3.389010 | norm 0.4482 | lr 3.94e-04 | (3804.90 ms | 137793 tok/s) step 13234/76294 | train loss 3.397459 | norm 1.0905 | lr 3.94e-04 | (3833.30 ms | 136772 tok/s) step 13235/76294 | train loss 3.390209 | norm 1.2227 | lr 3.93e-04 | (3801.69 ms | 137909 tok/s) step 13236/76294 | train loss 3.355125 | norm 0.3929 | lr 3.93e-04 | (3811.40 ms | 137558 tok/s) step 13237/76294 | train loss 3.420821 | norm 0.4215 | lr 3.93e-04 | (3897.55 ms | 134517 tok/s) step 13238/76294 | train loss 3.379697 | norm 0.5909 | lr 3.93e-04 | (3805.25 ms | 137780 tok/s) step 13239/76294 | train loss 3.455015 | norm 0.3395 | lr 3.93e-04 | (3859.02 ms | 135860 tok/s) step 13240/76294 | train loss 3.350976 | norm 0.3545 | lr 3.93e-04 | (3802.51 ms | 137880 tok/s) step 13241/76294 | train loss 3.341664 | norm 0.3586 | lr 3.93e-04 | (3809.70 ms | 137619 tok/s) step 13242/76294 | train loss 3.392258 | norm 0.2749 | lr 3.93e-04 | (3829.61 ms | 136904 tok/s) step 13243/76294 | train loss 3.369240 | norm 0.2747 | lr 3.93e-04 | (3805.13 ms | 137784 tok/s) step 13244/76294 | train loss 3.403235 | norm 0.2467 | lr 3.93e-04 | (3800.04 ms | 137969 tok/s) step 13245/76294 | train loss 3.410187 | norm 0.3607 | lr 3.93e-04 | (3849.68 ms | 136190 tok/s) step 13246/76294 | train loss 3.391692 | norm 0.2433 | lr 3.93e-04 | (3803.14 ms | 137857 tok/s) step 13247/76294 | train loss 3.467672 | norm 0.4263 | lr 3.93e-04 | (3834.60 ms | 136726 tok/s) step 13248/76294 | train loss 3.376078 | norm 0.2223 | lr 3.92e-04 | (3802.83 ms | 137868 tok/s) step 13249/76294 | train loss 3.377027 | norm 0.2933 | lr 3.92e-04 | (3803.36 ms | 137849 tok/s) step 13250/76294 | train loss 3.386523 | norm 0.2182 | lr 3.92e-04 | (3833.50 ms | 136765 tok/s) val loss: 3.377980 saving model checkpoint to ./results/gpt2-124M-gqa/step_13250.pth step 13251/76294 | train loss 3.397355 | norm 0.2280 | lr 3.92e-04 | (3840.15 ms | 136528 tok/s) step 13252/76294 | train loss 3.375372 | norm 0.3330 | lr 3.92e-04 | (3799.01 ms | 138006 tok/s) step 13253/76294 | train loss 3.395702 | norm 0.2612 | lr 3.92e-04 | (3873.34 ms | 135358 tok/s) step 13254/76294 | train loss 3.358621 | norm 0.2584 | lr 3.92e-04 | (3807.67 ms | 137693 tok/s) step 13255/76294 | train loss 3.351317 | norm 0.1903 | lr 3.92e-04 | (3846.95 ms | 136287 tok/s) step 13256/76294 | train loss 3.356730 | norm 0.2249 | lr 3.92e-04 | (7412.32 ms | 70732 tok/s) step 13257/76294 | train loss 3.399187 | norm 0.2890 | lr 3.92e-04 | (3797.27 ms | 138070 tok/s) step 13258/76294 | train loss 3.395889 | norm 0.5641 | lr 3.92e-04 | (3817.12 ms | 137352 tok/s) step 13259/76294 | train loss 3.410372 | norm 0.2624 | lr 3.92e-04 | (3891.17 ms | 134738 tok/s) step 13260/76294 | train loss 3.317401 | norm 0.3036 | lr 3.92e-04 | (3837.03 ms | 136639 tok/s) step 13261/76294 | train loss 3.424114 | norm 0.3875 | lr 3.91e-04 | (3798.38 ms | 138029 tok/s) step 13262/76294 | train loss 3.385067 | norm 0.2957 | lr 3.91e-04 | (3799.63 ms | 137984 tok/s) step 13263/76294 | train loss 3.396547 | norm 0.2710 | lr 3.91e-04 | (3823.22 ms | 137133 tok/s) step 13264/76294 | train loss 3.448198 | norm 0.5874 | lr 3.91e-04 | (3802.10 ms | 137894 tok/s) step 13265/76294 | train loss 3.352745 | norm 0.3849 | lr 3.91e-04 | (3807.60 ms | 137695 tok/s) step 13266/76294 | train loss 3.407613 | norm 0.3374 | lr 3.91e-04 | (3848.71 ms | 136224 tok/s) step 13267/76294 | train loss 3.406870 | norm 0.5672 | lr 3.91e-04 | (4174.75 ms | 125585 tok/s) step 13268/76294 | train loss 3.378765 | norm 0.3160 | lr 3.91e-04 | (3793.51 ms | 138207 tok/s) step 13269/76294 | train loss 3.325962 | norm 0.4874 | lr 3.91e-04 | (3824.34 ms | 137092 tok/s) step 13270/76294 | train loss 3.421301 | norm 0.7652 | lr 3.91e-04 | (3794.04 ms | 138187 tok/s) step 13271/76294 | train loss 3.331343 | norm 0.5387 | lr 3.91e-04 | (3817.84 ms | 137326 tok/s) step 13272/76294 | train loss 3.343265 | norm 0.3904 | lr 3.91e-04 | (3796.52 ms | 138097 tok/s) step 13273/76294 | train loss 3.384816 | norm 0.5310 | lr 3.90e-04 | (3809.80 ms | 137615 tok/s) step 13274/76294 | train loss 3.354615 | norm 0.2503 | lr 3.90e-04 | (3794.05 ms | 138187 tok/s) step 13275/76294 | train loss 3.380522 | norm 0.3078 | lr 3.90e-04 | (3800.61 ms | 137948 tok/s) step 13276/76294 | train loss 3.387007 | norm 0.2909 | lr 3.90e-04 | (3817.65 ms | 137333 tok/s) step 13277/76294 | train loss 3.432675 | norm 0.2858 | lr 3.90e-04 | (3804.27 ms | 137816 tok/s) step 13278/76294 | train loss 3.406948 | norm 0.2239 | lr 3.90e-04 | (3805.01 ms | 137789 tok/s) step 13279/76294 | train loss 3.352101 | norm 0.4179 | lr 3.90e-04 | (3802.83 ms | 137868 tok/s) step 13280/76294 | train loss 3.353927 | norm 0.2728 | lr 3.90e-04 | (3804.83 ms | 137795 tok/s) step 13281/76294 | train loss 3.413310 | norm 0.5148 | lr 3.90e-04 | (3804.23 ms | 137817 tok/s) step 13282/76294 | train loss 3.343992 | norm 0.2219 | lr 3.90e-04 | (3809.04 ms | 137643 tok/s) step 13283/76294 | train loss 3.403173 | norm 0.3624 | lr 3.90e-04 | (3831.05 ms | 136852 tok/s) step 13284/76294 | train loss 3.375573 | norm 0.4127 | lr 3.90e-04 | (3829.98 ms | 136891 tok/s) step 13285/76294 | train loss 3.307896 | norm 0.2336 | lr 3.90e-04 | (3804.32 ms | 137814 tok/s) step 13286/76294 | train loss 3.409070 | norm 0.2877 | lr 3.89e-04 | (3825.64 ms | 137046 tok/s) step 13287/76294 | train loss 3.399497 | norm 0.2321 | lr 3.89e-04 | (3803.85 ms | 137831 tok/s) step 13288/76294 | train loss 3.402732 | norm 0.2490 | lr 3.89e-04 | (3807.98 ms | 137681 tok/s) step 13289/76294 | train loss 3.395672 | norm 0.2747 | lr 3.89e-04 | (3828.17 ms | 136955 tok/s) step 13290/76294 | train loss 3.355111 | norm 0.4510 | lr 3.89e-04 | (3805.66 ms | 137765 tok/s) step 13291/76294 | train loss 3.360159 | norm 0.3649 | lr 3.89e-04 | (3822.89 ms | 137144 tok/s) step 13292/76294 | train loss 3.444847 | norm 0.3058 | lr 3.89e-04 | (3806.76 ms | 137726 tok/s) step 13293/76294 | train loss 3.400922 | norm 0.3956 | lr 3.89e-04 | (3802.18 ms | 137892 tok/s) step 13294/76294 | train loss 3.404435 | norm 0.7290 | lr 3.89e-04 | (3806.47 ms | 137736 tok/s) step 13295/76294 | train loss 3.384591 | norm 0.2456 | lr 3.89e-04 | (3804.71 ms | 137800 tok/s) step 13296/76294 | train loss 3.671298 | norm 0.3055 | lr 3.89e-04 | (3808.72 ms | 137655 tok/s) step 13297/76294 | train loss 3.418413 | norm 0.2891 | lr 3.89e-04 | (3812.03 ms | 137535 tok/s) step 13298/76294 | train loss 3.337016 | norm 0.3139 | lr 3.89e-04 | (3821.51 ms | 137194 tok/s) step 13299/76294 | train loss 3.382730 | norm 0.2006 | lr 3.88e-04 | (3836.04 ms | 136674 tok/s) step 13300/76294 | train loss 3.358824 | norm 0.2999 | lr 3.88e-04 | (3813.13 ms | 137495 tok/s) step 13301/76294 | train loss 3.373568 | norm 0.4828 | lr 3.88e-04 | (3804.29 ms | 137815 tok/s) step 13302/76294 | train loss 3.360433 | norm 0.2694 | lr 3.88e-04 | (3807.89 ms | 137685 tok/s) step 13303/76294 | train loss 3.346217 | norm 0.3423 | lr 3.88e-04 | (3817.83 ms | 137326 tok/s) step 13304/76294 | train loss 3.365777 | norm 0.3615 | lr 3.88e-04 | (3828.74 ms | 136935 tok/s) step 13305/76294 | train loss 3.417403 | norm 0.3308 | lr 3.88e-04 | (4057.79 ms | 129205 tok/s) step 13306/76294 | train loss 3.362974 | norm 0.3598 | lr 3.88e-04 | (3795.22 ms | 138144 tok/s) step 13307/76294 | train loss 3.386142 | norm 0.6159 | lr 3.88e-04 | (3824.20 ms | 137098 tok/s) step 13308/76294 | train loss 3.317027 | norm 0.4078 | lr 3.88e-04 | (3798.73 ms | 138017 tok/s) step 13309/76294 | train loss 3.384485 | norm 0.2130 | lr 3.88e-04 | (3893.99 ms | 134640 tok/s) step 13310/76294 | train loss 3.372330 | norm 0.2560 | lr 3.88e-04 | (3858.83 ms | 135867 tok/s) step 13311/76294 | train loss 3.361417 | norm 0.2216 | lr 3.88e-04 | (3846.26 ms | 136311 tok/s) step 13312/76294 | train loss 3.416764 | norm 0.3770 | lr 3.87e-04 | (3800.15 ms | 137965 tok/s) step 13313/76294 | train loss 3.396362 | norm 0.5832 | lr 3.87e-04 | (3871.93 ms | 135407 tok/s) step 13314/76294 | train loss 3.355953 | norm 0.5121 | lr 3.87e-04 | (3800.33 ms | 137959 tok/s) step 13315/76294 | train loss 3.339413 | norm 0.6173 | lr 3.87e-04 | (3848.28 ms | 136240 tok/s) step 13316/76294 | train loss 3.322889 | norm 0.3338 | lr 3.87e-04 | (3845.21 ms | 136348 tok/s) step 13317/76294 | train loss 3.361123 | norm 0.2274 | lr 3.87e-04 | (3801.89 ms | 137902 tok/s) step 13318/76294 | train loss 3.356593 | norm 0.2339 | lr 3.87e-04 | (3812.45 ms | 137520 tok/s) step 13319/76294 | train loss 3.302773 | norm 0.4792 | lr 3.87e-04 | (3801.17 ms | 137928 tok/s) step 13320/76294 | train loss 3.421175 | norm 0.6149 | lr 3.87e-04 | (3832.25 ms | 136809 tok/s) step 13321/76294 | train loss 3.347411 | norm 0.2926 | lr 3.87e-04 | (3936.22 ms | 133196 tok/s) step 13322/76294 | train loss 3.392523 | norm 0.3469 | lr 3.87e-04 | (3790.71 ms | 138309 tok/s) step 13323/76294 | train loss 3.370191 | norm 0.2588 | lr 3.87e-04 | (3798.48 ms | 138026 tok/s) step 13324/76294 | train loss 3.393512 | norm 0.2330 | lr 3.87e-04 | (3839.26 ms | 136560 tok/s) step 13325/76294 | train loss 3.341666 | norm 0.2594 | lr 3.86e-04 | (3804.18 ms | 137819 tok/s) step 13326/76294 | train loss 3.419729 | norm 0.2299 | lr 3.86e-04 | (3854.85 ms | 136007 tok/s) step 13327/76294 | train loss 3.354904 | norm 0.2641 | lr 3.86e-04 | (3794.51 ms | 138170 tok/s) step 13328/76294 | train loss 3.376265 | norm 0.2446 | lr 3.86e-04 | (3795.75 ms | 138125 tok/s) step 13329/76294 | train loss 3.341365 | norm 0.3452 | lr 3.86e-04 | (4279.14 ms | 122522 tok/s) step 13330/76294 | train loss 3.388366 | norm 0.2877 | lr 3.86e-04 | (19935.97 ms | 26299 tok/s) step 13331/76294 | train loss 3.374752 | norm 0.8989 | lr 3.86e-04 | (3804.33 ms | 137813 tok/s) step 13332/76294 | train loss 3.409424 | norm 0.3452 | lr 3.86e-04 | (3832.12 ms | 136814 tok/s) step 13333/76294 | train loss 3.376806 | norm 0.5139 | lr 3.86e-04 | (4972.63 ms | 105435 tok/s) step 13334/76294 | train loss 3.376927 | norm 0.8796 | lr 3.86e-04 | (3838.30 ms | 136594 tok/s) step 13335/76294 | train loss 3.391581 | norm 0.3871 | lr 3.86e-04 | (3792.49 ms | 138244 tok/s) step 13336/76294 | train loss 3.391138 | norm 0.3266 | lr 3.86e-04 | (3779.23 ms | 138729 tok/s) step 13337/76294 | train loss 3.296533 | norm 0.5937 | lr 3.86e-04 | (3853.66 ms | 136050 tok/s) step 13338/76294 | train loss 3.361104 | norm 0.6280 | lr 3.85e-04 | (3787.01 ms | 138444 tok/s) step 13339/76294 | train loss 3.398523 | norm 0.3135 | lr 3.85e-04 | (3813.39 ms | 137486 tok/s) step 13340/76294 | train loss 3.497119 | norm 0.5054 | lr 3.85e-04 | (3789.76 ms | 138343 tok/s) step 13341/76294 | train loss 3.391277 | norm 0.4873 | lr 3.85e-04 | (3837.16 ms | 136634 tok/s) step 13342/76294 | train loss 3.388491 | norm 0.4341 | lr 3.85e-04 | (3875.85 ms | 135270 tok/s) step 13343/76294 | train loss 3.354494 | norm 0.3424 | lr 3.85e-04 | (3809.12 ms | 137640 tok/s) step 13344/76294 | train loss 3.344432 | norm 0.9354 | lr 3.85e-04 | (3849.00 ms | 136214 tok/s) step 13345/76294 | train loss 3.343059 | norm 0.3768 | lr 3.85e-04 | (3795.79 ms | 138124 tok/s) step 13346/76294 | train loss 3.381075 | norm 0.2785 | lr 3.85e-04 | (3884.21 ms | 134979 tok/s) step 13347/76294 | train loss 3.409643 | norm 0.3277 | lr 3.85e-04 | (3796.58 ms | 138095 tok/s) step 13348/76294 | train loss 3.368454 | norm 0.3883 | lr 3.85e-04 | (3877.25 ms | 135222 tok/s) step 13349/76294 | train loss 3.359318 | norm 0.3348 | lr 3.85e-04 | (3871.63 ms | 135418 tok/s) step 13350/76294 | train loss 3.365961 | norm 0.2284 | lr 3.85e-04 | (3833.37 ms | 136770 tok/s) step 13351/76294 | train loss 3.292277 | norm 0.2873 | lr 3.84e-04 | (3908.32 ms | 134146 tok/s) step 13352/76294 | train loss 3.403545 | norm 0.2146 | lr 3.84e-04 | (3882.31 ms | 135045 tok/s) step 13353/76294 | train loss 3.370049 | norm 0.3019 | lr 3.84e-04 | (3793.41 ms | 138210 tok/s) step 13354/76294 | train loss 3.417610 | norm 0.2288 | lr 3.84e-04 | (3864.54 ms | 135666 tok/s) step 13355/76294 | train loss 3.305112 | norm 0.4567 | lr 3.84e-04 | (3879.48 ms | 135144 tok/s) step 13356/76294 | train loss 3.352028 | norm 0.3298 | lr 3.84e-04 | (3944.17 ms | 132927 tok/s) step 13357/76294 | train loss 3.315616 | norm 0.5576 | lr 3.84e-04 | (3913.99 ms | 133952 tok/s) step 13358/76294 | train loss 3.378645 | norm 0.2375 | lr 3.84e-04 | (3790.84 ms | 138304 tok/s) step 13359/76294 | train loss 3.386760 | norm 0.2829 | lr 3.84e-04 | (3841.37 ms | 136485 tok/s) step 13360/76294 | train loss 3.383562 | norm 0.3704 | lr 3.84e-04 | (3792.73 ms | 138235 tok/s) step 13361/76294 | train loss 3.322568 | norm 0.3050 | lr 3.84e-04 | (3843.63 ms | 136404 tok/s) step 13362/76294 | train loss 3.383556 | norm 0.4203 | lr 3.84e-04 | (3801.29 ms | 137924 tok/s) step 13363/76294 | train loss 3.385949 | norm 0.3245 | lr 3.84e-04 | (3801.85 ms | 137903 tok/s) step 13364/76294 | train loss 3.463239 | norm 0.3099 | lr 3.83e-04 | (3824.28 ms | 137094 tok/s) step 13365/76294 | train loss 3.423678 | norm 0.4476 | lr 3.83e-04 | (3806.17 ms | 137747 tok/s) step 13366/76294 | train loss 3.411306 | norm 0.3970 | lr 3.83e-04 | (3802.66 ms | 137874 tok/s) step 13367/76294 | train loss 3.355277 | norm 0.3158 | lr 3.83e-04 | (3837.49 ms | 136623 tok/s) step 13368/76294 | train loss 3.404840 | norm 0.2489 | lr 3.83e-04 | (3807.16 ms | 137711 tok/s) step 13369/76294 | train loss 3.337108 | norm 0.2691 | lr 3.83e-04 | (3871.98 ms | 135406 tok/s) step 13370/76294 | train loss 3.357837 | norm 0.3294 | lr 3.83e-04 | (3812.29 ms | 137526 tok/s) step 13371/76294 | train loss 3.311146 | norm 0.3909 | lr 3.83e-04 | (3815.95 ms | 137394 tok/s) step 13372/76294 | train loss 3.377522 | norm 0.2585 | lr 3.83e-04 | (3830.21 ms | 136882 tok/s) step 13373/76294 | train loss 3.353298 | norm 0.4301 | lr 3.83e-04 | (3818.61 ms | 137298 tok/s) step 13374/76294 | train loss 3.361828 | norm 0.3063 | lr 3.83e-04 | (3820.69 ms | 137223 tok/s) step 13375/76294 | train loss 3.340874 | norm 0.3672 | lr 3.83e-04 | (3819.24 ms | 137275 tok/s) step 13376/76294 | train loss 3.460217 | norm 0.3221 | lr 3.82e-04 | (3930.81 ms | 133379 tok/s) step 13377/76294 | train loss 3.305389 | norm 0.2882 | lr 3.82e-04 | (3809.94 ms | 137611 tok/s) step 13378/76294 | train loss 3.365663 | norm 0.6951 | lr 3.82e-04 | (3831.94 ms | 136821 tok/s) step 13379/76294 | train loss 3.355796 | norm 0.4052 | lr 3.82e-04 | (3838.79 ms | 136576 tok/s) step 13380/76294 | train loss 3.384636 | norm 0.3490 | lr 3.82e-04 | (3826.56 ms | 137013 tok/s) step 13381/76294 | train loss 3.327672 | norm 0.5444 | lr 3.82e-04 | (3823.02 ms | 137140 tok/s) step 13382/76294 | train loss 3.438525 | norm 0.3362 | lr 3.82e-04 | (3819.37 ms | 137271 tok/s) step 13383/76294 | train loss 3.296139 | norm 1.0487 | lr 3.82e-04 | (3816.90 ms | 137360 tok/s) step 13384/76294 | train loss 3.405398 | norm 1.6713 | lr 3.82e-04 | (3919.07 ms | 133779 tok/s) step 13385/76294 | train loss 3.322699 | norm 0.6376 | lr 3.82e-04 | (3798.59 ms | 138022 tok/s) step 13386/76294 | train loss 3.393056 | norm 0.7666 | lr 3.82e-04 | (3805.27 ms | 137779 tok/s) step 13387/76294 | train loss 3.347356 | norm 0.5111 | lr 3.82e-04 | (3823.46 ms | 137124 tok/s) step 13388/76294 | train loss 3.429222 | norm 0.5329 | lr 3.82e-04 | (3806.56 ms | 137733 tok/s) step 13389/76294 | train loss 3.341384 | norm 0.5794 | lr 3.81e-04 | (3806.99 ms | 137717 tok/s) step 13390/76294 | train loss 3.397856 | norm 0.3984 | lr 3.81e-04 | (3846.51 ms | 136302 tok/s) step 13391/76294 | train loss 3.351558 | norm 0.8267 | lr 3.81e-04 | (3843.28 ms | 136417 tok/s) step 13392/76294 | train loss 3.364897 | norm 0.4420 | lr 3.81e-04 | (3806.58 ms | 137732 tok/s) step 13393/76294 | train loss 3.335259 | norm 0.5239 | lr 3.81e-04 | (3818.98 ms | 137285 tok/s) step 13394/76294 | train loss 3.394698 | norm 0.2999 | lr 3.81e-04 | (3812.34 ms | 137524 tok/s) step 13395/76294 | train loss 3.326767 | norm 0.3795 | lr 3.81e-04 | (3809.87 ms | 137613 tok/s) step 13396/76294 | train loss 3.398448 | norm 0.6558 | lr 3.81e-04 | (3807.88 ms | 137685 tok/s) step 13397/76294 | train loss 3.333040 | norm 0.3270 | lr 3.81e-04 | (3804.83 ms | 137795 tok/s) step 13398/76294 | train loss 3.357823 | norm 0.3024 | lr 3.81e-04 | (3834.39 ms | 136733 tok/s) step 13399/76294 | train loss 3.352535 | norm 0.2903 | lr 3.81e-04 | (5580.13 ms | 93956 tok/s) step 13400/76294 | train loss 3.373062 | norm 0.2902 | lr 3.81e-04 | (3903.22 ms | 134322 tok/s) step 13401/76294 | train loss 3.397261 | norm 0.5895 | lr 3.81e-04 | (5659.19 ms | 92644 tok/s) step 13402/76294 | train loss 3.351215 | norm 1.0343 | lr 3.80e-04 | (3848.76 ms | 136223 tok/s) step 13403/76294 | train loss 3.370287 | norm 0.4866 | lr 3.80e-04 | (3815.39 ms | 137414 tok/s) step 13404/76294 | train loss 3.304211 | norm 0.2641 | lr 3.80e-04 | (3798.88 ms | 138011 tok/s) step 13405/76294 | train loss 3.413298 | norm 0.6706 | lr 3.80e-04 | (3794.56 ms | 138168 tok/s) step 13406/76294 | train loss 3.374012 | norm 0.9723 | lr 3.80e-04 | (3815.29 ms | 137418 tok/s) step 13407/76294 | train loss 3.426894 | norm 0.3788 | lr 3.80e-04 | (3819.44 ms | 137268 tok/s) step 13408/76294 | train loss 3.321057 | norm 0.4289 | lr 3.80e-04 | (3807.25 ms | 137708 tok/s) step 13409/76294 | train loss 3.428215 | norm 0.9555 | lr 3.80e-04 | (3794.67 ms | 138164 tok/s) step 13410/76294 | train loss 3.332770 | norm 0.2775 | lr 3.80e-04 | (3801.38 ms | 137921 tok/s) step 13411/76294 | train loss 3.394865 | norm 0.5982 | lr 3.80e-04 | (3826.25 ms | 137024 tok/s) step 13412/76294 | train loss 3.274773 | norm 0.2500 | lr 3.80e-04 | (3796.13 ms | 138111 tok/s) step 13413/76294 | train loss 3.521148 | norm 0.3357 | lr 3.80e-04 | (3907.62 ms | 134171 tok/s) step 13414/76294 | train loss 3.387943 | norm 0.3225 | lr 3.80e-04 | (3790.26 ms | 138325 tok/s) step 13415/76294 | train loss 3.398819 | norm 0.2958 | lr 3.79e-04 | (3828.02 ms | 136961 tok/s) step 13416/76294 | train loss 3.355812 | norm 0.2785 | lr 3.79e-04 | (3797.89 ms | 138047 tok/s) step 13417/76294 | train loss 3.319352 | norm 0.3329 | lr 3.79e-04 | (3837.29 ms | 136630 tok/s) step 13418/76294 | train loss 3.332691 | norm 0.8261 | lr 3.79e-04 | (3876.49 ms | 135248 tok/s) step 13419/76294 | train loss 3.316470 | norm 0.3366 | lr 3.79e-04 | (3790.72 ms | 138308 tok/s) step 13420/76294 | train loss 3.348552 | norm 0.4676 | lr 3.79e-04 | (3798.13 ms | 138038 tok/s) step 13421/76294 | train loss 3.342703 | norm 0.2884 | lr 3.79e-04 | (3819.04 ms | 137283 tok/s) step 13422/76294 | train loss 3.447348 | norm 0.3105 | lr 3.79e-04 | (3799.64 ms | 137983 tok/s) step 13423/76294 | train loss 3.338974 | norm 0.2254 | lr 3.79e-04 | (3805.13 ms | 137785 tok/s) step 13424/76294 | train loss 3.417232 | norm 0.3504 | lr 3.79e-04 | (3798.97 ms | 138008 tok/s) step 13425/76294 | train loss 3.357177 | norm 0.2649 | lr 3.79e-04 | (3796.06 ms | 138114 tok/s) step 13426/76294 | train loss 3.387858 | norm 0.3676 | lr 3.79e-04 | (3831.53 ms | 136835 tok/s) step 13427/76294 | train loss 3.332621 | norm 0.2260 | lr 3.79e-04 | (3796.32 ms | 138104 tok/s) step 13428/76294 | train loss 3.393728 | norm 0.3352 | lr 3.78e-04 | (3960.00 ms | 132396 tok/s) step 13429/76294 | train loss 3.332072 | norm 0.8845 | lr 3.78e-04 | (3801.40 ms | 137920 tok/s) step 13430/76294 | train loss 3.389333 | norm 0.3031 | lr 3.78e-04 | (3806.84 ms | 137722 tok/s) step 13431/76294 | train loss 3.384630 | norm 0.3387 | lr 3.78e-04 | (3815.61 ms | 137406 tok/s) step 13432/76294 | train loss 3.263542 | norm 0.3462 | lr 3.78e-04 | (3800.98 ms | 137935 tok/s) step 13433/76294 | train loss 3.455150 | norm 0.3400 | lr 3.78e-04 | (3819.98 ms | 137249 tok/s) step 13434/76294 | train loss 3.322913 | norm 0.2424 | lr 3.78e-04 | (3798.12 ms | 138039 tok/s) step 13435/76294 | train loss 3.430024 | norm 0.2369 | lr 3.78e-04 | (3799.03 ms | 138006 tok/s) step 13436/76294 | train loss 3.370853 | norm 0.6021 | lr 3.78e-04 | (3829.11 ms | 136922 tok/s) step 13437/76294 | train loss 3.495062 | norm 0.2456 | lr 3.78e-04 | (3797.05 ms | 138078 tok/s) step 13438/76294 | train loss 3.364813 | norm 0.3939 | lr 3.78e-04 | (3840.85 ms | 136503 tok/s) step 13439/76294 | train loss 3.333725 | norm 0.3925 | lr 3.78e-04 | (3819.71 ms | 137258 tok/s) step 13440/76294 | train loss 3.278768 | norm 0.4342 | lr 3.78e-04 | (3827.40 ms | 136983 tok/s) step 13441/76294 | train loss 3.479458 | norm 0.2561 | lr 3.77e-04 | (3815.29 ms | 137417 tok/s) step 13442/76294 | train loss 3.332222 | norm 0.2827 | lr 3.77e-04 | (3880.55 ms | 135107 tok/s) step 13443/76294 | train loss 3.358216 | norm 0.2895 | lr 3.77e-04 | (3809.84 ms | 137614 tok/s) step 13444/76294 | train loss 3.342555 | norm 0.7019 | lr 3.77e-04 | (3815.24 ms | 137419 tok/s) step 13445/76294 | train loss 3.349064 | norm 0.2480 | lr 3.77e-04 | (3816.55 ms | 137372 tok/s) step 13446/76294 | train loss 3.325118 | norm 0.4635 | lr 3.77e-04 | (3801.92 ms | 137901 tok/s) step 13447/76294 | train loss 3.410788 | norm 0.3952 | lr 3.77e-04 | (3823.62 ms | 137118 tok/s) step 13448/76294 | train loss 3.332626 | norm 0.2866 | lr 3.77e-04 | (3804.33 ms | 137814 tok/s) step 13449/76294 | train loss 3.480454 | norm 0.6831 | lr 3.77e-04 | (3804.34 ms | 137813 tok/s) step 13450/76294 | train loss 3.406746 | norm 1.4625 | lr 3.77e-04 | (3798.97 ms | 138008 tok/s) step 13451/76294 | train loss 3.331714 | norm 0.4620 | lr 3.77e-04 | (3809.68 ms | 137620 tok/s) step 13452/76294 | train loss 3.411579 | norm 2.3046 | lr 3.77e-04 | (3820.95 ms | 137214 tok/s) step 13453/76294 | train loss 3.327261 | norm 0.6547 | lr 3.77e-04 | (3808.66 ms | 137657 tok/s) step 13454/76294 | train loss 3.536988 | norm 0.7011 | lr 3.76e-04 | (3802.91 ms | 137865 tok/s) step 13455/76294 | train loss 3.368962 | norm 0.5150 | lr 3.76e-04 | (3852.15 ms | 136103 tok/s) step 13456/76294 | train loss 3.391079 | norm 0.3432 | lr 3.76e-04 | (3800.91 ms | 137938 tok/s) step 13457/76294 | train loss 3.348550 | norm 1.9350 | lr 3.76e-04 | (4816.09 ms | 108862 tok/s) step 13458/76294 | train loss 3.382465 | norm 0.8813 | lr 3.76e-04 | (3797.65 ms | 138056 tok/s) step 13459/76294 | train loss 3.329556 | norm 0.8807 | lr 3.76e-04 | (3818.69 ms | 137295 tok/s) step 13460/76294 | train loss 3.425963 | norm 0.6733 | lr 3.76e-04 | (3800.45 ms | 137954 tok/s) step 13461/76294 | train loss 3.435023 | norm 0.3009 | lr 3.76e-04 | (3802.77 ms | 137870 tok/s) step 13462/76294 | train loss 3.448004 | norm 0.3488 | lr 3.76e-04 | (3824.08 ms | 137102 tok/s) step 13463/76294 | train loss 3.416258 | norm 0.4780 | lr 3.76e-04 | (3812.47 ms | 137519 tok/s) step 13464/76294 | train loss 3.352353 | norm 0.4813 | lr 3.76e-04 | (3798.77 ms | 138015 tok/s) step 13465/76294 | train loss 3.431452 | norm 0.6347 | lr 3.76e-04 | (3875.77 ms | 135273 tok/s) step 13466/76294 | train loss 3.365342 | norm 0.5027 | lr 3.76e-04 | (3799.14 ms | 138002 tok/s) step 13467/76294 | train loss 3.455316 | norm 0.3095 | lr 3.76e-04 | (3843.37 ms | 136414 tok/s) step 13468/76294 | train loss 3.408382 | norm 1.8759 | lr 3.75e-04 | (3798.24 ms | 138034 tok/s) step 13469/76294 | train loss 3.346061 | norm 0.6010 | lr 3.75e-04 | (3831.81 ms | 136825 tok/s) step 13470/76294 | train loss 3.367216 | norm 0.3053 | lr 3.75e-04 | (3799.74 ms | 137980 tok/s) step 13471/76294 | train loss 3.376703 | norm 0.4079 | lr 3.75e-04 | (3809.59 ms | 137623 tok/s) step 13472/76294 | train loss 3.402545 | norm 0.2567 | lr 3.75e-04 | (3801.47 ms | 137917 tok/s) step 13473/76294 | train loss 3.371069 | norm 0.6360 | lr 3.75e-04 | (3820.45 ms | 137232 tok/s) step 13474/76294 | train loss 3.343701 | norm 0.4618 | lr 3.75e-04 | (3819.37 ms | 137271 tok/s) step 13475/76294 | train loss 3.425487 | norm 0.3087 | lr 3.75e-04 | (3806.40 ms | 137739 tok/s) step 13476/76294 | train loss 3.380367 | norm 0.3806 | lr 3.75e-04 | (3805.25 ms | 137780 tok/s) step 13477/76294 | train loss 3.413278 | norm 0.3496 | lr 3.75e-04 | (3801.68 ms | 137910 tok/s) step 13478/76294 | train loss 3.382089 | norm 0.4680 | lr 3.75e-04 | (3811.90 ms | 137540 tok/s) step 13479/76294 | train loss 3.424226 | norm 0.3053 | lr 3.75e-04 | (3846.34 ms | 136308 tok/s) step 13480/76294 | train loss 3.461625 | norm 0.3502 | lr 3.75e-04 | (3858.53 ms | 135878 tok/s) step 13481/76294 | train loss 3.405758 | norm 0.2432 | lr 3.74e-04 | (3805.48 ms | 137772 tok/s) step 13482/76294 | train loss 3.483677 | norm 0.2434 | lr 3.74e-04 | (3850.26 ms | 136169 tok/s) step 13483/76294 | train loss 3.392767 | norm 0.2626 | lr 3.74e-04 | (3798.78 ms | 138015 tok/s) step 13484/76294 | train loss 3.374177 | norm 0.4061 | lr 3.74e-04 | (3806.16 ms | 137747 tok/s) step 13485/76294 | train loss 3.453377 | norm 0.2878 | lr 3.74e-04 | (3801.77 ms | 137906 tok/s) step 13486/76294 | train loss 3.381887 | norm 0.4431 | lr 3.74e-04 | (3803.65 ms | 137838 tok/s) step 13487/76294 | train loss 3.364297 | norm 0.4778 | lr 3.74e-04 | (3828.81 ms | 136932 tok/s) step 13488/76294 | train loss 3.437504 | norm 0.3887 | lr 3.74e-04 | (3853.69 ms | 136048 tok/s) step 13489/76294 | train loss 3.459540 | norm 0.2735 | lr 3.74e-04 | (3797.81 ms | 138050 tok/s) step 13490/76294 | train loss 3.337611 | norm 0.5658 | lr 3.74e-04 | (3828.21 ms | 136954 tok/s) step 13491/76294 | train loss 3.388441 | norm 0.3885 | lr 3.74e-04 | (3797.41 ms | 138065 tok/s) step 13492/76294 | train loss 3.349827 | norm 0.3861 | lr 3.74e-04 | (3837.11 ms | 136636 tok/s) step 13493/76294 | train loss 3.555703 | norm 0.2825 | lr 3.74e-04 | (3800.28 ms | 137960 tok/s) step 13494/76294 | train loss 3.393645 | norm 0.5042 | lr 3.73e-04 | (3856.07 ms | 135964 tok/s) step 13495/76294 | train loss 3.373760 | norm 0.3511 | lr 3.73e-04 | (3799.61 ms | 137985 tok/s) step 13496/76294 | train loss 3.359523 | norm 0.3825 | lr 3.73e-04 | (3807.04 ms | 137715 tok/s) step 13497/76294 | train loss 3.398938 | norm 0.2382 | lr 3.73e-04 | (3863.35 ms | 135708 tok/s) step 13498/76294 | train loss 3.382166 | norm 0.3604 | lr 3.73e-04 | (3802.68 ms | 137873 tok/s) step 13499/76294 | train loss 3.368546 | norm 0.3018 | lr 3.73e-04 | (3802.10 ms | 137894 tok/s) step 13500/76294 | train loss 3.362641 | norm 0.2767 | lr 3.73e-04 | (3878.02 ms | 135195 tok/s) val loss: 3.377223 saving model checkpoint to ./results/gpt2-124M-gqa/step_13500.pth step 13501/76294 | train loss 3.397082 | norm 0.2357 | lr 3.73e-04 | (3812.16 ms | 137530 tok/s) step 13502/76294 | train loss 3.302026 | norm 0.3330 | lr 3.73e-04 | (3822.70 ms | 137151 tok/s) step 13503/76294 | train loss 3.394027 | norm 0.2611 | lr 3.73e-04 | (3807.40 ms | 137702 tok/s) step 13504/76294 | train loss 3.398896 | norm 0.3015 | lr 3.73e-04 | (3823.73 ms | 137114 tok/s) step 13505/76294 | train loss 3.371965 | norm 0.2376 | lr 3.73e-04 | (3802.00 ms | 137898 tok/s) step 13506/76294 | train loss 3.365106 | norm 0.3301 | lr 3.73e-04 | (3798.62 ms | 138021 tok/s) step 13507/76294 | train loss 3.352670 | norm 0.3966 | lr 3.72e-04 | (3836.81 ms | 136647 tok/s) step 13508/76294 | train loss 3.412850 | norm 0.2072 | lr 3.72e-04 | (3799.75 ms | 137980 tok/s) step 13509/76294 | train loss 3.373158 | norm 0.3170 | lr 3.72e-04 | (3804.16 ms | 137820 tok/s) step 13510/76294 | train loss 3.461445 | norm 0.5762 | lr 3.72e-04 | (3909.30 ms | 134113 tok/s) step 13511/76294 | train loss 3.330735 | norm 0.2910 | lr 3.72e-04 | (3803.66 ms | 137838 tok/s) step 13512/76294 | train loss 3.418104 | norm 0.2154 | lr 3.72e-04 | (3819.82 ms | 137255 tok/s) step 13513/76294 | train loss 3.386816 | norm 0.3492 | lr 3.72e-04 | (3824.67 ms | 137081 tok/s) step 13514/76294 | train loss 3.385877 | norm 0.2469 | lr 3.72e-04 | (3869.23 ms | 135502 tok/s) step 13515/76294 | train loss 3.377869 | norm 0.2318 | lr 3.72e-04 | (3809.57 ms | 137624 tok/s) step 13516/76294 | train loss 3.394259 | norm 0.3093 | lr 3.72e-04 | (3804.30 ms | 137815 tok/s) step 13517/76294 | train loss 3.342522 | norm 0.2805 | lr 3.72e-04 | (3966.18 ms | 132190 tok/s) step 13518/76294 | train loss 3.347280 | norm 0.4454 | lr 3.72e-04 | (3782.33 ms | 138615 tok/s) step 13519/76294 | train loss 3.382755 | norm 0.4046 | lr 3.72e-04 | (3789.58 ms | 138350 tok/s) step 13520/76294 | train loss 3.411433 | norm 0.2542 | lr 3.71e-04 | (3788.69 ms | 138382 tok/s) step 13521/76294 | train loss 3.390887 | norm 0.6187 | lr 3.71e-04 | (3817.40 ms | 137342 tok/s) step 13522/76294 | train loss 3.353749 | norm 0.4414 | lr 3.71e-04 | (3810.79 ms | 137580 tok/s) step 13523/76294 | train loss 3.378457 | norm 0.2763 | lr 3.71e-04 | (3824.63 ms | 137082 tok/s) step 13524/76294 | train loss 3.389200 | norm 0.4087 | lr 3.71e-04 | (3814.26 ms | 137455 tok/s) step 13525/76294 | train loss 3.427461 | norm 0.4317 | lr 3.71e-04 | (3955.15 ms | 132558 tok/s) step 13526/76294 | train loss 3.353812 | norm 0.7957 | lr 3.71e-04 | (3802.36 ms | 137885 tok/s) step 13527/76294 | train loss 3.360560 | norm 0.5071 | lr 3.71e-04 | (3799.52 ms | 137988 tok/s) step 13528/76294 | train loss 3.406408 | norm 0.3535 | lr 3.71e-04 | (3801.25 ms | 137925 tok/s) step 13529/76294 | train loss 3.422863 | norm 0.6025 | lr 3.71e-04 | (3797.51 ms | 138061 tok/s) step 13530/76294 | train loss 3.399248 | norm 0.2884 | lr 3.71e-04 | (3801.36 ms | 137921 tok/s) step 13531/76294 | train loss 3.367979 | norm 0.3668 | lr 3.71e-04 | (3814.98 ms | 137429 tok/s) step 13532/76294 | train loss 3.507957 | norm 0.3643 | lr 3.71e-04 | (3798.58 ms | 138022 tok/s) step 13533/76294 | train loss 3.375826 | norm 0.4756 | lr 3.70e-04 | (3802.02 ms | 137897 tok/s) step 13534/76294 | train loss 3.384722 | norm 0.3240 | lr 3.70e-04 | (3821.42 ms | 137197 tok/s) step 13535/76294 | train loss 3.392210 | norm 0.4050 | lr 3.70e-04 | (3802.26 ms | 137888 tok/s) step 13536/76294 | train loss 3.391409 | norm 0.3488 | lr 3.70e-04 | (3808.29 ms | 137670 tok/s) step 13537/76294 | train loss 3.361944 | norm 0.3256 | lr 3.70e-04 | (3810.27 ms | 137599 tok/s) step 13538/76294 | train loss 3.478902 | norm 0.6104 | lr 3.70e-04 | (3840.72 ms | 136508 tok/s) step 13539/76294 | train loss 3.405917 | norm 0.3745 | lr 3.70e-04 | (3804.37 ms | 137812 tok/s) step 13540/76294 | train loss 3.342055 | norm 0.3058 | lr 3.70e-04 | (3876.56 ms | 135246 tok/s) step 13541/76294 | train loss 3.373756 | norm 0.2967 | lr 3.70e-04 | (3801.71 ms | 137908 tok/s) step 13542/76294 | train loss 3.351649 | norm 0.3631 | lr 3.70e-04 | (3807.30 ms | 137706 tok/s) step 13543/76294 | train loss 3.425186 | norm 0.4338 | lr 3.70e-04 | (3826.22 ms | 137025 tok/s) step 13544/76294 | train loss 3.401065 | norm 0.3299 | lr 3.70e-04 | (3805.73 ms | 137763 tok/s) step 13545/76294 | train loss 3.393328 | norm 0.3039 | lr 3.70e-04 | (3849.74 ms | 136188 tok/s) step 13546/76294 | train loss 3.446046 | norm 0.2557 | lr 3.69e-04 | (3804.92 ms | 137792 tok/s) step 13547/76294 | train loss 3.365573 | norm 0.2591 | lr 3.69e-04 | (3830.85 ms | 136859 tok/s) step 13548/76294 | train loss 3.345970 | norm 0.2937 | lr 3.69e-04 | (3805.68 ms | 137764 tok/s) step 13549/76294 | train loss 3.423535 | norm 0.6459 | lr 3.69e-04 | (3809.33 ms | 137632 tok/s) step 13550/76294 | train loss 3.383574 | norm 0.2845 | lr 3.69e-04 | (3821.96 ms | 137178 tok/s) step 13551/76294 | train loss 3.375586 | norm 0.4088 | lr 3.69e-04 | (3808.11 ms | 137677 tok/s) step 13552/76294 | train loss 3.398678 | norm 0.2567 | lr 3.69e-04 | (3920.87 ms | 133717 tok/s) step 13553/76294 | train loss 3.455117 | norm 0.3654 | lr 3.69e-04 | (3804.77 ms | 137798 tok/s) step 13554/76294 | train loss 3.441894 | norm 0.3612 | lr 3.69e-04 | (3807.60 ms | 137695 tok/s) step 13555/76294 | train loss 3.387516 | norm 0.2396 | lr 3.69e-04 | (3810.90 ms | 137576 tok/s) step 13556/76294 | train loss 3.385487 | norm 0.4964 | lr 3.69e-04 | (3901.13 ms | 134394 tok/s) step 13557/76294 | train loss 3.399586 | norm 0.5209 | lr 3.69e-04 | (3804.49 ms | 137808 tok/s) step 13558/76294 | train loss 3.391233 | norm 0.2837 | lr 3.69e-04 | (3801.12 ms | 137930 tok/s) step 13559/76294 | train loss 3.414135 | norm 0.3045 | lr 3.68e-04 | (3825.48 ms | 137052 tok/s) step 13560/76294 | train loss 3.362431 | norm 0.5074 | lr 3.68e-04 | (3810.05 ms | 137607 tok/s) step 13561/76294 | train loss 3.324327 | norm 0.3994 | lr 3.68e-04 | (3814.27 ms | 137454 tok/s) step 13562/76294 | train loss 3.387543 | norm 0.3156 | lr 3.68e-04 | (3804.38 ms | 137812 tok/s) step 13563/76294 | train loss 3.403522 | norm 0.3065 | lr 3.68e-04 | (3811.06 ms | 137570 tok/s) step 13564/76294 | train loss 3.424511 | norm 0.5624 | lr 3.68e-04 | (3834.65 ms | 136724 tok/s) step 13565/76294 | train loss 3.407156 | norm 0.3821 | lr 3.68e-04 | (3811.44 ms | 137556 tok/s) step 13566/76294 | train loss 3.362949 | norm 0.4243 | lr 3.68e-04 | (3811.13 ms | 137568 tok/s) step 13567/76294 | train loss 3.455049 | norm 0.7547 | lr 3.68e-04 | (3804.84 ms | 137795 tok/s) step 13568/76294 | train loss 3.369016 | norm 0.6306 | lr 3.68e-04 | (3834.11 ms | 136743 tok/s) step 13569/76294 | train loss 3.497760 | norm 1.3394 | lr 3.68e-04 | (3811.26 ms | 137563 tok/s) step 13570/76294 | train loss 3.388041 | norm 0.6446 | lr 3.68e-04 | (3814.75 ms | 137437 tok/s) step 13571/76294 | train loss 3.406690 | norm 0.4544 | lr 3.68e-04 | (3803.59 ms | 137840 tok/s) step 13572/76294 | train loss 3.398216 | norm 0.4617 | lr 3.68e-04 | (3836.04 ms | 136674 tok/s) step 13573/76294 | train loss 3.350534 | norm 0.2972 | lr 3.67e-04 | (3804.04 ms | 137824 tok/s) step 13574/76294 | train loss 3.432328 | norm 0.4466 | lr 3.67e-04 | (3910.81 ms | 134061 tok/s) step 13575/76294 | train loss 3.326669 | norm 0.4405 | lr 3.67e-04 | (3848.86 ms | 136219 tok/s) step 13576/76294 | train loss 3.378814 | norm 0.3525 | lr 3.67e-04 | (3812.63 ms | 137514 tok/s) step 13577/76294 | train loss 3.437574 | norm 0.3715 | lr 3.67e-04 | (3974.15 ms | 131925 tok/s) step 13578/76294 | train loss 3.402614 | norm 0.3376 | lr 3.67e-04 | (3865.97 ms | 135616 tok/s) step 13579/76294 | train loss 3.471999 | norm 0.2986 | lr 3.67e-04 | (3828.91 ms | 136929 tok/s) step 13580/76294 | train loss 3.401204 | norm 0.3265 | lr 3.67e-04 | (3811.00 ms | 137572 tok/s) step 13581/76294 | train loss 3.396401 | norm 0.7778 | lr 3.67e-04 | (3804.27 ms | 137816 tok/s) step 13582/76294 | train loss 3.372224 | norm 0.2936 | lr 3.67e-04 | (3833.07 ms | 136780 tok/s) step 13583/76294 | train loss 3.405149 | norm 0.4043 | lr 3.67e-04 | (3804.15 ms | 137820 tok/s) step 13584/76294 | train loss 3.338308 | norm 0.3271 | lr 3.67e-04 | (3808.64 ms | 137658 tok/s) step 13585/76294 | train loss 3.427423 | norm 0.4877 | lr 3.67e-04 | (3829.75 ms | 136899 tok/s) step 13586/76294 | train loss 3.490033 | norm 0.3443 | lr 3.66e-04 | (3820.78 ms | 137220 tok/s) step 13587/76294 | train loss 3.392713 | norm 0.2616 | lr 3.66e-04 | (3804.08 ms | 137822 tok/s) step 13588/76294 | train loss 3.384093 | norm 0.3561 | lr 3.66e-04 | (3863.03 ms | 135719 tok/s) step 13589/76294 | train loss 3.389578 | norm 0.4028 | lr 3.66e-04 | (3808.02 ms | 137680 tok/s) step 13590/76294 | train loss 3.365155 | norm 0.4741 | lr 3.66e-04 | (3930.67 ms | 133384 tok/s) step 13591/76294 | train loss 3.415002 | norm 0.3813 | lr 3.66e-04 | (3861.47 ms | 135774 tok/s) step 13592/76294 | train loss 3.392645 | norm 0.3726 | lr 3.66e-04 | (3799.29 ms | 137996 tok/s) step 13593/76294 | train loss 3.430212 | norm 0.4520 | lr 3.66e-04 | (3829.55 ms | 136906 tok/s) step 13594/76294 | train loss 3.418766 | norm 0.5191 | lr 3.66e-04 | (3827.05 ms | 136995 tok/s) step 13595/76294 | train loss 3.423285 | norm 0.2837 | lr 3.66e-04 | (3887.66 ms | 134860 tok/s) step 13596/76294 | train loss 3.457432 | norm 0.6829 | lr 3.66e-04 | (3804.28 ms | 137815 tok/s) step 13597/76294 | train loss 3.439539 | norm 0.2518 | lr 3.66e-04 | (3813.73 ms | 137474 tok/s) step 13598/76294 | train loss 3.395676 | norm 0.4150 | lr 3.66e-04 | (3832.00 ms | 136818 tok/s) step 13599/76294 | train loss 3.385396 | norm 0.3587 | lr 3.65e-04 | (3815.38 ms | 137414 tok/s) step 13600/76294 | train loss 3.405557 | norm 0.3475 | lr 3.65e-04 | (3855.87 ms | 135971 tok/s) step 13601/76294 | train loss 3.419226 | norm 0.3756 | lr 3.65e-04 | (3796.36 ms | 138103 tok/s) step 13602/76294 | train loss 3.439832 | norm 0.3289 | lr 3.65e-04 | (3820.04 ms | 137247 tok/s) step 13603/76294 | train loss 3.404169 | norm 0.3039 | lr 3.65e-04 | (3798.29 ms | 138033 tok/s) step 13604/76294 | train loss 3.322337 | norm 0.2722 | lr 3.65e-04 | (3796.39 ms | 138102 tok/s) step 13605/76294 | train loss 3.351176 | norm 0.4767 | lr 3.65e-04 | (3858.58 ms | 135876 tok/s) step 13606/76294 | train loss 3.381202 | norm 0.2433 | lr 3.65e-04 | (3800.14 ms | 137966 tok/s) step 13607/76294 | train loss 3.321617 | norm 0.3173 | lr 3.65e-04 | (3800.97 ms | 137935 tok/s) step 13608/76294 | train loss 3.365745 | norm 0.2634 | lr 3.65e-04 | (3835.65 ms | 136688 tok/s) step 13609/76294 | train loss 3.412321 | norm 0.3154 | lr 3.65e-04 | (3914.41 ms | 133938 tok/s) step 13610/76294 | train loss 3.383597 | norm 0.2278 | lr 3.65e-04 | (3804.36 ms | 137812 tok/s) step 13611/76294 | train loss 3.418501 | norm 0.2005 | lr 3.65e-04 | (3872.82 ms | 135376 tok/s) step 13612/76294 | train loss 3.402559 | norm 0.2154 | lr 3.64e-04 | (3884.28 ms | 134977 tok/s) step 13613/76294 | train loss 3.376368 | norm 0.2728 | lr 3.64e-04 | (3787.87 ms | 138412 tok/s) step 13614/76294 | train loss 3.433331 | norm 0.3938 | lr 3.64e-04 | (3791.46 ms | 138281 tok/s) step 13615/76294 | train loss 3.313345 | norm 0.2856 | lr 3.64e-04 | (3891.87 ms | 134714 tok/s) step 13616/76294 | train loss 3.436445 | norm 0.2561 | lr 3.64e-04 | (3779.65 ms | 138713 tok/s) step 13617/76294 | train loss 3.398188 | norm 0.2129 | lr 3.64e-04 | (3834.30 ms | 136736 tok/s) step 13618/76294 | train loss 3.393933 | norm 0.2203 | lr 3.64e-04 | (3783.00 ms | 138591 tok/s) step 13619/76294 | train loss 3.447230 | norm 0.4051 | lr 3.64e-04 | (3807.59 ms | 137695 tok/s) step 13620/76294 | train loss 3.340271 | norm 0.1984 | lr 3.64e-04 | (3806.88 ms | 137721 tok/s) step 13621/76294 | train loss 3.411204 | norm 0.2395 | lr 3.64e-04 | (3783.02 ms | 138590 tok/s) step 13622/76294 | train loss 3.400252 | norm 0.2899 | lr 3.64e-04 | (3788.35 ms | 138395 tok/s) step 13623/76294 | train loss 3.388247 | norm 0.2496 | lr 3.64e-04 | (3785.74 ms | 138490 tok/s) step 13624/76294 | train loss 3.583135 | norm 0.2807 | lr 3.64e-04 | (3800.20 ms | 137963 tok/s) step 13625/76294 | train loss 3.439396 | norm 0.2327 | lr 3.64e-04 | (3789.56 ms | 138351 tok/s) step 13626/76294 | train loss 3.391414 | norm 0.2129 | lr 3.63e-04 | (3795.52 ms | 138134 tok/s) step 13627/76294 | train loss 3.385232 | norm 0.2620 | lr 3.63e-04 | (3790.39 ms | 138320 tok/s) step 13628/76294 | train loss 3.421432 | norm 0.2738 | lr 3.63e-04 | (3918.33 ms | 133804 tok/s) step 13629/76294 | train loss 3.415705 | norm 0.2847 | lr 3.63e-04 | (3779.65 ms | 138713 tok/s) step 13630/76294 | train loss 3.403846 | norm 0.2726 | lr 3.63e-04 | (3785.97 ms | 138482 tok/s) step 13631/76294 | train loss 3.454944 | norm 0.1911 | lr 3.63e-04 | (3806.77 ms | 137725 tok/s) step 13632/76294 | train loss 3.382518 | norm 0.2901 | lr 3.63e-04 | (3787.80 ms | 138415 tok/s) step 13633/76294 | train loss 3.422077 | norm 0.2098 | lr 3.63e-04 | (3793.15 ms | 138220 tok/s) step 13634/76294 | train loss 3.414517 | norm 0.2601 | lr 3.63e-04 | (3791.31 ms | 138287 tok/s) step 13635/76294 | train loss 3.626229 | norm 0.2652 | lr 3.63e-04 | (3838.03 ms | 136603 tok/s) step 13636/76294 | train loss 3.301011 | norm 0.2709 | lr 3.63e-04 | (3787.13 ms | 138439 tok/s) step 13637/76294 | train loss 3.363425 | norm 0.4650 | lr 3.63e-04 | (3792.15 ms | 138256 tok/s) step 13638/76294 | train loss 3.453008 | norm 0.2256 | lr 3.63e-04 | (3812.56 ms | 137516 tok/s) step 13639/76294 | train loss 3.408464 | norm 0.2673 | lr 3.62e-04 | (3786.94 ms | 138446 tok/s) step 13640/76294 | train loss 3.417793 | norm 0.4492 | lr 3.62e-04 | (3792.00 ms | 138262 tok/s) step 13641/76294 | train loss 3.376262 | norm 0.4038 | lr 3.62e-04 | (3814.56 ms | 137444 tok/s) step 13642/76294 | train loss 3.365075 | norm 0.3155 | lr 3.62e-04 | (3790.87 ms | 138303 tok/s) step 13643/76294 | train loss 3.458875 | norm 0.3631 | lr 3.62e-04 | (3820.21 ms | 137241 tok/s) step 13644/76294 | train loss 3.396115 | norm 0.3507 | lr 3.62e-04 | (3793.08 ms | 138222 tok/s) step 13645/76294 | train loss 3.421809 | norm 0.5067 | lr 3.62e-04 | (3804.03 ms | 137824 tok/s) step 13646/76294 | train loss 3.376527 | norm 0.3851 | lr 3.62e-04 | (3814.78 ms | 137436 tok/s) step 13647/76294 | train loss 3.395969 | norm 0.3211 | lr 3.62e-04 | (3838.20 ms | 136597 tok/s) step 13648/76294 | train loss 3.375420 | norm 0.2926 | lr 3.62e-04 | (4121.80 ms | 127199 tok/s) step 13649/76294 | train loss 3.411756 | norm 0.3157 | lr 3.62e-04 | (3816.97 ms | 137357 tok/s) step 13650/76294 | train loss 3.369813 | norm 0.2406 | lr 3.62e-04 | (3811.71 ms | 137547 tok/s) step 13651/76294 | train loss 3.410967 | norm 0.2987 | lr 3.62e-04 | (5002.94 ms | 104796 tok/s) step 13652/76294 | train loss 3.398405 | norm 0.5345 | lr 3.61e-04 | (3793.29 ms | 138215 tok/s) step 13653/76294 | train loss 3.361320 | norm 1.3376 | lr 3.61e-04 | (3796.23 ms | 138107 tok/s) step 13654/76294 | train loss 3.382568 | norm 0.4553 | lr 3.61e-04 | (3795.90 ms | 138120 tok/s) step 13655/76294 | train loss 3.389009 | norm 0.6124 | lr 3.61e-04 | (3798.21 ms | 138036 tok/s) step 13656/76294 | train loss 3.420413 | norm 0.6605 | lr 3.61e-04 | (3795.50 ms | 138134 tok/s) step 13657/76294 | train loss 3.383553 | norm 0.8136 | lr 3.61e-04 | (3876.61 ms | 135244 tok/s) step 13658/76294 | train loss 3.384921 | norm 0.5666 | lr 3.61e-04 | (3788.70 ms | 138382 tok/s) step 13659/76294 | train loss 3.416532 | norm 0.3972 | lr 3.61e-04 | (3820.55 ms | 137228 tok/s) step 13660/76294 | train loss 3.373974 | norm 0.5140 | lr 3.61e-04 | (3815.54 ms | 137408 tok/s) step 13661/76294 | train loss 3.414462 | norm 0.5832 | lr 3.61e-04 | (3795.38 ms | 138138 tok/s) step 13662/76294 | train loss 3.399640 | norm 0.5276 | lr 3.61e-04 | (3798.51 ms | 138025 tok/s) step 13663/76294 | train loss 3.469415 | norm 0.2607 | lr 3.61e-04 | (3800.56 ms | 137950 tok/s) step 13664/76294 | train loss 3.409601 | norm 0.8676 | lr 3.61e-04 | (3793.55 ms | 138205 tok/s) step 13665/76294 | train loss 3.322682 | norm 0.3792 | lr 3.61e-04 | (3834.72 ms | 136721 tok/s) step 13666/76294 | train loss 3.408133 | norm 0.3685 | lr 3.60e-04 | (3793.52 ms | 138206 tok/s) step 13667/76294 | train loss 3.347189 | norm 0.2773 | lr 3.60e-04 | (3800.08 ms | 137968 tok/s) step 13668/76294 | train loss 3.392918 | norm 0.8237 | lr 3.60e-04 | (3814.77 ms | 137436 tok/s) step 13669/76294 | train loss 3.325929 | norm 0.2669 | lr 3.60e-04 | (3795.03 ms | 138151 tok/s) step 13670/76294 | train loss 3.428351 | norm 0.4184 | lr 3.60e-04 | (3791.12 ms | 138294 tok/s) step 13671/76294 | train loss 3.348605 | norm 0.4813 | lr 3.60e-04 | (3835.24 ms | 136703 tok/s) step 13672/76294 | train loss 3.411653 | norm 0.2752 | lr 3.60e-04 | (3794.07 ms | 138186 tok/s) step 13673/76294 | train loss 3.543804 | norm 0.8730 | lr 3.60e-04 | (3879.62 ms | 135139 tok/s) step 13674/76294 | train loss 3.313987 | norm 0.2746 | lr 3.60e-04 | (3791.70 ms | 138273 tok/s) step 13675/76294 | train loss 3.463029 | norm 0.6297 | lr 3.60e-04 | (3796.81 ms | 138086 tok/s) step 13676/76294 | train loss 3.390354 | norm 0.3436 | lr 3.60e-04 | (3819.80 ms | 137255 tok/s) step 13677/76294 | train loss 3.364357 | norm 0.4742 | lr 3.60e-04 | (3799.77 ms | 137979 tok/s) step 13678/76294 | train loss 3.402612 | norm 0.6242 | lr 3.60e-04 | (3804.24 ms | 137817 tok/s) step 13679/76294 | train loss 3.457563 | norm 0.4246 | lr 3.59e-04 | (3925.41 ms | 133563 tok/s) step 13680/76294 | train loss 3.406109 | norm 0.3636 | lr 3.59e-04 | (3983.17 ms | 131626 tok/s) step 13681/76294 | train loss 3.397418 | norm 0.5171 | lr 3.59e-04 | (3796.27 ms | 138106 tok/s) step 13682/76294 | train loss 3.379620 | norm 0.2156 | lr 3.59e-04 | (3854.88 ms | 136006 tok/s) step 13683/76294 | train loss 3.401320 | norm 0.5926 | lr 3.59e-04 | (3794.27 ms | 138179 tok/s) step 13684/76294 | train loss 3.372365 | norm 0.3276 | lr 3.59e-04 | (3798.67 ms | 138019 tok/s) step 13685/76294 | train loss 3.334546 | norm 0.2638 | lr 3.59e-04 | (3829.64 ms | 136903 tok/s) step 13686/76294 | train loss 3.453892 | norm 0.3597 | lr 3.59e-04 | (3796.37 ms | 138103 tok/s) step 13687/76294 | train loss 3.353475 | norm 0.2629 | lr 3.59e-04 | (3797.10 ms | 138076 tok/s) step 13688/76294 | train loss 3.343082 | norm 0.2797 | lr 3.59e-04 | (3794.75 ms | 138162 tok/s) step 13689/76294 | train loss 3.387703 | norm 0.2798 | lr 3.59e-04 | (3801.10 ms | 137931 tok/s) step 13690/76294 | train loss 3.314884 | norm 0.2329 | lr 3.59e-04 | (3797.63 ms | 138057 tok/s) step 13691/76294 | train loss 3.468489 | norm 0.3429 | lr 3.59e-04 | (3812.43 ms | 137521 tok/s) step 13692/76294 | train loss 3.334135 | norm 0.2294 | lr 3.58e-04 | (4052.09 ms | 129387 tok/s) step 13693/76294 | train loss 3.445804 | norm 0.3661 | lr 3.58e-04 | (3797.58 ms | 138059 tok/s) step 13694/76294 | train loss 3.377185 | norm 0.3194 | lr 3.58e-04 | (3835.16 ms | 136706 tok/s) step 13695/76294 | train loss 3.324559 | norm 0.4725 | lr 3.58e-04 | (3793.43 ms | 138209 tok/s) step 13696/76294 | train loss 3.436473 | norm 0.4357 | lr 3.58e-04 | (3820.47 ms | 137231 tok/s) step 13697/76294 | train loss 3.368397 | norm 0.4675 | lr 3.58e-04 | (3807.92 ms | 137684 tok/s) step 13698/76294 | train loss 3.391018 | norm 0.4014 | lr 3.58e-04 | (3835.06 ms | 136709 tok/s) step 13699/76294 | train loss 3.376826 | norm 0.2632 | lr 3.58e-04 | (3837.65 ms | 136617 tok/s) step 13700/76294 | train loss 3.391415 | norm 0.3457 | lr 3.58e-04 | (3807.57 ms | 137696 tok/s) step 13701/76294 | train loss 3.443203 | norm 0.5683 | lr 3.58e-04 | (3795.05 ms | 138150 tok/s) step 13702/76294 | train loss 3.420936 | norm 0.6047 | lr 3.58e-04 | (3880.75 ms | 135100 tok/s) step 13703/76294 | train loss 3.399832 | norm 0.4311 | lr 3.58e-04 | (3794.31 ms | 138178 tok/s) step 13704/76294 | train loss 3.348479 | norm 0.5359 | lr 3.58e-04 | (3801.84 ms | 137904 tok/s) step 13705/76294 | train loss 3.395584 | norm 0.3216 | lr 3.58e-04 | (3813.16 ms | 137494 tok/s) step 13706/76294 | train loss 3.381543 | norm 0.4757 | lr 3.57e-04 | (3803.21 ms | 137854 tok/s) step 13707/76294 | train loss 3.350106 | norm 0.2840 | lr 3.57e-04 | (3807.53 ms | 137697 tok/s) step 13708/76294 | train loss 3.429332 | norm 0.3649 | lr 3.57e-04 | (3798.73 ms | 138017 tok/s) step 13709/76294 | train loss 3.362969 | norm 0.2671 | lr 3.57e-04 | (3812.88 ms | 137504 tok/s) step 13710/76294 | train loss 3.409630 | norm 0.4060 | lr 3.57e-04 | (3841.19 ms | 136491 tok/s) step 13711/76294 | train loss 3.382221 | norm 0.2402 | lr 3.57e-04 | (3805.40 ms | 137775 tok/s) step 13712/76294 | train loss 3.292058 | norm 0.3966 | lr 3.57e-04 | (3926.80 ms | 133515 tok/s) step 13713/76294 | train loss 3.426324 | norm 0.5401 | lr 3.57e-04 | (3852.73 ms | 136082 tok/s) step 13714/76294 | train loss 3.352221 | norm 1.5568 | lr 3.57e-04 | (3806.48 ms | 137736 tok/s) step 13715/76294 | train loss 3.385046 | norm 0.5778 | lr 3.57e-04 | (3793.87 ms | 138193 tok/s) step 13716/76294 | train loss 3.361441 | norm 0.5847 | lr 3.57e-04 | (3830.70 ms | 136865 tok/s) step 13717/76294 | train loss 3.374467 | norm 0.3979 | lr 3.57e-04 | (3795.67 ms | 138128 tok/s) step 13718/76294 | train loss 3.373712 | norm 0.6309 | lr 3.57e-04 | (3798.34 ms | 138031 tok/s) step 13719/76294 | train loss 3.392121 | norm 0.4684 | lr 3.56e-04 | (3820.63 ms | 137226 tok/s) step 13720/76294 | train loss 3.394387 | norm 0.8888 | lr 3.56e-04 | (3801.13 ms | 137930 tok/s) step 13721/76294 | train loss 3.351359 | norm 0.5203 | lr 3.56e-04 | (3795.55 ms | 138132 tok/s) step 13722/76294 | train loss 3.412934 | norm 0.4412 | lr 3.56e-04 | (3850.69 ms | 136154 tok/s) step 13723/76294 | train loss 3.352847 | norm 0.4800 | lr 3.56e-04 | (3798.66 ms | 138019 tok/s) step 13724/76294 | train loss 3.399566 | norm 0.3914 | lr 3.56e-04 | (3898.07 ms | 134499 tok/s) step 13725/76294 | train loss 3.440459 | norm 0.4108 | lr 3.56e-04 | (3785.40 ms | 138503 tok/s) step 13726/76294 | train loss 3.426885 | norm 0.2899 | lr 3.56e-04 | (3810.03 ms | 137607 tok/s) step 13727/76294 | train loss 3.410773 | norm 0.2412 | lr 3.56e-04 | (3810.83 ms | 137578 tok/s) step 13728/76294 | train loss 3.299248 | norm 0.3335 | lr 3.56e-04 | (3792.46 ms | 138245 tok/s) step 13729/76294 | train loss 3.400151 | norm 0.2906 | lr 3.56e-04 | (3810.05 ms | 137606 tok/s) step 13730/76294 | train loss 3.374251 | norm 0.3427 | lr 3.56e-04 | (3791.63 ms | 138275 tok/s) step 13731/76294 | train loss 3.357908 | norm 0.3558 | lr 3.56e-04 | (3828.59 ms | 136940 tok/s) step 13732/76294 | train loss 3.417687 | norm 0.3495 | lr 3.56e-04 | (3817.20 ms | 137349 tok/s) step 13733/76294 | train loss 3.412690 | norm 0.4342 | lr 3.55e-04 | (3801.33 ms | 137922 tok/s) step 13734/76294 | train loss 3.475947 | norm 0.4478 | lr 3.55e-04 | (3843.34 ms | 136415 tok/s) step 13735/76294 | train loss 3.342511 | norm 0.5878 | lr 3.55e-04 | (3798.70 ms | 138018 tok/s) step 13736/76294 | train loss 3.437288 | norm 0.7994 | lr 3.55e-04 | (3799.78 ms | 137979 tok/s) step 13737/76294 | train loss 3.361834 | norm 0.4459 | lr 3.55e-04 | (3796.72 ms | 138090 tok/s) step 13738/76294 | train loss 3.364120 | norm 0.3725 | lr 3.55e-04 | (3799.09 ms | 138003 tok/s) step 13739/76294 | train loss 3.356405 | norm 0.4119 | lr 3.55e-04 | (3798.17 ms | 138037 tok/s) step 13740/76294 | train loss 3.333768 | norm 0.3733 | lr 3.55e-04 | (3824.13 ms | 137100 tok/s) step 13741/76294 | train loss 3.411203 | norm 0.2934 | lr 3.55e-04 | (3798.68 ms | 138018 tok/s) step 13742/76294 | train loss 3.400055 | norm 0.6059 | lr 3.55e-04 | (3792.28 ms | 138251 tok/s) step 13743/76294 | train loss 3.302882 | norm 0.3344 | lr 3.55e-04 | (3824.04 ms | 137103 tok/s) step 13744/76294 | train loss 3.377130 | norm 0.4365 | lr 3.55e-04 | (3880.77 ms | 135099 tok/s) step 13745/76294 | train loss 3.406835 | norm 0.3814 | lr 3.55e-04 | (3865.53 ms | 135631 tok/s) step 13746/76294 | train loss 3.513481 | norm 0.3271 | lr 3.54e-04 | (3795.90 ms | 138120 tok/s) step 13747/76294 | train loss 3.374649 | norm 0.2902 | lr 3.54e-04 | (3822.99 ms | 137141 tok/s) step 13748/76294 | train loss 3.409213 | norm 0.2963 | lr 3.54e-04 | (3793.64 ms | 138202 tok/s) step 13749/76294 | train loss 3.384871 | norm 0.4077 | lr 3.54e-04 | (3800.02 ms | 137970 tok/s) step 13750/76294 | train loss 3.390553 | norm 0.2934 | lr 3.54e-04 | (3817.15 ms | 137351 tok/s) val loss: 3.374658 saving model checkpoint to ./results/gpt2-124M-gqa/step_13750.pth step 13751/76294 | train loss 3.431381 | norm 0.2778 | lr 3.54e-04 | (3832.09 ms | 136815 tok/s) step 13752/76294 | train loss 3.380283 | norm 0.2911 | lr 3.54e-04 | (3792.05 ms | 138260 tok/s) step 13753/76294 | train loss 3.415977 | norm 0.3449 | lr 3.54e-04 | (3839.11 ms | 136565 tok/s) step 13754/76294 | train loss 3.389924 | norm 0.2829 | lr 3.54e-04 | (3791.65 ms | 138274 tok/s) step 13755/76294 | train loss 3.449028 | norm 0.3286 | lr 3.54e-04 | (3913.05 ms | 133985 tok/s) step 13756/76294 | train loss 3.405318 | norm 0.2331 | lr 3.54e-04 | (3786.84 ms | 138450 tok/s) step 13757/76294 | train loss 3.340160 | norm 0.2774 | lr 3.54e-04 | (3799.72 ms | 137981 tok/s) step 13758/76294 | train loss 3.314530 | norm 0.2518 | lr 3.54e-04 | (3791.16 ms | 138292 tok/s) step 13759/76294 | train loss 3.392600 | norm 0.2426 | lr 3.54e-04 | (3817.54 ms | 137337 tok/s) step 13760/76294 | train loss 3.370870 | norm 0.2555 | lr 3.53e-04 | (3796.57 ms | 138095 tok/s) step 13761/76294 | train loss 3.459204 | norm 0.3356 | lr 3.53e-04 | (3797.71 ms | 138054 tok/s) step 13762/76294 | train loss 3.402656 | norm 0.3027 | lr 3.53e-04 | (3902.76 ms | 134338 tok/s) step 13763/76294 | train loss 3.392673 | norm 0.2540 | lr 3.53e-04 | (3783.94 ms | 138556 tok/s) step 13764/76294 | train loss 3.441993 | norm 0.2826 | lr 3.53e-04 | (3841.43 ms | 136483 tok/s) step 13765/76294 | train loss 3.387344 | norm 0.3302 | lr 3.53e-04 | (3899.44 ms | 134452 tok/s) step 13766/76294 | train loss 3.386781 | norm 0.3186 | lr 3.53e-04 | (3791.47 ms | 138281 tok/s) step 13767/76294 | train loss 3.368930 | norm 0.2779 | lr 3.53e-04 | (3844.09 ms | 136388 tok/s) step 13768/76294 | train loss 3.355359 | norm 0.1968 | lr 3.53e-04 | (3790.06 ms | 138333 tok/s) step 13769/76294 | train loss 3.440578 | norm 0.1980 | lr 3.53e-04 | (4041.90 ms | 129713 tok/s) step 13770/76294 | train loss 3.381928 | norm 0.3099 | lr 3.53e-04 | (3775.68 ms | 138859 tok/s) step 13771/76294 | train loss 3.399322 | norm 0.2032 | lr 3.53e-04 | (3827.38 ms | 136984 tok/s) step 13772/76294 | train loss 3.424601 | norm 0.2146 | lr 3.53e-04 | (3779.28 ms | 138727 tok/s) step 13773/76294 | train loss 3.338667 | norm 0.2647 | lr 3.52e-04 | (3788.26 ms | 138398 tok/s) step 13774/76294 | train loss 3.437581 | norm 0.3417 | lr 3.52e-04 | (3832.58 ms | 136798 tok/s) step 13775/76294 | train loss 3.370376 | norm 0.2651 | lr 3.52e-04 | (3790.33 ms | 138323 tok/s) step 13776/76294 | train loss 3.377020 | norm 0.4217 | lr 3.52e-04 | (3796.13 ms | 138111 tok/s) step 13777/76294 | train loss 3.359094 | norm 0.3147 | lr 3.52e-04 | (3788.62 ms | 138385 tok/s) step 13778/76294 | train loss 3.378861 | norm 0.7442 | lr 3.52e-04 | (3833.08 ms | 136780 tok/s) step 13779/76294 | train loss 3.404583 | norm 1.0847 | lr 3.52e-04 | (3789.40 ms | 138356 tok/s) step 13780/76294 | train loss 3.358561 | norm 0.3053 | lr 3.52e-04 | (3792.18 ms | 138255 tok/s) step 13781/76294 | train loss 3.393362 | norm 0.4578 | lr 3.52e-04 | (3795.76 ms | 138125 tok/s) step 13782/76294 | train loss 3.367923 | norm 0.2669 | lr 3.52e-04 | (3787.39 ms | 138430 tok/s) step 13783/76294 | train loss 3.381008 | norm 0.3492 | lr 3.52e-04 | (3820.71 ms | 137223 tok/s) step 13784/76294 | train loss 3.373997 | norm 0.5152 | lr 3.52e-04 | (3787.41 ms | 138429 tok/s) step 13785/76294 | train loss 3.451069 | norm 0.5331 | lr 3.52e-04 | (3812.38 ms | 137523 tok/s) step 13786/76294 | train loss 3.374059 | norm 0.3255 | lr 3.52e-04 | (3825.14 ms | 137064 tok/s) step 13787/76294 | train loss 3.364641 | norm 0.3196 | lr 3.51e-04 | (3794.60 ms | 138167 tok/s) step 13788/76294 | train loss 3.438695 | norm 0.3777 | lr 3.51e-04 | (3813.38 ms | 137486 tok/s) step 13789/76294 | train loss 3.360547 | norm 0.2950 | lr 3.51e-04 | (3808.89 ms | 137648 tok/s) step 13790/76294 | train loss 3.360405 | norm 0.3223 | lr 3.51e-04 | (3796.63 ms | 138093 tok/s) step 13791/76294 | train loss 3.430530 | norm 0.4649 | lr 3.51e-04 | (3808.59 ms | 137659 tok/s) step 13792/76294 | train loss 3.398604 | norm 0.2772 | lr 3.51e-04 | (3791.20 ms | 138291 tok/s) step 13793/76294 | train loss 3.430350 | norm 0.3963 | lr 3.51e-04 | (3795.33 ms | 138140 tok/s) step 13794/76294 | train loss 3.376549 | norm 0.2987 | lr 3.51e-04 | (3842.45 ms | 136446 tok/s) step 13795/76294 | train loss 3.381211 | norm 0.3164 | lr 3.51e-04 | (3791.97 ms | 138263 tok/s) step 13796/76294 | train loss 3.375245 | norm 0.5766 | lr 3.51e-04 | (3851.01 ms | 136143 tok/s) step 13797/76294 | train loss 3.400135 | norm 0.3727 | lr 3.51e-04 | (3790.17 ms | 138329 tok/s) step 13798/76294 | train loss 3.386184 | norm 0.5081 | lr 3.51e-04 | (3794.79 ms | 138160 tok/s) step 13799/76294 | train loss 3.374532 | norm 0.4839 | lr 3.51e-04 | (3810.62 ms | 137586 tok/s) step 13800/76294 | train loss 3.565622 | norm 0.3808 | lr 3.50e-04 | (3798.66 ms | 138019 tok/s) step 13801/76294 | train loss 3.444485 | norm 0.5961 | lr 3.50e-04 | (3792.02 ms | 138261 tok/s) step 13802/76294 | train loss 3.378737 | norm 0.3117 | lr 3.50e-04 | (3845.67 ms | 136332 tok/s) step 13803/76294 | train loss 3.391967 | norm 0.3410 | lr 3.50e-04 | (3792.31 ms | 138250 tok/s) step 13804/76294 | train loss 3.428809 | norm 0.2487 | lr 3.50e-04 | (3795.65 ms | 138129 tok/s) step 13805/76294 | train loss 3.377016 | norm 0.2950 | lr 3.50e-04 | (3817.96 ms | 137322 tok/s) step 13806/76294 | train loss 3.369687 | norm 0.2248 | lr 3.50e-04 | (3802.10 ms | 137894 tok/s) step 13807/76294 | train loss 3.417703 | norm 0.2753 | lr 3.50e-04 | (3844.92 ms | 136359 tok/s) step 13808/76294 | train loss 3.345457 | norm 0.2432 | lr 3.50e-04 | (3795.78 ms | 138124 tok/s) step 13809/76294 | train loss 3.431975 | norm 0.2832 | lr 3.50e-04 | (3851.33 ms | 136132 tok/s) step 13810/76294 | train loss 3.344441 | norm 0.6318 | lr 3.50e-04 | (3794.16 ms | 138183 tok/s) step 13811/76294 | train loss 3.405444 | norm 0.3745 | lr 3.50e-04 | (3850.97 ms | 136144 tok/s) step 13812/76294 | train loss 3.389643 | norm 0.7678 | lr 3.50e-04 | (3795.22 ms | 138144 tok/s) step 13813/76294 | train loss 3.454430 | norm 1.8823 | lr 3.50e-04 | (3846.91 ms | 136288 tok/s) step 13814/76294 | train loss 3.434953 | norm 1.4539 | lr 3.49e-04 | (3795.50 ms | 138134 tok/s) step 13815/76294 | train loss 3.371316 | norm 0.8651 | lr 3.49e-04 | (3799.97 ms | 137972 tok/s) step 13816/76294 | train loss 3.458416 | norm 0.4242 | lr 3.49e-04 | (3791.89 ms | 138266 tok/s) step 13817/76294 | train loss 3.396430 | norm 0.4053 | lr 3.49e-04 | (3925.41 ms | 133563 tok/s) step 13818/76294 | train loss 3.343017 | norm 0.9609 | lr 3.49e-04 | (3812.25 ms | 137527 tok/s) step 13819/76294 | train loss 3.410113 | norm 0.5156 | lr 3.49e-04 | (3805.69 ms | 137764 tok/s) step 13820/76294 | train loss 3.344512 | norm 0.6623 | lr 3.49e-04 | (3827.98 ms | 136962 tok/s) step 13821/76294 | train loss 3.399963 | norm 0.3605 | lr 3.49e-04 | (3832.59 ms | 136797 tok/s) step 13822/76294 | train loss 3.403862 | norm 0.7232 | lr 3.49e-04 | (3801.34 ms | 137922 tok/s) step 13823/76294 | train loss 3.276257 | norm 0.5134 | lr 3.49e-04 | (3847.34 ms | 136273 tok/s) step 13824/76294 | train loss 3.443527 | norm 0.6564 | lr 3.49e-04 | (3802.19 ms | 137891 tok/s) step 13825/76294 | train loss 3.304403 | norm 0.3290 | lr 3.49e-04 | (3801.50 ms | 137916 tok/s) step 13826/76294 | train loss 3.310131 | norm 1.3987 | lr 3.49e-04 | (3818.96 ms | 137286 tok/s) step 13827/76294 | train loss 3.328613 | norm 0.6388 | lr 3.48e-04 | (3822.16 ms | 137171 tok/s) step 13828/76294 | train loss 3.401636 | norm 0.8042 | lr 3.48e-04 | (3804.10 ms | 137822 tok/s) step 13829/76294 | train loss 3.408816 | norm 1.1347 | lr 3.48e-04 | (3989.02 ms | 131433 tok/s) step 13830/76294 | train loss 3.361247 | norm 2.3531 | lr 3.48e-04 | (3889.23 ms | 134805 tok/s) step 13831/76294 | train loss 3.470997 | norm 1.0546 | lr 3.48e-04 | (3828.68 ms | 136937 tok/s) step 13832/76294 | train loss 3.393336 | norm 4.9872 | lr 3.48e-04 | (3803.86 ms | 137831 tok/s) step 13833/76294 | train loss 3.421035 | norm 3.8184 | lr 3.48e-04 | (3790.70 ms | 138309 tok/s) step 13834/76294 | train loss 3.405446 | norm 0.9849 | lr 3.48e-04 | (3823.71 ms | 137115 tok/s) step 13835/76294 | train loss 3.366233 | norm 0.5813 | lr 3.48e-04 | (3792.84 ms | 138231 tok/s) step 13836/76294 | train loss 3.381846 | norm 1.3570 | lr 3.48e-04 | (3857.06 ms | 135929 tok/s) step 13837/76294 | train loss 3.333542 | norm 1.0608 | lr 3.48e-04 | (3825.34 ms | 137056 tok/s) step 13838/76294 | train loss 3.355268 | norm 0.7291 | lr 3.48e-04 | (17825.33 ms | 29413 tok/s) step 13839/76294 | train loss 3.375756 | norm 0.5558 | lr 3.48e-04 | (4408.97 ms | 118914 tok/s) step 13840/76294 | train loss 3.363975 | norm 0.4736 | lr 3.48e-04 | (3769.49 ms | 139087 tok/s) step 13841/76294 | train loss 3.445695 | norm 0.3155 | lr 3.47e-04 | (3907.17 ms | 134186 tok/s) step 13842/76294 | train loss 3.378419 | norm 0.4789 | lr 3.47e-04 | (3769.87 ms | 139073 tok/s) step 13843/76294 | train loss 3.377931 | norm 0.5006 | lr 3.47e-04 | (3778.93 ms | 138740 tok/s) step 13844/76294 | train loss 3.435675 | norm 0.3537 | lr 3.47e-04 | (3798.25 ms | 138034 tok/s) step 13845/76294 | train loss 3.435389 | norm 0.3074 | lr 3.47e-04 | (3777.48 ms | 138793 tok/s) step 13846/76294 | train loss 3.355505 | norm 0.2595 | lr 3.47e-04 | (3778.62 ms | 138751 tok/s) step 13847/76294 | train loss 3.411160 | norm 0.3679 | lr 3.47e-04 | (4021.76 ms | 130363 tok/s) step 13848/76294 | train loss 3.370639 | norm 0.4490 | lr 3.47e-04 | (3777.41 ms | 138796 tok/s) step 13849/76294 | train loss 3.405912 | norm 0.4050 | lr 3.47e-04 | (3831.40 ms | 136840 tok/s) step 13850/76294 | train loss 3.385826 | norm 0.2694 | lr 3.47e-04 | (3785.55 ms | 138497 tok/s) step 13851/76294 | train loss 3.475973 | norm 0.3755 | lr 3.47e-04 | (3842.46 ms | 136446 tok/s) step 13852/76294 | train loss 3.406217 | norm 0.2880 | lr 3.47e-04 | (3784.80 ms | 138525 tok/s) step 13853/76294 | train loss 3.373541 | norm 0.2849 | lr 3.47e-04 | (3813.11 ms | 137496 tok/s) step 13854/76294 | train loss 3.435675 | norm 0.3413 | lr 3.47e-04 | (3785.70 ms | 138492 tok/s) step 13855/76294 | train loss 3.354190 | norm 0.4637 | lr 3.46e-04 | (3905.89 ms | 134230 tok/s) step 13856/76294 | train loss 3.448068 | norm 0.2804 | lr 3.46e-04 | (3809.43 ms | 137629 tok/s) step 13857/76294 | train loss 3.384166 | norm 0.2823 | lr 3.46e-04 | (3799.35 ms | 137994 tok/s) step 13858/76294 | train loss 3.393540 | norm 0.2785 | lr 3.46e-04 | (3795.20 ms | 138145 tok/s) step 13859/76294 | train loss 3.417864 | norm 0.2320 | lr 3.46e-04 | (3871.85 ms | 135410 tok/s) step 13860/76294 | train loss 3.391205 | norm 0.3182 | lr 3.46e-04 | (3794.30 ms | 138178 tok/s) step 13861/76294 | train loss 3.412350 | norm 0.2316 | lr 3.46e-04 | (3799.63 ms | 137984 tok/s) step 13862/76294 | train loss 3.424906 | norm 0.3183 | lr 3.46e-04 | (3836.66 ms | 136652 tok/s) step 13863/76294 | train loss 3.331659 | norm 0.2910 | lr 3.46e-04 | (3845.30 ms | 136345 tok/s) step 13864/76294 | train loss 3.412929 | norm 0.2222 | lr 3.46e-04 | (3793.47 ms | 138208 tok/s) step 13865/76294 | train loss 3.490639 | norm 0.3613 | lr 3.46e-04 | (3805.40 ms | 137775 tok/s) step 13866/76294 | train loss 3.430926 | norm 0.3266 | lr 3.46e-04 | (3826.82 ms | 137004 tok/s) step 13867/76294 | train loss 3.391106 | norm 0.2757 | lr 3.46e-04 | (3822.88 ms | 137145 tok/s) step 13868/76294 | train loss 3.384786 | norm 0.4306 | lr 3.45e-04 | (3826.47 ms | 137016 tok/s) step 13869/76294 | train loss 3.723386 | norm 0.4180 | lr 3.45e-04 | (3935.60 ms | 133217 tok/s) step 13870/76294 | train loss 3.450269 | norm 0.3769 | lr 3.45e-04 | (3796.72 ms | 138090 tok/s) step 13871/76294 | train loss 3.383551 | norm 0.4704 | lr 3.45e-04 | (3846.97 ms | 136286 tok/s) step 13872/76294 | train loss 3.397484 | norm 0.2420 | lr 3.45e-04 | (3824.02 ms | 137104 tok/s) step 13873/76294 | train loss 3.395502 | norm 0.3446 | lr 3.45e-04 | (3849.00 ms | 136214 tok/s) step 13874/76294 | train loss 3.441727 | norm 0.4280 | lr 3.45e-04 | (3789.59 ms | 138349 tok/s) step 13875/76294 | train loss 3.402930 | norm 0.9508 | lr 3.45e-04 | (3808.33 ms | 137669 tok/s) step 13876/76294 | train loss 3.416066 | norm 0.2725 | lr 3.45e-04 | (3792.56 ms | 138241 tok/s) step 13877/76294 | train loss 3.535702 | norm 1.3270 | lr 3.45e-04 | (3797.67 ms | 138055 tok/s) step 13878/76294 | train loss 3.570455 | norm 0.2949 | lr 3.45e-04 | (3816.97 ms | 137357 tok/s) step 13879/76294 | train loss 3.344018 | norm 0.5389 | lr 3.45e-04 | (3797.40 ms | 138065 tok/s) step 13880/76294 | train loss 3.416133 | norm 0.4228 | lr 3.45e-04 | (3799.21 ms | 137999 tok/s) step 13881/76294 | train loss 3.436977 | norm 0.6253 | lr 3.45e-04 | (3800.56 ms | 137950 tok/s) step 13882/76294 | train loss 3.413397 | norm 0.3339 | lr 3.44e-04 | (3791.44 ms | 138282 tok/s) step 13883/76294 | train loss 3.397946 | norm 0.6328 | lr 3.44e-04 | (3961.89 ms | 132333 tok/s) step 13884/76294 | train loss 3.459070 | norm 0.3532 | lr 3.44e-04 | (3793.16 ms | 138219 tok/s) step 13885/76294 | train loss 3.372583 | norm 0.6952 | lr 3.44e-04 | (5782.57 ms | 90667 tok/s) step 13886/76294 | train loss 3.354607 | norm 0.4120 | lr 3.44e-04 | (3787.50 ms | 138426 tok/s) step 13887/76294 | train loss 3.432309 | norm 1.0221 | lr 3.44e-04 | (3811.21 ms | 137565 tok/s) step 13888/76294 | train loss 3.385405 | norm 0.3546 | lr 3.44e-04 | (3791.39 ms | 138284 tok/s) step 13889/76294 | train loss 3.425019 | norm 0.5342 | lr 3.44e-04 | (3855.34 ms | 135990 tok/s) step 13890/76294 | train loss 3.374277 | norm 0.6165 | lr 3.44e-04 | (3790.63 ms | 138311 tok/s) step 13891/76294 | train loss 3.402223 | norm 0.3402 | lr 3.44e-04 | (3818.11 ms | 137316 tok/s) step 13892/76294 | train loss 3.406327 | norm 0.7642 | lr 3.44e-04 | (3791.42 ms | 138283 tok/s) step 13893/76294 | train loss 3.445498 | norm 0.4224 | lr 3.44e-04 | (3795.96 ms | 138117 tok/s) step 13894/76294 | train loss 3.384183 | norm 0.6085 | lr 3.44e-04 | (3814.89 ms | 137432 tok/s) step 13895/76294 | train loss 3.454816 | norm 0.6716 | lr 3.44e-04 | (3909.79 ms | 134096 tok/s) step 13896/76294 | train loss 3.436026 | norm 0.6656 | lr 3.43e-04 | (3818.44 ms | 137304 tok/s) step 13897/76294 | train loss 3.454371 | norm 0.3596 | lr 3.43e-04 | (3842.77 ms | 136435 tok/s) step 13898/76294 | train loss 3.401464 | norm 0.8249 | lr 3.43e-04 | (3814.00 ms | 137464 tok/s) step 13899/76294 | train loss 3.358266 | norm 0.2992 | lr 3.43e-04 | (3813.20 ms | 137493 tok/s) step 13900/76294 | train loss 3.373328 | norm 0.4699 | lr 3.43e-04 | (3798.47 ms | 138026 tok/s) step 13901/76294 | train loss 3.470795 | norm 0.2672 | lr 3.43e-04 | (3798.88 ms | 138011 tok/s) step 13902/76294 | train loss 3.401663 | norm 0.2981 | lr 3.43e-04 | (3838.86 ms | 136574 tok/s) step 13903/76294 | train loss 3.437341 | norm 0.2914 | lr 3.43e-04 | (3903.28 ms | 134320 tok/s) step 13904/76294 | train loss 3.458653 | norm 0.3211 | lr 3.43e-04 | (3816.49 ms | 137374 tok/s) step 13905/76294 | train loss 3.369281 | norm 0.3993 | lr 3.43e-04 | (3833.50 ms | 136765 tok/s) step 13906/76294 | train loss 3.380987 | norm 0.3034 | lr 3.43e-04 | (3810.85 ms | 137578 tok/s) step 13907/76294 | train loss 3.411589 | norm 0.8258 | lr 3.43e-04 | (3804.32 ms | 137814 tok/s) step 13908/76294 | train loss 3.447401 | norm 2.4532 | lr 3.43e-04 | (3823.86 ms | 137110 tok/s) step 13909/76294 | train loss 3.329055 | norm 0.8455 | lr 3.42e-04 | (3802.68 ms | 137873 tok/s) step 13910/76294 | train loss 3.391557 | norm 0.6255 | lr 3.42e-04 | (3886.94 ms | 134885 tok/s) step 13911/76294 | train loss 3.339656 | norm 0.7642 | lr 3.42e-04 | (3802.04 ms | 137896 tok/s) step 13912/76294 | train loss 3.335612 | norm 0.3194 | lr 3.42e-04 | (3859.23 ms | 135853 tok/s) step 13913/76294 | train loss 3.445912 | norm 0.4932 | lr 3.42e-04 | (3803.58 ms | 137841 tok/s) step 13914/76294 | train loss 3.372099 | norm 0.4287 | lr 3.42e-04 | (3858.83 ms | 135867 tok/s) step 13915/76294 | train loss 3.385711 | norm 0.8934 | lr 3.42e-04 | (3804.91 ms | 137792 tok/s) step 13916/76294 | train loss 3.440043 | norm 1.3482 | lr 3.42e-04 | (3860.58 ms | 135805 tok/s) step 13917/76294 | train loss 3.367516 | norm 0.4666 | lr 3.42e-04 | (3794.78 ms | 138160 tok/s) step 13918/76294 | train loss 3.428850 | norm 1.9497 | lr 3.42e-04 | (3896.63 ms | 134549 tok/s) step 13919/76294 | train loss 3.330987 | norm 0.8332 | lr 3.42e-04 | (3800.68 ms | 137946 tok/s) step 13920/76294 | train loss 3.421561 | norm 0.8836 | lr 3.42e-04 | (3802.93 ms | 137864 tok/s) step 13921/76294 | train loss 3.402291 | norm 0.7157 | lr 3.42e-04 | (3796.05 ms | 138114 tok/s) step 13922/76294 | train loss 3.401610 | norm 0.4066 | lr 3.42e-04 | (3853.20 ms | 136066 tok/s) step 13923/76294 | train loss 3.445700 | norm 1.0928 | lr 3.41e-04 | (3904.05 ms | 134293 tok/s) step 13924/76294 | train loss 3.378269 | norm 0.5778 | lr 3.41e-04 | (3786.16 ms | 138475 tok/s) step 13925/76294 | train loss 3.449311 | norm 0.4026 | lr 3.41e-04 | (3792.17 ms | 138255 tok/s) step 13926/76294 | train loss 3.390265 | norm 1.0390 | lr 3.41e-04 | (3813.60 ms | 137478 tok/s) step 13927/76294 | train loss 3.392801 | norm 0.4187 | lr 3.41e-04 | (3791.90 ms | 138265 tok/s) step 13928/76294 | train loss 3.411906 | norm 0.3262 | lr 3.41e-04 | (3791.82 ms | 138268 tok/s) step 13929/76294 | train loss 3.415492 | norm 0.5239 | lr 3.41e-04 | (3831.03 ms | 136853 tok/s) step 13930/76294 | train loss 3.428509 | norm 0.3883 | lr 3.41e-04 | (3792.93 ms | 138228 tok/s) step 13931/76294 | train loss 3.417381 | norm 0.4996 | lr 3.41e-04 | (3857.67 ms | 135908 tok/s) step 13932/76294 | train loss 3.499625 | norm 0.9513 | lr 3.41e-04 | (3788.60 ms | 138386 tok/s) step 13933/76294 | train loss 3.400973 | norm 0.3247 | lr 3.41e-04 | (3852.26 ms | 136099 tok/s) step 13934/76294 | train loss 3.719302 | norm 0.5723 | lr 3.41e-04 | (3789.82 ms | 138341 tok/s) step 13935/76294 | train loss 3.377051 | norm 0.3434 | lr 3.41e-04 | (3824.25 ms | 137096 tok/s) step 13936/76294 | train loss 3.422638 | norm 0.4574 | lr 3.41e-04 | (3791.73 ms | 138271 tok/s) step 13937/76294 | train loss 3.405112 | norm 0.3562 | lr 3.40e-04 | (4037.42 ms | 129857 tok/s) step 13938/76294 | train loss 3.377222 | norm 0.3907 | lr 3.40e-04 | (3777.81 ms | 138781 tok/s) step 13939/76294 | train loss 3.366786 | norm 0.3770 | lr 3.40e-04 | (3806.15 ms | 137747 tok/s) step 13940/76294 | train loss 3.415149 | norm 0.3437 | lr 3.40e-04 | (3786.79 ms | 138452 tok/s) step 13941/76294 | train loss 3.404366 | norm 0.3047 | lr 3.40e-04 | (3792.19 ms | 138255 tok/s) step 13942/76294 | train loss 3.366928 | norm 0.2863 | lr 3.40e-04 | (3811.99 ms | 137537 tok/s) step 13943/76294 | train loss 3.428469 | norm 0.3126 | lr 3.40e-04 | (3801.25 ms | 137925 tok/s) step 13944/76294 | train loss 3.362965 | norm 0.2466 | lr 3.40e-04 | (3819.47 ms | 137267 tok/s) step 13945/76294 | train loss 3.412187 | norm 0.4654 | lr 3.40e-04 | (5726.90 ms | 91548 tok/s) step 13946/76294 | train loss 3.441934 | norm 0.4085 | lr 3.40e-04 | (3875.41 ms | 135286 tok/s) step 13947/76294 | train loss 3.359990 | norm 0.4351 | lr 3.40e-04 | (3848.06 ms | 136247 tok/s) step 13948/76294 | train loss 3.390737 | norm 0.2971 | lr 3.40e-04 | (3792.23 ms | 138253 tok/s) step 13949/76294 | train loss 3.410560 | norm 0.3447 | lr 3.40e-04 | (3918.84 ms | 133787 tok/s) step 13950/76294 | train loss 3.431976 | norm 0.2656 | lr 3.40e-04 | (3792.86 ms | 138230 tok/s) step 13951/76294 | train loss 3.543425 | norm 0.2943 | lr 3.39e-04 | (3848.09 ms | 136246 tok/s) step 13952/76294 | train loss 3.555692 | norm 0.3441 | lr 3.39e-04 | (3793.67 ms | 138201 tok/s) step 13953/76294 | train loss 3.343090 | norm 0.2875 | lr 3.39e-04 | (3800.82 ms | 137941 tok/s) step 13954/76294 | train loss 3.442049 | norm 0.3706 | lr 3.39e-04 | (3815.72 ms | 137402 tok/s) step 13955/76294 | train loss 3.514222 | norm 0.6299 | lr 3.39e-04 | (3805.28 ms | 137779 tok/s) step 13956/76294 | train loss 3.388790 | norm 0.3202 | lr 3.39e-04 | (3806.58 ms | 137732 tok/s) step 13957/76294 | train loss 3.471261 | norm 0.5326 | lr 3.39e-04 | (3803.03 ms | 137861 tok/s) step 13958/76294 | train loss 3.448031 | norm 0.6363 | lr 3.39e-04 | (3810.31 ms | 137597 tok/s) step 13959/76294 | train loss 3.518892 | norm 0.6104 | lr 3.39e-04 | (3877.07 ms | 135228 tok/s) step 13960/76294 | train loss 3.399968 | norm 0.3479 | lr 3.39e-04 | (3804.91 ms | 137792 tok/s) step 13961/76294 | train loss 3.381993 | norm 0.2996 | lr 3.39e-04 | (3834.10 ms | 136743 tok/s) step 13962/76294 | train loss 3.444102 | norm 0.2629 | lr 3.39e-04 | (3802.57 ms | 137877 tok/s) step 13963/76294 | train loss 3.377690 | norm 0.3376 | lr 3.39e-04 | (3809.85 ms | 137614 tok/s) step 13964/76294 | train loss 3.469408 | norm 0.2853 | lr 3.38e-04 | (3825.19 ms | 137062 tok/s) step 13965/76294 | train loss 3.545761 | norm 0.3520 | lr 3.38e-04 | (3805.49 ms | 137772 tok/s) step 13966/76294 | train loss 3.400892 | norm 0.2903 | lr 3.38e-04 | (3805.29 ms | 137779 tok/s) step 13967/76294 | train loss 3.334391 | norm 0.2377 | lr 3.38e-04 | (3831.89 ms | 136822 tok/s) step 13968/76294 | train loss 3.396886 | norm 0.3105 | lr 3.38e-04 | (3804.43 ms | 137810 tok/s) step 13969/76294 | train loss 3.397628 | norm 0.3197 | lr 3.38e-04 | (3824.98 ms | 137069 tok/s) step 13970/76294 | train loss 3.354842 | norm 0.4078 | lr 3.38e-04 | (3918.70 ms | 133791 tok/s) step 13971/76294 | train loss 3.412834 | norm 0.6062 | lr 3.38e-04 | (3803.42 ms | 137846 tok/s) step 13972/76294 | train loss 3.430951 | norm 0.5792 | lr 3.38e-04 | (3834.27 ms | 136737 tok/s) step 13973/76294 | train loss 3.367352 | norm 0.3377 | lr 3.38e-04 | (3802.83 ms | 137868 tok/s) step 13974/76294 | train loss 3.366564 | norm 0.5280 | lr 3.38e-04 | (3816.42 ms | 137377 tok/s) step 13975/76294 | train loss 3.405862 | norm 0.3820 | lr 3.38e-04 | (3847.05 ms | 136283 tok/s) step 13976/76294 | train loss 3.426075 | norm 0.4349 | lr 3.38e-04 | (3813.66 ms | 137476 tok/s) step 13977/76294 | train loss 3.449965 | norm 0.7792 | lr 3.38e-04 | (3812.41 ms | 137522 tok/s) step 13978/76294 | train loss 3.427269 | norm 1.4193 | lr 3.37e-04 | (3812.27 ms | 137526 tok/s) step 13979/76294 | train loss 3.435707 | norm 0.3324 | lr 3.37e-04 | (3815.63 ms | 137405 tok/s) step 13980/76294 | train loss 3.410588 | norm 2.0122 | lr 3.37e-04 | (3809.18 ms | 137638 tok/s) step 13981/76294 | train loss 3.400800 | norm 0.4557 | lr 3.37e-04 | (3993.58 ms | 131283 tok/s) step 13982/76294 | train loss 3.445360 | norm 0.6428 | lr 3.37e-04 | (3840.47 ms | 136517 tok/s) step 13983/76294 | train loss 3.445462 | norm 0.4893 | lr 3.37e-04 | (3804.82 ms | 137796 tok/s) step 13984/76294 | train loss 3.325714 | norm 0.7761 | lr 3.37e-04 | (3813.14 ms | 137495 tok/s) step 13985/76294 | train loss 3.395514 | norm 0.3493 | lr 3.37e-04 | (3830.57 ms | 136870 tok/s) step 13986/76294 | train loss 3.498451 | norm 0.3608 | lr 3.37e-04 | (3806.63 ms | 137730 tok/s) step 13987/76294 | train loss 3.426616 | norm 0.2892 | lr 3.37e-04 | (3827.77 ms | 136970 tok/s) step 13988/76294 | train loss 3.396661 | norm 0.4570 | lr 3.37e-04 | (3810.88 ms | 137577 tok/s) step 13989/76294 | train loss 3.412290 | norm 0.2890 | lr 3.37e-04 | (3810.98 ms | 137573 tok/s) step 13990/76294 | train loss 3.359557 | norm 0.3692 | lr 3.37e-04 | (3808.50 ms | 137663 tok/s) step 13991/76294 | train loss 3.434434 | norm 0.6991 | lr 3.37e-04 | (3808.86 ms | 137650 tok/s) step 13992/76294 | train loss 3.407140 | norm 0.9256 | lr 3.36e-04 | (3806.26 ms | 137744 tok/s) step 13993/76294 | train loss 3.379265 | norm 0.6147 | lr 3.36e-04 | (3831.14 ms | 136849 tok/s) step 13994/76294 | train loss 3.408623 | norm 0.3044 | lr 3.36e-04 | (3876.49 ms | 135248 tok/s) step 13995/76294 | train loss 3.415493 | norm 0.9853 | lr 3.36e-04 | (3839.86 ms | 136538 tok/s) step 13996/76294 | train loss 3.367180 | norm 0.4620 | lr 3.36e-04 | (3811.95 ms | 137538 tok/s) step 13997/76294 | train loss 3.353827 | norm 0.5535 | lr 3.36e-04 | (3847.22 ms | 136277 tok/s) step 13998/76294 | train loss 3.483707 | norm 0.3252 | lr 3.36e-04 | (3800.35 ms | 137958 tok/s) step 13999/76294 | train loss 3.392624 | norm 0.9088 | lr 3.36e-04 | (3823.15 ms | 137135 tok/s) step 14000/76294 | train loss 3.375394 | norm 0.5239 | lr 3.36e-04 | (3802.64 ms | 137875 tok/s) val loss: 3.373578 saving model checkpoint to ./results/gpt2-124M-gqa/step_14000.pth step 14001/76294 | train loss 3.398064 | norm 0.6199 | lr 3.36e-04 | (3821.68 ms | 137188 tok/s) step 14002/76294 | train loss 3.418437 | norm 0.2995 | lr 3.36e-04 | (3797.49 ms | 138062 tok/s) step 14003/76294 | train loss 3.387818 | norm 0.7822 | lr 3.36e-04 | (3801.71 ms | 137908 tok/s) step 14004/76294 | train loss 3.373759 | norm 0.2570 | lr 3.36e-04 | (3820.46 ms | 137232 tok/s) step 14005/76294 | train loss 3.399609 | norm 0.3736 | lr 3.36e-04 | (3803.54 ms | 137842 tok/s) step 14006/76294 | train loss 3.379471 | norm 0.5027 | lr 3.35e-04 | (3807.00 ms | 137717 tok/s) step 14007/76294 | train loss 3.364501 | norm 0.5177 | lr 3.35e-04 | (3825.03 ms | 137068 tok/s) step 14008/76294 | train loss 3.354977 | norm 0.2427 | lr 3.35e-04 | (3808.52 ms | 137662 tok/s) step 14009/76294 | train loss 3.404963 | norm 0.5299 | lr 3.35e-04 | (3804.35 ms | 137813 tok/s) step 14010/76294 | train loss 3.430395 | norm 0.2606 | lr 3.35e-04 | (3806.82 ms | 137723 tok/s) step 14011/76294 | train loss 3.385538 | norm 0.5746 | lr 3.35e-04 | (3806.52 ms | 137734 tok/s) step 14012/76294 | train loss 3.426757 | norm 0.3995 | lr 3.35e-04 | (3885.01 ms | 134952 tok/s) step 14013/76294 | train loss 3.357349 | norm 1.0415 | lr 3.35e-04 | (3841.57 ms | 136478 tok/s) step 14014/76294 | train loss 3.364591 | norm 0.3878 | lr 3.35e-04 | (3865.48 ms | 135633 tok/s) step 14015/76294 | train loss 3.513149 | norm 0.3814 | lr 3.35e-04 | (3902.93 ms | 134332 tok/s) step 14016/76294 | train loss 3.350976 | norm 0.5249 | lr 3.35e-04 | (3791.48 ms | 138280 tok/s) step 14017/76294 | train loss 3.372828 | norm 0.6255 | lr 3.35e-04 | (3827.20 ms | 136990 tok/s) step 14018/76294 | train loss 3.347921 | norm 0.3434 | lr 3.35e-04 | (3797.50 ms | 138061 tok/s) step 14019/76294 | train loss 3.426149 | norm 0.4821 | lr 3.35e-04 | (3801.53 ms | 137915 tok/s) step 14020/76294 | train loss 3.413338 | norm 0.4922 | lr 3.34e-04 | (3823.07 ms | 137138 tok/s) step 14021/76294 | train loss 3.428857 | norm 0.3422 | lr 3.34e-04 | (3803.33 ms | 137850 tok/s) step 14022/76294 | train loss 3.384650 | norm 0.3127 | lr 3.34e-04 | (3805.97 ms | 137754 tok/s) step 14023/76294 | train loss 3.360655 | norm 0.6229 | lr 3.34e-04 | (3804.26 ms | 137816 tok/s) step 14024/76294 | train loss 3.383284 | norm 0.5038 | lr 3.34e-04 | (3801.13 ms | 137930 tok/s) step 14025/76294 | train loss 3.426080 | norm 0.3849 | lr 3.34e-04 | (3830.55 ms | 136870 tok/s) step 14026/76294 | train loss 3.367109 | norm 0.5652 | lr 3.34e-04 | (3798.18 ms | 138037 tok/s) step 14027/76294 | train loss 3.362984 | norm 0.2780 | lr 3.34e-04 | (3850.91 ms | 136147 tok/s) step 14028/76294 | train loss 3.388244 | norm 0.5445 | lr 3.34e-04 | (3819.31 ms | 137273 tok/s) step 14029/76294 | train loss 3.342631 | norm 1.0054 | lr 3.34e-04 | (3801.60 ms | 137913 tok/s) step 14030/76294 | train loss 3.444952 | norm 2.2916 | lr 3.34e-04 | (4269.83 ms | 122789 tok/s) step 14031/76294 | train loss 3.388019 | norm 0.5634 | lr 3.34e-04 | (3889.59 ms | 134793 tok/s) step 14032/76294 | train loss 3.320128 | norm 1.2222 | lr 3.34e-04 | (3796.70 ms | 138091 tok/s) step 14033/76294 | train loss 3.363348 | norm 0.2650 | lr 3.34e-04 | (3820.17 ms | 137242 tok/s) step 14034/76294 | train loss 3.351057 | norm 0.9465 | lr 3.33e-04 | (3819.42 ms | 137269 tok/s) step 14035/76294 | train loss 3.421199 | norm 0.2755 | lr 3.33e-04 | (3795.25 ms | 138143 tok/s) step 14036/76294 | train loss 3.387751 | norm 1.1875 | lr 3.33e-04 | (3813.57 ms | 137479 tok/s) step 14037/76294 | train loss 3.444623 | norm 0.3125 | lr 3.33e-04 | (3825.77 ms | 137041 tok/s) step 14038/76294 | train loss 3.406036 | norm 0.5437 | lr 3.33e-04 | (3797.31 ms | 138068 tok/s) step 14039/76294 | train loss 3.409157 | norm 0.3520 | lr 3.33e-04 | (3850.34 ms | 136167 tok/s) step 14040/76294 | train loss 3.373203 | norm 0.5013 | lr 3.33e-04 | (3791.80 ms | 138269 tok/s) step 14041/76294 | train loss 3.432668 | norm 0.4430 | lr 3.33e-04 | (3895.94 ms | 134573 tok/s) step 14042/76294 | train loss 3.451540 | norm 0.3148 | lr 3.33e-04 | (3819.47 ms | 137267 tok/s) step 14043/76294 | train loss 3.390187 | norm 0.3733 | lr 3.33e-04 | (3810.06 ms | 137606 tok/s) step 14044/76294 | train loss 3.382563 | norm 0.4008 | lr 3.33e-04 | (3813.95 ms | 137466 tok/s) step 14045/76294 | train loss 3.365376 | norm 0.3657 | lr 3.33e-04 | (3792.97 ms | 138226 tok/s) step 14046/76294 | train loss 3.337857 | norm 0.4958 | lr 3.33e-04 | (3800.01 ms | 137970 tok/s) step 14047/76294 | train loss 3.396675 | norm 1.3408 | lr 3.33e-04 | (3819.93 ms | 137251 tok/s) step 14048/76294 | train loss 3.420341 | norm 0.4718 | lr 3.32e-04 | (3801.92 ms | 137901 tok/s) step 14049/76294 | train loss 3.351152 | norm 0.3858 | lr 3.32e-04 | (3802.60 ms | 137876 tok/s) step 14050/76294 | train loss 3.414600 | norm 0.4850 | lr 3.32e-04 | (3802.11 ms | 137894 tok/s) step 14051/76294 | train loss 3.394512 | norm 0.3713 | lr 3.32e-04 | (3804.06 ms | 137823 tok/s) step 14052/76294 | train loss 3.435651 | norm 0.3643 | lr 3.32e-04 | (3848.01 ms | 136249 tok/s) step 14053/76294 | train loss 3.369304 | norm 0.3746 | lr 3.32e-04 | (3796.21 ms | 138108 tok/s) step 14054/76294 | train loss 3.315526 | norm 0.3492 | lr 3.32e-04 | (3832.11 ms | 136815 tok/s) step 14055/76294 | train loss 3.373810 | norm 0.3471 | lr 3.32e-04 | (3796.80 ms | 138087 tok/s) step 14056/76294 | train loss 3.383989 | norm 0.3309 | lr 3.32e-04 | (3847.25 ms | 136276 tok/s) step 14057/76294 | train loss 3.478445 | norm 0.4600 | lr 3.32e-04 | (3797.22 ms | 138072 tok/s) step 14058/76294 | train loss 3.395583 | norm 0.6459 | lr 3.32e-04 | (3855.20 ms | 135995 tok/s) step 14059/76294 | train loss 3.337791 | norm 0.6054 | lr 3.32e-04 | (3798.19 ms | 138036 tok/s) step 14060/76294 | train loss 3.385657 | norm 0.7671 | lr 3.32e-04 | (3798.67 ms | 138019 tok/s) step 14061/76294 | train loss 3.429180 | norm 0.4139 | lr 3.32e-04 | (3821.73 ms | 137186 tok/s) step 14062/76294 | train loss 3.407470 | norm 0.6434 | lr 3.31e-04 | (3805.54 ms | 137770 tok/s) step 14063/76294 | train loss 3.368417 | norm 0.5922 | lr 3.31e-04 | (3827.98 ms | 136962 tok/s) step 14064/76294 | train loss 3.373338 | norm 0.7403 | lr 3.31e-04 | (3805.75 ms | 137762 tok/s) step 14065/76294 | train loss 3.393011 | norm 0.8947 | lr 3.31e-04 | (3856.55 ms | 135947 tok/s) step 14066/76294 | train loss 3.399824 | norm 0.6033 | lr 3.31e-04 | (3798.36 ms | 138030 tok/s) step 14067/76294 | train loss 3.377279 | norm 1.3556 | lr 3.31e-04 | (3810.85 ms | 137578 tok/s) step 14068/76294 | train loss 3.358665 | norm 0.4859 | lr 3.31e-04 | (3801.63 ms | 137911 tok/s) step 14069/76294 | train loss 3.365844 | norm 2.8717 | lr 3.31e-04 | (3860.13 ms | 135821 tok/s) step 14070/76294 | train loss 3.385058 | norm 0.7285 | lr 3.31e-04 | (3820.39 ms | 137234 tok/s) step 14071/76294 | train loss 3.369314 | norm 1.2115 | lr 3.31e-04 | (3823.90 ms | 137108 tok/s) step 14072/76294 | train loss 3.475627 | norm 1.0148 | lr 3.31e-04 | (3811.08 ms | 137569 tok/s) step 14073/76294 | train loss 3.385950 | norm 0.9872 | lr 3.31e-04 | (3926.98 ms | 133509 tok/s) step 14074/76294 | train loss 3.394105 | norm 1.1744 | lr 3.31e-04 | (3793.58 ms | 138204 tok/s) step 14075/76294 | train loss 3.373629 | norm 0.5978 | lr 3.31e-04 | (3817.79 ms | 137328 tok/s) step 14076/76294 | train loss 3.393447 | norm 1.1830 | lr 3.30e-04 | (3812.48 ms | 137519 tok/s) step 14077/76294 | train loss 3.400438 | norm 0.5563 | lr 3.30e-04 | (3829.50 ms | 136908 tok/s) step 14078/76294 | train loss 3.420704 | norm 0.8788 | lr 3.30e-04 | (3800.49 ms | 137953 tok/s) step 14079/76294 | train loss 3.390802 | norm 0.3859 | lr 3.30e-04 | (3835.74 ms | 136685 tok/s) step 14080/76294 | train loss 3.275679 | norm 0.3996 | lr 3.30e-04 | (3820.30 ms | 137238 tok/s) step 14081/76294 | train loss 3.387873 | norm 0.4463 | lr 3.30e-04 | (3801.48 ms | 137917 tok/s) step 14082/76294 | train loss 3.340888 | norm 0.3554 | lr 3.30e-04 | (3799.65 ms | 137983 tok/s) step 14083/76294 | train loss 3.417935 | norm 0.3994 | lr 3.30e-04 | (3832.79 ms | 136790 tok/s) step 14084/76294 | train loss 3.353460 | norm 0.2729 | lr 3.30e-04 | (3800.45 ms | 137954 tok/s) step 14085/76294 | train loss 3.375738 | norm 0.2565 | lr 3.30e-04 | (4122.90 ms | 127165 tok/s) step 14086/76294 | train loss 3.406975 | norm 0.2997 | lr 3.30e-04 | (3803.38 ms | 137848 tok/s) step 14087/76294 | train loss 3.319470 | norm 0.4219 | lr 3.30e-04 | (4003.35 ms | 130962 tok/s) step 14088/76294 | train loss 3.383217 | norm 0.3967 | lr 3.30e-04 | (3880.40 ms | 135112 tok/s) step 14089/76294 | train loss 3.340481 | norm 0.2606 | lr 3.30e-04 | (3865.08 ms | 135647 tok/s) step 14090/76294 | train loss 3.374794 | norm 0.5012 | lr 3.29e-04 | (3788.01 ms | 138407 tok/s) step 14091/76294 | train loss 3.365833 | norm 0.3611 | lr 3.29e-04 | (3827.13 ms | 136992 tok/s) step 14092/76294 | train loss 3.383409 | norm 0.6462 | lr 3.29e-04 | (3895.69 ms | 134582 tok/s) step 14093/76294 | train loss 3.299508 | norm 0.3938 | lr 3.29e-04 | (3794.12 ms | 138184 tok/s) step 14094/76294 | train loss 3.383845 | norm 1.1390 | lr 3.29e-04 | (3824.21 ms | 137097 tok/s) step 14095/76294 | train loss 3.431718 | norm 0.3114 | lr 3.29e-04 | (3796.97 ms | 138080 tok/s) step 14096/76294 | train loss 3.363150 | norm 0.3721 | lr 3.29e-04 | (3800.96 ms | 137936 tok/s) step 14097/76294 | train loss 3.487008 | norm 0.3993 | lr 3.29e-04 | (3825.05 ms | 137067 tok/s) step 14098/76294 | train loss 3.419480 | norm 0.3159 | lr 3.29e-04 | (3910.58 ms | 134069 tok/s) step 14099/76294 | train loss 3.348147 | norm 0.9046 | lr 3.29e-04 | (3953.04 ms | 132629 tok/s) step 14100/76294 | train loss 3.389819 | norm 0.2834 | lr 3.29e-04 | (3784.93 ms | 138520 tok/s) step 14101/76294 | train loss 3.394621 | norm 0.4730 | lr 3.29e-04 | (3790.96 ms | 138300 tok/s) step 14102/76294 | train loss 3.365256 | norm 0.3490 | lr 3.29e-04 | (3810.98 ms | 137573 tok/s) step 14103/76294 | train loss 3.405873 | norm 0.3923 | lr 3.29e-04 | (3791.02 ms | 138297 tok/s) step 14104/76294 | train loss 3.417353 | norm 1.4867 | lr 3.28e-04 | (3807.86 ms | 137686 tok/s) step 14105/76294 | train loss 3.452921 | norm 2.8588 | lr 3.28e-04 | (3793.05 ms | 138223 tok/s) step 14106/76294 | train loss 3.537085 | norm 0.4918 | lr 3.28e-04 | (3841.50 ms | 136480 tok/s) step 14107/76294 | train loss 3.359984 | norm 2.1427 | lr 3.28e-04 | (3788.05 ms | 138406 tok/s) step 14108/76294 | train loss 3.409424 | norm 0.6112 | lr 3.28e-04 | (3819.62 ms | 137262 tok/s) step 14109/76294 | train loss 3.429043 | norm 2.1803 | lr 3.28e-04 | (3788.89 ms | 138375 tok/s) step 14110/76294 | train loss 3.351135 | norm 0.7850 | lr 3.28e-04 | (3796.26 ms | 138107 tok/s) step 14111/76294 | train loss 3.505653 | norm 1.0021 | lr 3.28e-04 | (3834.78 ms | 136719 tok/s) step 14112/76294 | train loss 3.370127 | norm 0.6529 | lr 3.28e-04 | (3791.43 ms | 138283 tok/s) step 14113/76294 | train loss 3.361235 | norm 0.5629 | lr 3.28e-04 | (3837.71 ms | 136615 tok/s) step 14114/76294 | train loss 3.360343 | norm 0.5830 | lr 3.28e-04 | (3793.84 ms | 138195 tok/s) step 14115/76294 | train loss 3.454348 | norm 0.5849 | lr 3.28e-04 | (3798.66 ms | 138019 tok/s) step 14116/76294 | train loss 3.395748 | norm 0.4636 | lr 3.28e-04 | (3812.40 ms | 137522 tok/s) step 14117/76294 | train loss 3.410562 | norm 0.3798 | lr 3.28e-04 | (3795.35 ms | 138139 tok/s) step 14118/76294 | train loss 3.387855 | norm 0.3387 | lr 3.27e-04 | (3816.32 ms | 137380 tok/s) step 14119/76294 | train loss 3.407974 | norm 0.5295 | lr 3.27e-04 | (3796.15 ms | 138110 tok/s) step 14120/76294 | train loss 3.390876 | norm 0.3871 | lr 3.27e-04 | (3817.06 ms | 137354 tok/s) step 14121/76294 | train loss 3.379182 | norm 0.3983 | lr 3.27e-04 | (3796.17 ms | 138110 tok/s) step 14122/76294 | train loss 3.452360 | norm 0.3310 | lr 3.27e-04 | (3798.94 ms | 138009 tok/s) step 14123/76294 | train loss 3.344786 | norm 0.2751 | lr 3.27e-04 | (3817.64 ms | 137333 tok/s) step 14124/76294 | train loss 3.461193 | norm 0.3409 | lr 3.27e-04 | (3798.72 ms | 138017 tok/s) step 14125/76294 | train loss 3.482496 | norm 0.2788 | lr 3.27e-04 | (3794.48 ms | 138171 tok/s) step 14126/76294 | train loss 3.415537 | norm 0.4561 | lr 3.27e-04 | (3829.04 ms | 136924 tok/s) step 14127/76294 | train loss 3.377614 | norm 0.3945 | lr 3.27e-04 | (3798.75 ms | 138016 tok/s) step 14128/76294 | train loss 3.367286 | norm 0.2501 | lr 3.27e-04 | (3801.32 ms | 137922 tok/s) step 14129/76294 | train loss 3.435496 | norm 0.3160 | lr 3.27e-04 | (3827.91 ms | 136965 tok/s) step 14130/76294 | train loss 3.409032 | norm 0.3716 | lr 3.27e-04 | (3796.17 ms | 138110 tok/s) step 14131/76294 | train loss 3.412352 | norm 0.2596 | lr 3.27e-04 | (3814.73 ms | 137438 tok/s) step 14132/76294 | train loss 3.382226 | norm 0.3860 | lr 3.26e-04 | (3907.17 ms | 134186 tok/s) step 14133/76294 | train loss 3.374697 | norm 0.3251 | lr 3.26e-04 | (3801.07 ms | 137932 tok/s) step 14134/76294 | train loss 3.321107 | norm 0.2508 | lr 3.26e-04 | (3820.27 ms | 137238 tok/s) step 14135/76294 | train loss 3.399970 | norm 0.2573 | lr 3.26e-04 | (3823.81 ms | 137111 tok/s) step 14136/76294 | train loss 3.321544 | norm 0.2807 | lr 3.26e-04 | (3803.13 ms | 137857 tok/s) step 14137/76294 | train loss 3.374279 | norm 0.3360 | lr 3.26e-04 | (3805.29 ms | 137779 tok/s) step 14138/76294 | train loss 3.379028 | norm 0.2871 | lr 3.26e-04 | (3820.61 ms | 137226 tok/s) step 14139/76294 | train loss 3.318182 | norm 0.3209 | lr 3.26e-04 | (3817.01 ms | 137356 tok/s) step 14140/76294 | train loss 3.337445 | norm 0.3036 | lr 3.26e-04 | (3848.30 ms | 136239 tok/s) step 14141/76294 | train loss 3.350949 | norm 0.2883 | lr 3.26e-04 | (3801.01 ms | 137934 tok/s) step 14142/76294 | train loss 3.325912 | norm 0.4354 | lr 3.26e-04 | (3831.38 ms | 136840 tok/s) step 14143/76294 | train loss 3.394940 | norm 0.2871 | lr 3.26e-04 | (3804.94 ms | 137791 tok/s) step 14144/76294 | train loss 3.417897 | norm 0.2771 | lr 3.26e-04 | (3827.39 ms | 136983 tok/s) step 14145/76294 | train loss 3.362572 | norm 0.3322 | lr 3.26e-04 | (3802.89 ms | 137866 tok/s) step 14146/76294 | train loss 3.400836 | norm 0.3267 | lr 3.25e-04 | (3848.08 ms | 136247 tok/s) step 14147/76294 | train loss 3.341340 | norm 0.2785 | lr 3.25e-04 | (3828.49 ms | 136944 tok/s) step 14148/76294 | train loss 3.397264 | norm 0.3663 | lr 3.25e-04 | (4009.36 ms | 130766 tok/s) step 14149/76294 | train loss 3.338799 | norm 1.1447 | lr 3.25e-04 | (3912.06 ms | 134018 tok/s) step 14150/76294 | train loss 3.408118 | norm 5.7104 | lr 3.25e-04 | (3811.37 ms | 137559 tok/s) step 14151/76294 | train loss 3.335133 | norm 0.5362 | lr 3.25e-04 | (3830.68 ms | 136866 tok/s) step 14152/76294 | train loss 3.372178 | norm 3.7677 | lr 3.25e-04 | (3831.48 ms | 136837 tok/s) step 14153/76294 | train loss 3.382893 | norm 0.7706 | lr 3.25e-04 | (3796.22 ms | 138108 tok/s) step 14154/76294 | train loss 3.421369 | norm 0.8194 | lr 3.25e-04 | (3861.90 ms | 135759 tok/s) step 14155/76294 | train loss 3.412483 | norm 1.2055 | lr 3.25e-04 | (3804.61 ms | 137803 tok/s) step 14156/76294 | train loss 3.357466 | norm 0.3537 | lr 3.25e-04 | (3801.50 ms | 137916 tok/s) step 14157/76294 | train loss 3.347475 | norm 0.5134 | lr 3.25e-04 | (3824.13 ms | 137100 tok/s) step 14158/76294 | train loss 3.402165 | norm 1.0972 | lr 3.25e-04 | (3814.90 ms | 137432 tok/s) step 14159/76294 | train loss 3.405117 | norm 0.3166 | lr 3.25e-04 | (3801.48 ms | 137917 tok/s) step 14160/76294 | train loss 3.396221 | norm 0.5865 | lr 3.24e-04 | (3854.00 ms | 136037 tok/s) step 14161/76294 | train loss 3.552925 | norm 0.3091 | lr 3.24e-04 | (3864.18 ms | 135679 tok/s) step 14162/76294 | train loss 3.385774 | norm 0.4350 | lr 3.24e-04 | (3826.78 ms | 137005 tok/s) step 14163/76294 | train loss 3.310498 | norm 0.3541 | lr 3.24e-04 | (3803.00 ms | 137862 tok/s) step 14164/76294 | train loss 3.393221 | norm 0.4289 | lr 3.24e-04 | (3852.45 ms | 136092 tok/s) step 14165/76294 | train loss 3.465670 | norm 0.4445 | lr 3.24e-04 | (3804.70 ms | 137800 tok/s) step 14166/76294 | train loss 3.379095 | norm 0.5812 | lr 3.24e-04 | (3807.85 ms | 137686 tok/s) step 14167/76294 | train loss 3.411394 | norm 0.7674 | lr 3.24e-04 | (3824.72 ms | 137079 tok/s) step 14168/76294 | train loss 3.353891 | norm 0.3872 | lr 3.24e-04 | (3805.13 ms | 137785 tok/s) step 14169/76294 | train loss 3.359885 | norm 0.4107 | lr 3.24e-04 | (3800.27 ms | 137961 tok/s) step 14170/76294 | train loss 3.489439 | norm 0.2676 | lr 3.24e-04 | (3829.51 ms | 136907 tok/s) step 14171/76294 | train loss 3.372224 | norm 0.3706 | lr 3.24e-04 | (3805.67 ms | 137765 tok/s) step 14172/76294 | train loss 3.471974 | norm 0.3282 | lr 3.24e-04 | (3830.62 ms | 136868 tok/s) step 14173/76294 | train loss 3.393081 | norm 0.4038 | lr 3.24e-04 | (3904.30 ms | 134285 tok/s) step 14174/76294 | train loss 3.333571 | norm 0.3092 | lr 3.24e-04 | (3801.45 ms | 137918 tok/s) step 14175/76294 | train loss 3.383496 | norm 0.2985 | lr 3.23e-04 | (3807.75 ms | 137690 tok/s) step 14176/76294 | train loss 3.392917 | norm 0.4571 | lr 3.23e-04 | (3864.15 ms | 135680 tok/s) step 14177/76294 | train loss 3.360913 | norm 0.4153 | lr 3.23e-04 | (3876.86 ms | 135235 tok/s) step 14178/76294 | train loss 3.392778 | norm 0.6869 | lr 3.23e-04 | (3806.46 ms | 137736 tok/s) step 14179/76294 | train loss 3.401376 | norm 0.6648 | lr 3.23e-04 | (3836.04 ms | 136674 tok/s) step 14180/76294 | train loss 3.339717 | norm 0.4276 | lr 3.23e-04 | (3803.36 ms | 137849 tok/s) step 14181/76294 | train loss 3.452378 | norm 0.6780 | lr 3.23e-04 | (3809.76 ms | 137617 tok/s) step 14182/76294 | train loss 3.429860 | norm 0.6336 | lr 3.23e-04 | (3825.32 ms | 137057 tok/s) step 14183/76294 | train loss 3.424188 | norm 0.5567 | lr 3.23e-04 | (3806.66 ms | 137729 tok/s) step 14184/76294 | train loss 3.381726 | norm 0.5868 | lr 3.23e-04 | (3806.56 ms | 137733 tok/s) step 14185/76294 | train loss 3.484377 | norm 1.2874 | lr 3.23e-04 | (3879.23 ms | 135153 tok/s) step 14186/76294 | train loss 3.319934 | norm 3.5428 | lr 3.23e-04 | (3800.46 ms | 137954 tok/s) step 14187/76294 | train loss 3.403492 | norm 0.6686 | lr 3.23e-04 | (3821.90 ms | 137180 tok/s) step 14188/76294 | train loss 3.454796 | norm 15.8413 | lr 3.23e-04 | (3813.63 ms | 137478 tok/s) step 14189/76294 | train loss 3.407598 | norm 9.8380 | lr 3.22e-04 | (3808.07 ms | 137678 tok/s) step 14190/76294 | train loss 3.389709 | norm 1.8748 | lr 3.22e-04 | (3810.97 ms | 137574 tok/s) step 14191/76294 | train loss 3.373389 | norm 0.9900 | lr 3.22e-04 | (3808.95 ms | 137646 tok/s) step 14192/76294 | train loss 3.376638 | norm 4.1863 | lr 3.22e-04 | (3805.08 ms | 137786 tok/s) step 14193/76294 | train loss 3.388228 | norm 3.1202 | lr 3.22e-04 | (3808.46 ms | 137664 tok/s) step 14194/76294 | train loss 3.365205 | norm 0.9229 | lr 3.22e-04 | (3981.04 ms | 131696 tok/s) step 14195/76294 | train loss 3.312880 | norm 1.5310 | lr 3.22e-04 | (3799.57 ms | 137986 tok/s) step 14196/76294 | train loss 3.358932 | norm 0.8520 | lr 3.22e-04 | (3833.12 ms | 136779 tok/s) step 14197/76294 | train loss 3.426635 | norm 0.7018 | lr 3.22e-04 | (3805.74 ms | 137762 tok/s) step 14198/76294 | train loss 3.343029 | norm 0.5875 | lr 3.22e-04 | (3807.03 ms | 137716 tok/s) step 14199/76294 | train loss 3.349365 | norm 0.5094 | lr 3.22e-04 | (3824.32 ms | 137093 tok/s) step 14200/76294 | train loss 3.425011 | norm 0.5200 | lr 3.22e-04 | (3807.38 ms | 137703 tok/s) step 14201/76294 | train loss 3.401543 | norm 0.7303 | lr 3.22e-04 | (3809.05 ms | 137643 tok/s) step 14202/76294 | train loss 3.389560 | norm 0.6114 | lr 3.22e-04 | (3808.25 ms | 137671 tok/s) step 14203/76294 | train loss 3.385529 | norm 0.3844 | lr 3.21e-04 | (3820.55 ms | 137228 tok/s) step 14204/76294 | train loss 3.599237 | norm 0.8215 | lr 3.21e-04 | (3801.89 ms | 137902 tok/s) step 14205/76294 | train loss 3.427353 | norm 0.9056 | lr 3.21e-04 | (3813.89 ms | 137468 tok/s) step 14206/76294 | train loss 3.340722 | norm 0.3430 | lr 3.21e-04 | (3819.63 ms | 137261 tok/s) step 14207/76294 | train loss 3.399122 | norm 1.0333 | lr 3.21e-04 | (3825.11 ms | 137065 tok/s) step 14208/76294 | train loss 3.361973 | norm 0.4592 | lr 3.21e-04 | (3809.43 ms | 137629 tok/s) step 14209/76294 | train loss 3.345986 | norm 1.2818 | lr 3.21e-04 | (3834.81 ms | 136718 tok/s) step 14210/76294 | train loss 3.426494 | norm 0.3673 | lr 3.21e-04 | (3803.63 ms | 137839 tok/s) step 14211/76294 | train loss 3.314941 | norm 1.9061 | lr 3.21e-04 | (3849.64 ms | 136191 tok/s) step 14212/76294 | train loss 3.412248 | norm 0.3411 | lr 3.21e-04 | (3803.94 ms | 137827 tok/s) step 14213/76294 | train loss 3.389217 | norm 0.7932 | lr 3.21e-04 | (3825.27 ms | 137059 tok/s) step 14214/76294 | train loss 3.368925 | norm 0.3355 | lr 3.21e-04 | (3829.16 ms | 136920 tok/s) step 14215/76294 | train loss 3.374021 | norm 1.3610 | lr 3.21e-04 | (3880.41 ms | 135112 tok/s) step 14216/76294 | train loss 3.345889 | norm 0.3588 | lr 3.21e-04 | (3801.28 ms | 137924 tok/s) step 14217/76294 | train loss 3.394578 | norm 0.9304 | lr 3.20e-04 | (3854.53 ms | 136019 tok/s) step 14218/76294 | train loss 3.349597 | norm 0.6642 | lr 3.20e-04 | (3802.92 ms | 137865 tok/s) step 14219/76294 | train loss 3.356395 | norm 0.4555 | lr 3.20e-04 | (3808.15 ms | 137675 tok/s) step 14220/76294 | train loss 3.388406 | norm 0.8515 | lr 3.20e-04 | (4121.79 ms | 127199 tok/s) step 14221/76294 | train loss 3.441659 | norm 0.3416 | lr 3.20e-04 | (3800.46 ms | 137954 tok/s) step 14222/76294 | train loss 3.346299 | norm 0.7858 | lr 3.20e-04 | (3805.34 ms | 137777 tok/s) step 14223/76294 | train loss 3.395641 | norm 0.3055 | lr 3.20e-04 | (3819.48 ms | 137267 tok/s) step 14224/76294 | train loss 3.364748 | norm 0.8085 | lr 3.20e-04 | (3808.72 ms | 137655 tok/s) step 14225/76294 | train loss 3.424009 | norm 0.3909 | lr 3.20e-04 | (3808.32 ms | 137669 tok/s) step 14226/76294 | train loss 3.423852 | norm 0.8947 | lr 3.20e-04 | (3841.72 ms | 136472 tok/s) step 14227/76294 | train loss 3.381596 | norm 0.4487 | lr 3.20e-04 | (3824.97 ms | 137070 tok/s) step 14228/76294 | train loss 3.407107 | norm 0.4494 | lr 3.20e-04 | (3803.25 ms | 137853 tok/s) step 14229/76294 | train loss 3.374486 | norm 0.2500 | lr 3.20e-04 | (3805.12 ms | 137785 tok/s) step 14230/76294 | train loss 3.406277 | norm 0.8633 | lr 3.20e-04 | (3837.56 ms | 136620 tok/s) step 14231/76294 | train loss 3.354642 | norm 0.4843 | lr 3.20e-04 | (3802.54 ms | 137878 tok/s) step 14232/76294 | train loss 3.368203 | norm 0.6224 | lr 3.19e-04 | (3812.40 ms | 137522 tok/s) step 14233/76294 | train loss 3.440444 | norm 0.2660 | lr 3.19e-04 | (3802.37 ms | 137885 tok/s) step 14234/76294 | train loss 3.384952 | norm 1.1111 | lr 3.19e-04 | (3829.59 ms | 136904 tok/s) step 14235/76294 | train loss 3.362744 | norm 0.5796 | lr 3.19e-04 | (3804.74 ms | 137799 tok/s) step 14236/76294 | train loss 3.259131 | norm 0.3381 | lr 3.19e-04 | (3910.60 ms | 134069 tok/s) step 14237/76294 | train loss 3.445817 | norm 0.3461 | lr 3.19e-04 | (3795.20 ms | 138145 tok/s) step 14238/76294 | train loss 3.374618 | norm 0.2905 | lr 3.19e-04 | (4220.90 ms | 124213 tok/s) step 14239/76294 | train loss 3.341605 | norm 0.3400 | lr 3.19e-04 | (3827.69 ms | 136972 tok/s) step 14240/76294 | train loss 3.431163 | norm 0.4322 | lr 3.19e-04 | (3795.38 ms | 138138 tok/s) step 14241/76294 | train loss 3.367628 | norm 0.3825 | lr 3.19e-04 | (3801.33 ms | 137922 tok/s) step 14242/76294 | train loss 3.398719 | norm 0.6801 | lr 3.19e-04 | (3795.35 ms | 138140 tok/s) step 14243/76294 | train loss 3.366374 | norm 2.3387 | lr 3.19e-04 | (3798.84 ms | 138013 tok/s) step 14244/76294 | train loss 3.388122 | norm 4.5714 | lr 3.19e-04 | (3794.49 ms | 138171 tok/s) step 14245/76294 | train loss 3.484336 | norm 1.1664 | lr 3.19e-04 | (3952.78 ms | 132638 tok/s) step 14246/76294 | train loss 3.375462 | norm 1.4310 | lr 3.18e-04 | (3792.34 ms | 138249 tok/s) step 14247/76294 | train loss 3.510880 | norm 0.3462 | lr 3.18e-04 | (3818.89 ms | 137288 tok/s) step 14248/76294 | train loss 3.423667 | norm 0.6417 | lr 3.18e-04 | (3793.73 ms | 138199 tok/s) step 14249/76294 | train loss 3.362607 | norm 0.3267 | lr 3.18e-04 | (3812.02 ms | 137536 tok/s) step 14250/76294 | train loss 3.386098 | norm 0.6025 | lr 3.18e-04 | (3792.37 ms | 138248 tok/s) val loss: 3.373324 saving model checkpoint to ./results/gpt2-124M-gqa/step_14250.pth step 14251/76294 | train loss 3.430463 | norm 0.5478 | lr 3.18e-04 | (3806.41 ms | 137738 tok/s) step 14252/76294 | train loss 3.385666 | norm 0.3909 | lr 3.18e-04 | (3813.52 ms | 137481 tok/s) step 14253/76294 | train loss 3.424958 | norm 0.5960 | lr 3.18e-04 | (3792.08 ms | 138259 tok/s) step 14254/76294 | train loss 3.394320 | norm 0.3292 | lr 3.18e-04 | (3813.53 ms | 137481 tok/s) step 14255/76294 | train loss 3.454060 | norm 0.4022 | lr 3.18e-04 | (3870.77 ms | 135448 tok/s) step 14256/76294 | train loss 3.408473 | norm 0.5140 | lr 3.18e-04 | (3794.43 ms | 138173 tok/s) step 14257/76294 | train loss 3.362754 | norm 0.2917 | lr 3.18e-04 | (3810.02 ms | 137608 tok/s) step 14258/76294 | train loss 3.407336 | norm 0.9933 | lr 3.18e-04 | (3795.66 ms | 138128 tok/s) step 14259/76294 | train loss 3.363794 | norm 0.7957 | lr 3.18e-04 | (3803.13 ms | 137857 tok/s) step 14260/76294 | train loss 3.361664 | norm 0.2859 | lr 3.17e-04 | (3792.60 ms | 138240 tok/s) step 14261/76294 | train loss 3.398617 | norm 0.9836 | lr 3.17e-04 | (3820.07 ms | 137246 tok/s) step 14262/76294 | train loss 3.377491 | norm 1.2435 | lr 3.17e-04 | (3796.06 ms | 138114 tok/s) step 14263/76294 | train loss 3.355911 | norm 0.2668 | lr 3.17e-04 | (3824.65 ms | 137081 tok/s) step 14264/76294 | train loss 3.435167 | norm 0.7284 | lr 3.17e-04 | (3793.80 ms | 138196 tok/s) step 14265/76294 | train loss 3.389818 | norm 0.2672 | lr 3.17e-04 | (3816.05 ms | 137390 tok/s) step 14266/76294 | train loss 3.376289 | norm 1.5853 | lr 3.17e-04 | (3824.44 ms | 137089 tok/s) step 14267/76294 | train loss 3.402441 | norm 0.4896 | lr 3.17e-04 | (3803.19 ms | 137855 tok/s) step 14268/76294 | train loss 3.344285 | norm 1.1090 | lr 3.17e-04 | (3806.08 ms | 137750 tok/s) step 14269/76294 | train loss 3.332343 | norm 1.4638 | lr 3.17e-04 | (3801.24 ms | 137926 tok/s) step 14270/76294 | train loss 3.353812 | norm 0.5091 | lr 3.17e-04 | (3802.75 ms | 137871 tok/s) step 14271/76294 | train loss 3.455067 | norm 1.2271 | lr 3.17e-04 | (3799.98 ms | 137971 tok/s) step 14272/76294 | train loss 3.385121 | norm 0.3109 | lr 3.17e-04 | (3802.59 ms | 137876 tok/s) step 14273/76294 | train loss 3.433711 | norm 0.8406 | lr 3.17e-04 | (3831.22 ms | 136846 tok/s) step 14274/76294 | train loss 3.388331 | norm 1.2377 | lr 3.17e-04 | (3879.70 ms | 135136 tok/s) step 14275/76294 | train loss 3.414615 | norm 0.9113 | lr 3.16e-04 | (3797.84 ms | 138049 tok/s) step 14276/76294 | train loss 3.424616 | norm 0.5306 | lr 3.16e-04 | (3804.07 ms | 137823 tok/s) step 14277/76294 | train loss 3.400341 | norm 0.3767 | lr 3.16e-04 | (3789.65 ms | 138348 tok/s) step 14278/76294 | train loss 3.427007 | norm 0.7101 | lr 3.16e-04 | (3809.17 ms | 137638 tok/s) step 14279/76294 | train loss 3.357708 | norm 2.2422 | lr 3.16e-04 | (3822.90 ms | 137144 tok/s) step 14280/76294 | train loss 3.383318 | norm 0.3466 | lr 3.16e-04 | (3802.25 ms | 137889 tok/s) step 14281/76294 | train loss 3.329629 | norm 1.4111 | lr 3.16e-04 | (3808.68 ms | 137656 tok/s) step 14282/76294 | train loss 3.378286 | norm 0.3103 | lr 3.16e-04 | (3800.62 ms | 137948 tok/s) step 14283/76294 | train loss 3.391942 | norm 0.8485 | lr 3.16e-04 | (3800.43 ms | 137955 tok/s) step 14284/76294 | train loss 3.415659 | norm 0.7961 | lr 3.16e-04 | (4115.76 ms | 127386 tok/s) step 14285/76294 | train loss 3.404066 | norm 0.2890 | lr 3.16e-04 | (3790.19 ms | 138328 tok/s) step 14286/76294 | train loss 3.364372 | norm 1.5801 | lr 3.16e-04 | (3846.33 ms | 136309 tok/s) step 14287/76294 | train loss 3.383340 | norm 4.4685 | lr 3.16e-04 | (3793.35 ms | 138213 tok/s) step 14288/76294 | train loss 3.417700 | norm 1.0290 | lr 3.16e-04 | (3846.67 ms | 136297 tok/s) step 14289/76294 | train loss 3.419955 | norm 6.3440 | lr 3.15e-04 | (3793.26 ms | 138216 tok/s) step 14290/76294 | train loss 3.394918 | norm 4.9773 | lr 3.15e-04 | (3841.28 ms | 136488 tok/s) step 14291/76294 | train loss 3.543264 | norm 1.0173 | lr 3.15e-04 | (3792.50 ms | 138243 tok/s) step 14292/76294 | train loss 3.434230 | norm 0.8377 | lr 3.15e-04 | (3797.56 ms | 138059 tok/s) step 14293/76294 | train loss 3.460726 | norm 1.6737 | lr 3.15e-04 | (3794.14 ms | 138184 tok/s) step 14294/76294 | train loss 3.339477 | norm 0.4439 | lr 3.15e-04 | (3849.35 ms | 136202 tok/s) step 14295/76294 | train loss 3.363147 | norm 1.3698 | lr 3.15e-04 | (3795.64 ms | 138129 tok/s) step 14296/76294 | train loss 3.389978 | norm 0.8436 | lr 3.15e-04 | (3915.34 ms | 133906 tok/s) step 14297/76294 | train loss 3.387408 | norm 0.3111 | lr 3.15e-04 | (3795.11 ms | 138148 tok/s) step 14298/76294 | train loss 3.383652 | norm 0.6686 | lr 3.15e-04 | (3799.63 ms | 137984 tok/s) step 14299/76294 | train loss 3.402613 | norm 0.8531 | lr 3.15e-04 | (3821.96 ms | 137178 tok/s) step 14300/76294 | train loss 3.450888 | norm 0.9385 | lr 3.15e-04 | (3804.07 ms | 137823 tok/s) step 14301/76294 | train loss 3.416916 | norm 1.1775 | lr 3.15e-04 | (3797.49 ms | 138062 tok/s) step 14302/76294 | train loss 3.329852 | norm 1.3584 | lr 3.15e-04 | (3838.91 ms | 136572 tok/s) step 14303/76294 | train loss 3.372679 | norm 0.6118 | lr 3.15e-04 | (3801.55 ms | 137914 tok/s) step 14304/76294 | train loss 3.412905 | norm 0.3092 | lr 3.14e-04 | (3852.85 ms | 136078 tok/s) step 14305/76294 | train loss 3.412245 | norm 0.3955 | lr 3.14e-04 | (3800.88 ms | 137939 tok/s) step 14306/76294 | train loss 3.411149 | norm 0.2955 | lr 3.14e-04 | (3805.80 ms | 137760 tok/s) step 14307/76294 | train loss 3.370713 | norm 0.5814 | lr 3.14e-04 | (3799.43 ms | 137991 tok/s) step 14308/76294 | train loss 3.416763 | norm 0.4564 | lr 3.14e-04 | (3804.80 ms | 137796 tok/s) step 14309/76294 | train loss 3.392715 | norm 0.2791 | lr 3.14e-04 | (3821.86 ms | 137181 tok/s) step 14310/76294 | train loss 3.337468 | norm 0.5318 | lr 3.14e-04 | (3804.83 ms | 137795 tok/s) step 14311/76294 | train loss 3.426750 | norm 0.6085 | lr 3.14e-04 | (3799.51 ms | 137988 tok/s) step 14312/76294 | train loss 3.357448 | norm 0.4019 | lr 3.14e-04 | (3850.46 ms | 136162 tok/s) step 14313/76294 | train loss 3.369018 | norm 0.3454 | lr 3.14e-04 | (3801.88 ms | 137902 tok/s) step 14314/76294 | train loss 3.445580 | norm 0.3825 | lr 3.14e-04 | (3888.59 ms | 134827 tok/s) step 14315/76294 | train loss 3.415805 | norm 0.5627 | lr 3.14e-04 | (3794.37 ms | 138175 tok/s) step 14316/76294 | train loss 3.405332 | norm 0.7320 | lr 3.14e-04 | (3803.35 ms | 137849 tok/s) step 14317/76294 | train loss 3.395162 | norm 0.6951 | lr 3.14e-04 | (3893.62 ms | 134653 tok/s) step 14318/76294 | train loss 3.411121 | norm 0.3639 | lr 3.13e-04 | (3798.75 ms | 138016 tok/s) step 14319/76294 | train loss 3.389342 | norm 0.3753 | lr 3.13e-04 | (3818.98 ms | 137285 tok/s) step 14320/76294 | train loss 3.439885 | norm 0.8027 | lr 3.13e-04 | (3811.16 ms | 137566 tok/s) step 14321/76294 | train loss 3.362818 | norm 2.0067 | lr 3.13e-04 | (3802.61 ms | 137876 tok/s) step 14322/76294 | train loss 3.343873 | norm 1.1433 | lr 3.13e-04 | (3831.25 ms | 136845 tok/s) step 14323/76294 | train loss 3.406952 | norm 1.1654 | lr 3.13e-04 | (3883.47 ms | 135005 tok/s) step 14324/76294 | train loss 3.394053 | norm 1.8377 | lr 3.13e-04 | (3791.37 ms | 138285 tok/s) step 14325/76294 | train loss 3.277178 | norm 0.3962 | lr 3.13e-04 | (3936.18 ms | 133197 tok/s) step 14326/76294 | train loss 3.417085 | norm 0.8776 | lr 3.13e-04 | (3900.18 ms | 134426 tok/s) step 14327/76294 | train loss 3.404518 | norm 1.4053 | lr 3.13e-04 | (3823.36 ms | 137128 tok/s) step 14328/76294 | train loss 3.359505 | norm 0.2757 | lr 3.13e-04 | (3859.42 ms | 135846 tok/s) step 14329/76294 | train loss 3.378648 | norm 0.5883 | lr 3.13e-04 | (3817.61 ms | 137334 tok/s) step 14330/76294 | train loss 3.353585 | norm 0.7544 | lr 3.13e-04 | (3768.78 ms | 139113 tok/s) step 14331/76294 | train loss 3.399238 | norm 0.4240 | lr 3.13e-04 | (3943.33 ms | 132956 tok/s) step 14332/76294 | train loss 3.404197 | norm 0.8803 | lr 3.13e-04 | (3761.11 ms | 139397 tok/s) step 14333/76294 | train loss 3.397670 | norm 4.7975 | lr 3.12e-04 | (3773.11 ms | 138954 tok/s) step 14334/76294 | train loss 3.374567 | norm 2.5366 | lr 3.12e-04 | (3796.15 ms | 138110 tok/s) step 14335/76294 | train loss 3.486765 | norm 35.2770 | lr 3.12e-04 | (4034.82 ms | 129941 tok/s) step 14336/76294 | train loss 3.470429 | norm 22.7231 | lr 3.12e-04 | (3805.23 ms | 137781 tok/s) step 14337/76294 | train loss 3.442041 | norm 8.1581 | lr 3.12e-04 | (3800.06 ms | 137968 tok/s) step 14338/76294 | train loss 3.406291 | norm 8.5553 | lr 3.12e-04 | (3784.97 ms | 138519 tok/s) step 14339/76294 | train loss 3.466156 | norm 4.7803 | lr 3.12e-04 | (3791.76 ms | 138270 tok/s) step 14340/76294 | train loss 3.421459 | norm 2.1379 | lr 3.12e-04 | (3794.66 ms | 138165 tok/s) step 14341/76294 | train loss 3.438746 | norm 2.4848 | lr 3.12e-04 | (3793.36 ms | 138212 tok/s) step 14342/76294 | train loss 3.442905 | norm 3.3743 | lr 3.12e-04 | (3791.06 ms | 138296 tok/s) step 14343/76294 | train loss 3.342016 | norm 0.5801 | lr 3.12e-04 | (3820.01 ms | 137248 tok/s) step 14344/76294 | train loss 3.456588 | norm 2.2296 | lr 3.12e-04 | (3791.80 ms | 138269 tok/s) step 14345/76294 | train loss 3.369823 | norm 1.5041 | lr 3.12e-04 | (3797.71 ms | 138054 tok/s) step 14346/76294 | train loss 3.356636 | norm 1.3245 | lr 3.12e-04 | (3818.96 ms | 137285 tok/s) step 14347/76294 | train loss 3.480810 | norm 0.8369 | lr 3.11e-04 | (4249.55 ms | 123375 tok/s) step 14348/76294 | train loss 3.351538 | norm 1.2840 | lr 3.11e-04 | (3795.21 ms | 138144 tok/s) step 14349/76294 | train loss 3.387325 | norm 1.0309 | lr 3.11e-04 | (3799.95 ms | 137972 tok/s) step 14350/76294 | train loss 3.382058 | norm 1.2649 | lr 3.11e-04 | (3824.67 ms | 137081 tok/s) step 14351/76294 | train loss 3.345419 | norm 0.8927 | lr 3.11e-04 | (3944.28 ms | 132924 tok/s) step 14352/76294 | train loss 3.389403 | norm 1.1356 | lr 3.11e-04 | (5067.36 ms | 103464 tok/s) step 14353/76294 | train loss 3.413949 | norm 0.7324 | lr 3.11e-04 | (3919.08 ms | 133778 tok/s) step 14354/76294 | train loss 3.478165 | norm 2.1896 | lr 3.11e-04 | (3793.40 ms | 138211 tok/s) step 14355/76294 | train loss 3.399382 | norm 0.6580 | lr 3.11e-04 | (3819.38 ms | 137271 tok/s) step 14356/76294 | train loss 3.400755 | norm 1.0011 | lr 3.11e-04 | (3839.88 ms | 136537 tok/s) step 14357/76294 | train loss 3.411451 | norm 0.3410 | lr 3.11e-04 | (3797.52 ms | 138061 tok/s) step 14358/76294 | train loss 3.341697 | norm 0.6304 | lr 3.11e-04 | (3814.20 ms | 137457 tok/s) step 14359/76294 | train loss 3.450692 | norm 0.3095 | lr 3.11e-04 | (3830.84 ms | 136860 tok/s) step 14360/76294 | train loss 3.367104 | norm 0.8002 | lr 3.11e-04 | (5374.91 ms | 97544 tok/s) step 14361/76294 | train loss 3.356063 | norm 0.3307 | lr 3.11e-04 | (4109.90 ms | 127567 tok/s) step 14362/76294 | train loss 3.483499 | norm 0.9230 | lr 3.10e-04 | (4589.85 ms | 114228 tok/s) step 14363/76294 | train loss 3.322519 | norm 0.9810 | lr 3.10e-04 | (3814.95 ms | 137430 tok/s) step 14364/76294 | train loss 3.350739 | norm 0.2871 | lr 3.10e-04 | (3889.65 ms | 134790 tok/s) step 14365/76294 | train loss 3.341368 | norm 0.8185 | lr 3.10e-04 | (3786.61 ms | 138458 tok/s) step 14366/76294 | train loss 3.444502 | norm 0.5002 | lr 3.10e-04 | (3798.19 ms | 138036 tok/s) step 14367/76294 | train loss 3.359106 | norm 0.4882 | lr 3.10e-04 | (3812.40 ms | 137522 tok/s) step 14368/76294 | train loss 3.408437 | norm 0.9754 | lr 3.10e-04 | (3798.87 ms | 138012 tok/s) step 14369/76294 | train loss 3.409801 | norm 1.0495 | lr 3.10e-04 | (3807.80 ms | 137688 tok/s) step 14370/76294 | train loss 3.408777 | norm 0.3658 | lr 3.10e-04 | (3868.37 ms | 135532 tok/s) step 14371/76294 | train loss 3.325783 | norm 0.9391 | lr 3.10e-04 | (3795.12 ms | 138148 tok/s) step 14372/76294 | train loss 3.421525 | norm 1.7801 | lr 3.10e-04 | (3808.96 ms | 137646 tok/s) step 14373/76294 | train loss 3.384827 | norm 0.3987 | lr 3.10e-04 | (3820.76 ms | 137221 tok/s) step 14374/76294 | train loss 3.397161 | norm 3.5399 | lr 3.10e-04 | (3801.29 ms | 137924 tok/s) step 14375/76294 | train loss 3.529577 | norm 1.0799 | lr 3.10e-04 | (3807.52 ms | 137698 tok/s) step 14376/76294 | train loss 3.347252 | norm 1.4809 | lr 3.09e-04 | (3868.43 ms | 135530 tok/s) step 14377/76294 | train loss 3.324982 | norm 1.0652 | lr 3.09e-04 | (3803.29 ms | 137851 tok/s) step 14378/76294 | train loss 3.418573 | norm 0.5275 | lr 3.09e-04 | (3813.30 ms | 137489 tok/s) step 14379/76294 | train loss 3.494820 | norm 0.6330 | lr 3.09e-04 | (3811.94 ms | 137538 tok/s) step 14380/76294 | train loss 3.362374 | norm 1.3220 | lr 3.09e-04 | (3806.31 ms | 137742 tok/s) step 14381/76294 | train loss 3.398008 | norm 0.3363 | lr 3.09e-04 | (3838.46 ms | 136588 tok/s) step 14382/76294 | train loss 3.343034 | norm 1.0873 | lr 3.09e-04 | (3805.64 ms | 137766 tok/s) step 14383/76294 | train loss 3.369369 | norm 1.1734 | lr 3.09e-04 | (3811.19 ms | 137566 tok/s) step 14384/76294 | train loss 3.394014 | norm 0.3200 | lr 3.09e-04 | (3831.64 ms | 136831 tok/s) step 14385/76294 | train loss 3.367590 | norm 0.8618 | lr 3.09e-04 | (3808.24 ms | 137672 tok/s) step 14386/76294 | train loss 3.352366 | norm 1.7780 | lr 3.09e-04 | (3817.20 ms | 137349 tok/s) step 14387/76294 | train loss 3.386057 | norm 0.3097 | lr 3.09e-04 | (3809.79 ms | 137616 tok/s) step 14388/76294 | train loss 3.334518 | norm 0.7911 | lr 3.09e-04 | (3806.68 ms | 137728 tok/s) step 14389/76294 | train loss 3.343967 | norm 0.5043 | lr 3.09e-04 | (3858.55 ms | 135877 tok/s) step 14390/76294 | train loss 3.399592 | norm 0.6065 | lr 3.09e-04 | (3885.40 ms | 134938 tok/s) step 14391/76294 | train loss 3.390606 | norm 0.8666 | lr 3.08e-04 | (3851.50 ms | 136126 tok/s) step 14392/76294 | train loss 3.475505 | norm 0.4968 | lr 3.08e-04 | (3826.54 ms | 137014 tok/s) step 14393/76294 | train loss 3.372366 | norm 1.1071 | lr 3.08e-04 | (3808.07 ms | 137678 tok/s) step 14394/76294 | train loss 3.361474 | norm 0.3260 | lr 3.08e-04 | (3808.35 ms | 137668 tok/s) step 14395/76294 | train loss 3.434509 | norm 3.2363 | lr 3.08e-04 | (3832.86 ms | 136788 tok/s) step 14396/76294 | train loss 3.327062 | norm 0.6350 | lr 3.08e-04 | (3807.36 ms | 137704 tok/s) step 14397/76294 | train loss 3.376645 | norm 2.3745 | lr 3.08e-04 | (3889.27 ms | 134804 tok/s) step 14398/76294 | train loss 3.363794 | norm 1.1125 | lr 3.08e-04 | (3901.56 ms | 134379 tok/s) step 14399/76294 | train loss 3.463298 | norm 9.8259 | lr 3.08e-04 | (3788.81 ms | 138378 tok/s) step 14400/76294 | train loss 3.404186 | norm 6.9483 | lr 3.08e-04 | (3795.21 ms | 138145 tok/s) step 14401/76294 | train loss 3.365496 | norm 1.0358 | lr 3.08e-04 | (3825.09 ms | 137066 tok/s) step 14402/76294 | train loss 3.387656 | norm 0.4468 | lr 3.08e-04 | (3800.13 ms | 137966 tok/s) step 14403/76294 | train loss 3.351146 | norm 0.9521 | lr 3.08e-04 | (3802.62 ms | 137875 tok/s) step 14404/76294 | train loss 3.420615 | norm 0.5043 | lr 3.08e-04 | (3830.25 ms | 136881 tok/s) step 14405/76294 | train loss 3.365915 | norm 0.2861 | lr 3.07e-04 | (3805.23 ms | 137781 tok/s) step 14406/76294 | train loss 3.408664 | norm 0.4264 | lr 3.07e-04 | (3800.81 ms | 137941 tok/s) step 14407/76294 | train loss 3.381802 | norm 0.3506 | lr 3.07e-04 | (3827.87 ms | 136966 tok/s) step 14408/76294 | train loss 3.466643 | norm 0.4448 | lr 3.07e-04 | (3799.78 ms | 137978 tok/s) step 14409/76294 | train loss 3.392989 | norm 0.4694 | lr 3.07e-04 | (3806.09 ms | 137750 tok/s) step 14410/76294 | train loss 3.350098 | norm 0.3651 | lr 3.07e-04 | (3909.78 ms | 134096 tok/s) step 14411/76294 | train loss 3.404085 | norm 0.4386 | lr 3.07e-04 | (4597.38 ms | 114041 tok/s) step 14412/76294 | train loss 3.409723 | norm 0.7745 | lr 3.07e-04 | (3875.41 ms | 135286 tok/s) step 14413/76294 | train loss 3.503001 | norm 1.3766 | lr 3.07e-04 | (3806.24 ms | 137744 tok/s) step 14414/76294 | train loss 3.391470 | norm 2.7901 | lr 3.07e-04 | (3809.07 ms | 137642 tok/s) step 14415/76294 | train loss 3.379874 | norm 0.6899 | lr 3.07e-04 | (3829.73 ms | 136899 tok/s) step 14416/76294 | train loss 3.435268 | norm 0.6976 | lr 3.07e-04 | (3820.64 ms | 137225 tok/s) step 14417/76294 | train loss 3.317633 | norm 4.2656 | lr 3.07e-04 | (3807.30 ms | 137706 tok/s) step 14418/76294 | train loss 3.420054 | norm 1.0266 | lr 3.07e-04 | (3871.48 ms | 135423 tok/s) step 14419/76294 | train loss 3.381959 | norm 2.1247 | lr 3.07e-04 | (3809.39 ms | 137630 tok/s) step 14420/76294 | train loss 3.417998 | norm 1.4694 | lr 3.06e-04 | (3907.94 ms | 134160 tok/s) step 14421/76294 | train loss 3.419208 | norm 0.4094 | lr 3.06e-04 | (3797.34 ms | 138067 tok/s) step 14422/76294 | train loss 3.357311 | norm 1.5629 | lr 3.06e-04 | (3825.90 ms | 137037 tok/s) step 14423/76294 | train loss 3.385209 | norm 0.5962 | lr 3.06e-04 | (3799.22 ms | 137999 tok/s) step 14424/76294 | train loss 3.392774 | norm 1.3718 | lr 3.06e-04 | (3849.85 ms | 136184 tok/s) step 14425/76294 | train loss 3.321980 | norm 1.0846 | lr 3.06e-04 | (3798.67 ms | 138019 tok/s) step 14426/76294 | train loss 3.710291 | norm 1.6946 | lr 3.06e-04 | (3807.75 ms | 137690 tok/s) step 14427/76294 | train loss 3.340810 | norm 1.6588 | lr 3.06e-04 | (3821.71 ms | 137187 tok/s) step 14428/76294 | train loss 3.499079 | norm 0.6885 | lr 3.06e-04 | (3804.35 ms | 137813 tok/s) step 14429/76294 | train loss 3.408146 | norm 1.0219 | lr 3.06e-04 | (3800.20 ms | 137963 tok/s) step 14430/76294 | train loss 3.461276 | norm 0.3492 | lr 3.06e-04 | (3833.32 ms | 136771 tok/s) step 14431/76294 | train loss 3.356702 | norm 0.9128 | lr 3.06e-04 | (3822.95 ms | 137142 tok/s) step 14432/76294 | train loss 3.380594 | norm 0.3040 | lr 3.06e-04 | (3805.98 ms | 137754 tok/s) step 14433/76294 | train loss 3.364682 | norm 0.5795 | lr 3.06e-04 | (4030.43 ms | 130082 tok/s) step 14434/76294 | train loss 3.346986 | norm 0.2881 | lr 3.06e-04 | (3805.01 ms | 137789 tok/s) step 14435/76294 | train loss 3.395319 | norm 0.6205 | lr 3.05e-04 | (3810.36 ms | 137595 tok/s) step 14436/76294 | train loss 3.361707 | norm 0.3357 | lr 3.05e-04 | (3806.13 ms | 137748 tok/s) step 14437/76294 | train loss 3.380735 | norm 0.6342 | lr 3.05e-04 | (3831.58 ms | 136833 tok/s) step 14438/76294 | train loss 3.397473 | norm 1.8827 | lr 3.05e-04 | (3805.71 ms | 137763 tok/s) step 14439/76294 | train loss 3.317956 | norm 1.0807 | lr 3.05e-04 | (3812.21 ms | 137529 tok/s) step 14440/76294 | train loss 3.408504 | norm 0.5755 | lr 3.05e-04 | (3807.29 ms | 137706 tok/s) step 14441/76294 | train loss 3.331040 | norm 0.5042 | lr 3.05e-04 | (3807.32 ms | 137705 tok/s) step 14442/76294 | train loss 3.418826 | norm 0.9398 | lr 3.05e-04 | (3812.50 ms | 137518 tok/s) step 14443/76294 | train loss 3.396933 | norm 2.1092 | lr 3.05e-04 | (3808.76 ms | 137653 tok/s) step 14444/76294 | train loss 3.345417 | norm 0.2969 | lr 3.05e-04 | (3859.79 ms | 135833 tok/s) step 14445/76294 | train loss 3.347946 | norm 3.3293 | lr 3.05e-04 | (3801.91 ms | 137901 tok/s) step 14446/76294 | train loss 3.425983 | norm 0.4095 | lr 3.05e-04 | (3810.33 ms | 137597 tok/s) step 14447/76294 | train loss 3.414495 | norm 0.8921 | lr 3.05e-04 | (3823.58 ms | 137120 tok/s) step 14448/76294 | train loss 3.360867 | norm 1.9352 | lr 3.05e-04 | (3877.99 ms | 135196 tok/s) step 14449/76294 | train loss 3.447171 | norm 0.4859 | lr 3.05e-04 | (3808.98 ms | 137645 tok/s) step 14450/76294 | train loss 3.439145 | norm 1.8066 | lr 3.04e-04 | (3855.87 ms | 135971 tok/s) step 14451/76294 | train loss 3.402287 | norm 0.8453 | lr 3.04e-04 | (3814.29 ms | 137454 tok/s) step 14452/76294 | train loss 3.384977 | norm 5.1973 | lr 3.04e-04 | (3828.12 ms | 136957 tok/s) step 14453/76294 | train loss 3.330274 | norm 4.4859 | lr 3.04e-04 | (3827.41 ms | 136983 tok/s) step 14454/76294 | train loss 3.372344 | norm 0.8372 | lr 3.04e-04 | (3803.31 ms | 137850 tok/s) step 14455/76294 | train loss 3.410903 | norm 0.8882 | lr 3.04e-04 | (3821.98 ms | 137177 tok/s) step 14456/76294 | train loss 3.355105 | norm 1.1592 | lr 3.04e-04 | (3808.81 ms | 137652 tok/s) step 14457/76294 | train loss 3.368256 | norm 0.5005 | lr 3.04e-04 | (3832.05 ms | 136817 tok/s) step 14458/76294 | train loss 3.311369 | norm 1.2278 | lr 3.04e-04 | (3807.88 ms | 137685 tok/s) step 14459/76294 | train loss 3.378397 | norm 0.3853 | lr 3.04e-04 | (3822.63 ms | 137154 tok/s) step 14460/76294 | train loss 3.383640 | norm 0.6270 | lr 3.04e-04 | (3807.97 ms | 137682 tok/s) step 14461/76294 | train loss 3.427281 | norm 0.3598 | lr 3.04e-04 | (3823.22 ms | 137133 tok/s) step 14462/76294 | train loss 3.394907 | norm 0.8167 | lr 3.04e-04 | (3806.30 ms | 137742 tok/s) step 14463/76294 | train loss 3.309942 | norm 0.4976 | lr 3.04e-04 | (3812.58 ms | 137515 tok/s) step 14464/76294 | train loss 3.381924 | norm 0.5742 | lr 3.03e-04 | (3806.30 ms | 137742 tok/s) step 14465/76294 | train loss 3.335088 | norm 0.4300 | lr 3.03e-04 | (3807.44 ms | 137701 tok/s) step 14466/76294 | train loss 3.602613 | norm 0.5884 | lr 3.03e-04 | (3800.94 ms | 137936 tok/s) step 14467/76294 | train loss 3.397773 | norm 0.8784 | lr 3.03e-04 | (3811.01 ms | 137572 tok/s) step 14468/76294 | train loss 3.386855 | norm 0.5497 | lr 3.03e-04 | (3808.83 ms | 137651 tok/s) step 14469/76294 | train loss 3.404932 | norm 0.3324 | lr 3.03e-04 | (3827.15 ms | 136992 tok/s) step 14470/76294 | train loss 3.295770 | norm 1.0041 | lr 3.03e-04 | (3808.81 ms | 137651 tok/s) step 14471/76294 | train loss 3.423428 | norm 1.3568 | lr 3.03e-04 | (3807.33 ms | 137705 tok/s) step 14472/76294 | train loss 3.391675 | norm 0.5622 | lr 3.03e-04 | (3912.47 ms | 134004 tok/s) step 14473/76294 | train loss 3.355149 | norm 2.6581 | lr 3.03e-04 | (3805.12 ms | 137785 tok/s) step 14474/76294 | train loss 3.373481 | norm 0.8290 | lr 3.03e-04 | (3809.40 ms | 137630 tok/s) step 14475/76294 | train loss 3.348993 | norm 1.1070 | lr 3.03e-04 | (3846.50 ms | 136303 tok/s) step 14476/76294 | train loss 3.396553 | norm 0.8967 | lr 3.03e-04 | (3809.04 ms | 137643 tok/s) step 14477/76294 | train loss 3.471677 | norm 0.3492 | lr 3.03e-04 | (3809.71 ms | 137619 tok/s) step 14478/76294 | train loss 3.354501 | norm 2.9799 | lr 3.03e-04 | (3807.24 ms | 137708 tok/s) step 14479/76294 | train loss 3.333963 | norm 0.8999 | lr 3.02e-04 | (3810.28 ms | 137598 tok/s) step 14480/76294 | train loss 3.378544 | norm 1.1432 | lr 3.02e-04 | (3808.59 ms | 137659 tok/s) step 14481/76294 | train loss 3.358709 | norm 0.7318 | lr 3.02e-04 | (3822.63 ms | 137154 tok/s) step 14482/76294 | train loss 3.431449 | norm 0.5859 | lr 3.02e-04 | (3808.03 ms | 137680 tok/s) step 14483/76294 | train loss 3.327727 | norm 0.6592 | lr 3.02e-04 | (3835.03 ms | 136710 tok/s) step 14484/76294 | train loss 3.377898 | norm 1.9579 | lr 3.02e-04 | (3859.36 ms | 135848 tok/s) step 14485/76294 | train loss 3.362655 | norm 0.6698 | lr 3.02e-04 | (3804.22 ms | 137817 tok/s) step 14486/76294 | train loss 3.463224 | norm 0.6094 | lr 3.02e-04 | (3833.23 ms | 136775 tok/s) step 14487/76294 | train loss 3.384879 | norm 0.8114 | lr 3.02e-04 | (3844.36 ms | 136378 tok/s) step 14488/76294 | train loss 3.384487 | norm 0.4730 | lr 3.02e-04 | (3808.27 ms | 137671 tok/s) step 14489/76294 | train loss 3.399173 | norm 0.4354 | lr 3.02e-04 | (3799.68 ms | 137982 tok/s) step 14490/76294 | train loss 3.420617 | norm 1.0775 | lr 3.02e-04 | (3958.06 ms | 132461 tok/s) step 14491/76294 | train loss 3.394210 | norm 1.2940 | lr 3.02e-04 | (3845.54 ms | 136337 tok/s) step 14492/76294 | train loss 3.410856 | norm 0.4548 | lr 3.02e-04 | (3803.22 ms | 137854 tok/s) step 14493/76294 | train loss 3.348112 | norm 2.5474 | lr 3.02e-04 | (3835.55 ms | 136692 tok/s) step 14494/76294 | train loss 3.402381 | norm 0.3142 | lr 3.01e-04 | (3853.55 ms | 136053 tok/s) step 14495/76294 | train loss 3.373728 | norm 0.9749 | lr 3.01e-04 | (3808.01 ms | 137680 tok/s) step 14496/76294 | train loss 3.362683 | norm 0.4080 | lr 3.01e-04 | (3800.84 ms | 137940 tok/s) step 14497/76294 | train loss 3.450640 | norm 0.3878 | lr 3.01e-04 | (3831.84 ms | 136824 tok/s) step 14498/76294 | train loss 3.339676 | norm 0.3415 | lr 3.01e-04 | (3802.13 ms | 137893 tok/s) step 14499/76294 | train loss 3.388074 | norm 0.2794 | lr 3.01e-04 | (3803.83 ms | 137832 tok/s) step 14500/76294 | train loss 3.393662 | norm 0.3298 | lr 3.01e-04 | (3828.69 ms | 136937 tok/s) val loss: 3.374515 saving model checkpoint to ./results/gpt2-124M-gqa/step_14500.pth step 14501/76294 | train loss 3.345619 | norm 0.8231 | lr 3.01e-04 | (3839.01 ms | 136569 tok/s) step 14502/76294 | train loss 3.398320 | norm 3.6149 | lr 3.01e-04 | (3798.47 ms | 138026 tok/s) step 14503/76294 | train loss 3.403718 | norm 0.5516 | lr 3.01e-04 | (3816.65 ms | 137369 tok/s) step 14504/76294 | train loss 3.371023 | norm 0.4004 | lr 3.01e-04 | (3801.45 ms | 137918 tok/s) step 14505/76294 | train loss 3.413479 | norm 0.8384 | lr 3.01e-04 | (3851.10 ms | 136140 tok/s) step 14506/76294 | train loss 3.337280 | norm 0.7763 | lr 3.01e-04 | (3801.08 ms | 137931 tok/s) step 14507/76294 | train loss 3.313559 | norm 0.3568 | lr 3.01e-04 | (3805.49 ms | 137772 tok/s) step 14508/76294 | train loss 3.304118 | norm 2.3098 | lr 3.01e-04 | (3831.13 ms | 136849 tok/s) step 14509/76294 | train loss 3.422579 | norm 0.7931 | lr 3.00e-04 | (3810.01 ms | 137608 tok/s) step 14510/76294 | train loss 3.423378 | norm 0.7329 | lr 3.00e-04 | (3807.32 ms | 137705 tok/s) step 14511/76294 | train loss 3.346851 | norm 2.4260 | lr 3.00e-04 | (3807.07 ms | 137714 tok/s) step 14512/76294 | train loss 3.411109 | norm 0.5466 | lr 3.00e-04 | (3809.46 ms | 137628 tok/s) step 14513/76294 | train loss 3.316461 | norm 0.6472 | lr 3.00e-04 | (3809.36 ms | 137631 tok/s) step 14514/76294 | train loss 3.470644 | norm 0.4966 | lr 3.00e-04 | (3801.27 ms | 137924 tok/s) step 14515/76294 | train loss 3.370481 | norm 0.3882 | lr 3.00e-04 | (3832.53 ms | 136800 tok/s) step 14516/76294 | train loss 3.329039 | norm 0.3586 | lr 3.00e-04 | (3810.67 ms | 137584 tok/s) step 14517/76294 | train loss 3.433537 | norm 1.3229 | lr 3.00e-04 | (3805.28 ms | 137779 tok/s) step 14518/76294 | train loss 3.400821 | norm 0.4324 | lr 3.00e-04 | (3805.77 ms | 137762 tok/s) step 14519/76294 | train loss 3.377998 | norm 0.5826 | lr 3.00e-04 | (3807.09 ms | 137714 tok/s) step 14520/76294 | train loss 3.448231 | norm 0.4017 | lr 3.00e-04 | (3802.35 ms | 137885 tok/s) step 14521/76294 | train loss 3.393790 | norm 0.6308 | lr 3.00e-04 | (3833.31 ms | 136772 tok/s) step 14522/76294 | train loss 3.406986 | norm 0.4248 | lr 3.00e-04 | (3858.19 ms | 135890 tok/s) step 14523/76294 | train loss 3.344043 | norm 0.3719 | lr 3.00e-04 | (3840.86 ms | 136503 tok/s) step 14524/76294 | train loss 3.387604 | norm 0.5155 | lr 2.99e-04 | (3802.31 ms | 137887 tok/s) step 14525/76294 | train loss 3.440883 | norm 0.3498 | lr 2.99e-04 | (3811.48 ms | 137555 tok/s) step 14526/76294 | train loss 3.346834 | norm 0.3563 | lr 2.99e-04 | (3854.43 ms | 136022 tok/s) step 14527/76294 | train loss 3.432671 | norm 0.4536 | lr 2.99e-04 | (3802.38 ms | 137884 tok/s) step 14528/76294 | train loss 3.475785 | norm 0.4253 | lr 2.99e-04 | (3807.87 ms | 137685 tok/s) step 14529/76294 | train loss 3.417309 | norm 0.4192 | lr 2.99e-04 | (3824.87 ms | 137073 tok/s) step 14530/76294 | train loss 3.389804 | norm 1.3122 | lr 2.99e-04 | (3806.93 ms | 137719 tok/s) step 14531/76294 | train loss 3.354278 | norm 0.5127 | lr 2.99e-04 | (3826.43 ms | 137018 tok/s) step 14532/76294 | train loss 3.371262 | norm 0.3532 | lr 2.99e-04 | (3908.25 ms | 134149 tok/s) step 14533/76294 | train loss 3.378637 | norm 0.3382 | lr 2.99e-04 | (3801.48 ms | 137917 tok/s) step 14534/76294 | train loss 3.361653 | norm 0.3132 | lr 2.99e-04 | (3838.86 ms | 136574 tok/s) step 14535/76294 | train loss 3.401872 | norm 0.3642 | lr 2.99e-04 | (3850.62 ms | 136157 tok/s) step 14536/76294 | train loss 3.351521 | norm 0.2888 | lr 2.99e-04 | (3836.39 ms | 136662 tok/s) step 14537/76294 | train loss 3.376366 | norm 0.3155 | lr 2.99e-04 | (3805.16 ms | 137783 tok/s) step 14538/76294 | train loss 3.382779 | norm 0.3771 | lr 2.99e-04 | (3814.63 ms | 137441 tok/s) step 14539/76294 | train loss 3.357347 | norm 0.3165 | lr 2.98e-04 | (3813.57 ms | 137480 tok/s) step 14540/76294 | train loss 3.372571 | norm 0.3045 | lr 2.98e-04 | (3812.00 ms | 137536 tok/s) step 14541/76294 | train loss 3.329786 | norm 0.2865 | lr 2.98e-04 | (3814.44 ms | 137448 tok/s) step 14542/76294 | train loss 3.404615 | norm 0.3726 | lr 2.98e-04 | (3831.53 ms | 136835 tok/s) step 14543/76294 | train loss 3.406112 | norm 0.3611 | lr 2.98e-04 | (3811.72 ms | 137546 tok/s) step 14544/76294 | train loss 3.365462 | norm 0.2932 | lr 2.98e-04 | (3808.68 ms | 137656 tok/s) step 14545/76294 | train loss 3.388199 | norm 0.2978 | lr 2.98e-04 | (3807.64 ms | 137694 tok/s) step 14546/76294 | train loss 3.332774 | norm 0.2992 | lr 2.98e-04 | (3807.25 ms | 137708 tok/s) step 14547/76294 | train loss 3.418154 | norm 0.2931 | lr 2.98e-04 | (3803.88 ms | 137830 tok/s) step 14548/76294 | train loss 3.347492 | norm 0.4116 | lr 2.98e-04 | (3898.72 ms | 134477 tok/s) step 14549/76294 | train loss 3.664082 | norm 0.5329 | lr 2.98e-04 | (3803.07 ms | 137859 tok/s) step 14550/76294 | train loss 3.346655 | norm 0.4259 | lr 2.98e-04 | (3805.21 ms | 137781 tok/s) step 14551/76294 | train loss 3.371096 | norm 0.4948 | lr 2.98e-04 | (3823.98 ms | 137105 tok/s) step 14552/76294 | train loss 3.409591 | norm 0.3810 | lr 2.98e-04 | (3805.55 ms | 137769 tok/s) step 14553/76294 | train loss 3.366691 | norm 0.4129 | lr 2.98e-04 | (3827.10 ms | 136993 tok/s) step 14554/76294 | train loss 3.381067 | norm 0.4262 | lr 2.97e-04 | (3807.12 ms | 137713 tok/s) step 14555/76294 | train loss 3.374448 | norm 0.3550 | lr 2.97e-04 | (3908.16 ms | 134152 tok/s) step 14556/76294 | train loss 3.375730 | norm 0.3701 | lr 2.97e-04 | (3881.35 ms | 135079 tok/s) step 14557/76294 | train loss 3.432491 | norm 0.3797 | lr 2.97e-04 | (3801.32 ms | 137923 tok/s) step 14558/76294 | train loss 3.326153 | norm 0.4638 | lr 2.97e-04 | (3805.86 ms | 137758 tok/s) step 14559/76294 | train loss 3.392944 | norm 0.4014 | lr 2.97e-04 | (3796.14 ms | 138111 tok/s) step 14560/76294 | train loss 3.399739 | norm 0.4186 | lr 2.97e-04 | (3830.88 ms | 136858 tok/s) step 14561/76294 | train loss 3.333167 | norm 0.3168 | lr 2.97e-04 | (3823.58 ms | 137120 tok/s) step 14562/76294 | train loss 3.340086 | norm 0.3333 | lr 2.97e-04 | (4164.40 ms | 125898 tok/s) step 14563/76294 | train loss 3.352474 | norm 0.3457 | lr 2.97e-04 | (4447.28 ms | 117889 tok/s) step 14564/76294 | train loss 3.374184 | norm 0.3911 | lr 2.97e-04 | (3875.01 ms | 135300 tok/s) step 14565/76294 | train loss 3.338038 | norm 0.5783 | lr 2.97e-04 | (3806.89 ms | 137721 tok/s) step 14566/76294 | train loss 3.361233 | norm 0.3166 | lr 2.97e-04 | (3910.09 ms | 134086 tok/s) step 14567/76294 | train loss 3.355814 | norm 0.3009 | lr 2.97e-04 | (3782.53 ms | 138608 tok/s) step 14568/76294 | train loss 3.362406 | norm 0.3210 | lr 2.97e-04 | (3852.49 ms | 136091 tok/s) step 14569/76294 | train loss 3.312093 | norm 0.3404 | lr 2.96e-04 | (3786.73 ms | 138454 tok/s) step 14570/76294 | train loss 3.445154 | norm 0.3814 | lr 2.96e-04 | (3817.46 ms | 137340 tok/s) step 14571/76294 | train loss 3.311430 | norm 0.3571 | lr 2.96e-04 | (3789.54 ms | 138351 tok/s) step 14572/76294 | train loss 3.383357 | norm 0.3724 | lr 2.96e-04 | (3802.72 ms | 137872 tok/s) step 14573/76294 | train loss 3.341418 | norm 0.3958 | lr 2.96e-04 | (3795.64 ms | 138129 tok/s) step 14574/76294 | train loss 3.376431 | norm 0.3923 | lr 2.96e-04 | (3797.13 ms | 138075 tok/s) step 14575/76294 | train loss 3.399681 | norm 0.3371 | lr 2.96e-04 | (3865.83 ms | 135621 tok/s) step 14576/76294 | train loss 3.351569 | norm 0.3579 | lr 2.96e-04 | (3800.40 ms | 137956 tok/s) step 14577/76294 | train loss 3.394110 | norm 0.4315 | lr 2.96e-04 | (3799.83 ms | 137977 tok/s) step 14578/76294 | train loss 3.367059 | norm 0.3932 | lr 2.96e-04 | (3797.94 ms | 138045 tok/s) step 14579/76294 | train loss 3.358109 | norm 0.3074 | lr 2.96e-04 | (3857.76 ms | 135905 tok/s) step 14580/76294 | train loss 3.359956 | norm 0.3612 | lr 2.96e-04 | (3810.22 ms | 137601 tok/s) step 14581/76294 | train loss 3.348396 | norm 0.2899 | lr 2.96e-04 | (3829.78 ms | 136898 tok/s) step 14582/76294 | train loss 3.334514 | norm 0.3465 | lr 2.96e-04 | (3820.44 ms | 137233 tok/s) step 14583/76294 | train loss 3.419761 | norm 0.4102 | lr 2.96e-04 | (3833.01 ms | 136782 tok/s) step 14584/76294 | train loss 3.380843 | norm 0.4777 | lr 2.95e-04 | (3803.25 ms | 137853 tok/s) step 14585/76294 | train loss 3.444383 | norm 0.4505 | lr 2.95e-04 | (3826.40 ms | 137019 tok/s) step 14586/76294 | train loss 3.376308 | norm 0.4073 | lr 2.95e-04 | (3805.30 ms | 137778 tok/s) step 14587/76294 | train loss 3.490371 | norm 0.3757 | lr 2.95e-04 | (3860.40 ms | 135812 tok/s) step 14588/76294 | train loss 3.373449 | norm 0.3662 | lr 2.95e-04 | (3805.78 ms | 137761 tok/s) step 14589/76294 | train loss 3.356033 | norm 0.4075 | lr 2.95e-04 | (3825.36 ms | 137056 tok/s) step 14590/76294 | train loss 3.402684 | norm 0.4448 | lr 2.95e-04 | (3805.96 ms | 137755 tok/s) step 14591/76294 | train loss 3.375167 | norm 0.5518 | lr 2.95e-04 | (3802.74 ms | 137871 tok/s) step 14592/76294 | train loss 3.398446 | norm 0.5305 | lr 2.95e-04 | (3806.22 ms | 137745 tok/s) step 14593/76294 | train loss 3.426719 | norm 0.5321 | lr 2.95e-04 | (3811.92 ms | 137539 tok/s) step 14594/76294 | train loss 3.404772 | norm 0.6231 | lr 2.95e-04 | (3804.80 ms | 137797 tok/s) step 14595/76294 | train loss 3.431585 | norm 0.5589 | lr 2.95e-04 | (3804.48 ms | 137808 tok/s) step 14596/76294 | train loss 3.532691 | norm 0.3911 | lr 2.95e-04 | (3968.45 ms | 132114 tok/s) step 14597/76294 | train loss 3.484412 | norm 0.5677 | lr 2.95e-04 | (3798.89 ms | 138011 tok/s) step 14598/76294 | train loss 3.330275 | norm 0.3632 | lr 2.95e-04 | (3833.65 ms | 136760 tok/s) step 14599/76294 | train loss 3.434227 | norm 0.6016 | lr 2.94e-04 | (3806.88 ms | 137721 tok/s) step 14600/76294 | train loss 3.501072 | norm 0.6279 | lr 2.94e-04 | (3840.37 ms | 136520 tok/s) step 14601/76294 | train loss 3.386539 | norm 0.4304 | lr 2.94e-04 | (3825.72 ms | 137043 tok/s) step 14602/76294 | train loss 3.432766 | norm 0.4881 | lr 2.94e-04 | (4143.88 ms | 126521 tok/s) step 14603/76294 | train loss 3.399149 | norm 0.5695 | lr 2.94e-04 | (3826.09 ms | 137030 tok/s) step 14604/76294 | train loss 3.375936 | norm 0.3676 | lr 2.94e-04 | (3832.45 ms | 136802 tok/s) step 14605/76294 | train loss 3.413920 | norm 0.6190 | lr 2.94e-04 | (3817.48 ms | 137339 tok/s) step 14606/76294 | train loss 3.397701 | norm 0.6366 | lr 2.94e-04 | (3855.17 ms | 135996 tok/s) step 14607/76294 | train loss 3.356104 | norm 0.7196 | lr 2.94e-04 | (3843.67 ms | 136403 tok/s) step 14608/76294 | train loss 3.339071 | norm 0.5184 | lr 2.94e-04 | (3805.83 ms | 137759 tok/s) step 14609/76294 | train loss 3.334162 | norm 0.3933 | lr 2.94e-04 | (3805.77 ms | 137761 tok/s) step 14610/76294 | train loss 3.405734 | norm 0.5321 | lr 2.94e-04 | (3864.14 ms | 135680 tok/s) step 14611/76294 | train loss 3.357165 | norm 0.6246 | lr 2.94e-04 | (5674.79 ms | 92389 tok/s) step 14612/76294 | train loss 3.572511 | norm 0.8343 | lr 2.94e-04 | (3809.15 ms | 137639 tok/s) step 14613/76294 | train loss 3.449414 | norm 1.0046 | lr 2.94e-04 | (3917.06 ms | 133847 tok/s) step 14614/76294 | train loss 3.405380 | norm 0.4068 | lr 2.93e-04 | (3843.84 ms | 136397 tok/s) step 14615/76294 | train loss 3.407211 | norm 0.5067 | lr 2.93e-04 | (3878.94 ms | 135163 tok/s) step 14616/76294 | train loss 3.364612 | norm 0.5361 | lr 2.93e-04 | (3866.32 ms | 135604 tok/s) step 14617/76294 | train loss 3.463709 | norm 0.3703 | lr 2.93e-04 | (3794.15 ms | 138183 tok/s) step 14618/76294 | train loss 3.436723 | norm 0.4923 | lr 2.93e-04 | (3807.69 ms | 137692 tok/s) step 14619/76294 | train loss 3.360283 | norm 0.5358 | lr 2.93e-04 | (3793.47 ms | 138208 tok/s) step 14620/76294 | train loss 3.442428 | norm 0.4392 | lr 2.93e-04 | (3816.54 ms | 137373 tok/s) step 14621/76294 | train loss 3.425230 | norm 0.4865 | lr 2.93e-04 | (3796.94 ms | 138082 tok/s) step 14622/76294 | train loss 3.379678 | norm 0.5465 | lr 2.93e-04 | (3803.48 ms | 137844 tok/s) step 14623/76294 | train loss 3.342299 | norm 1.1828 | lr 2.93e-04 | (3815.22 ms | 137420 tok/s) step 14624/76294 | train loss 3.448615 | norm 0.6287 | lr 2.93e-04 | (3801.36 ms | 137921 tok/s) step 14625/76294 | train loss 3.373652 | norm 0.5636 | lr 2.93e-04 | (3796.34 ms | 138104 tok/s) step 14626/76294 | train loss 3.407559 | norm 0.9487 | lr 2.93e-04 | (3872.03 ms | 135404 tok/s) step 14627/76294 | train loss 3.383222 | norm 0.9653 | lr 2.93e-04 | (3835.86 ms | 136681 tok/s) step 14628/76294 | train loss 3.436205 | norm 0.7402 | lr 2.93e-04 | (3803.22 ms | 137854 tok/s) step 14629/76294 | train loss 3.337919 | norm 0.7138 | lr 2.92e-04 | (3831.20 ms | 136847 tok/s) step 14630/76294 | train loss 3.484489 | norm 0.6691 | lr 2.92e-04 | (3837.86 ms | 136609 tok/s) step 14631/76294 | train loss 3.337880 | norm 0.5467 | lr 2.92e-04 | (3806.96 ms | 137718 tok/s) step 14632/76294 | train loss 3.394445 | norm 0.7389 | lr 2.92e-04 | (3842.46 ms | 136446 tok/s) step 14633/76294 | train loss 3.476764 | norm 0.7042 | lr 2.92e-04 | (3804.47 ms | 137808 tok/s) step 14634/76294 | train loss 3.385945 | norm 0.5612 | lr 2.92e-04 | (3813.09 ms | 137497 tok/s) step 14635/76294 | train loss 3.411295 | norm 0.5600 | lr 2.92e-04 | (3829.42 ms | 136911 tok/s) step 14636/76294 | train loss 3.374766 | norm 0.6163 | lr 2.92e-04 | (3814.35 ms | 137451 tok/s) step 14637/76294 | train loss 3.325794 | norm 0.4812 | lr 2.92e-04 | (3812.62 ms | 137514 tok/s) step 14638/76294 | train loss 3.332021 | norm 0.5561 | lr 2.92e-04 | (3948.73 ms | 132774 tok/s) step 14639/76294 | train loss 3.355580 | norm 0.6278 | lr 2.92e-04 | (3875.07 ms | 135298 tok/s) step 14640/76294 | train loss 3.444456 | norm 0.9083 | lr 2.92e-04 | (3800.54 ms | 137951 tok/s) step 14641/76294 | train loss 3.409219 | norm 0.6230 | lr 2.92e-04 | (3821.27 ms | 137203 tok/s) step 14642/76294 | train loss 3.423351 | norm 0.4905 | lr 2.92e-04 | (3802.69 ms | 137873 tok/s) step 14643/76294 | train loss 3.353614 | norm 0.5136 | lr 2.92e-04 | (3828.49 ms | 136944 tok/s) step 14644/76294 | train loss 3.356435 | norm 0.4435 | lr 2.91e-04 | (3803.19 ms | 137855 tok/s) step 14645/76294 | train loss 3.372974 | norm 0.4766 | lr 2.91e-04 | (3813.26 ms | 137491 tok/s) step 14646/76294 | train loss 3.369550 | norm 0.4980 | lr 2.91e-04 | (3823.70 ms | 137115 tok/s) step 14647/76294 | train loss 3.389198 | norm 0.5432 | lr 2.91e-04 | (3805.89 ms | 137757 tok/s) step 14648/76294 | train loss 3.343409 | norm 0.4291 | lr 2.91e-04 | (3814.43 ms | 137449 tok/s) step 14649/76294 | train loss 3.341367 | norm 0.5533 | lr 2.91e-04 | (3810.90 ms | 137576 tok/s) step 14650/76294 | train loss 3.373735 | norm 0.4478 | lr 2.91e-04 | (3811.27 ms | 137563 tok/s) step 14651/76294 | train loss 3.376738 | norm 0.4998 | lr 2.91e-04 | (3805.66 ms | 137765 tok/s) step 14652/76294 | train loss 3.490271 | norm 0.4930 | lr 2.91e-04 | (3814.39 ms | 137450 tok/s) step 14653/76294 | train loss 3.304650 | norm 0.5208 | lr 2.91e-04 | (3808.74 ms | 137654 tok/s) step 14654/76294 | train loss 3.380184 | norm 0.4540 | lr 2.91e-04 | (3809.33 ms | 137633 tok/s) step 14655/76294 | train loss 3.385210 | norm 0.4142 | lr 2.91e-04 | (3809.51 ms | 137626 tok/s) step 14656/76294 | train loss 3.463120 | norm 0.4588 | lr 2.91e-04 | (3831.15 ms | 136849 tok/s) step 14657/76294 | train loss 3.397036 | norm 0.4686 | lr 2.91e-04 | (3826.55 ms | 137013 tok/s) step 14658/76294 | train loss 3.321832 | norm 0.4324 | lr 2.91e-04 | (3804.49 ms | 137808 tok/s) step 14659/76294 | train loss 3.418154 | norm 0.4320 | lr 2.90e-04 | (3833.04 ms | 136781 tok/s) step 14660/76294 | train loss 3.414566 | norm 0.5306 | lr 2.90e-04 | (3898.20 ms | 134495 tok/s) step 14661/76294 | train loss 3.374078 | norm 0.4249 | lr 2.90e-04 | (3830.40 ms | 136875 tok/s) step 14662/76294 | train loss 3.392961 | norm 0.4301 | lr 2.90e-04 | (3803.80 ms | 137833 tok/s) step 14663/76294 | train loss 3.377567 | norm 0.5086 | lr 2.90e-04 | (3836.74 ms | 136649 tok/s) step 14664/76294 | train loss 3.409343 | norm 0.4409 | lr 2.90e-04 | (3802.87 ms | 137866 tok/s) step 14665/76294 | train loss 3.381174 | norm 0.4833 | lr 2.90e-04 | (3808.87 ms | 137649 tok/s) step 14666/76294 | train loss 3.351245 | norm 0.4736 | lr 2.90e-04 | (3832.27 ms | 136809 tok/s) step 14667/76294 | train loss 3.390314 | norm 0.4486 | lr 2.90e-04 | (3802.52 ms | 137879 tok/s) step 14668/76294 | train loss 3.407085 | norm 0.4668 | lr 2.90e-04 | (3827.38 ms | 136984 tok/s) step 14669/76294 | train loss 3.437784 | norm 0.5243 | lr 2.90e-04 | (3807.16 ms | 137711 tok/s) step 14670/76294 | train loss 3.425203 | norm 0.4581 | lr 2.90e-04 | (3808.06 ms | 137678 tok/s) step 14671/76294 | train loss 3.403143 | norm 0.6264 | lr 2.90e-04 | (3841.69 ms | 136473 tok/s) step 14672/76294 | train loss 3.398218 | norm 0.4584 | lr 2.90e-04 | (3820.42 ms | 137233 tok/s) step 14673/76294 | train loss 3.396505 | norm 0.4573 | lr 2.90e-04 | (3817.98 ms | 137321 tok/s) step 14674/76294 | train loss 3.392384 | norm 0.4645 | lr 2.90e-04 | (3807.73 ms | 137690 tok/s) step 14675/76294 | train loss 3.405165 | norm 0.6431 | lr 2.89e-04 | (3916.77 ms | 133857 tok/s) step 14676/76294 | train loss 3.360572 | norm 1.0767 | lr 2.89e-04 | (3805.18 ms | 137783 tok/s) step 14677/76294 | train loss 3.345841 | norm 0.6030 | lr 2.89e-04 | (3801.38 ms | 137921 tok/s) step 14678/76294 | train loss 3.465674 | norm 0.5825 | lr 2.89e-04 | (3839.68 ms | 136545 tok/s) step 14679/76294 | train loss 3.379684 | norm 0.5310 | lr 2.89e-04 | (3803.50 ms | 137844 tok/s) step 14680/76294 | train loss 3.388054 | norm 0.3976 | lr 2.89e-04 | (3827.72 ms | 136971 tok/s) step 14681/76294 | train loss 3.430853 | norm 0.4897 | lr 2.89e-04 | (3816.37 ms | 137379 tok/s) step 14682/76294 | train loss 3.293715 | norm 0.7217 | lr 2.89e-04 | (3905.26 ms | 134252 tok/s) step 14683/76294 | train loss 3.396729 | norm 0.7907 | lr 2.89e-04 | (3797.55 ms | 138060 tok/s) step 14684/76294 | train loss 3.401344 | norm 0.6359 | lr 2.89e-04 | (3867.16 ms | 135574 tok/s) step 14685/76294 | train loss 3.356999 | norm 0.5195 | lr 2.89e-04 | (3805.16 ms | 137783 tok/s) step 14686/76294 | train loss 3.399314 | norm 0.6362 | lr 2.89e-04 | (3847.82 ms | 136256 tok/s) step 14687/76294 | train loss 3.395623 | norm 0.6521 | lr 2.89e-04 | (3827.57 ms | 136977 tok/s) step 14688/76294 | train loss 3.366575 | norm 0.4605 | lr 2.89e-04 | (3804.08 ms | 137822 tok/s) step 14689/76294 | train loss 3.402654 | norm 0.4869 | lr 2.89e-04 | (3829.13 ms | 136921 tok/s) step 14690/76294 | train loss 3.364908 | norm 0.3898 | lr 2.88e-04 | (3810.30 ms | 137598 tok/s) step 14691/76294 | train loss 3.392726 | norm 0.4251 | lr 2.88e-04 | (3828.87 ms | 136930 tok/s) step 14692/76294 | train loss 3.326339 | norm 0.3763 | lr 2.88e-04 | (3804.65 ms | 137802 tok/s) step 14693/76294 | train loss 3.446730 | norm 0.6938 | lr 2.88e-04 | (3815.03 ms | 137427 tok/s) step 14694/76294 | train loss 3.465779 | norm 0.6969 | lr 2.88e-04 | (3798.62 ms | 138020 tok/s) step 14695/76294 | train loss 3.385629 | norm 0.4049 | lr 2.88e-04 | (3852.49 ms | 136091 tok/s) step 14696/76294 | train loss 3.451304 | norm 0.6071 | lr 2.88e-04 | (3802.26 ms | 137889 tok/s) step 14697/76294 | train loss 3.302088 | norm 0.6247 | lr 2.88e-04 | (3809.16 ms | 137639 tok/s) step 14698/76294 | train loss 3.390239 | norm 0.4324 | lr 2.88e-04 | (3829.47 ms | 136909 tok/s) step 14699/76294 | train loss 3.445039 | norm 0.4066 | lr 2.88e-04 | (3804.51 ms | 137807 tok/s) step 14700/76294 | train loss 3.395709 | norm 0.4402 | lr 2.88e-04 | (3809.53 ms | 137625 tok/s) step 14701/76294 | train loss 3.414817 | norm 0.4338 | lr 2.88e-04 | (3805.98 ms | 137754 tok/s) step 14702/76294 | train loss 3.338931 | norm 0.3797 | lr 2.88e-04 | (3807.50 ms | 137699 tok/s) step 14703/76294 | train loss 3.484008 | norm 0.4896 | lr 2.88e-04 | (3802.29 ms | 137887 tok/s) step 14704/76294 | train loss 3.379425 | norm 0.6763 | lr 2.88e-04 | (3849.44 ms | 136199 tok/s) step 14705/76294 | train loss 3.377202 | norm 0.5266 | lr 2.87e-04 | (3845.27 ms | 136346 tok/s) step 14706/76294 | train loss 3.394806 | norm 0.3781 | lr 2.87e-04 | (3824.37 ms | 137091 tok/s) step 14707/76294 | train loss 3.368193 | norm 0.4833 | lr 2.87e-04 | (3808.63 ms | 137658 tok/s) step 14708/76294 | train loss 3.427611 | norm 0.4045 | lr 2.87e-04 | (3801.76 ms | 137907 tok/s) step 14709/76294 | train loss 3.394348 | norm 0.3754 | lr 2.87e-04 | (3830.97 ms | 136855 tok/s) step 14710/76294 | train loss 3.435214 | norm 0.3649 | lr 2.87e-04 | (3803.20 ms | 137854 tok/s) step 14711/76294 | train loss 3.376951 | norm 0.2966 | lr 2.87e-04 | (3810.21 ms | 137601 tok/s) step 14712/76294 | train loss 3.394615 | norm 0.4372 | lr 2.87e-04 | (3829.23 ms | 136917 tok/s) step 14713/76294 | train loss 3.356525 | norm 0.4452 | lr 2.87e-04 | (3818.08 ms | 137317 tok/s) step 14714/76294 | train loss 3.432890 | norm 0.3510 | lr 2.87e-04 | (3810.72 ms | 137582 tok/s) step 14715/76294 | train loss 3.387007 | norm 0.4317 | lr 2.87e-04 | (3833.19 ms | 136776 tok/s) step 14716/76294 | train loss 3.354419 | norm 0.4497 | lr 2.87e-04 | (3802.59 ms | 137877 tok/s) step 14717/76294 | train loss 3.429071 | norm 0.5057 | lr 2.87e-04 | (3810.80 ms | 137579 tok/s) step 14718/76294 | train loss 3.369505 | norm 0.5213 | lr 2.87e-04 | (3820.79 ms | 137220 tok/s) step 14719/76294 | train loss 3.414907 | norm 0.3884 | lr 2.87e-04 | (3806.10 ms | 137749 tok/s) step 14720/76294 | train loss 3.344407 | norm 0.6023 | lr 2.87e-04 | (4085.99 ms | 128314 tok/s) step 14721/76294 | train loss 3.380964 | norm 0.4001 | lr 2.86e-04 | (3805.33 ms | 137777 tok/s) step 14722/76294 | train loss 3.425124 | norm 0.7686 | lr 2.86e-04 | (3806.23 ms | 137745 tok/s) step 14723/76294 | train loss 3.433965 | norm 0.5339 | lr 2.86e-04 | (3831.77 ms | 136827 tok/s) step 14724/76294 | train loss 3.335442 | norm 0.5053 | lr 2.86e-04 | (3807.55 ms | 137697 tok/s) step 14725/76294 | train loss 3.401022 | norm 0.5345 | lr 2.86e-04 | (3802.95 ms | 137864 tok/s) step 14726/76294 | train loss 3.381698 | norm 0.3637 | lr 2.86e-04 | (3965.68 ms | 132206 tok/s) step 14727/76294 | train loss 3.332088 | norm 0.3850 | lr 2.86e-04 | (3810.07 ms | 137606 tok/s) step 14728/76294 | train loss 3.393637 | norm 0.3670 | lr 2.86e-04 | (4014.75 ms | 130590 tok/s) step 14729/76294 | train loss 3.405592 | norm 0.6213 | lr 2.86e-04 | (3836.55 ms | 136656 tok/s) step 14730/76294 | train loss 3.346357 | norm 0.3900 | lr 2.86e-04 | (3828.05 ms | 136959 tok/s) step 14731/76294 | train loss 3.442105 | norm 0.4972 | lr 2.86e-04 | (3797.65 ms | 138056 tok/s) step 14732/76294 | train loss 3.370700 | norm 0.5067 | lr 2.86e-04 | (3805.51 ms | 137771 tok/s) step 14733/76294 | train loss 3.375297 | norm 0.5426 | lr 2.86e-04 | (3821.25 ms | 137203 tok/s) step 14734/76294 | train loss 3.501813 | norm 1.4522 | lr 2.86e-04 | (3838.89 ms | 136573 tok/s) step 14735/76294 | train loss 3.367272 | norm 0.4980 | lr 2.86e-04 | (3799.61 ms | 137985 tok/s) step 14736/76294 | train loss 3.378613 | norm 0.6342 | lr 2.85e-04 | (3805.58 ms | 137768 tok/s) step 14737/76294 | train loss 3.351490 | norm 0.4858 | lr 2.85e-04 | (3828.78 ms | 136933 tok/s) step 14738/76294 | train loss 3.378375 | norm 0.5895 | lr 2.85e-04 | (3809.03 ms | 137643 tok/s) step 14739/76294 | train loss 3.475462 | norm 0.5720 | lr 2.85e-04 | (3825.98 ms | 137034 tok/s) step 14740/76294 | train loss 3.514764 | norm 0.3852 | lr 2.85e-04 | (3801.47 ms | 137917 tok/s) step 14741/76294 | train loss 3.433982 | norm 0.5051 | lr 2.85e-04 | (3810.17 ms | 137602 tok/s) step 14742/76294 | train loss 3.378954 | norm 0.5582 | lr 2.85e-04 | (3804.11 ms | 137821 tok/s) step 14743/76294 | train loss 3.387541 | norm 0.6463 | lr 2.85e-04 | (3805.68 ms | 137765 tok/s) step 14744/76294 | train loss 3.474125 | norm 0.8744 | lr 2.85e-04 | (3817.61 ms | 137334 tok/s) step 14745/76294 | train loss 3.326925 | norm 0.6883 | lr 2.85e-04 | (3830.71 ms | 136864 tok/s) step 14746/76294 | train loss 3.463114 | norm 0.6370 | lr 2.85e-04 | (3834.88 ms | 136716 tok/s) step 14747/76294 | train loss 3.393863 | norm 0.5208 | lr 2.85e-04 | (3829.77 ms | 136898 tok/s) step 14748/76294 | train loss 3.387980 | norm 2.1573 | lr 2.85e-04 | (3856.42 ms | 135952 tok/s) step 14749/76294 | train loss 3.415273 | norm 0.9487 | lr 2.85e-04 | (3803.55 ms | 137842 tok/s) step 14750/76294 | train loss 3.397647 | norm 0.4931 | lr 2.85e-04 | (3816.15 ms | 137387 tok/s) val loss: 3.387160 saving model checkpoint to ./results/gpt2-124M-gqa/step_14750.pth step 14751/76294 | train loss 3.388776 | norm 0.5721 | lr 2.84e-04 | (3823.37 ms | 137127 tok/s) step 14752/76294 | train loss 3.384273 | norm 0.5664 | lr 2.84e-04 | (3799.22 ms | 137999 tok/s) step 14753/76294 | train loss 3.399686 | norm 0.6724 | lr 2.84e-04 | (3834.37 ms | 136734 tok/s) step 14754/76294 | train loss 3.358558 | norm 0.4210 | lr 2.84e-04 | (3802.75 ms | 137871 tok/s) step 14755/76294 | train loss 3.385605 | norm 1.0409 | lr 2.84e-04 | (3803.98 ms | 137826 tok/s) step 14756/76294 | train loss 3.422424 | norm 0.6319 | lr 2.84e-04 | (3821.56 ms | 137192 tok/s) step 14757/76294 | train loss 3.419725 | norm 0.4535 | lr 2.84e-04 | (3805.31 ms | 137778 tok/s) step 14758/76294 | train loss 3.388247 | norm 0.3424 | lr 2.84e-04 | (3808.69 ms | 137656 tok/s) step 14759/76294 | train loss 3.317157 | norm 0.4317 | lr 2.84e-04 | (3806.14 ms | 137748 tok/s) step 14760/76294 | train loss 3.413174 | norm 0.4896 | lr 2.84e-04 | (3812.41 ms | 137522 tok/s) step 14761/76294 | train loss 3.438421 | norm 0.4508 | lr 2.84e-04 | (3806.11 ms | 137749 tok/s) step 14762/76294 | train loss 3.386416 | norm 0.4936 | lr 2.84e-04 | (3809.94 ms | 137611 tok/s) step 14763/76294 | train loss 3.408730 | norm 0.3567 | lr 2.84e-04 | (3807.91 ms | 137684 tok/s) step 14764/76294 | train loss 3.435608 | norm 0.4049 | lr 2.84e-04 | (3820.77 ms | 137221 tok/s) step 14765/76294 | train loss 3.395497 | norm 0.4941 | lr 2.84e-04 | (3806.67 ms | 137729 tok/s) step 14766/76294 | train loss 3.367702 | norm 0.4856 | lr 2.84e-04 | (3814.70 ms | 137439 tok/s) step 14767/76294 | train loss 3.394444 | norm 0.4957 | lr 2.83e-04 | (3812.84 ms | 137506 tok/s) step 14768/76294 | train loss 3.460756 | norm 0.6604 | lr 2.83e-04 | (3806.22 ms | 137745 tok/s) step 14769/76294 | train loss 3.408142 | norm 0.3999 | lr 2.83e-04 | (3812.20 ms | 137529 tok/s) step 14770/76294 | train loss 3.422645 | norm 0.5640 | lr 2.83e-04 | (3880.24 ms | 135117 tok/s) step 14771/76294 | train loss 3.393169 | norm 0.3966 | lr 2.83e-04 | (3801.79 ms | 137906 tok/s) step 14772/76294 | train loss 3.417435 | norm 0.4624 | lr 2.83e-04 | (3809.71 ms | 137619 tok/s) step 14773/76294 | train loss 3.400258 | norm 0.3507 | lr 2.83e-04 | (3823.41 ms | 137126 tok/s) step 14774/76294 | train loss 3.363253 | norm 0.3411 | lr 2.83e-04 | (3814.24 ms | 137455 tok/s) step 14775/76294 | train loss 3.404353 | norm 0.4189 | lr 2.83e-04 | (3825.38 ms | 137055 tok/s) step 14776/76294 | train loss 3.401413 | norm 0.4063 | lr 2.83e-04 | (3844.59 ms | 136370 tok/s) step 14777/76294 | train loss 3.454152 | norm 0.3787 | lr 2.83e-04 | (3839.18 ms | 136563 tok/s) step 14778/76294 | train loss 3.366215 | norm 0.5664 | lr 2.83e-04 | (3820.42 ms | 137233 tok/s) step 14779/76294 | train loss 3.446679 | norm 0.5168 | lr 2.83e-04 | (3816.63 ms | 137369 tok/s) step 14780/76294 | train loss 3.367479 | norm 0.4787 | lr 2.83e-04 | (3812.85 ms | 137506 tok/s) step 14781/76294 | train loss 3.339244 | norm 0.3561 | lr 2.83e-04 | (3803.46 ms | 137845 tok/s) step 14782/76294 | train loss 3.519356 | norm 0.3996 | lr 2.82e-04 | (3879.03 ms | 135159 tok/s) step 14783/76294 | train loss 3.356756 | norm 0.3539 | lr 2.82e-04 | (3811.23 ms | 137564 tok/s) step 14784/76294 | train loss 3.441854 | norm 0.3212 | lr 2.82e-04 | (3909.94 ms | 134091 tok/s) step 14785/76294 | train loss 3.444207 | norm 0.3878 | lr 2.82e-04 | (3904.70 ms | 134271 tok/s) step 14786/76294 | train loss 3.361830 | norm 0.3597 | lr 2.82e-04 | (3792.28 ms | 138251 tok/s) step 14787/76294 | train loss 3.437687 | norm 0.3592 | lr 2.82e-04 | (3823.37 ms | 137127 tok/s) step 14788/76294 | train loss 3.437972 | norm 0.4034 | lr 2.82e-04 | (3793.78 ms | 138197 tok/s) step 14789/76294 | train loss 3.373510 | norm 0.4214 | lr 2.82e-04 | (3935.54 ms | 133219 tok/s) step 14790/76294 | train loss 3.389576 | norm 0.3494 | lr 2.82e-04 | (3772.63 ms | 138971 tok/s) step 14791/76294 | train loss 3.447163 | norm 0.3790 | lr 2.82e-04 | (3838.53 ms | 136586 tok/s) step 14792/76294 | train loss 3.400769 | norm 0.3608 | lr 2.82e-04 | (3853.42 ms | 136058 tok/s) step 14793/76294 | train loss 3.397236 | norm 0.3156 | lr 2.82e-04 | (4331.77 ms | 121033 tok/s) step 14794/76294 | train loss 3.417792 | norm 0.3743 | lr 2.82e-04 | (3926.36 ms | 133530 tok/s) step 14795/76294 | train loss 3.398819 | norm 0.3264 | lr 2.82e-04 | (3778.87 ms | 138742 tok/s) step 14796/76294 | train loss 3.405146 | norm 0.3557 | lr 2.82e-04 | (3839.34 ms | 136557 tok/s) step 14797/76294 | train loss 3.575589 | norm 0.3884 | lr 2.82e-04 | (3811.54 ms | 137553 tok/s) step 14798/76294 | train loss 3.387386 | norm 0.2935 | lr 2.81e-04 | (3849.31 ms | 136203 tok/s) step 14799/76294 | train loss 3.383672 | norm 0.5014 | lr 2.81e-04 | (3808.99 ms | 137645 tok/s) step 14800/76294 | train loss 3.361081 | norm 0.6580 | lr 2.81e-04 | (3830.40 ms | 136875 tok/s) step 14801/76294 | train loss 3.407940 | norm 0.4443 | lr 2.81e-04 | (3897.29 ms | 134526 tok/s) step 14802/76294 | train loss 3.449499 | norm 0.3430 | lr 2.81e-04 | (3815.13 ms | 137423 tok/s) step 14803/76294 | train loss 3.336340 | norm 0.3066 | lr 2.81e-04 | (3898.59 ms | 134482 tok/s) step 14804/76294 | train loss 3.455235 | norm 0.3188 | lr 2.81e-04 | (3792.90 ms | 138229 tok/s) step 14805/76294 | train loss 3.397854 | norm 0.3449 | lr 2.81e-04 | (3825.00 ms | 137069 tok/s) step 14806/76294 | train loss 3.455148 | norm 0.2937 | lr 2.81e-04 | (3821.79 ms | 137184 tok/s) step 14807/76294 | train loss 3.450469 | norm 0.3391 | lr 2.81e-04 | (4739.83 ms | 110613 tok/s) step 14808/76294 | train loss 3.416020 | norm 0.3634 | lr 2.81e-04 | (3797.90 ms | 138047 tok/s) step 14809/76294 | train loss 3.359811 | norm 0.3340 | lr 2.81e-04 | (3806.02 ms | 137752 tok/s) step 14810/76294 | train loss 3.385215 | norm 0.3741 | lr 2.81e-04 | (3892.48 ms | 134693 tok/s) step 14811/76294 | train loss 3.436146 | norm 0.4686 | lr 2.81e-04 | (3799.91 ms | 137974 tok/s) step 14812/76294 | train loss 3.412530 | norm 0.3857 | lr 2.81e-04 | (3932.43 ms | 133324 tok/s) step 14813/76294 | train loss 3.354138 | norm 0.4358 | lr 2.81e-04 | (3798.79 ms | 138014 tok/s) step 14814/76294 | train loss 3.379275 | norm 0.3761 | lr 2.80e-04 | (3879.92 ms | 135128 tok/s) step 14815/76294 | train loss 3.457675 | norm 0.4187 | lr 2.80e-04 | (3806.62 ms | 137731 tok/s) step 14816/76294 | train loss 3.382428 | norm 0.4029 | lr 2.80e-04 | (3810.02 ms | 137608 tok/s) step 14817/76294 | train loss 3.388638 | norm 0.4666 | lr 2.80e-04 | (3831.87 ms | 136823 tok/s) step 14818/76294 | train loss 3.408480 | norm 0.3364 | lr 2.80e-04 | (3816.85 ms | 137361 tok/s) step 14819/76294 | train loss 3.365606 | norm 0.3785 | lr 2.80e-04 | (3811.78 ms | 137544 tok/s) step 14820/76294 | train loss 3.379196 | norm 0.5023 | lr 2.80e-04 | (3862.50 ms | 135738 tok/s) step 14821/76294 | train loss 3.359478 | norm 0.3947 | lr 2.80e-04 | (3813.52 ms | 137481 tok/s) step 14822/76294 | train loss 3.442343 | norm 0.5142 | lr 2.80e-04 | (3834.13 ms | 136742 tok/s) step 14823/76294 | train loss 3.378427 | norm 0.3775 | lr 2.80e-04 | (3841.04 ms | 136496 tok/s) step 14824/76294 | train loss 3.401825 | norm 0.3983 | lr 2.80e-04 | (3819.50 ms | 137266 tok/s) step 14825/76294 | train loss 3.396589 | norm 0.4170 | lr 2.80e-04 | (3813.59 ms | 137479 tok/s) step 14826/76294 | train loss 3.458758 | norm 0.3800 | lr 2.80e-04 | (3872.50 ms | 135387 tok/s) step 14827/76294 | train loss 3.363892 | norm 0.5393 | lr 2.80e-04 | (3813.30 ms | 137489 tok/s) step 14828/76294 | train loss 3.365169 | norm 0.4406 | lr 2.80e-04 | (3930.64 ms | 133385 tok/s) step 14829/76294 | train loss 3.421656 | norm 0.4523 | lr 2.79e-04 | (3814.47 ms | 137447 tok/s) step 14830/76294 | train loss 3.435118 | norm 0.4260 | lr 2.79e-04 | (3872.18 ms | 135399 tok/s) step 14831/76294 | train loss 3.416974 | norm 0.4033 | lr 2.79e-04 | (3811.11 ms | 137568 tok/s) step 14832/76294 | train loss 3.450598 | norm 0.4320 | lr 2.79e-04 | (3819.16 ms | 137278 tok/s) step 14833/76294 | train loss 3.397392 | norm 0.3843 | lr 2.79e-04 | (3830.78 ms | 136862 tok/s) step 14834/76294 | train loss 3.335079 | norm 0.4407 | lr 2.79e-04 | (3815.18 ms | 137422 tok/s) step 14835/76294 | train loss 3.422320 | norm 0.3675 | lr 2.79e-04 | (3817.82 ms | 137327 tok/s) step 14836/76294 | train loss 3.363731 | norm 0.4293 | lr 2.79e-04 | (3811.43 ms | 137557 tok/s) step 14837/76294 | train loss 3.388451 | norm 0.4067 | lr 2.79e-04 | (3818.12 ms | 137316 tok/s) step 14838/76294 | train loss 3.394733 | norm 0.5182 | lr 2.79e-04 | (3840.52 ms | 136515 tok/s) step 14839/76294 | train loss 3.359535 | norm 0.9544 | lr 2.79e-04 | (3833.09 ms | 136779 tok/s) step 14840/76294 | train loss 3.446890 | norm 0.5421 | lr 2.79e-04 | (3841.29 ms | 136488 tok/s) step 14841/76294 | train loss 3.352929 | norm 0.4646 | lr 2.79e-04 | (3820.79 ms | 137220 tok/s) step 14842/76294 | train loss 3.352217 | norm 0.3957 | lr 2.79e-04 | (3810.50 ms | 137590 tok/s) step 14843/76294 | train loss 3.411671 | norm 0.4867 | lr 2.79e-04 | (3809.00 ms | 137644 tok/s) step 14844/76294 | train loss 3.410357 | norm 0.4887 | lr 2.79e-04 | (3839.03 ms | 136568 tok/s) step 14845/76294 | train loss 3.464830 | norm 0.4749 | lr 2.78e-04 | (3806.00 ms | 137753 tok/s) step 14846/76294 | train loss 3.401441 | norm 0.4191 | lr 2.78e-04 | (3841.26 ms | 136489 tok/s) step 14847/76294 | train loss 3.416135 | norm 0.3687 | lr 2.78e-04 | (3807.85 ms | 137686 tok/s) step 14848/76294 | train loss 3.442953 | norm 0.4048 | lr 2.78e-04 | (3961.61 ms | 132342 tok/s) step 14849/76294 | train loss 3.394493 | norm 0.4087 | lr 2.78e-04 | (3815.40 ms | 137414 tok/s) step 14850/76294 | train loss 3.451283 | norm 0.4111 | lr 2.78e-04 | (3883.31 ms | 135011 tok/s) step 14851/76294 | train loss 3.382610 | norm 0.5878 | lr 2.78e-04 | (3803.73 ms | 137835 tok/s) step 14852/76294 | train loss 3.438353 | norm 0.4527 | lr 2.78e-04 | (3815.39 ms | 137414 tok/s) step 14853/76294 | train loss 3.370460 | norm 0.4728 | lr 2.78e-04 | (3831.61 ms | 136832 tok/s) step 14854/76294 | train loss 3.371074 | norm 0.3761 | lr 2.78e-04 | (3807.12 ms | 137712 tok/s) step 14855/76294 | train loss 3.419423 | norm 0.4989 | lr 2.78e-04 | (3803.97 ms | 137826 tok/s) step 14856/76294 | train loss 3.421015 | norm 0.4312 | lr 2.78e-04 | (3835.62 ms | 136689 tok/s) step 14857/76294 | train loss 3.378715 | norm 0.4423 | lr 2.78e-04 | (3803.54 ms | 137842 tok/s) step 14858/76294 | train loss 3.410981 | norm 0.4179 | lr 2.78e-04 | (3857.33 ms | 135920 tok/s) step 14859/76294 | train loss 3.400531 | norm 0.3854 | lr 2.78e-04 | (3801.89 ms | 137902 tok/s) step 14860/76294 | train loss 3.343069 | norm 0.4904 | lr 2.78e-04 | (3808.00 ms | 137681 tok/s) step 14861/76294 | train loss 3.383885 | norm 0.4137 | lr 2.77e-04 | (3823.18 ms | 137134 tok/s) step 14862/76294 | train loss 3.507040 | norm 0.4918 | lr 2.77e-04 | (3811.22 ms | 137564 tok/s) step 14863/76294 | train loss 3.317152 | norm 0.3544 | lr 2.77e-04 | (3825.12 ms | 137064 tok/s) step 14864/76294 | train loss 3.410978 | norm 0.4228 | lr 2.77e-04 | (3825.27 ms | 137059 tok/s) step 14865/76294 | train loss 3.441708 | norm 0.4009 | lr 2.77e-04 | (3805.07 ms | 137787 tok/s) step 14866/76294 | train loss 3.388282 | norm 0.3798 | lr 2.77e-04 | (3829.26 ms | 136916 tok/s) step 14867/76294 | train loss 3.403468 | norm 0.4149 | lr 2.77e-04 | (3803.66 ms | 137838 tok/s) step 14868/76294 | train loss 3.386309 | norm 0.3436 | lr 2.77e-04 | (3812.19 ms | 137529 tok/s) step 14869/76294 | train loss 3.438237 | norm 0.3843 | lr 2.77e-04 | (3820.53 ms | 137229 tok/s) step 14870/76294 | train loss 3.351029 | norm 0.4481 | lr 2.77e-04 | (5510.02 ms | 95152 tok/s) step 14871/76294 | train loss 3.374319 | norm 0.3033 | lr 2.77e-04 | (3853.47 ms | 136056 tok/s) step 14872/76294 | train loss 3.399300 | norm 0.3068 | lr 2.77e-04 | (3796.05 ms | 138114 tok/s) step 14873/76294 | train loss 3.430830 | norm 0.3248 | lr 2.77e-04 | (3831.72 ms | 136828 tok/s) step 14874/76294 | train loss 3.387061 | norm 0.3579 | lr 2.77e-04 | (3805.07 ms | 137787 tok/s) step 14875/76294 | train loss 3.433646 | norm 0.4700 | lr 2.77e-04 | (3803.85 ms | 137831 tok/s) step 14876/76294 | train loss 3.409075 | norm 0.2986 | lr 2.76e-04 | (3832.28 ms | 136808 tok/s) step 14877/76294 | train loss 3.300360 | norm 0.3760 | lr 2.76e-04 | (3807.13 ms | 137712 tok/s) step 14878/76294 | train loss 3.460442 | norm 0.2790 | lr 2.76e-04 | (3801.72 ms | 137908 tok/s) step 14879/76294 | train loss 3.439655 | norm 0.3468 | lr 2.76e-04 | (3843.47 ms | 136410 tok/s) step 14880/76294 | train loss 3.336367 | norm 0.3012 | lr 2.76e-04 | (3811.81 ms | 137543 tok/s) step 14881/76294 | train loss 3.356006 | norm 0.3496 | lr 2.76e-04 | (3808.64 ms | 137658 tok/s) step 14882/76294 | train loss 3.394011 | norm 0.3332 | lr 2.76e-04 | (3822.55 ms | 137157 tok/s) step 14883/76294 | train loss 3.384805 | norm 0.3525 | lr 2.76e-04 | (3807.88 ms | 137685 tok/s) step 14884/76294 | train loss 3.393782 | norm 0.3996 | lr 2.76e-04 | (3810.97 ms | 137573 tok/s) step 14885/76294 | train loss 3.389916 | norm 0.4086 | lr 2.76e-04 | (3806.52 ms | 137734 tok/s) step 14886/76294 | train loss 3.409087 | norm 0.3269 | lr 2.76e-04 | (3803.53 ms | 137843 tok/s) step 14887/76294 | train loss 3.400638 | norm 0.5483 | lr 2.76e-04 | (3839.09 ms | 136566 tok/s) step 14888/76294 | train loss 3.347806 | norm 0.4072 | lr 2.76e-04 | (3807.03 ms | 137716 tok/s) step 14889/76294 | train loss 3.409205 | norm 0.4341 | lr 2.76e-04 | (3810.55 ms | 137588 tok/s) step 14890/76294 | train loss 3.387680 | norm 0.4492 | lr 2.76e-04 | (3805.89 ms | 137757 tok/s) step 14891/76294 | train loss 3.401261 | norm 0.5004 | lr 2.76e-04 | (3807.64 ms | 137694 tok/s) step 14892/76294 | train loss 3.407382 | norm 0.4332 | lr 2.75e-04 | (3826.53 ms | 137014 tok/s) step 14893/76294 | train loss 3.397623 | norm 0.4567 | lr 2.75e-04 | (3910.75 ms | 134063 tok/s) step 14894/76294 | train loss 3.387013 | norm 0.3286 | lr 2.75e-04 | (3838.77 ms | 136577 tok/s) step 14895/76294 | train loss 3.435783 | norm 0.3691 | lr 2.75e-04 | (3804.52 ms | 137807 tok/s) step 14896/76294 | train loss 3.457645 | norm 0.3820 | lr 2.75e-04 | (3812.09 ms | 137533 tok/s) step 14897/76294 | train loss 3.401434 | norm 0.3628 | lr 2.75e-04 | (3829.03 ms | 136924 tok/s) step 14898/76294 | train loss 3.416442 | norm 0.3366 | lr 2.75e-04 | (3804.97 ms | 137790 tok/s) step 14899/76294 | train loss 3.398047 | norm 0.3236 | lr 2.75e-04 | (3813.19 ms | 137493 tok/s) step 14900/76294 | train loss 3.382998 | norm 0.3758 | lr 2.75e-04 | (3811.52 ms | 137553 tok/s) step 14901/76294 | train loss 3.383331 | norm 0.3393 | lr 2.75e-04 | (3817.80 ms | 137327 tok/s) step 14902/76294 | train loss 3.364593 | norm 0.3851 | lr 2.75e-04 | (3821.75 ms | 137185 tok/s) step 14903/76294 | train loss 3.359474 | norm 0.3241 | lr 2.75e-04 | (3809.30 ms | 137634 tok/s) step 14904/76294 | train loss 3.369205 | norm 0.2934 | lr 2.75e-04 | (3806.15 ms | 137747 tok/s) step 14905/76294 | train loss 3.400747 | norm 0.3233 | lr 2.75e-04 | (3809.86 ms | 137614 tok/s) step 14906/76294 | train loss 3.455904 | norm 0.3118 | lr 2.75e-04 | (3806.93 ms | 137719 tok/s) step 14907/76294 | train loss 3.371640 | norm 0.3044 | lr 2.75e-04 | (3806.13 ms | 137748 tok/s) step 14908/76294 | train loss 3.344623 | norm 0.2896 | lr 2.74e-04 | (3848.49 ms | 136232 tok/s) step 14909/76294 | train loss 3.404560 | norm 0.3385 | lr 2.74e-04 | (3800.36 ms | 137957 tok/s) step 14910/76294 | train loss 3.361792 | norm 0.3891 | lr 2.74e-04 | (3829.44 ms | 136910 tok/s) step 14911/76294 | train loss 3.421597 | norm 0.3209 | lr 2.74e-04 | (3834.75 ms | 136720 tok/s) step 14912/76294 | train loss 3.400216 | norm 0.2745 | lr 2.74e-04 | (3805.50 ms | 137771 tok/s) step 14913/76294 | train loss 3.361853 | norm 0.4847 | lr 2.74e-04 | (3811.81 ms | 137543 tok/s) step 14914/76294 | train loss 3.381608 | norm 0.3614 | lr 2.74e-04 | (3873.57 ms | 135350 tok/s) step 14915/76294 | train loss 3.424761 | norm 0.3608 | lr 2.74e-04 | (3803.28 ms | 137852 tok/s) step 14916/76294 | train loss 3.391823 | norm 0.3757 | lr 2.74e-04 | (3927.42 ms | 133494 tok/s) step 14917/76294 | train loss 3.359387 | norm 0.3420 | lr 2.74e-04 | (3798.99 ms | 138007 tok/s) step 14918/76294 | train loss 3.327388 | norm 0.4360 | lr 2.74e-04 | (3852.10 ms | 136105 tok/s) step 14919/76294 | train loss 3.441999 | norm 0.5840 | lr 2.74e-04 | (3801.88 ms | 137902 tok/s) step 14920/76294 | train loss 3.434680 | norm 0.3929 | lr 2.74e-04 | (3832.23 ms | 136810 tok/s) step 14921/76294 | train loss 3.397955 | norm 0.3139 | lr 2.74e-04 | (3829.41 ms | 136911 tok/s) step 14922/76294 | train loss 3.375330 | norm 0.3435 | lr 2.74e-04 | (3801.87 ms | 137903 tok/s) step 14923/76294 | train loss 3.336391 | norm 0.2888 | lr 2.74e-04 | (3809.49 ms | 137627 tok/s) step 14924/76294 | train loss 3.384643 | norm 0.3202 | lr 2.73e-04 | (3804.92 ms | 137792 tok/s) step 14925/76294 | train loss 3.428191 | norm 0.3349 | lr 2.73e-04 | (3800.76 ms | 137943 tok/s) step 14926/76294 | train loss 3.452852 | norm 0.3641 | lr 2.73e-04 | (3833.68 ms | 136758 tok/s) step 14927/76294 | train loss 3.399958 | norm 0.3350 | lr 2.73e-04 | (3806.20 ms | 137746 tok/s) step 14928/76294 | train loss 3.366697 | norm 0.3526 | lr 2.73e-04 | (4083.60 ms | 128389 tok/s) step 14929/76294 | train loss 3.386782 | norm 0.3610 | lr 2.73e-04 | (3825.43 ms | 137053 tok/s) step 14930/76294 | train loss 3.427147 | norm 0.3038 | lr 2.73e-04 | (3805.44 ms | 137773 tok/s) step 14931/76294 | train loss 3.401382 | norm 0.4410 | lr 2.73e-04 | (3811.42 ms | 137557 tok/s) step 14932/76294 | train loss 3.358551 | norm 0.3888 | lr 2.73e-04 | (3803.90 ms | 137829 tok/s) step 14933/76294 | train loss 3.433624 | norm 0.3125 | lr 2.73e-04 | (3809.77 ms | 137617 tok/s) step 14934/76294 | train loss 3.510384 | norm 0.3703 | lr 2.73e-04 | (3812.28 ms | 137526 tok/s) step 14935/76294 | train loss 3.355032 | norm 0.3876 | lr 2.73e-04 | (3809.14 ms | 137639 tok/s) step 14936/76294 | train loss 3.569960 | norm 0.3663 | lr 2.73e-04 | (3801.02 ms | 137933 tok/s) step 14937/76294 | train loss 3.365026 | norm 0.3599 | lr 2.73e-04 | (3806.58 ms | 137732 tok/s) step 14938/76294 | train loss 3.359350 | norm 0.3803 | lr 2.73e-04 | (3891.84 ms | 134715 tok/s) step 14939/76294 | train loss 3.348377 | norm 0.3064 | lr 2.73e-04 | (3800.80 ms | 137941 tok/s) step 14940/76294 | train loss 3.460232 | norm 0.3535 | lr 2.72e-04 | (3803.10 ms | 137858 tok/s) step 14941/76294 | train loss 3.438574 | norm 0.3546 | lr 2.72e-04 | (3821.53 ms | 137193 tok/s) step 14942/76294 | train loss 3.352474 | norm 0.3481 | lr 2.72e-04 | (3829.12 ms | 136921 tok/s) step 14943/76294 | train loss 3.482643 | norm 0.4638 | lr 2.72e-04 | (3809.89 ms | 137612 tok/s) step 14944/76294 | train loss 3.391753 | norm 0.3740 | lr 2.72e-04 | (3803.08 ms | 137859 tok/s) step 14945/76294 | train loss 3.392941 | norm 0.4283 | lr 2.72e-04 | (3801.56 ms | 137914 tok/s) step 14946/76294 | train loss 3.407262 | norm 0.4034 | lr 2.72e-04 | (3835.95 ms | 136677 tok/s) step 14947/76294 | train loss 3.444016 | norm 0.3133 | lr 2.72e-04 | (3800.87 ms | 137939 tok/s) step 14948/76294 | train loss 3.367448 | norm 0.3221 | lr 2.72e-04 | (3835.92 ms | 136679 tok/s) step 14949/76294 | train loss 3.415031 | norm 0.2998 | lr 2.72e-04 | (3803.92 ms | 137828 tok/s) step 14950/76294 | train loss 3.389252 | norm 0.3227 | lr 2.72e-04 | (3832.77 ms | 136791 tok/s) step 14951/76294 | train loss 3.388945 | norm 0.3270 | lr 2.72e-04 | (3800.10 ms | 137967 tok/s) step 14952/76294 | train loss 3.394402 | norm 0.3930 | lr 2.72e-04 | (3804.97 ms | 137790 tok/s) step 14953/76294 | train loss 3.445064 | norm 0.3215 | lr 2.72e-04 | (3825.60 ms | 137047 tok/s) step 14954/76294 | train loss 3.350522 | norm 0.3739 | lr 2.72e-04 | (3808.03 ms | 137679 tok/s) step 14955/76294 | train loss 3.404878 | norm 0.3255 | lr 2.72e-04 | (3802.22 ms | 137890 tok/s) step 14956/76294 | train loss 3.514727 | norm 0.3153 | lr 2.71e-04 | (3832.45 ms | 136802 tok/s) step 14957/76294 | train loss 3.407373 | norm 0.3287 | lr 2.71e-04 | (3798.42 ms | 138028 tok/s) step 14958/76294 | train loss 3.409677 | norm 0.3626 | lr 2.71e-04 | (3807.75 ms | 137690 tok/s) step 14959/76294 | train loss 3.373670 | norm 0.3530 | lr 2.71e-04 | (3824.98 ms | 137070 tok/s) step 14960/76294 | train loss 3.356437 | norm 0.3920 | lr 2.71e-04 | (3905.36 ms | 134248 tok/s) step 14961/76294 | train loss 3.400533 | norm 0.2917 | lr 2.71e-04 | (3800.48 ms | 137953 tok/s) step 14962/76294 | train loss 3.384962 | norm 0.3460 | lr 2.71e-04 | (3814.44 ms | 137448 tok/s) step 14963/76294 | train loss 3.414611 | norm 0.3389 | lr 2.71e-04 | (3802.09 ms | 137895 tok/s) step 14964/76294 | train loss 3.376657 | norm 0.3365 | lr 2.71e-04 | (3810.41 ms | 137594 tok/s) step 14965/76294 | train loss 3.421490 | norm 0.3230 | lr 2.71e-04 | (3820.77 ms | 137220 tok/s) step 14966/76294 | train loss 3.369741 | norm 0.3775 | lr 2.71e-04 | (3804.44 ms | 137809 tok/s) step 14967/76294 | train loss 3.457896 | norm 0.4816 | lr 2.71e-04 | (3813.83 ms | 137470 tok/s) step 14968/76294 | train loss 3.394919 | norm 0.4010 | lr 2.71e-04 | (3807.96 ms | 137682 tok/s) step 14969/76294 | train loss 3.330148 | norm 0.4566 | lr 2.71e-04 | (3824.78 ms | 137076 tok/s) step 14970/76294 | train loss 3.417516 | norm 0.3582 | lr 2.71e-04 | (3807.11 ms | 137713 tok/s) step 14971/76294 | train loss 3.458207 | norm 0.3867 | lr 2.71e-04 | (3800.96 ms | 137936 tok/s) step 14972/76294 | train loss 3.429818 | norm 0.5512 | lr 2.70e-04 | (3833.62 ms | 136761 tok/s) step 14973/76294 | train loss 3.365110 | norm 0.4637 | lr 2.70e-04 | (3800.35 ms | 137958 tok/s) step 14974/76294 | train loss 3.363695 | norm 0.4287 | lr 2.70e-04 | (3835.46 ms | 136695 tok/s) step 14975/76294 | train loss 3.426031 | norm 0.6040 | lr 2.70e-04 | (3839.46 ms | 136553 tok/s) step 14976/76294 | train loss 3.322620 | norm 0.6150 | lr 2.70e-04 | (3803.56 ms | 137841 tok/s) step 14977/76294 | train loss 3.475857 | norm 0.4968 | lr 2.70e-04 | (3808.15 ms | 137675 tok/s) step 14978/76294 | train loss 3.431180 | norm 0.3983 | lr 2.70e-04 | (3809.35 ms | 137632 tok/s) step 14979/76294 | train loss 3.421207 | norm 0.4468 | lr 2.70e-04 | (3818.04 ms | 137319 tok/s) step 14980/76294 | train loss 3.400670 | norm 0.3898 | lr 2.70e-04 | (3806.21 ms | 137745 tok/s) step 14981/76294 | train loss 3.465852 | norm 0.3850 | lr 2.70e-04 | (3811.28 ms | 137562 tok/s) step 14982/76294 | train loss 3.489095 | norm 0.4068 | lr 2.70e-04 | (3805.09 ms | 137786 tok/s) step 14983/76294 | train loss 3.355036 | norm 0.3192 | lr 2.70e-04 | (4129.23 ms | 126970 tok/s) step 14984/76294 | train loss 3.383799 | norm 0.4407 | lr 2.70e-04 | (3806.23 ms | 137745 tok/s) step 14985/76294 | train loss 3.417354 | norm 0.4122 | lr 2.70e-04 | (3806.15 ms | 137747 tok/s) step 14986/76294 | train loss 3.414169 | norm 0.4005 | lr 2.70e-04 | (3833.97 ms | 136748 tok/s) step 14987/76294 | train loss 3.440233 | norm 0.3593 | lr 2.70e-04 | (3830.10 ms | 136886 tok/s) step 14988/76294 | train loss 3.404829 | norm 0.3875 | lr 2.69e-04 | (3822.73 ms | 137150 tok/s) step 14989/76294 | train loss 3.430568 | norm 0.4003 | lr 2.69e-04 | (3808.70 ms | 137655 tok/s) step 14990/76294 | train loss 3.415373 | norm 0.4350 | lr 2.69e-04 | (3807.13 ms | 137712 tok/s) step 14991/76294 | train loss 3.367818 | norm 0.4451 | lr 2.69e-04 | (3829.77 ms | 136898 tok/s) step 14992/76294 | train loss 3.455018 | norm 0.3833 | lr 2.69e-04 | (3805.09 ms | 137786 tok/s) step 14993/76294 | train loss 3.353178 | norm 0.2896 | lr 2.69e-04 | (3893.83 ms | 134646 tok/s) step 14994/76294 | train loss 3.377841 | norm 0.4391 | lr 2.69e-04 | (3831.74 ms | 136828 tok/s) step 14995/76294 | train loss 3.366951 | norm 0.3439 | lr 2.69e-04 | (3966.44 ms | 132181 tok/s) step 14996/76294 | train loss 3.440696 | norm 0.4400 | lr 2.69e-04 | (3793.52 ms | 138206 tok/s) step 14997/76294 | train loss 3.483147 | norm 0.3637 | lr 2.69e-04 | (3862.52 ms | 135737 tok/s) step 14998/76294 | train loss 3.352617 | norm 0.4574 | lr 2.69e-04 | (3799.86 ms | 137975 tok/s) step 14999/76294 | train loss 3.368679 | norm 0.5312 | lr 2.69e-04 | (4398.28 ms | 119203 tok/s) step 15000/76294 | train loss 3.410277 | norm 0.4177 | lr 2.69e-04 | (3853.99 ms | 136038 tok/s) val loss: 3.373585 saving model checkpoint to ./results/gpt2-124M-gqa/step_15000.pth step 15001/76294 | train loss 3.381625 | norm 0.3648 | lr 2.69e-04 | (3844.16 ms | 136386 tok/s) step 15002/76294 | train loss 3.430458 | norm 0.3175 | lr 2.69e-04 | (3762.84 ms | 139333 tok/s) step 15003/76294 | train loss 3.365123 | norm 0.4245 | lr 2.69e-04 | (3827.57 ms | 136977 tok/s) step 15004/76294 | train loss 3.339278 | norm 0.3149 | lr 2.68e-04 | (3775.90 ms | 138851 tok/s) step 15005/76294 | train loss 3.379297 | norm 0.3096 | lr 2.68e-04 | (3834.37 ms | 136734 tok/s) step 15006/76294 | train loss 3.354301 | norm 0.2771 | lr 2.68e-04 | (3780.89 ms | 138668 tok/s) step 15007/76294 | train loss 3.404643 | norm 0.3890 | lr 2.68e-04 | (3785.43 ms | 138502 tok/s) step 15008/76294 | train loss 3.367017 | norm 0.2994 | lr 2.68e-04 | (3804.97 ms | 137790 tok/s) step 15009/76294 | train loss 3.415246 | norm 0.4456 | lr 2.68e-04 | (3816.45 ms | 137376 tok/s) step 15010/76294 | train loss 3.435070 | norm 0.4008 | lr 2.68e-04 | (3792.90 ms | 138229 tok/s) step 15011/76294 | train loss 3.377784 | norm 0.4574 | lr 2.68e-04 | (3796.00 ms | 138116 tok/s) step 15012/76294 | train loss 3.417883 | norm 0.5042 | lr 2.68e-04 | (3813.47 ms | 137483 tok/s) step 15013/76294 | train loss 3.400148 | norm 0.3475 | lr 2.68e-04 | (3794.08 ms | 138186 tok/s) step 15014/76294 | train loss 3.413185 | norm 0.3660 | lr 2.68e-04 | (3818.03 ms | 137319 tok/s) step 15015/76294 | train loss 3.429020 | norm 0.3329 | lr 2.68e-04 | (3805.27 ms | 137780 tok/s) step 15016/76294 | train loss 3.448217 | norm 0.4531 | lr 2.68e-04 | (3797.46 ms | 138063 tok/s) step 15017/76294 | train loss 3.356982 | norm 0.3732 | lr 2.68e-04 | (3849.87 ms | 136183 tok/s) step 15018/76294 | train loss 3.368696 | norm 0.3666 | lr 2.68e-04 | (3803.53 ms | 137842 tok/s) step 15019/76294 | train loss 3.373929 | norm 0.3416 | lr 2.68e-04 | (3803.09 ms | 137858 tok/s) step 15020/76294 | train loss 3.392606 | norm 0.3670 | lr 2.67e-04 | (3825.25 ms | 137060 tok/s) step 15021/76294 | train loss 3.493917 | norm 0.3852 | lr 2.67e-04 | (3810.58 ms | 137588 tok/s) step 15022/76294 | train loss 3.584817 | norm 0.4322 | lr 2.67e-04 | (3945.61 ms | 132879 tok/s) step 15023/76294 | train loss 3.411735 | norm 0.4139 | lr 2.67e-04 | (3800.79 ms | 137942 tok/s) step 15024/76294 | train loss 3.385006 | norm 0.3852 | lr 2.67e-04 | (3808.15 ms | 137675 tok/s) step 15025/76294 | train loss 3.410810 | norm 0.4105 | lr 2.67e-04 | (3824.62 ms | 137082 tok/s) step 15026/76294 | train loss 3.352455 | norm 0.3938 | lr 2.67e-04 | (3812.12 ms | 137532 tok/s) step 15027/76294 | train loss 3.424209 | norm 0.3489 | lr 2.67e-04 | (3809.36 ms | 137632 tok/s) step 15028/76294 | train loss 3.302012 | norm 0.3545 | lr 2.67e-04 | (3811.54 ms | 137553 tok/s) step 15029/76294 | train loss 3.381531 | norm 0.3519 | lr 2.67e-04 | (3819.72 ms | 137258 tok/s) step 15030/76294 | train loss 3.436677 | norm 0.8116 | lr 2.67e-04 | (3806.31 ms | 137742 tok/s) step 15031/76294 | train loss 3.415090 | norm 0.3862 | lr 2.67e-04 | (3812.47 ms | 137519 tok/s) step 15032/76294 | train loss 3.347449 | norm 0.4193 | lr 2.67e-04 | (3803.84 ms | 137831 tok/s) step 15033/76294 | train loss 3.379831 | norm 0.4013 | lr 2.67e-04 | (3808.94 ms | 137647 tok/s) step 15034/76294 | train loss 3.389383 | norm 0.3249 | lr 2.67e-04 | (3829.58 ms | 136905 tok/s) step 15035/76294 | train loss 3.403841 | norm 0.5446 | lr 2.67e-04 | (3809.55 ms | 137625 tok/s) step 15036/76294 | train loss 3.430427 | norm 0.2946 | lr 2.66e-04 | (3814.47 ms | 137447 tok/s) step 15037/76294 | train loss 3.415040 | norm 0.3563 | lr 2.66e-04 | (3849.81 ms | 136185 tok/s) step 15038/76294 | train loss 3.359697 | norm 0.2975 | lr 2.66e-04 | (3814.88 ms | 137432 tok/s) step 15039/76294 | train loss 3.379919 | norm 0.3643 | lr 2.66e-04 | (3878.83 ms | 135167 tok/s) step 15040/76294 | train loss 3.347451 | norm 0.4102 | lr 2.66e-04 | (3812.67 ms | 137512 tok/s) step 15041/76294 | train loss 3.330370 | norm 0.4165 | lr 2.66e-04 | (3817.84 ms | 137326 tok/s) step 15042/76294 | train loss 3.473474 | norm 0.3548 | lr 2.66e-04 | (3837.45 ms | 136624 tok/s) step 15043/76294 | train loss 3.409594 | norm 0.3059 | lr 2.66e-04 | (3817.86 ms | 137325 tok/s) step 15044/76294 | train loss 3.406593 | norm 0.3283 | lr 2.66e-04 | (3985.57 ms | 131547 tok/s) step 15045/76294 | train loss 3.531740 | norm 0.3903 | lr 2.66e-04 | (3823.38 ms | 137127 tok/s) step 15046/76294 | train loss 3.391897 | norm 0.3824 | lr 2.66e-04 | (3846.49 ms | 136303 tok/s) step 15047/76294 | train loss 3.417520 | norm 0.3584 | lr 2.66e-04 | (3805.50 ms | 137771 tok/s) step 15048/76294 | train loss 3.456067 | norm 0.4009 | lr 2.66e-04 | (3811.02 ms | 137572 tok/s) step 15049/76294 | train loss 3.433222 | norm 0.3736 | lr 2.66e-04 | (3827.37 ms | 136984 tok/s) step 15050/76294 | train loss 3.498134 | norm 0.4578 | lr 2.66e-04 | (3809.86 ms | 137614 tok/s) step 15051/76294 | train loss 3.365755 | norm 0.4499 | lr 2.66e-04 | (3809.96 ms | 137610 tok/s) step 15052/76294 | train loss 3.366694 | norm 0.3664 | lr 2.66e-04 | (3807.98 ms | 137681 tok/s) step 15053/76294 | train loss 3.396183 | norm 0.4064 | lr 2.65e-04 | (3800.42 ms | 137955 tok/s) step 15054/76294 | train loss 3.378425 | norm 0.4751 | lr 2.65e-04 | (3831.15 ms | 136849 tok/s) step 15055/76294 | train loss 3.390325 | norm 0.3555 | lr 2.65e-04 | (3803.21 ms | 137854 tok/s) step 15056/76294 | train loss 3.463986 | norm 0.4839 | lr 2.65e-04 | (3807.54 ms | 137697 tok/s) step 15057/76294 | train loss 3.433865 | norm 0.6184 | lr 2.65e-04 | (3825.19 ms | 137062 tok/s) step 15058/76294 | train loss 3.405221 | norm 0.4016 | lr 2.65e-04 | (3806.43 ms | 137738 tok/s) step 15059/76294 | train loss 3.367722 | norm 0.3215 | lr 2.65e-04 | (3809.26 ms | 137635 tok/s) step 15060/76294 | train loss 3.421319 | norm 0.4270 | lr 2.65e-04 | (3807.58 ms | 137696 tok/s) step 15061/76294 | train loss 3.407607 | norm 0.3148 | lr 2.65e-04 | (3859.18 ms | 135855 tok/s) step 15062/76294 | train loss 3.338344 | norm 0.3385 | lr 2.65e-04 | (3807.39 ms | 137703 tok/s) step 15063/76294 | train loss 3.445879 | norm 0.2926 | lr 2.65e-04 | (3809.74 ms | 137618 tok/s) step 15064/76294 | train loss 3.383145 | norm 0.3959 | lr 2.65e-04 | (3810.29 ms | 137598 tok/s) step 15065/76294 | train loss 3.407656 | norm 0.3246 | lr 2.65e-04 | (3803.70 ms | 137836 tok/s) step 15066/76294 | train loss 3.403377 | norm 0.4303 | lr 2.65e-04 | (3901.01 ms | 134398 tok/s) step 15067/76294 | train loss 3.366580 | norm 0.3386 | lr 2.65e-04 | (3805.34 ms | 137777 tok/s) step 15068/76294 | train loss 3.428656 | norm 0.3316 | lr 2.65e-04 | (3803.97 ms | 137826 tok/s) step 15069/76294 | train loss 3.435068 | norm 0.4367 | lr 2.64e-04 | (3823.60 ms | 137119 tok/s) step 15070/76294 | train loss 3.446598 | norm 0.2843 | lr 2.64e-04 | (3806.66 ms | 137729 tok/s) step 15071/76294 | train loss 3.442350 | norm 0.3025 | lr 2.64e-04 | (3842.16 ms | 136457 tok/s) step 15072/76294 | train loss 3.350121 | norm 0.3076 | lr 2.64e-04 | (3807.81 ms | 137687 tok/s) step 15073/76294 | train loss 3.392022 | norm 0.2703 | lr 2.64e-04 | (3808.36 ms | 137668 tok/s) step 15074/76294 | train loss 3.404943 | norm 0.3509 | lr 2.64e-04 | (3807.81 ms | 137687 tok/s) step 15075/76294 | train loss 3.346711 | norm 0.3901 | lr 2.64e-04 | (3814.60 ms | 137442 tok/s) step 15076/76294 | train loss 3.385055 | norm 0.6027 | lr 2.64e-04 | (3809.71 ms | 137619 tok/s) step 15077/76294 | train loss 3.413229 | norm 0.3704 | lr 2.64e-04 | (3813.90 ms | 137468 tok/s) step 15078/76294 | train loss 3.477329 | norm 0.3907 | lr 2.64e-04 | (3806.63 ms | 137730 tok/s) step 15079/76294 | train loss 3.403646 | norm 0.5750 | lr 2.64e-04 | (3810.50 ms | 137590 tok/s) step 15080/76294 | train loss 3.408132 | norm 0.3424 | lr 2.64e-04 | (3809.09 ms | 137641 tok/s) step 15081/76294 | train loss 3.440843 | norm 0.3900 | lr 2.64e-04 | (3804.09 ms | 137822 tok/s) step 15082/76294 | train loss 3.370165 | norm 0.3304 | lr 2.64e-04 | (3853.47 ms | 136056 tok/s) step 15083/76294 | train loss 3.367268 | norm 0.3344 | lr 2.64e-04 | (3806.75 ms | 137726 tok/s) step 15084/76294 | train loss 3.444970 | norm 0.3584 | lr 2.64e-04 | (3813.24 ms | 137492 tok/s) step 15085/76294 | train loss 3.377170 | norm 0.3856 | lr 2.63e-04 | (3825.06 ms | 137067 tok/s) step 15086/76294 | train loss 3.419273 | norm 0.4170 | lr 2.63e-04 | (3805.67 ms | 137765 tok/s) step 15087/76294 | train loss 3.499058 | norm 0.5942 | lr 2.63e-04 | (3809.37 ms | 137631 tok/s) step 15088/76294 | train loss 3.426208 | norm 0.5313 | lr 2.63e-04 | (3804.94 ms | 137792 tok/s) step 15089/76294 | train loss 3.439155 | norm 0.6311 | lr 2.63e-04 | (3896.16 ms | 134565 tok/s) step 15090/76294 | train loss 3.384694 | norm 1.6072 | lr 2.63e-04 | (3802.15 ms | 137893 tok/s) step 15091/76294 | train loss 3.424341 | norm 0.5486 | lr 2.63e-04 | (3817.86 ms | 137325 tok/s) step 15092/76294 | train loss 3.452699 | norm 0.5132 | lr 2.63e-04 | (3804.61 ms | 137803 tok/s) step 15093/76294 | train loss 3.418762 | norm 0.4212 | lr 2.63e-04 | (3816.21 ms | 137384 tok/s) step 15094/76294 | train loss 3.379636 | norm 0.5876 | lr 2.63e-04 | (3830.03 ms | 136889 tok/s) step 15095/76294 | train loss 3.567760 | norm 0.3489 | lr 2.63e-04 | (3812.46 ms | 137520 tok/s) step 15096/76294 | train loss 3.405509 | norm 0.5026 | lr 2.63e-04 | (3804.63 ms | 137803 tok/s) step 15097/76294 | train loss 3.371893 | norm 0.3789 | lr 2.63e-04 | (3804.51 ms | 137807 tok/s) step 15098/76294 | train loss 3.444906 | norm 0.4768 | lr 2.63e-04 | (3807.14 ms | 137712 tok/s) step 15099/76294 | train loss 3.581896 | norm 0.6621 | lr 2.63e-04 | (3807.32 ms | 137705 tok/s) step 15100/76294 | train loss 3.418289 | norm 0.4397 | lr 2.63e-04 | (3816.36 ms | 137379 tok/s) step 15101/76294 | train loss 3.352242 | norm 0.3781 | lr 2.63e-04 | (3806.81 ms | 137724 tok/s) step 15102/76294 | train loss 3.412535 | norm 0.4569 | lr 2.62e-04 | (3811.28 ms | 137562 tok/s) step 15103/76294 | train loss 3.433403 | norm 0.5240 | lr 2.62e-04 | (3805.08 ms | 137786 tok/s) step 15104/76294 | train loss 3.362824 | norm 0.4464 | lr 2.62e-04 | (3811.66 ms | 137549 tok/s) step 15105/76294 | train loss 3.363778 | norm 0.3261 | lr 2.62e-04 | (3803.04 ms | 137860 tok/s) step 15106/76294 | train loss 3.436304 | norm 0.3131 | lr 2.62e-04 | (3806.06 ms | 137751 tok/s) step 15107/76294 | train loss 3.391790 | norm 0.3480 | lr 2.62e-04 | (3802.15 ms | 137892 tok/s) step 15108/76294 | train loss 3.406010 | norm 0.2894 | lr 2.62e-04 | (3831.05 ms | 136852 tok/s) step 15109/76294 | train loss 3.441314 | norm 0.3569 | lr 2.62e-04 | (3806.49 ms | 137735 tok/s) step 15110/76294 | train loss 3.393290 | norm 0.2992 | lr 2.62e-04 | (3808.46 ms | 137664 tok/s) step 15111/76294 | train loss 3.470609 | norm 0.4906 | lr 2.62e-04 | (3813.40 ms | 137486 tok/s) step 15112/76294 | train loss 3.446025 | norm 0.3126 | lr 2.62e-04 | (3803.43 ms | 137846 tok/s) step 15113/76294 | train loss 3.424311 | norm 0.3896 | lr 2.62e-04 | (3862.08 ms | 135753 tok/s) step 15114/76294 | train loss 3.465050 | norm 0.3471 | lr 2.62e-04 | (3801.16 ms | 137928 tok/s) step 15115/76294 | train loss 3.376029 | norm 0.3443 | lr 2.62e-04 | (3806.90 ms | 137720 tok/s) step 15116/76294 | train loss 3.318342 | norm 0.3291 | lr 2.62e-04 | (3825.11 ms | 137065 tok/s) step 15117/76294 | train loss 3.445197 | norm 0.2814 | lr 2.62e-04 | (3810.03 ms | 137607 tok/s) step 15118/76294 | train loss 3.414087 | norm 0.3622 | lr 2.61e-04 | (3798.87 ms | 138011 tok/s) step 15119/76294 | train loss 3.594217 | norm 0.3647 | lr 2.61e-04 | (3833.28 ms | 136773 tok/s) step 15120/76294 | train loss 3.376763 | norm 0.2871 | lr 2.61e-04 | (3802.78 ms | 137870 tok/s) step 15121/76294 | train loss 3.303196 | norm 0.3287 | lr 2.61e-04 | (3830.73 ms | 136864 tok/s) step 15122/76294 | train loss 3.280255 | norm 0.4055 | lr 2.61e-04 | (3805.27 ms | 137779 tok/s) step 15123/76294 | train loss 3.341872 | norm 0.2890 | lr 2.61e-04 | (3809.66 ms | 137621 tok/s) step 15124/76294 | train loss 3.381159 | norm 0.3716 | lr 2.61e-04 | (3822.86 ms | 137146 tok/s) step 15125/76294 | train loss 3.418760 | norm 0.3965 | lr 2.61e-04 | (3806.41 ms | 137738 tok/s) step 15126/76294 | train loss 3.398002 | norm 0.3774 | lr 2.61e-04 | (3802.68 ms | 137873 tok/s) step 15127/76294 | train loss 3.432636 | norm 0.3228 | lr 2.61e-04 | (3831.61 ms | 136832 tok/s) step 15128/76294 | train loss 3.409692 | norm 0.2788 | lr 2.61e-04 | (3803.69 ms | 137837 tok/s) step 15129/76294 | train loss 3.461156 | norm 0.3682 | lr 2.61e-04 | (3819.06 ms | 137282 tok/s) step 15130/76294 | train loss 3.373395 | norm 0.3351 | lr 2.61e-04 | (3801.75 ms | 137907 tok/s) step 15131/76294 | train loss 3.413809 | norm 0.3119 | lr 2.61e-04 | (3805.39 ms | 137775 tok/s) step 15132/76294 | train loss 3.396168 | norm 0.5069 | lr 2.61e-04 | (3823.15 ms | 137135 tok/s) step 15133/76294 | train loss 3.377552 | norm 0.3540 | lr 2.61e-04 | (3805.50 ms | 137771 tok/s) step 15134/76294 | train loss 3.361568 | norm 0.3100 | lr 2.61e-04 | (3979.73 ms | 131740 tok/s) step 15135/76294 | train loss 3.394453 | norm 0.8985 | lr 2.60e-04 | (3815.04 ms | 137427 tok/s) step 15136/76294 | train loss 3.364329 | norm 0.3855 | lr 2.60e-04 | (3829.34 ms | 136913 tok/s) step 15137/76294 | train loss 3.427052 | norm 0.4397 | lr 2.60e-04 | (3799.07 ms | 138004 tok/s) step 15138/76294 | train loss 3.433123 | norm 0.3609 | lr 2.60e-04 | (3810.28 ms | 137598 tok/s) step 15139/76294 | train loss 3.350978 | norm 0.3270 | lr 2.60e-04 | (3799.20 ms | 138000 tok/s) step 15140/76294 | train loss 3.426964 | norm 0.3921 | lr 2.60e-04 | (3818.69 ms | 137295 tok/s) step 15141/76294 | train loss 3.384820 | norm 0.4477 | lr 2.60e-04 | (3802.75 ms | 137871 tok/s) step 15142/76294 | train loss 3.396499 | norm 0.5018 | lr 2.60e-04 | (3830.20 ms | 136883 tok/s) step 15143/76294 | train loss 3.359454 | norm 0.4650 | lr 2.60e-04 | (3800.04 ms | 137969 tok/s) step 15144/76294 | train loss 3.464769 | norm 0.3905 | lr 2.60e-04 | (3803.70 ms | 137836 tok/s) step 15145/76294 | train loss 3.396128 | norm 0.4533 | lr 2.60e-04 | (3821.85 ms | 137182 tok/s) step 15146/76294 | train loss 3.371789 | norm 0.3908 | lr 2.60e-04 | (3803.59 ms | 137840 tok/s) step 15147/76294 | train loss 3.417994 | norm 0.5041 | lr 2.60e-04 | (3805.13 ms | 137784 tok/s) step 15148/76294 | train loss 3.401025 | norm 0.4389 | lr 2.60e-04 | (3836.69 ms | 136651 tok/s) step 15149/76294 | train loss 3.355267 | norm 0.5456 | lr 2.60e-04 | (3800.64 ms | 137947 tok/s) step 15150/76294 | train loss 3.458880 | norm 0.4470 | lr 2.60e-04 | (3811.30 ms | 137561 tok/s) step 15151/76294 | train loss 3.380562 | norm 0.3464 | lr 2.59e-04 | (3823.82 ms | 137111 tok/s) step 15152/76294 | train loss 3.393344 | norm 0.3278 | lr 2.59e-04 | (3808.73 ms | 137654 tok/s) step 15153/76294 | train loss 3.381749 | norm 0.4346 | lr 2.59e-04 | (3799.16 ms | 138001 tok/s) step 15154/76294 | train loss 3.403256 | norm 0.3336 | lr 2.59e-04 | (3832.63 ms | 136796 tok/s) step 15155/76294 | train loss 3.400266 | norm 0.3582 | lr 2.59e-04 | (3803.36 ms | 137849 tok/s) step 15156/76294 | train loss 3.405831 | norm 0.3167 | lr 2.59e-04 | (3822.26 ms | 137167 tok/s) step 15157/76294 | train loss 3.405128 | norm 0.3144 | lr 2.59e-04 | (3921.92 ms | 133681 tok/s) step 15158/76294 | train loss 3.341878 | norm 0.3041 | lr 2.59e-04 | (3800.75 ms | 137943 tok/s) step 15159/76294 | train loss 3.421965 | norm 0.4300 | lr 2.59e-04 | (3810.78 ms | 137580 tok/s) step 15160/76294 | train loss 3.402125 | norm 0.3706 | lr 2.59e-04 | (3820.80 ms | 137219 tok/s) step 15161/76294 | train loss 3.440849 | norm 0.3634 | lr 2.59e-04 | (3811.24 ms | 137564 tok/s) step 15162/76294 | train loss 3.387341 | norm 0.3467 | lr 2.59e-04 | (3800.69 ms | 137945 tok/s) step 15163/76294 | train loss 3.428252 | norm 0.3212 | lr 2.59e-04 | (3832.56 ms | 136798 tok/s) step 15164/76294 | train loss 3.395483 | norm 0.3159 | lr 2.59e-04 | (3804.75 ms | 137798 tok/s) step 15165/76294 | train loss 3.408842 | norm 0.2962 | lr 2.59e-04 | (3846.63 ms | 136298 tok/s) step 15166/76294 | train loss 3.340620 | norm 0.3755 | lr 2.59e-04 | (3803.60 ms | 137840 tok/s) step 15167/76294 | train loss 3.413034 | norm 0.3210 | lr 2.59e-04 | (3803.40 ms | 137847 tok/s) step 15168/76294 | train loss 3.291495 | norm 0.3380 | lr 2.58e-04 | (3833.84 ms | 136753 tok/s) step 15169/76294 | train loss 3.448712 | norm 0.4029 | lr 2.58e-04 | (3813.80 ms | 137471 tok/s) step 15170/76294 | train loss 3.345632 | norm 0.3156 | lr 2.58e-04 | (3810.60 ms | 137587 tok/s) step 15171/76294 | train loss 3.415854 | norm 0.3073 | lr 2.58e-04 | (3842.53 ms | 136444 tok/s) step 15172/76294 | train loss 3.407450 | norm 0.3262 | lr 2.58e-04 | (3808.46 ms | 137664 tok/s) step 15173/76294 | train loss 3.431716 | norm 0.3909 | lr 2.58e-04 | (3821.39 ms | 137198 tok/s) step 15174/76294 | train loss 3.484725 | norm 0.3286 | lr 2.58e-04 | (4194.63 ms | 124990 tok/s) step 15175/76294 | train loss 3.375662 | norm 0.3476 | lr 2.58e-04 | (3827.83 ms | 136967 tok/s) step 15176/76294 | train loss 3.487478 | norm 0.6320 | lr 2.58e-04 | (3813.75 ms | 137473 tok/s) step 15177/76294 | train loss 3.484831 | norm 0.3896 | lr 2.58e-04 | (3810.92 ms | 137575 tok/s) step 15178/76294 | train loss 3.423738 | norm 0.3322 | lr 2.58e-04 | (3828.86 ms | 136931 tok/s) step 15179/76294 | train loss 3.381079 | norm 0.3488 | lr 2.58e-04 | (3808.04 ms | 137679 tok/s) step 15180/76294 | train loss 3.412043 | norm 0.5085 | lr 2.58e-04 | (3870.84 ms | 135446 tok/s) step 15181/76294 | train loss 3.481217 | norm 0.7328 | lr 2.58e-04 | (3801.92 ms | 137901 tok/s) step 15182/76294 | train loss 3.454134 | norm 0.4429 | lr 2.58e-04 | (3811.03 ms | 137571 tok/s) step 15183/76294 | train loss 3.399895 | norm 0.5398 | lr 2.58e-04 | (3800.20 ms | 137963 tok/s) step 15184/76294 | train loss 3.447937 | norm 0.3489 | lr 2.57e-04 | (3803.16 ms | 137856 tok/s) step 15185/76294 | train loss 3.363790 | norm 0.4763 | lr 2.57e-04 | (3825.73 ms | 137043 tok/s) step 15186/76294 | train loss 3.451793 | norm 0.3492 | lr 2.57e-04 | (3805.50 ms | 137771 tok/s) step 15187/76294 | train loss 3.356597 | norm 0.4869 | lr 2.57e-04 | (3808.43 ms | 137665 tok/s) step 15188/76294 | train loss 3.437847 | norm 0.5288 | lr 2.57e-04 | (3837.61 ms | 136619 tok/s) step 15189/76294 | train loss 3.421889 | norm 0.4270 | lr 2.57e-04 | (3812.59 ms | 137515 tok/s) step 15190/76294 | train loss 3.466717 | norm 0.5842 | lr 2.57e-04 | (3807.03 ms | 137716 tok/s) step 15191/76294 | train loss 3.373034 | norm 0.4354 | lr 2.57e-04 | (3810.00 ms | 137609 tok/s) step 15192/76294 | train loss 3.605088 | norm 0.3449 | lr 2.57e-04 | (3808.49 ms | 137663 tok/s) step 15193/76294 | train loss 3.461444 | norm 0.3938 | lr 2.57e-04 | (3812.41 ms | 137522 tok/s) step 15194/76294 | train loss 3.402924 | norm 0.3401 | lr 2.57e-04 | (3803.37 ms | 137848 tok/s) step 15195/76294 | train loss 3.388246 | norm 0.2934 | lr 2.57e-04 | (3905.43 ms | 134246 tok/s) step 15196/76294 | train loss 3.406138 | norm 0.3721 | lr 2.57e-04 | (3832.18 ms | 136812 tok/s) step 15197/76294 | train loss 3.382419 | norm 0.3577 | lr 2.57e-04 | (3801.79 ms | 137905 tok/s) step 15198/76294 | train loss 3.363693 | norm 0.3180 | lr 2.57e-04 | (3826.98 ms | 136998 tok/s) step 15199/76294 | train loss 3.405329 | norm 0.3746 | lr 2.57e-04 | (3804.07 ms | 137823 tok/s) step 15200/76294 | train loss 3.324616 | norm 0.4401 | lr 2.57e-04 | (3826.62 ms | 137011 tok/s) step 15201/76294 | train loss 3.411951 | norm 0.5342 | lr 2.56e-04 | (3800.08 ms | 137968 tok/s) step 15202/76294 | train loss 3.494674 | norm 0.5422 | lr 2.56e-04 | (3874.25 ms | 135326 tok/s) step 15203/76294 | train loss 3.371679 | norm 1.1183 | lr 2.56e-04 | (3875.28 ms | 135290 tok/s) step 15204/76294 | train loss 3.361524 | norm 0.8962 | lr 2.56e-04 | (3802.60 ms | 137876 tok/s) step 15205/76294 | train loss 3.392881 | norm 0.5122 | lr 2.56e-04 | (3806.61 ms | 137731 tok/s) step 15206/76294 | train loss 3.403284 | norm 0.6175 | lr 2.56e-04 | (3825.25 ms | 137060 tok/s) step 15207/76294 | train loss 3.439056 | norm 0.3744 | lr 2.56e-04 | (3813.23 ms | 137492 tok/s) step 15208/76294 | train loss 3.505070 | norm 0.4898 | lr 2.56e-04 | (3835.95 ms | 136678 tok/s) step 15209/76294 | train loss 3.454735 | norm 0.4113 | lr 2.56e-04 | (3802.56 ms | 137878 tok/s) step 15210/76294 | train loss 3.400865 | norm 0.3422 | lr 2.56e-04 | (3825.80 ms | 137040 tok/s) step 15211/76294 | train loss 3.422201 | norm 0.4966 | lr 2.56e-04 | (3802.18 ms | 137891 tok/s) step 15212/76294 | train loss 3.383832 | norm 0.4212 | lr 2.56e-04 | (3807.62 ms | 137694 tok/s) step 15213/76294 | train loss 3.378060 | norm 0.4259 | lr 2.56e-04 | (12897.95 ms | 40649 tok/s) step 15214/76294 | train loss 3.434194 | norm 0.3306 | lr 2.56e-04 | (3774.73 ms | 138894 tok/s) step 15215/76294 | train loss 3.399343 | norm 0.3483 | lr 2.56e-04 | (3830.12 ms | 136885 tok/s) step 15216/76294 | train loss 3.378827 | norm 0.3223 | lr 2.56e-04 | (3806.99 ms | 137717 tok/s) step 15217/76294 | train loss 3.685574 | norm 0.3928 | lr 2.56e-04 | (4028.72 ms | 130138 tok/s) step 15218/76294 | train loss 3.428221 | norm 0.3467 | lr 2.55e-04 | (3804.49 ms | 137808 tok/s) step 15219/76294 | train loss 3.625489 | norm 0.3596 | lr 2.55e-04 | (3816.53 ms | 137373 tok/s) step 15220/76294 | train loss 3.406698 | norm 0.3350 | lr 2.55e-04 | (3791.18 ms | 138291 tok/s) step 15221/76294 | train loss 3.418213 | norm 0.3321 | lr 2.55e-04 | (3813.12 ms | 137496 tok/s) step 15222/76294 | train loss 3.442174 | norm 0.3471 | lr 2.55e-04 | (3795.13 ms | 138147 tok/s) step 15223/76294 | train loss 3.366241 | norm 0.3585 | lr 2.55e-04 | (3792.22 ms | 138253 tok/s) step 15224/76294 | train loss 3.448241 | norm 0.2803 | lr 2.55e-04 | (3885.39 ms | 134938 tok/s) step 15225/76294 | train loss 3.389432 | norm 0.3911 | lr 2.55e-04 | (3795.20 ms | 138145 tok/s) step 15226/76294 | train loss 3.518343 | norm 0.3040 | lr 2.55e-04 | (3817.21 ms | 137348 tok/s) step 15227/76294 | train loss 3.422558 | norm 0.3831 | lr 2.55e-04 | (3793.01 ms | 138225 tok/s) step 15228/76294 | train loss 3.485294 | norm 0.3380 | lr 2.55e-04 | (3840.90 ms | 136501 tok/s) step 15229/76294 | train loss 3.410685 | norm 0.3192 | lr 2.55e-04 | (3795.35 ms | 138139 tok/s) step 15230/76294 | train loss 3.444284 | norm 0.5022 | lr 2.55e-04 | (3798.68 ms | 138019 tok/s) step 15231/76294 | train loss 3.375964 | norm 0.2839 | lr 2.55e-04 | (3822.63 ms | 137154 tok/s) step 15232/76294 | train loss 3.533239 | norm 0.3758 | lr 2.55e-04 | (3797.64 ms | 138056 tok/s) step 15233/76294 | train loss 3.385505 | norm 0.3043 | lr 2.55e-04 | (3805.58 ms | 137768 tok/s) step 15234/76294 | train loss 3.388753 | norm 0.3100 | lr 2.55e-04 | (3796.31 ms | 138105 tok/s) step 15235/76294 | train loss 3.422850 | norm 0.2928 | lr 2.54e-04 | (3803.47 ms | 137845 tok/s) step 15236/76294 | train loss 3.435308 | norm 0.3909 | lr 2.54e-04 | (3808.87 ms | 137649 tok/s) step 15237/76294 | train loss 3.381094 | norm 0.4882 | lr 2.54e-04 | (3823.87 ms | 137109 tok/s) step 15238/76294 | train loss 3.387775 | norm 0.3738 | lr 2.54e-04 | (3796.99 ms | 138080 tok/s) step 15239/76294 | train loss 3.482542 | norm 0.4007 | lr 2.54e-04 | (3804.30 ms | 137815 tok/s) step 15240/76294 | train loss 3.430037 | norm 0.4259 | lr 2.54e-04 | (3825.40 ms | 137054 tok/s) step 15241/76294 | train loss 3.334556 | norm 0.3863 | lr 2.54e-04 | (3802.25 ms | 137889 tok/s) step 15242/76294 | train loss 3.473197 | norm 0.4252 | lr 2.54e-04 | (3806.90 ms | 137721 tok/s) step 15243/76294 | train loss 3.401483 | norm 0.3353 | lr 2.54e-04 | (3855.73 ms | 135976 tok/s) step 15244/76294 | train loss 3.372725 | norm 0.4473 | lr 2.54e-04 | (3796.18 ms | 138109 tok/s) step 15245/76294 | train loss 3.371382 | norm 0.2992 | lr 2.54e-04 | (3802.40 ms | 137884 tok/s) step 15246/76294 | train loss 3.385888 | norm 0.3152 | lr 2.54e-04 | (3820.41 ms | 137234 tok/s) step 15247/76294 | train loss 3.365812 | norm 0.3490 | lr 2.54e-04 | (3800.81 ms | 137941 tok/s) step 15248/76294 | train loss 3.360733 | norm 0.4315 | lr 2.54e-04 | (3882.00 ms | 135056 tok/s) step 15249/76294 | train loss 3.415110 | norm 0.2856 | lr 2.54e-04 | (3793.17 ms | 138219 tok/s) step 15250/76294 | train loss 3.402596 | norm 0.2807 | lr 2.54e-04 | (3822.90 ms | 137144 tok/s) val loss: 3.371664 saving model checkpoint to ./results/gpt2-124M-gqa/step_15250.pth step 15251/76294 | train loss 3.415597 | norm 0.3191 | lr 2.53e-04 | (3819.65 ms | 137261 tok/s) step 15252/76294 | train loss 3.359414 | norm 0.4566 | lr 2.53e-04 | (3843.77 ms | 136400 tok/s) step 15253/76294 | train loss 3.384383 | norm 0.2721 | lr 2.53e-04 | (3799.23 ms | 137999 tok/s) step 15254/76294 | train loss 3.434397 | norm 0.4087 | lr 2.53e-04 | (3798.45 ms | 138027 tok/s) step 15255/76294 | train loss 3.314586 | norm 0.3030 | lr 2.53e-04 | (3799.13 ms | 138002 tok/s) step 15256/76294 | train loss 3.385705 | norm 0.3295 | lr 2.53e-04 | (3804.50 ms | 137807 tok/s) step 15257/76294 | train loss 3.358945 | norm 0.3044 | lr 2.53e-04 | (3802.45 ms | 137882 tok/s) step 15258/76294 | train loss 3.408151 | norm 0.3045 | lr 2.53e-04 | (3804.54 ms | 137806 tok/s) step 15259/76294 | train loss 3.401862 | norm 0.4478 | lr 2.53e-04 | (3803.64 ms | 137839 tok/s) step 15260/76294 | train loss 3.387970 | norm 0.3282 | lr 2.53e-04 | (3833.36 ms | 136770 tok/s) step 15261/76294 | train loss 3.375004 | norm 0.3078 | lr 2.53e-04 | (3800.76 ms | 137943 tok/s) step 15262/76294 | train loss 3.388141 | norm 0.3282 | lr 2.53e-04 | (3811.35 ms | 137560 tok/s) step 15263/76294 | train loss 3.398604 | norm 0.2850 | lr 2.53e-04 | (3816.27 ms | 137382 tok/s) step 15264/76294 | train loss 3.392593 | norm 0.3208 | lr 2.53e-04 | (3830.85 ms | 136859 tok/s) step 15265/76294 | train loss 3.386409 | norm 0.3263 | lr 2.53e-04 | (3816.89 ms | 137360 tok/s) step 15266/76294 | train loss 3.401296 | norm 0.3093 | lr 2.53e-04 | (3815.52 ms | 137409 tok/s) step 15267/76294 | train loss 3.456599 | norm 0.3041 | lr 2.53e-04 | (3808.16 ms | 137675 tok/s) step 15268/76294 | train loss 3.405307 | norm 0.3375 | lr 2.52e-04 | (3815.64 ms | 137405 tok/s) step 15269/76294 | train loss 3.411582 | norm 0.3346 | lr 2.52e-04 | (3808.66 ms | 137657 tok/s) step 15270/76294 | train loss 3.406744 | norm 0.3544 | lr 2.52e-04 | (3814.39 ms | 137450 tok/s) step 15271/76294 | train loss 3.374299 | norm 0.2959 | lr 2.52e-04 | (3815.64 ms | 137405 tok/s) step 15272/76294 | train loss 3.426226 | norm 0.3251 | lr 2.52e-04 | (3932.81 ms | 133311 tok/s) step 15273/76294 | train loss 3.433317 | norm 0.3049 | lr 2.52e-04 | (3806.32 ms | 137741 tok/s) step 15274/76294 | train loss 3.478988 | norm 0.3116 | lr 2.52e-04 | (4347.25 ms | 120602 tok/s) step 15275/76294 | train loss 3.438843 | norm 0.2999 | lr 2.52e-04 | (3812.87 ms | 137505 tok/s) step 15276/76294 | train loss 3.467532 | norm 0.3318 | lr 2.52e-04 | (3851.76 ms | 136116 tok/s) step 15277/76294 | train loss 3.426988 | norm 0.4295 | lr 2.52e-04 | (3807.25 ms | 137708 tok/s) step 15278/76294 | train loss 3.371868 | norm 0.4499 | lr 2.52e-04 | (3827.04 ms | 136996 tok/s) step 15279/76294 | train loss 3.388219 | norm 0.4561 | lr 2.52e-04 | (3796.44 ms | 138100 tok/s) step 15280/76294 | train loss 3.493981 | norm 0.4115 | lr 2.52e-04 | (3809.43 ms | 137629 tok/s) step 15281/76294 | train loss 3.409914 | norm 0.3382 | lr 2.52e-04 | (3828.68 ms | 136937 tok/s) step 15282/76294 | train loss 3.462924 | norm 0.2880 | lr 2.52e-04 | (3799.70 ms | 137981 tok/s) step 15283/76294 | train loss 3.418669 | norm 0.2820 | lr 2.52e-04 | (3810.31 ms | 137597 tok/s) step 15284/76294 | train loss 3.345948 | norm 0.4591 | lr 2.52e-04 | (3801.16 ms | 137928 tok/s) step 15285/76294 | train loss 3.379768 | norm 0.3960 | lr 2.51e-04 | (3809.01 ms | 137644 tok/s) step 15286/76294 | train loss 3.372436 | norm 0.3098 | lr 2.51e-04 | (3825.19 ms | 137062 tok/s) step 15287/76294 | train loss 3.442350 | norm 0.3752 | lr 2.51e-04 | (3802.22 ms | 137890 tok/s) step 15288/76294 | train loss 3.517782 | norm 0.3374 | lr 2.51e-04 | (3808.72 ms | 137654 tok/s) step 15289/76294 | train loss 3.371583 | norm 0.3998 | lr 2.51e-04 | (3806.08 ms | 137750 tok/s) step 15290/76294 | train loss 3.487329 | norm 0.3354 | lr 2.51e-04 | (3803.50 ms | 137843 tok/s) step 15291/76294 | train loss 3.395576 | norm 0.2973 | lr 2.51e-04 | (3801.91 ms | 137901 tok/s) step 15292/76294 | train loss 3.473924 | norm 0.3236 | lr 2.51e-04 | (3808.16 ms | 137675 tok/s) step 15293/76294 | train loss 3.374952 | norm 0.3734 | lr 2.51e-04 | (3800.32 ms | 137959 tok/s) step 15294/76294 | train loss 3.656891 | norm 0.3473 | lr 2.51e-04 | (3809.30 ms | 137634 tok/s) step 15295/76294 | train loss 3.354195 | norm 0.3331 | lr 2.51e-04 | (3800.76 ms | 137943 tok/s) step 15296/76294 | train loss 3.372655 | norm 0.3421 | lr 2.51e-04 | (3877.23 ms | 135222 tok/s) step 15297/76294 | train loss 3.435006 | norm 0.3568 | lr 2.51e-04 | (3803.37 ms | 137848 tok/s) step 15298/76294 | train loss 3.393704 | norm 0.3117 | lr 2.51e-04 | (3821.87 ms | 137181 tok/s) step 15299/76294 | train loss 3.375313 | norm 0.3875 | lr 2.51e-04 | (3803.10 ms | 137858 tok/s) step 15300/76294 | train loss 3.363175 | norm 0.3013 | lr 2.51e-04 | (3802.84 ms | 137868 tok/s) step 15301/76294 | train loss 3.422654 | norm 0.3313 | lr 2.51e-04 | (3825.78 ms | 137041 tok/s) step 15302/76294 | train loss 3.427298 | norm 0.3407 | lr 2.50e-04 | (3801.21 ms | 137927 tok/s) step 15303/76294 | train loss 3.452779 | norm 0.3741 | lr 2.50e-04 | (3809.88 ms | 137613 tok/s) step 15304/76294 | train loss 3.424193 | norm 0.3895 | lr 2.50e-04 | (3864.67 ms | 135662 tok/s) step 15305/76294 | train loss 3.339111 | norm 0.5185 | lr 2.50e-04 | (3801.58 ms | 137913 tok/s) step 15306/76294 | train loss 3.406919 | norm 0.5779 | lr 2.50e-04 | (3836.15 ms | 136670 tok/s) step 15307/76294 | train loss 3.429659 | norm 0.4881 | lr 2.50e-04 | (3802.18 ms | 137891 tok/s) step 15308/76294 | train loss 3.348852 | norm 0.4481 | lr 2.50e-04 | (3804.29 ms | 137815 tok/s) step 15309/76294 | train loss 3.366745 | norm 0.4569 | lr 2.50e-04 | (3823.69 ms | 137116 tok/s) step 15310/76294 | train loss 3.413484 | norm 0.3094 | lr 2.50e-04 | (3808.14 ms | 137676 tok/s) step 15311/76294 | train loss 3.411877 | norm 0.3692 | lr 2.50e-04 | (3801.59 ms | 137913 tok/s) step 15312/76294 | train loss 3.379315 | norm 0.3141 | lr 2.50e-04 | (3829.68 ms | 136901 tok/s) step 15313/76294 | train loss 3.422046 | norm 0.4167 | lr 2.50e-04 | (3811.76 ms | 137545 tok/s) step 15314/76294 | train loss 3.389544 | norm 0.3574 | lr 2.50e-04 | (3832.35 ms | 136806 tok/s) step 15315/76294 | train loss 3.465058 | norm 0.3187 | lr 2.50e-04 | (3811.84 ms | 137542 tok/s) step 15316/76294 | train loss 3.476688 | norm 0.3281 | lr 2.50e-04 | (3837.22 ms | 136632 tok/s) step 15317/76294 | train loss 3.460515 | norm 0.3599 | lr 2.50e-04 | (3807.91 ms | 137684 tok/s) step 15318/76294 | train loss 3.378389 | norm 0.3665 | lr 2.50e-04 | (3826.77 ms | 137005 tok/s) step 15319/76294 | train loss 3.430281 | norm 0.3582 | lr 2.49e-04 | (3811.42 ms | 137557 tok/s) step 15320/76294 | train loss 3.326021 | norm 0.3149 | lr 2.49e-04 | (3854.41 ms | 136023 tok/s) step 15321/76294 | train loss 3.340496 | norm 0.3138 | lr 2.49e-04 | (3809.53 ms | 137625 tok/s) step 15322/76294 | train loss 3.307976 | norm 0.3005 | lr 2.49e-04 | (3818.14 ms | 137315 tok/s) step 15323/76294 | train loss 3.363496 | norm 0.3270 | lr 2.49e-04 | (3828.13 ms | 136957 tok/s) step 15324/76294 | train loss 3.364499 | norm 0.2590 | lr 2.49e-04 | (3810.86 ms | 137577 tok/s) step 15325/76294 | train loss 3.385374 | norm 0.2948 | lr 2.49e-04 | (3816.67 ms | 137368 tok/s) step 15326/76294 | train loss 3.416495 | norm 0.2572 | lr 2.49e-04 | (3807.69 ms | 137692 tok/s) step 15327/76294 | train loss 3.421110 | norm 0.2807 | lr 2.49e-04 | (3835.70 ms | 136686 tok/s) step 15328/76294 | train loss 3.380639 | norm 0.3781 | lr 2.49e-04 | (3814.55 ms | 137444 tok/s) step 15329/76294 | train loss 3.409390 | norm 0.3153 | lr 2.49e-04 | (3813.84 ms | 137470 tok/s) step 15330/76294 | train loss 3.433131 | norm 0.2838 | lr 2.49e-04 | (3814.60 ms | 137442 tok/s) step 15331/76294 | train loss 3.436741 | norm 0.2700 | lr 2.49e-04 | (3817.07 ms | 137354 tok/s) step 15332/76294 | train loss 3.373059 | norm 0.2722 | lr 2.49e-04 | (3831.92 ms | 136821 tok/s) step 15333/76294 | train loss 3.470344 | norm 0.3365 | lr 2.49e-04 | (3813.11 ms | 137496 tok/s) step 15334/76294 | train loss 3.415211 | norm 0.3013 | lr 2.49e-04 | (3846.52 ms | 136302 tok/s) step 15335/76294 | train loss 3.379783 | norm 0.2962 | lr 2.49e-04 | (3820.62 ms | 137226 tok/s) step 15336/76294 | train loss 3.308120 | norm 0.3284 | lr 2.48e-04 | (3810.62 ms | 137586 tok/s) step 15337/76294 | train loss 3.455291 | norm 0.4298 | lr 2.48e-04 | (3812.75 ms | 137509 tok/s) step 15338/76294 | train loss 3.582564 | norm 0.4213 | lr 2.48e-04 | (3817.17 ms | 137350 tok/s) step 15339/76294 | train loss 3.388222 | norm 0.2662 | lr 2.48e-04 | (3808.27 ms | 137671 tok/s) step 15340/76294 | train loss 3.420259 | norm 0.4304 | lr 2.48e-04 | (3856.55 ms | 135948 tok/s) step 15341/76294 | train loss 3.431120 | norm 0.2891 | lr 2.48e-04 | (5066.29 ms | 103486 tok/s) step 15342/76294 | train loss 3.365496 | norm 0.3325 | lr 2.48e-04 | (3950.55 ms | 132713 tok/s) step 15343/76294 | train loss 3.377753 | norm 0.2612 | lr 2.48e-04 | (4688.45 ms | 111825 tok/s) step 15344/76294 | train loss 3.384354 | norm 0.3297 | lr 2.48e-04 | (3812.07 ms | 137534 tok/s) step 15345/76294 | train loss 3.626659 | norm 0.3155 | lr 2.48e-04 | (3804.67 ms | 137801 tok/s) step 15346/76294 | train loss 3.378941 | norm 0.2778 | lr 2.48e-04 | (3819.11 ms | 137280 tok/s) step 15347/76294 | train loss 3.422219 | norm 0.2721 | lr 2.48e-04 | (3804.16 ms | 137820 tok/s) step 15348/76294 | train loss 3.340697 | norm 0.3971 | lr 2.48e-04 | (3800.43 ms | 137955 tok/s) step 15349/76294 | train loss 3.410624 | norm 0.3818 | lr 2.48e-04 | (3834.45 ms | 136731 tok/s) step 15350/76294 | train loss 3.370301 | norm 0.3437 | lr 2.48e-04 | (3797.09 ms | 138076 tok/s) step 15351/76294 | train loss 3.371043 | norm 0.4594 | lr 2.48e-04 | (3806.57 ms | 137732 tok/s) step 15352/76294 | train loss 3.402522 | norm 0.3908 | lr 2.48e-04 | (3828.69 ms | 136937 tok/s) step 15353/76294 | train loss 3.384751 | norm 0.3051 | lr 2.48e-04 | (3803.97 ms | 137826 tok/s) step 15354/76294 | train loss 3.382078 | norm 0.3555 | lr 2.47e-04 | (3809.24 ms | 137636 tok/s) step 15355/76294 | train loss 3.367847 | norm 0.3583 | lr 2.47e-04 | (3806.50 ms | 137735 tok/s) step 15356/76294 | train loss 3.447332 | norm 0.2687 | lr 2.47e-04 | (3809.87 ms | 137613 tok/s) step 15357/76294 | train loss 3.417408 | norm 0.3448 | lr 2.47e-04 | (3802.89 ms | 137866 tok/s) step 15358/76294 | train loss 3.381599 | norm 0.4152 | lr 2.47e-04 | (3807.32 ms | 137705 tok/s) step 15359/76294 | train loss 3.391634 | norm 0.3427 | lr 2.47e-04 | (3801.21 ms | 137927 tok/s) step 15360/76294 | train loss 3.429238 | norm 0.3924 | lr 2.47e-04 | (3802.75 ms | 137871 tok/s) step 15361/76294 | train loss 3.433096 | norm 0.3619 | lr 2.47e-04 | (3850.47 ms | 136162 tok/s) step 15362/76294 | train loss 3.346576 | norm 0.4042 | lr 2.47e-04 | (3808.00 ms | 137681 tok/s) step 15363/76294 | train loss 3.364359 | norm 0.3410 | lr 2.47e-04 | (3809.95 ms | 137610 tok/s) step 15364/76294 | train loss 3.383158 | norm 0.3848 | lr 2.47e-04 | (3821.88 ms | 137181 tok/s) step 15365/76294 | train loss 3.407162 | norm 0.4019 | lr 2.47e-04 | (4604.65 ms | 113861 tok/s) step 15366/76294 | train loss 3.378443 | norm 0.4458 | lr 2.47e-04 | (3802.12 ms | 137894 tok/s) step 15367/76294 | train loss 3.320725 | norm 0.3304 | lr 2.47e-04 | (3835.83 ms | 136682 tok/s) step 15368/76294 | train loss 3.386230 | norm 0.4131 | lr 2.47e-04 | (3795.96 ms | 138117 tok/s) step 15369/76294 | train loss 3.324128 | norm 0.3402 | lr 2.47e-04 | (3805.76 ms | 137762 tok/s) step 15370/76294 | train loss 3.365010 | norm 0.3172 | lr 2.47e-04 | (3824.61 ms | 137083 tok/s) step 15371/76294 | train loss 3.443265 | norm 0.2940 | lr 2.46e-04 | (3801.36 ms | 137921 tok/s) step 15372/76294 | train loss 3.303733 | norm 0.3457 | lr 2.46e-04 | (3805.81 ms | 137760 tok/s) step 15373/76294 | train loss 3.346954 | norm 0.2949 | lr 2.46e-04 | (3810.75 ms | 137581 tok/s) step 15374/76294 | train loss 3.395014 | norm 0.3303 | lr 2.46e-04 | (3799.34 ms | 137994 tok/s) step 15375/76294 | train loss 3.405084 | norm 0.3178 | lr 2.46e-04 | (3836.67 ms | 136652 tok/s) step 15376/76294 | train loss 3.352702 | norm 0.4276 | lr 2.46e-04 | (3803.75 ms | 137834 tok/s) step 15377/76294 | train loss 3.349467 | norm 0.3036 | lr 2.46e-04 | (3805.60 ms | 137768 tok/s) step 15378/76294 | train loss 3.389915 | norm 0.3117 | lr 2.46e-04 | (3823.02 ms | 137140 tok/s) step 15379/76294 | train loss 3.354223 | norm 0.4656 | lr 2.46e-04 | (3803.16 ms | 137856 tok/s) step 15380/76294 | train loss 3.369277 | norm 0.3623 | lr 2.46e-04 | (3806.84 ms | 137723 tok/s) step 15381/76294 | train loss 3.366763 | norm 0.6242 | lr 2.46e-04 | (3836.42 ms | 136661 tok/s) step 15382/76294 | train loss 3.325600 | norm 0.5496 | lr 2.46e-04 | (3802.67 ms | 137874 tok/s) step 15383/76294 | train loss 3.398710 | norm 0.3191 | lr 2.46e-04 | (3809.12 ms | 137640 tok/s) step 15384/76294 | train loss 3.325931 | norm 0.3959 | lr 2.46e-04 | (3823.28 ms | 137130 tok/s) step 15385/76294 | train loss 3.468628 | norm 0.3976 | lr 2.46e-04 | (3803.89 ms | 137829 tok/s) step 15386/76294 | train loss 3.375200 | norm 0.3762 | lr 2.46e-04 | (3802.85 ms | 137867 tok/s) step 15387/76294 | train loss 3.364537 | norm 0.3718 | lr 2.46e-04 | (3833.95 ms | 136749 tok/s) step 15388/76294 | train loss 3.362785 | norm 0.3476 | lr 2.45e-04 | (3805.38 ms | 137775 tok/s) step 15389/76294 | train loss 3.375715 | norm 0.3124 | lr 2.45e-04 | (3805.47 ms | 137772 tok/s) step 15390/76294 | train loss 3.323361 | norm 0.3052 | lr 2.45e-04 | (3898.87 ms | 134472 tok/s) step 15391/76294 | train loss 3.317849 | norm 0.4682 | lr 2.45e-04 | (3835.64 ms | 136688 tok/s) step 15392/76294 | train loss 3.365524 | norm 0.3580 | lr 2.45e-04 | (3806.75 ms | 137726 tok/s) step 15393/76294 | train loss 3.435408 | norm 0.4392 | lr 2.45e-04 | (3837.07 ms | 136638 tok/s) step 15394/76294 | train loss 3.345091 | norm 0.3575 | lr 2.45e-04 | (3800.95 ms | 137936 tok/s) step 15395/76294 | train loss 3.365396 | norm 0.3263 | lr 2.45e-04 | (3838.03 ms | 136603 tok/s) step 15396/76294 | train loss 3.407987 | norm 0.4329 | lr 2.45e-04 | (3801.59 ms | 137913 tok/s) step 15397/76294 | train loss 3.332971 | norm 0.3376 | lr 2.45e-04 | (3805.36 ms | 137776 tok/s) step 15398/76294 | train loss 3.357714 | norm 0.3238 | lr 2.45e-04 | (3833.54 ms | 136764 tok/s) step 15399/76294 | train loss 3.410582 | norm 0.3440 | lr 2.45e-04 | (3805.26 ms | 137780 tok/s) step 15400/76294 | train loss 3.470672 | norm 0.3171 | lr 2.45e-04 | (3805.56 ms | 137769 tok/s) step 15401/76294 | train loss 3.417069 | norm 0.3246 | lr 2.45e-04 | (3804.29 ms | 137815 tok/s) step 15402/76294 | train loss 3.360474 | norm 0.3063 | lr 2.45e-04 | (3811.23 ms | 137564 tok/s) step 15403/76294 | train loss 3.386803 | norm 0.3074 | lr 2.45e-04 | (3873.03 ms | 135369 tok/s) step 15404/76294 | train loss 3.403752 | norm 0.3633 | lr 2.45e-04 | (3812.74 ms | 137509 tok/s) step 15405/76294 | train loss 3.356125 | norm 0.2728 | lr 2.45e-04 | (3803.54 ms | 137842 tok/s) step 15406/76294 | train loss 3.335042 | norm 0.3679 | lr 2.44e-04 | (3815.90 ms | 137396 tok/s) step 15407/76294 | train loss 3.365727 | norm 0.6142 | lr 2.44e-04 | (3829.84 ms | 136896 tok/s) step 15408/76294 | train loss 3.327496 | norm 0.5382 | lr 2.44e-04 | (3813.80 ms | 137471 tok/s) step 15409/76294 | train loss 3.334646 | norm 0.6471 | lr 2.44e-04 | (3809.85 ms | 137614 tok/s) step 15410/76294 | train loss 3.378442 | norm 0.4746 | lr 2.44e-04 | (3811.81 ms | 137543 tok/s) step 15411/76294 | train loss 3.351948 | norm 0.4094 | lr 2.44e-04 | (3806.27 ms | 137743 tok/s) step 15412/76294 | train loss 3.375217 | norm 0.4683 | lr 2.44e-04 | (3810.33 ms | 137597 tok/s) step 15413/76294 | train loss 3.401880 | norm 0.3870 | lr 2.44e-04 | (3805.82 ms | 137760 tok/s) step 15414/76294 | train loss 3.378198 | norm 0.3369 | lr 2.44e-04 | (3808.74 ms | 137654 tok/s) step 15415/76294 | train loss 3.397323 | norm 0.3475 | lr 2.44e-04 | (3825.61 ms | 137047 tok/s) step 15416/76294 | train loss 3.412359 | norm 0.3910 | lr 2.44e-04 | (3805.57 ms | 137769 tok/s) step 15417/76294 | train loss 3.495301 | norm 0.4084 | lr 2.44e-04 | (3805.92 ms | 137756 tok/s) step 15418/76294 | train loss 3.502956 | norm 0.3172 | lr 2.44e-04 | (3808.42 ms | 137666 tok/s) step 15419/76294 | train loss 3.407220 | norm 0.3923 | lr 2.44e-04 | (3808.62 ms | 137658 tok/s) step 15420/76294 | train loss 3.411373 | norm 0.3436 | lr 2.44e-04 | (3804.34 ms | 137813 tok/s) step 15421/76294 | train loss 3.429926 | norm 0.3338 | lr 2.44e-04 | (3802.85 ms | 137867 tok/s) step 15422/76294 | train loss 3.317217 | norm 0.2904 | lr 2.44e-04 | (3830.42 ms | 136875 tok/s) step 15423/76294 | train loss 3.341783 | norm 0.3924 | lr 2.43e-04 | (3799.98 ms | 137971 tok/s) step 15424/76294 | train loss 3.365129 | norm 0.3102 | lr 2.43e-04 | (4391.09 ms | 119398 tok/s) step 15425/76294 | train loss 3.390688 | norm 0.3556 | lr 2.43e-04 | (3835.65 ms | 136688 tok/s) step 15426/76294 | train loss 3.381330 | norm 0.3252 | lr 2.43e-04 | (3904.84 ms | 134266 tok/s) step 15427/76294 | train loss 3.407527 | norm 0.3214 | lr 2.43e-04 | (3790.09 ms | 138331 tok/s) step 15428/76294 | train loss 3.367682 | norm 0.3569 | lr 2.43e-04 | (3836.62 ms | 136654 tok/s) step 15429/76294 | train loss 3.344502 | norm 0.3199 | lr 2.43e-04 | (3795.81 ms | 138123 tok/s) step 15430/76294 | train loss 3.431877 | norm 0.3100 | lr 2.43e-04 | (3842.46 ms | 136446 tok/s) step 15431/76294 | train loss 3.357938 | norm 0.3916 | lr 2.43e-04 | (3796.63 ms | 138093 tok/s) step 15432/76294 | train loss 3.408797 | norm 0.3567 | lr 2.43e-04 | (3800.47 ms | 137953 tok/s) step 15433/76294 | train loss 3.476424 | norm 0.4157 | lr 2.43e-04 | (3822.93 ms | 137143 tok/s) step 15434/76294 | train loss 3.428787 | norm 0.3644 | lr 2.43e-04 | (3813.50 ms | 137482 tok/s) step 15435/76294 | train loss 3.502518 | norm 0.3785 | lr 2.43e-04 | (3808.87 ms | 137649 tok/s) step 15436/76294 | train loss 3.459721 | norm 0.3633 | lr 2.43e-04 | (3933.57 ms | 133286 tok/s) step 15437/76294 | train loss 3.368476 | norm 0.2959 | lr 2.43e-04 | (3796.86 ms | 138085 tok/s) step 15438/76294 | train loss 3.396496 | norm 0.3169 | lr 2.43e-04 | (3820.00 ms | 137248 tok/s) step 15439/76294 | train loss 3.366099 | norm 0.3162 | lr 2.43e-04 | (3819.67 ms | 137260 tok/s) step 15440/76294 | train loss 3.413900 | norm 0.2694 | lr 2.42e-04 | (3817.67 ms | 137332 tok/s) step 15441/76294 | train loss 3.353583 | norm 0.3042 | lr 2.42e-04 | (3833.94 ms | 136749 tok/s) step 15442/76294 | train loss 3.377088 | norm 0.3189 | lr 2.42e-04 | (3801.60 ms | 137913 tok/s) step 15443/76294 | train loss 3.323133 | norm 0.2928 | lr 2.42e-04 | (3827.13 ms | 136992 tok/s) step 15444/76294 | train loss 3.347490 | norm 0.3553 | lr 2.42e-04 | (3800.65 ms | 137947 tok/s) step 15445/76294 | train loss 3.328896 | norm 0.3724 | lr 2.42e-04 | (3819.43 ms | 137269 tok/s) step 15446/76294 | train loss 3.456809 | norm 0.4128 | lr 2.42e-04 | (3829.39 ms | 136912 tok/s) step 15447/76294 | train loss 3.454044 | norm 0.3408 | lr 2.42e-04 | (3824.26 ms | 137095 tok/s) step 15448/76294 | train loss 3.323691 | norm 0.4105 | lr 2.42e-04 | (3804.68 ms | 137801 tok/s) step 15449/76294 | train loss 3.406201 | norm 0.3161 | lr 2.42e-04 | (3803.42 ms | 137846 tok/s) step 15450/76294 | train loss 3.392688 | norm 0.4314 | lr 2.42e-04 | (3800.24 ms | 137962 tok/s) step 15451/76294 | train loss 3.356894 | norm 0.3261 | lr 2.42e-04 | (3824.68 ms | 137080 tok/s) step 15452/76294 | train loss 3.333889 | norm 0.4054 | lr 2.42e-04 | (3803.85 ms | 137831 tok/s) step 15453/76294 | train loss 3.355009 | norm 0.3627 | lr 2.42e-04 | (3821.78 ms | 137184 tok/s) step 15454/76294 | train loss 3.383129 | norm 0.3208 | lr 2.42e-04 | (3801.43 ms | 137919 tok/s) step 15455/76294 | train loss 3.403667 | norm 0.3791 | lr 2.42e-04 | (3806.93 ms | 137719 tok/s) step 15456/76294 | train loss 3.320405 | norm 0.4543 | lr 2.42e-04 | (3803.39 ms | 137848 tok/s) step 15457/76294 | train loss 3.373211 | norm 0.4239 | lr 2.42e-04 | (3799.90 ms | 137974 tok/s) step 15458/76294 | train loss 3.354572 | norm 0.4926 | lr 2.41e-04 | (3894.26 ms | 134631 tok/s) step 15459/76294 | train loss 3.377460 | norm 0.3770 | lr 2.41e-04 | (3946.75 ms | 132840 tok/s) step 15460/76294 | train loss 3.365533 | norm 0.4101 | lr 2.41e-04 | (3798.15 ms | 138038 tok/s) step 15461/76294 | train loss 3.400534 | norm 0.4279 | lr 2.41e-04 | (3819.07 ms | 137282 tok/s) step 15462/76294 | train loss 3.426270 | norm 0.3947 | lr 2.41e-04 | (3798.01 ms | 138043 tok/s) step 15463/76294 | train loss 3.329632 | norm 0.5008 | lr 2.41e-04 | (3800.06 ms | 137968 tok/s) step 15464/76294 | train loss 3.445453 | norm 0.4553 | lr 2.41e-04 | (3823.50 ms | 137122 tok/s) step 15465/76294 | train loss 3.367511 | norm 0.4572 | lr 2.41e-04 | (3802.73 ms | 137872 tok/s) step 15466/76294 | train loss 3.388091 | norm 0.3150 | lr 2.41e-04 | (3798.41 ms | 138028 tok/s) step 15467/76294 | train loss 3.476491 | norm 0.4464 | lr 2.41e-04 | (4375.59 ms | 119821 tok/s) step 15468/76294 | train loss 3.304689 | norm 0.3162 | lr 2.41e-04 | (3798.03 ms | 138042 tok/s) step 15469/76294 | train loss 3.331931 | norm 0.3482 | lr 2.41e-04 | (3802.10 ms | 137894 tok/s) step 15470/76294 | train loss 3.342482 | norm 0.3841 | lr 2.41e-04 | (3826.44 ms | 137017 tok/s) step 15471/76294 | train loss 3.376726 | norm 0.3188 | lr 2.41e-04 | (3803.59 ms | 137840 tok/s) step 15472/76294 | train loss 3.471998 | norm 0.4356 | lr 2.41e-04 | (3823.55 ms | 137121 tok/s) step 15473/76294 | train loss 3.361497 | norm 0.3027 | lr 2.41e-04 | (3807.66 ms | 137693 tok/s) step 15474/76294 | train loss 3.421429 | norm 0.5779 | lr 2.41e-04 | (3810.84 ms | 137578 tok/s) step 15475/76294 | train loss 3.336334 | norm 0.5482 | lr 2.41e-04 | (3804.98 ms | 137790 tok/s) step 15476/76294 | train loss 3.403532 | norm 0.5660 | lr 2.40e-04 | (3821.83 ms | 137182 tok/s) step 15477/76294 | train loss 3.527459 | norm 0.6687 | lr 2.40e-04 | (3803.53 ms | 137842 tok/s) step 15478/76294 | train loss 3.437659 | norm 0.4140 | lr 2.40e-04 | (3800.03 ms | 137969 tok/s) step 15479/76294 | train loss 3.365804 | norm 0.3890 | lr 2.40e-04 | (3826.98 ms | 136998 tok/s) step 15480/76294 | train loss 3.386431 | norm 0.5077 | lr 2.40e-04 | (3798.35 ms | 138031 tok/s) step 15481/76294 | train loss 3.379787 | norm 0.3338 | lr 2.40e-04 | (3834.78 ms | 136719 tok/s) step 15482/76294 | train loss 3.337048 | norm 0.5512 | lr 2.40e-04 | (3798.62 ms | 138021 tok/s) step 15483/76294 | train loss 3.390424 | norm 0.5004 | lr 2.40e-04 | (3896.73 ms | 134546 tok/s) step 15484/76294 | train loss 3.381539 | norm 0.6384 | lr 2.40e-04 | (3799.53 ms | 137988 tok/s) step 15485/76294 | train loss 3.387147 | norm 0.3553 | lr 2.40e-04 | (3805.72 ms | 137763 tok/s) step 15486/76294 | train loss 3.393274 | norm 0.3933 | lr 2.40e-04 | (3823.07 ms | 137138 tok/s) step 15487/76294 | train loss 3.365635 | norm 0.4167 | lr 2.40e-04 | (3854.99 ms | 136002 tok/s) step 15488/76294 | train loss 3.420918 | norm 0.3475 | lr 2.40e-04 | (3801.15 ms | 137929 tok/s) step 15489/76294 | train loss 3.406581 | norm 0.3669 | lr 2.40e-04 | (3831.93 ms | 136821 tok/s) step 15490/76294 | train loss 3.394575 | norm 0.3987 | lr 2.40e-04 | (3800.17 ms | 137964 tok/s) step 15491/76294 | train loss 3.409883 | norm 0.3555 | lr 2.40e-04 | (3811.13 ms | 137568 tok/s) step 15492/76294 | train loss 3.361313 | norm 0.3482 | lr 2.40e-04 | (3804.67 ms | 137801 tok/s) step 15493/76294 | train loss 3.357120 | norm 0.3418 | lr 2.39e-04 | (3805.85 ms | 137758 tok/s) step 15494/76294 | train loss 3.449943 | norm 0.3089 | lr 2.39e-04 | (3802.45 ms | 137882 tok/s) step 15495/76294 | train loss 3.450778 | norm 0.3356 | lr 2.39e-04 | (3833.26 ms | 136774 tok/s) step 15496/76294 | train loss 3.346978 | norm 0.4259 | lr 2.39e-04 | (3830.26 ms | 136880 tok/s) step 15497/76294 | train loss 3.397079 | norm 0.3473 | lr 2.39e-04 | (3810.70 ms | 137583 tok/s) step 15498/76294 | train loss 3.344930 | norm 0.3803 | lr 2.39e-04 | (3873.83 ms | 135341 tok/s) step 15499/76294 | train loss 3.408598 | norm 0.3559 | lr 2.39e-04 | (3894.58 ms | 134620 tok/s) step 15500/76294 | train loss 3.371321 | norm 0.3776 | lr 2.39e-04 | (3839.28 ms | 136559 tok/s) val loss: 3.368969 saving model checkpoint to ./results/gpt2-124M-gqa/step_15500.pth step 15501/76294 | train loss 3.366692 | norm 0.3598 | lr 2.39e-04 | (3813.93 ms | 137467 tok/s) step 15502/76294 | train loss 3.397276 | norm 0.3195 | lr 2.39e-04 | (3816.83 ms | 137362 tok/s) step 15503/76294 | train loss 3.351586 | norm 0.4581 | lr 2.39e-04 | (3802.36 ms | 137885 tok/s) step 15504/76294 | train loss 3.409860 | norm 0.3379 | lr 2.39e-04 | (3806.42 ms | 137738 tok/s) step 15505/76294 | train loss 3.447475 | norm 0.3753 | lr 2.39e-04 | (3800.62 ms | 137948 tok/s) step 15506/76294 | train loss 3.434493 | norm 0.3330 | lr 2.39e-04 | (3842.30 ms | 136452 tok/s) step 15507/76294 | train loss 3.317994 | norm 0.3226 | lr 2.39e-04 | (3805.42 ms | 137774 tok/s) step 15508/76294 | train loss 3.360478 | norm 0.3282 | lr 2.39e-04 | (3826.74 ms | 137006 tok/s) step 15509/76294 | train loss 3.365999 | norm 0.4163 | lr 2.39e-04 | (3807.38 ms | 137703 tok/s) step 15510/76294 | train loss 3.307679 | norm 0.3032 | lr 2.39e-04 | (3811.71 ms | 137547 tok/s) step 15511/76294 | train loss 3.345104 | norm 0.4372 | lr 2.38e-04 | (3807.22 ms | 137709 tok/s) step 15512/76294 | train loss 3.405430 | norm 0.3755 | lr 2.38e-04 | (3804.01 ms | 137825 tok/s) step 15513/76294 | train loss 3.366424 | norm 0.2873 | lr 2.38e-04 | (3835.26 ms | 136702 tok/s) step 15514/76294 | train loss 3.465404 | norm 0.3348 | lr 2.38e-04 | (3804.43 ms | 137810 tok/s) step 15515/76294 | train loss 3.378762 | norm 0.3414 | lr 2.38e-04 | (3806.22 ms | 137745 tok/s) step 15516/76294 | train loss 3.407542 | norm 0.3913 | lr 2.38e-04 | (3827.21 ms | 136990 tok/s) step 15517/76294 | train loss 3.375137 | norm 0.4562 | lr 2.38e-04 | (3803.83 ms | 137832 tok/s) step 15518/76294 | train loss 3.316479 | norm 0.3311 | lr 2.38e-04 | (3814.64 ms | 137441 tok/s) step 15519/76294 | train loss 3.317014 | norm 0.3266 | lr 2.38e-04 | (3801.70 ms | 137909 tok/s) step 15520/76294 | train loss 3.363758 | norm 0.5062 | lr 2.38e-04 | (3811.08 ms | 137570 tok/s) step 15521/76294 | train loss 3.257523 | norm 0.4603 | lr 2.38e-04 | (3827.90 ms | 136965 tok/s) step 15522/76294 | train loss 3.386365 | norm 0.5299 | lr 2.38e-04 | (3804.88 ms | 137794 tok/s) step 15523/76294 | train loss 3.439618 | norm 0.4791 | lr 2.38e-04 | (3830.61 ms | 136868 tok/s) step 15524/76294 | train loss 3.374796 | norm 0.3936 | lr 2.38e-04 | (3802.20 ms | 137891 tok/s) step 15525/76294 | train loss 3.318333 | norm 0.3105 | lr 2.38e-04 | (3809.06 ms | 137642 tok/s) step 15526/76294 | train loss 3.379072 | norm 0.3506 | lr 2.38e-04 | (3833.89 ms | 136751 tok/s) step 15527/76294 | train loss 3.388066 | norm 0.3634 | lr 2.38e-04 | (3807.51 ms | 137698 tok/s) step 15528/76294 | train loss 3.343924 | norm 0.3266 | lr 2.38e-04 | (3808.05 ms | 137679 tok/s) step 15529/76294 | train loss 3.373334 | norm 0.3875 | lr 2.37e-04 | (3811.14 ms | 137567 tok/s) step 15530/76294 | train loss 3.407740 | norm 0.3653 | lr 2.37e-04 | (3892.28 ms | 134700 tok/s) step 15531/76294 | train loss 3.343917 | norm 0.3242 | lr 2.37e-04 | (3806.10 ms | 137749 tok/s) step 15532/76294 | train loss 3.386797 | norm 0.4233 | lr 2.37e-04 | (3837.59 ms | 136619 tok/s) step 15533/76294 | train loss 3.349756 | norm 0.4203 | lr 2.37e-04 | (3800.71 ms | 137945 tok/s) step 15534/76294 | train loss 3.394078 | norm 0.3341 | lr 2.37e-04 | (3825.80 ms | 137040 tok/s) step 15535/76294 | train loss 3.473882 | norm 0.4596 | lr 2.37e-04 | (3819.74 ms | 137258 tok/s) step 15536/76294 | train loss 3.412042 | norm 0.3677 | lr 2.37e-04 | (3805.24 ms | 137781 tok/s) step 15537/76294 | train loss 3.313041 | norm 0.4232 | lr 2.37e-04 | (3823.07 ms | 137138 tok/s) step 15538/76294 | train loss 3.462407 | norm 0.3922 | lr 2.37e-04 | (3801.85 ms | 137904 tok/s) step 15539/76294 | train loss 3.398090 | norm 0.3767 | lr 2.37e-04 | (3801.33 ms | 137922 tok/s) step 15540/76294 | train loss 3.366628 | norm 0.3635 | lr 2.37e-04 | (3866.59 ms | 135594 tok/s) step 15541/76294 | train loss 3.381407 | norm 0.3659 | lr 2.37e-04 | (3805.58 ms | 137768 tok/s) step 15542/76294 | train loss 3.413540 | norm 0.4612 | lr 2.37e-04 | (3809.94 ms | 137611 tok/s) step 15543/76294 | train loss 3.382761 | norm 0.3571 | lr 2.37e-04 | (3810.02 ms | 137608 tok/s) step 15544/76294 | train loss 3.477416 | norm 0.3475 | lr 2.37e-04 | (3811.34 ms | 137560 tok/s) step 15545/76294 | train loss 3.458601 | norm 0.4725 | lr 2.37e-04 | (3809.51 ms | 137626 tok/s) step 15546/76294 | train loss 3.341168 | norm 0.4926 | lr 2.37e-04 | (3827.09 ms | 136994 tok/s) step 15547/76294 | train loss 3.523147 | norm 0.4801 | lr 2.36e-04 | (3806.90 ms | 137720 tok/s) step 15548/76294 | train loss 3.437865 | norm 0.4387 | lr 2.36e-04 | (3802.31 ms | 137887 tok/s) step 15549/76294 | train loss 3.307715 | norm 0.6905 | lr 2.36e-04 | (3856.66 ms | 135944 tok/s) step 15550/76294 | train loss 3.496978 | norm 0.3362 | lr 2.36e-04 | (3805.28 ms | 137779 tok/s) step 15551/76294 | train loss 3.413620 | norm 0.3924 | lr 2.36e-04 | (3807.82 ms | 137687 tok/s) step 15552/76294 | train loss 3.413593 | norm 0.4091 | lr 2.36e-04 | (3827.12 ms | 136993 tok/s) step 15553/76294 | train loss 3.343260 | norm 0.3381 | lr 2.36e-04 | (3808.26 ms | 137671 tok/s) step 15554/76294 | train loss 3.504452 | norm 0.3900 | lr 2.36e-04 | (3808.32 ms | 137669 tok/s) step 15555/76294 | train loss 3.305905 | norm 0.4191 | lr 2.36e-04 | (4145.25 ms | 126479 tok/s) step 15556/76294 | train loss 3.582150 | norm 0.4706 | lr 2.36e-04 | (3799.78 ms | 137979 tok/s) step 15557/76294 | train loss 3.610126 | norm 0.5325 | lr 2.36e-04 | (3819.40 ms | 137270 tok/s) step 15558/76294 | train loss 3.496212 | norm 0.4482 | lr 2.36e-04 | (3803.56 ms | 137841 tok/s) step 15559/76294 | train loss 3.374265 | norm 0.4741 | lr 2.36e-04 | (3808.05 ms | 137679 tok/s) step 15560/76294 | train loss 3.429098 | norm 0.5718 | lr 2.36e-04 | (3827.72 ms | 136971 tok/s) step 15561/76294 | train loss 3.381306 | norm 0.3934 | lr 2.36e-04 | (3806.79 ms | 137725 tok/s) step 15562/76294 | train loss 3.386423 | norm 0.4441 | lr 2.36e-04 | (3818.63 ms | 137298 tok/s) step 15563/76294 | train loss 3.380977 | norm 0.4130 | lr 2.36e-04 | (3805.12 ms | 137785 tok/s) step 15564/76294 | train loss 3.344164 | norm 0.4487 | lr 2.36e-04 | (3803.96 ms | 137827 tok/s) step 15565/76294 | train loss 3.353204 | norm 0.4391 | lr 2.35e-04 | (3833.54 ms | 136763 tok/s) step 15566/76294 | train loss 3.377524 | norm 0.4644 | lr 2.35e-04 | (3803.89 ms | 137830 tok/s) step 15567/76294 | train loss 3.367450 | norm 0.3942 | lr 2.35e-04 | (3808.76 ms | 137653 tok/s) step 15568/76294 | train loss 3.343785 | norm 0.4687 | lr 2.35e-04 | (3827.64 ms | 136974 tok/s) step 15569/76294 | train loss 3.373526 | norm 0.3497 | lr 2.35e-04 | (3804.99 ms | 137790 tok/s) step 15570/76294 | train loss 3.441320 | norm 0.3806 | lr 2.35e-04 | (3826.23 ms | 137025 tok/s) step 15571/76294 | train loss 3.344538 | norm 0.3342 | lr 2.35e-04 | (3815.38 ms | 137414 tok/s) step 15572/76294 | train loss 3.422633 | norm 0.3347 | lr 2.35e-04 | (3803.38 ms | 137848 tok/s) step 15573/76294 | train loss 3.381147 | norm 0.4336 | lr 2.35e-04 | (3840.05 ms | 136532 tok/s) step 15574/76294 | train loss 3.352193 | norm 0.4156 | lr 2.35e-04 | (3803.18 ms | 137855 tok/s) step 15575/76294 | train loss 3.359096 | norm 0.4420 | lr 2.35e-04 | (3828.43 ms | 136946 tok/s) step 15576/76294 | train loss 3.389049 | norm 0.4124 | lr 2.35e-04 | (3827.77 ms | 136970 tok/s) step 15577/76294 | train loss 3.391646 | norm 0.5694 | lr 2.35e-04 | (3803.68 ms | 137837 tok/s) step 15578/76294 | train loss 3.432141 | norm 0.5843 | lr 2.35e-04 | (3806.82 ms | 137723 tok/s) step 15579/76294 | train loss 3.359245 | norm 0.6889 | lr 2.35e-04 | (3875.55 ms | 135281 tok/s) step 15580/76294 | train loss 3.361452 | norm 0.4141 | lr 2.35e-04 | (3804.64 ms | 137802 tok/s) step 15581/76294 | train loss 3.435169 | norm 0.5920 | lr 2.35e-04 | (3810.71 ms | 137583 tok/s) step 15582/76294 | train loss 3.365647 | norm 0.3818 | lr 2.35e-04 | (3829.23 ms | 136918 tok/s) step 15583/76294 | train loss 3.314594 | norm 0.3297 | lr 2.34e-04 | (3837.16 ms | 136635 tok/s) step 15584/76294 | train loss 3.321606 | norm 0.3725 | lr 2.34e-04 | (3810.35 ms | 137596 tok/s) step 15585/76294 | train loss 3.444275 | norm 1.4181 | lr 2.34e-04 | (3805.93 ms | 137756 tok/s) step 15586/76294 | train loss 3.363955 | norm 0.4332 | lr 2.34e-04 | (3834.49 ms | 136730 tok/s) step 15587/76294 | train loss 3.392847 | norm 0.6077 | lr 2.34e-04 | (3800.36 ms | 137958 tok/s) step 15588/76294 | train loss 3.419719 | norm 0.4589 | lr 2.34e-04 | (3817.70 ms | 137331 tok/s) step 15589/76294 | train loss 3.352606 | norm 0.5141 | lr 2.34e-04 | (3804.90 ms | 137793 tok/s) step 15590/76294 | train loss 3.442126 | norm 0.3664 | lr 2.34e-04 | (3820.13 ms | 137243 tok/s) step 15591/76294 | train loss 3.396078 | norm 0.3734 | lr 2.34e-04 | (3804.15 ms | 137820 tok/s) step 15592/76294 | train loss 3.387955 | norm 0.4950 | lr 2.34e-04 | (3808.15 ms | 137675 tok/s) step 15593/76294 | train loss 3.386256 | norm 0.4785 | lr 2.34e-04 | (3822.39 ms | 137162 tok/s) step 15594/76294 | train loss 3.373536 | norm 0.3882 | lr 2.34e-04 | (3807.60 ms | 137695 tok/s) step 15595/76294 | train loss 3.394080 | norm 0.3980 | lr 2.34e-04 | (3810.06 ms | 137606 tok/s) step 15596/76294 | train loss 3.402564 | norm 0.3652 | lr 2.34e-04 | (3804.40 ms | 137811 tok/s) step 15597/76294 | train loss 3.403793 | norm 0.3452 | lr 2.34e-04 | (3806.15 ms | 137748 tok/s) step 15598/76294 | train loss 3.433080 | norm 0.3289 | lr 2.34e-04 | (3807.97 ms | 137682 tok/s) step 15599/76294 | train loss 3.352816 | norm 0.3715 | lr 2.34e-04 | (3803.84 ms | 137831 tok/s) step 15600/76294 | train loss 3.379340 | norm 0.4341 | lr 2.34e-04 | (3832.05 ms | 136817 tok/s) step 15601/76294 | train loss 3.543206 | norm 0.4458 | lr 2.33e-04 | (5278.34 ms | 99328 tok/s) step 15602/76294 | train loss 3.499978 | norm 0.3405 | lr 2.33e-04 | (3801.15 ms | 137929 tok/s) step 15603/76294 | train loss 3.419593 | norm 0.4586 | lr 2.33e-04 | (3946.86 ms | 132837 tok/s) step 15604/76294 | train loss 3.350560 | norm 0.3242 | lr 2.33e-04 | (3795.75 ms | 138125 tok/s) step 15605/76294 | train loss 3.393010 | norm 0.3495 | lr 2.33e-04 | (3820.68 ms | 137224 tok/s) step 15606/76294 | train loss 3.348437 | norm 0.3989 | lr 2.33e-04 | (3797.11 ms | 138075 tok/s) step 15607/76294 | train loss 3.440107 | norm 0.3716 | lr 2.33e-04 | (3806.71 ms | 137727 tok/s) step 15608/76294 | train loss 3.300313 | norm 0.4003 | lr 2.33e-04 | (3823.66 ms | 137117 tok/s) step 15609/76294 | train loss 3.351475 | norm 0.4432 | lr 2.33e-04 | (3803.86 ms | 137830 tok/s) step 15610/76294 | train loss 3.382849 | norm 0.4431 | lr 2.33e-04 | (3803.90 ms | 137829 tok/s) step 15611/76294 | train loss 3.402754 | norm 0.5089 | lr 2.33e-04 | (3832.03 ms | 136817 tok/s) step 15612/76294 | train loss 3.410755 | norm 0.5301 | lr 2.33e-04 | (3797.89 ms | 138047 tok/s) step 15613/76294 | train loss 3.339703 | norm 0.3614 | lr 2.33e-04 | (3830.18 ms | 136883 tok/s) step 15614/76294 | train loss 3.353049 | norm 0.3365 | lr 2.33e-04 | (3800.76 ms | 137943 tok/s) step 15615/76294 | train loss 3.376253 | norm 0.4597 | lr 2.33e-04 | (3812.95 ms | 137502 tok/s) step 15616/76294 | train loss 3.358694 | norm 0.5206 | lr 2.33e-04 | (3796.57 ms | 138095 tok/s) step 15617/76294 | train loss 3.461344 | norm 0.5026 | lr 2.33e-04 | (3856.61 ms | 135945 tok/s) step 15618/76294 | train loss 3.357364 | norm 0.4429 | lr 2.33e-04 | (3801.91 ms | 137901 tok/s) step 15619/76294 | train loss 3.402423 | norm 0.3690 | lr 2.32e-04 | (3804.46 ms | 137809 tok/s) step 15620/76294 | train loss 3.452437 | norm 0.4054 | lr 2.32e-04 | (3800.00 ms | 137970 tok/s) step 15621/76294 | train loss 3.353607 | norm 0.5450 | lr 2.32e-04 | (3806.93 ms | 137719 tok/s) step 15622/76294 | train loss 3.407999 | norm 0.3003 | lr 2.32e-04 | (3819.54 ms | 137265 tok/s) step 15623/76294 | train loss 3.370048 | norm 0.4446 | lr 2.32e-04 | (3800.51 ms | 137952 tok/s) step 15624/76294 | train loss 3.374161 | norm 0.3607 | lr 2.32e-04 | (3807.11 ms | 137713 tok/s) step 15625/76294 | train loss 3.389842 | norm 0.3281 | lr 2.32e-04 | (3806.45 ms | 137737 tok/s) step 15626/76294 | train loss 3.358009 | norm 0.3183 | lr 2.32e-04 | (3807.69 ms | 137692 tok/s) step 15627/76294 | train loss 3.367905 | norm 0.3221 | lr 2.32e-04 | (3804.50 ms | 137807 tok/s) step 15628/76294 | train loss 3.399207 | norm 0.3883 | lr 2.32e-04 | (3843.93 ms | 136394 tok/s) step 15629/76294 | train loss 3.342100 | norm 0.3449 | lr 2.32e-04 | (3801.07 ms | 137932 tok/s) step 15630/76294 | train loss 3.338870 | norm 0.4018 | lr 2.32e-04 | (3828.03 ms | 136960 tok/s) step 15631/76294 | train loss 3.363822 | norm 0.3118 | lr 2.32e-04 | (3800.97 ms | 137935 tok/s) step 15632/76294 | train loss 3.437527 | norm 0.3324 | lr 2.32e-04 | (3808.78 ms | 137653 tok/s) step 15633/76294 | train loss 3.356375 | norm 0.3426 | lr 2.32e-04 | (3802.60 ms | 137876 tok/s) step 15634/76294 | train loss 3.429872 | norm 0.3495 | lr 2.32e-04 | (3825.62 ms | 137046 tok/s) step 15635/76294 | train loss 3.389033 | norm 0.3370 | lr 2.32e-04 | (3807.57 ms | 137696 tok/s) step 15636/76294 | train loss 3.316633 | norm 0.4693 | lr 2.32e-04 | (3941.17 ms | 133028 tok/s) step 15637/76294 | train loss 3.381473 | norm 0.3708 | lr 2.31e-04 | (3821.10 ms | 137209 tok/s) step 15638/76294 | train loss 3.436496 | norm 0.3793 | lr 2.31e-04 | (3805.16 ms | 137783 tok/s) step 15639/76294 | train loss 3.381320 | norm 0.3764 | lr 2.31e-04 | (3806.49 ms | 137735 tok/s) step 15640/76294 | train loss 3.391678 | norm 0.3310 | lr 2.31e-04 | (3808.61 ms | 137659 tok/s) step 15641/76294 | train loss 3.415929 | norm 0.3856 | lr 2.31e-04 | (3824.05 ms | 137103 tok/s) step 15642/76294 | train loss 3.340146 | norm 0.3241 | lr 2.31e-04 | (4701.18 ms | 111523 tok/s) step 15643/76294 | train loss 3.453325 | norm 0.3332 | lr 2.31e-04 | (3870.51 ms | 135457 tok/s) step 15644/76294 | train loss 3.343143 | norm 0.4058 | lr 2.31e-04 | (3794.57 ms | 138168 tok/s) step 15645/76294 | train loss 3.310873 | norm 0.4444 | lr 2.31e-04 | (3796.72 ms | 138090 tok/s) step 15646/76294 | train loss 3.397875 | norm 0.4207 | lr 2.31e-04 | (3816.59 ms | 137371 tok/s) step 15647/76294 | train loss 3.340131 | norm 0.4692 | lr 2.31e-04 | (6938.25 ms | 75565 tok/s) step 15648/76294 | train loss 3.359442 | norm 0.6053 | lr 2.31e-04 | (3852.33 ms | 136096 tok/s) step 15649/76294 | train loss 3.391939 | norm 0.3517 | lr 2.31e-04 | (3793.87 ms | 138193 tok/s) step 15650/76294 | train loss 3.378545 | norm 0.6130 | lr 2.31e-04 | (3798.93 ms | 138009 tok/s) step 15651/76294 | train loss 3.362254 | norm 0.5639 | lr 2.31e-04 | (3945.93 ms | 132868 tok/s) step 15652/76294 | train loss 3.338221 | norm 0.3175 | lr 2.31e-04 | (3790.25 ms | 138325 tok/s) step 15653/76294 | train loss 3.446975 | norm 0.3770 | lr 2.31e-04 | (3797.78 ms | 138051 tok/s) step 15654/76294 | train loss 3.311060 | norm 0.4598 | lr 2.31e-04 | (3816.86 ms | 137361 tok/s) step 15655/76294 | train loss 3.372497 | norm 0.3779 | lr 2.30e-04 | (3824.31 ms | 137093 tok/s) step 15656/76294 | train loss 3.411286 | norm 0.4770 | lr 2.30e-04 | (3797.19 ms | 138073 tok/s) step 15657/76294 | train loss 3.397329 | norm 0.6177 | lr 2.30e-04 | (3826.01 ms | 137033 tok/s) step 15658/76294 | train loss 3.407482 | norm 0.4923 | lr 2.30e-04 | (3795.22 ms | 138144 tok/s) step 15659/76294 | train loss 3.368605 | norm 0.3723 | lr 2.30e-04 | (3823.65 ms | 137117 tok/s) step 15660/76294 | train loss 3.341058 | norm 2.6382 | lr 2.30e-04 | (3832.27 ms | 136809 tok/s) step 15661/76294 | train loss 3.431609 | norm 0.6635 | lr 2.30e-04 | (3800.71 ms | 137945 tok/s) step 15662/76294 | train loss 3.381702 | norm 0.9759 | lr 2.30e-04 | (3797.64 ms | 138056 tok/s) step 15663/76294 | train loss 3.409729 | norm 0.7192 | lr 2.30e-04 | (3834.23 ms | 136739 tok/s) step 15664/76294 | train loss 3.437005 | norm 0.9886 | lr 2.30e-04 | (3799.66 ms | 137983 tok/s) step 15665/76294 | train loss 3.311355 | norm 0.6502 | lr 2.30e-04 | (3819.31 ms | 137273 tok/s) step 15666/76294 | train loss 3.343102 | norm 0.5828 | lr 2.30e-04 | (3800.67 ms | 137946 tok/s) step 15667/76294 | train loss 3.377248 | norm 0.6960 | lr 2.30e-04 | (3794.98 ms | 138153 tok/s) step 15668/76294 | train loss 3.293845 | norm 0.6232 | lr 2.30e-04 | (3824.08 ms | 137102 tok/s) step 15669/76294 | train loss 3.409541 | norm 0.7551 | lr 2.30e-04 | (3800.54 ms | 137951 tok/s) step 15670/76294 | train loss 3.328459 | norm 0.5449 | lr 2.30e-04 | (3804.53 ms | 137806 tok/s) step 15671/76294 | train loss 3.429679 | norm 0.5493 | lr 2.30e-04 | (3825.04 ms | 137067 tok/s) step 15672/76294 | train loss 3.386060 | norm 0.5369 | lr 2.30e-04 | (3802.64 ms | 137875 tok/s) step 15673/76294 | train loss 3.344005 | norm 0.5675 | lr 2.30e-04 | (3804.60 ms | 137804 tok/s) step 15674/76294 | train loss 3.336260 | norm 0.5441 | lr 2.29e-04 | (3799.00 ms | 138007 tok/s) step 15675/76294 | train loss 3.437593 | norm 0.4835 | lr 2.29e-04 | (3893.16 ms | 134669 tok/s) step 15676/76294 | train loss 3.401626 | norm 0.5045 | lr 2.29e-04 | (3798.09 ms | 138040 tok/s) step 15677/76294 | train loss 3.480350 | norm 0.6140 | lr 2.29e-04 | (3856.80 ms | 135939 tok/s) step 15678/76294 | train loss 3.402204 | norm 0.4549 | lr 2.29e-04 | (3802.36 ms | 137885 tok/s) step 15679/76294 | train loss 3.386027 | norm 0.6248 | lr 2.29e-04 | (3843.21 ms | 136419 tok/s) step 15680/76294 | train loss 3.401603 | norm 0.4540 | lr 2.29e-04 | (3802.80 ms | 137869 tok/s) step 15681/76294 | train loss 3.365563 | norm 0.4142 | lr 2.29e-04 | (3806.95 ms | 137719 tok/s) step 15682/76294 | train loss 3.410366 | norm 0.4188 | lr 2.29e-04 | (3823.29 ms | 137130 tok/s) step 15683/76294 | train loss 3.388491 | norm 0.4868 | lr 2.29e-04 | (3810.72 ms | 137583 tok/s) step 15684/76294 | train loss 3.395125 | norm 0.4476 | lr 2.29e-04 | (3804.30 ms | 137815 tok/s) step 15685/76294 | train loss 3.440656 | norm 0.7900 | lr 2.29e-04 | (3831.69 ms | 136830 tok/s) step 15686/76294 | train loss 3.437212 | norm 0.4543 | lr 2.29e-04 | (3803.55 ms | 137842 tok/s) step 15687/76294 | train loss 3.351413 | norm 0.5218 | lr 2.29e-04 | (3807.74 ms | 137690 tok/s) step 15688/76294 | train loss 3.386770 | norm 0.5540 | lr 2.29e-04 | (3827.28 ms | 136987 tok/s) step 15689/76294 | train loss 3.377327 | norm 0.6218 | lr 2.29e-04 | (3808.48 ms | 137663 tok/s) step 15690/76294 | train loss 3.466669 | norm 0.4757 | lr 2.29e-04 | (3803.87 ms | 137830 tok/s) step 15691/76294 | train loss 3.388531 | norm 0.4335 | lr 2.29e-04 | (3837.20 ms | 136633 tok/s) step 15692/76294 | train loss 3.383277 | norm 0.4826 | lr 2.28e-04 | (3802.01 ms | 137898 tok/s) step 15693/76294 | train loss 3.345912 | norm 0.5181 | lr 2.28e-04 | (3829.93 ms | 136892 tok/s) step 15694/76294 | train loss 3.385957 | norm 0.4626 | lr 2.28e-04 | (3804.35 ms | 137813 tok/s) step 15695/76294 | train loss 3.351846 | norm 0.6032 | lr 2.28e-04 | (3806.82 ms | 137723 tok/s) step 15696/76294 | train loss 3.399825 | norm 0.4406 | lr 2.28e-04 | (3827.61 ms | 136975 tok/s) step 15697/76294 | train loss 3.410831 | norm 0.4328 | lr 2.28e-04 | (3805.67 ms | 137765 tok/s) step 15698/76294 | train loss 3.385779 | norm 0.3732 | lr 2.28e-04 | (3919.14 ms | 133776 tok/s) step 15699/76294 | train loss 3.358667 | norm 0.4214 | lr 2.28e-04 | (3812.04 ms | 137535 tok/s) step 15700/76294 | train loss 3.366439 | norm 0.3529 | lr 2.28e-04 | (3864.90 ms | 135654 tok/s) step 15701/76294 | train loss 3.362374 | norm 0.4150 | lr 2.28e-04 | (3801.78 ms | 137906 tok/s) step 15702/76294 | train loss 3.383120 | norm 0.3706 | lr 2.28e-04 | (3804.85 ms | 137795 tok/s) step 15703/76294 | train loss 3.442527 | norm 0.5163 | lr 2.28e-04 | (3827.10 ms | 136993 tok/s) step 15704/76294 | train loss 3.350032 | norm 0.4712 | lr 2.28e-04 | (3805.63 ms | 137766 tok/s) step 15705/76294 | train loss 3.379376 | norm 0.4293 | lr 2.28e-04 | (3812.70 ms | 137511 tok/s) step 15706/76294 | train loss 3.456375 | norm 0.4971 | lr 2.28e-04 | (3808.05 ms | 137679 tok/s) step 15707/76294 | train loss 3.420846 | norm 0.4052 | lr 2.28e-04 | (3811.19 ms | 137565 tok/s) step 15708/76294 | train loss 3.370358 | norm 0.5183 | lr 2.28e-04 | (3805.63 ms | 137766 tok/s) step 15709/76294 | train loss 3.374381 | norm 0.4045 | lr 2.28e-04 | (3809.23 ms | 137636 tok/s) step 15710/76294 | train loss 3.347370 | norm 0.3502 | lr 2.28e-04 | (3807.32 ms | 137705 tok/s) step 15711/76294 | train loss 3.361099 | norm 0.4832 | lr 2.27e-04 | (3802.30 ms | 137887 tok/s) step 15712/76294 | train loss 3.305461 | norm 0.4970 | lr 2.27e-04 | (3828.89 ms | 136930 tok/s) step 15713/76294 | train loss 3.453260 | norm 0.3363 | lr 2.27e-04 | (3802.36 ms | 137885 tok/s) step 15714/76294 | train loss 3.432941 | norm 0.3641 | lr 2.27e-04 | (3806.94 ms | 137719 tok/s) step 15715/76294 | train loss 3.379684 | norm 0.3793 | lr 2.27e-04 | (3823.21 ms | 137133 tok/s) step 15716/76294 | train loss 3.439986 | norm 0.4371 | lr 2.27e-04 | (3815.44 ms | 137412 tok/s) step 15717/76294 | train loss 3.330897 | norm 0.5038 | lr 2.27e-04 | (3827.77 ms | 136970 tok/s) step 15718/76294 | train loss 3.463928 | norm 0.4636 | lr 2.27e-04 | (3858.42 ms | 135882 tok/s) step 15719/76294 | train loss 3.428038 | norm 0.4016 | lr 2.27e-04 | (3833.52 ms | 136764 tok/s) step 15720/76294 | train loss 3.349827 | norm 0.6212 | lr 2.27e-04 | (3806.39 ms | 137739 tok/s) step 15721/76294 | train loss 3.419362 | norm 0.3673 | lr 2.27e-04 | (3802.74 ms | 137871 tok/s) step 15722/76294 | train loss 3.405065 | norm 0.5268 | lr 2.27e-04 | (3832.02 ms | 136818 tok/s) step 15723/76294 | train loss 3.340029 | norm 0.4156 | lr 2.27e-04 | (3802.16 ms | 137892 tok/s) step 15724/76294 | train loss 3.401501 | norm 0.6355 | lr 2.27e-04 | (3806.19 ms | 137746 tok/s) step 15725/76294 | train loss 3.465834 | norm 0.4925 | lr 2.27e-04 | (3961.47 ms | 132347 tok/s) step 15726/76294 | train loss 3.385878 | norm 0.5366 | lr 2.27e-04 | (3799.19 ms | 138000 tok/s) step 15727/76294 | train loss 3.335585 | norm 0.4576 | lr 2.27e-04 | (3846.72 ms | 136295 tok/s) step 15728/76294 | train loss 3.309203 | norm 0.3797 | lr 2.27e-04 | (3804.68 ms | 137801 tok/s) step 15729/76294 | train loss 3.390261 | norm 0.4561 | lr 2.26e-04 | (3807.72 ms | 137691 tok/s) step 15730/76294 | train loss 3.383818 | norm 0.4768 | lr 2.26e-04 | (3819.02 ms | 137283 tok/s) step 15731/76294 | train loss 3.375422 | norm 0.3759 | lr 2.26e-04 | (3803.47 ms | 137845 tok/s) step 15732/76294 | train loss 3.433309 | norm 0.5473 | lr 2.26e-04 | (3824.23 ms | 137096 tok/s) step 15733/76294 | train loss 3.351698 | norm 0.5062 | lr 2.26e-04 | (3804.10 ms | 137822 tok/s) step 15734/76294 | train loss 3.365033 | norm 0.4361 | lr 2.26e-04 | (3821.55 ms | 137193 tok/s) step 15735/76294 | train loss 3.384449 | norm 0.4063 | lr 2.26e-04 | (3822.25 ms | 137168 tok/s) step 15736/76294 | train loss 3.370825 | norm 0.4200 | lr 2.26e-04 | (3800.54 ms | 137951 tok/s) step 15737/76294 | train loss 3.387760 | norm 0.3659 | lr 2.26e-04 | (3833.57 ms | 136762 tok/s) step 15738/76294 | train loss 3.397782 | norm 0.4271 | lr 2.26e-04 | (3803.49 ms | 137844 tok/s) step 15739/76294 | train loss 3.399316 | norm 0.3800 | lr 2.26e-04 | (3827.02 ms | 136997 tok/s) step 15740/76294 | train loss 3.358394 | norm 0.3781 | lr 2.26e-04 | (3800.79 ms | 137942 tok/s) step 15741/76294 | train loss 3.396019 | norm 0.3593 | lr 2.26e-04 | (3804.42 ms | 137810 tok/s) step 15742/76294 | train loss 3.354138 | norm 0.3446 | lr 2.26e-04 | (3828.39 ms | 136947 tok/s) step 15743/76294 | train loss 3.369709 | norm 0.3655 | lr 2.26e-04 | (3839.07 ms | 136566 tok/s) step 15744/76294 | train loss 3.386689 | norm 0.3441 | lr 2.26e-04 | (3805.50 ms | 137771 tok/s) step 15745/76294 | train loss 3.344059 | norm 0.3344 | lr 2.26e-04 | (3814.58 ms | 137443 tok/s) step 15746/76294 | train loss 3.372154 | norm 0.3403 | lr 2.26e-04 | (4250.71 ms | 123341 tok/s) step 15747/76294 | train loss 3.416989 | norm 0.3108 | lr 2.26e-04 | (3804.53 ms | 137806 tok/s) step 15748/76294 | train loss 3.323612 | norm 0.2768 | lr 2.25e-04 | (3821.94 ms | 137179 tok/s) step 15749/76294 | train loss 3.411570 | norm 0.3460 | lr 2.25e-04 | (3877.29 ms | 135220 tok/s) step 15750/76294 | train loss 3.287415 | norm 0.2819 | lr 2.25e-04 | (3801.40 ms | 137920 tok/s) val loss: 3.366854 saving model checkpoint to ./results/gpt2-124M-gqa/step_15750.pth step 15751/76294 | train loss 3.430092 | norm 0.3027 | lr 2.25e-04 | (3843.75 ms | 136400 tok/s) step 15752/76294 | train loss 3.299394 | norm 0.4102 | lr 2.25e-04 | (3794.76 ms | 138161 tok/s) step 15753/76294 | train loss 3.379114 | norm 0.4212 | lr 2.25e-04 | (3809.02 ms | 137644 tok/s) step 15754/76294 | train loss 3.343727 | norm 0.4346 | lr 2.25e-04 | (3825.43 ms | 137053 tok/s) step 15755/76294 | train loss 3.365844 | norm 0.3348 | lr 2.25e-04 | (3800.27 ms | 137961 tok/s) step 15756/76294 | train loss 3.307056 | norm 0.3309 | lr 2.25e-04 | (3814.76 ms | 137437 tok/s) step 15757/76294 | train loss 3.337069 | norm 0.3236 | lr 2.25e-04 | (3823.79 ms | 137112 tok/s) step 15758/76294 | train loss 3.301577 | norm 0.3935 | lr 2.25e-04 | (3803.68 ms | 137837 tok/s) step 15759/76294 | train loss 3.404993 | norm 0.3319 | lr 2.25e-04 | (3900.97 ms | 134399 tok/s) step 15760/76294 | train loss 3.323415 | norm 0.4385 | lr 2.25e-04 | (3774.47 ms | 138904 tok/s) step 15761/76294 | train loss 3.349099 | norm 0.4743 | lr 2.25e-04 | (3804.68 ms | 137801 tok/s) step 15762/76294 | train loss 3.341315 | norm 0.4012 | lr 2.25e-04 | (3808.17 ms | 137675 tok/s) step 15763/76294 | train loss 3.348802 | norm 0.3500 | lr 2.25e-04 | (3782.97 ms | 138592 tok/s) step 15764/76294 | train loss 3.386613 | norm 0.2977 | lr 2.25e-04 | (3785.46 ms | 138501 tok/s) step 15765/76294 | train loss 3.303491 | norm 0.4156 | lr 2.25e-04 | (3822.19 ms | 137169 tok/s) step 15766/76294 | train loss 3.296808 | norm 0.3337 | lr 2.25e-04 | (3789.78 ms | 138343 tok/s) step 15767/76294 | train loss 3.259838 | norm 0.4328 | lr 2.24e-04 | (3799.87 ms | 137975 tok/s) step 15768/76294 | train loss 3.378649 | norm 0.3648 | lr 2.24e-04 | (3819.06 ms | 137282 tok/s) step 15769/76294 | train loss 3.319294 | norm 0.3614 | lr 2.24e-04 | (3803.07 ms | 137859 tok/s) step 15770/76294 | train loss 3.361189 | norm 0.3425 | lr 2.24e-04 | (3852.22 ms | 136100 tok/s) step 15771/76294 | train loss 3.364917 | norm 0.4546 | lr 2.24e-04 | (3801.29 ms | 137924 tok/s) step 15772/76294 | train loss 3.374113 | norm 0.4266 | lr 2.24e-04 | (3923.23 ms | 133637 tok/s) step 15773/76294 | train loss 3.337670 | norm 0.4400 | lr 2.24e-04 | (3793.82 ms | 138195 tok/s) step 15774/76294 | train loss 3.397447 | norm 0.3649 | lr 2.24e-04 | (3825.22 ms | 137061 tok/s) step 15775/76294 | train loss 3.350938 | norm 0.3081 | lr 2.24e-04 | (3797.22 ms | 138071 tok/s) step 15776/76294 | train loss 3.375603 | norm 0.3348 | lr 2.24e-04 | (3816.28 ms | 137382 tok/s) step 15777/76294 | train loss 3.345796 | norm 0.3893 | lr 2.24e-04 | (3798.81 ms | 138014 tok/s) step 15778/76294 | train loss 3.307069 | norm 0.3519 | lr 2.24e-04 | (3802.76 ms | 137870 tok/s) step 15779/76294 | train loss 3.331344 | norm 0.3242 | lr 2.24e-04 | (3821.56 ms | 137192 tok/s) step 15780/76294 | train loss 3.372080 | norm 0.6514 | lr 2.24e-04 | (3800.70 ms | 137945 tok/s) step 15781/76294 | train loss 3.377669 | norm 0.4416 | lr 2.24e-04 | (3791.46 ms | 138281 tok/s) step 15782/76294 | train loss 3.339730 | norm 0.6996 | lr 2.24e-04 | (3870.54 ms | 135456 tok/s) step 15783/76294 | train loss 3.345392 | norm 0.3844 | lr 2.24e-04 | (3792.74 ms | 138235 tok/s) step 15784/76294 | train loss 3.355507 | norm 0.5422 | lr 2.24e-04 | (3799.52 ms | 137988 tok/s) step 15785/76294 | train loss 3.371908 | norm 0.7018 | lr 2.24e-04 | (3816.71 ms | 137366 tok/s) step 15786/76294 | train loss 3.326865 | norm 0.5039 | lr 2.23e-04 | (3797.47 ms | 138062 tok/s) step 15787/76294 | train loss 3.375236 | norm 0.5126 | lr 2.23e-04 | (3794.25 ms | 138180 tok/s) step 15788/76294 | train loss 3.360013 | norm 0.5792 | lr 2.23e-04 | (3857.66 ms | 135908 tok/s) step 15789/76294 | train loss 3.365949 | norm 0.4202 | lr 2.23e-04 | (3798.98 ms | 138008 tok/s) step 15790/76294 | train loss 3.292819 | norm 0.3437 | lr 2.23e-04 | (3829.50 ms | 136908 tok/s) step 15791/76294 | train loss 3.398158 | norm 0.3617 | lr 2.23e-04 | (3824.10 ms | 137101 tok/s) step 15792/76294 | train loss 3.293945 | norm 0.4625 | lr 2.23e-04 | (3798.42 ms | 138028 tok/s) step 15793/76294 | train loss 3.380243 | norm 0.5528 | lr 2.23e-04 | (3808.20 ms | 137674 tok/s) step 15794/76294 | train loss 3.388380 | norm 0.5106 | lr 2.23e-04 | (3798.62 ms | 138021 tok/s) step 15795/76294 | train loss 3.381384 | norm 0.3892 | lr 2.23e-04 | (3837.48 ms | 136623 tok/s) step 15796/76294 | train loss 3.377603 | norm 0.3447 | lr 2.23e-04 | (3818.58 ms | 137299 tok/s) step 15797/76294 | train loss 3.346831 | norm 0.4689 | lr 2.23e-04 | (3857.60 ms | 135911 tok/s) step 15798/76294 | train loss 3.412032 | norm 0.4186 | lr 2.23e-04 | (3794.40 ms | 138174 tok/s) step 15799/76294 | train loss 3.322401 | norm 0.3671 | lr 2.23e-04 | (3804.27 ms | 137816 tok/s) step 15800/76294 | train loss 3.366523 | norm 0.5581 | lr 2.23e-04 | (3801.12 ms | 137930 tok/s) step 15801/76294 | train loss 3.313627 | norm 0.4365 | lr 2.23e-04 | (3801.28 ms | 137924 tok/s) step 15802/76294 | train loss 3.468836 | norm 0.4018 | lr 2.23e-04 | (3818.41 ms | 137305 tok/s) step 15803/76294 | train loss 3.350303 | norm 0.3819 | lr 2.23e-04 | (3817.16 ms | 137350 tok/s) step 15804/76294 | train loss 3.421050 | norm 0.3997 | lr 2.22e-04 | (3800.53 ms | 137951 tok/s) step 15805/76294 | train loss 3.392504 | norm 0.4509 | lr 2.22e-04 | (3820.77 ms | 137220 tok/s) step 15806/76294 | train loss 3.417778 | norm 0.5920 | lr 2.22e-04 | (3797.55 ms | 138060 tok/s) step 15807/76294 | train loss 3.325598 | norm 0.4096 | lr 2.22e-04 | (3808.62 ms | 137658 tok/s) step 15808/76294 | train loss 3.405933 | norm 0.3854 | lr 2.22e-04 | (3799.60 ms | 137985 tok/s) step 15809/76294 | train loss 3.335254 | norm 0.3931 | lr 2.22e-04 | (3842.03 ms | 136461 tok/s) step 15810/76294 | train loss 3.375694 | norm 0.5391 | lr 2.22e-04 | (3798.50 ms | 138025 tok/s) step 15811/76294 | train loss 3.358800 | norm 0.4275 | lr 2.22e-04 | (3824.53 ms | 137086 tok/s) step 15812/76294 | train loss 3.331990 | norm 0.5094 | lr 2.22e-04 | (3837.63 ms | 136618 tok/s) step 15813/76294 | train loss 3.401795 | norm 0.4537 | lr 2.22e-04 | (3804.60 ms | 137804 tok/s) step 15814/76294 | train loss 3.321671 | norm 0.5012 | lr 2.22e-04 | (3822.54 ms | 137157 tok/s) step 15815/76294 | train loss 3.426671 | norm 0.6466 | lr 2.22e-04 | (3806.56 ms | 137733 tok/s) step 15816/76294 | train loss 3.318399 | norm 0.3718 | lr 2.22e-04 | (3799.42 ms | 137991 tok/s) step 15817/76294 | train loss 3.322622 | norm 0.3880 | lr 2.22e-04 | (3812.86 ms | 137505 tok/s) step 15818/76294 | train loss 3.313353 | norm 0.3791 | lr 2.22e-04 | (3809.19 ms | 137638 tok/s) step 15819/76294 | train loss 3.342279 | norm 0.4204 | lr 2.22e-04 | (3811.94 ms | 137538 tok/s) step 15820/76294 | train loss 3.334090 | norm 0.6447 | lr 2.22e-04 | (3811.01 ms | 137572 tok/s) step 15821/76294 | train loss 3.391747 | norm 0.4089 | lr 2.22e-04 | (3802.46 ms | 137881 tok/s) step 15822/76294 | train loss 3.335127 | norm 0.4892 | lr 2.22e-04 | (3889.71 ms | 134788 tok/s) step 15823/76294 | train loss 3.362151 | norm 0.4261 | lr 2.21e-04 | (3800.97 ms | 137935 tok/s) step 15824/76294 | train loss 3.315735 | norm 0.4034 | lr 2.21e-04 | (4336.68 ms | 120896 tok/s) step 15825/76294 | train loss 3.389511 | norm 0.5305 | lr 2.21e-04 | (3827.30 ms | 136987 tok/s) step 15826/76294 | train loss 3.410746 | norm 0.3988 | lr 2.21e-04 | (3803.85 ms | 137831 tok/s) step 15827/76294 | train loss 3.394255 | norm 0.4829 | lr 2.21e-04 | (3833.11 ms | 136779 tok/s) step 15828/76294 | train loss 3.269329 | norm 0.3850 | lr 2.21e-04 | (3801.57 ms | 137913 tok/s) step 15829/76294 | train loss 3.456431 | norm 0.4555 | lr 2.21e-04 | (3803.12 ms | 137857 tok/s) step 15830/76294 | train loss 3.367350 | norm 0.7261 | lr 2.21e-04 | (3819.28 ms | 137274 tok/s) step 15831/76294 | train loss 3.425967 | norm 0.5736 | lr 2.21e-04 | (3803.16 ms | 137856 tok/s) step 15832/76294 | train loss 3.286067 | norm 0.5330 | lr 2.21e-04 | (3821.88 ms | 137181 tok/s) step 15833/76294 | train loss 3.324259 | norm 0.3936 | lr 2.21e-04 | (3802.16 ms | 137892 tok/s) step 15834/76294 | train loss 3.347385 | norm 0.4191 | lr 2.21e-04 | (3808.77 ms | 137653 tok/s) step 15835/76294 | train loss 3.366523 | norm 0.4126 | lr 2.21e-04 | (3804.18 ms | 137819 tok/s) step 15836/76294 | train loss 3.325867 | norm 0.6489 | lr 2.21e-04 | (3811.26 ms | 137563 tok/s) step 15837/76294 | train loss 3.392231 | norm 0.5625 | lr 2.21e-04 | (3816.07 ms | 137390 tok/s) step 15838/76294 | train loss 3.344375 | norm 0.4754 | lr 2.21e-04 | (3827.56 ms | 136977 tok/s) step 15839/76294 | train loss 3.411367 | norm 0.4747 | lr 2.21e-04 | (3806.44 ms | 137737 tok/s) step 15840/76294 | train loss 3.396977 | norm 0.5893 | lr 2.21e-04 | (3800.17 ms | 137964 tok/s) step 15841/76294 | train loss 3.315440 | norm 0.4724 | lr 2.21e-04 | (3830.85 ms | 136859 tok/s) step 15842/76294 | train loss 3.340459 | norm 0.4776 | lr 2.21e-04 | (3801.37 ms | 137921 tok/s) step 15843/76294 | train loss 3.319135 | norm 0.3478 | lr 2.20e-04 | (3806.45 ms | 137737 tok/s) step 15844/76294 | train loss 3.377894 | norm 0.3929 | lr 2.20e-04 | (3824.29 ms | 137094 tok/s) step 15845/76294 | train loss 3.364472 | norm 0.5232 | lr 2.20e-04 | (3807.78 ms | 137689 tok/s) step 15846/76294 | train loss 3.407777 | norm 0.4124 | lr 2.20e-04 | (3802.65 ms | 137875 tok/s) step 15847/76294 | train loss 3.345079 | norm 0.4604 | lr 2.20e-04 | (3826.74 ms | 137006 tok/s) step 15848/76294 | train loss 3.417748 | norm 0.5832 | lr 2.20e-04 | (3803.92 ms | 137828 tok/s) step 15849/76294 | train loss 3.296108 | norm 0.6047 | lr 2.20e-04 | (3820.18 ms | 137242 tok/s) step 15850/76294 | train loss 3.442894 | norm 0.4277 | lr 2.20e-04 | (3803.37 ms | 137848 tok/s) step 15851/76294 | train loss 3.265564 | norm 0.5285 | lr 2.20e-04 | (3884.01 ms | 134986 tok/s) step 15852/76294 | train loss 3.377671 | norm 0.3951 | lr 2.20e-04 | (7120.08 ms | 73635 tok/s) step 15853/76294 | train loss 3.357756 | norm 0.4290 | lr 2.20e-04 | (3867.89 ms | 135549 tok/s) step 15854/76294 | train loss 3.374241 | norm 0.3895 | lr 2.20e-04 | (3806.46 ms | 137736 tok/s) step 15855/76294 | train loss 3.368562 | norm 0.4242 | lr 2.20e-04 | (3797.35 ms | 138067 tok/s) step 15856/76294 | train loss 3.334010 | norm 0.6042 | lr 2.20e-04 | (3805.44 ms | 137773 tok/s) step 15857/76294 | train loss 3.300206 | norm 0.4627 | lr 2.20e-04 | (3832.63 ms | 136796 tok/s) step 15858/76294 | train loss 3.395130 | norm 0.6071 | lr 2.20e-04 | (3842.08 ms | 136459 tok/s) step 15859/76294 | train loss 3.338414 | norm 0.4055 | lr 2.20e-04 | (3794.55 ms | 138169 tok/s) step 15860/76294 | train loss 3.366342 | norm 0.4616 | lr 2.20e-04 | (3802.13 ms | 137893 tok/s) step 15861/76294 | train loss 3.311498 | norm 0.4081 | lr 2.20e-04 | (3816.12 ms | 137388 tok/s) step 15862/76294 | train loss 3.342929 | norm 0.4167 | lr 2.19e-04 | (3797.63 ms | 138057 tok/s) step 15863/76294 | train loss 3.333421 | norm 0.5445 | lr 2.19e-04 | (3819.52 ms | 137266 tok/s) step 15864/76294 | train loss 3.418790 | norm 0.5299 | lr 2.19e-04 | (3800.12 ms | 137966 tok/s) step 15865/76294 | train loss 3.331117 | norm 0.4740 | lr 2.19e-04 | (3796.99 ms | 138080 tok/s) step 15866/76294 | train loss 3.394616 | norm 0.4980 | lr 2.19e-04 | (3826.38 ms | 137019 tok/s) step 15867/76294 | train loss 3.399474 | norm 0.9097 | lr 2.19e-04 | (3797.64 ms | 138056 tok/s) step 15868/76294 | train loss 3.320745 | norm 1.2562 | lr 2.19e-04 | (3802.95 ms | 137863 tok/s) step 15869/76294 | train loss 3.417056 | norm 0.4273 | lr 2.19e-04 | (3820.14 ms | 137243 tok/s) step 15870/76294 | train loss 3.347473 | norm 0.5507 | lr 2.19e-04 | (3803.37 ms | 137848 tok/s) step 15871/76294 | train loss 3.355660 | norm 0.4566 | lr 2.19e-04 | (3805.11 ms | 137785 tok/s) step 15872/76294 | train loss 3.355523 | norm 0.4814 | lr 2.19e-04 | (3877.52 ms | 135212 tok/s) step 15873/76294 | train loss 3.338796 | norm 0.4024 | lr 2.19e-04 | (3801.61 ms | 137912 tok/s) step 15874/76294 | train loss 3.374559 | norm 0.5725 | lr 2.19e-04 | (3875.65 ms | 135277 tok/s) step 15875/76294 | train loss 3.433125 | norm 0.4300 | lr 2.19e-04 | (3800.48 ms | 137953 tok/s) step 15876/76294 | train loss 3.344303 | norm 0.4029 | lr 2.19e-04 | (3807.53 ms | 137698 tok/s) step 15877/76294 | train loss 3.409756 | norm 0.4918 | lr 2.19e-04 | (3817.20 ms | 137349 tok/s) step 15878/76294 | train loss 3.347204 | norm 0.4599 | lr 2.19e-04 | (3824.71 ms | 137079 tok/s) step 15879/76294 | train loss 3.392094 | norm 0.4179 | lr 2.19e-04 | (3806.58 ms | 137732 tok/s) step 15880/76294 | train loss 3.391030 | norm 0.4222 | lr 2.19e-04 | (3805.81 ms | 137760 tok/s) step 15881/76294 | train loss 3.359844 | norm 0.4290 | lr 2.18e-04 | (3820.56 ms | 137228 tok/s) step 15882/76294 | train loss 3.253086 | norm 0.5536 | lr 2.18e-04 | (3812.87 ms | 137505 tok/s) step 15883/76294 | train loss 3.454125 | norm 0.4099 | lr 2.18e-04 | (3799.05 ms | 138005 tok/s) step 15884/76294 | train loss 3.416435 | norm 0.6071 | lr 2.18e-04 | (3852.25 ms | 136099 tok/s) step 15885/76294 | train loss 3.375926 | norm 0.4798 | lr 2.18e-04 | (3806.20 ms | 137746 tok/s) step 15886/76294 | train loss 3.364746 | norm 0.4314 | lr 2.18e-04 | (3804.24 ms | 137817 tok/s) step 15887/76294 | train loss 3.447844 | norm 0.3663 | lr 2.18e-04 | (3821.06 ms | 137210 tok/s) step 15888/76294 | train loss 3.346387 | norm 0.4237 | lr 2.18e-04 | (3805.69 ms | 137764 tok/s) step 15889/76294 | train loss 3.545048 | norm 0.3892 | lr 2.18e-04 | (3830.59 ms | 136869 tok/s) step 15890/76294 | train loss 3.373138 | norm 0.4699 | lr 2.18e-04 | (5612.31 ms | 93418 tok/s) step 15891/76294 | train loss 3.468613 | norm 0.4372 | lr 2.18e-04 | (3810.49 ms | 137591 tok/s) step 15892/76294 | train loss 3.312033 | norm 0.4082 | lr 2.18e-04 | (3825.68 ms | 137044 tok/s) step 15893/76294 | train loss 3.358297 | norm 0.5711 | lr 2.18e-04 | (3802.79 ms | 137869 tok/s) step 15894/76294 | train loss 3.328341 | norm 0.3947 | lr 2.18e-04 | (3805.75 ms | 137762 tok/s) step 15895/76294 | train loss 3.386593 | norm 0.5074 | lr 2.18e-04 | (3828.02 ms | 136961 tok/s) step 15896/76294 | train loss 3.377942 | norm 0.4628 | lr 2.18e-04 | (3832.28 ms | 136808 tok/s) step 15897/76294 | train loss 3.504875 | norm 0.4255 | lr 2.18e-04 | (3810.21 ms | 137601 tok/s) step 15898/76294 | train loss 3.342712 | norm 0.3635 | lr 2.18e-04 | (3805.65 ms | 137766 tok/s) step 15899/76294 | train loss 3.481914 | norm 0.3847 | lr 2.18e-04 | (3808.58 ms | 137660 tok/s) step 15900/76294 | train loss 3.387209 | norm 0.3731 | lr 2.17e-04 | (3811.88 ms | 137540 tok/s) step 15901/76294 | train loss 3.312952 | norm 0.4597 | lr 2.17e-04 | (3852.16 ms | 136102 tok/s) step 15902/76294 | train loss 3.331652 | norm 0.3578 | lr 2.17e-04 | (3801.48 ms | 137917 tok/s) step 15903/76294 | train loss 3.324053 | norm 0.3757 | lr 2.17e-04 | (3830.76 ms | 136863 tok/s) step 15904/76294 | train loss 3.379140 | norm 0.3432 | lr 2.17e-04 | (3804.59 ms | 137804 tok/s) step 15905/76294 | train loss 3.331889 | norm 0.3574 | lr 2.17e-04 | (3866.60 ms | 135594 tok/s) step 15906/76294 | train loss 3.381485 | norm 0.3472 | lr 2.17e-04 | (3803.38 ms | 137848 tok/s) step 15907/76294 | train loss 3.387623 | norm 0.3311 | lr 2.17e-04 | (3805.56 ms | 137769 tok/s) step 15908/76294 | train loss 3.383137 | norm 0.3862 | lr 2.17e-04 | (3915.57 ms | 133898 tok/s) step 15909/76294 | train loss 3.351791 | norm 0.4526 | lr 2.17e-04 | (3806.69 ms | 137728 tok/s) step 15910/76294 | train loss 3.347247 | norm 0.5613 | lr 2.17e-04 | (3824.01 ms | 137104 tok/s) step 15911/76294 | train loss 3.347318 | norm 0.4123 | lr 2.17e-04 | (3806.52 ms | 137734 tok/s) step 15912/76294 | train loss 3.361885 | norm 0.3439 | lr 2.17e-04 | (3831.36 ms | 136841 tok/s) step 15913/76294 | train loss 3.303076 | norm 0.5689 | lr 2.17e-04 | (3813.30 ms | 137489 tok/s) step 15914/76294 | train loss 3.346335 | norm 0.3694 | lr 2.17e-04 | (3810.69 ms | 137583 tok/s) step 15915/76294 | train loss 3.405945 | norm 0.3621 | lr 2.17e-04 | (3808.93 ms | 137647 tok/s) step 15916/76294 | train loss 3.389479 | norm 0.3943 | lr 2.17e-04 | (3810.98 ms | 137573 tok/s) step 15917/76294 | train loss 3.328319 | norm 0.4704 | lr 2.17e-04 | (3831.79 ms | 136826 tok/s) step 15918/76294 | train loss 3.443627 | norm 0.4736 | lr 2.17e-04 | (3813.48 ms | 137483 tok/s) step 15919/76294 | train loss 3.274026 | norm 0.8154 | lr 2.17e-04 | (3806.40 ms | 137739 tok/s) step 15920/76294 | train loss 3.418799 | norm 0.5179 | lr 2.16e-04 | (3808.35 ms | 137668 tok/s) step 15921/76294 | train loss 3.386729 | norm 0.6199 | lr 2.16e-04 | (3807.32 ms | 137705 tok/s) step 15922/76294 | train loss 3.413497 | norm 0.3829 | lr 2.16e-04 | (3930.13 ms | 133402 tok/s) step 15923/76294 | train loss 3.332272 | norm 0.4654 | lr 2.16e-04 | (3805.77 ms | 137761 tok/s) step 15924/76294 | train loss 3.386611 | norm 0.3539 | lr 2.16e-04 | (3805.83 ms | 137759 tok/s) step 15925/76294 | train loss 3.339585 | norm 0.5112 | lr 2.16e-04 | (3822.52 ms | 137158 tok/s) step 15926/76294 | train loss 3.345013 | norm 0.3492 | lr 2.16e-04 | (3803.14 ms | 137857 tok/s) step 15927/76294 | train loss 3.300579 | norm 0.5005 | lr 2.16e-04 | (3807.90 ms | 137684 tok/s) step 15928/76294 | train loss 3.325923 | norm 0.6103 | lr 2.16e-04 | (3803.46 ms | 137845 tok/s) step 15929/76294 | train loss 3.359864 | norm 0.4378 | lr 2.16e-04 | (3812.74 ms | 137509 tok/s) step 15930/76294 | train loss 3.376701 | norm 0.4309 | lr 2.16e-04 | (3803.80 ms | 137833 tok/s) step 15931/76294 | train loss 3.259896 | norm 0.4338 | lr 2.16e-04 | (3814.94 ms | 137430 tok/s) step 15932/76294 | train loss 3.431233 | norm 0.4981 | lr 2.16e-04 | (3803.32 ms | 137850 tok/s) step 15933/76294 | train loss 3.309092 | norm 0.5164 | lr 2.16e-04 | (3800.49 ms | 137953 tok/s) step 15934/76294 | train loss 3.379542 | norm 0.4146 | lr 2.16e-04 | (3866.85 ms | 135585 tok/s) step 15935/76294 | train loss 3.326843 | norm 0.5247 | lr 2.16e-04 | (3800.13 ms | 137966 tok/s) step 15936/76294 | train loss 3.376500 | norm 0.4043 | lr 2.16e-04 | (3806.60 ms | 137731 tok/s) step 15937/76294 | train loss 3.383172 | norm 0.3841 | lr 2.16e-04 | (4188.37 ms | 125177 tok/s) step 15938/76294 | train loss 3.360800 | norm 0.3328 | lr 2.16e-04 | (3804.84 ms | 137795 tok/s) step 15939/76294 | train loss 3.338839 | norm 0.3491 | lr 2.15e-04 | (3804.07 ms | 137823 tok/s) step 15940/76294 | train loss 3.425247 | norm 0.3275 | lr 2.15e-04 | (3844.24 ms | 136383 tok/s) step 15941/76294 | train loss 3.419816 | norm 0.4322 | lr 2.15e-04 | (3800.35 ms | 137958 tok/s) step 15942/76294 | train loss 3.376087 | norm 0.3657 | lr 2.15e-04 | (3805.80 ms | 137760 tok/s) step 15943/76294 | train loss 3.415438 | norm 0.3535 | lr 2.15e-04 | (3822.80 ms | 137148 tok/s) step 15944/76294 | train loss 3.381316 | norm 0.3678 | lr 2.15e-04 | (3803.32 ms | 137850 tok/s) step 15945/76294 | train loss 3.347481 | norm 0.3929 | lr 2.15e-04 | (3809.03 ms | 137643 tok/s) step 15946/76294 | train loss 3.377687 | norm 0.4530 | lr 2.15e-04 | (3802.80 ms | 137869 tok/s) step 15947/76294 | train loss 3.452914 | norm 0.3339 | lr 2.15e-04 | (3808.83 ms | 137651 tok/s) step 15948/76294 | train loss 3.307108 | norm 0.4936 | lr 2.15e-04 | (3870.05 ms | 135473 tok/s) step 15949/76294 | train loss 3.394336 | norm 0.4622 | lr 2.15e-04 | (3802.14 ms | 137893 tok/s) step 15950/76294 | train loss 3.367068 | norm 0.3463 | lr 2.15e-04 | (3804.55 ms | 137805 tok/s) step 15951/76294 | train loss 3.407681 | norm 0.6302 | lr 2.15e-04 | (3823.84 ms | 137110 tok/s) step 15952/76294 | train loss 3.368296 | norm 0.4266 | lr 2.15e-04 | (3804.23 ms | 137817 tok/s) step 15953/76294 | train loss 3.370655 | norm 0.3923 | lr 2.15e-04 | (3823.16 ms | 137135 tok/s) step 15954/76294 | train loss 3.426420 | norm 0.5795 | lr 2.15e-04 | (3805.38 ms | 137776 tok/s) step 15955/76294 | train loss 3.351636 | norm 0.4613 | lr 2.15e-04 | (3807.53 ms | 137698 tok/s) step 15956/76294 | train loss 3.411448 | norm 0.3754 | lr 2.15e-04 | (3804.14 ms | 137820 tok/s) step 15957/76294 | train loss 3.410123 | norm 0.3872 | lr 2.15e-04 | (3807.49 ms | 137699 tok/s) step 15958/76294 | train loss 3.367698 | norm 0.6874 | lr 2.15e-04 | (3803.37 ms | 137848 tok/s) step 15959/76294 | train loss 3.422623 | norm 0.6055 | lr 2.14e-04 | (3801.19 ms | 137927 tok/s) step 15960/76294 | train loss 3.446447 | norm 0.7575 | lr 2.14e-04 | (3826.74 ms | 137006 tok/s) step 15961/76294 | train loss 3.371470 | norm 0.4442 | lr 2.14e-04 | (3803.94 ms | 137828 tok/s) step 15962/76294 | train loss 3.399749 | norm 0.4395 | lr 2.14e-04 | (3844.66 ms | 136368 tok/s) step 15963/76294 | train loss 3.422961 | norm 0.4955 | lr 2.14e-04 | (3825.75 ms | 137042 tok/s) step 15964/76294 | train loss 3.367086 | norm 0.3764 | lr 2.14e-04 | (3802.04 ms | 137896 tok/s) step 15965/76294 | train loss 3.415479 | norm 0.3680 | lr 2.14e-04 | (3803.12 ms | 137857 tok/s) step 15966/76294 | train loss 3.405343 | norm 0.3726 | lr 2.14e-04 | (3822.87 ms | 137145 tok/s) step 15967/76294 | train loss 3.409019 | norm 0.3678 | lr 2.14e-04 | (3809.19 ms | 137638 tok/s) step 15968/76294 | train loss 3.406161 | norm 0.3560 | lr 2.14e-04 | (3833.57 ms | 136762 tok/s) step 15969/76294 | train loss 3.400441 | norm 0.4201 | lr 2.14e-04 | (3812.93 ms | 137503 tok/s) step 15970/76294 | train loss 3.349646 | norm 0.5263 | lr 2.14e-04 | (3810.12 ms | 137604 tok/s) step 15971/76294 | train loss 3.381033 | norm 0.3893 | lr 2.14e-04 | (3846.45 ms | 136304 tok/s) step 15972/76294 | train loss 3.401469 | norm 0.4231 | lr 2.14e-04 | (3809.93 ms | 137611 tok/s) step 15973/76294 | train loss 3.365348 | norm 0.3530 | lr 2.14e-04 | (3919.32 ms | 133770 tok/s) step 15974/76294 | train loss 3.442545 | norm 0.4579 | lr 2.14e-04 | (3797.35 ms | 138067 tok/s) step 15975/76294 | train loss 3.382483 | norm 0.4007 | lr 2.14e-04 | (3807.99 ms | 137681 tok/s) step 15976/76294 | train loss 3.346103 | norm 0.4355 | lr 2.14e-04 | (3801.03 ms | 137933 tok/s) step 15977/76294 | train loss 3.394690 | norm 0.4629 | lr 2.14e-04 | (3804.03 ms | 137824 tok/s) step 15978/76294 | train loss 3.428030 | norm 0.4094 | lr 2.14e-04 | (3825.11 ms | 137065 tok/s) step 15979/76294 | train loss 3.386521 | norm 0.3170 | lr 2.13e-04 | (3802.74 ms | 137871 tok/s) step 15980/76294 | train loss 3.391248 | norm 0.3880 | lr 2.13e-04 | (3801.42 ms | 137919 tok/s) step 15981/76294 | train loss 3.403225 | norm 0.3862 | lr 2.13e-04 | (3837.48 ms | 136623 tok/s) step 15982/76294 | train loss 3.460110 | norm 0.4152 | lr 2.13e-04 | (3798.15 ms | 138038 tok/s) step 15983/76294 | train loss 3.389591 | norm 0.3535 | lr 2.13e-04 | (3822.48 ms | 137159 tok/s) step 15984/76294 | train loss 3.463218 | norm 0.4105 | lr 2.13e-04 | (3798.73 ms | 138017 tok/s) step 15985/76294 | train loss 3.366572 | norm 0.5756 | lr 2.13e-04 | (3803.26 ms | 137852 tok/s) step 15986/76294 | train loss 3.465561 | norm 0.6142 | lr 2.13e-04 | (3820.97 ms | 137213 tok/s) step 15987/76294 | train loss 3.423006 | norm 0.3343 | lr 2.13e-04 | (3808.67 ms | 137656 tok/s) step 15988/76294 | train loss 3.433989 | norm 0.4239 | lr 2.13e-04 | (3801.18 ms | 137928 tok/s) step 15989/76294 | train loss 3.379306 | norm 0.4748 | lr 2.13e-04 | (3833.43 ms | 136767 tok/s) step 15990/76294 | train loss 3.391453 | norm 0.5482 | lr 2.13e-04 | (3805.83 ms | 137759 tok/s) step 15991/76294 | train loss 3.358093 | norm 0.4739 | lr 2.13e-04 | (3857.33 ms | 135920 tok/s) step 15992/76294 | train loss 3.463165 | norm 0.4001 | lr 2.13e-04 | (3804.79 ms | 137797 tok/s) step 15993/76294 | train loss 3.423369 | norm 0.5128 | lr 2.13e-04 | (3820.26 ms | 137239 tok/s) step 15994/76294 | train loss 3.364771 | norm 0.5824 | lr 2.13e-04 | (3802.76 ms | 137870 tok/s) step 15995/76294 | train loss 3.417921 | norm 0.4735 | lr 2.13e-04 | (3917.29 ms | 133840 tok/s) step 15996/76294 | train loss 3.385282 | norm 0.4912 | lr 2.13e-04 | (3804.63 ms | 137802 tok/s) step 15997/76294 | train loss 3.362137 | norm 0.3868 | lr 2.13e-04 | (3806.69 ms | 137728 tok/s) step 15998/76294 | train loss 3.439600 | norm 0.4126 | lr 2.12e-04 | (3823.73 ms | 137114 tok/s) step 15999/76294 | train loss 3.371806 | norm 0.4744 | lr 2.12e-04 | (3919.35 ms | 133769 tok/s) step 16000/76294 | train loss 3.405088 | norm 0.4054 | lr 2.12e-04 | (3797.17 ms | 138073 tok/s) val loss: 3.366668 saving model checkpoint to ./results/gpt2-124M-gqa/step_16000.pth step 16001/76294 | train loss 3.425458 | norm 0.3926 | lr 2.12e-04 | (3844.38 ms | 136378 tok/s) step 16002/76294 | train loss 3.427579 | norm 0.4125 | lr 2.12e-04 | (3799.94 ms | 137973 tok/s) step 16003/76294 | train loss 3.351031 | norm 0.3803 | lr 2.12e-04 | (3828.53 ms | 136942 tok/s) step 16004/76294 | train loss 3.410237 | norm 0.3331 | lr 2.12e-04 | (3797.47 ms | 138062 tok/s) step 16005/76294 | train loss 3.396574 | norm 0.3730 | lr 2.12e-04 | (3815.00 ms | 137428 tok/s) step 16006/76294 | train loss 3.361969 | norm 0.3974 | lr 2.12e-04 | (3796.29 ms | 138105 tok/s) step 16007/76294 | train loss 3.434275 | norm 0.3104 | lr 2.12e-04 | (3804.44 ms | 137809 tok/s) step 16008/76294 | train loss 3.381413 | norm 0.5128 | lr 2.12e-04 | (3825.14 ms | 137064 tok/s) step 16009/76294 | train loss 3.671760 | norm 0.3456 | lr 2.12e-04 | (3803.50 ms | 137844 tok/s) step 16010/76294 | train loss 3.435145 | norm 0.4646 | lr 2.12e-04 | (3806.34 ms | 137741 tok/s) step 16011/76294 | train loss 3.423151 | norm 0.3434 | lr 2.12e-04 | (3801.98 ms | 137899 tok/s) step 16012/76294 | train loss 3.434190 | norm 0.5063 | lr 2.12e-04 | (3796.64 ms | 138093 tok/s) step 16013/76294 | train loss 3.429859 | norm 0.3493 | lr 2.12e-04 | (3834.60 ms | 136725 tok/s) step 16014/76294 | train loss 3.434434 | norm 0.5246 | lr 2.12e-04 | (3803.11 ms | 137858 tok/s) step 16015/76294 | train loss 3.392627 | norm 0.4208 | lr 2.12e-04 | (3825.52 ms | 137050 tok/s) step 16016/76294 | train loss 3.364925 | norm 0.4239 | lr 2.12e-04 | (3805.40 ms | 137775 tok/s) step 16017/76294 | train loss 3.565222 | norm 0.4417 | lr 2.12e-04 | (3802.83 ms | 137868 tok/s) step 16018/76294 | train loss 3.354122 | norm 0.4554 | lr 2.11e-04 | (3817.60 ms | 137334 tok/s) step 16019/76294 | train loss 3.405022 | norm 0.4223 | lr 2.11e-04 | (3802.36 ms | 137885 tok/s) step 16020/76294 | train loss 3.408154 | norm 0.3766 | lr 2.11e-04 | (3807.81 ms | 137688 tok/s) step 16021/76294 | train loss 3.405543 | norm 0.3814 | lr 2.11e-04 | (3809.25 ms | 137635 tok/s) step 16022/76294 | train loss 3.399592 | norm 0.3052 | lr 2.11e-04 | (3810.98 ms | 137573 tok/s) step 16023/76294 | train loss 3.406525 | norm 0.4939 | lr 2.11e-04 | (3801.02 ms | 137933 tok/s) step 16024/76294 | train loss 3.368340 | norm 0.4818 | lr 2.11e-04 | (3807.77 ms | 137689 tok/s) step 16025/76294 | train loss 3.364100 | norm 0.4560 | lr 2.11e-04 | (3807.40 ms | 137702 tok/s) step 16026/76294 | train loss 3.395166 | norm 0.4893 | lr 2.11e-04 | (3821.25 ms | 137203 tok/s) step 16027/76294 | train loss 3.378780 | norm 0.5208 | lr 2.11e-04 | (3826.29 ms | 137023 tok/s) step 16028/76294 | train loss 3.355256 | norm 0.4911 | lr 2.11e-04 | (3809.16 ms | 137639 tok/s) step 16029/76294 | train loss 3.352966 | norm 0.5710 | lr 2.11e-04 | (3805.81 ms | 137760 tok/s) step 16030/76294 | train loss 3.382890 | norm 0.3862 | lr 2.11e-04 | (3899.03 ms | 134466 tok/s) step 16031/76294 | train loss 3.388984 | norm 0.6676 | lr 2.11e-04 | (3829.80 ms | 136897 tok/s) step 16032/76294 | train loss 3.336638 | norm 0.4921 | lr 2.11e-04 | (3802.38 ms | 137884 tok/s) step 16033/76294 | train loss 3.457765 | norm 0.5156 | lr 2.11e-04 | (3829.61 ms | 136904 tok/s) step 16034/76294 | train loss 3.420742 | norm 0.4930 | lr 2.11e-04 | (3802.32 ms | 137886 tok/s) step 16035/76294 | train loss 3.447429 | norm 0.5024 | lr 2.11e-04 | (3815.07 ms | 137426 tok/s) step 16036/76294 | train loss 3.467015 | norm 0.4448 | lr 2.11e-04 | (3802.73 ms | 137871 tok/s) step 16037/76294 | train loss 3.359558 | norm 0.4060 | lr 2.11e-04 | (3805.95 ms | 137755 tok/s) step 16038/76294 | train loss 3.401552 | norm 0.4025 | lr 2.10e-04 | (3825.15 ms | 137063 tok/s) step 16039/76294 | train loss 3.456674 | norm 0.4355 | lr 2.10e-04 | (3810.00 ms | 137609 tok/s) step 16040/76294 | train loss 3.405169 | norm 0.3687 | lr 2.10e-04 | (3802.18 ms | 137892 tok/s) step 16041/76294 | train loss 3.383179 | norm 0.4191 | lr 2.10e-04 | (3834.70 ms | 136722 tok/s) step 16042/76294 | train loss 3.430617 | norm 0.5005 | lr 2.10e-04 | (3802.03 ms | 137897 tok/s) step 16043/76294 | train loss 3.382479 | norm 0.4692 | lr 2.10e-04 | (3851.19 ms | 136137 tok/s) step 16044/76294 | train loss 3.372481 | norm 1.0990 | lr 2.10e-04 | (3803.21 ms | 137854 tok/s) step 16045/76294 | train loss 3.320216 | norm 0.5892 | lr 2.10e-04 | (3807.16 ms | 137711 tok/s) step 16046/76294 | train loss 3.421837 | norm 0.5395 | lr 2.10e-04 | (3838.71 ms | 136579 tok/s) step 16047/76294 | train loss 3.351982 | norm 0.6790 | lr 2.10e-04 | (3806.68 ms | 137728 tok/s) step 16048/76294 | train loss 3.357282 | norm 0.4416 | lr 2.10e-04 | (3810.20 ms | 137601 tok/s) step 16049/76294 | train loss 3.444172 | norm 0.4249 | lr 2.10e-04 | (3808.15 ms | 137675 tok/s) step 16050/76294 | train loss 3.470920 | norm 0.4121 | lr 2.10e-04 | (3921.49 ms | 133696 tok/s) step 16051/76294 | train loss 3.376078 | norm 0.4913 | lr 2.10e-04 | (3850.06 ms | 136176 tok/s) step 16052/76294 | train loss 3.394451 | norm 0.5457 | lr 2.10e-04 | (3849.08 ms | 136211 tok/s) step 16053/76294 | train loss 3.326232 | norm 0.4174 | lr 2.10e-04 | (3799.73 ms | 137980 tok/s) step 16054/76294 | train loss 3.454590 | norm 0.7638 | lr 2.10e-04 | (3847.27 ms | 136275 tok/s) step 16055/76294 | train loss 3.428550 | norm 0.3300 | lr 2.10e-04 | (3801.39 ms | 137920 tok/s) step 16056/76294 | train loss 3.430156 | norm 0.3947 | lr 2.10e-04 | (3798.65 ms | 138020 tok/s) step 16057/76294 | train loss 3.477548 | norm 0.3939 | lr 2.10e-04 | (3818.28 ms | 137310 tok/s) step 16058/76294 | train loss 3.410242 | norm 1.0330 | lr 2.09e-04 | (3804.75 ms | 137798 tok/s) step 16059/76294 | train loss 3.409646 | norm 0.3972 | lr 2.09e-04 | (3808.28 ms | 137670 tok/s) step 16060/76294 | train loss 3.390767 | norm 0.3661 | lr 2.09e-04 | (3808.36 ms | 137668 tok/s) step 16061/76294 | train loss 3.422366 | norm 0.4362 | lr 2.09e-04 | (3807.42 ms | 137702 tok/s) step 16062/76294 | train loss 3.368037 | norm 0.3734 | lr 2.09e-04 | (3804.63 ms | 137803 tok/s) step 16063/76294 | train loss 3.402397 | norm 0.3740 | lr 2.09e-04 | (3811.87 ms | 137541 tok/s) step 16064/76294 | train loss 3.402075 | norm 0.4061 | lr 2.09e-04 | (3802.68 ms | 137873 tok/s) step 16065/76294 | train loss 3.543335 | norm 1.1467 | lr 2.09e-04 | (3801.02 ms | 137933 tok/s) step 16066/76294 | train loss 3.360328 | norm 0.6105 | lr 2.09e-04 | (3832.33 ms | 136807 tok/s) step 16067/76294 | train loss 3.416188 | norm 0.3543 | lr 2.09e-04 | (3801.90 ms | 137902 tok/s) step 16068/76294 | train loss 3.413785 | norm 0.4874 | lr 2.09e-04 | (3804.82 ms | 137796 tok/s) step 16069/76294 | train loss 3.416508 | norm 0.4147 | lr 2.09e-04 | (3829.46 ms | 136909 tok/s) step 16070/76294 | train loss 3.576847 | norm 0.4159 | lr 2.09e-04 | (3808.26 ms | 137671 tok/s) step 16071/76294 | train loss 3.360618 | norm 0.5081 | lr 2.09e-04 | (3801.33 ms | 137922 tok/s) step 16072/76294 | train loss 3.387084 | norm 0.4154 | lr 2.09e-04 | (3841.74 ms | 136471 tok/s) step 16073/76294 | train loss 3.402362 | norm 0.3567 | lr 2.09e-04 | (3798.40 ms | 138029 tok/s) step 16074/76294 | train loss 3.378649 | norm 0.4513 | lr 2.09e-04 | (3800.21 ms | 137963 tok/s) step 16075/76294 | train loss 3.401710 | norm 0.4749 | lr 2.09e-04 | (3862.96 ms | 135722 tok/s) step 16076/76294 | train loss 3.353682 | norm 0.3919 | lr 2.09e-04 | (3825.58 ms | 137048 tok/s) step 16077/76294 | train loss 3.448135 | norm 0.3532 | lr 2.09e-04 | (9702.44 ms | 54037 tok/s) step 16078/76294 | train loss 3.362720 | norm 0.4266 | lr 2.08e-04 | (3784.54 ms | 138534 tok/s) step 16079/76294 | train loss 3.467520 | norm 0.5419 | lr 2.08e-04 | (3792.73 ms | 138235 tok/s) step 16080/76294 | train loss 3.431785 | norm 0.4024 | lr 2.08e-04 | (3810.51 ms | 137590 tok/s) step 16081/76294 | train loss 3.361683 | norm 0.5935 | lr 2.08e-04 | (5381.66 ms | 97421 tok/s) step 16082/76294 | train loss 3.416436 | norm 0.4224 | lr 2.08e-04 | (3809.69 ms | 137620 tok/s) step 16083/76294 | train loss 3.424043 | norm 0.4255 | lr 2.08e-04 | (3789.63 ms | 138348 tok/s) step 16084/76294 | train loss 3.453238 | norm 0.4563 | lr 2.08e-04 | (3811.50 ms | 137554 tok/s) step 16085/76294 | train loss 3.431095 | norm 0.4170 | lr 2.08e-04 | (3791.27 ms | 138288 tok/s) step 16086/76294 | train loss 3.386094 | norm 0.4977 | lr 2.08e-04 | (3817.17 ms | 137350 tok/s) step 16087/76294 | train loss 3.395654 | norm 0.6766 | lr 2.08e-04 | (3789.41 ms | 138356 tok/s) step 16088/76294 | train loss 3.404348 | norm 0.3841 | lr 2.08e-04 | (3796.15 ms | 138110 tok/s) step 16089/76294 | train loss 3.397902 | norm 0.4466 | lr 2.08e-04 | (3814.14 ms | 137459 tok/s) step 16090/76294 | train loss 3.405361 | norm 0.4197 | lr 2.08e-04 | (3795.07 ms | 138150 tok/s) step 16091/76294 | train loss 3.453857 | norm 0.4405 | lr 2.08e-04 | (3792.62 ms | 138239 tok/s) step 16092/76294 | train loss 3.409615 | norm 0.4659 | lr 2.08e-04 | (3847.06 ms | 136283 tok/s) step 16093/76294 | train loss 3.363564 | norm 0.5088 | lr 2.08e-04 | (3796.76 ms | 138088 tok/s) step 16094/76294 | train loss 3.377801 | norm 0.5475 | lr 2.08e-04 | (3823.02 ms | 137140 tok/s) step 16095/76294 | train loss 3.400553 | norm 0.5489 | lr 2.08e-04 | (3820.03 ms | 137247 tok/s) step 16096/76294 | train loss 3.421727 | norm 0.5695 | lr 2.08e-04 | (3800.76 ms | 137943 tok/s) step 16097/76294 | train loss 3.510644 | norm 1.0016 | lr 2.08e-04 | (3794.27 ms | 138179 tok/s) step 16098/76294 | train loss 3.425651 | norm 0.7753 | lr 2.08e-04 | (3904.39 ms | 134282 tok/s) step 16099/76294 | train loss 3.296854 | norm 0.8646 | lr 2.07e-04 | (3800.31 ms | 137959 tok/s) step 16100/76294 | train loss 3.419426 | norm 0.4642 | lr 2.07e-04 | (3839.97 ms | 136535 tok/s) step 16101/76294 | train loss 3.385153 | norm 0.8577 | lr 2.07e-04 | (3798.23 ms | 138035 tok/s) step 16102/76294 | train loss 3.392534 | norm 0.5948 | lr 2.07e-04 | (3827.96 ms | 136963 tok/s) step 16103/76294 | train loss 3.402958 | norm 0.3827 | lr 2.07e-04 | (3826.22 ms | 137025 tok/s) step 16104/76294 | train loss 3.433809 | norm 0.6315 | lr 2.07e-04 | (3795.14 ms | 138147 tok/s) step 16105/76294 | train loss 3.351543 | norm 0.3868 | lr 2.07e-04 | (3804.80 ms | 137796 tok/s) step 16106/76294 | train loss 3.374510 | norm 0.5026 | lr 2.07e-04 | (3796.87 ms | 138084 tok/s) step 16107/76294 | train loss 3.410967 | norm 0.3581 | lr 2.07e-04 | (3849.15 ms | 136209 tok/s) step 16108/76294 | train loss 3.356155 | norm 0.4157 | lr 2.07e-04 | (3793.70 ms | 138200 tok/s) step 16109/76294 | train loss 3.441043 | norm 0.4062 | lr 2.07e-04 | (3802.65 ms | 137874 tok/s) step 16110/76294 | train loss 3.357442 | norm 0.4222 | lr 2.07e-04 | (3830.04 ms | 136888 tok/s) step 16111/76294 | train loss 3.347523 | norm 0.3795 | lr 2.07e-04 | (3802.82 ms | 137868 tok/s) step 16112/76294 | train loss 3.334458 | norm 0.4574 | lr 2.07e-04 | (3796.93 ms | 138082 tok/s) step 16113/76294 | train loss 3.375065 | norm 0.3593 | lr 2.07e-04 | (3844.53 ms | 136372 tok/s) step 16114/76294 | train loss 3.423742 | norm 0.4344 | lr 2.07e-04 | (3794.30 ms | 138178 tok/s) step 16115/76294 | train loss 3.674892 | norm 0.3673 | lr 2.07e-04 | (3801.01 ms | 137934 tok/s) step 16116/76294 | train loss 3.396413 | norm 0.6578 | lr 2.07e-04 | (3850.14 ms | 136174 tok/s) step 16117/76294 | train loss 3.413412 | norm 0.4603 | lr 2.07e-04 | (3802.29 ms | 137887 tok/s) step 16118/76294 | train loss 3.387787 | norm 0.4696 | lr 2.07e-04 | (3801.82 ms | 137905 tok/s) step 16119/76294 | train loss 3.418182 | norm 0.4956 | lr 2.06e-04 | (3796.38 ms | 138102 tok/s) step 16120/76294 | train loss 3.382062 | norm 0.5067 | lr 2.06e-04 | (3803.33 ms | 137850 tok/s) step 16121/76294 | train loss 3.415230 | norm 0.6666 | lr 2.06e-04 | (3798.89 ms | 138011 tok/s) step 16122/76294 | train loss 3.511102 | norm 0.7163 | lr 2.06e-04 | (3823.30 ms | 137130 tok/s) step 16123/76294 | train loss 3.491151 | norm 0.9234 | lr 2.06e-04 | (3801.81 ms | 137905 tok/s) step 16124/76294 | train loss 3.403085 | norm 0.7517 | lr 2.06e-04 | (3800.17 ms | 137964 tok/s) step 16125/76294 | train loss 3.372673 | norm 1.0412 | lr 2.06e-04 | (3796.03 ms | 138115 tok/s) step 16126/76294 | train loss 3.396841 | norm 1.2382 | lr 2.06e-04 | (3805.74 ms | 137762 tok/s) step 16127/76294 | train loss 3.352960 | norm 0.6173 | lr 2.06e-04 | (3804.99 ms | 137790 tok/s) step 16128/76294 | train loss 3.365449 | norm 0.6298 | lr 2.06e-04 | (4081.62 ms | 128451 tok/s) step 16129/76294 | train loss 3.395766 | norm 0.3981 | lr 2.06e-04 | (3823.97 ms | 137106 tok/s) step 16130/76294 | train loss 3.373840 | norm 0.6820 | lr 2.06e-04 | (3804.62 ms | 137803 tok/s) step 16131/76294 | train loss 3.333742 | norm 0.4862 | lr 2.06e-04 | (3799.24 ms | 137998 tok/s) step 16132/76294 | train loss 3.355499 | norm 0.4012 | lr 2.06e-04 | (3804.46 ms | 137809 tok/s) step 16133/76294 | train loss 3.309590 | norm 0.5112 | lr 2.06e-04 | (3836.27 ms | 136666 tok/s) step 16134/76294 | train loss 3.348221 | norm 0.6356 | lr 2.06e-04 | (3805.60 ms | 137767 tok/s) step 16135/76294 | train loss 3.407362 | norm 0.3416 | lr 2.06e-04 | (3804.82 ms | 137796 tok/s) step 16136/76294 | train loss 3.383193 | norm 0.4660 | lr 2.06e-04 | (3825.54 ms | 137050 tok/s) step 16137/76294 | train loss 3.340785 | norm 0.4049 | lr 2.06e-04 | (3800.83 ms | 137941 tok/s) step 16138/76294 | train loss 3.355859 | norm 0.4013 | lr 2.06e-04 | (3807.08 ms | 137714 tok/s) step 16139/76294 | train loss 3.389887 | norm 0.5684 | lr 2.06e-04 | (3800.27 ms | 137961 tok/s) step 16140/76294 | train loss 3.391890 | norm 0.5126 | lr 2.05e-04 | (3804.90 ms | 137793 tok/s) step 16141/76294 | train loss 3.427503 | norm 0.4638 | lr 2.05e-04 | (3799.33 ms | 137995 tok/s) step 16142/76294 | train loss 3.376373 | norm 0.4670 | lr 2.05e-04 | (3798.96 ms | 138008 tok/s) step 16143/76294 | train loss 3.313152 | norm 0.8497 | lr 2.05e-04 | (3801.81 ms | 137905 tok/s) step 16144/76294 | train loss 3.379134 | norm 0.5565 | lr 2.05e-04 | (3802.33 ms | 137886 tok/s) step 16145/76294 | train loss 3.387767 | norm 0.4647 | lr 2.05e-04 | (3800.16 ms | 137965 tok/s) step 16146/76294 | train loss 3.372826 | norm 0.4192 | lr 2.05e-04 | (3805.64 ms | 137766 tok/s) step 16147/76294 | train loss 3.392448 | norm 0.5271 | lr 2.05e-04 | (3800.35 ms | 137958 tok/s) step 16148/76294 | train loss 3.341511 | norm 0.5085 | lr 2.05e-04 | (3797.06 ms | 138077 tok/s) step 16149/76294 | train loss 3.345362 | norm 0.3832 | lr 2.05e-04 | (3826.34 ms | 137021 tok/s) step 16150/76294 | train loss 3.394009 | norm 0.3865 | lr 2.05e-04 | (3798.15 ms | 138038 tok/s) step 16151/76294 | train loss 3.356272 | norm 0.4110 | lr 2.05e-04 | (3819.82 ms | 137255 tok/s) step 16152/76294 | train loss 3.343146 | norm 0.3972 | lr 2.05e-04 | (3814.92 ms | 137431 tok/s) step 16153/76294 | train loss 3.323997 | norm 0.4518 | lr 2.05e-04 | (3855.50 ms | 135985 tok/s) step 16154/76294 | train loss 3.365681 | norm 0.4218 | lr 2.05e-04 | (3882.76 ms | 135030 tok/s) step 16155/76294 | train loss 3.308021 | norm 0.5212 | lr 2.05e-04 | (3797.76 ms | 138052 tok/s) step 16156/76294 | train loss 3.349715 | norm 0.4869 | lr 2.05e-04 | (3851.53 ms | 136124 tok/s) step 16157/76294 | train loss 3.327588 | norm 0.5357 | lr 2.05e-04 | (3819.86 ms | 137253 tok/s) step 16158/76294 | train loss 3.384188 | norm 0.3946 | lr 2.05e-04 | (3809.86 ms | 137614 tok/s) step 16159/76294 | train loss 3.472227 | norm 0.4053 | lr 2.05e-04 | (3817.54 ms | 137337 tok/s) step 16160/76294 | train loss 3.367809 | norm 0.4044 | lr 2.04e-04 | (3829.56 ms | 136906 tok/s) step 16161/76294 | train loss 3.498594 | norm 0.3883 | lr 2.04e-04 | (3804.54 ms | 137806 tok/s) step 16162/76294 | train loss 3.350406 | norm 0.7796 | lr 2.04e-04 | (3799.57 ms | 137986 tok/s) step 16163/76294 | train loss 3.384445 | norm 0.4642 | lr 2.04e-04 | (3801.12 ms | 137930 tok/s) step 16164/76294 | train loss 3.362653 | norm 0.4248 | lr 2.04e-04 | (3800.55 ms | 137950 tok/s) step 16165/76294 | train loss 3.340424 | norm 0.5253 | lr 2.04e-04 | (3817.61 ms | 137334 tok/s) step 16166/76294 | train loss 3.338613 | norm 0.3807 | lr 2.04e-04 | (3801.63 ms | 137911 tok/s) step 16167/76294 | train loss 3.390393 | norm 0.4803 | lr 2.04e-04 | (3794.96 ms | 138154 tok/s) step 16168/76294 | train loss 3.342205 | norm 0.5082 | lr 2.04e-04 | (3823.76 ms | 137113 tok/s) step 16169/76294 | train loss 3.419721 | norm 0.5313 | lr 2.04e-04 | (3800.83 ms | 137940 tok/s) step 16170/76294 | train loss 3.459803 | norm 0.4284 | lr 2.04e-04 | (3797.59 ms | 138058 tok/s) step 16171/76294 | train loss 3.324939 | norm 0.4029 | lr 2.04e-04 | (3819.97 ms | 137249 tok/s) step 16172/76294 | train loss 3.348015 | norm 0.4682 | lr 2.04e-04 | (3801.87 ms | 137903 tok/s) step 16173/76294 | train loss 3.348607 | norm 0.4269 | lr 2.04e-04 | (3796.03 ms | 138115 tok/s) step 16174/76294 | train loss 3.314730 | norm 0.5038 | lr 2.04e-04 | (3824.87 ms | 137073 tok/s) step 16175/76294 | train loss 3.327811 | norm 0.6301 | lr 2.04e-04 | (3807.81 ms | 137688 tok/s) step 16176/76294 | train loss 3.409243 | norm 0.5301 | lr 2.04e-04 | (3804.35 ms | 137813 tok/s) step 16177/76294 | train loss 3.305825 | norm 0.4549 | lr 2.04e-04 | (3822.97 ms | 137142 tok/s) step 16178/76294 | train loss 3.395698 | norm 0.4284 | lr 2.04e-04 | (3803.86 ms | 137830 tok/s) step 16179/76294 | train loss 3.447067 | norm 0.4243 | lr 2.04e-04 | (3795.84 ms | 138122 tok/s) step 16180/76294 | train loss 3.375276 | norm 0.3593 | lr 2.04e-04 | (3823.87 ms | 137109 tok/s) step 16181/76294 | train loss 3.386463 | norm 0.3905 | lr 2.03e-04 | (3795.79 ms | 138123 tok/s) step 16182/76294 | train loss 3.355222 | norm 0.5014 | lr 2.03e-04 | (3799.15 ms | 138001 tok/s) step 16183/76294 | train loss 3.391517 | norm 0.6111 | lr 2.03e-04 | (3822.27 ms | 137167 tok/s) step 16184/76294 | train loss 3.345139 | norm 0.4548 | lr 2.03e-04 | (3801.79 ms | 137906 tok/s) step 16185/76294 | train loss 3.463834 | norm 0.9143 | lr 2.03e-04 | (3806.38 ms | 137739 tok/s) step 16186/76294 | train loss 3.330967 | norm 0.7768 | lr 2.03e-04 | (3835.36 ms | 136698 tok/s) step 16187/76294 | train loss 3.367835 | norm 0.4179 | lr 2.03e-04 | (3820.56 ms | 137228 tok/s) step 16188/76294 | train loss 3.482996 | norm 0.4406 | lr 2.03e-04 | (3798.12 ms | 138039 tok/s) step 16189/76294 | train loss 3.357281 | norm 0.3855 | lr 2.03e-04 | (3833.96 ms | 136748 tok/s) step 16190/76294 | train loss 3.296540 | norm 0.4306 | lr 2.03e-04 | (3797.51 ms | 138061 tok/s) step 16191/76294 | train loss 3.591913 | norm 0.4409 | lr 2.03e-04 | (3804.65 ms | 137802 tok/s) step 16192/76294 | train loss 3.329474 | norm 0.4148 | lr 2.03e-04 | (3798.50 ms | 138025 tok/s) step 16193/76294 | train loss 3.234281 | norm 0.5201 | lr 2.03e-04 | (3801.26 ms | 137925 tok/s) step 16194/76294 | train loss 3.361349 | norm 0.4546 | lr 2.03e-04 | (3804.57 ms | 137805 tok/s) step 16195/76294 | train loss 3.308938 | norm 0.3674 | lr 2.03e-04 | (3799.09 ms | 138004 tok/s) step 16196/76294 | train loss 3.307569 | norm 0.4689 | lr 2.03e-04 | (3828.13 ms | 136957 tok/s) step 16197/76294 | train loss 3.373176 | norm 0.3826 | lr 2.03e-04 | (3796.72 ms | 138090 tok/s) step 16198/76294 | train loss 3.350259 | norm 0.4655 | lr 2.03e-04 | (3802.94 ms | 137864 tok/s) step 16199/76294 | train loss 3.335742 | norm 0.5783 | lr 2.03e-04 | (3819.46 ms | 137267 tok/s) step 16200/76294 | train loss 3.312284 | norm 0.4715 | lr 2.03e-04 | (3801.21 ms | 137927 tok/s) step 16201/76294 | train loss 3.407118 | norm 0.7627 | lr 2.03e-04 | (3802.01 ms | 137898 tok/s) step 16202/76294 | train loss 3.405055 | norm 0.3953 | lr 2.02e-04 | (3801.82 ms | 137905 tok/s) step 16203/76294 | train loss 3.305784 | norm 0.5019 | lr 2.02e-04 | (3802.98 ms | 137862 tok/s) step 16204/76294 | train loss 3.334409 | norm 0.3755 | lr 2.02e-04 | (3803.27 ms | 137852 tok/s) step 16205/76294 | train loss 3.399470 | norm 0.4740 | lr 2.02e-04 | (3856.12 ms | 135963 tok/s) step 16206/76294 | train loss 3.323364 | norm 0.4445 | lr 2.02e-04 | (3796.25 ms | 138107 tok/s) step 16207/76294 | train loss 3.394219 | norm 0.4560 | lr 2.02e-04 | (3835.80 ms | 136683 tok/s) step 16208/76294 | train loss 3.368957 | norm 0.4302 | lr 2.02e-04 | (3804.11 ms | 137821 tok/s) step 16209/76294 | train loss 3.374904 | norm 0.4167 | lr 2.02e-04 | (3852.21 ms | 136100 tok/s) step 16210/76294 | train loss 3.376473 | norm 0.4506 | lr 2.02e-04 | (3818.65 ms | 137297 tok/s) step 16211/76294 | train loss 3.389131 | norm 0.6643 | lr 2.02e-04 | (3855.54 ms | 135983 tok/s) step 16212/76294 | train loss 3.346809 | norm 0.4381 | lr 2.02e-04 | (3804.98 ms | 137790 tok/s) step 16213/76294 | train loss 3.364111 | norm 0.3749 | lr 2.02e-04 | (3809.56 ms | 137624 tok/s) step 16214/76294 | train loss 3.362815 | norm 0.3851 | lr 2.02e-04 | (3832.53 ms | 136800 tok/s) step 16215/76294 | train loss 3.402108 | norm 0.3915 | lr 2.02e-04 | (3807.85 ms | 137686 tok/s) step 16216/76294 | train loss 3.389815 | norm 0.4905 | lr 2.02e-04 | (3811.41 ms | 137557 tok/s) step 16217/76294 | train loss 3.557654 | norm 0.4586 | lr 2.02e-04 | (3798.97 ms | 138008 tok/s) step 16218/76294 | train loss 3.320965 | norm 0.5072 | lr 2.02e-04 | (3793.36 ms | 138212 tok/s) step 16219/76294 | train loss 3.368305 | norm 0.4960 | lr 2.02e-04 | (3826.37 ms | 137020 tok/s) step 16220/76294 | train loss 3.380479 | norm 0.5843 | lr 2.02e-04 | (3794.89 ms | 138156 tok/s) step 16221/76294 | train loss 3.295418 | norm 0.5599 | lr 2.02e-04 | (3818.79 ms | 137292 tok/s) step 16222/76294 | train loss 3.363508 | norm 0.5451 | lr 2.02e-04 | (3823.48 ms | 137123 tok/s) step 16223/76294 | train loss 3.317727 | norm 0.5439 | lr 2.01e-04 | (3803.38 ms | 137848 tok/s) step 16224/76294 | train loss 3.390120 | norm 0.6273 | lr 2.01e-04 | (3803.67 ms | 137837 tok/s) step 16225/76294 | train loss 3.414299 | norm 1.0990 | lr 2.01e-04 | (3800.72 ms | 137944 tok/s) step 16226/76294 | train loss 3.449970 | norm 0.6915 | lr 2.01e-04 | (3809.80 ms | 137616 tok/s) step 16227/76294 | train loss 3.333269 | norm 0.4866 | lr 2.01e-04 | (3802.73 ms | 137872 tok/s) step 16228/76294 | train loss 3.339217 | norm 0.5011 | lr 2.01e-04 | (3793.49 ms | 138207 tok/s) step 16229/76294 | train loss 3.407080 | norm 0.4040 | lr 2.01e-04 | (3825.58 ms | 137048 tok/s) step 16230/76294 | train loss 3.372579 | norm 0.3995 | lr 2.01e-04 | (3796.61 ms | 138094 tok/s) step 16231/76294 | train loss 3.336919 | norm 0.5198 | lr 2.01e-04 | (3917.98 ms | 133816 tok/s) step 16232/76294 | train loss 3.379115 | norm 0.5125 | lr 2.01e-04 | (3891.12 ms | 134740 tok/s) step 16233/76294 | train loss 3.371245 | norm 0.7104 | lr 2.01e-04 | (3795.62 ms | 138130 tok/s) step 16234/76294 | train loss 3.346722 | norm 0.4821 | lr 2.01e-04 | (3858.10 ms | 135893 tok/s) step 16235/76294 | train loss 3.402670 | norm 0.4890 | lr 2.01e-04 | (3791.29 ms | 138287 tok/s) step 16236/76294 | train loss 3.410933 | norm 0.6673 | lr 2.01e-04 | (3899.24 ms | 134459 tok/s) step 16237/76294 | train loss 3.332111 | norm 0.5374 | lr 2.01e-04 | (3917.34 ms | 133838 tok/s) step 16238/76294 | train loss 3.454387 | norm 0.5402 | lr 2.01e-04 | (3779.20 ms | 138730 tok/s) step 16239/76294 | train loss 3.462879 | norm 0.6542 | lr 2.01e-04 | (5338.43 ms | 98210 tok/s) step 16240/76294 | train loss 3.370493 | norm 0.4011 | lr 2.01e-04 | (3896.17 ms | 134565 tok/s) step 16241/76294 | train loss 3.553839 | norm 0.4200 | lr 2.01e-04 | (3770.81 ms | 139038 tok/s) step 16242/76294 | train loss 3.393800 | norm 0.3911 | lr 2.01e-04 | (3893.51 ms | 134657 tok/s) step 16243/76294 | train loss 3.362881 | norm 0.5429 | lr 2.01e-04 | (3771.70 ms | 139006 tok/s) step 16244/76294 | train loss 3.351137 | norm 0.4037 | lr 2.00e-04 | (3895.70 ms | 134581 tok/s) step 16245/76294 | train loss 3.412567 | norm 0.4119 | lr 2.00e-04 | (3779.94 ms | 138703 tok/s) step 16246/76294 | train loss 3.396008 | norm 0.3876 | lr 2.00e-04 | (3846.50 ms | 136303 tok/s) step 16247/76294 | train loss 3.357151 | norm 0.4101 | lr 2.00e-04 | (3811.55 ms | 137553 tok/s) step 16248/76294 | train loss 3.425479 | norm 0.4905 | lr 2.00e-04 | (3839.21 ms | 136561 tok/s) step 16249/76294 | train loss 3.375487 | norm 0.9091 | lr 2.00e-04 | (3815.14 ms | 137423 tok/s) step 16250/76294 | train loss 3.388952 | norm 0.5467 | lr 2.00e-04 | (3790.29 ms | 138324 tok/s) val loss: 3.370022 saving model checkpoint to ./results/gpt2-124M-gqa/step_16250.pth step 16251/76294 | train loss 3.406630 | norm 0.4691 | lr 2.00e-04 | (3796.48 ms | 138098 tok/s) step 16252/76294 | train loss 3.463702 | norm 0.5508 | lr 2.00e-04 | (3786.75 ms | 138453 tok/s) step 16253/76294 | train loss 3.373457 | norm 0.5191 | lr 2.00e-04 | (3821.46 ms | 137196 tok/s) step 16254/76294 | train loss 3.307518 | norm 0.4524 | lr 2.00e-04 | (3790.77 ms | 138306 tok/s) step 16255/76294 | train loss 3.337090 | norm 0.5114 | lr 2.00e-04 | (3816.09 ms | 137389 tok/s) step 16256/76294 | train loss 3.395342 | norm 0.4771 | lr 2.00e-04 | (3795.79 ms | 138123 tok/s) step 16257/76294 | train loss 3.371466 | norm 0.7133 | lr 2.00e-04 | (3837.50 ms | 136622 tok/s) step 16258/76294 | train loss 3.346330 | norm 0.5730 | lr 2.00e-04 | (3903.02 ms | 134329 tok/s) step 16259/76294 | train loss 3.422245 | norm 0.4735 | lr 2.00e-04 | (3792.44 ms | 138246 tok/s) step 16260/76294 | train loss 3.439206 | norm 0.8850 | lr 2.00e-04 | (3890.18 ms | 134772 tok/s) step 16261/76294 | train loss 3.360569 | norm 0.9701 | lr 2.00e-04 | (3790.50 ms | 138316 tok/s) step 16262/76294 | train loss 3.449006 | norm 0.5929 | lr 2.00e-04 | (3809.61 ms | 137622 tok/s) step 16263/76294 | train loss 3.344336 | norm 0.5258 | lr 2.00e-04 | (3790.84 ms | 138304 tok/s) step 16264/76294 | train loss 3.385291 | norm 0.5472 | lr 2.00e-04 | (3806.23 ms | 137745 tok/s) step 16265/76294 | train loss 3.463481 | norm 0.5275 | lr 1.99e-04 | (3796.87 ms | 138084 tok/s) step 16266/76294 | train loss 3.370114 | norm 0.3961 | lr 1.99e-04 | (3823.84 ms | 137111 tok/s) step 16267/76294 | train loss 3.316305 | norm 0.4212 | lr 1.99e-04 | (3801.15 ms | 137929 tok/s) step 16268/76294 | train loss 3.407250 | norm 0.5382 | lr 1.99e-04 | (3804.72 ms | 137799 tok/s) step 16269/76294 | train loss 3.352211 | norm 0.5768 | lr 1.99e-04 | (3824.28 ms | 137094 tok/s) step 16270/76294 | train loss 3.316740 | norm 0.5175 | lr 1.99e-04 | (3811.91 ms | 137539 tok/s) step 16271/76294 | train loss 3.324014 | norm 0.9100 | lr 1.99e-04 | (3903.19 ms | 134323 tok/s) step 16272/76294 | train loss 3.412944 | norm 0.5171 | lr 1.99e-04 | (3795.05 ms | 138150 tok/s) step 16273/76294 | train loss 3.347287 | norm 0.5347 | lr 1.99e-04 | (13797.26 ms | 37999 tok/s) step 16274/76294 | train loss 3.313147 | norm 0.6814 | lr 1.99e-04 | (3768.95 ms | 139107 tok/s) step 16275/76294 | train loss 3.385986 | norm 0.5358 | lr 1.99e-04 | (3881.85 ms | 135061 tok/s) step 16276/76294 | train loss 3.340166 | norm 0.5995 | lr 1.99e-04 | (3770.41 ms | 139053 tok/s) step 16277/76294 | train loss 3.320003 | norm 0.4715 | lr 1.99e-04 | (3775.75 ms | 138856 tok/s) step 16278/76294 | train loss 3.315270 | norm 0.6162 | lr 1.99e-04 | (3818.82 ms | 137291 tok/s) step 16279/76294 | train loss 3.399778 | norm 0.6365 | lr 1.99e-04 | (3799.71 ms | 137981 tok/s) step 16280/76294 | train loss 3.350794 | norm 0.5092 | lr 1.99e-04 | (3792.73 ms | 138235 tok/s) step 16281/76294 | train loss 3.325029 | norm 0.5560 | lr 1.99e-04 | (3786.70 ms | 138455 tok/s) step 16282/76294 | train loss 3.405563 | norm 0.5215 | lr 1.99e-04 | (3788.71 ms | 138382 tok/s) step 16283/76294 | train loss 3.435685 | norm 0.6363 | lr 1.99e-04 | (3816.90 ms | 137360 tok/s) step 16284/76294 | train loss 3.396429 | norm 0.8527 | lr 1.99e-04 | (3791.59 ms | 138277 tok/s) step 16285/76294 | train loss 3.435330 | norm 0.6190 | lr 1.99e-04 | (3797.29 ms | 138069 tok/s) step 16286/76294 | train loss 3.385549 | norm 0.5863 | lr 1.99e-04 | (3816.95 ms | 137358 tok/s) step 16287/76294 | train loss 3.293770 | norm 0.6257 | lr 1.98e-04 | (3805.44 ms | 137773 tok/s) step 16288/76294 | train loss 3.387378 | norm 0.7912 | lr 1.98e-04 | (3800.85 ms | 137940 tok/s) step 16289/76294 | train loss 3.439646 | norm 0.4545 | lr 1.98e-04 | (3837.41 ms | 136625 tok/s) step 16290/76294 | train loss 3.349569 | norm 0.5733 | lr 1.98e-04 | (3805.42 ms | 137774 tok/s) step 16291/76294 | train loss 3.402516 | norm 0.6772 | lr 1.98e-04 | (3809.32 ms | 137633 tok/s) step 16292/76294 | train loss 3.338781 | norm 0.5478 | lr 1.98e-04 | (3857.57 ms | 135912 tok/s) step 16293/76294 | train loss 3.414454 | norm 0.4902 | lr 1.98e-04 | (3808.68 ms | 137656 tok/s) step 16294/76294 | train loss 3.418228 | norm 0.5062 | lr 1.98e-04 | (3837.20 ms | 136633 tok/s) step 16295/76294 | train loss 3.270813 | norm 0.5424 | lr 1.98e-04 | (3805.60 ms | 137768 tok/s) step 16296/76294 | train loss 3.364222 | norm 0.5798 | lr 1.98e-04 | (3812.41 ms | 137521 tok/s) step 16297/76294 | train loss 3.333338 | norm 0.7293 | lr 1.98e-04 | (3828.32 ms | 136950 tok/s) step 16298/76294 | train loss 3.336851 | norm 0.5385 | lr 1.98e-04 | (3813.92 ms | 137467 tok/s) step 16299/76294 | train loss 3.371329 | norm 0.3632 | lr 1.98e-04 | (3893.66 ms | 134652 tok/s) step 16300/76294 | train loss 3.409563 | norm 0.7657 | lr 1.98e-04 | (12977.43 ms | 40400 tok/s) step 16301/76294 | train loss 3.393515 | norm 0.5208 | lr 1.98e-04 | (3880.58 ms | 135105 tok/s) step 16302/76294 | train loss 3.352602 | norm 0.6124 | lr 1.98e-04 | (3791.91 ms | 138265 tok/s) step 16303/76294 | train loss 3.353947 | norm 0.5874 | lr 1.98e-04 | (3799.15 ms | 138001 tok/s) step 16304/76294 | train loss 3.372199 | norm 0.4921 | lr 1.98e-04 | (3802.45 ms | 137882 tok/s) step 16305/76294 | train loss 3.340743 | norm 0.4056 | lr 1.98e-04 | (5151.10 ms | 101782 tok/s) step 16306/76294 | train loss 3.376275 | norm 0.8353 | lr 1.98e-04 | (3800.48 ms | 137953 tok/s) step 16307/76294 | train loss 3.492327 | norm 0.5905 | lr 1.98e-04 | (3935.46 ms | 133222 tok/s) step 16308/76294 | train loss 3.438787 | norm 0.4316 | lr 1.97e-04 | (3782.00 ms | 138627 tok/s) step 16309/76294 | train loss 3.359755 | norm 0.5519 | lr 1.97e-04 | (3791.40 ms | 138283 tok/s) step 16310/76294 | train loss 3.392302 | norm 0.4555 | lr 1.97e-04 | (3810.54 ms | 137589 tok/s) step 16311/76294 | train loss 3.367747 | norm 1.0710 | lr 1.97e-04 | (3796.01 ms | 138116 tok/s) step 16312/76294 | train loss 3.400556 | norm 0.5921 | lr 1.97e-04 | (3796.56 ms | 138096 tok/s) step 16313/76294 | train loss 3.452733 | norm 0.6664 | lr 1.97e-04 | (3798.89 ms | 138011 tok/s) step 16314/76294 | train loss 3.388564 | norm 0.6117 | lr 1.97e-04 | (3799.14 ms | 138002 tok/s) step 16315/76294 | train loss 3.383584 | norm 0.9019 | lr 1.97e-04 | (3801.32 ms | 137923 tok/s) step 16316/76294 | train loss 3.320131 | norm 0.8826 | lr 1.97e-04 | (3797.61 ms | 138057 tok/s) step 16317/76294 | train loss 3.398062 | norm 0.6623 | lr 1.97e-04 | (3825.59 ms | 137047 tok/s) step 16318/76294 | train loss 3.355226 | norm 0.5973 | lr 1.97e-04 | (4398.96 ms | 119185 tok/s) step 16319/76294 | train loss 3.394670 | norm 0.5274 | lr 1.97e-04 | (3801.58 ms | 137913 tok/s) step 16320/76294 | train loss 3.408559 | norm 0.7142 | lr 1.97e-04 | (3859.85 ms | 135831 tok/s) step 16321/76294 | train loss 3.421282 | norm 0.5483 | lr 1.97e-04 | (6767.56 ms | 77471 tok/s) step 16322/76294 | train loss 3.335637 | norm 0.6756 | lr 1.97e-04 | (3885.51 ms | 134934 tok/s) step 16323/76294 | train loss 3.359765 | norm 0.7823 | lr 1.97e-04 | (3795.69 ms | 138127 tok/s) step 16324/76294 | train loss 3.370626 | norm 1.0215 | lr 1.97e-04 | (3823.35 ms | 137128 tok/s) step 16325/76294 | train loss 3.376074 | norm 0.6125 | lr 1.97e-04 | (3800.27 ms | 137961 tok/s) step 16326/76294 | train loss 3.404834 | norm 0.5125 | lr 1.97e-04 | (3810.86 ms | 137577 tok/s) step 16327/76294 | train loss 3.359940 | norm 0.4735 | lr 1.97e-04 | (3827.87 ms | 136966 tok/s) step 16328/76294 | train loss 3.341404 | norm 0.8128 | lr 1.97e-04 | (3801.38 ms | 137920 tok/s) step 16329/76294 | train loss 3.333552 | norm 0.8169 | lr 1.97e-04 | (3852.07 ms | 136106 tok/s) step 16330/76294 | train loss 3.325694 | norm 0.3922 | lr 1.96e-04 | (3805.03 ms | 137788 tok/s) step 16331/76294 | train loss 3.315845 | norm 0.4816 | lr 1.96e-04 | (3833.83 ms | 136753 tok/s) step 16332/76294 | train loss 3.345231 | norm 0.5240 | lr 1.96e-04 | (3805.14 ms | 137784 tok/s) step 16333/76294 | train loss 3.337392 | norm 0.4849 | lr 1.96e-04 | (3830.96 ms | 136855 tok/s) step 16334/76294 | train loss 3.401457 | norm 0.4617 | lr 1.96e-04 | (3808.26 ms | 137671 tok/s) step 16335/76294 | train loss 3.489044 | norm 0.4941 | lr 1.96e-04 | (3809.14 ms | 137640 tok/s) step 16336/76294 | train loss 3.354430 | norm 0.4825 | lr 1.96e-04 | (3843.17 ms | 136421 tok/s) step 16337/76294 | train loss 3.432672 | norm 0.5263 | lr 1.96e-04 | (3809.15 ms | 137639 tok/s) step 16338/76294 | train loss 3.406877 | norm 0.5360 | lr 1.96e-04 | (3813.37 ms | 137487 tok/s) step 16339/76294 | train loss 3.391502 | norm 0.6000 | lr 1.96e-04 | (3807.37 ms | 137703 tok/s) step 16340/76294 | train loss 3.326854 | norm 0.5467 | lr 1.96e-04 | (3807.59 ms | 137696 tok/s) step 16341/76294 | train loss 3.409572 | norm 0.5006 | lr 1.96e-04 | (3832.87 ms | 136787 tok/s) step 16342/76294 | train loss 3.423549 | norm 0.6859 | lr 1.96e-04 | (3809.10 ms | 137641 tok/s) step 16343/76294 | train loss 3.290836 | norm 0.5670 | lr 1.96e-04 | (3841.12 ms | 136494 tok/s) step 16344/76294 | train loss 3.362445 | norm 0.5626 | lr 1.96e-04 | (3808.43 ms | 137665 tok/s) step 16345/76294 | train loss 3.317442 | norm 0.5404 | lr 1.96e-04 | (3834.39 ms | 136733 tok/s) step 16346/76294 | train loss 3.311641 | norm 0.4226 | lr 1.96e-04 | (3899.01 ms | 134467 tok/s) step 16347/76294 | train loss 3.347120 | norm 0.4123 | lr 1.96e-04 | (3839.81 ms | 136540 tok/s) step 16348/76294 | train loss 3.265464 | norm 0.8536 | lr 1.96e-04 | (3830.11 ms | 136886 tok/s) step 16349/76294 | train loss 3.412736 | norm 0.7109 | lr 1.96e-04 | (3845.59 ms | 136335 tok/s) step 16350/76294 | train loss 3.313618 | norm 0.4094 | lr 1.96e-04 | (3830.12 ms | 136886 tok/s) step 16351/76294 | train loss 3.368124 | norm 0.3653 | lr 1.95e-04 | (3806.78 ms | 137725 tok/s) step 16352/76294 | train loss 3.390736 | norm 0.6223 | lr 1.95e-04 | (3843.88 ms | 136395 tok/s) step 16353/76294 | train loss 3.347987 | norm 0.6219 | lr 1.95e-04 | (3829.77 ms | 136898 tok/s) step 16354/76294 | train loss 3.437129 | norm 0.6568 | lr 1.95e-04 | (3807.83 ms | 137687 tok/s) step 16355/76294 | train loss 3.340426 | norm 0.5121 | lr 1.95e-04 | (3820.84 ms | 137218 tok/s) step 16356/76294 | train loss 3.357174 | norm 1.1718 | lr 1.95e-04 | (3808.96 ms | 137646 tok/s) step 16357/76294 | train loss 3.379289 | norm 0.6827 | lr 1.95e-04 | (3845.84 ms | 136326 tok/s) step 16358/76294 | train loss 3.313954 | norm 0.5441 | lr 1.95e-04 | (3803.48 ms | 137844 tok/s) step 16359/76294 | train loss 3.383007 | norm 0.6817 | lr 1.95e-04 | (3828.35 ms | 136949 tok/s) step 16360/76294 | train loss 3.369114 | norm 1.0163 | lr 1.95e-04 | (3819.92 ms | 137251 tok/s) step 16361/76294 | train loss 3.297251 | norm 0.6357 | lr 1.95e-04 | (3808.16 ms | 137675 tok/s) step 16362/76294 | train loss 3.374794 | norm 0.7829 | lr 1.95e-04 | (3809.08 ms | 137642 tok/s) step 16363/76294 | train loss 3.381023 | norm 0.6456 | lr 1.95e-04 | (3803.35 ms | 137849 tok/s) step 16364/76294 | train loss 3.508811 | norm 0.8539 | lr 1.95e-04 | (3807.51 ms | 137699 tok/s) step 16365/76294 | train loss 3.378544 | norm 0.7630 | lr 1.95e-04 | (3810.22 ms | 137601 tok/s) step 16366/76294 | train loss 3.344559 | norm 0.7055 | lr 1.95e-04 | (3809.98 ms | 137609 tok/s) step 16367/76294 | train loss 3.381730 | norm 0.8479 | lr 1.95e-04 | (3803.31 ms | 137851 tok/s) step 16368/76294 | train loss 3.384675 | norm 1.2538 | lr 1.95e-04 | (3832.56 ms | 136798 tok/s) step 16369/76294 | train loss 3.375964 | norm 1.5105 | lr 1.95e-04 | (3808.25 ms | 137671 tok/s) step 16370/76294 | train loss 3.306916 | norm 0.5036 | lr 1.95e-04 | (3826.57 ms | 137013 tok/s) step 16371/76294 | train loss 3.371016 | norm 0.4934 | lr 1.95e-04 | (3802.56 ms | 137878 tok/s) step 16372/76294 | train loss 3.313718 | norm 0.5287 | lr 1.95e-04 | (3874.47 ms | 135319 tok/s) step 16373/76294 | train loss 3.380902 | norm 0.5032 | lr 1.94e-04 | (3801.40 ms | 137920 tok/s) step 16374/76294 | train loss 3.363084 | norm 0.5615 | lr 1.94e-04 | (3832.07 ms | 136816 tok/s) step 16375/76294 | train loss 3.465414 | norm 0.4416 | lr 1.94e-04 | (3807.67 ms | 137693 tok/s) step 16376/76294 | train loss 3.328313 | norm 0.5989 | lr 1.94e-04 | (3829.90 ms | 136894 tok/s) step 16377/76294 | train loss 3.359021 | norm 0.4878 | lr 1.94e-04 | (3804.73 ms | 137799 tok/s) step 16378/76294 | train loss 3.412738 | norm 0.4609 | lr 1.94e-04 | (3815.71 ms | 137403 tok/s) step 16379/76294 | train loss 3.316975 | norm 0.4763 | lr 1.94e-04 | (3803.76 ms | 137834 tok/s) step 16380/76294 | train loss 3.286246 | norm 0.6464 | lr 1.94e-04 | (3806.85 ms | 137722 tok/s) step 16381/76294 | train loss 3.438572 | norm 0.5257 | lr 1.94e-04 | (3845.28 ms | 136346 tok/s) step 16382/76294 | train loss 3.357730 | norm 0.4510 | lr 1.94e-04 | (3806.68 ms | 137728 tok/s) step 16383/76294 | train loss 3.360751 | norm 0.4235 | lr 1.94e-04 | (3809.18 ms | 137638 tok/s) step 16384/76294 | train loss 3.405397 | norm 0.5469 | lr 1.94e-04 | (3823.65 ms | 137117 tok/s) step 16385/76294 | train loss 3.393332 | norm 0.9174 | lr 1.94e-04 | (3808.97 ms | 137646 tok/s) step 16386/76294 | train loss 3.390711 | norm 0.5007 | lr 1.94e-04 | (3809.14 ms | 137640 tok/s) step 16387/76294 | train loss 3.509499 | norm 0.4465 | lr 1.94e-04 | (3841.14 ms | 136493 tok/s) step 16388/76294 | train loss 3.391667 | norm 0.4932 | lr 1.94e-04 | (3812.33 ms | 137524 tok/s) step 16389/76294 | train loss 3.315195 | norm 0.6137 | lr 1.94e-04 | (3807.56 ms | 137697 tok/s) step 16390/76294 | train loss 3.381668 | norm 0.6532 | lr 1.94e-04 | (3803.94 ms | 137828 tok/s) step 16391/76294 | train loss 3.408973 | norm 0.4886 | lr 1.94e-04 | (3827.99 ms | 136962 tok/s) step 16392/76294 | train loss 3.355978 | norm 0.6257 | lr 1.94e-04 | (3800.19 ms | 137963 tok/s) step 16393/76294 | train loss 3.373261 | norm 0.6895 | lr 1.94e-04 | (3822.33 ms | 137164 tok/s) step 16394/76294 | train loss 3.338020 | norm 0.6137 | lr 1.94e-04 | (3820.13 ms | 137243 tok/s) step 16395/76294 | train loss 3.377623 | norm 0.7025 | lr 1.93e-04 | (3802.34 ms | 137886 tok/s) step 16396/76294 | train loss 3.300831 | norm 0.8917 | lr 1.93e-04 | (3810.22 ms | 137600 tok/s) step 16397/76294 | train loss 3.354870 | norm 0.6433 | lr 1.93e-04 | (3804.29 ms | 137815 tok/s) step 16398/76294 | train loss 3.376407 | norm 0.5418 | lr 1.93e-04 | (3885.62 ms | 134930 tok/s) step 16399/76294 | train loss 3.420622 | norm 0.7823 | lr 1.93e-04 | (3805.21 ms | 137782 tok/s) step 16400/76294 | train loss 3.347273 | norm 1.4447 | lr 1.93e-04 | (3808.92 ms | 137647 tok/s) step 16401/76294 | train loss 3.344805 | norm 0.5693 | lr 1.93e-04 | (3803.98 ms | 137826 tok/s) step 16402/76294 | train loss 3.339090 | norm 0.7565 | lr 1.93e-04 | (3804.79 ms | 137797 tok/s) step 16403/76294 | train loss 3.449392 | norm 0.5241 | lr 1.93e-04 | (3825.03 ms | 137068 tok/s) step 16404/76294 | train loss 3.354962 | norm 0.8140 | lr 1.93e-04 | (3805.72 ms | 137763 tok/s) step 16405/76294 | train loss 3.395618 | norm 1.3329 | lr 1.93e-04 | (3823.18 ms | 137134 tok/s) step 16406/76294 | train loss 3.329447 | norm 0.7463 | lr 1.93e-04 | (3801.83 ms | 137904 tok/s) step 16407/76294 | train loss 3.360914 | norm 0.5271 | lr 1.93e-04 | (3807.88 ms | 137685 tok/s) step 16408/76294 | train loss 3.462468 | norm 0.6791 | lr 1.93e-04 | (3804.76 ms | 137798 tok/s) step 16409/76294 | train loss 3.333146 | norm 0.6990 | lr 1.93e-04 | (3799.95 ms | 137972 tok/s) step 16410/76294 | train loss 3.400915 | norm 0.7590 | lr 1.93e-04 | (3825.76 ms | 137041 tok/s) step 16411/76294 | train loss 3.315219 | norm 0.3910 | lr 1.93e-04 | (3799.94 ms | 137973 tok/s) step 16412/76294 | train loss 3.362938 | norm 0.6123 | lr 1.93e-04 | (3804.50 ms | 137807 tok/s) step 16413/76294 | train loss 3.393165 | norm 0.8075 | lr 1.93e-04 | (3819.66 ms | 137261 tok/s) step 16414/76294 | train loss 3.269282 | norm 0.5366 | lr 1.93e-04 | (3800.70 ms | 137945 tok/s) step 16415/76294 | train loss 3.429140 | norm 0.6698 | lr 1.93e-04 | (3811.91 ms | 137539 tok/s) step 16416/76294 | train loss 3.340773 | norm 0.4265 | lr 1.93e-04 | (3840.80 ms | 136505 tok/s) step 16417/76294 | train loss 3.399195 | norm 0.4181 | lr 1.92e-04 | (3802.06 ms | 137896 tok/s) step 16418/76294 | train loss 3.370814 | norm 0.3684 | lr 1.92e-04 | (3807.05 ms | 137715 tok/s) step 16419/76294 | train loss 3.338238 | norm 0.4464 | lr 1.92e-04 | (3827.11 ms | 136993 tok/s) step 16420/76294 | train loss 3.456533 | norm 0.4745 | lr 1.92e-04 | (3801.83 ms | 137904 tok/s) step 16421/76294 | train loss 3.273214 | norm 0.4607 | lr 1.92e-04 | (3812.50 ms | 137518 tok/s) step 16422/76294 | train loss 3.390425 | norm 0.4210 | lr 1.92e-04 | (3816.93 ms | 137359 tok/s) step 16423/76294 | train loss 3.347026 | norm 0.6272 | lr 1.92e-04 | (3809.16 ms | 137639 tok/s) step 16424/76294 | train loss 3.403976 | norm 0.6041 | lr 1.92e-04 | (3897.42 ms | 134522 tok/s) step 16425/76294 | train loss 3.355261 | norm 0.8250 | lr 1.92e-04 | (3803.25 ms | 137853 tok/s) step 16426/76294 | train loss 3.326329 | norm 0.5192 | lr 1.92e-04 | (3813.81 ms | 137471 tok/s) step 16427/76294 | train loss 3.360713 | norm 0.6403 | lr 1.92e-04 | (3800.89 ms | 137938 tok/s) step 16428/76294 | train loss 3.350449 | norm 0.4416 | lr 1.92e-04 | (3817.05 ms | 137354 tok/s) step 16429/76294 | train loss 3.349852 | norm 0.5524 | lr 1.92e-04 | (3801.89 ms | 137902 tok/s) step 16430/76294 | train loss 3.410272 | norm 0.4249 | lr 1.92e-04 | (3828.41 ms | 136947 tok/s) step 16431/76294 | train loss 3.396549 | norm 0.4278 | lr 1.92e-04 | (3802.17 ms | 137892 tok/s) step 16432/76294 | train loss 3.348683 | norm 0.5669 | lr 1.92e-04 | (3803.05 ms | 137860 tok/s) step 16433/76294 | train loss 3.283504 | norm 0.8179 | lr 1.92e-04 | (3822.19 ms | 137169 tok/s) step 16434/76294 | train loss 3.439506 | norm 0.4060 | lr 1.92e-04 | (3946.82 ms | 132838 tok/s) step 16435/76294 | train loss 3.365768 | norm 0.7468 | lr 1.92e-04 | (3796.28 ms | 138106 tok/s) step 16436/76294 | train loss 3.370220 | norm 0.5564 | lr 1.92e-04 | (3836.48 ms | 136658 tok/s) step 16437/76294 | train loss 3.355374 | norm 0.7022 | lr 1.92e-04 | (3807.08 ms | 137714 tok/s) step 16438/76294 | train loss 3.358268 | norm 0.6870 | lr 1.92e-04 | (3825.56 ms | 137049 tok/s) step 16439/76294 | train loss 3.379080 | norm 0.4483 | lr 1.92e-04 | (3802.88 ms | 137866 tok/s) step 16440/76294 | train loss 3.367838 | norm 0.5804 | lr 1.91e-04 | (3805.74 ms | 137763 tok/s) step 16441/76294 | train loss 3.332304 | norm 0.4750 | lr 1.91e-04 | (3822.22 ms | 137168 tok/s) step 16442/76294 | train loss 3.424613 | norm 0.5279 | lr 1.91e-04 | (3804.35 ms | 137813 tok/s) step 16443/76294 | train loss 3.313366 | norm 0.4506 | lr 1.91e-04 | (3817.90 ms | 137323 tok/s) step 16444/76294 | train loss 3.362247 | norm 0.6250 | lr 1.91e-04 | (3809.61 ms | 137622 tok/s) step 16445/76294 | train loss 3.425479 | norm 0.6027 | lr 1.91e-04 | (3807.73 ms | 137690 tok/s) step 16446/76294 | train loss 3.343376 | norm 1.1341 | lr 1.91e-04 | (3802.25 ms | 137889 tok/s) step 16447/76294 | train loss 3.351809 | norm 0.6008 | lr 1.91e-04 | (3805.43 ms | 137774 tok/s) step 16448/76294 | train loss 3.355810 | norm 0.5532 | lr 1.91e-04 | (3801.88 ms | 137902 tok/s) step 16449/76294 | train loss 3.359328 | norm 0.8471 | lr 1.91e-04 | (3807.25 ms | 137708 tok/s) step 16450/76294 | train loss 3.394963 | norm 0.4504 | lr 1.91e-04 | (3880.32 ms | 135115 tok/s) step 16451/76294 | train loss 3.339839 | norm 0.4385 | lr 1.91e-04 | (3800.73 ms | 137944 tok/s) step 16452/76294 | train loss 3.358632 | norm 0.5400 | lr 1.91e-04 | (3832.23 ms | 136810 tok/s) step 16453/76294 | train loss 3.401680 | norm 0.6595 | lr 1.91e-04 | (3807.67 ms | 137692 tok/s) step 16454/76294 | train loss 3.282772 | norm 0.5539 | lr 1.91e-04 | (3829.17 ms | 136920 tok/s) step 16455/76294 | train loss 3.465694 | norm 0.6703 | lr 1.91e-04 | (3802.21 ms | 137890 tok/s) step 16456/76294 | train loss 3.324268 | norm 0.4804 | lr 1.91e-04 | (3843.32 ms | 136415 tok/s) step 16457/76294 | train loss 3.356848 | norm 0.4659 | lr 1.91e-04 | (3841.72 ms | 136472 tok/s) step 16458/76294 | train loss 3.360341 | norm 0.6631 | lr 1.91e-04 | (3801.71 ms | 137909 tok/s) step 16459/76294 | train loss 3.287681 | norm 0.9335 | lr 1.91e-04 | (3821.60 ms | 137191 tok/s) step 16460/76294 | train loss 3.360551 | norm 0.5604 | lr 1.91e-04 | (3804.32 ms | 137814 tok/s) step 16461/76294 | train loss 3.294533 | norm 0.7488 | lr 1.91e-04 | (3798.86 ms | 138012 tok/s) step 16462/76294 | train loss 3.307365 | norm 0.5355 | lr 1.90e-04 | (3819.39 ms | 137270 tok/s) step 16463/76294 | train loss 3.388658 | norm 0.5446 | lr 1.90e-04 | (3804.76 ms | 137798 tok/s) step 16464/76294 | train loss 3.368361 | norm 0.7536 | lr 1.90e-04 | (3806.88 ms | 137721 tok/s) step 16465/76294 | train loss 3.377829 | norm 0.4042 | lr 1.90e-04 | (3820.29 ms | 137238 tok/s) step 16466/76294 | train loss 3.396325 | norm 0.5612 | lr 1.90e-04 | (3809.08 ms | 137641 tok/s) step 16467/76294 | train loss 3.381169 | norm 0.4946 | lr 1.90e-04 | (3799.46 ms | 137990 tok/s) step 16468/76294 | train loss 3.357868 | norm 0.4626 | lr 1.90e-04 | (3830.30 ms | 136879 tok/s) step 16469/76294 | train loss 3.382881 | norm 0.4150 | lr 1.90e-04 | (3801.90 ms | 137902 tok/s) step 16470/76294 | train loss 3.296082 | norm 0.4982 | lr 1.90e-04 | (3809.58 ms | 137624 tok/s) step 16471/76294 | train loss 3.320415 | norm 0.4818 | lr 1.90e-04 | (3821.20 ms | 137205 tok/s) step 16472/76294 | train loss 3.343373 | norm 0.5859 | lr 1.90e-04 | (3805.74 ms | 137762 tok/s) step 16473/76294 | train loss 3.337651 | norm 0.6237 | lr 1.90e-04 | (3818.07 ms | 137318 tok/s) step 16474/76294 | train loss 3.460169 | norm 0.5498 | lr 1.90e-04 | (3822.75 ms | 137149 tok/s) step 16475/76294 | train loss 3.347626 | norm 0.6510 | lr 1.90e-04 | (3802.45 ms | 137881 tok/s) step 16476/76294 | train loss 3.343168 | norm 0.6366 | lr 1.90e-04 | (3869.88 ms | 135479 tok/s) step 16477/76294 | train loss 3.382151 | norm 1.0101 | lr 1.90e-04 | (3812.49 ms | 137518 tok/s) step 16478/76294 | train loss 3.310344 | norm 0.5861 | lr 1.90e-04 | (3814.10 ms | 137461 tok/s) step 16479/76294 | train loss 3.299679 | norm 0.6281 | lr 1.90e-04 | (3838.45 ms | 136588 tok/s) step 16480/76294 | train loss 3.345596 | norm 0.5730 | lr 1.90e-04 | (3802.11 ms | 137894 tok/s) step 16481/76294 | train loss 3.422332 | norm 0.5362 | lr 1.90e-04 | (3799.40 ms | 137992 tok/s) step 16482/76294 | train loss 3.340911 | norm 0.3665 | lr 1.90e-04 | (3834.16 ms | 136741 tok/s) step 16483/76294 | train loss 3.453531 | norm 0.4688 | lr 1.90e-04 | (3804.59 ms | 137804 tok/s) step 16484/76294 | train loss 3.369234 | norm 0.4685 | lr 1.89e-04 | (3805.87 ms | 137758 tok/s) step 16485/76294 | train loss 3.387765 | norm 0.4109 | lr 1.89e-04 | (3823.43 ms | 137125 tok/s) step 16486/76294 | train loss 3.388163 | norm 0.4619 | lr 1.89e-04 | (3804.13 ms | 137821 tok/s) step 16487/76294 | train loss 3.324043 | norm 0.6859 | lr 1.89e-04 | (3820.54 ms | 137229 tok/s) step 16488/76294 | train loss 3.359341 | norm 0.4065 | lr 1.89e-04 | (3803.98 ms | 137826 tok/s) step 16489/76294 | train loss 3.323298 | norm 0.5980 | lr 1.89e-04 | (3808.47 ms | 137664 tok/s) step 16490/76294 | train loss 3.296889 | norm 0.4508 | lr 1.89e-04 | (3804.12 ms | 137821 tok/s) step 16491/76294 | train loss 3.403751 | norm 0.4904 | lr 1.89e-04 | (3825.44 ms | 137053 tok/s) step 16492/76294 | train loss 3.378474 | norm 0.9526 | lr 1.89e-04 | (3799.69 ms | 137982 tok/s) step 16493/76294 | train loss 3.310501 | norm 0.5356 | lr 1.89e-04 | (3807.26 ms | 137707 tok/s) step 16494/76294 | train loss 3.374735 | norm 0.4450 | lr 1.89e-04 | (3825.92 ms | 137036 tok/s) step 16495/76294 | train loss 3.385447 | norm 0.7121 | lr 1.89e-04 | (3804.81 ms | 137796 tok/s) step 16496/76294 | train loss 3.339711 | norm 0.9307 | lr 1.89e-04 | (3800.81 ms | 137941 tok/s) step 16497/76294 | train loss 3.414526 | norm 0.9064 | lr 1.89e-04 | (3826.49 ms | 137016 tok/s) step 16498/76294 | train loss 3.276733 | norm 1.1603 | lr 1.89e-04 | (3800.13 ms | 137966 tok/s) step 16499/76294 | train loss 3.394606 | norm 0.5465 | lr 1.89e-04 | (3804.99 ms | 137790 tok/s) step 16500/76294 | train loss 3.370407 | norm 0.7484 | lr 1.89e-04 | (3823.67 ms | 137117 tok/s) val loss: 3.374685 saving model checkpoint to ./results/gpt2-124M-gqa/step_16500.pth step 16501/76294 | train loss 3.305541 | norm 1.0696 | lr 1.89e-04 | (3863.61 ms | 135699 tok/s) step 16502/76294 | train loss 3.477623 | norm 0.6200 | lr 1.89e-04 | (3794.30 ms | 138178 tok/s) step 16503/76294 | train loss 3.346016 | norm 0.8898 | lr 1.89e-04 | (3808.20 ms | 137673 tok/s) step 16504/76294 | train loss 3.346497 | norm 1.3238 | lr 1.89e-04 | (3818.17 ms | 137314 tok/s) step 16505/76294 | train loss 3.397481 | norm 1.1322 | lr 1.89e-04 | (3800.27 ms | 137961 tok/s) step 16506/76294 | train loss 3.382617 | norm 0.6699 | lr 1.89e-04 | (3803.85 ms | 137831 tok/s) step 16507/76294 | train loss 3.269387 | norm 0.6118 | lr 1.88e-04 | (3828.66 ms | 136938 tok/s) step 16508/76294 | train loss 3.332958 | norm 1.0968 | lr 1.88e-04 | (3802.58 ms | 137877 tok/s) step 16509/76294 | train loss 3.306484 | norm 0.8328 | lr 1.88e-04 | (4523.41 ms | 115905 tok/s) step 16510/76294 | train loss 3.352851 | norm 0.7976 | lr 1.88e-04 | (3797.82 ms | 138050 tok/s) step 16511/76294 | train loss 3.314735 | norm 0.6169 | lr 1.88e-04 | (3827.44 ms | 136982 tok/s) step 16512/76294 | train loss 3.406803 | norm 0.7016 | lr 1.88e-04 | (3820.97 ms | 137213 tok/s) step 16513/76294 | train loss 3.305196 | norm 0.5362 | lr 1.88e-04 | (3810.92 ms | 137575 tok/s) step 16514/76294 | train loss 3.417228 | norm 0.7068 | lr 1.88e-04 | (3821.89 ms | 137180 tok/s) step 16515/76294 | train loss 3.383561 | norm 0.5260 | lr 1.88e-04 | (3802.66 ms | 137874 tok/s) step 16516/76294 | train loss 3.402890 | norm 0.5016 | lr 1.88e-04 | (3822.55 ms | 137157 tok/s) step 16517/76294 | train loss 3.353023 | norm 0.5608 | lr 1.88e-04 | (3801.30 ms | 137923 tok/s) step 16518/76294 | train loss 3.414496 | norm 0.5315 | lr 1.88e-04 | (3810.95 ms | 137574 tok/s) step 16519/76294 | train loss 3.329537 | norm 0.5686 | lr 1.88e-04 | (3827.15 ms | 136992 tok/s) step 16520/76294 | train loss 3.391129 | norm 0.4839 | lr 1.88e-04 | (4050.03 ms | 129453 tok/s) step 16521/76294 | train loss 3.314812 | norm 0.5803 | lr 1.88e-04 | (3863.77 ms | 135694 tok/s) step 16522/76294 | train loss 3.394847 | norm 0.7536 | lr 1.88e-04 | (3800.18 ms | 137964 tok/s) step 16523/76294 | train loss 3.280438 | norm 0.4515 | lr 1.88e-04 | (3803.17 ms | 137855 tok/s) step 16524/76294 | train loss 3.528939 | norm 0.8043 | lr 1.88e-04 | (3823.02 ms | 137140 tok/s) step 16525/76294 | train loss 3.319167 | norm 0.5304 | lr 1.88e-04 | (3808.99 ms | 137645 tok/s) step 16526/76294 | train loss 3.422225 | norm 0.6875 | lr 1.88e-04 | (11201.74 ms | 46804 tok/s) step 16527/76294 | train loss 3.285041 | norm 0.5466 | lr 1.88e-04 | (3777.06 ms | 138808 tok/s) step 16528/76294 | train loss 3.380182 | norm 0.6594 | lr 1.88e-04 | (3792.34 ms | 138249 tok/s) step 16529/76294 | train loss 3.289396 | norm 0.7311 | lr 1.88e-04 | (3786.14 ms | 138476 tok/s) step 16530/76294 | train loss 3.397276 | norm 1.1770 | lr 1.87e-04 | (3806.94 ms | 137719 tok/s) step 16531/76294 | train loss 3.316741 | norm 0.4852 | lr 1.87e-04 | (3789.58 ms | 138350 tok/s) step 16532/76294 | train loss 3.327395 | norm 0.6537 | lr 1.87e-04 | (3792.34 ms | 138249 tok/s) step 16533/76294 | train loss 3.362212 | norm 0.7881 | lr 1.87e-04 | (3814.75 ms | 137437 tok/s) step 16534/76294 | train loss 3.380931 | norm 0.5135 | lr 1.87e-04 | (3792.69 ms | 138237 tok/s) step 16535/76294 | train loss 3.306219 | norm 0.4914 | lr 1.87e-04 | (3796.91 ms | 138083 tok/s) step 16536/76294 | train loss 3.407231 | norm 0.4268 | lr 1.87e-04 | (3792.25 ms | 138253 tok/s) step 16537/76294 | train loss 3.339128 | norm 0.4244 | lr 1.87e-04 | (3793.14 ms | 138220 tok/s) step 16538/76294 | train loss 3.362419 | norm 0.6144 | lr 1.87e-04 | (3819.80 ms | 137255 tok/s) step 16539/76294 | train loss 3.366919 | norm 0.4383 | lr 1.87e-04 | (3794.47 ms | 138171 tok/s) step 16540/76294 | train loss 3.375134 | norm 0.8610 | lr 1.87e-04 | (3803.59 ms | 137840 tok/s) step 16541/76294 | train loss 3.310053 | norm 0.5323 | lr 1.87e-04 | (3797.80 ms | 138050 tok/s) step 16542/76294 | train loss 3.344979 | norm 0.6612 | lr 1.87e-04 | (3799.77 ms | 137979 tok/s) step 16543/76294 | train loss 3.386391 | norm 0.5530 | lr 1.87e-04 | (3800.14 ms | 137966 tok/s) step 16544/76294 | train loss 3.342561 | norm 0.5312 | lr 1.87e-04 | (3802.40 ms | 137884 tok/s) step 16545/76294 | train loss 3.465415 | norm 0.4970 | lr 1.87e-04 | (3958.66 ms | 132441 tok/s) step 16546/76294 | train loss 3.359201 | norm 0.6232 | lr 1.87e-04 | (3803.65 ms | 137838 tok/s) step 16547/76294 | train loss 3.359708 | norm 0.6964 | lr 1.87e-04 | (3797.08 ms | 138077 tok/s) step 16548/76294 | train loss 3.356276 | norm 0.5696 | lr 1.87e-04 | (3799.09 ms | 138004 tok/s) step 16549/76294 | train loss 3.407576 | norm 0.5732 | lr 1.87e-04 | (3795.76 ms | 138125 tok/s) step 16550/76294 | train loss 3.302752 | norm 0.5072 | lr 1.87e-04 | (3798.73 ms | 138017 tok/s) step 16551/76294 | train loss 3.383606 | norm 0.6793 | lr 1.87e-04 | (3851.01 ms | 136143 tok/s) step 16552/76294 | train loss 3.287191 | norm 0.5790 | lr 1.87e-04 | (3795.76 ms | 138125 tok/s) step 16553/76294 | train loss 3.393979 | norm 0.8484 | lr 1.86e-04 | (3818.13 ms | 137315 tok/s) step 16554/76294 | train loss 3.342391 | norm 0.9409 | lr 1.86e-04 | (3883.99 ms | 134987 tok/s) step 16555/76294 | train loss 3.392864 | norm 0.8401 | lr 1.86e-04 | (3793.11 ms | 138221 tok/s) step 16556/76294 | train loss 3.318359 | norm 0.6752 | lr 1.86e-04 | (3805.41 ms | 137774 tok/s) step 16557/76294 | train loss 3.368787 | norm 0.6359 | lr 1.86e-04 | (3793.53 ms | 138206 tok/s) step 16558/76294 | train loss 3.345592 | norm 0.5415 | lr 1.86e-04 | (3797.87 ms | 138048 tok/s) step 16559/76294 | train loss 3.363256 | norm 0.9138 | lr 1.86e-04 | (3815.46 ms | 137411 tok/s) step 16560/76294 | train loss 3.349865 | norm 0.7021 | lr 1.86e-04 | (3796.38 ms | 138102 tok/s) step 16561/76294 | train loss 3.397050 | norm 0.5734 | lr 1.86e-04 | (3817.44 ms | 137340 tok/s) step 16562/76294 | train loss 3.389081 | norm 0.6642 | lr 1.86e-04 | (3795.09 ms | 138149 tok/s) step 16563/76294 | train loss 3.308616 | norm 0.6519 | lr 1.86e-04 | (3820.20 ms | 137241 tok/s) step 16564/76294 | train loss 3.396396 | norm 0.6283 | lr 1.86e-04 | (3798.57 ms | 138023 tok/s) step 16565/76294 | train loss 3.322981 | norm 0.6954 | lr 1.86e-04 | (3800.54 ms | 137951 tok/s) step 16566/76294 | train loss 3.403496 | norm 1.0112 | lr 1.86e-04 | (3861.93 ms | 135758 tok/s) step 16567/76294 | train loss 3.295821 | norm 0.5710 | lr 1.86e-04 | (3801.36 ms | 137921 tok/s) step 16568/76294 | train loss 3.400831 | norm 0.5780 | lr 1.86e-04 | (3807.66 ms | 137693 tok/s) step 16569/76294 | train loss 3.353457 | norm 0.4945 | lr 1.86e-04 | (3800.09 ms | 137967 tok/s) step 16570/76294 | train loss 3.393415 | norm 0.5225 | lr 1.86e-04 | (3820.47 ms | 137231 tok/s) step 16571/76294 | train loss 3.285777 | norm 0.4316 | lr 1.86e-04 | (3799.15 ms | 138001 tok/s) step 16572/76294 | train loss 3.413589 | norm 0.6827 | lr 1.86e-04 | (3803.49 ms | 137844 tok/s) step 16573/76294 | train loss 3.310509 | norm 1.1273 | lr 1.86e-04 | (3801.29 ms | 137924 tok/s) step 16574/76294 | train loss 3.424392 | norm 0.6957 | lr 1.86e-04 | (3801.93 ms | 137901 tok/s) step 16575/76294 | train loss 3.314498 | norm 0.5554 | lr 1.86e-04 | (3803.88 ms | 137830 tok/s) step 16576/76294 | train loss 3.395349 | norm 0.4674 | lr 1.85e-04 | (3823.95 ms | 137106 tok/s) step 16577/76294 | train loss 3.302845 | norm 0.5361 | lr 1.85e-04 | (3800.31 ms | 137959 tok/s) step 16578/76294 | train loss 3.407275 | norm 0.8228 | lr 1.85e-04 | (4517.51 ms | 116057 tok/s) step 16579/76294 | train loss 3.321388 | norm 0.5363 | lr 1.85e-04 | (3885.13 ms | 134947 tok/s) step 16580/76294 | train loss 3.341084 | norm 0.4609 | lr 1.85e-04 | (3837.30 ms | 136629 tok/s) step 16581/76294 | train loss 3.513589 | norm 0.6929 | lr 1.85e-04 | (3798.45 ms | 138027 tok/s) step 16582/76294 | train loss 3.507055 | norm 0.5895 | lr 1.85e-04 | (3835.76 ms | 136684 tok/s) step 16583/76294 | train loss 3.350364 | norm 0.8482 | lr 1.85e-04 | (3798.79 ms | 138014 tok/s) step 16584/76294 | train loss 3.355953 | norm 0.8406 | lr 1.85e-04 | (3803.02 ms | 137861 tok/s) step 16585/76294 | train loss 3.372020 | norm 1.0500 | lr 1.85e-04 | (3820.18 ms | 137242 tok/s) step 16586/76294 | train loss 3.368426 | norm 0.8131 | lr 1.85e-04 | (3800.49 ms | 137953 tok/s) step 16587/76294 | train loss 3.346779 | norm 0.6547 | lr 1.85e-04 | (3802.84 ms | 137868 tok/s) step 16588/76294 | train loss 3.328492 | norm 0.6115 | lr 1.85e-04 | (3801.56 ms | 137914 tok/s) step 16589/76294 | train loss 3.254121 | norm 0.4959 | lr 1.85e-04 | (3820.14 ms | 137243 tok/s) step 16590/76294 | train loss 3.395061 | norm 0.7500 | lr 1.85e-04 | (3800.60 ms | 137949 tok/s) step 16591/76294 | train loss 3.291075 | norm 0.4535 | lr 1.85e-04 | (3797.21 ms | 138072 tok/s) step 16592/76294 | train loss 3.360462 | norm 0.6053 | lr 1.85e-04 | (3820.31 ms | 137237 tok/s) step 16593/76294 | train loss 3.324856 | norm 0.5790 | lr 1.85e-04 | (3808.75 ms | 137654 tok/s) step 16594/76294 | train loss 3.374336 | norm 0.8520 | lr 1.85e-04 | (3808.02 ms | 137680 tok/s) step 16595/76294 | train loss 3.372238 | norm 0.4812 | lr 1.85e-04 | (3824.88 ms | 137073 tok/s) step 16596/76294 | train loss 3.363949 | norm 1.1098 | lr 1.85e-04 | (3815.09 ms | 137425 tok/s) step 16597/76294 | train loss 3.366569 | norm 0.6979 | lr 1.85e-04 | (3798.57 ms | 138022 tok/s) step 16598/76294 | train loss 3.380133 | norm 0.5862 | lr 1.85e-04 | (3800.00 ms | 137970 tok/s) step 16599/76294 | train loss 3.300234 | norm 0.4954 | lr 1.85e-04 | (3796.94 ms | 138082 tok/s) step 16600/76294 | train loss 3.368316 | norm 0.4945 | lr 1.84e-04 | (3849.05 ms | 136212 tok/s) step 16601/76294 | train loss 3.342078 | norm 0.7550 | lr 1.84e-04 | (3796.63 ms | 138093 tok/s) step 16602/76294 | train loss 3.320373 | norm 0.4725 | lr 1.84e-04 | (3827.32 ms | 136986 tok/s) step 16603/76294 | train loss 3.408604 | norm 0.7860 | lr 1.84e-04 | (3800.43 ms | 137955 tok/s) step 16604/76294 | train loss 3.321708 | norm 0.6722 | lr 1.84e-04 | (3827.87 ms | 136966 tok/s) step 16605/76294 | train loss 3.331891 | norm 0.6592 | lr 1.84e-04 | (3870.05 ms | 135473 tok/s) step 16606/76294 | train loss 3.289361 | norm 0.6713 | lr 1.84e-04 | (3797.00 ms | 138080 tok/s) step 16607/76294 | train loss 3.358750 | norm 0.8586 | lr 1.84e-04 | (3823.93 ms | 137107 tok/s) step 16608/76294 | train loss 3.322765 | norm 0.9033 | lr 1.84e-04 | (3793.86 ms | 138194 tok/s) step 16609/76294 | train loss 3.423287 | norm 1.3705 | lr 1.84e-04 | (3802.65 ms | 137875 tok/s) step 16610/76294 | train loss 3.363393 | norm 1.0245 | lr 1.84e-04 | (3822.43 ms | 137161 tok/s) step 16611/76294 | train loss 3.455086 | norm 1.3105 | lr 1.84e-04 | (3800.78 ms | 137942 tok/s) step 16612/76294 | train loss 3.351651 | norm 1.2874 | lr 1.84e-04 | (3804.22 ms | 137818 tok/s) step 16613/76294 | train loss 3.387482 | norm 1.4356 | lr 1.84e-04 | (3802.94 ms | 137864 tok/s) step 16614/76294 | train loss 3.359213 | norm 0.8171 | lr 1.84e-04 | (3803.27 ms | 137852 tok/s) step 16615/76294 | train loss 3.320782 | norm 0.8786 | lr 1.84e-04 | (3799.93 ms | 137973 tok/s) step 16616/76294 | train loss 3.348745 | norm 0.6264 | lr 1.84e-04 | (3798.65 ms | 138020 tok/s) step 16617/76294 | train loss 3.271862 | norm 0.7528 | lr 1.84e-04 | (3826.63 ms | 137010 tok/s) step 16618/76294 | train loss 3.403445 | norm 1.0028 | lr 1.84e-04 | (3797.75 ms | 138052 tok/s) step 16619/76294 | train loss 3.344102 | norm 0.6750 | lr 1.84e-04 | (3801.84 ms | 137904 tok/s) step 16620/76294 | train loss 3.451921 | norm 0.9850 | lr 1.84e-04 | (3821.95 ms | 137178 tok/s) step 16621/76294 | train loss 3.309758 | norm 0.7198 | lr 1.84e-04 | (3798.89 ms | 138011 tok/s) step 16622/76294 | train loss 3.417416 | norm 0.7289 | lr 1.84e-04 | (3806.33 ms | 137741 tok/s) step 16623/76294 | train loss 3.360804 | norm 0.9209 | lr 1.83e-04 | (3807.92 ms | 137684 tok/s) step 16624/76294 | train loss 3.407825 | norm 0.8111 | lr 1.83e-04 | (3802.61 ms | 137876 tok/s) step 16625/76294 | train loss 3.349738 | norm 0.6894 | lr 1.83e-04 | (3801.99 ms | 137898 tok/s) step 16626/76294 | train loss 3.441654 | norm 0.7682 | lr 1.83e-04 | (3808.85 ms | 137650 tok/s) step 16627/76294 | train loss 3.315047 | norm 1.0097 | lr 1.83e-04 | (3803.28 ms | 137852 tok/s) step 16628/76294 | train loss 3.401605 | norm 2.0238 | lr 1.83e-04 | (3797.02 ms | 138079 tok/s) step 16629/76294 | train loss 3.304102 | norm 0.9362 | lr 1.83e-04 | (3827.90 ms | 136965 tok/s) step 16630/76294 | train loss 3.348045 | norm 1.1934 | lr 1.83e-04 | (3798.69 ms | 138018 tok/s) step 16631/76294 | train loss 3.351036 | norm 0.9182 | lr 1.83e-04 | (3878.95 ms | 135162 tok/s) step 16632/76294 | train loss 3.418680 | norm 1.5053 | lr 1.83e-04 | (3800.80 ms | 137942 tok/s) step 16633/76294 | train loss 3.415643 | norm 1.4582 | lr 1.83e-04 | (3900.18 ms | 134426 tok/s) step 16634/76294 | train loss 3.356578 | norm 1.6925 | lr 1.83e-04 | (3799.24 ms | 137998 tok/s) step 16635/76294 | train loss 3.378879 | norm 1.3503 | lr 1.83e-04 | (3805.06 ms | 137787 tok/s) step 16636/76294 | train loss 3.369043 | norm 1.0699 | lr 1.83e-04 | (3819.60 ms | 137262 tok/s) step 16637/76294 | train loss 3.435468 | norm 1.4415 | lr 1.83e-04 | (3810.14 ms | 137603 tok/s) step 16638/76294 | train loss 3.293613 | norm 1.8883 | lr 1.83e-04 | (3802.32 ms | 137886 tok/s) step 16639/76294 | train loss 3.415174 | norm 1.7308 | lr 1.83e-04 | (3826.89 ms | 137001 tok/s) step 16640/76294 | train loss 3.400049 | norm 1.6184 | lr 1.83e-04 | (3803.53 ms | 137843 tok/s) step 16641/76294 | train loss 3.421468 | norm 2.6357 | lr 1.83e-04 | (3801.34 ms | 137922 tok/s) step 16642/76294 | train loss 3.363823 | norm 0.9906 | lr 1.83e-04 | (3870.19 ms | 135468 tok/s) step 16643/76294 | train loss 3.394246 | norm 0.7387 | lr 1.83e-04 | (3797.00 ms | 138079 tok/s) step 16644/76294 | train loss 3.393172 | norm 0.7818 | lr 1.83e-04 | (3805.74 ms | 137763 tok/s) step 16645/76294 | train loss 3.455546 | norm 0.6901 | lr 1.83e-04 | (3825.83 ms | 137039 tok/s) step 16646/76294 | train loss 3.391315 | norm 1.4825 | lr 1.83e-04 | (3803.13 ms | 137857 tok/s) step 16647/76294 | train loss 3.420309 | norm 0.7410 | lr 1.82e-04 | (3804.68 ms | 137801 tok/s) step 16648/76294 | train loss 3.413271 | norm 0.8694 | lr 1.82e-04 | (3833.54 ms | 136763 tok/s) step 16649/76294 | train loss 3.425567 | norm 0.8465 | lr 1.82e-04 | (3822.18 ms | 137170 tok/s) step 16650/76294 | train loss 3.355620 | norm 0.8476 | lr 1.82e-04 | (3802.70 ms | 137872 tok/s) step 16651/76294 | train loss 3.327244 | norm 0.8299 | lr 1.82e-04 | (3821.22 ms | 137204 tok/s) step 16652/76294 | train loss 3.442268 | norm 0.6022 | lr 1.82e-04 | (3803.17 ms | 137855 tok/s) step 16653/76294 | train loss 3.361014 | norm 0.8404 | lr 1.82e-04 | (3806.22 ms | 137745 tok/s) step 16654/76294 | train loss 3.507619 | norm 0.9147 | lr 1.82e-04 | (3832.17 ms | 136812 tok/s) step 16655/76294 | train loss 3.357378 | norm 0.6022 | lr 1.82e-04 | (3804.16 ms | 137820 tok/s) step 16656/76294 | train loss 3.383530 | norm 0.6988 | lr 1.82e-04 | (3868.28 ms | 135535 tok/s) step 16657/76294 | train loss 3.308364 | norm 0.5962 | lr 1.82e-04 | (3798.04 ms | 138042 tok/s) step 16658/76294 | train loss 3.400966 | norm 0.8083 | lr 1.82e-04 | (3806.89 ms | 137721 tok/s) step 16659/76294 | train loss 3.361005 | norm 0.6311 | lr 1.82e-04 | (3796.91 ms | 138083 tok/s) step 16660/76294 | train loss 3.397394 | norm 0.6676 | lr 1.82e-04 | (3810.36 ms | 137595 tok/s) step 16661/76294 | train loss 3.333383 | norm 0.7426 | lr 1.82e-04 | (3813.21 ms | 137493 tok/s) step 16662/76294 | train loss 3.404691 | norm 0.7745 | lr 1.82e-04 | (3823.53 ms | 137121 tok/s) step 16663/76294 | train loss 3.332615 | norm 0.7616 | lr 1.82e-04 | (3796.92 ms | 138082 tok/s) step 16664/76294 | train loss 3.403340 | norm 0.8022 | lr 1.82e-04 | (3845.49 ms | 136338 tok/s) step 16665/76294 | train loss 3.480153 | norm 1.3004 | lr 1.82e-04 | (3799.90 ms | 137974 tok/s) step 16666/76294 | train loss 3.413231 | norm 1.1825 | lr 1.82e-04 | (3801.21 ms | 137927 tok/s) step 16667/76294 | train loss 3.311913 | norm 1.3078 | lr 1.82e-04 | (3800.68 ms | 137946 tok/s) step 16668/76294 | train loss 3.421991 | norm 0.9098 | lr 1.82e-04 | (3825.66 ms | 137045 tok/s) step 16669/76294 | train loss 3.289858 | norm 0.8878 | lr 1.82e-04 | (3816.38 ms | 137378 tok/s) step 16670/76294 | train loss 3.384048 | norm 1.1713 | lr 1.82e-04 | (3905.49 ms | 134244 tok/s) step 16671/76294 | train loss 3.443802 | norm 1.1450 | lr 1.81e-04 | (3819.99 ms | 137248 tok/s) step 16672/76294 | train loss 3.365952 | norm 0.8681 | lr 1.81e-04 | (3797.67 ms | 138055 tok/s) step 16673/76294 | train loss 3.319400 | norm 1.6094 | lr 1.81e-04 | (3821.90 ms | 137180 tok/s) step 16674/76294 | train loss 3.323990 | norm 0.7564 | lr 1.81e-04 | (3801.19 ms | 137927 tok/s) step 16675/76294 | train loss 3.381227 | norm 0.8271 | lr 1.81e-04 | (3803.43 ms | 137846 tok/s) step 16676/76294 | train loss 3.333950 | norm 1.0384 | lr 1.81e-04 | (3835.24 ms | 136703 tok/s) step 16677/76294 | train loss 3.419687 | norm 0.9795 | lr 1.81e-04 | (3803.28 ms | 137852 tok/s) step 16678/76294 | train loss 3.349903 | norm 0.6133 | lr 1.81e-04 | (3815.72 ms | 137402 tok/s) step 16679/76294 | train loss 3.406202 | norm 0.5961 | lr 1.81e-04 | (3833.52 ms | 136764 tok/s) step 16680/76294 | train loss 3.344681 | norm 0.7121 | lr 1.81e-04 | (3803.72 ms | 137836 tok/s) step 16681/76294 | train loss 3.382217 | norm 0.5999 | lr 1.81e-04 | (3869.63 ms | 135488 tok/s) step 16682/76294 | train loss 3.321971 | norm 0.4925 | lr 1.81e-04 | (3798.53 ms | 138024 tok/s) step 16683/76294 | train loss 3.448164 | norm 0.9162 | lr 1.81e-04 | (3845.73 ms | 136330 tok/s) step 16684/76294 | train loss 3.359182 | norm 0.7499 | lr 1.81e-04 | (3812.31 ms | 137525 tok/s) step 16685/76294 | train loss 3.410652 | norm 0.7113 | lr 1.81e-04 | (3825.09 ms | 137066 tok/s) step 16686/76294 | train loss 3.413870 | norm 0.8028 | lr 1.81e-04 | (3809.51 ms | 137626 tok/s) step 16687/76294 | train loss 3.292299 | norm 0.5739 | lr 1.81e-04 | (3809.19 ms | 137638 tok/s) step 16688/76294 | train loss 3.372073 | norm 0.5662 | lr 1.81e-04 | (3809.33 ms | 137633 tok/s) step 16689/76294 | train loss 3.323631 | norm 0.9344 | lr 1.81e-04 | (3806.05 ms | 137751 tok/s) step 16690/76294 | train loss 3.439277 | norm 0.8432 | lr 1.81e-04 | (3815.43 ms | 137413 tok/s) step 16691/76294 | train loss 3.321506 | norm 0.7444 | lr 1.81e-04 | (3978.53 ms | 131779 tok/s) step 16692/76294 | train loss 3.455460 | norm 0.5950 | lr 1.81e-04 | (3802.69 ms | 137873 tok/s) step 16693/76294 | train loss 3.297870 | norm 0.7351 | lr 1.81e-04 | (3830.44 ms | 136874 tok/s) step 16694/76294 | train loss 3.753495 | norm 1.4919 | lr 1.81e-04 | (3798.76 ms | 138016 tok/s) step 16695/76294 | train loss 3.312721 | norm 0.6565 | lr 1.80e-04 | (3838.75 ms | 136578 tok/s) step 16696/76294 | train loss 3.416312 | norm 0.6242 | lr 1.80e-04 | (3804.51 ms | 137807 tok/s) step 16697/76294 | train loss 3.274390 | norm 0.6551 | lr 1.80e-04 | (3840.38 ms | 136520 tok/s) step 16698/76294 | train loss 3.405692 | norm 0.5684 | lr 1.80e-04 | (3825.13 ms | 137064 tok/s) step 16699/76294 | train loss 3.295832 | norm 0.5393 | lr 1.80e-04 | (3864.25 ms | 135677 tok/s) step 16700/76294 | train loss 3.377479 | norm 0.5501 | lr 1.80e-04 | (4212.89 ms | 124448 tok/s) step 16701/76294 | train loss 3.363021 | norm 0.5326 | lr 1.80e-04 | (3794.82 ms | 138159 tok/s) step 16702/76294 | train loss 3.387537 | norm 0.6358 | lr 1.80e-04 | (3802.07 ms | 137895 tok/s) step 16703/76294 | train loss 3.304742 | norm 0.7738 | lr 1.80e-04 | (3818.51 ms | 137302 tok/s) step 16704/76294 | train loss 3.393689 | norm 0.5827 | lr 1.80e-04 | (3803.08 ms | 137859 tok/s) step 16705/76294 | train loss 3.304332 | norm 0.6147 | lr 1.80e-04 | (3805.56 ms | 137769 tok/s) step 16706/76294 | train loss 3.487280 | norm 0.7689 | lr 1.80e-04 | (3946.59 ms | 132846 tok/s) step 16707/76294 | train loss 3.325446 | norm 0.7262 | lr 1.80e-04 | (3796.10 ms | 138112 tok/s) step 16708/76294 | train loss 3.358294 | norm 0.7334 | lr 1.80e-04 | (3805.77 ms | 137762 tok/s) step 16709/76294 | train loss 3.384774 | norm 0.7089 | lr 1.80e-04 | (3828.64 ms | 136938 tok/s) step 16710/76294 | train loss 3.226465 | norm 1.2797 | lr 1.80e-04 | (3802.67 ms | 137874 tok/s) step 16711/76294 | train loss 3.373586 | norm 0.7800 | lr 1.80e-04 | (3802.59 ms | 137876 tok/s) step 16712/76294 | train loss 3.325695 | norm 0.5550 | lr 1.80e-04 | (3801.22 ms | 137926 tok/s) step 16713/76294 | train loss 3.404968 | norm 0.6853 | lr 1.80e-04 | (3804.05 ms | 137824 tok/s) step 16714/76294 | train loss 3.240350 | norm 0.6814 | lr 1.80e-04 | (3809.68 ms | 137620 tok/s) step 16715/76294 | train loss 3.344816 | norm 0.6479 | lr 1.80e-04 | (3812.58 ms | 137515 tok/s) step 16716/76294 | train loss 3.390268 | norm 0.6161 | lr 1.80e-04 | (3806.64 ms | 137730 tok/s) step 16717/76294 | train loss 3.327547 | norm 0.7024 | lr 1.80e-04 | (3805.37 ms | 137776 tok/s) step 16718/76294 | train loss 3.408800 | norm 0.5934 | lr 1.80e-04 | (3840.66 ms | 136510 tok/s) step 16719/76294 | train loss 3.338388 | norm 0.6325 | lr 1.79e-04 | (3806.97 ms | 137718 tok/s) step 16720/76294 | train loss 3.275692 | norm 0.4760 | lr 1.79e-04 | (3809.53 ms | 137625 tok/s) step 16721/76294 | train loss 3.365979 | norm 0.7538 | lr 1.79e-04 | (3828.54 ms | 136942 tok/s) step 16722/76294 | train loss 3.320945 | norm 0.6360 | lr 1.79e-04 | (3806.51 ms | 137735 tok/s) step 16723/76294 | train loss 3.394880 | norm 0.6862 | lr 1.79e-04 | (3806.58 ms | 137732 tok/s) step 16724/76294 | train loss 3.409969 | norm 0.4713 | lr 1.79e-04 | (3835.45 ms | 136695 tok/s) step 16725/76294 | train loss 3.341359 | norm 0.8160 | lr 1.79e-04 | (3805.29 ms | 137779 tok/s) step 16726/76294 | train loss 3.484930 | norm 0.6092 | lr 1.79e-04 | (3809.35 ms | 137632 tok/s) step 16727/76294 | train loss 3.309708 | norm 1.2378 | lr 1.79e-04 | (3816.56 ms | 137372 tok/s) step 16728/76294 | train loss 3.392228 | norm 1.0228 | lr 1.79e-04 | (3801.57 ms | 137914 tok/s) step 16729/76294 | train loss 3.400789 | norm 0.6800 | lr 1.79e-04 | (3796.74 ms | 138089 tok/s) step 16730/76294 | train loss 3.308023 | norm 0.8090 | lr 1.79e-04 | (3825.47 ms | 137052 tok/s) step 16731/76294 | train loss 3.440487 | norm 0.6130 | lr 1.79e-04 | (3797.06 ms | 138078 tok/s) step 16732/76294 | train loss 3.403513 | norm 0.7040 | lr 1.79e-04 | (3805.54 ms | 137770 tok/s) step 16733/76294 | train loss 3.372875 | norm 0.5775 | lr 1.79e-04 | (3852.54 ms | 136089 tok/s) step 16734/76294 | train loss 3.326625 | norm 0.6744 | lr 1.79e-04 | (3797.36 ms | 138066 tok/s) step 16735/76294 | train loss 3.360281 | norm 0.6804 | lr 1.79e-04 | (3804.11 ms | 137821 tok/s) step 16736/76294 | train loss 3.436039 | norm 0.5280 | lr 1.79e-04 | (3819.89 ms | 137252 tok/s) step 16737/76294 | train loss 3.373819 | norm 0.7394 | lr 1.79e-04 | (3802.64 ms | 137875 tok/s) step 16738/76294 | train loss 3.460431 | norm 0.4286 | lr 1.79e-04 | (4050.48 ms | 129438 tok/s) step 16739/76294 | train loss 3.392375 | norm 0.6948 | lr 1.79e-04 | (3828.30 ms | 136950 tok/s) step 16740/76294 | train loss 3.400199 | norm 0.6769 | lr 1.79e-04 | (3814.36 ms | 137451 tok/s) step 16741/76294 | train loss 3.357689 | norm 0.5992 | lr 1.79e-04 | (3798.88 ms | 138011 tok/s) step 16742/76294 | train loss 3.335136 | norm 0.9051 | lr 1.79e-04 | (3796.43 ms | 138100 tok/s) step 16743/76294 | train loss 3.358647 | norm 0.5898 | lr 1.78e-04 | (3822.55 ms | 137156 tok/s) step 16744/76294 | train loss 3.410061 | norm 0.6745 | lr 1.78e-04 | (3798.52 ms | 138024 tok/s) step 16745/76294 | train loss 3.360269 | norm 0.8633 | lr 1.78e-04 | (3814.55 ms | 137444 tok/s) step 16746/76294 | train loss 3.363479 | norm 0.8623 | lr 1.78e-04 | (3796.23 ms | 138108 tok/s) step 16747/76294 | train loss 3.376703 | norm 0.6645 | lr 1.78e-04 | (3818.93 ms | 137287 tok/s) step 16748/76294 | train loss 3.378438 | norm 1.0888 | lr 1.78e-04 | (3820.90 ms | 137216 tok/s) step 16749/76294 | train loss 3.485899 | norm 0.7959 | lr 1.78e-04 | (3805.48 ms | 137772 tok/s) step 16750/76294 | train loss 3.365641 | norm 1.0639 | lr 1.78e-04 | (3837.71 ms | 136615 tok/s) val loss: 3.387259 saving model checkpoint to ./results/gpt2-124M-gqa/step_16750.pth step 16751/76294 | train loss 3.376190 | norm 0.7460 | lr 1.78e-04 | (3810.94 ms | 137574 tok/s) step 16752/76294 | train loss 3.345473 | norm 1.1566 | lr 1.78e-04 | (3816.53 ms | 137373 tok/s) step 16753/76294 | train loss 3.341110 | norm 0.6695 | lr 1.78e-04 | (3808.52 ms | 137662 tok/s) step 16754/76294 | train loss 3.413521 | norm 0.7877 | lr 1.78e-04 | (3849.92 ms | 136181 tok/s) step 16755/76294 | train loss 3.443782 | norm 0.6917 | lr 1.78e-04 | (3805.85 ms | 137758 tok/s) step 16756/76294 | train loss 3.311394 | norm 0.7526 | lr 1.78e-04 | (8228.28 ms | 63718 tok/s) step 16757/76294 | train loss 3.401991 | norm 0.4970 | lr 1.78e-04 | (3843.34 ms | 136415 tok/s) step 16758/76294 | train loss 3.357012 | norm 0.7453 | lr 1.78e-04 | (4010.45 ms | 130731 tok/s) step 16759/76294 | train loss 3.382531 | norm 0.5674 | lr 1.78e-04 | (3808.72 ms | 137655 tok/s) step 16760/76294 | train loss 3.364427 | norm 0.6745 | lr 1.78e-04 | (3788.69 ms | 138382 tok/s) step 16761/76294 | train loss 3.427051 | norm 0.5700 | lr 1.78e-04 | (3812.46 ms | 137520 tok/s) step 16762/76294 | train loss 3.386184 | norm 0.7264 | lr 1.78e-04 | (3791.53 ms | 138279 tok/s) step 16763/76294 | train loss 3.366794 | norm 0.7027 | lr 1.78e-04 | (3798.58 ms | 138022 tok/s) step 16764/76294 | train loss 3.385082 | norm 0.5735 | lr 1.78e-04 | (3790.82 ms | 138305 tok/s) step 16765/76294 | train loss 3.401051 | norm 0.8725 | lr 1.78e-04 | (3798.45 ms | 138027 tok/s) step 16766/76294 | train loss 3.357282 | norm 0.8620 | lr 1.78e-04 | (3818.24 ms | 137311 tok/s) step 16767/76294 | train loss 3.357832 | norm 0.8169 | lr 1.78e-04 | (3796.28 ms | 138106 tok/s) step 16768/76294 | train loss 3.392431 | norm 0.7628 | lr 1.77e-04 | (3794.15 ms | 138183 tok/s) step 16769/76294 | train loss 3.351506 | norm 0.7906 | lr 1.77e-04 | (3826.28 ms | 137023 tok/s) step 16770/76294 | train loss 3.342373 | norm 1.4704 | lr 1.77e-04 | (3794.29 ms | 138178 tok/s) step 16771/76294 | train loss 3.411194 | norm 1.8570 | lr 1.77e-04 | (3876.51 ms | 135248 tok/s) step 16772/76294 | train loss 3.307230 | norm 1.3580 | lr 1.77e-04 | (3797.01 ms | 138079 tok/s) step 16773/76294 | train loss 3.383580 | norm 1.3649 | lr 1.77e-04 | (3800.78 ms | 137942 tok/s) step 16774/76294 | train loss 3.429987 | norm 1.0898 | lr 1.77e-04 | (3820.99 ms | 137213 tok/s) step 16775/76294 | train loss 3.363412 | norm 1.0148 | lr 1.77e-04 | (3801.22 ms | 137926 tok/s) step 16776/76294 | train loss 3.330224 | norm 1.2435 | lr 1.77e-04 | (3801.20 ms | 137927 tok/s) step 16777/76294 | train loss 3.388254 | norm 1.1330 | lr 1.77e-04 | (3806.24 ms | 137744 tok/s) step 16778/76294 | train loss 3.395415 | norm 0.9740 | lr 1.77e-04 | (3803.57 ms | 137841 tok/s) step 16779/76294 | train loss 3.365919 | norm 0.9917 | lr 1.77e-04 | (3798.79 ms | 138015 tok/s) step 16780/76294 | train loss 3.349049 | norm 0.6765 | lr 1.77e-04 | (3807.97 ms | 137682 tok/s) step 16781/76294 | train loss 3.405538 | norm 0.8253 | lr 1.77e-04 | (3797.67 ms | 138055 tok/s) step 16782/76294 | train loss 3.396076 | norm 1.2435 | lr 1.77e-04 | (3826.96 ms | 136998 tok/s) step 16783/76294 | train loss 3.338454 | norm 0.8391 | lr 1.77e-04 | (3798.06 ms | 138041 tok/s) step 16784/76294 | train loss 3.386001 | norm 0.8370 | lr 1.77e-04 | (3800.52 ms | 137952 tok/s) step 16785/76294 | train loss 3.352080 | norm 0.6091 | lr 1.77e-04 | (3825.74 ms | 137042 tok/s) step 16786/76294 | train loss 3.381954 | norm 0.4897 | lr 1.77e-04 | (3803.65 ms | 137838 tok/s) step 16787/76294 | train loss 3.334345 | norm 0.6751 | lr 1.77e-04 | (3800.27 ms | 137961 tok/s) step 16788/76294 | train loss 3.276992 | norm 0.9181 | lr 1.77e-04 | (3848.53 ms | 136231 tok/s) step 16789/76294 | train loss 3.390506 | norm 0.5013 | lr 1.77e-04 | (3805.82 ms | 137759 tok/s) step 16790/76294 | train loss 3.347634 | norm 0.6134 | lr 1.77e-04 | (3806.92 ms | 137720 tok/s) step 16791/76294 | train loss 3.382153 | norm 0.8195 | lr 1.77e-04 | (3796.91 ms | 138083 tok/s) step 16792/76294 | train loss 3.395728 | norm 0.8403 | lr 1.77e-04 | (3805.42 ms | 137774 tok/s) step 16793/76294 | train loss 3.279026 | norm 0.6163 | lr 1.76e-04 | (3822.23 ms | 137168 tok/s) step 16794/76294 | train loss 3.333609 | norm 0.9952 | lr 1.76e-04 | (3802.96 ms | 137863 tok/s) step 16795/76294 | train loss 3.361874 | norm 1.0257 | lr 1.76e-04 | (3900.41 ms | 134419 tok/s) step 16796/76294 | train loss 3.375368 | norm 0.8501 | lr 1.76e-04 | (3805.10 ms | 137786 tok/s) step 16797/76294 | train loss 3.383244 | norm 0.7606 | lr 1.76e-04 | (3807.00 ms | 137717 tok/s) step 16798/76294 | train loss 3.344096 | norm 0.4975 | lr 1.76e-04 | (3799.67 ms | 137982 tok/s) step 16799/76294 | train loss 3.383309 | norm 0.7428 | lr 1.76e-04 | (3805.42 ms | 137774 tok/s) step 16800/76294 | train loss 3.345415 | norm 0.8252 | lr 1.76e-04 | (3830.94 ms | 136856 tok/s) step 16801/76294 | train loss 3.405861 | norm 0.5618 | lr 1.76e-04 | (3798.14 ms | 138038 tok/s) step 16802/76294 | train loss 3.366013 | norm 0.6002 | lr 1.76e-04 | (3809.81 ms | 137615 tok/s) step 16803/76294 | train loss 3.325322 | norm 0.5872 | lr 1.76e-04 | (3823.19 ms | 137134 tok/s) step 16804/76294 | train loss 3.410703 | norm 0.6173 | lr 1.76e-04 | (3806.60 ms | 137731 tok/s) step 16805/76294 | train loss 3.359083 | norm 0.7058 | lr 1.76e-04 | (3799.94 ms | 137973 tok/s) step 16806/76294 | train loss 3.428571 | norm 0.5110 | lr 1.76e-04 | (3918.16 ms | 133810 tok/s) step 16807/76294 | train loss 3.279078 | norm 0.5438 | lr 1.76e-04 | (3801.06 ms | 137932 tok/s) step 16808/76294 | train loss 3.400969 | norm 0.5963 | lr 1.76e-04 | (3809.87 ms | 137613 tok/s) step 16809/76294 | train loss 3.321909 | norm 0.9125 | lr 1.76e-04 | (3799.90 ms | 137974 tok/s) step 16810/76294 | train loss 3.341073 | norm 0.6773 | lr 1.76e-04 | (3834.82 ms | 136718 tok/s) step 16811/76294 | train loss 3.367151 | norm 0.7486 | lr 1.76e-04 | (3827.99 ms | 136962 tok/s) step 16812/76294 | train loss 3.282963 | norm 0.9481 | lr 1.76e-04 | (3801.21 ms | 137927 tok/s) step 16813/76294 | train loss 3.381625 | norm 0.6063 | lr 1.76e-04 | (3827.28 ms | 136987 tok/s) step 16814/76294 | train loss 3.346390 | norm 0.8481 | lr 1.76e-04 | (3801.95 ms | 137900 tok/s) step 16815/76294 | train loss 3.368252 | norm 0.5900 | lr 1.76e-04 | (4025.91 ms | 130228 tok/s) step 16816/76294 | train loss 3.390825 | norm 0.7637 | lr 1.76e-04 | (3798.31 ms | 138032 tok/s) step 16817/76294 | train loss 3.392523 | norm 0.5723 | lr 1.76e-04 | (3808.59 ms | 137659 tok/s) step 16818/76294 | train loss 3.370522 | norm 0.7813 | lr 1.75e-04 | (3807.67 ms | 137693 tok/s) step 16819/76294 | train loss 3.404862 | norm 0.6353 | lr 1.75e-04 | (3812.47 ms | 137519 tok/s) step 16820/76294 | train loss 3.328577 | norm 0.8262 | lr 1.75e-04 | (3804.24 ms | 137817 tok/s) step 16821/76294 | train loss 3.425217 | norm 1.1066 | lr 1.75e-04 | (3809.96 ms | 137610 tok/s) step 16822/76294 | train loss 3.310862 | norm 0.7032 | lr 1.75e-04 | (3804.88 ms | 137794 tok/s) step 16823/76294 | train loss 3.349758 | norm 0.7129 | lr 1.75e-04 | (3810.83 ms | 137578 tok/s) step 16824/76294 | train loss 3.359202 | norm 0.7357 | lr 1.75e-04 | (3808.54 ms | 137661 tok/s) step 16825/76294 | train loss 3.349217 | norm 0.9928 | lr 1.75e-04 | (3815.98 ms | 137393 tok/s) step 16826/76294 | train loss 3.353353 | norm 0.8067 | lr 1.75e-04 | (3804.83 ms | 137795 tok/s) step 16827/76294 | train loss 3.407029 | norm 1.1300 | lr 1.75e-04 | (3811.19 ms | 137565 tok/s) step 16828/76294 | train loss 3.401050 | norm 1.5097 | lr 1.75e-04 | (3800.59 ms | 137949 tok/s) step 16829/76294 | train loss 3.393893 | norm 0.8972 | lr 1.75e-04 | (3805.63 ms | 137767 tok/s) step 16830/76294 | train loss 3.381161 | norm 0.9188 | lr 1.75e-04 | (3826.96 ms | 136999 tok/s) step 16831/76294 | train loss 3.478460 | norm 1.1303 | lr 1.75e-04 | (3883.09 ms | 135018 tok/s) step 16832/76294 | train loss 3.272373 | norm 0.8203 | lr 1.75e-04 | (3839.69 ms | 136544 tok/s) step 16833/76294 | train loss 3.339187 | norm 0.8711 | lr 1.75e-04 | (3799.01 ms | 138006 tok/s) step 16834/76294 | train loss 3.367082 | norm 0.8150 | lr 1.75e-04 | (3809.48 ms | 137627 tok/s) step 16835/76294 | train loss 3.300055 | norm 0.5849 | lr 1.75e-04 | (3802.02 ms | 137897 tok/s) step 16836/76294 | train loss 3.332015 | norm 0.7408 | lr 1.75e-04 | (3805.56 ms | 137769 tok/s) step 16837/76294 | train loss 3.339983 | norm 0.8673 | lr 1.75e-04 | (3828.71 ms | 136936 tok/s) step 16838/76294 | train loss 3.375079 | norm 0.6077 | lr 1.75e-04 | (3803.81 ms | 137832 tok/s) step 16839/76294 | train loss 3.375628 | norm 0.9570 | lr 1.75e-04 | (3809.11 ms | 137641 tok/s) step 16840/76294 | train loss 3.361353 | norm 0.7529 | lr 1.75e-04 | (3804.70 ms | 137800 tok/s) step 16841/76294 | train loss 3.419580 | norm 0.6804 | lr 1.75e-04 | (3809.28 ms | 137635 tok/s) step 16842/76294 | train loss 3.408403 | norm 0.7676 | lr 1.75e-04 | (3804.68 ms | 137801 tok/s) step 16843/76294 | train loss 3.428238 | norm 0.7788 | lr 1.74e-04 | (3800.26 ms | 137961 tok/s) step 16844/76294 | train loss 3.434965 | norm 0.9365 | lr 1.74e-04 | (3835.41 ms | 136697 tok/s) step 16845/76294 | train loss 3.312049 | norm 0.8931 | lr 1.74e-04 | (3801.31 ms | 137923 tok/s) step 16846/76294 | train loss 3.496898 | norm 1.4700 | lr 1.74e-04 | (3833.88 ms | 136751 tok/s) step 16847/76294 | train loss 3.401400 | norm 1.0183 | lr 1.74e-04 | (3812.51 ms | 137518 tok/s) step 16848/76294 | train loss 3.440659 | norm 1.1519 | lr 1.74e-04 | (3805.04 ms | 137788 tok/s) step 16849/76294 | train loss 3.365471 | norm 1.0314 | lr 1.74e-04 | (3826.79 ms | 137005 tok/s) step 16850/76294 | train loss 3.274951 | norm 0.6315 | lr 1.74e-04 | (3803.77 ms | 137834 tok/s) step 16851/76294 | train loss 3.441431 | norm 0.8830 | lr 1.74e-04 | (3807.00 ms | 137717 tok/s) step 16852/76294 | train loss 3.351123 | norm 1.4079 | lr 1.74e-04 | (3809.17 ms | 137638 tok/s) step 16853/76294 | train loss 3.371928 | norm 1.0535 | lr 1.74e-04 | (4412.93 ms | 118807 tok/s) step 16854/76294 | train loss 3.366958 | norm 0.6386 | lr 1.74e-04 | (3803.95 ms | 137827 tok/s) step 16855/76294 | train loss 3.344461 | norm 0.5728 | lr 1.74e-04 | (3933.20 ms | 133298 tok/s) step 16856/76294 | train loss 3.506025 | norm 0.5538 | lr 1.74e-04 | (3798.87 ms | 138012 tok/s) step 16857/76294 | train loss 3.340858 | norm 0.6973 | lr 1.74e-04 | (3834.86 ms | 136716 tok/s) step 16858/76294 | train loss 3.338942 | norm 0.4889 | lr 1.74e-04 | (3802.22 ms | 137890 tok/s) step 16859/76294 | train loss 3.844113 | norm 0.9572 | lr 1.74e-04 | (3815.64 ms | 137405 tok/s) step 16860/76294 | train loss 3.332857 | norm 0.8780 | lr 1.74e-04 | (3801.53 ms | 137915 tok/s) step 16861/76294 | train loss 3.408824 | norm 0.7453 | lr 1.74e-04 | (3830.59 ms | 136869 tok/s) step 16862/76294 | train loss 3.309182 | norm 0.6181 | lr 1.74e-04 | (3848.01 ms | 136249 tok/s) step 16863/76294 | train loss 3.362213 | norm 0.4777 | lr 1.74e-04 | (3802.39 ms | 137884 tok/s) step 16864/76294 | train loss 3.388974 | norm 1.0355 | lr 1.74e-04 | (3829.80 ms | 136897 tok/s) step 16865/76294 | train loss 3.279686 | norm 0.8263 | lr 1.74e-04 | (3803.16 ms | 137856 tok/s) step 16866/76294 | train loss 3.530103 | norm 0.7596 | lr 1.74e-04 | (3826.86 ms | 137002 tok/s) step 16867/76294 | train loss 3.411645 | norm 0.8409 | lr 1.74e-04 | (3806.66 ms | 137729 tok/s) step 16868/76294 | train loss 3.373038 | norm 0.7303 | lr 1.74e-04 | (3807.10 ms | 137713 tok/s) step 16869/76294 | train loss 3.338652 | norm 0.6657 | lr 1.73e-04 | (4077.69 ms | 128575 tok/s) step 16870/76294 | train loss 3.330796 | norm 0.5647 | lr 1.73e-04 | (3854.62 ms | 136016 tok/s) step 16871/76294 | train loss 3.344230 | norm 0.6248 | lr 1.73e-04 | (3802.51 ms | 137879 tok/s) step 16872/76294 | train loss 3.355153 | norm 0.4952 | lr 1.73e-04 | (3830.50 ms | 136872 tok/s) step 16873/76294 | train loss 3.362358 | norm 0.6717 | lr 1.73e-04 | (3800.13 ms | 137966 tok/s) step 16874/76294 | train loss 3.418911 | norm 0.8356 | lr 1.73e-04 | (3802.52 ms | 137879 tok/s) step 16875/76294 | train loss 3.342753 | norm 0.8773 | lr 1.73e-04 | (3822.41 ms | 137162 tok/s) step 16876/76294 | train loss 3.425169 | norm 0.7596 | lr 1.73e-04 | (3809.55 ms | 137625 tok/s) step 16877/76294 | train loss 3.375144 | norm 0.9758 | lr 1.73e-04 | (3815.07 ms | 137425 tok/s) step 16878/76294 | train loss 3.335629 | norm 0.6970 | lr 1.73e-04 | (3806.89 ms | 137721 tok/s) step 16879/76294 | train loss 3.369425 | norm 0.6948 | lr 1.73e-04 | (3799.76 ms | 137979 tok/s) step 16880/76294 | train loss 3.298556 | norm 0.8573 | lr 1.73e-04 | (3838.10 ms | 136601 tok/s) step 16881/76294 | train loss 3.339260 | norm 1.4631 | lr 1.73e-04 | (3801.20 ms | 137927 tok/s) step 16882/76294 | train loss 3.333087 | norm 1.4124 | lr 1.73e-04 | (3902.34 ms | 134352 tok/s) step 16883/76294 | train loss 3.331782 | norm 1.5272 | lr 1.73e-04 | (3798.54 ms | 138023 tok/s) step 16884/76294 | train loss 3.413823 | norm 2.1681 | lr 1.73e-04 | (3806.70 ms | 137728 tok/s) step 16885/76294 | train loss 3.324403 | norm 1.3635 | lr 1.73e-04 | (3835.99 ms | 136676 tok/s) step 16886/76294 | train loss 3.303353 | norm 0.7260 | lr 1.73e-04 | (3814.08 ms | 137461 tok/s) step 16887/76294 | train loss 3.409528 | norm 0.6479 | lr 1.73e-04 | (3825.23 ms | 137060 tok/s) step 16888/76294 | train loss 3.328907 | norm 0.9123 | lr 1.73e-04 | (3801.41 ms | 137919 tok/s) step 16889/76294 | train loss 3.347327 | norm 0.9371 | lr 1.73e-04 | (3812.81 ms | 137507 tok/s) step 16890/76294 | train loss 3.360959 | norm 0.7031 | lr 1.73e-04 | (3839.81 ms | 136540 tok/s) step 16891/76294 | train loss 3.403208 | norm 0.8512 | lr 1.73e-04 | (4124.34 ms | 127120 tok/s) step 16892/76294 | train loss 3.324840 | norm 0.7413 | lr 1.73e-04 | (3824.89 ms | 137073 tok/s) step 16893/76294 | train loss 3.396193 | norm 1.5035 | lr 1.73e-04 | (3827.99 ms | 136962 tok/s) step 16894/76294 | train loss 3.340475 | norm 1.0750 | lr 1.72e-04 | (3807.02 ms | 137716 tok/s) step 16895/76294 | train loss 3.394749 | norm 1.1680 | lr 1.72e-04 | (3806.46 ms | 137736 tok/s) step 16896/76294 | train loss 3.361151 | norm 0.8430 | lr 1.72e-04 | (3807.83 ms | 137687 tok/s) step 16897/76294 | train loss 3.434965 | norm 0.7961 | lr 1.72e-04 | (3808.04 ms | 137679 tok/s) step 16898/76294 | train loss 3.382816 | norm 0.9511 | lr 1.72e-04 | (3801.24 ms | 137926 tok/s) step 16899/76294 | train loss 3.400951 | norm 1.0705 | lr 1.72e-04 | (3911.00 ms | 134055 tok/s) step 16900/76294 | train loss 3.384272 | norm 1.0247 | lr 1.72e-04 | (3840.30 ms | 136523 tok/s) step 16901/76294 | train loss 3.366004 | norm 1.4445 | lr 1.72e-04 | (3802.82 ms | 137868 tok/s) step 16902/76294 | train loss 3.478985 | norm 1.0332 | lr 1.72e-04 | (3831.33 ms | 136842 tok/s) step 16903/76294 | train loss 3.358439 | norm 0.8827 | lr 1.72e-04 | (3806.66 ms | 137729 tok/s) step 16904/76294 | train loss 3.398202 | norm 0.7619 | lr 1.72e-04 | (6230.39 ms | 84150 tok/s) step 16905/76294 | train loss 3.335299 | norm 1.0863 | lr 1.72e-04 | (3867.22 ms | 135572 tok/s) step 16906/76294 | train loss 3.382374 | norm 1.7797 | lr 1.72e-04 | (3819.34 ms | 137272 tok/s) step 16907/76294 | train loss 3.393120 | norm 1.1457 | lr 1.72e-04 | (3795.41 ms | 138137 tok/s) step 16908/76294 | train loss 3.357454 | norm 0.9680 | lr 1.72e-04 | (3801.63 ms | 137911 tok/s) step 16909/76294 | train loss 3.469445 | norm 1.4784 | lr 1.72e-04 | (3821.68 ms | 137188 tok/s) step 16910/76294 | train loss 3.332026 | norm 1.0480 | lr 1.72e-04 | (3799.60 ms | 137985 tok/s) step 16911/76294 | train loss 3.360862 | norm 1.2804 | lr 1.72e-04 | (3807.37 ms | 137703 tok/s) step 16912/76294 | train loss 3.341763 | norm 1.1527 | lr 1.72e-04 | (3802.59 ms | 137877 tok/s) step 16913/76294 | train loss 3.447837 | norm 0.9250 | lr 1.72e-04 | (3807.27 ms | 137707 tok/s) step 16914/76294 | train loss 3.381991 | norm 1.0312 | lr 1.72e-04 | (3800.50 ms | 137953 tok/s) step 16915/76294 | train loss 3.326142 | norm 0.8023 | lr 1.72e-04 | (3804.48 ms | 137808 tok/s) step 16916/76294 | train loss 3.473334 | norm 0.8908 | lr 1.72e-04 | (3803.00 ms | 137862 tok/s) step 16917/76294 | train loss 3.374254 | norm 1.2179 | lr 1.72e-04 | (3798.95 ms | 138009 tok/s) step 16918/76294 | train loss 3.394960 | norm 0.6360 | lr 1.72e-04 | (3848.08 ms | 136247 tok/s) step 16919/76294 | train loss 3.406013 | norm 0.7134 | lr 1.72e-04 | (3798.24 ms | 138034 tok/s) step 16920/76294 | train loss 3.349161 | norm 0.7642 | lr 1.71e-04 | (3941.39 ms | 133021 tok/s) step 16921/76294 | train loss 3.420509 | norm 0.9864 | lr 1.71e-04 | (3833.97 ms | 136748 tok/s) step 16922/76294 | train loss 3.347727 | norm 0.8431 | lr 1.71e-04 | (3803.83 ms | 137831 tok/s) step 16923/76294 | train loss 3.422441 | norm 0.8625 | lr 1.71e-04 | (3809.47 ms | 137628 tok/s) step 16924/76294 | train loss 3.340400 | norm 0.7104 | lr 1.71e-04 | (3802.33 ms | 137886 tok/s) step 16925/76294 | train loss 3.393408 | norm 0.8115 | lr 1.71e-04 | (3805.45 ms | 137773 tok/s) step 16926/76294 | train loss 3.329110 | norm 1.2193 | lr 1.71e-04 | (3803.00 ms | 137862 tok/s) step 16927/76294 | train loss 3.340317 | norm 0.5623 | lr 1.71e-04 | (3838.42 ms | 136589 tok/s) step 16928/76294 | train loss 3.409175 | norm 0.9253 | lr 1.71e-04 | (3804.29 ms | 137815 tok/s) step 16929/76294 | train loss 3.339783 | norm 0.8863 | lr 1.71e-04 | (3813.03 ms | 137499 tok/s) step 16930/76294 | train loss 3.437661 | norm 1.3274 | lr 1.71e-04 | (3926.76 ms | 133517 tok/s) step 16931/76294 | train loss 3.319033 | norm 0.7335 | lr 1.71e-04 | (3802.52 ms | 137879 tok/s) step 16932/76294 | train loss 3.381863 | norm 0.8999 | lr 1.71e-04 | (3836.64 ms | 136653 tok/s) step 16933/76294 | train loss 3.416486 | norm 1.4650 | lr 1.71e-04 | (3800.51 ms | 137952 tok/s) step 16934/76294 | train loss 3.336730 | norm 0.6531 | lr 1.71e-04 | (3861.26 ms | 135781 tok/s) step 16935/76294 | train loss 3.389765 | norm 0.6512 | lr 1.71e-04 | (3802.39 ms | 137884 tok/s) step 16936/76294 | train loss 3.329728 | norm 0.9912 | lr 1.71e-04 | (3802.00 ms | 137898 tok/s) step 16937/76294 | train loss 3.346539 | norm 0.6308 | lr 1.71e-04 | (3826.40 ms | 137019 tok/s) step 16938/76294 | train loss 3.369766 | norm 0.6539 | lr 1.71e-04 | (3810.54 ms | 137589 tok/s) step 16939/76294 | train loss 3.398422 | norm 0.8008 | lr 1.71e-04 | (3824.15 ms | 137099 tok/s) step 16940/76294 | train loss 3.394956 | norm 1.0086 | lr 1.71e-04 | (3827.47 ms | 136980 tok/s) step 16941/76294 | train loss 3.316021 | norm 0.9137 | lr 1.71e-04 | (3835.03 ms | 136710 tok/s) step 16942/76294 | train loss 3.451270 | norm 1.5993 | lr 1.71e-04 | (3829.16 ms | 136920 tok/s) step 16943/76294 | train loss 3.324582 | norm 1.1982 | lr 1.71e-04 | (3806.91 ms | 137720 tok/s) step 16944/76294 | train loss 3.466419 | norm 1.0414 | lr 1.71e-04 | (3833.30 ms | 136772 tok/s) step 16945/76294 | train loss 3.349279 | norm 1.9871 | lr 1.71e-04 | (3843.03 ms | 136426 tok/s) step 16946/76294 | train loss 3.301640 | norm 1.2498 | lr 1.71e-04 | (3832.56 ms | 136798 tok/s) step 16947/76294 | train loss 3.439266 | norm 1.0466 | lr 1.70e-04 | (3804.23 ms | 137817 tok/s) step 16948/76294 | train loss 3.336279 | norm 1.7780 | lr 1.70e-04 | (3802.90 ms | 137865 tok/s) step 16949/76294 | train loss 3.385976 | norm 0.7803 | lr 1.70e-04 | (3835.81 ms | 136683 tok/s) step 16950/76294 | train loss 3.441642 | norm 0.8711 | lr 1.70e-04 | (3804.69 ms | 137800 tok/s) step 16951/76294 | train loss 3.349526 | norm 0.6639 | lr 1.70e-04 | (3812.65 ms | 137513 tok/s) step 16952/76294 | train loss 3.351265 | norm 0.7115 | lr 1.70e-04 | (3803.70 ms | 137836 tok/s) step 16953/76294 | train loss 3.340391 | norm 0.6608 | lr 1.70e-04 | (3810.26 ms | 137599 tok/s) step 16954/76294 | train loss 3.391470 | norm 0.6807 | lr 1.70e-04 | (3826.65 ms | 137010 tok/s) step 16955/76294 | train loss 3.360395 | norm 0.7591 | lr 1.70e-04 | (3866.78 ms | 135588 tok/s) step 16956/76294 | train loss 3.342521 | norm 0.8409 | lr 1.70e-04 | (4960.36 ms | 105696 tok/s) step 16957/76294 | train loss 3.342133 | norm 0.9159 | lr 1.70e-04 | (4182.17 ms | 125363 tok/s) step 16958/76294 | train loss 3.298157 | norm 0.6867 | lr 1.70e-04 | (3792.47 ms | 138244 tok/s) step 16959/76294 | train loss 3.441169 | norm 0.9038 | lr 1.70e-04 | (3799.63 ms | 137984 tok/s) step 16960/76294 | train loss 3.343734 | norm 1.1181 | lr 1.70e-04 | (3822.11 ms | 137172 tok/s) step 16961/76294 | train loss 3.362297 | norm 0.9295 | lr 1.70e-04 | (3799.10 ms | 138003 tok/s) step 16962/76294 | train loss 3.358704 | norm 0.8457 | lr 1.70e-04 | (3797.94 ms | 138045 tok/s) step 16963/76294 | train loss 3.293355 | norm 0.6736 | lr 1.70e-04 | (3865.68 ms | 135626 tok/s) step 16964/76294 | train loss 3.407848 | norm 1.1396 | lr 1.70e-04 | (3799.16 ms | 138001 tok/s) step 16965/76294 | train loss 3.278854 | norm 0.9698 | lr 1.70e-04 | (3825.85 ms | 137038 tok/s) step 16966/76294 | train loss 3.416563 | norm 1.0333 | lr 1.70e-04 | (3801.10 ms | 137931 tok/s) step 16967/76294 | train loss 3.334462 | norm 0.8757 | lr 1.70e-04 | (3804.62 ms | 137803 tok/s) step 16968/76294 | train loss 3.379173 | norm 0.9109 | lr 1.70e-04 | (3824.31 ms | 137094 tok/s) step 16969/76294 | train loss 3.506760 | norm 0.7619 | lr 1.70e-04 | (3883.58 ms | 135001 tok/s) step 16970/76294 | train loss 3.347208 | norm 1.0595 | lr 1.70e-04 | (3821.54 ms | 137193 tok/s) step 16971/76294 | train loss 3.406281 | norm 0.8493 | lr 1.70e-04 | (3801.32 ms | 137922 tok/s) step 16972/76294 | train loss 3.357219 | norm 0.5645 | lr 1.70e-04 | (3801.17 ms | 137928 tok/s) step 16973/76294 | train loss 3.286595 | norm 0.8366 | lr 1.69e-04 | (3864.14 ms | 135680 tok/s) step 16974/76294 | train loss 3.348220 | norm 0.6589 | lr 1.69e-04 | (3799.77 ms | 137979 tok/s) step 16975/76294 | train loss 3.383946 | norm 0.8548 | lr 1.69e-04 | (3851.60 ms | 136122 tok/s) step 16976/76294 | train loss 3.414349 | norm 0.7534 | lr 1.69e-04 | (3801.73 ms | 137908 tok/s) step 16977/76294 | train loss 3.357535 | norm 1.6799 | lr 1.69e-04 | (3817.91 ms | 137323 tok/s) step 16978/76294 | train loss 3.387175 | norm 0.7483 | lr 1.69e-04 | (3802.37 ms | 137885 tok/s) step 16979/76294 | train loss 3.398234 | norm 1.2959 | lr 1.69e-04 | (3910.23 ms | 134081 tok/s) step 16980/76294 | train loss 3.346547 | norm 0.8239 | lr 1.69e-04 | (3803.23 ms | 137853 tok/s) step 16981/76294 | train loss 3.588916 | norm 1.2381 | lr 1.69e-04 | (3858.75 ms | 135870 tok/s) step 16982/76294 | train loss 3.302310 | norm 0.9267 | lr 1.69e-04 | (3899.50 ms | 134450 tok/s) step 16983/76294 | train loss 3.422295 | norm 0.6701 | lr 1.69e-04 | (3808.03 ms | 137680 tok/s) step 16984/76294 | train loss 3.385646 | norm 0.6928 | lr 1.69e-04 | (3805.49 ms | 137771 tok/s) step 16985/76294 | train loss 3.328381 | norm 0.6445 | lr 1.69e-04 | (3805.65 ms | 137766 tok/s) step 16986/76294 | train loss 3.403642 | norm 0.8838 | lr 1.69e-04 | (3823.66 ms | 137117 tok/s) step 16987/76294 | train loss 3.366004 | norm 0.8994 | lr 1.69e-04 | (3831.22 ms | 136846 tok/s) step 16988/76294 | train loss 3.401810 | norm 1.1946 | lr 1.69e-04 | (3861.11 ms | 135787 tok/s) step 16989/76294 | train loss 3.324016 | norm 1.3010 | lr 1.69e-04 | (3802.85 ms | 137867 tok/s) step 16990/76294 | train loss 3.333302 | norm 1.0388 | lr 1.69e-04 | (3842.50 ms | 136445 tok/s) step 16991/76294 | train loss 3.428450 | norm 1.1710 | lr 1.69e-04 | (3801.89 ms | 137902 tok/s) step 16992/76294 | train loss 3.374794 | norm 1.2397 | lr 1.69e-04 | (3812.53 ms | 137517 tok/s) step 16993/76294 | train loss 3.395696 | norm 1.0316 | lr 1.69e-04 | (3886.80 ms | 134889 tok/s) step 16994/76294 | train loss 3.275605 | norm 2.2014 | lr 1.69e-04 | (3840.09 ms | 136530 tok/s) step 16995/76294 | train loss 3.302660 | norm 0.9214 | lr 1.69e-04 | (3801.77 ms | 137906 tok/s) step 16996/76294 | train loss 3.486039 | norm 0.7337 | lr 1.69e-04 | (3822.23 ms | 137168 tok/s) step 16997/76294 | train loss 3.373913 | norm 1.0845 | lr 1.69e-04 | (3806.14 ms | 137748 tok/s) step 16998/76294 | train loss 3.390936 | norm 1.2592 | lr 1.69e-04 | (3857.14 ms | 135927 tok/s) step 16999/76294 | train loss 3.323323 | norm 0.6594 | lr 1.69e-04 | (3889.74 ms | 134787 tok/s) step 17000/76294 | train loss 3.285510 | norm 0.9305 | lr 1.68e-04 | (3864.13 ms | 135681 tok/s) val loss: 3.375787 saving model checkpoint to ./results/gpt2-124M-gqa/step_17000.pth step 17001/76294 | train loss 3.418859 | norm 1.2162 | lr 1.68e-04 | (3816.95 ms | 137358 tok/s) step 17002/76294 | train loss 3.306212 | norm 1.6332 | lr 1.68e-04 | (3849.02 ms | 136213 tok/s) step 17003/76294 | train loss 3.428149 | norm 0.8053 | lr 1.68e-04 | (3849.26 ms | 136205 tok/s) step 17004/76294 | train loss 3.361187 | norm 0.8701 | lr 1.68e-04 | (3863.70 ms | 135696 tok/s) step 17005/76294 | train loss 3.370495 | norm 0.8847 | lr 1.68e-04 | (3796.73 ms | 138089 tok/s) step 17006/76294 | train loss 3.360334 | norm 0.7414 | lr 1.68e-04 | (3812.96 ms | 137502 tok/s) step 17007/76294 | train loss 3.300248 | norm 1.1275 | lr 1.68e-04 | (3799.44 ms | 137991 tok/s) step 17008/76294 | train loss 3.464847 | norm 0.7220 | lr 1.68e-04 | (3803.27 ms | 137852 tok/s) step 17009/76294 | train loss 3.338353 | norm 0.8187 | lr 1.68e-04 | (3825.80 ms | 137040 tok/s) step 17010/76294 | train loss 3.395739 | norm 1.2431 | lr 1.68e-04 | (3802.28 ms | 137888 tok/s) step 17011/76294 | train loss 3.360751 | norm 0.8286 | lr 1.68e-04 | (3802.12 ms | 137894 tok/s) step 17012/76294 | train loss 3.331826 | norm 0.6606 | lr 1.68e-04 | (3861.23 ms | 135783 tok/s) step 17013/76294 | train loss 3.366686 | norm 1.1509 | lr 1.68e-04 | (3805.14 ms | 137784 tok/s) step 17014/76294 | train loss 3.337409 | norm 0.8589 | lr 1.68e-04 | (3824.40 ms | 137090 tok/s) step 17015/76294 | train loss 3.376964 | norm 1.0705 | lr 1.68e-04 | (3827.36 ms | 136984 tok/s) step 17016/76294 | train loss 3.341540 | norm 1.1557 | lr 1.68e-04 | (3808.78 ms | 137652 tok/s) step 17017/76294 | train loss 3.346362 | norm 0.7787 | lr 1.68e-04 | (3808.39 ms | 137667 tok/s) step 17018/76294 | train loss 3.406366 | norm 0.9956 | lr 1.68e-04 | (3812.73 ms | 137510 tok/s) step 17019/76294 | train loss 3.314752 | norm 0.7600 | lr 1.68e-04 | (3820.24 ms | 137239 tok/s) step 17020/76294 | train loss 3.381260 | norm 0.9351 | lr 1.68e-04 | (3814.69 ms | 137439 tok/s) step 17021/76294 | train loss 3.356647 | norm 0.7422 | lr 1.68e-04 | (3818.93 ms | 137287 tok/s) step 17022/76294 | train loss 3.369148 | norm 1.1014 | lr 1.68e-04 | (3811.25 ms | 137563 tok/s) step 17023/76294 | train loss 3.372647 | norm 1.3088 | lr 1.68e-04 | (3820.39 ms | 137234 tok/s) step 17024/76294 | train loss 3.336483 | norm 0.6551 | lr 1.68e-04 | (3815.17 ms | 137422 tok/s) step 17025/76294 | train loss 3.430532 | norm 0.7308 | lr 1.68e-04 | (3812.28 ms | 137526 tok/s) step 17026/76294 | train loss 3.450143 | norm 1.0661 | lr 1.68e-04 | (3807.21 ms | 137709 tok/s) step 17027/76294 | train loss 3.421441 | norm 0.6379 | lr 1.67e-04 | (3802.89 ms | 137866 tok/s) step 17028/76294 | train loss 3.371748 | norm 0.6478 | lr 1.67e-04 | (3835.88 ms | 136680 tok/s) step 17029/76294 | train loss 3.359542 | norm 1.0801 | lr 1.67e-04 | (3804.17 ms | 137819 tok/s) step 17030/76294 | train loss 3.349413 | norm 1.0354 | lr 1.67e-04 | (3854.56 ms | 136017 tok/s) step 17031/76294 | train loss 3.371014 | norm 0.8045 | lr 1.67e-04 | (3807.33 ms | 137705 tok/s) step 17032/76294 | train loss 3.413636 | norm 0.8009 | lr 1.67e-04 | (3826.89 ms | 137001 tok/s) step 17033/76294 | train loss 3.308755 | norm 0.6963 | lr 1.67e-04 | (3805.07 ms | 137787 tok/s) step 17034/76294 | train loss 3.341776 | norm 0.6011 | lr 1.67e-04 | (3831.08 ms | 136851 tok/s) step 17035/76294 | train loss 3.375607 | norm 1.0925 | lr 1.67e-04 | (3799.50 ms | 137989 tok/s) step 17036/76294 | train loss 3.339257 | norm 0.9327 | lr 1.67e-04 | (3814.35 ms | 137452 tok/s) step 17037/76294 | train loss 3.389045 | norm 0.8313 | lr 1.67e-04 | (3842.28 ms | 136452 tok/s) step 17038/76294 | train loss 3.367443 | norm 1.2137 | lr 1.67e-04 | (3807.41 ms | 137702 tok/s) step 17039/76294 | train loss 3.356841 | norm 0.9693 | lr 1.67e-04 | (3801.03 ms | 137933 tok/s) step 17040/76294 | train loss 3.420116 | norm 0.6875 | lr 1.67e-04 | (3832.00 ms | 136818 tok/s) step 17041/76294 | train loss 3.295574 | norm 1.2367 | lr 1.67e-04 | (3832.55 ms | 136799 tok/s) step 17042/76294 | train loss 3.395765 | norm 0.9769 | lr 1.67e-04 | (3809.81 ms | 137615 tok/s) step 17043/76294 | train loss 3.346101 | norm 1.0095 | lr 1.67e-04 | (3809.59 ms | 137623 tok/s) step 17044/76294 | train loss 3.425180 | norm 0.9915 | lr 1.67e-04 | (3806.04 ms | 137752 tok/s) step 17045/76294 | train loss 3.439975 | norm 1.4486 | lr 1.67e-04 | (3901.81 ms | 134370 tok/s) step 17046/76294 | train loss 3.315104 | norm 0.9362 | lr 1.67e-04 | (3820.49 ms | 137231 tok/s) step 17047/76294 | train loss 3.373726 | norm 0.8532 | lr 1.67e-04 | (3804.51 ms | 137807 tok/s) step 17048/76294 | train loss 3.336025 | norm 0.6520 | lr 1.67e-04 | (3825.08 ms | 137066 tok/s) step 17049/76294 | train loss 3.393764 | norm 0.9544 | lr 1.67e-04 | (3817.02 ms | 137355 tok/s) step 17050/76294 | train loss 3.307347 | norm 0.7758 | lr 1.67e-04 | (3807.77 ms | 137689 tok/s) step 17051/76294 | train loss 3.500863 | norm 0.5881 | lr 1.67e-04 | (3841.63 ms | 136475 tok/s) step 17052/76294 | train loss 3.356731 | norm 0.8608 | lr 1.67e-04 | (3800.61 ms | 137948 tok/s) step 17053/76294 | train loss 3.330394 | norm 0.8716 | lr 1.67e-04 | (3979.23 ms | 131756 tok/s) step 17054/76294 | train loss 3.399062 | norm 0.9793 | lr 1.66e-04 | (3845.20 ms | 136349 tok/s) step 17055/76294 | train loss 3.340244 | norm 0.7350 | lr 1.66e-04 | (3807.84 ms | 137686 tok/s) step 17056/76294 | train loss 3.401260 | norm 0.5931 | lr 1.66e-04 | (3824.06 ms | 137103 tok/s) step 17057/76294 | train loss 3.372794 | norm 0.7029 | lr 1.66e-04 | (3803.59 ms | 137840 tok/s) step 17058/76294 | train loss 3.371025 | norm 0.7567 | lr 1.66e-04 | (3803.59 ms | 137840 tok/s) step 17059/76294 | train loss 3.404125 | norm 0.8246 | lr 1.66e-04 | (3827.93 ms | 136964 tok/s) step 17060/76294 | train loss 3.381570 | norm 0.7834 | lr 1.66e-04 | (3804.26 ms | 137816 tok/s) step 17061/76294 | train loss 3.335844 | norm 0.5894 | lr 1.66e-04 | (3807.52 ms | 137698 tok/s) step 17062/76294 | train loss 3.407815 | norm 0.8624 | lr 1.66e-04 | (3825.46 ms | 137052 tok/s) step 17063/76294 | train loss 3.331584 | norm 0.6447 | lr 1.66e-04 | (3807.04 ms | 137715 tok/s) step 17064/76294 | train loss 3.415526 | norm 0.7246 | lr 1.66e-04 | (3821.11 ms | 137208 tok/s) step 17065/76294 | train loss 3.356790 | norm 0.5072 | lr 1.66e-04 | (3845.27 ms | 136346 tok/s) step 17066/76294 | train loss 3.365700 | norm 0.6342 | lr 1.66e-04 | (3802.99 ms | 137862 tok/s) step 17067/76294 | train loss 3.424910 | norm 0.8061 | lr 1.66e-04 | (3888.59 ms | 134827 tok/s) step 17068/76294 | train loss 3.409567 | norm 0.9564 | lr 1.66e-04 | (3806.80 ms | 137724 tok/s) step 17069/76294 | train loss 3.377039 | norm 0.6969 | lr 1.66e-04 | (3808.91 ms | 137648 tok/s) step 17070/76294 | train loss 3.312986 | norm 1.1582 | lr 1.66e-04 | (3823.61 ms | 137118 tok/s) step 17071/76294 | train loss 3.329389 | norm 0.6430 | lr 1.66e-04 | (3804.79 ms | 137797 tok/s) step 17072/76294 | train loss 3.357127 | norm 0.8032 | lr 1.66e-04 | (3801.40 ms | 137920 tok/s) step 17073/76294 | train loss 3.344155 | norm 0.9400 | lr 1.66e-04 | (3833.00 ms | 136783 tok/s) step 17074/76294 | train loss 3.380201 | norm 0.6776 | lr 1.66e-04 | (3803.13 ms | 137857 tok/s) step 17075/76294 | train loss 3.457346 | norm 0.7241 | lr 1.66e-04 | (3829.44 ms | 136910 tok/s) step 17076/76294 | train loss 3.326814 | norm 0.7216 | lr 1.66e-04 | (3805.74 ms | 137763 tok/s) step 17077/76294 | train loss 3.447666 | norm 0.8700 | lr 1.66e-04 | (3806.35 ms | 137740 tok/s) step 17078/76294 | train loss 3.324067 | norm 0.5874 | lr 1.66e-04 | (3901.72 ms | 134373 tok/s) step 17079/76294 | train loss 3.317841 | norm 0.6730 | lr 1.66e-04 | (3844.10 ms | 136388 tok/s) step 17080/76294 | train loss 3.364614 | norm 1.1094 | lr 1.66e-04 | (3809.60 ms | 137623 tok/s) step 17081/76294 | train loss 3.354351 | norm 1.0299 | lr 1.65e-04 | (4193.60 ms | 125021 tok/s) step 17082/76294 | train loss 3.416466 | norm 1.0656 | lr 1.65e-04 | (3803.83 ms | 137831 tok/s) step 17083/76294 | train loss 3.364975 | norm 0.9764 | lr 1.65e-04 | (3802.64 ms | 137875 tok/s) step 17084/76294 | train loss 3.375531 | norm 0.9896 | lr 1.65e-04 | (3808.22 ms | 137673 tok/s) step 17085/76294 | train loss 3.451539 | norm 1.1755 | lr 1.65e-04 | (3821.76 ms | 137185 tok/s) step 17086/76294 | train loss 3.372097 | norm 1.0161 | lr 1.65e-04 | (4015.10 ms | 130579 tok/s) step 17087/76294 | train loss 3.395779 | norm 0.9172 | lr 1.65e-04 | (3901.10 ms | 134395 tok/s) step 17088/76294 | train loss 3.429049 | norm 2.6313 | lr 1.65e-04 | (3803.91 ms | 137829 tok/s) step 17089/76294 | train loss 3.440822 | norm 0.9841 | lr 1.65e-04 | (3821.51 ms | 137194 tok/s) step 17090/76294 | train loss 3.355367 | norm 0.7810 | lr 1.65e-04 | (3809.60 ms | 137623 tok/s) step 17091/76294 | train loss 3.398976 | norm 0.6567 | lr 1.65e-04 | (3802.44 ms | 137882 tok/s) step 17092/76294 | train loss 3.377135 | norm 0.8312 | lr 1.65e-04 | (3799.87 ms | 137975 tok/s) step 17093/76294 | train loss 3.381295 | norm 0.7766 | lr 1.65e-04 | (3833.69 ms | 136758 tok/s) step 17094/76294 | train loss 3.358316 | norm 0.6809 | lr 1.65e-04 | (3798.49 ms | 138025 tok/s) step 17095/76294 | train loss 3.412635 | norm 0.8453 | lr 1.65e-04 | (3807.40 ms | 137702 tok/s) step 17096/76294 | train loss 3.394915 | norm 0.6026 | lr 1.65e-04 | (3826.10 ms | 137029 tok/s) step 17097/76294 | train loss 3.431552 | norm 0.7607 | lr 1.65e-04 | (3803.37 ms | 137848 tok/s) step 17098/76294 | train loss 3.313810 | norm 0.6801 | lr 1.65e-04 | (3807.36 ms | 137704 tok/s) step 17099/76294 | train loss 3.415633 | norm 0.6715 | lr 1.65e-04 | (3807.92 ms | 137684 tok/s) step 17100/76294 | train loss 3.451145 | norm 0.7777 | lr 1.65e-04 | (3809.24 ms | 137636 tok/s) step 17101/76294 | train loss 3.360757 | norm 0.8301 | lr 1.65e-04 | (3804.76 ms | 137798 tok/s) step 17102/76294 | train loss 3.422854 | norm 0.6380 | lr 1.65e-04 | (3809.39 ms | 137631 tok/s) step 17103/76294 | train loss 3.333694 | norm 1.1399 | lr 1.65e-04 | (3808.54 ms | 137661 tok/s) step 17104/76294 | train loss 3.365258 | norm 0.6859 | lr 1.65e-04 | (3843.12 ms | 136422 tok/s) step 17105/76294 | train loss 3.344893 | norm 0.7359 | lr 1.65e-04 | (3799.16 ms | 138001 tok/s) step 17106/76294 | train loss 3.384726 | norm 0.7663 | lr 1.65e-04 | (3916.77 ms | 133857 tok/s) step 17107/76294 | train loss 3.313987 | norm 0.5163 | lr 1.65e-04 | (3798.41 ms | 138028 tok/s) step 17108/76294 | train loss 3.420165 | norm 0.9462 | lr 1.65e-04 | (3851.17 ms | 136137 tok/s) step 17109/76294 | train loss 3.422263 | norm 0.6124 | lr 1.64e-04 | (3799.91 ms | 137974 tok/s) step 17110/76294 | train loss 3.417128 | norm 0.7597 | lr 1.64e-04 | (3804.67 ms | 137801 tok/s) step 17111/76294 | train loss 3.323467 | norm 0.7405 | lr 1.64e-04 | (3820.02 ms | 137248 tok/s) step 17112/76294 | train loss 3.389575 | norm 1.3470 | lr 1.64e-04 | (3806.47 ms | 137736 tok/s) step 17113/76294 | train loss 3.397485 | norm 1.2269 | lr 1.64e-04 | (3800.97 ms | 137935 tok/s) step 17114/76294 | train loss 3.366603 | norm 1.1237 | lr 1.64e-04 | (3829.00 ms | 136926 tok/s) step 17115/76294 | train loss 3.398895 | norm 0.8501 | lr 1.64e-04 | (3804.02 ms | 137825 tok/s) step 17116/76294 | train loss 3.370856 | norm 0.7862 | lr 1.64e-04 | (3859.56 ms | 135842 tok/s) step 17117/76294 | train loss 3.387904 | norm 0.9582 | lr 1.64e-04 | (3823.57 ms | 137120 tok/s) step 17118/76294 | train loss 3.382586 | norm 1.1694 | lr 1.64e-04 | (3803.32 ms | 137850 tok/s) step 17119/76294 | train loss 3.327582 | norm 0.9794 | lr 1.64e-04 | (3807.80 ms | 137688 tok/s) step 17120/76294 | train loss 3.479098 | norm 1.0558 | lr 1.64e-04 | (3803.04 ms | 137860 tok/s) step 17121/76294 | train loss 3.436333 | norm 0.8684 | lr 1.64e-04 | (3805.91 ms | 137756 tok/s) step 17122/76294 | train loss 3.398601 | norm 0.9981 | lr 1.64e-04 | (3806.70 ms | 137728 tok/s) step 17123/76294 | train loss 3.416059 | norm 0.8750 | lr 1.64e-04 | (3806.19 ms | 137746 tok/s) step 17124/76294 | train loss 3.337484 | norm 1.0201 | lr 1.64e-04 | (3805.11 ms | 137785 tok/s) step 17125/76294 | train loss 3.382010 | norm 0.9606 | lr 1.64e-04 | (3875.26 ms | 135291 tok/s) step 17126/76294 | train loss 3.319755 | norm 0.8970 | lr 1.64e-04 | (3800.41 ms | 137956 tok/s) step 17127/76294 | train loss 3.359253 | norm 0.7105 | lr 1.64e-04 | (3875.74 ms | 135274 tok/s) step 17128/76294 | train loss 3.366276 | norm 0.5692 | lr 1.64e-04 | (3802.18 ms | 137891 tok/s) step 17129/76294 | train loss 3.383900 | norm 0.6086 | lr 1.64e-04 | (3878.75 ms | 135169 tok/s) step 17130/76294 | train loss 3.540627 | norm 1.0133 | lr 1.64e-04 | (3802.85 ms | 137867 tok/s) step 17131/76294 | train loss 3.362486 | norm 1.0293 | lr 1.64e-04 | (3809.33 ms | 137633 tok/s) step 17132/76294 | train loss 3.398419 | norm 0.6936 | lr 1.64e-04 | (3802.85 ms | 137867 tok/s) step 17133/76294 | train loss 3.390876 | norm 0.5522 | lr 1.64e-04 | (3829.67 ms | 136902 tok/s) step 17134/76294 | train loss 3.448239 | norm 0.6304 | lr 1.64e-04 | (3802.84 ms | 137867 tok/s) step 17135/76294 | train loss 3.353273 | norm 1.0549 | lr 1.64e-04 | (3807.95 ms | 137683 tok/s) step 17136/76294 | train loss 3.375942 | norm 0.9003 | lr 1.64e-04 | (3835.58 ms | 136691 tok/s) step 17137/76294 | train loss 3.434432 | norm 1.1604 | lr 1.63e-04 | (3828.44 ms | 136945 tok/s) step 17138/76294 | train loss 3.358794 | norm 0.5994 | lr 1.63e-04 | (3805.80 ms | 137760 tok/s) step 17139/76294 | train loss 3.402400 | norm 0.6467 | lr 1.63e-04 | (3806.22 ms | 137745 tok/s) step 17140/76294 | train loss 3.400625 | norm 1.1394 | lr 1.63e-04 | (3801.64 ms | 137911 tok/s) step 17141/76294 | train loss 3.329096 | norm 0.6697 | lr 1.63e-04 | (3834.17 ms | 136741 tok/s) step 17142/76294 | train loss 3.386292 | norm 1.0830 | lr 1.63e-04 | (3801.93 ms | 137900 tok/s) step 17143/76294 | train loss 3.398713 | norm 0.7683 | lr 1.63e-04 | (3807.75 ms | 137690 tok/s) step 17144/76294 | train loss 3.377359 | norm 1.0739 | lr 1.63e-04 | (3822.08 ms | 137173 tok/s) step 17145/76294 | train loss 3.393842 | norm 0.8393 | lr 1.63e-04 | (4116.59 ms | 127360 tok/s) step 17146/76294 | train loss 3.373049 | norm 0.7369 | lr 1.63e-04 | (3808.83 ms | 137651 tok/s) step 17147/76294 | train loss 3.367086 | norm 0.8430 | lr 1.63e-04 | (3804.25 ms | 137816 tok/s) step 17148/76294 | train loss 3.453561 | norm 1.2778 | lr 1.63e-04 | (3801.78 ms | 137906 tok/s) step 17149/76294 | train loss 3.438376 | norm 1.6339 | lr 1.63e-04 | (3832.70 ms | 136794 tok/s) step 17150/76294 | train loss 3.354493 | norm 1.0715 | lr 1.63e-04 | (3802.63 ms | 137875 tok/s) step 17151/76294 | train loss 3.375397 | norm 0.7934 | lr 1.63e-04 | (3818.27 ms | 137310 tok/s) step 17152/76294 | train loss 3.348593 | norm 1.3074 | lr 1.63e-04 | (3850.67 ms | 136155 tok/s) step 17153/76294 | train loss 3.395016 | norm 0.8908 | lr 1.63e-04 | (3830.27 ms | 136880 tok/s) step 17154/76294 | train loss 3.436868 | norm 0.7372 | lr 1.63e-04 | (3798.85 ms | 138012 tok/s) step 17155/76294 | train loss 3.399036 | norm 1.5600 | lr 1.63e-04 | (3965.65 ms | 132207 tok/s) step 17156/76294 | train loss 3.336648 | norm 0.8686 | lr 1.63e-04 | (3798.56 ms | 138023 tok/s) step 17157/76294 | train loss 3.464824 | norm 1.1692 | lr 1.63e-04 | (3822.90 ms | 137144 tok/s) step 17158/76294 | train loss 3.389460 | norm 0.8215 | lr 1.63e-04 | (3802.88 ms | 137866 tok/s) step 17159/76294 | train loss 3.361160 | norm 1.0301 | lr 1.63e-04 | (3823.84 ms | 137110 tok/s) step 17160/76294 | train loss 3.361806 | norm 0.6834 | lr 1.63e-04 | (3802.49 ms | 137880 tok/s) step 17161/76294 | train loss 3.399101 | norm 0.8338 | lr 1.63e-04 | (3831.81 ms | 136825 tok/s) step 17162/76294 | train loss 3.403102 | norm 0.8494 | lr 1.63e-04 | (3799.43 ms | 137991 tok/s) step 17163/76294 | train loss 3.388132 | norm 1.3546 | lr 1.63e-04 | (3814.36 ms | 137451 tok/s) step 17164/76294 | train loss 3.453177 | norm 1.0250 | lr 1.63e-04 | (3819.59 ms | 137263 tok/s) step 17165/76294 | train loss 3.373577 | norm 0.9577 | lr 1.63e-04 | (3800.65 ms | 137947 tok/s) step 17166/76294 | train loss 3.380860 | norm 1.1556 | lr 1.62e-04 | (3814.26 ms | 137455 tok/s) step 17167/76294 | train loss 3.342986 | norm 1.6484 | lr 1.62e-04 | (3806.68 ms | 137728 tok/s) step 17168/76294 | train loss 3.487253 | norm 1.1417 | lr 1.62e-04 | (3802.38 ms | 137884 tok/s) step 17169/76294 | train loss 3.427625 | norm 1.4440 | lr 1.62e-04 | (3806.78 ms | 137725 tok/s) step 17170/76294 | train loss 3.379055 | norm 1.5603 | lr 1.62e-04 | (3798.12 ms | 138039 tok/s) step 17171/76294 | train loss 3.364077 | norm 1.3852 | lr 1.62e-04 | (3895.79 ms | 134578 tok/s) step 17172/76294 | train loss 3.404955 | norm 1.2967 | lr 1.62e-04 | (3801.91 ms | 137901 tok/s) step 17173/76294 | train loss 3.431964 | norm 1.2247 | lr 1.62e-04 | (3803.39 ms | 137848 tok/s) step 17174/76294 | train loss 3.389380 | norm 1.0184 | lr 1.62e-04 | (3820.89 ms | 137216 tok/s) step 17175/76294 | train loss 3.526516 | norm 1.6821 | lr 1.62e-04 | (3803.81 ms | 137832 tok/s) step 17176/76294 | train loss 3.367356 | norm 0.9746 | lr 1.62e-04 | (3870.83 ms | 135446 tok/s) step 17177/76294 | train loss 3.373368 | norm 1.2453 | lr 1.62e-04 | (3799.38 ms | 137993 tok/s) step 17178/76294 | train loss 3.396586 | norm 1.1232 | lr 1.62e-04 | (3907.01 ms | 134192 tok/s) step 17179/76294 | train loss 3.369902 | norm 1.0845 | lr 1.62e-04 | (3794.11 ms | 138185 tok/s) step 17180/76294 | train loss 3.391686 | norm 0.6014 | lr 1.62e-04 | (3973.94 ms | 131931 tok/s) step 17181/76294 | train loss 3.362840 | norm 1.2466 | lr 1.62e-04 | (3820.97 ms | 137213 tok/s) step 17182/76294 | train loss 3.408384 | norm 1.0669 | lr 1.62e-04 | (3802.57 ms | 137877 tok/s) step 17183/76294 | train loss 3.405057 | norm 0.7273 | lr 1.62e-04 | (5999.92 ms | 87383 tok/s) step 17184/76294 | train loss 3.389200 | norm 0.6165 | lr 1.62e-04 | (3789.61 ms | 138349 tok/s) step 17185/76294 | train loss 3.381189 | norm 0.6648 | lr 1.62e-04 | (3835.66 ms | 136688 tok/s) step 17186/76294 | train loss 3.423975 | norm 0.8419 | lr 1.62e-04 | (3806.39 ms | 137739 tok/s) step 17187/76294 | train loss 3.369879 | norm 0.6432 | lr 1.62e-04 | (3800.04 ms | 137969 tok/s) step 17188/76294 | train loss 3.403765 | norm 1.0105 | lr 1.62e-04 | (3796.92 ms | 138082 tok/s) step 17189/76294 | train loss 3.365064 | norm 0.6574 | lr 1.62e-04 | (3800.96 ms | 137936 tok/s) step 17190/76294 | train loss 3.410721 | norm 1.0596 | lr 1.62e-04 | (3823.93 ms | 137107 tok/s) step 17191/76294 | train loss 3.346713 | norm 0.9203 | lr 1.62e-04 | (3861.67 ms | 135767 tok/s) step 17192/76294 | train loss 3.449809 | norm 1.1583 | lr 1.62e-04 | (3822.60 ms | 137155 tok/s) step 17193/76294 | train loss 3.327336 | norm 1.2164 | lr 1.62e-04 | (3806.44 ms | 137737 tok/s) step 17194/76294 | train loss 3.402628 | norm 0.8065 | lr 1.61e-04 | (3817.00 ms | 137356 tok/s) step 17195/76294 | train loss 3.443698 | norm 0.8128 | lr 1.61e-04 | (3804.68 ms | 137801 tok/s) step 17196/76294 | train loss 3.337642 | norm 1.7921 | lr 1.61e-04 | (3822.90 ms | 137144 tok/s) step 17197/76294 | train loss 3.398511 | norm 0.8147 | lr 1.61e-04 | (3801.55 ms | 137914 tok/s) step 17198/76294 | train loss 3.384046 | norm 0.7081 | lr 1.61e-04 | (3799.25 ms | 137998 tok/s) step 17199/76294 | train loss 3.359303 | norm 0.8023 | lr 1.61e-04 | (3899.44 ms | 134452 tok/s) step 17200/76294 | train loss 3.374139 | norm 0.9957 | lr 1.61e-04 | (3802.68 ms | 137873 tok/s) step 17201/76294 | train loss 3.358358 | norm 0.6789 | lr 1.61e-04 | (3821.99 ms | 137177 tok/s) step 17202/76294 | train loss 3.353450 | norm 0.6058 | lr 1.61e-04 | (3826.99 ms | 136997 tok/s) step 17203/76294 | train loss 3.419131 | norm 0.8070 | lr 1.61e-04 | (3834.07 ms | 136744 tok/s) step 17204/76294 | train loss 3.387110 | norm 0.8432 | lr 1.61e-04 | (3822.51 ms | 137158 tok/s) step 17205/76294 | train loss 3.445944 | norm 0.7867 | lr 1.61e-04 | (3806.00 ms | 137753 tok/s) step 17206/76294 | train loss 3.377946 | norm 0.8190 | lr 1.61e-04 | (3805.08 ms | 137786 tok/s) step 17207/76294 | train loss 3.491241 | norm 0.9632 | lr 1.61e-04 | (3828.36 ms | 136949 tok/s) step 17208/76294 | train loss 3.329087 | norm 0.8482 | lr 1.61e-04 | (3806.32 ms | 137742 tok/s) step 17209/76294 | train loss 3.385601 | norm 1.0693 | lr 1.61e-04 | (3801.71 ms | 137909 tok/s) step 17210/76294 | train loss 3.351445 | norm 0.8002 | lr 1.61e-04 | (3799.11 ms | 138003 tok/s) step 17211/76294 | train loss 3.479160 | norm 1.3515 | lr 1.61e-04 | (3899.04 ms | 134466 tok/s) step 17212/76294 | train loss 3.367340 | norm 1.2265 | lr 1.61e-04 | (3788.92 ms | 138374 tok/s) step 17213/76294 | train loss 3.373166 | norm 0.9804 | lr 1.61e-04 | (3819.80 ms | 137255 tok/s) step 17214/76294 | train loss 3.353497 | norm 0.6819 | lr 1.61e-04 | (3817.12 ms | 137352 tok/s) step 17215/76294 | train loss 3.457247 | norm 1.3138 | lr 1.61e-04 | (3795.59 ms | 138131 tok/s) step 17216/76294 | train loss 3.394119 | norm 0.8223 | lr 1.61e-04 | (3792.63 ms | 138239 tok/s) step 17217/76294 | train loss 3.350414 | norm 1.2553 | lr 1.61e-04 | (3831.99 ms | 136819 tok/s) step 17218/76294 | train loss 3.419271 | norm 1.0394 | lr 1.61e-04 | (3793.97 ms | 138190 tok/s) step 17219/76294 | train loss 3.385312 | norm 0.9814 | lr 1.61e-04 | (3800.42 ms | 137955 tok/s) step 17220/76294 | train loss 3.363354 | norm 1.1033 | lr 1.61e-04 | (3818.58 ms | 137299 tok/s) step 17221/76294 | train loss 3.386761 | norm 0.9650 | lr 1.61e-04 | (3798.19 ms | 138036 tok/s) step 17222/76294 | train loss 3.486261 | norm 1.3179 | lr 1.61e-04 | (3805.34 ms | 137777 tok/s) step 17223/76294 | train loss 3.445462 | norm 0.8310 | lr 1.61e-04 | (3818.85 ms | 137290 tok/s) step 17224/76294 | train loss 3.465057 | norm 0.9829 | lr 1.60e-04 | (3802.36 ms | 137885 tok/s) step 17225/76294 | train loss 3.421869 | norm 1.2279 | lr 1.60e-04 | (3807.65 ms | 137693 tok/s) step 17226/76294 | train loss 3.363942 | norm 1.6819 | lr 1.60e-04 | (3807.32 ms | 137705 tok/s) step 17227/76294 | train loss 3.316282 | norm 1.6966 | lr 1.60e-04 | (3819.94 ms | 137250 tok/s) step 17228/76294 | train loss 3.475322 | norm 1.5197 | lr 1.60e-04 | (3805.83 ms | 137759 tok/s) step 17229/76294 | train loss 3.376909 | norm 1.7660 | lr 1.60e-04 | (3815.18 ms | 137422 tok/s) step 17230/76294 | train loss 3.365080 | norm 1.1660 | lr 1.60e-04 | (3805.23 ms | 137781 tok/s) step 17231/76294 | train loss 3.426160 | norm 0.7447 | lr 1.60e-04 | (3824.23 ms | 137096 tok/s) step 17232/76294 | train loss 3.474350 | norm 0.9704 | lr 1.60e-04 | (3806.81 ms | 137724 tok/s) step 17233/76294 | train loss 3.392211 | norm 0.9249 | lr 1.60e-04 | (3802.85 ms | 137867 tok/s) step 17234/76294 | train loss 3.394659 | norm 0.7699 | lr 1.60e-04 | (3833.03 ms | 136782 tok/s) step 17235/76294 | train loss 3.404619 | norm 0.8274 | lr 1.60e-04 | (3805.92 ms | 137756 tok/s) step 17236/76294 | train loss 3.356178 | norm 0.9391 | lr 1.60e-04 | (3806.53 ms | 137734 tok/s) step 17237/76294 | train loss 3.396675 | norm 1.7297 | lr 1.60e-04 | (3823.50 ms | 137123 tok/s) step 17238/76294 | train loss 3.392631 | norm 0.8067 | lr 1.60e-04 | (3804.75 ms | 137798 tok/s) step 17239/76294 | train loss 3.384549 | norm 0.5642 | lr 1.60e-04 | (3800.53 ms | 137951 tok/s) step 17240/76294 | train loss 3.364974 | norm 0.8395 | lr 1.60e-04 | (3876.77 ms | 135238 tok/s) step 17241/76294 | train loss 3.367279 | norm 0.6871 | lr 1.60e-04 | (3802.31 ms | 137887 tok/s) step 17242/76294 | train loss 3.365193 | norm 0.7599 | lr 1.60e-04 | (3850.97 ms | 136144 tok/s) step 17243/76294 | train loss 3.412795 | norm 0.8697 | lr 1.60e-04 | (3800.39 ms | 137956 tok/s) step 17244/76294 | train loss 3.398829 | norm 0.7179 | lr 1.60e-04 | (3818.72 ms | 137294 tok/s) step 17245/76294 | train loss 3.348531 | norm 1.2119 | lr 1.60e-04 | (3802.88 ms | 137866 tok/s) step 17246/76294 | train loss 3.428109 | norm 0.9488 | lr 1.60e-04 | (3808.08 ms | 137678 tok/s) step 17247/76294 | train loss 3.386440 | norm 1.0832 | lr 1.60e-04 | (3804.40 ms | 137811 tok/s) step 17248/76294 | train loss 3.395479 | norm 1.8493 | lr 1.60e-04 | (3825.02 ms | 137068 tok/s) step 17249/76294 | train loss 3.432539 | norm 0.6217 | lr 1.60e-04 | (3804.69 ms | 137800 tok/s) step 17250/76294 | train loss 3.407687 | norm 0.9723 | lr 1.60e-04 | (3878.52 ms | 135178 tok/s) val loss: 3.377636 saving model checkpoint to ./results/gpt2-124M-gqa/step_17250.pth step 17251/76294 | train loss 3.406637 | norm 0.9381 | lr 1.60e-04 | (3792.11 ms | 138258 tok/s) step 17252/76294 | train loss 3.421610 | norm 1.2607 | lr 1.60e-04 | (3827.86 ms | 136966 tok/s) step 17253/76294 | train loss 3.414750 | norm 1.0580 | lr 1.59e-04 | (3801.23 ms | 137926 tok/s) step 17254/76294 | train loss 3.388746 | norm 0.7323 | lr 1.59e-04 | (3805.61 ms | 137767 tok/s) step 17255/76294 | train loss 3.427438 | norm 0.9743 | lr 1.59e-04 | (3843.20 ms | 136420 tok/s) step 17256/76294 | train loss 3.452941 | norm 0.7340 | lr 1.59e-04 | (3805.91 ms | 137756 tok/s) step 17257/76294 | train loss 3.316473 | norm 0.7359 | lr 1.59e-04 | (3810.64 ms | 137585 tok/s) step 17258/76294 | train loss 3.390373 | norm 1.0922 | lr 1.59e-04 | (3825.54 ms | 137050 tok/s) step 17259/76294 | train loss 3.587693 | norm 1.0083 | lr 1.59e-04 | (3804.01 ms | 137825 tok/s) step 17260/76294 | train loss 3.509718 | norm 0.8437 | lr 1.59e-04 | (3831.80 ms | 136826 tok/s) step 17261/76294 | train loss 3.386667 | norm 0.9725 | lr 1.59e-04 | (3803.90 ms | 137829 tok/s) step 17262/76294 | train loss 3.401112 | norm 0.7696 | lr 1.59e-04 | (3808.73 ms | 137654 tok/s) step 17263/76294 | train loss 3.426004 | norm 0.8643 | lr 1.59e-04 | (3803.66 ms | 137838 tok/s) step 17264/76294 | train loss 3.468490 | norm 0.7440 | lr 1.59e-04 | (3825.02 ms | 137068 tok/s) step 17265/76294 | train loss 3.493860 | norm 0.7954 | lr 1.59e-04 | (3800.58 ms | 137949 tok/s) step 17266/76294 | train loss 3.330940 | norm 1.2949 | lr 1.59e-04 | (3814.83 ms | 137434 tok/s) step 17267/76294 | train loss 3.398107 | norm 1.0386 | lr 1.59e-04 | (3819.60 ms | 137263 tok/s) step 17268/76294 | train loss 3.372814 | norm 0.9052 | lr 1.59e-04 | (3814.47 ms | 137447 tok/s) step 17269/76294 | train loss 3.417246 | norm 0.7893 | lr 1.59e-04 | (3809.52 ms | 137626 tok/s) step 17270/76294 | train loss 3.465991 | norm 1.2786 | lr 1.59e-04 | (3807.57 ms | 137696 tok/s) step 17271/76294 | train loss 3.412093 | norm 0.8382 | lr 1.59e-04 | (3831.32 ms | 136843 tok/s) step 17272/76294 | train loss 3.337355 | norm 1.5694 | lr 1.59e-04 | (4114.51 ms | 127424 tok/s) step 17273/76294 | train loss 3.380257 | norm 1.6753 | lr 1.59e-04 | (3824.16 ms | 137099 tok/s) step 17274/76294 | train loss 3.359416 | norm 0.7344 | lr 1.59e-04 | (3937.32 ms | 133159 tok/s) step 17275/76294 | train loss 3.394708 | norm 1.0759 | lr 1.59e-04 | (3878.60 ms | 135174 tok/s) step 17276/76294 | train loss 3.433488 | norm 1.2635 | lr 1.59e-04 | (3802.61 ms | 137876 tok/s) step 17277/76294 | train loss 3.595758 | norm 1.3662 | lr 1.59e-04 | (3805.13 ms | 137784 tok/s) step 17278/76294 | train loss 3.431704 | norm 1.6803 | lr 1.59e-04 | (3828.85 ms | 136931 tok/s) step 17279/76294 | train loss 3.415322 | norm 1.2986 | lr 1.59e-04 | (3804.98 ms | 137790 tok/s) step 17280/76294 | train loss 3.498019 | norm 2.2580 | lr 1.59e-04 | (3810.81 ms | 137579 tok/s) step 17281/76294 | train loss 3.539926 | norm 2.2251 | lr 1.59e-04 | (3804.97 ms | 137790 tok/s) step 17282/76294 | train loss 3.477958 | norm 1.4690 | lr 1.59e-04 | (3807.68 ms | 137692 tok/s) step 17283/76294 | train loss 3.376465 | norm 1.1499 | lr 1.58e-04 | (3808.16 ms | 137675 tok/s) step 17284/76294 | train loss 3.420263 | norm 1.2738 | lr 1.58e-04 | (3814.73 ms | 137438 tok/s) step 17285/76294 | train loss 3.456077 | norm 1.1490 | lr 1.58e-04 | (3807.07 ms | 137714 tok/s) step 17286/76294 | train loss 3.378045 | norm 1.3156 | lr 1.58e-04 | (3811.98 ms | 137537 tok/s) step 17287/76294 | train loss 3.394792 | norm 1.3573 | lr 1.58e-04 | (3807.55 ms | 137697 tok/s) step 17288/76294 | train loss 3.361873 | norm 0.8627 | lr 1.58e-04 | (3801.61 ms | 137912 tok/s) step 17289/76294 | train loss 3.398610 | norm 1.0956 | lr 1.58e-04 | (3840.92 ms | 136501 tok/s) step 17290/76294 | train loss 3.423748 | norm 1.6089 | lr 1.58e-04 | (3802.66 ms | 137874 tok/s) step 17291/76294 | train loss 3.413912 | norm 1.2352 | lr 1.58e-04 | (3807.60 ms | 137695 tok/s) step 17292/76294 | train loss 3.402671 | norm 0.7560 | lr 1.58e-04 | (3819.35 ms | 137272 tok/s) step 17293/76294 | train loss 3.389206 | norm 0.8665 | lr 1.58e-04 | (3811.90 ms | 137540 tok/s) step 17294/76294 | train loss 3.384130 | norm 0.7987 | lr 1.58e-04 | (3810.76 ms | 137581 tok/s) step 17295/76294 | train loss 3.388606 | norm 1.0185 | lr 1.58e-04 | (3805.71 ms | 137763 tok/s) step 17296/76294 | train loss 3.389054 | norm 1.0470 | lr 1.58e-04 | (3813.28 ms | 137490 tok/s) step 17297/76294 | train loss 3.343326 | norm 0.8416 | lr 1.58e-04 | (3801.49 ms | 137916 tok/s) step 17298/76294 | train loss 3.421709 | norm 0.8446 | lr 1.58e-04 | (3805.16 ms | 137784 tok/s) step 17299/76294 | train loss 3.402046 | norm 1.5593 | lr 1.58e-04 | (3801.69 ms | 137909 tok/s) step 17300/76294 | train loss 3.372337 | norm 0.8085 | lr 1.58e-04 | (3880.09 ms | 135123 tok/s) step 17301/76294 | train loss 3.454319 | norm 0.9720 | lr 1.58e-04 | (3814.74 ms | 137438 tok/s) step 17302/76294 | train loss 3.507494 | norm 0.8702 | lr 1.58e-04 | (3817.39 ms | 137342 tok/s) step 17303/76294 | train loss 3.443326 | norm 0.8843 | lr 1.58e-04 | (3802.90 ms | 137865 tok/s) step 17304/76294 | train loss 3.390712 | norm 0.8139 | lr 1.58e-04 | (3831.19 ms | 136847 tok/s) step 17305/76294 | train loss 3.411992 | norm 0.6980 | lr 1.58e-04 | (3800.36 ms | 137957 tok/s) step 17306/76294 | train loss 3.739396 | norm 1.0709 | lr 1.58e-04 | (3824.37 ms | 137091 tok/s) step 17307/76294 | train loss 3.444623 | norm 1.1834 | lr 1.58e-04 | (3803.32 ms | 137850 tok/s) step 17308/76294 | train loss 3.375059 | norm 0.9918 | lr 1.58e-04 | (3805.26 ms | 137780 tok/s) step 17309/76294 | train loss 3.414344 | norm 0.9634 | lr 1.58e-04 | (3823.70 ms | 137115 tok/s) step 17310/76294 | train loss 3.454710 | norm 0.9227 | lr 1.58e-04 | (3805.59 ms | 137768 tok/s) step 17311/76294 | train loss 3.388393 | norm 0.9490 | lr 1.58e-04 | (3808.04 ms | 137679 tok/s) step 17312/76294 | train loss 3.546424 | norm 0.8254 | lr 1.58e-04 | (3807.05 ms | 137715 tok/s) step 17313/76294 | train loss 3.398156 | norm 0.7605 | lr 1.57e-04 | (3810.15 ms | 137603 tok/s) step 17314/76294 | train loss 3.353039 | norm 0.9661 | lr 1.57e-04 | (3805.81 ms | 137760 tok/s) step 17315/76294 | train loss 3.438023 | norm 0.9631 | lr 1.57e-04 | (3804.58 ms | 137804 tok/s) step 17316/76294 | train loss 3.403245 | norm 1.0739 | lr 1.57e-04 | (3810.22 ms | 137600 tok/s) step 17317/76294 | train loss 3.379847 | norm 1.0329 | lr 1.57e-04 | (3800.85 ms | 137940 tok/s) step 17318/76294 | train loss 3.407880 | norm 1.0104 | lr 1.57e-04 | (3868.99 ms | 135510 tok/s) step 17319/76294 | train loss 3.380497 | norm 1.1248 | lr 1.57e-04 | (3829.42 ms | 136910 tok/s) step 17320/76294 | train loss 3.421906 | norm 1.6788 | lr 1.57e-04 | (3827.45 ms | 136981 tok/s) step 17321/76294 | train loss 3.383312 | norm 1.6047 | lr 1.57e-04 | (3802.92 ms | 137865 tok/s) step 17322/76294 | train loss 3.429287 | norm 1.9222 | lr 1.57e-04 | (4098.98 ms | 127907 tok/s) step 17323/76294 | train loss 3.451747 | norm 1.0034 | lr 1.57e-04 | (7497.04 ms | 69933 tok/s) step 17324/76294 | train loss 3.510109 | norm 1.0343 | lr 1.57e-04 | (4011.95 ms | 130682 tok/s) step 17325/76294 | train loss 3.477169 | norm 1.4759 | lr 1.57e-04 | (3790.09 ms | 138331 tok/s) step 17326/76294 | train loss 3.371277 | norm 1.4481 | lr 1.57e-04 | (3806.70 ms | 137728 tok/s) step 17327/76294 | train loss 3.452219 | norm 0.8682 | lr 1.57e-04 | (3798.32 ms | 138032 tok/s) step 17328/76294 | train loss 3.396317 | norm 0.7620 | lr 1.57e-04 | (3799.63 ms | 137984 tok/s) step 17329/76294 | train loss 3.414574 | norm 1.1364 | lr 1.57e-04 | (3821.19 ms | 137206 tok/s) step 17330/76294 | train loss 3.384302 | norm 1.0142 | lr 1.57e-04 | (3799.01 ms | 138007 tok/s) step 17331/76294 | train loss 3.383918 | norm 0.7090 | lr 1.57e-04 | (3811.97 ms | 137537 tok/s) step 17332/76294 | train loss 3.364821 | norm 1.1898 | lr 1.57e-04 | (3797.09 ms | 138076 tok/s) step 17333/76294 | train loss 3.438989 | norm 0.7235 | lr 1.57e-04 | (3815.42 ms | 137413 tok/s) step 17334/76294 | train loss 3.327950 | norm 1.1705 | lr 1.57e-04 | (3802.76 ms | 137870 tok/s) step 17335/76294 | train loss 3.336397 | norm 1.3652 | lr 1.57e-04 | (3802.76 ms | 137870 tok/s) step 17336/76294 | train loss 3.396708 | norm 0.8329 | lr 1.57e-04 | (3827.73 ms | 136971 tok/s) step 17337/76294 | train loss 3.308419 | norm 1.2102 | lr 1.57e-04 | (3800.25 ms | 137962 tok/s) step 17338/76294 | train loss 3.368454 | norm 1.3231 | lr 1.57e-04 | (3832.18 ms | 136812 tok/s) step 17339/76294 | train loss 3.379991 | norm 1.6934 | lr 1.57e-04 | (3795.59 ms | 138131 tok/s) step 17340/76294 | train loss 3.383785 | norm 1.1966 | lr 1.57e-04 | (3799.90 ms | 137974 tok/s) step 17341/76294 | train loss 3.457830 | norm 1.0826 | lr 1.57e-04 | (3796.05 ms | 138114 tok/s) step 17342/76294 | train loss 3.496897 | norm 2.2185 | lr 1.57e-04 | (3804.24 ms | 137817 tok/s) step 17343/76294 | train loss 3.409336 | norm 1.7679 | lr 1.56e-04 | (3817.87 ms | 137325 tok/s) step 17344/76294 | train loss 3.432891 | norm 1.2194 | lr 1.56e-04 | (3797.50 ms | 138061 tok/s) step 17345/76294 | train loss 3.349649 | norm 1.4169 | lr 1.56e-04 | (3803.66 ms | 137838 tok/s) step 17346/76294 | train loss 3.395194 | norm 1.4226 | lr 1.56e-04 | (3821.89 ms | 137180 tok/s) step 17347/76294 | train loss 3.401809 | norm 1.1726 | lr 1.56e-04 | (3800.79 ms | 137942 tok/s) step 17348/76294 | train loss 3.381999 | norm 0.9514 | lr 1.56e-04 | (3801.75 ms | 137907 tok/s) step 17349/76294 | train loss 3.370027 | norm 0.9774 | lr 1.56e-04 | (3875.03 ms | 135299 tok/s) step 17350/76294 | train loss 3.413548 | norm 1.0593 | lr 1.56e-04 | (3803.97 ms | 137827 tok/s) step 17351/76294 | train loss 3.415149 | norm 1.3724 | lr 1.56e-04 | (3822.20 ms | 137169 tok/s) step 17352/76294 | train loss 3.411754 | norm 1.0488 | lr 1.56e-04 | (3797.41 ms | 138065 tok/s) step 17353/76294 | train loss 3.396718 | norm 2.1645 | lr 1.56e-04 | (3806.77 ms | 137725 tok/s) step 17354/76294 | train loss 3.411090 | norm 1.8558 | lr 1.56e-04 | (3819.51 ms | 137266 tok/s) step 17355/76294 | train loss 3.342744 | norm 1.4121 | lr 1.56e-04 | (3799.82 ms | 137977 tok/s) step 17356/76294 | train loss 3.462330 | norm 1.7489 | lr 1.56e-04 | (3809.27 ms | 137635 tok/s) step 17357/76294 | train loss 3.430144 | norm 1.5535 | lr 1.56e-04 | (3797.38 ms | 138066 tok/s) step 17358/76294 | train loss 3.391142 | norm 1.9568 | lr 1.56e-04 | (3821.07 ms | 137210 tok/s) step 17359/76294 | train loss 3.422640 | norm 1.1371 | lr 1.56e-04 | (3826.92 ms | 137000 tok/s) step 17360/76294 | train loss 3.417294 | norm 2.1507 | lr 1.56e-04 | (3818.26 ms | 137311 tok/s) step 17361/76294 | train loss 3.432019 | norm 1.4475 | lr 1.56e-04 | (3801.98 ms | 137899 tok/s) step 17362/76294 | train loss 3.447294 | norm 1.6698 | lr 1.56e-04 | (3801.73 ms | 137908 tok/s) step 17363/76294 | train loss 3.380321 | norm 1.3968 | lr 1.56e-04 | (3801.38 ms | 137920 tok/s) step 17364/76294 | train loss 3.406814 | norm 1.0714 | lr 1.56e-04 | (3797.59 ms | 138058 tok/s) step 17365/76294 | train loss 3.414873 | norm 1.7324 | lr 1.56e-04 | (3833.14 ms | 136778 tok/s) step 17366/76294 | train loss 3.377521 | norm 1.3211 | lr 1.56e-04 | (3801.18 ms | 137928 tok/s) step 17367/76294 | train loss 3.448555 | norm 0.8503 | lr 1.56e-04 | (3805.34 ms | 137777 tok/s) step 17368/76294 | train loss 3.449996 | norm 1.1843 | lr 1.56e-04 | (3821.70 ms | 137187 tok/s) step 17369/76294 | train loss 3.313864 | norm 1.4974 | lr 1.56e-04 | (3799.20 ms | 137999 tok/s) step 17370/76294 | train loss 3.402433 | norm 1.1644 | lr 1.56e-04 | (3801.95 ms | 137900 tok/s) step 17371/76294 | train loss 3.423140 | norm 1.0815 | lr 1.56e-04 | (3806.07 ms | 137750 tok/s) step 17372/76294 | train loss 3.340395 | norm 0.8507 | lr 1.56e-04 | (3808.10 ms | 137677 tok/s) step 17373/76294 | train loss 3.381080 | norm 1.0650 | lr 1.56e-04 | (3799.06 ms | 138005 tok/s) step 17374/76294 | train loss 3.377786 | norm 0.8287 | lr 1.55e-04 | (3868.99 ms | 135510 tok/s) step 17375/76294 | train loss 3.409363 | norm 0.7270 | lr 1.55e-04 | (3796.72 ms | 138090 tok/s) step 17376/76294 | train loss 3.403578 | norm 1.1896 | lr 1.55e-04 | (3820.13 ms | 137243 tok/s) step 17377/76294 | train loss 3.559399 | norm 1.0461 | lr 1.55e-04 | (3820.33 ms | 137236 tok/s) step 17378/76294 | train loss 3.416745 | norm 1.1159 | lr 1.55e-04 | (3802.31 ms | 137887 tok/s) step 17379/76294 | train loss 3.389365 | norm 3.0542 | lr 1.55e-04 | (3796.49 ms | 138098 tok/s) step 17380/76294 | train loss 3.341273 | norm 1.4192 | lr 1.55e-04 | (3861.62 ms | 135769 tok/s) step 17381/76294 | train loss 3.380756 | norm 0.9170 | lr 1.55e-04 | (3811.08 ms | 137569 tok/s) step 17382/76294 | train loss 3.380461 | norm 1.0872 | lr 1.55e-04 | (3806.48 ms | 137736 tok/s) step 17383/76294 | train loss 3.413730 | norm 1.1847 | lr 1.55e-04 | (3820.80 ms | 137219 tok/s) step 17384/76294 | train loss 3.417696 | norm 0.9135 | lr 1.55e-04 | (3803.79 ms | 137833 tok/s) step 17385/76294 | train loss 3.331410 | norm 1.2309 | lr 1.55e-04 | (3830.84 ms | 136860 tok/s) step 17386/76294 | train loss 3.423377 | norm 1.3269 | lr 1.55e-04 | (3804.71 ms | 137800 tok/s) step 17387/76294 | train loss 3.345837 | norm 1.3654 | lr 1.55e-04 | (3809.99 ms | 137609 tok/s) step 17388/76294 | train loss 3.481520 | norm 1.6423 | lr 1.55e-04 | (3810.05 ms | 137606 tok/s) step 17389/76294 | train loss 3.369011 | norm 0.9493 | lr 1.55e-04 | (3805.80 ms | 137760 tok/s) step 17390/76294 | train loss 3.446960 | norm 0.8862 | lr 1.55e-04 | (3821.91 ms | 137179 tok/s) step 17391/76294 | train loss 3.380471 | norm 0.8956 | lr 1.55e-04 | (3805.10 ms | 137786 tok/s) step 17392/76294 | train loss 3.358520 | norm 0.9699 | lr 1.55e-04 | (3807.93 ms | 137683 tok/s) step 17393/76294 | train loss 3.411792 | norm 0.6988 | lr 1.55e-04 | (3803.16 ms | 137856 tok/s) step 17394/76294 | train loss 3.430600 | norm 0.9415 | lr 1.55e-04 | (3804.31 ms | 137814 tok/s) step 17395/76294 | train loss 3.363734 | norm 1.1496 | lr 1.55e-04 | (3802.95 ms | 137864 tok/s) step 17396/76294 | train loss 3.380448 | norm 1.0151 | lr 1.55e-04 | (3804.55 ms | 137806 tok/s) step 17397/76294 | train loss 3.382542 | norm 1.2001 | lr 1.55e-04 | (3800.86 ms | 137939 tok/s) step 17398/76294 | train loss 3.522847 | norm 0.8764 | lr 1.55e-04 | (3811.16 ms | 137566 tok/s) step 17399/76294 | train loss 3.413709 | norm 1.2526 | lr 1.55e-04 | (3918.92 ms | 133784 tok/s) step 17400/76294 | train loss 3.371808 | norm 0.8842 | lr 1.55e-04 | (3799.54 ms | 137987 tok/s) step 17401/76294 | train loss 3.323899 | norm 2.0672 | lr 1.55e-04 | (3828.90 ms | 136929 tok/s) step 17402/76294 | train loss 3.361372 | norm 1.7065 | lr 1.55e-04 | (3883.01 ms | 135021 tok/s) step 17403/76294 | train loss 3.466589 | norm 1.0847 | lr 1.55e-04 | (3792.84 ms | 138231 tok/s) step 17404/76294 | train loss 3.354146 | norm 1.4380 | lr 1.55e-04 | (3799.53 ms | 137988 tok/s) step 17405/76294 | train loss 3.371538 | norm 1.4502 | lr 1.55e-04 | (3824.79 ms | 137076 tok/s) step 17406/76294 | train loss 3.420637 | norm 1.1293 | lr 1.54e-04 | (3798.83 ms | 138013 tok/s) step 17407/76294 | train loss 3.378245 | norm 1.2611 | lr 1.54e-04 | (3802.74 ms | 137871 tok/s) step 17408/76294 | train loss 3.423947 | norm 1.3294 | lr 1.54e-04 | (3801.67 ms | 137910 tok/s) step 17409/76294 | train loss 3.404575 | norm 0.8375 | lr 1.54e-04 | (3816.29 ms | 137382 tok/s) step 17410/76294 | train loss 3.415651 | norm 0.9684 | lr 1.54e-04 | (3799.90 ms | 137974 tok/s) step 17411/76294 | train loss 3.378131 | norm 0.8949 | lr 1.54e-04 | (3818.53 ms | 137301 tok/s) step 17412/76294 | train loss 3.390536 | norm 1.2621 | lr 1.54e-04 | (3800.66 ms | 137947 tok/s) step 17413/76294 | train loss 3.576780 | norm 1.3196 | lr 1.54e-04 | (3981.17 ms | 131692 tok/s) step 17414/76294 | train loss 3.462809 | norm 1.1307 | lr 1.54e-04 | (3829.71 ms | 136900 tok/s) step 17415/76294 | train loss 3.388996 | norm 2.4272 | lr 1.54e-04 | (3817.44 ms | 137340 tok/s) step 17416/76294 | train loss 3.367755 | norm 1.1212 | lr 1.54e-04 | (3831.15 ms | 136849 tok/s) step 17417/76294 | train loss 3.461977 | norm 0.8278 | lr 1.54e-04 | (3809.18 ms | 137638 tok/s) step 17418/76294 | train loss 3.432172 | norm 0.9352 | lr 1.54e-04 | (3800.83 ms | 137940 tok/s) step 17419/76294 | train loss 3.365360 | norm 0.6588 | lr 1.54e-04 | (3807.58 ms | 137696 tok/s) step 17420/76294 | train loss 3.464363 | norm 0.8812 | lr 1.54e-04 | (3805.76 ms | 137762 tok/s) step 17421/76294 | train loss 3.524657 | norm 1.9961 | lr 1.54e-04 | (3802.62 ms | 137876 tok/s) step 17422/76294 | train loss 3.321616 | norm 1.7723 | lr 1.54e-04 | (3802.45 ms | 137882 tok/s) step 17423/76294 | train loss 3.414681 | norm 1.1508 | lr 1.54e-04 | (3803.35 ms | 137849 tok/s) step 17424/76294 | train loss 3.373593 | norm 1.0970 | lr 1.54e-04 | (3798.97 ms | 138008 tok/s) step 17425/76294 | train loss 3.410837 | norm 1.7215 | lr 1.54e-04 | (3851.47 ms | 136127 tok/s) step 17426/76294 | train loss 3.435528 | norm 1.4064 | lr 1.54e-04 | (3802.34 ms | 137886 tok/s) step 17427/76294 | train loss 3.455810 | norm 1.7019 | lr 1.54e-04 | (3808.69 ms | 137656 tok/s) step 17428/76294 | train loss 3.473523 | norm 1.1320 | lr 1.54e-04 | (3818.36 ms | 137307 tok/s) step 17429/76294 | train loss 3.594791 | norm 1.4695 | lr 1.54e-04 | (3802.83 ms | 137868 tok/s) step 17430/76294 | train loss 3.328017 | norm 1.6672 | lr 1.54e-04 | (3802.59 ms | 137877 tok/s) step 17431/76294 | train loss 3.490276 | norm 1.6989 | lr 1.54e-04 | (3804.46 ms | 137809 tok/s) step 17432/76294 | train loss 3.391884 | norm 1.3795 | lr 1.54e-04 | (3822.39 ms | 137162 tok/s) step 17433/76294 | train loss 3.373763 | norm 2.1111 | lr 1.54e-04 | (3799.86 ms | 137975 tok/s) step 17434/76294 | train loss 3.385022 | norm 1.6253 | lr 1.54e-04 | (3804.63 ms | 137803 tok/s) step 17435/76294 | train loss 3.360692 | norm 1.3753 | lr 1.54e-04 | (3800.17 ms | 137964 tok/s) step 17436/76294 | train loss 3.462506 | norm 1.2863 | lr 1.54e-04 | (3794.59 ms | 138167 tok/s) step 17437/76294 | train loss 3.349003 | norm 1.3298 | lr 1.53e-04 | (4003.37 ms | 130962 tok/s) step 17438/76294 | train loss 3.426974 | norm 1.1000 | lr 1.53e-04 | (3795.11 ms | 138148 tok/s) step 17439/76294 | train loss 3.428879 | norm 0.9027 | lr 1.53e-04 | (3804.75 ms | 137798 tok/s) step 17440/76294 | train loss 3.405071 | norm 1.3874 | lr 1.53e-04 | (3821.48 ms | 137195 tok/s) step 17441/76294 | train loss 3.453580 | norm 1.4287 | lr 1.53e-04 | (3801.58 ms | 137913 tok/s) step 17442/76294 | train loss 3.371721 | norm 1.5395 | lr 1.53e-04 | (3820.23 ms | 137240 tok/s) step 17443/76294 | train loss 3.452536 | norm 0.8379 | lr 1.53e-04 | (3811.19 ms | 137565 tok/s) step 17444/76294 | train loss 3.402426 | norm 1.2222 | lr 1.53e-04 | (3807.68 ms | 137692 tok/s) step 17445/76294 | train loss 3.500602 | norm 2.6546 | lr 1.53e-04 | (4125.68 ms | 127079 tok/s) step 17446/76294 | train loss 3.385122 | norm 1.4152 | lr 1.53e-04 | (3791.84 ms | 138267 tok/s) step 17447/76294 | train loss 3.459645 | norm 1.3984 | lr 1.53e-04 | (3824.22 ms | 137097 tok/s) step 17448/76294 | train loss 3.407875 | norm 1.9220 | lr 1.53e-04 | (3879.65 ms | 135138 tok/s) step 17449/76294 | train loss 3.320962 | norm 1.7122 | lr 1.53e-04 | (3798.71 ms | 138017 tok/s) step 17450/76294 | train loss 3.421843 | norm 0.9719 | lr 1.53e-04 | (3804.83 ms | 137796 tok/s) step 17451/76294 | train loss 3.354328 | norm 2.6511 | lr 1.53e-04 | (3821.03 ms | 137211 tok/s) step 17452/76294 | train loss 3.448351 | norm 1.0019 | lr 1.53e-04 | (3808.17 ms | 137674 tok/s) step 17453/76294 | train loss 3.381341 | norm 1.6997 | lr 1.53e-04 | (3796.17 ms | 138110 tok/s) step 17454/76294 | train loss 3.478074 | norm 1.1039 | lr 1.53e-04 | (3826.90 ms | 137001 tok/s) step 17455/76294 | train loss 3.378269 | norm 1.1745 | lr 1.53e-04 | (3797.14 ms | 138074 tok/s) step 17456/76294 | train loss 3.453441 | norm 1.1745 | lr 1.53e-04 | (3827.51 ms | 136979 tok/s) step 17457/76294 | train loss 3.584181 | norm 1.8236 | lr 1.53e-04 | (3797.14 ms | 138074 tok/s) step 17458/76294 | train loss 3.447661 | norm 1.3866 | lr 1.53e-04 | (3827.14 ms | 136992 tok/s) step 17459/76294 | train loss 3.390320 | norm 1.2900 | lr 1.53e-04 | (3795.31 ms | 138141 tok/s) step 17460/76294 | train loss 3.363102 | norm 1.8047 | lr 1.53e-04 | (3847.69 ms | 136260 tok/s) step 17461/76294 | train loss 3.409556 | norm 1.7792 | lr 1.53e-04 | (3813.99 ms | 137465 tok/s) step 17462/76294 | train loss 3.464831 | norm 1.7227 | lr 1.53e-04 | (3854.98 ms | 136003 tok/s) step 17463/76294 | train loss 3.301978 | norm 1.8677 | lr 1.53e-04 | (4634.71 ms | 113122 tok/s) step 17464/76294 | train loss 3.387119 | norm 3.1505 | lr 1.53e-04 | (3825.66 ms | 137045 tok/s) step 17465/76294 | train loss 3.354526 | norm 2.5080 | lr 1.53e-04 | (3801.31 ms | 137923 tok/s) step 17466/76294 | train loss 3.367517 | norm 1.9092 | lr 1.53e-04 | (3805.63 ms | 137766 tok/s) step 17467/76294 | train loss 3.385988 | norm 1.5750 | lr 1.53e-04 | (3797.05 ms | 138078 tok/s) step 17468/76294 | train loss 3.397884 | norm 1.3170 | lr 1.53e-04 | (3826.48 ms | 137016 tok/s) step 17469/76294 | train loss 3.341651 | norm 1.4387 | lr 1.53e-04 | (3796.68 ms | 138091 tok/s) step 17470/76294 | train loss 3.396805 | norm 0.9029 | lr 1.52e-04 | (3800.63 ms | 137948 tok/s) step 17471/76294 | train loss 3.388838 | norm 1.0114 | lr 1.52e-04 | (3842.46 ms | 136446 tok/s) step 17472/76294 | train loss 3.359236 | norm 0.7967 | lr 1.52e-04 | (3821.13 ms | 137207 tok/s) step 17473/76294 | train loss 3.327895 | norm 1.0625 | lr 1.52e-04 | (3878.04 ms | 135194 tok/s) step 17474/76294 | train loss 3.464689 | norm 1.1193 | lr 1.52e-04 | (3800.06 ms | 137968 tok/s) step 17475/76294 | train loss 3.432191 | norm 1.7008 | lr 1.52e-04 | (3821.43 ms | 137197 tok/s) step 17476/76294 | train loss 3.385057 | norm 2.3422 | lr 1.52e-04 | (3801.70 ms | 137909 tok/s) step 17477/76294 | train loss 3.431177 | norm 1.3864 | lr 1.52e-04 | (3800.26 ms | 137961 tok/s) step 17478/76294 | train loss 3.305817 | norm 1.3552 | lr 1.52e-04 | (3817.16 ms | 137350 tok/s) step 17479/76294 | train loss 3.360996 | norm 2.8816 | lr 1.52e-04 | (3801.73 ms | 137908 tok/s) step 17480/76294 | train loss 3.458277 | norm 1.7916 | lr 1.52e-04 | (3802.47 ms | 137881 tok/s) step 17481/76294 | train loss 3.406954 | norm 2.3577 | lr 1.52e-04 | (3801.73 ms | 137908 tok/s) step 17482/76294 | train loss 3.375375 | norm 1.2652 | lr 1.52e-04 | (3801.69 ms | 137909 tok/s) step 17483/76294 | train loss 3.406669 | norm 1.5097 | lr 1.52e-04 | (3835.24 ms | 136703 tok/s) step 17484/76294 | train loss 3.362385 | norm 1.5927 | lr 1.52e-04 | (3798.97 ms | 138008 tok/s) step 17485/76294 | train loss 3.389742 | norm 1.4649 | lr 1.52e-04 | (3801.70 ms | 137909 tok/s) step 17486/76294 | train loss 3.409138 | norm 1.7259 | lr 1.52e-04 | (3822.49 ms | 137159 tok/s) step 17487/76294 | train loss 3.416734 | norm 1.1105 | lr 1.52e-04 | (3797.75 ms | 138052 tok/s) step 17488/76294 | train loss 3.373972 | norm 2.7619 | lr 1.52e-04 | (3807.38 ms | 137703 tok/s) step 17489/76294 | train loss 3.376110 | norm 1.3304 | lr 1.52e-04 | (3801.42 ms | 137919 tok/s) step 17490/76294 | train loss 3.432478 | norm 1.2610 | lr 1.52e-04 | (3815.64 ms | 137405 tok/s) step 17491/76294 | train loss 3.393795 | norm 0.8756 | lr 1.52e-04 | (3800.07 ms | 137968 tok/s) step 17492/76294 | train loss 3.301551 | norm 1.0506 | lr 1.52e-04 | (3808.73 ms | 137654 tok/s) step 17493/76294 | train loss 3.367338 | norm 1.2463 | lr 1.52e-04 | (3804.48 ms | 137808 tok/s) step 17494/76294 | train loss 3.394753 | norm 1.0284 | lr 1.52e-04 | (3823.74 ms | 137114 tok/s) step 17495/76294 | train loss 3.315254 | norm 1.0065 | lr 1.52e-04 | (3800.48 ms | 137953 tok/s) step 17496/76294 | train loss 3.314506 | norm 1.0066 | lr 1.52e-04 | (3804.13 ms | 137821 tok/s) step 17497/76294 | train loss 3.335038 | norm 0.8469 | lr 1.52e-04 | (3802.14 ms | 137893 tok/s) step 17498/76294 | train loss 3.317520 | norm 1.2765 | lr 1.52e-04 | (3800.80 ms | 137942 tok/s) step 17499/76294 | train loss 3.343106 | norm 1.6680 | lr 1.52e-04 | (3808.61 ms | 137659 tok/s) step 17500/76294 | train loss 3.371125 | norm 1.1318 | lr 1.52e-04 | (3805.13 ms | 137784 tok/s) val loss: 3.376391 saving model checkpoint to ./results/gpt2-124M-gqa/step_17500.pth step 17501/76294 | train loss 3.401680 | norm 1.2639 | lr 1.52e-04 | (3808.42 ms | 137665 tok/s) step 17502/76294 | train loss 3.455204 | norm 0.8470 | lr 1.52e-04 | (3825.23 ms | 137061 tok/s) step 17503/76294 | train loss 3.367097 | norm 1.2773 | lr 1.51e-04 | (3801.75 ms | 137907 tok/s) step 17504/76294 | train loss 3.568042 | norm 1.1414 | lr 1.51e-04 | (3801.54 ms | 137914 tok/s) step 17505/76294 | train loss 3.385489 | norm 1.6130 | lr 1.51e-04 | (3801.64 ms | 137911 tok/s) step 17506/76294 | train loss 3.320122 | norm 1.1370 | lr 1.51e-04 | (3798.12 ms | 138039 tok/s) step 17507/76294 | train loss 3.414367 | norm 0.9381 | lr 1.51e-04 | (3825.38 ms | 137055 tok/s) step 17508/76294 | train loss 3.381397 | norm 1.1049 | lr 1.51e-04 | (3799.00 ms | 138007 tok/s) step 17509/76294 | train loss 3.303658 | norm 1.7430 | lr 1.51e-04 | (3826.16 ms | 137027 tok/s) step 17510/76294 | train loss 3.475127 | norm 1.2393 | lr 1.51e-04 | (3798.59 ms | 138022 tok/s) step 17511/76294 | train loss 3.309293 | norm 1.2404 | lr 1.51e-04 | (3818.30 ms | 137309 tok/s) step 17512/76294 | train loss 3.356537 | norm 1.0098 | lr 1.51e-04 | (3819.81 ms | 137255 tok/s) step 17513/76294 | train loss 3.355734 | norm 1.9413 | lr 1.51e-04 | (3796.82 ms | 138086 tok/s) step 17514/76294 | train loss 3.329962 | norm 1.0892 | lr 1.51e-04 | (3820.29 ms | 137238 tok/s) step 17515/76294 | train loss 3.399626 | norm 1.3555 | lr 1.51e-04 | (3800.34 ms | 137958 tok/s) step 17516/76294 | train loss 3.280743 | norm 1.1982 | lr 1.51e-04 | (3795.64 ms | 138129 tok/s) step 17517/76294 | train loss 3.403702 | norm 1.0624 | lr 1.51e-04 | (3831.80 ms | 136825 tok/s) step 17518/76294 | train loss 3.291744 | norm 1.1033 | lr 1.51e-04 | (3801.95 ms | 137900 tok/s) step 17519/76294 | train loss 3.371405 | norm 1.2183 | lr 1.51e-04 | (3805.77 ms | 137761 tok/s) step 17520/76294 | train loss 3.344478 | norm 1.4689 | lr 1.51e-04 | (3844.88 ms | 136360 tok/s) step 17521/76294 | train loss 3.413322 | norm 1.2700 | lr 1.51e-04 | (3809.54 ms | 137625 tok/s) step 17522/76294 | train loss 3.466737 | norm 2.9021 | lr 1.51e-04 | (3844.93 ms | 136358 tok/s) step 17523/76294 | train loss 3.309020 | norm 2.4614 | lr 1.51e-04 | (3805.47 ms | 137772 tok/s) step 17524/76294 | train loss 3.410673 | norm 1.4489 | lr 1.51e-04 | (3827.50 ms | 136979 tok/s) step 17525/76294 | train loss 3.308001 | norm 1.2776 | lr 1.51e-04 | (3802.46 ms | 137881 tok/s) step 17526/76294 | train loss 3.400440 | norm 1.4842 | lr 1.51e-04 | (3826.57 ms | 137012 tok/s) step 17527/76294 | train loss 3.295147 | norm 2.4954 | lr 1.51e-04 | (3851.07 ms | 136141 tok/s) step 17528/76294 | train loss 3.487521 | norm 1.4887 | lr 1.51e-04 | (3805.72 ms | 137763 tok/s) step 17529/76294 | train loss 3.361912 | norm 1.5291 | lr 1.51e-04 | (3808.60 ms | 137659 tok/s) step 17530/76294 | train loss 3.332826 | norm 1.5408 | lr 1.51e-04 | (3802.35 ms | 137885 tok/s) step 17531/76294 | train loss 3.459298 | norm 1.0587 | lr 1.51e-04 | (3815.66 ms | 137404 tok/s) step 17532/76294 | train loss 3.377858 | norm 1.3283 | lr 1.51e-04 | (3806.29 ms | 137742 tok/s) step 17533/76294 | train loss 3.410157 | norm 1.1370 | lr 1.51e-04 | (3799.12 ms | 138002 tok/s) step 17534/76294 | train loss 3.373981 | norm 0.9962 | lr 1.51e-04 | (3831.31 ms | 136843 tok/s) step 17535/76294 | train loss 3.413976 | norm 1.2054 | lr 1.51e-04 | (3800.63 ms | 137948 tok/s) step 17536/76294 | train loss 3.294725 | norm 1.3546 | lr 1.50e-04 | (3804.76 ms | 137798 tok/s) step 17537/76294 | train loss 3.408834 | norm 1.2873 | lr 1.50e-04 | (3826.42 ms | 137018 tok/s) step 17538/76294 | train loss 3.408276 | norm 1.5250 | lr 1.50e-04 | (3800.47 ms | 137954 tok/s) step 17539/76294 | train loss 3.339411 | norm 1.3012 | lr 1.50e-04 | (3802.33 ms | 137886 tok/s) step 17540/76294 | train loss 3.385878 | norm 1.7387 | lr 1.50e-04 | (3803.75 ms | 137834 tok/s) step 17541/76294 | train loss 3.316643 | norm 1.7110 | lr 1.50e-04 | (3812.67 ms | 137512 tok/s) step 17542/76294 | train loss 3.303238 | norm 1.0039 | lr 1.50e-04 | (3800.46 ms | 137954 tok/s) step 17543/76294 | train loss 3.366344 | norm 0.9450 | lr 1.50e-04 | (3817.17 ms | 137350 tok/s) step 17544/76294 | train loss 3.373987 | norm 1.2380 | lr 1.50e-04 | (3837.92 ms | 136607 tok/s) step 17545/76294 | train loss 3.365483 | norm 0.9747 | lr 1.50e-04 | (3799.63 ms | 137984 tok/s) step 17546/76294 | train loss 3.322685 | norm 1.7380 | lr 1.50e-04 | (4007.53 ms | 130826 tok/s) step 17547/76294 | train loss 3.397783 | norm 1.3432 | lr 1.50e-04 | (3801.92 ms | 137901 tok/s) step 17548/76294 | train loss 3.440607 | norm 1.2211 | lr 1.50e-04 | (3815.70 ms | 137403 tok/s) step 17549/76294 | train loss 3.368976 | norm 1.0402 | lr 1.50e-04 | (3821.05 ms | 137211 tok/s) step 17550/76294 | train loss 3.330845 | norm 1.1153 | lr 1.50e-04 | (3805.01 ms | 137789 tok/s) step 17551/76294 | train loss 3.466932 | norm 1.1845 | lr 1.50e-04 | (3801.45 ms | 137918 tok/s) step 17552/76294 | train loss 3.328273 | norm 0.9039 | lr 1.50e-04 | (3823.00 ms | 137140 tok/s) step 17553/76294 | train loss 3.421377 | norm 1.8758 | lr 1.50e-04 | (3798.74 ms | 138016 tok/s) step 17554/76294 | train loss 3.327590 | norm 1.8825 | lr 1.50e-04 | (3812.06 ms | 137534 tok/s) step 17555/76294 | train loss 3.365787 | norm 1.2696 | lr 1.50e-04 | (3796.00 ms | 138116 tok/s) step 17556/76294 | train loss 3.375290 | norm 3.0514 | lr 1.50e-04 | (3825.31 ms | 137058 tok/s) step 17557/76294 | train loss 3.382876 | norm 4.1599 | lr 1.50e-04 | (3799.91 ms | 137974 tok/s) step 17558/76294 | train loss 3.341382 | norm 1.7913 | lr 1.50e-04 | (3865.28 ms | 135641 tok/s) step 17559/76294 | train loss 3.437204 | norm 2.1284 | lr 1.50e-04 | (3800.40 ms | 137956 tok/s) step 17560/76294 | train loss 3.336259 | norm 1.1324 | lr 1.50e-04 | (3814.60 ms | 137442 tok/s) step 17561/76294 | train loss 3.313430 | norm 1.8774 | lr 1.50e-04 | (3798.23 ms | 138035 tok/s) step 17562/76294 | train loss 3.404622 | norm 1.1848 | lr 1.50e-04 | (3808.67 ms | 137657 tok/s) step 17563/76294 | train loss 3.286375 | norm 0.8545 | lr 1.50e-04 | (3823.82 ms | 137111 tok/s) step 17564/76294 | train loss 3.368589 | norm 0.8434 | lr 1.50e-04 | (3801.74 ms | 137908 tok/s) step 17565/76294 | train loss 3.282576 | norm 1.4111 | lr 1.50e-04 | (3804.88 ms | 137793 tok/s) step 17566/76294 | train loss 3.361153 | norm 0.9916 | lr 1.50e-04 | (3797.30 ms | 138068 tok/s) step 17567/76294 | train loss 3.360764 | norm 0.8514 | lr 1.50e-04 | (3807.19 ms | 137710 tok/s) step 17568/76294 | train loss 3.370457 | norm 1.4543 | lr 1.50e-04 | (3802.64 ms | 137875 tok/s) step 17569/76294 | train loss 3.234251 | norm 1.1730 | lr 1.50e-04 | (3806.76 ms | 137726 tok/s) step 17570/76294 | train loss 3.409172 | norm 1.2456 | lr 1.49e-04 | (3830.15 ms | 136885 tok/s) step 17571/76294 | train loss 3.364956 | norm 1.3854 | lr 1.49e-04 | (3880.47 ms | 135109 tok/s) step 17572/76294 | train loss 3.361596 | norm 1.1427 | lr 1.49e-04 | (3798.28 ms | 138033 tok/s) step 17573/76294 | train loss 3.414526 | norm 1.8830 | lr 1.49e-04 | (3801.76 ms | 137907 tok/s) step 17574/76294 | train loss 3.361164 | norm 1.7540 | lr 1.49e-04 | (3856.82 ms | 135938 tok/s) step 17575/76294 | train loss 3.390342 | norm 1.4769 | lr 1.49e-04 | (3801.58 ms | 137913 tok/s) step 17576/76294 | train loss 3.356254 | norm 1.2781 | lr 1.49e-04 | (3799.10 ms | 138003 tok/s) step 17577/76294 | train loss 3.427049 | norm 1.5841 | lr 1.49e-04 | (3835.19 ms | 136705 tok/s) step 17578/76294 | train loss 3.308789 | norm 1.7064 | lr 1.49e-04 | (3800.17 ms | 137964 tok/s) step 17579/76294 | train loss 3.363012 | norm 1.6688 | lr 1.49e-04 | (3807.17 ms | 137711 tok/s) step 17580/76294 | train loss 3.352945 | norm 1.2059 | lr 1.49e-04 | (3799.93 ms | 137973 tok/s) step 17581/76294 | train loss 3.387992 | norm 1.4369 | lr 1.49e-04 | (3803.63 ms | 137839 tok/s) step 17582/76294 | train loss 3.339240 | norm 1.5866 | lr 1.49e-04 | (4932.35 ms | 106296 tok/s) step 17583/76294 | train loss 3.383259 | norm 1.0984 | lr 1.49e-04 | (3833.20 ms | 136776 tok/s) step 17584/76294 | train loss 3.338303 | norm 0.9832 | lr 1.49e-04 | (3798.99 ms | 138007 tok/s) step 17585/76294 | train loss 3.426463 | norm 1.2653 | lr 1.49e-04 | (3831.63 ms | 136831 tok/s) step 17586/76294 | train loss 3.344345 | norm 1.4873 | lr 1.49e-04 | (3798.90 ms | 138010 tok/s) step 17587/76294 | train loss 3.480683 | norm 1.4782 | lr 1.49e-04 | (3799.86 ms | 137975 tok/s) step 17588/76294 | train loss 3.287508 | norm 1.8382 | lr 1.49e-04 | (3832.69 ms | 136794 tok/s) step 17589/76294 | train loss 3.560199 | norm 1.6603 | lr 1.49e-04 | (3802.96 ms | 137863 tok/s) step 17590/76294 | train loss 3.407029 | norm 1.0555 | lr 1.49e-04 | (3806.80 ms | 137724 tok/s) step 17591/76294 | train loss 3.358684 | norm 1.2504 | lr 1.49e-04 | (3800.53 ms | 137951 tok/s) step 17592/76294 | train loss 3.378135 | norm 1.1596 | lr 1.49e-04 | (3805.61 ms | 137767 tok/s) step 17593/76294 | train loss 3.352521 | norm 1.4917 | lr 1.49e-04 | (3804.70 ms | 137800 tok/s) step 17594/76294 | train loss 3.383819 | norm 2.6672 | lr 1.49e-04 | (3803.77 ms | 137834 tok/s) step 17595/76294 | train loss 3.332034 | norm 1.4696 | lr 1.49e-04 | (3806.29 ms | 137743 tok/s) step 17596/76294 | train loss 3.366803 | norm 0.9238 | lr 1.49e-04 | (3863.83 ms | 135691 tok/s) step 17597/76294 | train loss 3.315548 | norm 1.7532 | lr 1.49e-04 | (3794.91 ms | 138156 tok/s) step 17598/76294 | train loss 3.419704 | norm 1.8327 | lr 1.49e-04 | (3810.54 ms | 137589 tok/s) step 17599/76294 | train loss 3.368886 | norm 1.2019 | lr 1.49e-04 | (3810.10 ms | 137605 tok/s) step 17600/76294 | train loss 3.380819 | norm 1.5345 | lr 1.49e-04 | (3806.54 ms | 137733 tok/s) step 17601/76294 | train loss 3.277403 | norm 1.6616 | lr 1.49e-04 | (3972.34 ms | 131985 tok/s) step 17602/76294 | train loss 3.462174 | norm 1.4540 | lr 1.49e-04 | (3842.19 ms | 136456 tok/s) step 17603/76294 | train loss 3.348606 | norm 1.2580 | lr 1.49e-04 | (3808.12 ms | 137676 tok/s) step 17604/76294 | train loss 3.357446 | norm 1.3356 | lr 1.48e-04 | (3799.69 ms | 137982 tok/s) step 17605/76294 | train loss 3.427541 | norm 1.6070 | lr 1.48e-04 | (3795.60 ms | 138130 tok/s) step 17606/76294 | train loss 3.326415 | norm 1.2936 | lr 1.48e-04 | (3849.61 ms | 136192 tok/s) step 17607/76294 | train loss 3.333413 | norm 1.4630 | lr 1.48e-04 | (3796.71 ms | 138090 tok/s) step 17608/76294 | train loss 3.379690 | norm 1.5053 | lr 1.48e-04 | (3801.02 ms | 137934 tok/s) step 17609/76294 | train loss 3.378452 | norm 2.8192 | lr 1.48e-04 | (3861.01 ms | 135791 tok/s) step 17610/76294 | train loss 3.291358 | norm 2.3573 | lr 1.48e-04 | (3802.62 ms | 137875 tok/s) step 17611/76294 | train loss 3.365515 | norm 1.2874 | lr 1.48e-04 | (3798.87 ms | 138012 tok/s) step 17612/76294 | train loss 3.343340 | norm 1.8858 | lr 1.48e-04 | (3828.21 ms | 136954 tok/s) step 17613/76294 | train loss 3.445985 | norm 1.5019 | lr 1.48e-04 | (3801.22 ms | 137926 tok/s) step 17614/76294 | train loss 3.320162 | norm 2.4711 | lr 1.48e-04 | (3803.70 ms | 137836 tok/s) step 17615/76294 | train loss 3.398075 | norm 2.3660 | lr 1.48e-04 | (3821.51 ms | 137194 tok/s) step 17616/76294 | train loss 3.338187 | norm 1.5778 | lr 1.48e-04 | (3804.97 ms | 137790 tok/s) step 17617/76294 | train loss 3.390225 | norm 1.5685 | lr 1.48e-04 | (3800.00 ms | 137971 tok/s) step 17618/76294 | train loss 3.500859 | norm 1.5266 | lr 1.48e-04 | (3824.18 ms | 137098 tok/s) step 17619/76294 | train loss 3.325834 | norm 1.7569 | lr 1.48e-04 | (3799.42 ms | 137992 tok/s) step 17620/76294 | train loss 3.419699 | norm 1.7505 | lr 1.48e-04 | (3838.21 ms | 136597 tok/s) step 17621/76294 | train loss 3.346668 | norm 2.0970 | lr 1.48e-04 | (3838.99 ms | 136569 tok/s) step 17622/76294 | train loss 3.380198 | norm 1.5855 | lr 1.48e-04 | (3805.41 ms | 137775 tok/s) step 17623/76294 | train loss 3.452377 | norm 1.0135 | lr 1.48e-04 | (4087.86 ms | 128255 tok/s) step 17624/76294 | train loss 3.338675 | norm 1.1580 | lr 1.48e-04 | (3792.33 ms | 138250 tok/s) step 17625/76294 | train loss 3.359668 | norm 1.3123 | lr 1.48e-04 | (3839.30 ms | 136558 tok/s) step 17626/76294 | train loss 3.291965 | norm 1.7020 | lr 1.48e-04 | (3795.73 ms | 138126 tok/s) step 17627/76294 | train loss 3.364131 | norm 1.9768 | lr 1.48e-04 | (3801.38 ms | 137920 tok/s) step 17628/76294 | train loss 3.398591 | norm 3.4849 | lr 1.48e-04 | (3822.90 ms | 137144 tok/s) step 17629/76294 | train loss 3.423505 | norm 1.8750 | lr 1.48e-04 | (4012.34 ms | 130669 tok/s) step 17630/76294 | train loss 3.341842 | norm 1.8975 | lr 1.48e-04 | (3794.45 ms | 138172 tok/s) step 17631/76294 | train loss 3.373530 | norm 1.5637 | lr 1.48e-04 | (3825.69 ms | 137044 tok/s) step 17632/76294 | train loss 3.316169 | norm 1.7391 | lr 1.48e-04 | (3797.60 ms | 138058 tok/s) step 17633/76294 | train loss 3.389619 | norm 1.1597 | lr 1.48e-04 | (3803.62 ms | 137839 tok/s) step 17634/76294 | train loss 3.366725 | norm 1.0148 | lr 1.48e-04 | (3796.02 ms | 138115 tok/s) step 17635/76294 | train loss 3.394513 | norm 1.8895 | lr 1.48e-04 | (3804.43 ms | 137810 tok/s) step 17636/76294 | train loss 3.360490 | norm 1.1742 | lr 1.48e-04 | (3820.91 ms | 137216 tok/s) step 17637/76294 | train loss 3.356910 | norm 2.2429 | lr 1.48e-04 | (3803.08 ms | 137859 tok/s) step 17638/76294 | train loss 3.347579 | norm 2.1022 | lr 1.48e-04 | (3823.22 ms | 137133 tok/s) step 17639/76294 | train loss 3.382452 | norm 1.9349 | lr 1.47e-04 | (3803.88 ms | 137830 tok/s) step 17640/76294 | train loss 3.353062 | norm 1.0336 | lr 1.47e-04 | (3808.75 ms | 137654 tok/s) step 17641/76294 | train loss 3.427588 | norm 1.1303 | lr 1.47e-04 | (3804.86 ms | 137794 tok/s) step 17642/76294 | train loss 3.363851 | norm 1.7533 | lr 1.47e-04 | (3820.62 ms | 137226 tok/s) step 17643/76294 | train loss 3.370811 | norm 1.7349 | lr 1.47e-04 | (3803.16 ms | 137856 tok/s) step 17644/76294 | train loss 3.398215 | norm 1.4631 | lr 1.47e-04 | (3806.84 ms | 137723 tok/s) step 17645/76294 | train loss 3.382977 | norm 2.4149 | lr 1.47e-04 | (3836.22 ms | 136668 tok/s) step 17646/76294 | train loss 3.387045 | norm 2.9059 | lr 1.47e-04 | (3806.91 ms | 137720 tok/s) step 17647/76294 | train loss 3.350632 | norm 1.8097 | lr 1.47e-04 | (3806.30 ms | 137742 tok/s) step 17648/76294 | train loss 3.365149 | norm 1.8397 | lr 1.47e-04 | (3802.13 ms | 137893 tok/s) step 17649/76294 | train loss 3.311863 | norm 0.8793 | lr 1.47e-04 | (3834.54 ms | 136728 tok/s) step 17650/76294 | train loss 3.399482 | norm 1.4013 | lr 1.47e-04 | (3819.06 ms | 137282 tok/s) step 17651/76294 | train loss 3.334563 | norm 1.9985 | lr 1.47e-04 | (3860.30 ms | 135815 tok/s) step 17652/76294 | train loss 3.451451 | norm 1.6243 | lr 1.47e-04 | (3801.11 ms | 137930 tok/s) step 17653/76294 | train loss 3.333538 | norm 1.8823 | lr 1.47e-04 | (3801.85 ms | 137903 tok/s) step 17654/76294 | train loss 3.417946 | norm 1.4898 | lr 1.47e-04 | (4364.57 ms | 120124 tok/s) step 17655/76294 | train loss 3.379804 | norm 1.1617 | lr 1.47e-04 | (3806.92 ms | 137720 tok/s) step 17656/76294 | train loss 3.474388 | norm 1.1732 | lr 1.47e-04 | (3806.31 ms | 137742 tok/s) step 17657/76294 | train loss 3.362006 | norm 1.4008 | lr 1.47e-04 | (3808.93 ms | 137647 tok/s) step 17658/76294 | train loss 3.394217 | norm 2.1481 | lr 1.47e-04 | (3807.27 ms | 137707 tok/s) step 17659/76294 | train loss 3.387070 | norm 1.8971 | lr 1.47e-04 | (3832.67 ms | 136794 tok/s) step 17660/76294 | train loss 3.331489 | norm 2.1739 | lr 1.47e-04 | (3802.78 ms | 137869 tok/s) step 17661/76294 | train loss 3.426452 | norm 1.9760 | lr 1.47e-04 | (3855.51 ms | 135984 tok/s) step 17662/76294 | train loss 3.410340 | norm 1.4593 | lr 1.47e-04 | (3799.74 ms | 137980 tok/s) step 17663/76294 | train loss 3.402915 | norm 2.4351 | lr 1.47e-04 | (3850.42 ms | 136164 tok/s) step 17664/76294 | train loss 3.422284 | norm 2.5792 | lr 1.47e-04 | (3802.41 ms | 137883 tok/s) step 17665/76294 | train loss 3.399422 | norm 1.5687 | lr 1.47e-04 | (3810.50 ms | 137590 tok/s) step 17666/76294 | train loss 3.488501 | norm 1.7186 | lr 1.47e-04 | (3824.27 ms | 137095 tok/s) step 17667/76294 | train loss 3.423907 | norm 2.8047 | lr 1.47e-04 | (3804.87 ms | 137794 tok/s) step 17668/76294 | train loss 3.291193 | norm 2.2911 | lr 1.47e-04 | (3818.97 ms | 137285 tok/s) step 17669/76294 | train loss 3.409544 | norm 1.9632 | lr 1.47e-04 | (3858.81 ms | 135868 tok/s) step 17670/76294 | train loss 3.418934 | norm 1.4356 | lr 1.47e-04 | (3802.16 ms | 137892 tok/s) step 17671/76294 | train loss 3.335158 | norm 1.5037 | lr 1.47e-04 | (3829.84 ms | 136896 tok/s) step 17672/76294 | train loss 3.404304 | norm 1.4457 | lr 1.47e-04 | (3800.70 ms | 137945 tok/s) step 17673/76294 | train loss 3.427209 | norm 3.2064 | lr 1.47e-04 | (3808.36 ms | 137668 tok/s) step 17674/76294 | train loss 3.423638 | norm 3.2526 | lr 1.46e-04 | (3832.50 ms | 136800 tok/s) step 17675/76294 | train loss 3.417581 | norm 2.9565 | lr 1.46e-04 | (3810.89 ms | 137576 tok/s) step 17676/76294 | train loss 3.438369 | norm 2.4398 | lr 1.46e-04 | (3821.98 ms | 137177 tok/s) step 17677/76294 | train loss 3.404775 | norm 2.6999 | lr 1.46e-04 | (3833.57 ms | 136762 tok/s) step 17678/76294 | train loss 3.410030 | norm 1.5128 | lr 1.46e-04 | (3804.58 ms | 137805 tok/s) step 17679/76294 | train loss 3.350878 | norm 1.7037 | lr 1.46e-04 | (3802.10 ms | 137894 tok/s) step 17680/76294 | train loss 3.421736 | norm 2.5072 | lr 1.46e-04 | (3802.23 ms | 137890 tok/s) step 17681/76294 | train loss 3.406136 | norm 2.1346 | lr 1.46e-04 | (3821.68 ms | 137188 tok/s) step 17682/76294 | train loss 3.460150 | norm 3.0183 | lr 1.46e-04 | (3814.86 ms | 137433 tok/s) step 17683/76294 | train loss 3.469217 | norm 2.5583 | lr 1.46e-04 | (3811.14 ms | 137567 tok/s) step 17684/76294 | train loss 3.448244 | norm 1.4700 | lr 1.46e-04 | (3835.81 ms | 136682 tok/s) step 17685/76294 | train loss 3.383142 | norm 2.5778 | lr 1.46e-04 | (3811.31 ms | 137561 tok/s) step 17686/76294 | train loss 3.371316 | norm 4.3580 | lr 1.46e-04 | (3815.84 ms | 137398 tok/s) step 17687/76294 | train loss 3.401689 | norm 5.3098 | lr 1.46e-04 | (3812.48 ms | 137519 tok/s) step 17688/76294 | train loss 3.417599 | norm 2.5335 | lr 1.46e-04 | (3836.69 ms | 136651 tok/s) step 17689/76294 | train loss 3.439090 | norm 1.2907 | lr 1.46e-04 | (3811.98 ms | 137537 tok/s) step 17690/76294 | train loss 3.369550 | norm 2.1630 | lr 1.46e-04 | (3837.49 ms | 136623 tok/s) step 17691/76294 | train loss 3.349498 | norm 1.5189 | lr 1.46e-04 | (3836.86 ms | 136645 tok/s) step 17692/76294 | train loss 3.435283 | norm 2.1863 | lr 1.46e-04 | (3797.58 ms | 138058 tok/s) step 17693/76294 | train loss 3.414976 | norm 1.5706 | lr 1.46e-04 | (3839.58 ms | 136548 tok/s) step 17694/76294 | train loss 3.437452 | norm 1.2542 | lr 1.46e-04 | (3957.20 ms | 132490 tok/s) step 17695/76294 | train loss 3.449773 | norm 1.5002 | lr 1.46e-04 | (3791.92 ms | 138265 tok/s) step 17696/76294 | train loss 3.365140 | norm 1.8520 | lr 1.46e-04 | (3908.16 ms | 134152 tok/s) step 17697/76294 | train loss 3.388706 | norm 1.3289 | lr 1.46e-04 | (3848.26 ms | 136240 tok/s) step 17698/76294 | train loss 3.404896 | norm 1.0861 | lr 1.46e-04 | (3814.70 ms | 137439 tok/s) step 17699/76294 | train loss 3.381110 | norm 1.1888 | lr 1.46e-04 | (3909.21 ms | 134116 tok/s) step 17700/76294 | train loss 3.375129 | norm 2.6465 | lr 1.46e-04 | (3782.91 ms | 138594 tok/s) step 17701/76294 | train loss 3.429441 | norm 1.8938 | lr 1.46e-04 | (3788.99 ms | 138372 tok/s) step 17702/76294 | train loss 3.517512 | norm 1.8406 | lr 1.46e-04 | (3808.22 ms | 137673 tok/s) step 17703/76294 | train loss 3.403558 | norm 3.3344 | lr 1.46e-04 | (3791.38 ms | 138284 tok/s) step 17704/76294 | train loss 3.455534 | norm 1.9413 | lr 1.46e-04 | (3843.96 ms | 136393 tok/s) step 17705/76294 | train loss 3.357265 | norm 1.8046 | lr 1.46e-04 | (3847.52 ms | 136267 tok/s) step 17706/76294 | train loss 3.403789 | norm 1.2697 | lr 1.46e-04 | (3873.64 ms | 135348 tok/s) step 17707/76294 | train loss 3.434005 | norm 1.1151 | lr 1.46e-04 | (3783.88 ms | 138558 tok/s) step 17708/76294 | train loss 3.382751 | norm 1.4313 | lr 1.46e-04 | (3883.13 ms | 135017 tok/s) step 17709/76294 | train loss 3.428125 | norm 1.7641 | lr 1.46e-04 | (3767.87 ms | 139147 tok/s) step 17710/76294 | train loss 3.400547 | norm 2.4918 | lr 1.46e-04 | (3859.45 ms | 135845 tok/s) step 17711/76294 | train loss 3.448531 | norm 1.1570 | lr 1.45e-04 | (3833.85 ms | 136753 tok/s) step 17712/76294 | train loss 3.418999 | norm 1.5499 | lr 1.45e-04 | (3826.87 ms | 137002 tok/s) step 17713/76294 | train loss 3.376535 | norm 1.8513 | lr 1.45e-04 | (3774.36 ms | 138908 tok/s) step 17714/76294 | train loss 3.409242 | norm 1.1165 | lr 1.45e-04 | (3859.21 ms | 135854 tok/s) step 17715/76294 | train loss 3.391784 | norm 1.9847 | lr 1.45e-04 | (3874.44 ms | 135320 tok/s) step 17716/76294 | train loss 3.611194 | norm 2.5318 | lr 1.45e-04 | (3768.08 ms | 139139 tok/s) step 17717/76294 | train loss 3.442812 | norm 2.3490 | lr 1.45e-04 | (3798.94 ms | 138009 tok/s) step 17718/76294 | train loss 3.380624 | norm 3.1395 | lr 1.45e-04 | (3770.35 ms | 139055 tok/s) step 17719/76294 | train loss 3.436939 | norm 2.0107 | lr 1.45e-04 | (3802.24 ms | 137889 tok/s) step 17720/76294 | train loss 3.434319 | norm 3.1929 | lr 1.45e-04 | (3777.60 ms | 138789 tok/s) step 17721/76294 | train loss 3.348801 | norm 3.2695 | lr 1.45e-04 | (3786.29 ms | 138470 tok/s) step 17722/76294 | train loss 3.361273 | norm 3.3580 | lr 1.45e-04 | (3805.42 ms | 137774 tok/s) step 17723/76294 | train loss 3.523050 | norm 3.0492 | lr 1.45e-04 | (3787.57 ms | 138423 tok/s) step 17724/76294 | train loss 3.385440 | norm 2.8621 | lr 1.45e-04 | (3799.29 ms | 137996 tok/s) step 17725/76294 | train loss 3.458083 | norm 2.7284 | lr 1.45e-04 | (3793.57 ms | 138204 tok/s) step 17726/76294 | train loss 3.490206 | norm 1.3274 | lr 1.45e-04 | (3814.05 ms | 137462 tok/s) step 17727/76294 | train loss 3.404069 | norm 1.9825 | lr 1.45e-04 | (3802.47 ms | 137881 tok/s) step 17728/76294 | train loss 3.421000 | norm 1.6470 | lr 1.45e-04 | (3822.59 ms | 137155 tok/s) step 17729/76294 | train loss 3.424737 | norm 2.7361 | lr 1.45e-04 | (3834.63 ms | 136725 tok/s) step 17730/76294 | train loss 3.418323 | norm 1.7346 | lr 1.45e-04 | (3811.84 ms | 137542 tok/s) step 17731/76294 | train loss 3.404974 | norm 2.0879 | lr 1.45e-04 | (3805.27 ms | 137779 tok/s) step 17732/76294 | train loss 3.401123 | norm 1.5288 | lr 1.45e-04 | (3823.97 ms | 137106 tok/s) step 17733/76294 | train loss 3.387082 | norm 1.6329 | lr 1.45e-04 | (3863.60 ms | 135699 tok/s) step 17734/76294 | train loss 3.399142 | norm 1.8174 | lr 1.45e-04 | (3802.86 ms | 137867 tok/s) step 17735/76294 | train loss 3.430642 | norm 2.1151 | lr 1.45e-04 | (3900.30 ms | 134422 tok/s) step 17736/76294 | train loss 3.406696 | norm 2.8061 | lr 1.45e-04 | (3798.51 ms | 138025 tok/s) step 17737/76294 | train loss 3.410857 | norm 2.1284 | lr 1.45e-04 | (3910.29 ms | 134079 tok/s) step 17738/76294 | train loss 3.373903 | norm 1.5750 | lr 1.45e-04 | (3804.00 ms | 137825 tok/s) step 17739/76294 | train loss 3.406754 | norm 1.7843 | lr 1.45e-04 | (3864.07 ms | 135683 tok/s) step 17740/76294 | train loss 3.470176 | norm 1.7441 | lr 1.45e-04 | (3800.98 ms | 137935 tok/s) step 17741/76294 | train loss 3.406334 | norm 1.4636 | lr 1.45e-04 | (3878.23 ms | 135188 tok/s) step 17742/76294 | train loss 3.355650 | norm 1.5664 | lr 1.45e-04 | (3804.43 ms | 137810 tok/s) step 17743/76294 | train loss 3.422000 | norm 1.5419 | lr 1.45e-04 | (4166.22 ms | 125843 tok/s) step 17744/76294 | train loss 3.418800 | norm 1.1967 | lr 1.45e-04 | (3804.18 ms | 137819 tok/s) step 17745/76294 | train loss 3.436489 | norm 1.5964 | lr 1.45e-04 | (3816.19 ms | 137385 tok/s) step 17746/76294 | train loss 3.444108 | norm 1.4006 | lr 1.45e-04 | (3805.93 ms | 137756 tok/s) step 17747/76294 | train loss 3.461442 | norm 1.6798 | lr 1.45e-04 | (3816.00 ms | 137392 tok/s) step 17748/76294 | train loss 3.406937 | norm 1.2372 | lr 1.44e-04 | (3828.62 ms | 136939 tok/s) step 17749/76294 | train loss 3.413650 | norm 1.8594 | lr 1.44e-04 | (3844.46 ms | 136375 tok/s) step 17750/76294 | train loss 3.400864 | norm 1.4633 | lr 1.44e-04 | (3817.36 ms | 137343 tok/s) val loss: 3.384086 saving model checkpoint to ./results/gpt2-124M-gqa/step_17750.pth step 17751/76294 | train loss 3.377347 | norm 2.3107 | lr 1.44e-04 | (3815.65 ms | 137405 tok/s) step 17752/76294 | train loss 3.399954 | norm 3.0788 | lr 1.44e-04 | (3808.62 ms | 137658 tok/s) step 17753/76294 | train loss 3.417150 | norm 2.0541 | lr 1.44e-04 | (3814.56 ms | 137444 tok/s) step 17754/76294 | train loss 3.434745 | norm 2.3701 | lr 1.44e-04 | (3809.07 ms | 137642 tok/s) step 17755/76294 | train loss 3.397893 | norm 2.4965 | lr 1.44e-04 | (3835.19 ms | 136705 tok/s) step 17756/76294 | train loss 3.451792 | norm 1.7568 | lr 1.44e-04 | (3806.64 ms | 137730 tok/s) step 17757/76294 | train loss 3.415826 | norm 2.2280 | lr 1.44e-04 | (3913.60 ms | 133966 tok/s) step 17758/76294 | train loss 3.407125 | norm 3.6415 | lr 1.44e-04 | (3841.12 ms | 136493 tok/s) step 17759/76294 | train loss 3.408868 | norm 2.3604 | lr 1.44e-04 | (3804.05 ms | 137824 tok/s) step 17760/76294 | train loss 3.430554 | norm 1.9478 | lr 1.44e-04 | (3838.98 ms | 136570 tok/s) step 17761/76294 | train loss 3.401343 | norm 1.7943 | lr 1.44e-04 | (3805.97 ms | 137754 tok/s) step 17762/76294 | train loss 3.356078 | norm 2.0211 | lr 1.44e-04 | (3813.07 ms | 137498 tok/s) step 17763/76294 | train loss 3.388496 | norm 1.9285 | lr 1.44e-04 | (3805.73 ms | 137763 tok/s) step 17764/76294 | train loss 3.408937 | norm 1.4729 | lr 1.44e-04 | (3833.63 ms | 136760 tok/s) step 17765/76294 | train loss 3.378717 | norm 1.4664 | lr 1.44e-04 | (3816.60 ms | 137371 tok/s) step 17766/76294 | train loss 3.381945 | norm 2.2563 | lr 1.44e-04 | (3836.68 ms | 136652 tok/s) step 17767/76294 | train loss 3.378795 | norm 3.0303 | lr 1.44e-04 | (3808.26 ms | 137671 tok/s) step 17768/76294 | train loss 3.359182 | norm 2.0608 | lr 1.44e-04 | (3842.51 ms | 136444 tok/s) step 17769/76294 | train loss 3.395840 | norm 2.6665 | lr 1.44e-04 | (3810.86 ms | 137577 tok/s) step 17770/76294 | train loss 3.367754 | norm 1.9334 | lr 1.44e-04 | (3841.45 ms | 136482 tok/s) step 17771/76294 | train loss 3.434301 | norm 1.2092 | lr 1.44e-04 | (3812.79 ms | 137508 tok/s) step 17772/76294 | train loss 3.365269 | norm 2.0187 | lr 1.44e-04 | (3852.26 ms | 136099 tok/s) step 17773/76294 | train loss 3.387271 | norm 1.5197 | lr 1.44e-04 | (3846.78 ms | 136293 tok/s) step 17774/76294 | train loss 3.435762 | norm 2.2517 | lr 1.44e-04 | (3798.11 ms | 138039 tok/s) step 17775/76294 | train loss 3.441375 | norm 1.3509 | lr 1.44e-04 | (3876.76 ms | 135239 tok/s) step 17776/76294 | train loss 3.463358 | norm 1.9720 | lr 1.44e-04 | (3793.80 ms | 138196 tok/s) step 17777/76294 | train loss 3.416511 | norm 2.6307 | lr 1.44e-04 | (3816.86 ms | 137361 tok/s) step 17778/76294 | train loss 3.423254 | norm 2.3130 | lr 1.44e-04 | (3797.00 ms | 138080 tok/s) step 17779/76294 | train loss 3.381663 | norm 2.7878 | lr 1.44e-04 | (3818.55 ms | 137300 tok/s) step 17780/76294 | train loss 3.452309 | norm 4.8112 | lr 1.44e-04 | (3798.04 ms | 138042 tok/s) step 17781/76294 | train loss 3.370120 | norm 2.7472 | lr 1.44e-04 | (3872.88 ms | 135374 tok/s) step 17782/76294 | train loss 3.447226 | norm 3.0744 | lr 1.44e-04 | (3791.65 ms | 138274 tok/s) step 17783/76294 | train loss 3.425279 | norm 3.9435 | lr 1.44e-04 | (3809.20 ms | 137637 tok/s) step 17784/76294 | train loss 3.408842 | norm 2.2590 | lr 1.44e-04 | (3795.15 ms | 138147 tok/s) step 17785/76294 | train loss 3.419086 | norm 3.3733 | lr 1.43e-04 | (3799.49 ms | 137989 tok/s) step 17786/76294 | train loss 3.481071 | norm 2.3265 | lr 1.43e-04 | (3847.13 ms | 136280 tok/s) step 17787/76294 | train loss 3.392972 | norm 3.6840 | lr 1.43e-04 | (3800.00 ms | 137970 tok/s) step 17788/76294 | train loss 3.350185 | norm 4.4363 | lr 1.43e-04 | (3820.91 ms | 137216 tok/s) step 17789/76294 | train loss 3.462337 | norm 2.4575 | lr 1.43e-04 | (3802.08 ms | 137895 tok/s) step 17790/76294 | train loss 3.424949 | norm 2.1090 | lr 1.43e-04 | (3805.11 ms | 137785 tok/s) step 17791/76294 | train loss 3.419426 | norm 2.5357 | lr 1.43e-04 | (3803.16 ms | 137856 tok/s) step 17792/76294 | train loss 3.372161 | norm 1.3406 | lr 1.43e-04 | (3841.06 ms | 136496 tok/s) step 17793/76294 | train loss 3.410388 | norm 4.7646 | lr 1.43e-04 | (3796.70 ms | 138090 tok/s) step 17794/76294 | train loss 3.465062 | norm 8.4141 | lr 1.43e-04 | (3801.53 ms | 137915 tok/s) step 17795/76294 | train loss 3.419093 | norm 3.2776 | lr 1.43e-04 | (3824.25 ms | 137095 tok/s) step 17796/76294 | train loss 3.486532 | norm 2.8722 | lr 1.43e-04 | (3828.37 ms | 136948 tok/s) step 17797/76294 | train loss 3.413797 | norm 2.9197 | lr 1.43e-04 | (3797.72 ms | 138053 tok/s) step 17798/76294 | train loss 3.382527 | norm 2.7145 | lr 1.43e-04 | (3808.96 ms | 137646 tok/s) step 17799/76294 | train loss 3.384240 | norm 1.6968 | lr 1.43e-04 | (3796.78 ms | 138087 tok/s) step 17800/76294 | train loss 3.370443 | norm 1.6013 | lr 1.43e-04 | (3817.72 ms | 137330 tok/s) step 17801/76294 | train loss 3.408342 | norm 2.1584 | lr 1.43e-04 | (3801.62 ms | 137912 tok/s) step 17802/76294 | train loss 3.463980 | norm 1.8465 | lr 1.43e-04 | (3885.31 ms | 134941 tok/s) step 17803/76294 | train loss 3.447852 | norm 1.3550 | lr 1.43e-04 | (3794.30 ms | 138178 tok/s) step 17804/76294 | train loss 3.500085 | norm 1.7539 | lr 1.43e-04 | (3819.06 ms | 137282 tok/s) step 17805/76294 | train loss 3.387780 | norm 2.2137 | lr 1.43e-04 | (3874.58 ms | 135315 tok/s) step 17806/76294 | train loss 3.448187 | norm 3.0179 | lr 1.43e-04 | (3800.63 ms | 137948 tok/s) step 17807/76294 | train loss 3.425370 | norm 3.3826 | lr 1.43e-04 | (3803.62 ms | 137839 tok/s) step 17808/76294 | train loss 3.420051 | norm 2.6399 | lr 1.43e-04 | (3801.38 ms | 137921 tok/s) step 17809/76294 | train loss 3.359445 | norm 2.1140 | lr 1.43e-04 | (3822.26 ms | 137167 tok/s) step 17810/76294 | train loss 3.424669 | norm 2.0667 | lr 1.43e-04 | (3800.20 ms | 137963 tok/s) step 17811/76294 | train loss 3.361961 | norm 1.5638 | lr 1.43e-04 | (3824.15 ms | 137099 tok/s) step 17812/76294 | train loss 3.449739 | norm 1.8767 | lr 1.43e-04 | (3837.84 ms | 136610 tok/s) step 17813/76294 | train loss 3.393022 | norm 1.7450 | lr 1.43e-04 | (3799.04 ms | 138005 tok/s) step 17814/76294 | train loss 3.431640 | norm 1.7837 | lr 1.43e-04 | (3951.34 ms | 132686 tok/s) step 17815/76294 | train loss 3.417538 | norm 1.6041 | lr 1.43e-04 | (3788.39 ms | 138393 tok/s) step 17816/76294 | train loss 3.419670 | norm 2.2669 | lr 1.43e-04 | (3794.63 ms | 138166 tok/s) step 17817/76294 | train loss 3.439166 | norm 1.9293 | lr 1.43e-04 | (3814.06 ms | 137462 tok/s) step 17818/76294 | train loss 3.416477 | norm 2.2963 | lr 1.43e-04 | (3796.14 ms | 138111 tok/s) step 17819/76294 | train loss 3.410601 | norm 2.8898 | lr 1.43e-04 | (3816.98 ms | 137357 tok/s) step 17820/76294 | train loss 3.370484 | norm 2.3011 | lr 1.43e-04 | (3800.62 ms | 137948 tok/s) step 17821/76294 | train loss 3.453369 | norm 1.5891 | lr 1.43e-04 | (3797.69 ms | 138054 tok/s) step 17822/76294 | train loss 3.468331 | norm 2.5177 | lr 1.43e-04 | (3819.50 ms | 137266 tok/s) step 17823/76294 | train loss 3.453804 | norm 3.1077 | lr 1.43e-04 | (3798.58 ms | 138022 tok/s) step 17824/76294 | train loss 3.421174 | norm 2.0010 | lr 1.42e-04 | (3819.22 ms | 137276 tok/s) step 17825/76294 | train loss 3.407549 | norm 1.9884 | lr 1.42e-04 | (3819.90 ms | 137252 tok/s) step 17826/76294 | train loss 3.364535 | norm 1.2120 | lr 1.42e-04 | (3796.45 ms | 138099 tok/s) step 17827/76294 | train loss 3.396691 | norm 1.2679 | lr 1.42e-04 | (3802.30 ms | 137887 tok/s) step 17828/76294 | train loss 3.441812 | norm 2.3994 | lr 1.42e-04 | (3873.27 ms | 135360 tok/s) step 17829/76294 | train loss 3.344982 | norm 1.7843 | lr 1.42e-04 | (3797.45 ms | 138063 tok/s) step 17830/76294 | train loss 3.412207 | norm 1.4195 | lr 1.42e-04 | (3801.01 ms | 137934 tok/s) step 17831/76294 | train loss 3.412764 | norm 2.0440 | lr 1.42e-04 | (3819.50 ms | 137266 tok/s) step 17832/76294 | train loss 3.439492 | norm 2.6701 | lr 1.42e-04 | (3800.19 ms | 137964 tok/s) step 17833/76294 | train loss 3.442250 | norm 1.7194 | lr 1.42e-04 | (3807.07 ms | 137714 tok/s) step 17834/76294 | train loss 3.393894 | norm 2.2296 | lr 1.42e-04 | (3806.77 ms | 137725 tok/s) step 17835/76294 | train loss 3.430675 | norm 2.4945 | lr 1.42e-04 | (3806.27 ms | 137743 tok/s) step 17836/76294 | train loss 3.423694 | norm 2.6191 | lr 1.42e-04 | (3798.86 ms | 138012 tok/s) step 17837/76294 | train loss 3.419738 | norm 3.2849 | lr 1.42e-04 | (3895.88 ms | 134575 tok/s) step 17838/76294 | train loss 3.427691 | norm 2.1529 | lr 1.42e-04 | (14265.41 ms | 36752 tok/s) step 17839/76294 | train loss 3.399968 | norm 2.9192 | lr 1.42e-04 | (3774.69 ms | 138896 tok/s) step 17840/76294 | train loss 3.419645 | norm 2.0426 | lr 1.42e-04 | (3781.54 ms | 138644 tok/s) step 17841/76294 | train loss 3.474515 | norm 1.3240 | lr 1.42e-04 | (3803.06 ms | 137860 tok/s) step 17842/76294 | train loss 3.423675 | norm 2.6547 | lr 1.42e-04 | (3784.59 ms | 138532 tok/s) step 17843/76294 | train loss 3.426750 | norm 1.6118 | lr 1.42e-04 | (3798.36 ms | 138030 tok/s) step 17844/76294 | train loss 3.409385 | norm 1.3438 | lr 1.42e-04 | (4167.65 ms | 125799 tok/s) step 17845/76294 | train loss 3.327801 | norm 2.0086 | lr 1.42e-04 | (3802.05 ms | 137896 tok/s) step 17846/76294 | train loss 3.423596 | norm 1.3498 | lr 1.42e-04 | (3792.06 ms | 138259 tok/s) step 17847/76294 | train loss 3.355399 | norm 2.7358 | lr 1.42e-04 | (3804.13 ms | 137821 tok/s) step 17848/76294 | train loss 3.390594 | norm 2.1912 | lr 1.42e-04 | (3794.19 ms | 138182 tok/s) step 17849/76294 | train loss 3.412208 | norm 1.5941 | lr 1.42e-04 | (3808.38 ms | 137667 tok/s) step 17850/76294 | train loss 3.382669 | norm 1.1188 | lr 1.42e-04 | (3910.72 ms | 134064 tok/s) step 17851/76294 | train loss 3.445846 | norm 1.0608 | lr 1.42e-04 | (3789.16 ms | 138365 tok/s) step 17852/76294 | train loss 3.386915 | norm 2.0279 | lr 1.42e-04 | (3844.19 ms | 136384 tok/s) step 17853/76294 | train loss 3.387850 | norm 1.8575 | lr 1.42e-04 | (3828.77 ms | 136934 tok/s) step 17854/76294 | train loss 3.435688 | norm 2.1914 | lr 1.42e-04 | (3796.54 ms | 138096 tok/s) step 17855/76294 | train loss 3.458499 | norm 2.2355 | lr 1.42e-04 | (3822.67 ms | 137152 tok/s) step 17856/76294 | train loss 3.454036 | norm 2.3455 | lr 1.42e-04 | (3798.76 ms | 138016 tok/s) step 17857/76294 | train loss 3.407571 | norm 1.6208 | lr 1.42e-04 | (3797.94 ms | 138045 tok/s) step 17858/76294 | train loss 3.399415 | norm 2.3026 | lr 1.42e-04 | (3830.10 ms | 136886 tok/s) step 17859/76294 | train loss 3.445879 | norm 1.8330 | lr 1.42e-04 | (3862.62 ms | 135734 tok/s) step 17860/76294 | train loss 3.400177 | norm 3.0705 | lr 1.42e-04 | (3799.11 ms | 138003 tok/s) step 17861/76294 | train loss 3.402584 | norm 2.1132 | lr 1.42e-04 | (3833.87 ms | 136752 tok/s) step 17862/76294 | train loss 3.413677 | norm 2.8093 | lr 1.42e-04 | (3800.04 ms | 137969 tok/s) step 17863/76294 | train loss 3.372800 | norm 2.1474 | lr 1.41e-04 | (3809.93 ms | 137611 tok/s) step 17864/76294 | train loss 3.401205 | norm 1.4271 | lr 1.41e-04 | (3828.51 ms | 136943 tok/s) step 17865/76294 | train loss 3.426879 | norm 2.5636 | lr 1.41e-04 | (3811.04 ms | 137571 tok/s) step 17866/76294 | train loss 3.447728 | norm 1.4102 | lr 1.41e-04 | (3816.53 ms | 137373 tok/s) step 17867/76294 | train loss 3.460381 | norm 1.6179 | lr 1.41e-04 | (3895.18 ms | 134599 tok/s) step 17868/76294 | train loss 3.400545 | norm 2.0542 | lr 1.41e-04 | (3814.65 ms | 137441 tok/s) step 17869/76294 | train loss 3.432650 | norm 3.2053 | lr 1.41e-04 | (3809.32 ms | 137633 tok/s) step 17870/76294 | train loss 3.464685 | norm 3.0671 | lr 1.41e-04 | (3817.14 ms | 137351 tok/s) step 17871/76294 | train loss 3.396807 | norm 3.2074 | lr 1.41e-04 | (3814.38 ms | 137450 tok/s) step 17872/76294 | train loss 3.403710 | norm 3.0779 | lr 1.41e-04 | (3821.68 ms | 137188 tok/s) step 17873/76294 | train loss 3.359850 | norm 3.2031 | lr 1.41e-04 | (3812.30 ms | 137525 tok/s) step 17874/76294 | train loss 3.424010 | norm 1.9512 | lr 1.41e-04 | (3893.85 ms | 134645 tok/s) step 17875/76294 | train loss 3.368500 | norm 1.7582 | lr 1.41e-04 | (3866.92 ms | 135583 tok/s) step 17876/76294 | train loss 3.419024 | norm 1.2130 | lr 1.41e-04 | (3804.95 ms | 137791 tok/s) step 17877/76294 | train loss 3.422852 | norm 1.8255 | lr 1.41e-04 | (3862.35 ms | 135743 tok/s) step 17878/76294 | train loss 3.415888 | norm 2.2933 | lr 1.41e-04 | (3801.65 ms | 137910 tok/s) step 17879/76294 | train loss 3.429462 | norm 1.8259 | lr 1.41e-04 | (3912.94 ms | 133988 tok/s) step 17880/76294 | train loss 3.369862 | norm 2.5769 | lr 1.41e-04 | (3792.02 ms | 138261 tok/s) step 17881/76294 | train loss 3.420728 | norm 1.8422 | lr 1.41e-04 | (3821.35 ms | 137200 tok/s) step 17882/76294 | train loss 3.442460 | norm 2.3882 | lr 1.41e-04 | (3794.87 ms | 138157 tok/s) step 17883/76294 | train loss 3.438888 | norm 2.0157 | lr 1.41e-04 | (4043.11 ms | 129674 tok/s) step 17884/76294 | train loss 3.426243 | norm 1.8663 | lr 1.41e-04 | (3793.70 ms | 138200 tok/s) step 17885/76294 | train loss 3.371742 | norm 2.1301 | lr 1.41e-04 | (3807.14 ms | 137712 tok/s) step 17886/76294 | train loss 3.380460 | norm 1.3839 | lr 1.41e-04 | (3817.72 ms | 137330 tok/s) step 17887/76294 | train loss 3.409098 | norm 3.0312 | lr 1.41e-04 | (3799.93 ms | 137973 tok/s) step 17888/76294 | train loss 3.416288 | norm 3.9757 | lr 1.41e-04 | (3801.82 ms | 137904 tok/s) step 17889/76294 | train loss 3.361006 | norm 3.3419 | lr 1.41e-04 | (3830.90 ms | 136858 tok/s) step 17890/76294 | train loss 3.431201 | norm 4.3254 | lr 1.41e-04 | (3803.05 ms | 137860 tok/s) step 17891/76294 | train loss 3.422496 | norm 2.6011 | lr 1.41e-04 | (3807.91 ms | 137684 tok/s) step 17892/76294 | train loss 3.382224 | norm 2.5015 | lr 1.41e-04 | (3835.00 ms | 136711 tok/s) step 17893/76294 | train loss 3.475769 | norm 2.5767 | lr 1.41e-04 | (3804.84 ms | 137795 tok/s) step 17894/76294 | train loss 3.375342 | norm 2.6480 | lr 1.41e-04 | (3803.00 ms | 137862 tok/s) step 17895/76294 | train loss 3.467567 | norm 2.2454 | lr 1.41e-04 | (3866.11 ms | 135611 tok/s) step 17896/76294 | train loss 3.381356 | norm 3.8347 | lr 1.41e-04 | (3803.86 ms | 137830 tok/s) step 17897/76294 | train loss 3.377723 | norm 2.0269 | lr 1.41e-04 | (3807.47 ms | 137700 tok/s) step 17898/76294 | train loss 3.440690 | norm 2.9734 | lr 1.41e-04 | (3900.34 ms | 134421 tok/s) step 17899/76294 | train loss 3.379357 | norm 5.2819 | lr 1.41e-04 | (3799.17 ms | 138001 tok/s) step 17900/76294 | train loss 3.442098 | norm 4.1708 | lr 1.41e-04 | (3833.63 ms | 136760 tok/s) step 17901/76294 | train loss 3.418976 | norm 3.5303 | lr 1.41e-04 | (3801.87 ms | 137903 tok/s) step 17902/76294 | train loss 3.398461 | norm 4.7408 | lr 1.41e-04 | (3829.34 ms | 136913 tok/s) step 17903/76294 | train loss 3.475970 | norm 4.4890 | lr 1.40e-04 | (3804.46 ms | 137809 tok/s) step 17904/76294 | train loss 3.468679 | norm 5.0111 | lr 1.40e-04 | (3819.03 ms | 137283 tok/s) step 17905/76294 | train loss 3.440273 | norm 6.1605 | lr 1.40e-04 | (3800.16 ms | 137965 tok/s) step 17906/76294 | train loss 3.460702 | norm 5.6400 | lr 1.40e-04 | (3890.63 ms | 134757 tok/s) step 17907/76294 | train loss 3.426952 | norm 5.0251 | lr 1.40e-04 | (6358.93 ms | 82449 tok/s) step 17908/76294 | train loss 3.465456 | norm 5.2788 | lr 1.40e-04 | (3800.97 ms | 137935 tok/s) step 17909/76294 | train loss 3.385325 | norm 4.0288 | lr 1.40e-04 | (3823.42 ms | 137126 tok/s) step 17910/76294 | train loss 3.373852 | norm 2.0392 | lr 1.40e-04 | (3795.14 ms | 138147 tok/s) step 17911/76294 | train loss 3.443281 | norm 3.9963 | lr 1.40e-04 | (3804.50 ms | 137807 tok/s) step 17912/76294 | train loss 3.470461 | norm 4.8830 | lr 1.40e-04 | (3802.91 ms | 137865 tok/s) step 17913/76294 | train loss 3.428855 | norm 3.1288 | lr 1.40e-04 | (3878.32 ms | 135184 tok/s) step 17914/76294 | train loss 3.477179 | norm 4.9569 | lr 1.40e-04 | (3791.40 ms | 138284 tok/s) step 17915/76294 | train loss 3.326003 | norm 3.7532 | lr 1.40e-04 | (3818.77 ms | 137292 tok/s) step 17916/76294 | train loss 3.426739 | norm 2.3435 | lr 1.40e-04 | (3791.61 ms | 138276 tok/s) step 17917/76294 | train loss 3.362886 | norm 3.0769 | lr 1.40e-04 | (4635.49 ms | 113103 tok/s) step 17918/76294 | train loss 3.401546 | norm 2.5028 | lr 1.40e-04 | (3816.07 ms | 137389 tok/s) step 17919/76294 | train loss 3.473074 | norm 2.3416 | lr 1.40e-04 | (3802.74 ms | 137871 tok/s) step 17920/76294 | train loss 3.343770 | norm 2.8849 | lr 1.40e-04 | (3802.06 ms | 137896 tok/s) step 17921/76294 | train loss 3.372341 | norm 1.6305 | lr 1.40e-04 | (3892.80 ms | 134682 tok/s) step 17922/76294 | train loss 3.410413 | norm 1.8874 | lr 1.40e-04 | (3797.56 ms | 138059 tok/s) step 17923/76294 | train loss 3.398696 | norm 2.3363 | lr 1.40e-04 | (3799.67 ms | 137982 tok/s) step 17924/76294 | train loss 3.434273 | norm 1.6342 | lr 1.40e-04 | (3826.57 ms | 137013 tok/s) step 17925/76294 | train loss 3.400128 | norm 2.1654 | lr 1.40e-04 | (3916.55 ms | 133865 tok/s) step 17926/76294 | train loss 3.392279 | norm 1.7112 | lr 1.40e-04 | (3791.49 ms | 138280 tok/s) step 17927/76294 | train loss 3.406844 | norm 2.4298 | lr 1.40e-04 | (3842.88 ms | 136431 tok/s) step 17928/76294 | train loss 3.414698 | norm 2.8845 | lr 1.40e-04 | (3793.98 ms | 138190 tok/s) step 17929/76294 | train loss 3.462107 | norm 2.6435 | lr 1.40e-04 | (3885.53 ms | 134934 tok/s) step 17930/76294 | train loss 3.356860 | norm 1.1235 | lr 1.40e-04 | (3791.96 ms | 138263 tok/s) step 17931/76294 | train loss 3.396259 | norm 2.1632 | lr 1.40e-04 | (3852.00 ms | 136108 tok/s) step 17932/76294 | train loss 3.379706 | norm 1.7475 | lr 1.40e-04 | (3795.33 ms | 138140 tok/s) step 17933/76294 | train loss 3.412679 | norm 2.2651 | lr 1.40e-04 | (3792.71 ms | 138236 tok/s) step 17934/76294 | train loss 3.408892 | norm 1.2593 | lr 1.40e-04 | (3792.49 ms | 138244 tok/s) step 17935/76294 | train loss 3.419505 | norm 2.9203 | lr 1.40e-04 | (3916.68 ms | 133860 tok/s) step 17936/76294 | train loss 3.360645 | norm 1.8046 | lr 1.40e-04 | (3792.42 ms | 138246 tok/s) step 17937/76294 | train loss 3.428453 | norm 1.2851 | lr 1.40e-04 | (3798.52 ms | 138024 tok/s) step 17938/76294 | train loss 3.441594 | norm 3.2130 | lr 1.40e-04 | (3825.29 ms | 137058 tok/s) step 17939/76294 | train loss 3.395694 | norm 1.7649 | lr 1.40e-04 | (3791.64 ms | 138275 tok/s) step 17940/76294 | train loss 3.542119 | norm 7.9955 | lr 1.40e-04 | (3821.40 ms | 137198 tok/s) step 17941/76294 | train loss 3.454368 | norm 3.1988 | lr 1.40e-04 | (3802.62 ms | 137875 tok/s) step 17942/76294 | train loss 3.507397 | norm 2.7800 | lr 1.40e-04 | (3965.69 ms | 132206 tok/s) step 17943/76294 | train loss 3.382098 | norm 2.3303 | lr 1.40e-04 | (3786.20 ms | 138473 tok/s) step 17944/76294 | train loss 3.436675 | norm 1.7590 | lr 1.40e-04 | (3846.76 ms | 136294 tok/s) step 17945/76294 | train loss 3.488636 | norm 1.9507 | lr 1.39e-04 | (3795.90 ms | 138120 tok/s) step 17946/76294 | train loss 3.371064 | norm 2.9938 | lr 1.39e-04 | (3824.27 ms | 137095 tok/s) step 17947/76294 | train loss 3.422650 | norm 4.8245 | lr 1.39e-04 | (3796.27 ms | 138106 tok/s) step 17948/76294 | train loss 3.392746 | norm 2.2927 | lr 1.39e-04 | (3847.01 ms | 136285 tok/s) step 17949/76294 | train loss 3.433290 | norm 1.7375 | lr 1.39e-04 | (3796.21 ms | 138108 tok/s) step 17950/76294 | train loss 3.415005 | norm 2.1778 | lr 1.39e-04 | (3819.87 ms | 137253 tok/s) step 17951/76294 | train loss 3.351585 | norm 1.9603 | lr 1.39e-04 | (3797.74 ms | 138053 tok/s) step 17952/76294 | train loss 3.391916 | norm 0.9943 | lr 1.39e-04 | (3858.56 ms | 135876 tok/s) step 17953/76294 | train loss 3.588921 | norm 1.7933 | lr 1.39e-04 | (3799.32 ms | 137995 tok/s) step 17954/76294 | train loss 3.386359 | norm 3.0050 | lr 1.39e-04 | (3841.67 ms | 136474 tok/s) step 17955/76294 | train loss 3.371399 | norm 2.6421 | lr 1.39e-04 | (3873.62 ms | 135348 tok/s) step 17956/76294 | train loss 3.399420 | norm 1.1947 | lr 1.39e-04 | (3797.55 ms | 138059 tok/s) step 17957/76294 | train loss 3.394378 | norm 5.4707 | lr 1.39e-04 | (3803.93 ms | 137828 tok/s) step 17958/76294 | train loss 3.411955 | norm 2.3045 | lr 1.39e-04 | (3833.02 ms | 136782 tok/s) step 17959/76294 | train loss 3.405292 | norm 1.5725 | lr 1.39e-04 | (3805.69 ms | 137764 tok/s) step 17960/76294 | train loss 3.428273 | norm 3.5292 | lr 1.39e-04 | (3813.80 ms | 137471 tok/s) step 17961/76294 | train loss 3.396589 | norm 2.9789 | lr 1.39e-04 | (3806.71 ms | 137727 tok/s) step 17962/76294 | train loss 3.400733 | norm 1.6763 | lr 1.39e-04 | (3864.77 ms | 135658 tok/s) step 17963/76294 | train loss 3.459319 | norm 2.0561 | lr 1.39e-04 | (3809.96 ms | 137610 tok/s) step 17964/76294 | train loss 3.476902 | norm 1.7978 | lr 1.39e-04 | (3832.01 ms | 136818 tok/s) step 17965/76294 | train loss 3.399019 | norm 2.0096 | lr 1.39e-04 | (3936.89 ms | 133173 tok/s) step 17966/76294 | train loss 3.561027 | norm 1.7050 | lr 1.39e-04 | (3871.16 ms | 135434 tok/s) step 17967/76294 | train loss 3.401934 | norm 1.5613 | lr 1.39e-04 | (3797.82 ms | 138050 tok/s) step 17968/76294 | train loss 3.394307 | norm 1.9402 | lr 1.39e-04 | (3825.92 ms | 137036 tok/s) step 17969/76294 | train loss 3.471546 | norm 3.7137 | lr 1.39e-04 | (3799.36 ms | 137994 tok/s) step 17970/76294 | train loss 3.408048 | norm 3.6400 | lr 1.39e-04 | (3836.95 ms | 136642 tok/s) step 17971/76294 | train loss 3.480566 | norm 1.9843 | lr 1.39e-04 | (3825.62 ms | 137046 tok/s) step 17972/76294 | train loss 3.388794 | norm 2.4414 | lr 1.39e-04 | (3799.86 ms | 137976 tok/s) step 17973/76294 | train loss 3.435033 | norm 3.8621 | lr 1.39e-04 | (3801.64 ms | 137911 tok/s) step 17974/76294 | train loss 3.512089 | norm 2.6524 | lr 1.39e-04 | (3849.78 ms | 136186 tok/s) step 17975/76294 | train loss 3.401637 | norm 6.0801 | lr 1.39e-04 | (3800.49 ms | 137953 tok/s) step 17976/76294 | train loss 3.382907 | norm 5.1692 | lr 1.39e-04 | (3805.74 ms | 137762 tok/s) step 17977/76294 | train loss 3.484540 | norm 5.0605 | lr 1.39e-04 | (3810.79 ms | 137580 tok/s) step 17978/76294 | train loss 3.407243 | norm 5.0422 | lr 1.39e-04 | (3806.57 ms | 137732 tok/s) step 17979/76294 | train loss 3.426111 | norm 2.6015 | lr 1.39e-04 | (3809.49 ms | 137627 tok/s) step 17980/76294 | train loss 3.408806 | norm 2.9209 | lr 1.39e-04 | (3805.64 ms | 137766 tok/s) step 17981/76294 | train loss 3.364438 | norm 2.8450 | lr 1.39e-04 | (3800.73 ms | 137944 tok/s) step 17982/76294 | train loss 3.429314 | norm 2.1134 | lr 1.39e-04 | (3841.26 ms | 136488 tok/s) step 17983/76294 | train loss 3.367998 | norm 2.5604 | lr 1.39e-04 | (3799.87 ms | 137975 tok/s) step 17984/76294 | train loss 3.473613 | norm 1.9188 | lr 1.39e-04 | (3832.64 ms | 136795 tok/s) step 17985/76294 | train loss 3.458499 | norm 3.3574 | lr 1.39e-04 | (3802.42 ms | 137883 tok/s) step 17986/76294 | train loss 3.382923 | norm 2.3479 | lr 1.39e-04 | (3829.70 ms | 136900 tok/s) step 17987/76294 | train loss 3.390709 | norm 2.7617 | lr 1.38e-04 | (3804.58 ms | 137804 tok/s) step 17988/76294 | train loss 3.428061 | norm 2.4956 | lr 1.38e-04 | (3835.03 ms | 136710 tok/s) step 17989/76294 | train loss 3.441401 | norm 2.5041 | lr 1.38e-04 | (3803.59 ms | 137840 tok/s) step 17990/76294 | train loss 3.420658 | norm 1.9972 | lr 1.38e-04 | (3860.97 ms | 135792 tok/s) step 17991/76294 | train loss 3.406577 | norm 1.8624 | lr 1.38e-04 | (3809.56 ms | 137624 tok/s) step 17992/76294 | train loss 3.366131 | norm 2.3980 | lr 1.38e-04 | (3831.46 ms | 136838 tok/s) step 17993/76294 | train loss 3.451610 | norm 2.3890 | lr 1.38e-04 | (3804.80 ms | 137797 tok/s) step 17994/76294 | train loss 3.403664 | norm 2.2022 | lr 1.38e-04 | (3810.62 ms | 137586 tok/s) step 17995/76294 | train loss 3.384606 | norm 2.4785 | lr 1.38e-04 | (3823.40 ms | 137126 tok/s) step 17996/76294 | train loss 3.467789 | norm 1.7561 | lr 1.38e-04 | (3817.75 ms | 137329 tok/s) step 17997/76294 | train loss 3.475750 | norm 1.8002 | lr 1.38e-04 | (3812.49 ms | 137519 tok/s) step 17998/76294 | train loss 3.417794 | norm 1.9924 | lr 1.38e-04 | (3837.01 ms | 136640 tok/s) step 17999/76294 | train loss 3.423402 | norm 1.4285 | lr 1.38e-04 | (3812.38 ms | 137522 tok/s) step 18000/76294 | train loss 3.392649 | norm 2.6135 | lr 1.38e-04 | (3856.31 ms | 135956 tok/s) val loss: 3.395149 saving model checkpoint to ./results/gpt2-124M-gqa/step_18000.pth step 18001/76294 | train loss 3.397849 | norm 3.5614 | lr 1.38e-04 | (3792.25 ms | 138253 tok/s) step 18002/76294 | train loss 3.400016 | norm 2.1225 | lr 1.38e-04 | (3848.57 ms | 136229 tok/s) step 18003/76294 | train loss 3.358234 | norm 2.0692 | lr 1.38e-04 | (3793.74 ms | 138198 tok/s) step 18004/76294 | train loss 3.338047 | norm 3.3051 | lr 1.38e-04 | (3955.01 ms | 132563 tok/s) step 18005/76294 | train loss 3.412763 | norm 2.7748 | lr 1.38e-04 | (3786.88 ms | 138449 tok/s) step 18006/76294 | train loss 3.404304 | norm 2.1114 | lr 1.38e-04 | (3792.01 ms | 138261 tok/s) step 18007/76294 | train loss 3.394867 | norm 1.5971 | lr 1.38e-04 | (3815.62 ms | 137406 tok/s) step 18008/76294 | train loss 3.428703 | norm 2.8332 | lr 1.38e-04 | (3820.84 ms | 137218 tok/s) step 18009/76294 | train loss 3.411913 | norm 2.0414 | lr 1.38e-04 | (3813.88 ms | 137468 tok/s) step 18010/76294 | train loss 3.407481 | norm 1.4627 | lr 1.38e-04 | (3796.93 ms | 138082 tok/s) step 18011/76294 | train loss 3.419360 | norm 2.5798 | lr 1.38e-04 | (3798.35 ms | 138030 tok/s) step 18012/76294 | train loss 3.492726 | norm 2.6296 | lr 1.38e-04 | (3803.65 ms | 137838 tok/s) step 18013/76294 | train loss 3.426637 | norm 2.8473 | lr 1.38e-04 | (3805.42 ms | 137774 tok/s) step 18014/76294 | train loss 3.342865 | norm 2.9990 | lr 1.38e-04 | (3834.47 ms | 136730 tok/s) step 18015/76294 | train loss 3.389232 | norm 2.5117 | lr 1.38e-04 | (3817.59 ms | 137335 tok/s) step 18016/76294 | train loss 3.468843 | norm 2.3804 | lr 1.38e-04 | (3869.27 ms | 135500 tok/s) step 18017/76294 | train loss 3.408833 | norm 2.2290 | lr 1.38e-04 | (3815.81 ms | 137399 tok/s) step 18018/76294 | train loss 3.362079 | norm 2.0277 | lr 1.38e-04 | (3821.59 ms | 137191 tok/s) step 18019/76294 | train loss 3.431673 | norm 2.9426 | lr 1.38e-04 | (3818.00 ms | 137320 tok/s) step 18020/76294 | train loss 3.392425 | norm 3.0584 | lr 1.38e-04 | (3799.92 ms | 137973 tok/s) step 18021/76294 | train loss 3.442968 | norm 2.1873 | lr 1.38e-04 | (3804.88 ms | 137794 tok/s) step 18022/76294 | train loss 3.388080 | norm 1.5374 | lr 1.38e-04 | (3802.92 ms | 137864 tok/s) step 18023/76294 | train loss 3.435066 | norm 1.3665 | lr 1.38e-04 | (3801.15 ms | 137929 tok/s) step 18024/76294 | train loss 3.469066 | norm 1.1655 | lr 1.38e-04 | (3805.04 ms | 137788 tok/s) step 18025/76294 | train loss 3.431606 | norm 1.5640 | lr 1.38e-04 | (3796.07 ms | 138113 tok/s) step 18026/76294 | train loss 3.392273 | norm 2.1986 | lr 1.38e-04 | (3832.74 ms | 136792 tok/s) step 18027/76294 | train loss 3.398144 | norm 3.2933 | lr 1.38e-04 | (3801.68 ms | 137910 tok/s) step 18028/76294 | train loss 3.454936 | norm 2.0953 | lr 1.38e-04 | (3805.65 ms | 137766 tok/s) step 18029/76294 | train loss 3.404178 | norm 1.6970 | lr 1.38e-04 | (3824.82 ms | 137075 tok/s) step 18030/76294 | train loss 3.384775 | norm 2.9859 | lr 1.37e-04 | (3803.10 ms | 137858 tok/s) step 18031/76294 | train loss 3.459354 | norm 2.2868 | lr 1.37e-04 | (3806.88 ms | 137721 tok/s) step 18032/76294 | train loss 3.411100 | norm 2.0492 | lr 1.37e-04 | (3806.00 ms | 137753 tok/s) step 18033/76294 | train loss 3.409424 | norm 2.1165 | lr 1.37e-04 | (3811.73 ms | 137546 tok/s) step 18034/76294 | train loss 3.515322 | norm 3.3507 | lr 1.37e-04 | (3803.77 ms | 137834 tok/s) step 18035/76294 | train loss 3.370322 | norm 1.5675 | lr 1.37e-04 | (4222.97 ms | 124152 tok/s) step 18036/76294 | train loss 3.450778 | norm 1.8998 | lr 1.37e-04 | (3831.49 ms | 136837 tok/s) step 18037/76294 | train loss 3.352633 | norm 1.5015 | lr 1.37e-04 | (3837.85 ms | 136610 tok/s) step 18038/76294 | train loss 3.420077 | norm 2.7441 | lr 1.37e-04 | (3811.21 ms | 137565 tok/s) step 18039/76294 | train loss 3.390604 | norm 1.9290 | lr 1.37e-04 | (3902.60 ms | 134343 tok/s) step 18040/76294 | train loss 3.385651 | norm 2.9783 | lr 1.37e-04 | (3803.80 ms | 137833 tok/s) step 18041/76294 | train loss 3.423803 | norm 3.1797 | lr 1.37e-04 | (3846.13 ms | 136316 tok/s) step 18042/76294 | train loss 3.347597 | norm 2.6422 | lr 1.37e-04 | (3836.26 ms | 136666 tok/s) step 18043/76294 | train loss 3.657218 | norm 3.3469 | lr 1.37e-04 | (3801.16 ms | 137928 tok/s) step 18044/76294 | train loss 3.387288 | norm 2.2595 | lr 1.37e-04 | (3832.96 ms | 136784 tok/s) step 18045/76294 | train loss 3.484739 | norm 1.9805 | lr 1.37e-04 | (3798.18 ms | 138036 tok/s) step 18046/76294 | train loss 3.412085 | norm 3.7512 | lr 1.37e-04 | (3807.26 ms | 137707 tok/s) step 18047/76294 | train loss 3.429841 | norm 2.1504 | lr 1.37e-04 | (3904.26 ms | 134286 tok/s) step 18048/76294 | train loss 3.448420 | norm 2.7476 | lr 1.37e-04 | (3792.50 ms | 138243 tok/s) step 18049/76294 | train loss 3.382949 | norm 3.0643 | lr 1.37e-04 | (3919.43 ms | 133766 tok/s) step 18050/76294 | train loss 3.434175 | norm 2.4788 | lr 1.37e-04 | (3792.71 ms | 138236 tok/s) step 18051/76294 | train loss 3.388652 | norm 3.2094 | lr 1.37e-04 | (3799.43 ms | 137991 tok/s) step 18052/76294 | train loss 3.468756 | norm 2.1127 | lr 1.37e-04 | (3822.45 ms | 137160 tok/s) step 18053/76294 | train loss 3.415053 | norm 1.7617 | lr 1.37e-04 | (3803.05 ms | 137860 tok/s) step 18054/76294 | train loss 3.426430 | norm 1.7649 | lr 1.37e-04 | (3893.78 ms | 134648 tok/s) step 18055/76294 | train loss 3.429552 | norm 2.0102 | lr 1.37e-04 | (3790.08 ms | 138332 tok/s) step 18056/76294 | train loss 3.429955 | norm 1.7851 | lr 1.37e-04 | (3881.28 ms | 135081 tok/s) step 18057/76294 | train loss 3.358830 | norm 1.6291 | lr 1.37e-04 | (3837.02 ms | 136639 tok/s) step 18058/76294 | train loss 3.416459 | norm 2.5573 | lr 1.37e-04 | (7422.87 ms | 70631 tok/s) step 18059/76294 | train loss 3.418232 | norm 2.2732 | lr 1.37e-04 | (3952.99 ms | 132631 tok/s) step 18060/76294 | train loss 3.396689 | norm 2.3140 | lr 1.37e-04 | (3782.41 ms | 138612 tok/s) step 18061/76294 | train loss 3.484248 | norm 1.4002 | lr 1.37e-04 | (3832.33 ms | 136807 tok/s) step 18062/76294 | train loss 3.383683 | norm 2.1010 | lr 1.37e-04 | (3782.95 ms | 138592 tok/s) step 18063/76294 | train loss 3.425796 | norm 2.9053 | lr 1.37e-04 | (4313.69 ms | 121540 tok/s) step 18064/76294 | train loss 3.495012 | norm 2.3026 | lr 1.37e-04 | (3792.11 ms | 138258 tok/s) step 18065/76294 | train loss 3.336755 | norm 2.6848 | lr 1.37e-04 | (3820.94 ms | 137215 tok/s) step 18066/76294 | train loss 3.507365 | norm 3.0739 | lr 1.37e-04 | (3791.98 ms | 138262 tok/s) step 18067/76294 | train loss 3.421709 | norm 2.3045 | lr 1.37e-04 | (3796.18 ms | 138109 tok/s) step 18068/76294 | train loss 3.423663 | norm 3.0737 | lr 1.37e-04 | (3822.15 ms | 137171 tok/s) step 18069/76294 | train loss 3.380987 | norm 1.8917 | lr 1.37e-04 | (3801.27 ms | 137924 tok/s) step 18070/76294 | train loss 3.345173 | norm 3.2461 | lr 1.37e-04 | (3818.54 ms | 137301 tok/s) step 18071/76294 | train loss 3.369584 | norm 3.9890 | lr 1.37e-04 | (3914.44 ms | 133937 tok/s) step 18072/76294 | train loss 3.396917 | norm 4.7211 | lr 1.37e-04 | (3805.29 ms | 137779 tok/s) step 18073/76294 | train loss 3.438083 | norm 3.9333 | lr 1.37e-04 | (3899.46 ms | 134451 tok/s) step 18074/76294 | train loss 3.332718 | norm 4.6455 | lr 1.37e-04 | (3804.59 ms | 137804 tok/s) step 18075/76294 | train loss 3.350570 | norm 3.0049 | lr 1.36e-04 | (3827.90 ms | 136965 tok/s) step 18076/76294 | train loss 3.416624 | norm 2.9167 | lr 1.36e-04 | (3805.67 ms | 137765 tok/s) step 18077/76294 | train loss 3.379796 | norm 2.5006 | lr 1.36e-04 | (3808.51 ms | 137662 tok/s) step 18078/76294 | train loss 3.347951 | norm 1.7528 | lr 1.36e-04 | (3833.11 ms | 136779 tok/s) step 18079/76294 | train loss 3.391838 | norm 2.2113 | lr 1.36e-04 | (3928.87 ms | 133445 tok/s) step 18080/76294 | train loss 3.526355 | norm 2.3390 | lr 1.36e-04 | (3866.77 ms | 135588 tok/s) step 18081/76294 | train loss 3.422284 | norm 2.9005 | lr 1.36e-04 | (3776.59 ms | 138826 tok/s) step 18082/76294 | train loss 3.497431 | norm 4.5714 | lr 1.36e-04 | (3831.13 ms | 136850 tok/s) step 18083/76294 | train loss 3.379372 | norm 6.8685 | lr 1.36e-04 | (3783.47 ms | 138573 tok/s) step 18084/76294 | train loss 3.694324 | norm 10.9142 | lr 1.36e-04 | (3792.15 ms | 138256 tok/s) step 18085/76294 | train loss 3.478337 | norm 20.1488 | lr 1.36e-04 | (3875.70 ms | 135276 tok/s) step 18086/76294 | train loss 3.445563 | norm 10.7231 | lr 1.36e-04 | (3784.72 ms | 138528 tok/s) step 18087/76294 | train loss 3.450544 | norm 7.8074 | lr 1.36e-04 | (3879.90 ms | 135129 tok/s) step 18088/76294 | train loss 3.422027 | norm 4.9536 | lr 1.36e-04 | (3785.00 ms | 138517 tok/s) step 18089/76294 | train loss 3.518815 | norm 4.3427 | lr 1.36e-04 | (3798.61 ms | 138021 tok/s) step 18090/76294 | train loss 3.410190 | norm 5.6104 | lr 1.36e-04 | (3786.28 ms | 138470 tok/s) step 18091/76294 | train loss 3.433387 | norm 4.8003 | lr 1.36e-04 | (3810.87 ms | 137577 tok/s) step 18092/76294 | train loss 3.381023 | norm 2.7622 | lr 1.36e-04 | (3791.13 ms | 138293 tok/s) step 18093/76294 | train loss 3.463994 | norm 3.0080 | lr 1.36e-04 | (3817.22 ms | 137348 tok/s) step 18094/76294 | train loss 3.446882 | norm 2.6888 | lr 1.36e-04 | (3792.08 ms | 138259 tok/s) step 18095/76294 | train loss 3.417466 | norm 4.6823 | lr 1.36e-04 | (3802.80 ms | 137869 tok/s) step 18096/76294 | train loss 3.400712 | norm 5.3904 | lr 1.36e-04 | (3798.29 ms | 138032 tok/s) step 18097/76294 | train loss 3.428404 | norm 3.1457 | lr 1.36e-04 | (3823.59 ms | 137119 tok/s) step 18098/76294 | train loss 3.434281 | norm 5.1630 | lr 1.36e-04 | (3799.74 ms | 137980 tok/s) step 18099/76294 | train loss 3.430026 | norm 4.2619 | lr 1.36e-04 | (3898.53 ms | 134483 tok/s) step 18100/76294 | train loss 3.398986 | norm 3.7729 | lr 1.36e-04 | (3828.44 ms | 136945 tok/s) step 18101/76294 | train loss 3.451528 | norm 2.9459 | lr 1.36e-04 | (3797.68 ms | 138055 tok/s) step 18102/76294 | train loss 3.386754 | norm 3.6054 | lr 1.36e-04 | (3905.77 ms | 134234 tok/s) step 18103/76294 | train loss 3.402438 | norm 3.9123 | lr 1.36e-04 | (3794.58 ms | 138167 tok/s) step 18104/76294 | train loss 3.395832 | norm 3.4287 | lr 1.36e-04 | (3905.40 ms | 134247 tok/s) step 18105/76294 | train loss 3.405469 | norm 2.1582 | lr 1.36e-04 | (3796.38 ms | 138102 tok/s) step 18106/76294 | train loss 3.449052 | norm 3.4352 | lr 1.36e-04 | (3809.00 ms | 137644 tok/s) step 18107/76294 | train loss 3.381905 | norm 4.5267 | lr 1.36e-04 | (3801.95 ms | 137900 tok/s) step 18108/76294 | train loss 3.489074 | norm 2.8674 | lr 1.36e-04 | (3882.98 ms | 135022 tok/s) step 18109/76294 | train loss 3.353412 | norm 4.9988 | lr 1.36e-04 | (3798.63 ms | 138020 tok/s) step 18110/76294 | train loss 3.420588 | norm 4.6534 | lr 1.36e-04 | (3827.35 ms | 136985 tok/s) step 18111/76294 | train loss 3.424581 | norm 2.1636 | lr 1.36e-04 | (3798.27 ms | 138033 tok/s) step 18112/76294 | train loss 3.592567 | norm 2.6708 | lr 1.36e-04 | (3830.64 ms | 136867 tok/s) step 18113/76294 | train loss 3.426021 | norm 4.2303 | lr 1.36e-04 | (3799.87 ms | 137975 tok/s) step 18114/76294 | train loss 3.408630 | norm 3.6261 | lr 1.36e-04 | (3829.49 ms | 136908 tok/s) step 18115/76294 | train loss 3.439452 | norm 5.0561 | lr 1.36e-04 | (3803.46 ms | 137845 tok/s) step 18116/76294 | train loss 3.418994 | norm 3.2490 | lr 1.36e-04 | (3824.74 ms | 137078 tok/s) step 18117/76294 | train loss 3.370046 | norm 4.4062 | lr 1.36e-04 | (3804.58 ms | 137805 tok/s) step 18118/76294 | train loss 3.371746 | norm 2.2095 | lr 1.36e-04 | (3848.59 ms | 136229 tok/s) step 18119/76294 | train loss 3.376508 | norm 2.9746 | lr 1.36e-04 | (3804.39 ms | 137811 tok/s) step 18120/76294 | train loss 3.356231 | norm 4.9907 | lr 1.36e-04 | (3849.73 ms | 136188 tok/s) step 18121/76294 | train loss 3.415463 | norm 4.0510 | lr 1.35e-04 | (3800.21 ms | 137963 tok/s) step 18122/76294 | train loss 3.447731 | norm 3.7827 | lr 1.35e-04 | (3807.23 ms | 137708 tok/s) step 18123/76294 | train loss 3.403656 | norm 2.8147 | lr 1.35e-04 | (3854.99 ms | 136003 tok/s) step 18124/76294 | train loss 3.496524 | norm 3.2589 | lr 1.35e-04 | (3837.45 ms | 136624 tok/s) step 18125/76294 | train loss 3.422533 | norm 3.8825 | lr 1.35e-04 | (3860.23 ms | 135818 tok/s) step 18126/76294 | train loss 3.434379 | norm 5.5058 | lr 1.35e-04 | (3804.18 ms | 137819 tok/s) step 18127/76294 | train loss 3.393699 | norm 6.0090 | lr 1.35e-04 | (3856.29 ms | 135957 tok/s) step 18128/76294 | train loss 3.456352 | norm 4.8502 | lr 1.35e-04 | (3826.25 ms | 137024 tok/s) step 18129/76294 | train loss 3.364051 | norm 4.8343 | lr 1.35e-04 | (3859.35 ms | 135849 tok/s) step 18130/76294 | train loss 3.430489 | norm 5.1724 | lr 1.35e-04 | (3799.98 ms | 137971 tok/s) step 18131/76294 | train loss 3.419283 | norm 1.6813 | lr 1.35e-04 | (3825.52 ms | 137050 tok/s) step 18132/76294 | train loss 3.418972 | norm 4.9603 | lr 1.35e-04 | (3800.93 ms | 137937 tok/s) step 18133/76294 | train loss 3.473103 | norm 3.8220 | lr 1.35e-04 | (3823.30 ms | 137130 tok/s) step 18134/76294 | train loss 3.430682 | norm 5.2641 | lr 1.35e-04 | (3826.57 ms | 137013 tok/s) step 18135/76294 | train loss 3.435124 | norm 4.6425 | lr 1.35e-04 | (3807.59 ms | 137696 tok/s) step 18136/76294 | train loss 3.501097 | norm 3.5913 | lr 1.35e-04 | (3824.02 ms | 137104 tok/s) step 18137/76294 | train loss 3.429258 | norm 5.6021 | lr 1.35e-04 | (3804.98 ms | 137790 tok/s) step 18138/76294 | train loss 3.429265 | norm 5.8684 | lr 1.35e-04 | (3836.60 ms | 136654 tok/s) step 18139/76294 | train loss 3.437193 | norm 4.4297 | lr 1.35e-04 | (3808.29 ms | 137670 tok/s) step 18140/76294 | train loss 3.440824 | norm 4.8597 | lr 1.35e-04 | (3804.67 ms | 137801 tok/s) step 18141/76294 | train loss 3.399226 | norm 4.1491 | lr 1.35e-04 | (3832.33 ms | 136806 tok/s) step 18142/76294 | train loss 3.378549 | norm 3.4888 | lr 1.35e-04 | (3804.52 ms | 137806 tok/s) step 18143/76294 | train loss 3.441546 | norm 3.6659 | lr 1.35e-04 | (3811.24 ms | 137564 tok/s) step 18144/76294 | train loss 3.388650 | norm 4.7087 | lr 1.35e-04 | (3803.41 ms | 137847 tok/s) step 18145/76294 | train loss 3.407219 | norm 2.9996 | lr 1.35e-04 | (3806.61 ms | 137731 tok/s) step 18146/76294 | train loss 3.405553 | norm 3.6716 | lr 1.35e-04 | (3830.08 ms | 136887 tok/s) step 18147/76294 | train loss 3.367284 | norm 4.6403 | lr 1.35e-04 | (3812.48 ms | 137519 tok/s) step 18148/76294 | train loss 3.397335 | norm 3.9743 | lr 1.35e-04 | (3822.48 ms | 137159 tok/s) step 18149/76294 | train loss 3.386369 | norm 3.5185 | lr 1.35e-04 | (4211.71 ms | 124483 tok/s) step 18150/76294 | train loss 3.450402 | norm 4.1195 | lr 1.35e-04 | (3810.07 ms | 137606 tok/s) step 18151/76294 | train loss 3.409978 | norm 1.7730 | lr 1.35e-04 | (3807.76 ms | 137689 tok/s) step 18152/76294 | train loss 3.438596 | norm 2.7189 | lr 1.35e-04 | (3811.43 ms | 137557 tok/s) step 18153/76294 | train loss 3.452820 | norm 5.8060 | lr 1.35e-04 | (3808.80 ms | 137652 tok/s) step 18154/76294 | train loss 3.435353 | norm 6.0082 | lr 1.35e-04 | (3828.34 ms | 136949 tok/s) step 18155/76294 | train loss 3.421745 | norm 3.2400 | lr 1.35e-04 | (3802.61 ms | 137876 tok/s) step 18156/76294 | train loss 3.428058 | norm 2.4093 | lr 1.35e-04 | (3805.28 ms | 137779 tok/s) step 18157/76294 | train loss 3.387504 | norm 3.0206 | lr 1.35e-04 | (3832.54 ms | 136799 tok/s) step 18158/76294 | train loss 3.413141 | norm 3.6382 | lr 1.35e-04 | (3808.20 ms | 137674 tok/s) step 18159/76294 | train loss 3.424393 | norm 2.3674 | lr 1.35e-04 | (3804.12 ms | 137821 tok/s) step 18160/76294 | train loss 3.378373 | norm 2.3455 | lr 1.35e-04 | (3821.96 ms | 137178 tok/s) step 18161/76294 | train loss 3.393726 | norm 2.1732 | lr 1.35e-04 | (3805.82 ms | 137760 tok/s) step 18162/76294 | train loss 3.447535 | norm 1.1863 | lr 1.35e-04 | (3825.03 ms | 137068 tok/s) step 18163/76294 | train loss 3.390271 | norm 3.5333 | lr 1.35e-04 | (3802.98 ms | 137862 tok/s) step 18164/76294 | train loss 3.422223 | norm 3.1683 | lr 1.35e-04 | (3804.19 ms | 137819 tok/s) step 18165/76294 | train loss 3.426126 | norm 5.2666 | lr 1.35e-04 | (3936.74 ms | 133178 tok/s) step 18166/76294 | train loss 3.483772 | norm 3.1046 | lr 1.35e-04 | (3814.51 ms | 137446 tok/s) step 18167/76294 | train loss 3.347217 | norm 3.4577 | lr 1.35e-04 | (3807.23 ms | 137708 tok/s) step 18168/76294 | train loss 3.490523 | norm 3.7213 | lr 1.34e-04 | (3828.27 ms | 136952 tok/s) step 18169/76294 | train loss 3.438392 | norm 3.6440 | lr 1.34e-04 | (3804.49 ms | 137808 tok/s) step 18170/76294 | train loss 3.441620 | norm 2.6757 | lr 1.34e-04 | (3801.58 ms | 137913 tok/s) step 18171/76294 | train loss 3.389318 | norm 3.9890 | lr 1.34e-04 | (3828.21 ms | 136954 tok/s) step 18172/76294 | train loss 3.455889 | norm 4.0934 | lr 1.34e-04 | (3803.90 ms | 137829 tok/s) step 18173/76294 | train loss 3.411881 | norm 3.5570 | lr 1.34e-04 | (3807.09 ms | 137714 tok/s) step 18174/76294 | train loss 3.327673 | norm 2.2258 | lr 1.34e-04 | (3833.99 ms | 136747 tok/s) step 18175/76294 | train loss 3.463056 | norm 4.9292 | lr 1.34e-04 | (3928.88 ms | 133445 tok/s) step 18176/76294 | train loss 3.356794 | norm 5.0500 | lr 1.34e-04 | (3797.87 ms | 138048 tok/s) step 18177/76294 | train loss 3.466089 | norm 3.7198 | lr 1.34e-04 | (3852.44 ms | 136092 tok/s) step 18178/76294 | train loss 3.440591 | norm 4.1173 | lr 1.34e-04 | (3798.92 ms | 138010 tok/s) step 18179/76294 | train loss 3.453777 | norm 1.7548 | lr 1.34e-04 | (3862.68 ms | 135732 tok/s) step 18180/76294 | train loss 3.481046 | norm 2.0850 | lr 1.34e-04 | (3817.98 ms | 137321 tok/s) step 18181/76294 | train loss 3.383288 | norm 2.6509 | lr 1.34e-04 | (3816.36 ms | 137379 tok/s) step 18182/76294 | train loss 3.353440 | norm 2.1174 | lr 1.34e-04 | (3798.22 ms | 138035 tok/s) step 18183/76294 | train loss 3.616203 | norm 5.4543 | lr 1.34e-04 | (3806.05 ms | 137751 tok/s) step 18184/76294 | train loss 3.375145 | norm 4.1639 | lr 1.34e-04 | (3825.37 ms | 137055 tok/s) step 18185/76294 | train loss 3.405793 | norm 5.2764 | lr 1.34e-04 | (3824.05 ms | 137103 tok/s) step 18186/76294 | train loss 3.413163 | norm 4.6660 | lr 1.34e-04 | (3827.54 ms | 136978 tok/s) step 18187/76294 | train loss 3.517829 | norm 3.9535 | lr 1.34e-04 | (3912.11 ms | 134017 tok/s) step 18188/76294 | train loss 3.425688 | norm 3.7942 | lr 1.34e-04 | (3839.20 ms | 136562 tok/s) step 18189/76294 | train loss 3.579829 | norm 2.7900 | lr 1.34e-04 | (3809.16 ms | 137639 tok/s) step 18190/76294 | train loss 3.450126 | norm 3.4729 | lr 1.34e-04 | (3878.68 ms | 135172 tok/s) step 18191/76294 | train loss 3.426008 | norm 5.7362 | lr 1.34e-04 | (3803.67 ms | 137837 tok/s) step 18192/76294 | train loss 3.430507 | norm 16.6278 | lr 1.34e-04 | (3782.89 ms | 138595 tok/s) step 18193/76294 | train loss 3.472378 | norm 5.6102 | lr 1.34e-04 | (3810.66 ms | 137585 tok/s) step 18194/76294 | train loss 3.411316 | norm 3.7196 | lr 1.34e-04 | (3790.70 ms | 138309 tok/s) step 18195/76294 | train loss 3.383987 | norm 3.8603 | lr 1.34e-04 | (3790.89 ms | 138302 tok/s) step 18196/76294 | train loss 3.402859 | norm 3.9356 | lr 1.34e-04 | (3809.24 ms | 137636 tok/s) step 18197/76294 | train loss 3.443044 | norm 11.1231 | lr 1.34e-04 | (3832.07 ms | 136816 tok/s) step 18198/76294 | train loss 3.388986 | norm 4.9110 | lr 1.34e-04 | (3850.40 ms | 136165 tok/s) step 18199/76294 | train loss 3.459363 | norm 5.8720 | lr 1.34e-04 | (3820.70 ms | 137223 tok/s) step 18200/76294 | train loss 3.433702 | norm 5.4095 | lr 1.34e-04 | (3809.19 ms | 137638 tok/s) step 18201/76294 | train loss 3.457317 | norm 3.0542 | lr 1.34e-04 | (3790.48 ms | 138317 tok/s) step 18202/76294 | train loss 3.403921 | norm 3.1833 | lr 1.34e-04 | (3796.48 ms | 138098 tok/s) step 18203/76294 | train loss 3.390672 | norm 5.0949 | lr 1.34e-04 | (3892.36 ms | 134697 tok/s) step 18204/76294 | train loss 3.628380 | norm 5.3646 | lr 1.34e-04 | (3806.09 ms | 137750 tok/s) step 18205/76294 | train loss 3.406555 | norm 5.5299 | lr 1.34e-04 | (3798.44 ms | 138027 tok/s) step 18206/76294 | train loss 3.433638 | norm 5.5201 | lr 1.34e-04 | (3823.97 ms | 137106 tok/s) step 18207/76294 | train loss 3.421874 | norm 8.5230 | lr 1.34e-04 | (3795.86 ms | 138121 tok/s) step 18208/76294 | train loss 3.277929 | norm 4.9690 | lr 1.34e-04 | (3793.27 ms | 138215 tok/s) step 18209/76294 | train loss 3.408679 | norm 9.1946 | lr 1.34e-04 | (3847.41 ms | 136270 tok/s) step 18210/76294 | train loss 3.416450 | norm 5.2628 | lr 1.34e-04 | (3795.13 ms | 138147 tok/s) step 18211/76294 | train loss 3.416386 | norm 4.5037 | lr 1.34e-04 | (3797.58 ms | 138058 tok/s) step 18212/76294 | train loss 3.490297 | norm 3.0737 | lr 1.34e-04 | (3819.12 ms | 137280 tok/s) step 18213/76294 | train loss 3.393703 | norm 5.2123 | lr 1.34e-04 | (3797.16 ms | 138074 tok/s) step 18214/76294 | train loss 3.446011 | norm 2.4941 | lr 1.34e-04 | (3797.76 ms | 138052 tok/s) step 18215/76294 | train loss 3.383453 | norm 3.1266 | lr 1.34e-04 | (3797.10 ms | 138076 tok/s) step 18216/76294 | train loss 3.422867 | norm 2.6153 | lr 1.34e-04 | (3798.54 ms | 138023 tok/s) step 18217/76294 | train loss 3.394318 | norm 3.1433 | lr 1.33e-04 | (3797.15 ms | 138074 tok/s) step 18218/76294 | train loss 3.354460 | norm 4.9154 | lr 1.33e-04 | (3804.73 ms | 137799 tok/s) step 18219/76294 | train loss 3.389923 | norm 3.7153 | lr 1.33e-04 | (3801.05 ms | 137933 tok/s) step 18220/76294 | train loss 3.385584 | norm 4.3370 | lr 1.33e-04 | (3792.55 ms | 138242 tok/s) step 18221/76294 | train loss 3.416332 | norm 3.2246 | lr 1.33e-04 | (3894.78 ms | 134613 tok/s) step 18222/76294 | train loss 3.405736 | norm 3.4081 | lr 1.33e-04 | (3870.72 ms | 135450 tok/s) step 18223/76294 | train loss 3.395805 | norm 3.7806 | lr 1.33e-04 | (3790.98 ms | 138299 tok/s) step 18224/76294 | train loss 3.475194 | norm 3.0132 | lr 1.33e-04 | (3847.89 ms | 136253 tok/s) step 18225/76294 | train loss 3.402551 | norm 2.3699 | lr 1.33e-04 | (3796.23 ms | 138107 tok/s) step 18226/76294 | train loss 3.408381 | norm 3.7226 | lr 1.33e-04 | (4130.07 ms | 126944 tok/s) step 18227/76294 | train loss 3.435549 | norm 2.0670 | lr 1.33e-04 | (3794.93 ms | 138155 tok/s) step 18228/76294 | train loss 3.431066 | norm 2.5316 | lr 1.33e-04 | (3829.85 ms | 136895 tok/s) step 18229/76294 | train loss 3.345058 | norm 1.6743 | lr 1.33e-04 | (3795.36 ms | 138139 tok/s) step 18230/76294 | train loss 3.501157 | norm 2.4183 | lr 1.33e-04 | (3821.38 ms | 137199 tok/s) step 18231/76294 | train loss 3.392616 | norm 4.0763 | lr 1.33e-04 | (3805.26 ms | 137780 tok/s) step 18232/76294 | train loss 3.398510 | norm 2.9823 | lr 1.33e-04 | (3799.05 ms | 138005 tok/s) step 18233/76294 | train loss 3.397539 | norm 3.5590 | lr 1.33e-04 | (3815.40 ms | 137414 tok/s) step 18234/76294 | train loss 3.408313 | norm 3.8317 | lr 1.33e-04 | (3798.75 ms | 138016 tok/s) step 18235/76294 | train loss 3.475021 | norm 6.2482 | lr 1.33e-04 | (3794.37 ms | 138175 tok/s) step 18236/76294 | train loss 3.395795 | norm 5.1690 | lr 1.33e-04 | (3821.28 ms | 137202 tok/s) step 18237/76294 | train loss 3.444256 | norm 4.2059 | lr 1.33e-04 | (3798.01 ms | 138043 tok/s) step 18238/76294 | train loss 3.516722 | norm 3.7140 | lr 1.33e-04 | (3793.99 ms | 138189 tok/s) step 18239/76294 | train loss 3.562354 | norm 3.3842 | lr 1.33e-04 | (3820.42 ms | 137233 tok/s) step 18240/76294 | train loss 3.452684 | norm 3.0215 | lr 1.33e-04 | (3794.04 ms | 138187 tok/s) step 18241/76294 | train loss 3.441569 | norm 3.3118 | lr 1.33e-04 | (3825.81 ms | 137040 tok/s) step 18242/76294 | train loss 3.405612 | norm 3.5234 | lr 1.33e-04 | (3797.00 ms | 138079 tok/s) step 18243/76294 | train loss 3.410061 | norm 2.7461 | lr 1.33e-04 | (3798.06 ms | 138041 tok/s) step 18244/76294 | train loss 3.405756 | norm 2.8211 | lr 1.33e-04 | (3817.07 ms | 137354 tok/s) step 18245/76294 | train loss 3.420950 | norm 3.0166 | lr 1.33e-04 | (3806.13 ms | 137748 tok/s) step 18246/76294 | train loss 3.443777 | norm 1.9401 | lr 1.33e-04 | (3802.42 ms | 137883 tok/s) step 18247/76294 | train loss 3.399022 | norm 3.6944 | lr 1.33e-04 | (3863.91 ms | 135688 tok/s) step 18248/76294 | train loss 3.404675 | norm 4.0960 | lr 1.33e-04 | (3794.31 ms | 138177 tok/s) step 18249/76294 | train loss 3.394283 | norm 4.1578 | lr 1.33e-04 | (3801.16 ms | 137928 tok/s) step 18250/76294 | train loss 3.401768 | norm 7.6422 | lr 1.33e-04 | (3819.15 ms | 137279 tok/s) val loss: 3.414971 saving model checkpoint to ./results/gpt2-124M-gqa/step_18250.pth step 18251/76294 | train loss 3.397708 | norm 4.0747 | lr 1.33e-04 | (3818.02 ms | 137319 tok/s) step 18252/76294 | train loss 3.427639 | norm 4.5653 | lr 1.33e-04 | (3826.25 ms | 137024 tok/s) step 18253/76294 | train loss 3.451171 | norm 4.3909 | lr 1.33e-04 | (3799.95 ms | 137972 tok/s) step 18254/76294 | train loss 3.390971 | norm 2.1846 | lr 1.33e-04 | (3794.82 ms | 138159 tok/s) step 18255/76294 | train loss 3.459935 | norm 2.8155 | lr 1.33e-04 | (3801.21 ms | 137927 tok/s) step 18256/76294 | train loss 3.475006 | norm 3.5247 | lr 1.33e-04 | (3798.88 ms | 138011 tok/s) step 18257/76294 | train loss 3.379410 | norm 4.3435 | lr 1.33e-04 | (3804.65 ms | 137802 tok/s) step 18258/76294 | train loss 3.419461 | norm 3.9689 | lr 1.33e-04 | (3817.30 ms | 137345 tok/s) step 18259/76294 | train loss 3.460838 | norm 3.1681 | lr 1.33e-04 | (3795.16 ms | 138147 tok/s) step 18260/76294 | train loss 3.493761 | norm 4.7532 | lr 1.33e-04 | (3814.90 ms | 137432 tok/s) step 18261/76294 | train loss 3.392034 | norm 3.1112 | lr 1.33e-04 | (3823.85 ms | 137110 tok/s) step 18262/76294 | train loss 3.430118 | norm 4.7829 | lr 1.33e-04 | (3817.55 ms | 137336 tok/s) step 18263/76294 | train loss 3.509272 | norm 4.5684 | lr 1.33e-04 | (3796.52 ms | 138097 tok/s) step 18264/76294 | train loss 3.445188 | norm 4.2350 | lr 1.33e-04 | (3840.75 ms | 136507 tok/s) step 18265/76294 | train loss 3.435230 | norm 5.8787 | lr 1.33e-04 | (3794.96 ms | 138154 tok/s) step 18266/76294 | train loss 3.424764 | norm 6.3441 | lr 1.33e-04 | (3932.02 ms | 133338 tok/s) step 18267/76294 | train loss 3.461730 | norm 6.5891 | lr 1.33e-04 | (3792.80 ms | 138232 tok/s) step 18268/76294 | train loss 3.496919 | norm 4.7494 | lr 1.32e-04 | (3820.25 ms | 137239 tok/s) step 18269/76294 | train loss 3.380333 | norm 6.8119 | lr 1.32e-04 | (3821.69 ms | 137188 tok/s) step 18270/76294 | train loss 3.441396 | norm 3.7874 | lr 1.32e-04 | (3805.79 ms | 137761 tok/s) step 18271/76294 | train loss 3.463199 | norm 4.7557 | lr 1.32e-04 | (3882.23 ms | 135048 tok/s) step 18272/76294 | train loss 3.477384 | norm 3.6983 | lr 1.32e-04 | (3792.63 ms | 138239 tok/s) step 18273/76294 | train loss 3.427443 | norm 2.6637 | lr 1.32e-04 | (3888.80 ms | 134820 tok/s) step 18274/76294 | train loss 3.455047 | norm 5.6737 | lr 1.32e-04 | (3785.41 ms | 138502 tok/s) step 18275/76294 | train loss 3.468488 | norm 3.9680 | lr 1.32e-04 | (4140.42 ms | 126627 tok/s) step 18276/76294 | train loss 3.385264 | norm 8.4341 | lr 1.32e-04 | (21731.85 ms | 24125 tok/s) step 18277/76294 | train loss 3.425761 | norm 4.5610 | lr 1.32e-04 | (3883.17 ms | 135016 tok/s) step 18278/76294 | train loss 3.374874 | norm 6.5342 | lr 1.32e-04 | (3761.98 ms | 139365 tok/s) step 18279/76294 | train loss 3.442518 | norm 5.2965 | lr 1.32e-04 | (3775.22 ms | 138876 tok/s) step 18280/76294 | train loss 3.424244 | norm 8.5959 | lr 1.32e-04 | (3774.01 ms | 138921 tok/s) step 18281/76294 | train loss 3.434728 | norm 8.4169 | lr 1.32e-04 | (3774.82 ms | 138891 tok/s) step 18282/76294 | train loss 3.496478 | norm 4.8451 | lr 1.32e-04 | (3795.21 ms | 138144 tok/s) step 18283/76294 | train loss 3.472249 | norm 2.4408 | lr 1.32e-04 | (3778.44 ms | 138758 tok/s) step 18284/76294 | train loss 3.491372 | norm 2.4852 | lr 1.32e-04 | (3783.37 ms | 138577 tok/s) step 18285/76294 | train loss 3.473660 | norm 4.5575 | lr 1.32e-04 | (3785.55 ms | 138497 tok/s) step 18286/76294 | train loss 3.500730 | norm 3.2975 | lr 1.32e-04 | (3807.31 ms | 137706 tok/s) step 18287/76294 | train loss 3.515933 | norm 3.6593 | lr 1.32e-04 | (3788.98 ms | 138372 tok/s) step 18288/76294 | train loss 3.432370 | norm 2.6249 | lr 1.32e-04 | (3804.60 ms | 137804 tok/s) step 18289/76294 | train loss 3.581546 | norm 7.8282 | lr 1.32e-04 | (3820.76 ms | 137221 tok/s) step 18290/76294 | train loss 3.480189 | norm 5.3713 | lr 1.32e-04 | (3811.99 ms | 137536 tok/s) step 18291/76294 | train loss 3.459369 | norm 10.2943 | lr 1.32e-04 | (3849.91 ms | 136182 tok/s) step 18292/76294 | train loss 3.526539 | norm 9.5451 | lr 1.32e-04 | (3833.89 ms | 136751 tok/s) step 18293/76294 | train loss 3.491057 | norm 14.8089 | lr 1.32e-04 | (3795.77 ms | 138124 tok/s) step 18294/76294 | train loss 3.438288 | norm 6.1766 | lr 1.32e-04 | (3831.51 ms | 136836 tok/s) step 18295/76294 | train loss 3.553625 | norm 4.0101 | lr 1.32e-04 | (3794.70 ms | 138163 tok/s) step 18296/76294 | train loss 3.404839 | norm 3.9636 | lr 1.32e-04 | (3797.48 ms | 138062 tok/s) step 18297/76294 | train loss 3.458711 | norm 2.2976 | lr 1.32e-04 | (3817.11 ms | 137352 tok/s) step 18298/76294 | train loss 3.444997 | norm 3.5471 | lr 1.32e-04 | (3800.78 ms | 137942 tok/s) step 18299/76294 | train loss 3.416686 | norm 2.6622 | lr 1.32e-04 | (3822.16 ms | 137170 tok/s) step 18300/76294 | train loss 3.471050 | norm 7.3479 | lr 1.32e-04 | (3817.30 ms | 137345 tok/s) step 18301/76294 | train loss 3.453427 | norm 3.1960 | lr 1.32e-04 | (3801.14 ms | 137929 tok/s) step 18302/76294 | train loss 3.419458 | norm 2.9036 | lr 1.32e-04 | (3807.88 ms | 137685 tok/s) step 18303/76294 | train loss 3.427441 | norm 4.4073 | lr 1.32e-04 | (3804.44 ms | 137809 tok/s) step 18304/76294 | train loss 3.487923 | norm 6.1919 | lr 1.32e-04 | (3814.96 ms | 137429 tok/s) step 18305/76294 | train loss 3.492518 | norm 4.8732 | lr 1.32e-04 | (3831.89 ms | 136822 tok/s) step 18306/76294 | train loss 3.422591 | norm 3.1120 | lr 1.32e-04 | (3931.19 ms | 133366 tok/s) step 18307/76294 | train loss 3.456934 | norm 7.3868 | lr 1.32e-04 | (3802.81 ms | 137869 tok/s) step 18308/76294 | train loss 3.427491 | norm 7.2254 | lr 1.32e-04 | (3831.59 ms | 136833 tok/s) step 18309/76294 | train loss 3.495885 | norm 3.3157 | lr 1.32e-04 | (3802.73 ms | 137872 tok/s) step 18310/76294 | train loss 3.442003 | norm 3.2348 | lr 1.32e-04 | (3823.69 ms | 137116 tok/s) step 18311/76294 | train loss 3.367667 | norm 2.5263 | lr 1.32e-04 | (3831.51 ms | 136836 tok/s) step 18312/76294 | train loss 3.418784 | norm 2.3786 | lr 1.32e-04 | (3808.87 ms | 137649 tok/s) step 18313/76294 | train loss 3.503409 | norm 1.6235 | lr 1.32e-04 | (3807.78 ms | 137689 tok/s) step 18314/76294 | train loss 3.405276 | norm 4.5290 | lr 1.32e-04 | (3830.25 ms | 136881 tok/s) step 18315/76294 | train loss 3.459352 | norm 6.8612 | lr 1.32e-04 | (3874.59 ms | 135314 tok/s) step 18316/76294 | train loss 3.373750 | norm 2.8613 | lr 1.32e-04 | (3800.08 ms | 137968 tok/s) step 18317/76294 | train loss 3.413423 | norm 3.3330 | lr 1.32e-04 | (3827.21 ms | 136990 tok/s) step 18318/76294 | train loss 3.470261 | norm 3.2116 | lr 1.32e-04 | (3798.24 ms | 138034 tok/s) step 18319/76294 | train loss 3.455247 | norm 3.9638 | lr 1.32e-04 | (3854.97 ms | 136003 tok/s) step 18320/76294 | train loss 3.452535 | norm 3.4282 | lr 1.32e-04 | (3805.19 ms | 137782 tok/s) step 18321/76294 | train loss 3.408248 | norm 4.0209 | lr 1.31e-04 | (3807.08 ms | 137714 tok/s) step 18322/76294 | train loss 3.420955 | norm 9.0989 | lr 1.31e-04 | (3829.63 ms | 136903 tok/s) step 18323/76294 | train loss 3.412488 | norm 10.1462 | lr 1.31e-04 | (3888.81 ms | 134820 tok/s) step 18324/76294 | train loss 3.463825 | norm 8.3294 | lr 1.31e-04 | (3822.08 ms | 137174 tok/s) step 18325/76294 | train loss 3.444088 | norm 9.4928 | lr 1.31e-04 | (3810.48 ms | 137591 tok/s) step 18326/76294 | train loss 3.440593 | norm 6.1535 | lr 1.31e-04 | (3944.12 ms | 132929 tok/s) step 18327/76294 | train loss 3.501961 | norm 9.7839 | lr 1.31e-04 | (3805.54 ms | 137770 tok/s) step 18328/76294 | train loss 3.529528 | norm 7.2490 | lr 1.31e-04 | (3849.90 ms | 136182 tok/s) step 18329/76294 | train loss 3.421083 | norm 5.4068 | lr 1.31e-04 | (3799.44 ms | 137991 tok/s) step 18330/76294 | train loss 3.435322 | norm 5.6760 | lr 1.31e-04 | (3892.36 ms | 134697 tok/s) step 18331/76294 | train loss 3.362198 | norm 5.7318 | lr 1.31e-04 | (3793.18 ms | 138219 tok/s) step 18332/76294 | train loss 3.457855 | norm 7.1559 | lr 1.31e-04 | (3849.77 ms | 136187 tok/s) step 18333/76294 | train loss 3.439768 | norm 4.9351 | lr 1.31e-04 | (3801.42 ms | 137919 tok/s) step 18334/76294 | train loss 3.468411 | norm 4.1644 | lr 1.31e-04 | (3818.94 ms | 137286 tok/s) step 18335/76294 | train loss 3.546844 | norm 5.1807 | lr 1.31e-04 | (3793.17 ms | 138219 tok/s) step 18336/76294 | train loss 3.439774 | norm 7.1700 | lr 1.31e-04 | (3803.59 ms | 137840 tok/s) step 18337/76294 | train loss 3.465430 | norm 4.9478 | lr 1.31e-04 | (3818.15 ms | 137315 tok/s) step 18338/76294 | train loss 3.453654 | norm 4.8810 | lr 1.31e-04 | (3801.46 ms | 137918 tok/s) step 18339/76294 | train loss 3.470190 | norm 3.0062 | lr 1.31e-04 | (3888.96 ms | 134814 tok/s) step 18340/76294 | train loss 3.517675 | norm 3.5535 | lr 1.31e-04 | (3797.26 ms | 138070 tok/s) step 18341/76294 | train loss 3.437346 | norm 2.7486 | lr 1.31e-04 | (3863.85 ms | 135691 tok/s) step 18342/76294 | train loss 3.414165 | norm 3.1281 | lr 1.31e-04 | (3799.06 ms | 138005 tok/s) step 18343/76294 | train loss 3.543571 | norm 4.3221 | lr 1.31e-04 | (3851.59 ms | 136122 tok/s) step 18344/76294 | train loss 3.451031 | norm 2.9836 | lr 1.31e-04 | (3798.99 ms | 138007 tok/s) step 18345/76294 | train loss 3.461086 | norm 4.0251 | lr 1.31e-04 | (3846.72 ms | 136295 tok/s) step 18346/76294 | train loss 3.413841 | norm 4.1899 | lr 1.31e-04 | (3794.83 ms | 138158 tok/s) step 18347/76294 | train loss 3.421040 | norm 4.7539 | lr 1.31e-04 | (3848.64 ms | 136227 tok/s) step 18348/76294 | train loss 3.438810 | norm 6.0324 | lr 1.31e-04 | (3798.52 ms | 138024 tok/s) step 18349/76294 | train loss 3.439661 | norm 3.8426 | lr 1.31e-04 | (3850.55 ms | 136159 tok/s) step 18350/76294 | train loss 3.441853 | norm 4.0298 | lr 1.31e-04 | (3798.69 ms | 138018 tok/s) step 18351/76294 | train loss 3.447215 | norm 7.7509 | lr 1.31e-04 | (3848.38 ms | 136236 tok/s) step 18352/76294 | train loss 3.407449 | norm 7.6339 | lr 1.31e-04 | (3798.72 ms | 138017 tok/s) step 18353/76294 | train loss 3.470351 | norm 7.7119 | lr 1.31e-04 | (3805.26 ms | 137780 tok/s) step 18354/76294 | train loss 3.451900 | norm 3.0809 | lr 1.31e-04 | (3821.08 ms | 137210 tok/s) step 18355/76294 | train loss 3.466480 | norm 1.8436 | lr 1.31e-04 | (3805.07 ms | 137787 tok/s) step 18356/76294 | train loss 3.562064 | norm 4.3214 | lr 1.31e-04 | (3826.64 ms | 137010 tok/s) step 18357/76294 | train loss 3.453797 | norm 4.2064 | lr 1.31e-04 | (3804.46 ms | 137809 tok/s) step 18358/76294 | train loss 3.528142 | norm 7.8116 | lr 1.31e-04 | (3806.77 ms | 137725 tok/s) step 18359/76294 | train loss 3.489644 | norm 5.1808 | lr 1.31e-04 | (3804.78 ms | 137797 tok/s) step 18360/76294 | train loss 3.374765 | norm 3.8616 | lr 1.31e-04 | (3806.57 ms | 137732 tok/s) step 18361/76294 | train loss 3.705694 | norm 4.5563 | lr 1.31e-04 | (3803.04 ms | 137860 tok/s) step 18362/76294 | train loss 3.379959 | norm 6.5227 | lr 1.31e-04 | (3804.54 ms | 137806 tok/s) step 18363/76294 | train loss 3.466020 | norm 7.4090 | lr 1.31e-04 | (3886.10 ms | 134914 tok/s) step 18364/76294 | train loss 3.454849 | norm 6.8451 | lr 1.31e-04 | (3802.68 ms | 137873 tok/s) step 18365/76294 | train loss 3.478432 | norm 5.2887 | lr 1.31e-04 | (3816.70 ms | 137367 tok/s) step 18366/76294 | train loss 3.622279 | norm 5.7705 | lr 1.31e-04 | (3802.80 ms | 137869 tok/s) step 18367/76294 | train loss 3.438163 | norm 12.2897 | lr 1.31e-04 | (3806.87 ms | 137722 tok/s) step 18368/76294 | train loss 3.520867 | norm 10.0262 | lr 1.31e-04 | (3838.80 ms | 136576 tok/s) step 18369/76294 | train loss 3.424773 | norm 3.9824 | lr 1.31e-04 | (3803.38 ms | 137848 tok/s) step 18370/76294 | train loss 3.462432 | norm 4.4364 | lr 1.31e-04 | (3830.76 ms | 136863 tok/s) step 18371/76294 | train loss 3.506183 | norm 4.1356 | lr 1.31e-04 | (3804.59 ms | 137804 tok/s) step 18372/76294 | train loss 3.425041 | norm 3.3937 | lr 1.31e-04 | (3804.01 ms | 137825 tok/s) step 18373/76294 | train loss 3.477471 | norm 3.5925 | lr 1.31e-04 | (3837.70 ms | 136615 tok/s) step 18374/76294 | train loss 3.477164 | norm 4.1232 | lr 1.31e-04 | (3807.24 ms | 137708 tok/s) step 18375/76294 | train loss 3.395884 | norm 2.8198 | lr 1.31e-04 | (3820.83 ms | 137218 tok/s) step 18376/76294 | train loss 3.502809 | norm 3.7642 | lr 1.31e-04 | (3804.38 ms | 137812 tok/s) step 18377/76294 | train loss 3.443491 | norm 18.0006 | lr 1.30e-04 | (3809.46 ms | 137628 tok/s) step 18378/76294 | train loss 3.542857 | norm 7.3645 | lr 1.30e-04 | (3802.22 ms | 137890 tok/s) step 18379/76294 | train loss 3.485240 | norm 3.0805 | lr 1.30e-04 | (3802.50 ms | 137880 tok/s) step 18380/76294 | train loss 3.396826 | norm 2.9714 | lr 1.30e-04 | (3855.32 ms | 135991 tok/s) step 18381/76294 | train loss 3.432127 | norm 3.3904 | lr 1.30e-04 | (3800.03 ms | 137970 tok/s) step 18382/76294 | train loss 3.409907 | norm 4.1828 | lr 1.30e-04 | (3803.54 ms | 137842 tok/s) step 18383/76294 | train loss 3.425920 | norm 2.3859 | lr 1.30e-04 | (3825.41 ms | 137054 tok/s) step 18384/76294 | train loss 3.523975 | norm 1.6776 | lr 1.30e-04 | (3805.61 ms | 137767 tok/s) step 18385/76294 | train loss 3.387553 | norm 2.5596 | lr 1.30e-04 | (3811.72 ms | 137546 tok/s) step 18386/76294 | train loss 3.538356 | norm 2.6428 | lr 1.30e-04 | (3800.37 ms | 137957 tok/s) step 18387/76294 | train loss 3.439960 | norm 2.0771 | lr 1.30e-04 | (3796.79 ms | 138087 tok/s) step 18388/76294 | train loss 3.529564 | norm 4.3413 | lr 1.30e-04 | (4058.46 ms | 129184 tok/s) step 18389/76294 | train loss 3.481651 | norm 4.7012 | lr 1.30e-04 | (3805.99 ms | 137753 tok/s) step 18390/76294 | train loss 3.421402 | norm 5.8974 | lr 1.30e-04 | (3814.89 ms | 137432 tok/s) step 18391/76294 | train loss 3.495571 | norm 4.0756 | lr 1.30e-04 | (3828.80 ms | 136933 tok/s) step 18392/76294 | train loss 3.444223 | norm 5.0751 | lr 1.30e-04 | (3810.70 ms | 137583 tok/s) step 18393/76294 | train loss 3.481249 | norm 4.2630 | lr 1.30e-04 | (3819.66 ms | 137260 tok/s) step 18394/76294 | train loss 3.472091 | norm 2.2528 | lr 1.30e-04 | (3809.65 ms | 137621 tok/s) step 18395/76294 | train loss 3.492997 | norm 5.7970 | lr 1.30e-04 | (3817.64 ms | 137333 tok/s) step 18396/76294 | train loss 3.513105 | norm 7.7308 | lr 1.30e-04 | (3808.67 ms | 137656 tok/s) step 18397/76294 | train loss 3.489818 | norm 4.0010 | lr 1.30e-04 | (3815.67 ms | 137404 tok/s) step 18398/76294 | train loss 3.461090 | norm 4.2765 | lr 1.30e-04 | (3809.30 ms | 137634 tok/s) step 18399/76294 | train loss 3.629854 | norm 4.7675 | lr 1.30e-04 | (3815.80 ms | 137399 tok/s) step 18400/76294 | train loss 3.408056 | norm 4.8957 | lr 1.30e-04 | (3815.63 ms | 137405 tok/s) step 18401/76294 | train loss 3.469346 | norm 2.7210 | lr 1.30e-04 | (3810.07 ms | 137606 tok/s) step 18402/76294 | train loss 3.474058 | norm 3.4963 | lr 1.30e-04 | (3812.08 ms | 137533 tok/s) step 18403/76294 | train loss 3.517545 | norm 4.9922 | lr 1.30e-04 | (3814.24 ms | 137456 tok/s) step 18404/76294 | train loss 3.466568 | norm 3.9738 | lr 1.30e-04 | (3844.33 ms | 136380 tok/s) step 18405/76294 | train loss 3.490911 | norm 4.4383 | lr 1.30e-04 | (3809.45 ms | 137628 tok/s) step 18406/76294 | train loss 3.525824 | norm 4.9988 | lr 1.30e-04 | (3816.83 ms | 137362 tok/s) step 18407/76294 | train loss 3.431540 | norm 4.2619 | lr 1.30e-04 | (3833.17 ms | 136777 tok/s) step 18408/76294 | train loss 3.462514 | norm 2.0637 | lr 1.30e-04 | (3813.35 ms | 137487 tok/s) step 18409/76294 | train loss 3.406741 | norm 4.1009 | lr 1.30e-04 | (3842.06 ms | 136460 tok/s) step 18410/76294 | train loss 3.437575 | norm 4.5202 | lr 1.30e-04 | (3807.39 ms | 137703 tok/s) step 18411/76294 | train loss 3.529835 | norm 5.1878 | lr 1.30e-04 | (3882.42 ms | 135042 tok/s) step 18412/76294 | train loss 3.429285 | norm 2.6894 | lr 1.30e-04 | (3816.59 ms | 137371 tok/s) step 18413/76294 | train loss 3.411147 | norm 3.9594 | lr 1.30e-04 | (3877.54 ms | 135211 tok/s) step 18414/76294 | train loss 3.448056 | norm 3.4892 | lr 1.30e-04 | (3811.19 ms | 137565 tok/s) step 18415/76294 | train loss 3.552211 | norm 5.5399 | lr 1.30e-04 | (3964.52 ms | 132245 tok/s) step 18416/76294 | train loss 3.508243 | norm 3.9425 | lr 1.30e-04 | (3820.21 ms | 137241 tok/s) step 18417/76294 | train loss 3.364925 | norm 2.5358 | lr 1.30e-04 | (4896.51 ms | 107074 tok/s) step 18418/76294 | train loss 3.455455 | norm 2.8676 | lr 1.30e-04 | (3798.52 ms | 138024 tok/s) step 18419/76294 | train loss 3.376309 | norm 5.0424 | lr 1.30e-04 | (3830.30 ms | 136879 tok/s) step 18420/76294 | train loss 3.477696 | norm 3.0913 | lr 1.30e-04 | (3803.78 ms | 137833 tok/s) step 18421/76294 | train loss 3.410494 | norm 4.7461 | lr 1.30e-04 | (3804.03 ms | 137824 tok/s) step 18422/76294 | train loss 3.393144 | norm 5.5084 | lr 1.30e-04 | (3818.69 ms | 137295 tok/s) step 18423/76294 | train loss 3.402370 | norm 8.9476 | lr 1.30e-04 | (3851.19 ms | 136137 tok/s) step 18424/76294 | train loss 3.460226 | norm 26.3212 | lr 1.30e-04 | (3796.12 ms | 138112 tok/s) step 18425/76294 | train loss 3.422221 | norm 3.5222 | lr 1.30e-04 | (3848.80 ms | 136221 tok/s) step 18426/76294 | train loss 3.412195 | norm 4.2442 | lr 1.30e-04 | (3794.39 ms | 138175 tok/s) step 18427/76294 | train loss 3.455463 | norm 4.4485 | lr 1.30e-04 | (3819.80 ms | 137255 tok/s) step 18428/76294 | train loss 3.429051 | norm 2.3238 | lr 1.30e-04 | (3842.57 ms | 136442 tok/s) step 18429/76294 | train loss 3.409135 | norm 4.2184 | lr 1.30e-04 | (3793.66 ms | 138201 tok/s) step 18430/76294 | train loss 3.399688 | norm 3.6336 | lr 1.30e-04 | (3831.13 ms | 136849 tok/s) step 18431/76294 | train loss 3.436836 | norm 4.6849 | lr 1.30e-04 | (3794.00 ms | 138189 tok/s) step 18432/76294 | train loss 3.469345 | norm 2.4073 | lr 1.30e-04 | (3892.10 ms | 134706 tok/s) step 18433/76294 | train loss 3.396259 | norm 3.4669 | lr 1.30e-04 | (3788.35 ms | 138395 tok/s) step 18434/76294 | train loss 3.427982 | norm 3.3494 | lr 1.30e-04 | (3889.29 ms | 134803 tok/s) step 18435/76294 | train loss 3.431827 | norm 2.5207 | lr 1.29e-04 | (3792.54 ms | 138242 tok/s) step 18436/76294 | train loss 3.415165 | norm 5.3358 | lr 1.29e-04 | (3862.90 ms | 135724 tok/s) step 18437/76294 | train loss 3.486950 | norm 3.6781 | lr 1.29e-04 | (3871.17 ms | 135434 tok/s) step 18438/76294 | train loss 3.416433 | norm 4.0628 | lr 1.29e-04 | (3771.23 ms | 139023 tok/s) step 18439/76294 | train loss 3.444794 | norm 4.3339 | lr 1.29e-04 | (3848.68 ms | 136225 tok/s) step 18440/76294 | train loss 3.449744 | norm 5.6333 | lr 1.29e-04 | (3774.68 ms | 138896 tok/s) step 18441/76294 | train loss 3.463027 | norm 4.8760 | lr 1.29e-04 | (3855.57 ms | 135982 tok/s) step 18442/76294 | train loss 3.441856 | norm 4.6249 | lr 1.29e-04 | (3771.21 ms | 139024 tok/s) step 18443/76294 | train loss 3.405655 | norm 7.1370 | lr 1.29e-04 | (3869.35 ms | 135498 tok/s) step 18444/76294 | train loss 3.387628 | norm 12.5721 | lr 1.29e-04 | (3772.46 ms | 138978 tok/s) step 18445/76294 | train loss 3.504035 | norm 7.4198 | lr 1.29e-04 | (3792.41 ms | 138246 tok/s) step 18446/76294 | train loss 3.416102 | norm 2.7262 | lr 1.29e-04 | (3777.86 ms | 138779 tok/s) step 18447/76294 | train loss 3.432308 | norm 9.4273 | lr 1.29e-04 | (3804.98 ms | 137790 tok/s) step 18448/76294 | train loss 3.467706 | norm 12.3093 | lr 1.29e-04 | (3780.34 ms | 138688 tok/s) step 18449/76294 | train loss 3.430389 | norm 14.5407 | lr 1.29e-04 | (3790.96 ms | 138300 tok/s) step 18450/76294 | train loss 3.444316 | norm 8.5195 | lr 1.29e-04 | (3814.22 ms | 137456 tok/s) step 18451/76294 | train loss 3.471721 | norm 4.6915 | lr 1.29e-04 | (3786.91 ms | 138447 tok/s) step 18452/76294 | train loss 3.529120 | norm 4.3688 | lr 1.29e-04 | (3794.35 ms | 138176 tok/s) step 18453/76294 | train loss 3.563844 | norm 4.9397 | lr 1.29e-04 | (3805.86 ms | 137758 tok/s) step 18454/76294 | train loss 3.496615 | norm 4.5363 | lr 1.29e-04 | (3788.34 ms | 138395 tok/s) step 18455/76294 | train loss 3.527237 | norm 2.6512 | lr 1.29e-04 | (3803.38 ms | 137848 tok/s) step 18456/76294 | train loss 3.506802 | norm 3.4603 | lr 1.29e-04 | (3888.72 ms | 134823 tok/s) step 18457/76294 | train loss 3.450199 | norm 4.0236 | lr 1.29e-04 | (3788.30 ms | 138397 tok/s) step 18458/76294 | train loss 3.597508 | norm 2.6568 | lr 1.29e-04 | (3794.85 ms | 138158 tok/s) step 18459/76294 | train loss 3.490398 | norm 6.5165 | lr 1.29e-04 | (3810.48 ms | 137591 tok/s) step 18460/76294 | train loss 3.388487 | norm 16.8563 | lr 1.29e-04 | (3791.05 ms | 138296 tok/s) step 18461/76294 | train loss 3.486579 | norm 7.3548 | lr 1.29e-04 | (3796.48 ms | 138099 tok/s) step 18462/76294 | train loss 3.491189 | norm 5.6030 | lr 1.29e-04 | (3796.94 ms | 138082 tok/s) step 18463/76294 | train loss 3.436036 | norm 2.4162 | lr 1.29e-04 | (3797.34 ms | 138067 tok/s) step 18464/76294 | train loss 3.413830 | norm 3.3973 | lr 1.29e-04 | (3794.83 ms | 138158 tok/s) step 18465/76294 | train loss 3.413773 | norm 8.5519 | lr 1.29e-04 | (3791.66 ms | 138274 tok/s) step 18466/76294 | train loss 3.459688 | norm 3.6428 | lr 1.29e-04 | (3850.35 ms | 136166 tok/s) step 18467/76294 | train loss 3.504725 | norm 8.8795 | lr 1.29e-04 | (3797.34 ms | 138067 tok/s) step 18468/76294 | train loss 3.428072 | norm 9.2224 | lr 1.29e-04 | (3799.98 ms | 137971 tok/s) step 18469/76294 | train loss 3.457947 | norm 6.4131 | lr 1.29e-04 | (3814.35 ms | 137451 tok/s) step 18470/76294 | train loss 3.435548 | norm 6.3722 | lr 1.29e-04 | (3803.10 ms | 137858 tok/s) step 18471/76294 | train loss 3.442866 | norm 4.7364 | lr 1.29e-04 | (3801.14 ms | 137929 tok/s) step 18472/76294 | train loss 3.438490 | norm 7.5404 | lr 1.29e-04 | (3801.38 ms | 137920 tok/s) step 18473/76294 | train loss 3.401033 | norm 8.4312 | lr 1.29e-04 | (3803.10 ms | 137858 tok/s) step 18474/76294 | train loss 3.398446 | norm 9.6354 | lr 1.29e-04 | (3800.59 ms | 137949 tok/s) step 18475/76294 | train loss 3.465074 | norm 6.0636 | lr 1.29e-04 | (3818.85 ms | 137289 tok/s) step 18476/76294 | train loss 3.445261 | norm 9.4756 | lr 1.29e-04 | (3815.45 ms | 137412 tok/s) step 18477/76294 | train loss 3.425191 | norm 5.0019 | lr 1.29e-04 | (3796.15 ms | 138111 tok/s) step 18478/76294 | train loss 3.400253 | norm 2.6228 | lr 1.29e-04 | (3800.90 ms | 137938 tok/s) step 18479/76294 | train loss 3.479100 | norm 3.0363 | lr 1.29e-04 | (3804.80 ms | 137796 tok/s) step 18480/76294 | train loss 3.392510 | norm 7.8627 | lr 1.29e-04 | (3806.76 ms | 137725 tok/s) step 18481/76294 | train loss 3.517592 | norm 9.4516 | lr 1.29e-04 | (3904.75 ms | 134269 tok/s) step 18482/76294 | train loss 3.343717 | norm 6.9668 | lr 1.29e-04 | (3812.90 ms | 137504 tok/s) step 18483/76294 | train loss 3.452954 | norm 2.8294 | lr 1.29e-04 | (3819.74 ms | 137257 tok/s) step 18484/76294 | train loss 3.508556 | norm 5.9143 | lr 1.29e-04 | (3817.24 ms | 137347 tok/s) step 18485/76294 | train loss 3.397480 | norm 4.3304 | lr 1.29e-04 | (3823.04 ms | 137139 tok/s) step 18486/76294 | train loss 3.467500 | norm 5.6277 | lr 1.29e-04 | (3802.79 ms | 137869 tok/s) step 18487/76294 | train loss 3.383001 | norm 5.0080 | lr 1.29e-04 | (3842.69 ms | 136438 tok/s) step 18488/76294 | train loss 3.429729 | norm 6.4595 | lr 1.29e-04 | (3797.14 ms | 138074 tok/s) step 18489/76294 | train loss 3.420085 | norm 2.1412 | lr 1.29e-04 | (3824.79 ms | 137076 tok/s) step 18490/76294 | train loss 3.370148 | norm 3.9400 | lr 1.29e-04 | (3796.04 ms | 138115 tok/s) step 18491/76294 | train loss 3.448595 | norm 3.1839 | lr 1.29e-04 | (3839.43 ms | 136554 tok/s) step 18492/76294 | train loss 3.439479 | norm 4.7788 | lr 1.29e-04 | (3797.49 ms | 138062 tok/s) step 18493/76294 | train loss 3.398972 | norm 5.8523 | lr 1.29e-04 | (3813.08 ms | 137497 tok/s) step 18494/76294 | train loss 3.435035 | norm 3.4844 | lr 1.29e-04 | (3798.41 ms | 138028 tok/s) step 18495/76294 | train loss 3.347070 | norm 2.6476 | lr 1.29e-04 | (3804.19 ms | 137819 tok/s) step 18496/76294 | train loss 3.478922 | norm 4.1419 | lr 1.28e-04 | (3806.43 ms | 137737 tok/s) step 18497/76294 | train loss 3.527519 | norm 5.0207 | lr 1.28e-04 | (3796.67 ms | 138092 tok/s) step 18498/76294 | train loss 3.499834 | norm 6.7266 | lr 1.28e-04 | (3879.08 ms | 135158 tok/s) step 18499/76294 | train loss 3.445403 | norm 10.7951 | lr 1.28e-04 | (3798.85 ms | 138012 tok/s) step 18500/76294 | train loss 3.477123 | norm 7.3418 | lr 1.28e-04 | (3801.19 ms | 137927 tok/s) val loss: 3.453958 saving model checkpoint to ./results/gpt2-124M-gqa/step_18500.pth step 18501/76294 | train loss 3.467667 | norm 9.9677 | lr 1.28e-04 | (3793.77 ms | 138197 tok/s) step 18502/76294 | train loss 3.466835 | norm 18.5262 | lr 1.28e-04 | (3826.54 ms | 137013 tok/s) step 18503/76294 | train loss 3.465292 | norm 4.4564 | lr 1.28e-04 | (3793.76 ms | 138197 tok/s) step 18504/76294 | train loss 3.443626 | norm 6.9127 | lr 1.28e-04 | (3852.55 ms | 136089 tok/s) step 18505/76294 | train loss 3.465578 | norm 5.3435 | lr 1.28e-04 | (3840.60 ms | 136512 tok/s) step 18506/76294 | train loss 3.350227 | norm 5.0570 | lr 1.28e-04 | (3792.43 ms | 138246 tok/s) step 18507/76294 | train loss 3.421518 | norm 6.5965 | lr 1.28e-04 | (3843.07 ms | 136424 tok/s) step 18508/76294 | train loss 3.498972 | norm 2.7294 | lr 1.28e-04 | (3800.16 ms | 137965 tok/s) step 18509/76294 | train loss 3.503299 | norm 4.3433 | lr 1.28e-04 | (3838.17 ms | 136599 tok/s) step 18510/76294 | train loss 3.424430 | norm 6.1706 | lr 1.28e-04 | (3889.27 ms | 134804 tok/s) step 18511/76294 | train loss 3.479594 | norm 4.3152 | lr 1.28e-04 | (3940.65 ms | 133046 tok/s) step 18512/76294 | train loss 3.440624 | norm 4.5004 | lr 1.28e-04 | (3798.88 ms | 138011 tok/s) step 18513/76294 | train loss 3.413356 | norm 9.4483 | lr 1.28e-04 | (3805.42 ms | 137774 tok/s) step 18514/76294 | train loss 3.399982 | norm 8.8253 | lr 1.28e-04 | (3796.94 ms | 138082 tok/s) step 18515/76294 | train loss 3.490116 | norm 7.6527 | lr 1.28e-04 | (3853.20 ms | 136066 tok/s) step 18516/76294 | train loss 3.371543 | norm 6.1315 | lr 1.28e-04 | (3800.10 ms | 137967 tok/s) step 18517/76294 | train loss 3.446737 | norm 10.1160 | lr 1.28e-04 | (3819.21 ms | 137276 tok/s) step 18518/76294 | train loss 3.392381 | norm 2.4665 | lr 1.28e-04 | (3830.35 ms | 136877 tok/s) step 18519/76294 | train loss 3.339067 | norm 9.1773 | lr 1.28e-04 | (3822.91 ms | 137144 tok/s) step 18520/76294 | train loss 3.488553 | norm 5.7374 | lr 1.28e-04 | (4551.27 ms | 115196 tok/s) step 18521/76294 | train loss 3.444294 | norm 5.8282 | lr 1.28e-04 | (3839.32 ms | 136558 tok/s) step 18522/76294 | train loss 3.462497 | norm 7.0947 | lr 1.28e-04 | (3805.25 ms | 137780 tok/s) step 18523/76294 | train loss 3.452620 | norm 7.2624 | lr 1.28e-04 | (3804.15 ms | 137820 tok/s) step 18524/76294 | train loss 3.696602 | norm 4.3722 | lr 1.28e-04 | (3803.18 ms | 137855 tok/s) step 18525/76294 | train loss 3.440076 | norm 6.3535 | lr 1.28e-04 | (3800.43 ms | 137955 tok/s) step 18526/76294 | train loss 3.420165 | norm 9.6459 | lr 1.28e-04 | (3829.86 ms | 136895 tok/s) step 18527/76294 | train loss 3.447670 | norm 5.4383 | lr 1.28e-04 | (3801.77 ms | 137906 tok/s) step 18528/76294 | train loss 3.392438 | norm 7.0849 | lr 1.28e-04 | (3805.15 ms | 137784 tok/s) step 18529/76294 | train loss 3.456095 | norm 7.2577 | lr 1.28e-04 | (3952.41 ms | 132650 tok/s) step 18530/76294 | train loss 3.468385 | norm 7.9364 | lr 1.28e-04 | (3798.86 ms | 138012 tok/s) step 18531/76294 | train loss 3.441214 | norm 6.5063 | lr 1.28e-04 | (3816.37 ms | 137379 tok/s) step 18532/76294 | train loss 3.495661 | norm 4.7018 | lr 1.28e-04 | (3802.04 ms | 137896 tok/s) step 18533/76294 | train loss 3.384717 | norm 11.2765 | lr 1.28e-04 | (3801.96 ms | 137899 tok/s) step 18534/76294 | train loss 3.547381 | norm 4.8319 | lr 1.28e-04 | (3817.64 ms | 137333 tok/s) step 18535/76294 | train loss 3.453629 | norm 10.9327 | lr 1.28e-04 | (3801.88 ms | 137902 tok/s) step 18536/76294 | train loss 3.435808 | norm 3.8623 | lr 1.28e-04 | (3799.72 ms | 137981 tok/s) step 18537/76294 | train loss 3.533362 | norm 5.8937 | lr 1.28e-04 | (3798.77 ms | 138015 tok/s) step 18538/76294 | train loss 3.438943 | norm 4.3884 | lr 1.28e-04 | (3819.33 ms | 137272 tok/s) step 18539/76294 | train loss 3.469136 | norm 6.4267 | lr 1.28e-04 | (3839.42 ms | 136554 tok/s) step 18540/76294 | train loss 3.468217 | norm 5.0681 | lr 1.28e-04 | (3874.58 ms | 135315 tok/s) step 18541/76294 | train loss 3.391876 | norm 4.4475 | lr 1.28e-04 | (3791.76 ms | 138270 tok/s) step 18542/76294 | train loss 3.462745 | norm 7.0007 | lr 1.28e-04 | (3812.76 ms | 137509 tok/s) step 18543/76294 | train loss 3.369599 | norm 6.8604 | lr 1.28e-04 | (3791.01 ms | 138298 tok/s) step 18544/76294 | train loss 3.506449 | norm 5.5336 | lr 1.28e-04 | (3796.48 ms | 138098 tok/s) step 18545/76294 | train loss 3.459000 | norm 3.8798 | lr 1.28e-04 | (3812.63 ms | 137513 tok/s) step 18546/76294 | train loss 3.409569 | norm 3.9949 | lr 1.28e-04 | (3801.26 ms | 137925 tok/s) step 18547/76294 | train loss 3.397650 | norm 7.0745 | lr 1.28e-04 | (3799.94 ms | 137973 tok/s) step 18548/76294 | train loss 3.375260 | norm 2.4882 | lr 1.28e-04 | (3818.71 ms | 137295 tok/s) step 18549/76294 | train loss 3.520276 | norm 2.7337 | lr 1.28e-04 | (3791.95 ms | 138263 tok/s) step 18550/76294 | train loss 3.487239 | norm 9.7404 | lr 1.28e-04 | (3822.22 ms | 137168 tok/s) step 18551/76294 | train loss 3.448827 | norm 7.3692 | lr 1.28e-04 | (3793.55 ms | 138205 tok/s) step 18552/76294 | train loss 3.441664 | norm 4.2739 | lr 1.28e-04 | (3823.83 ms | 137111 tok/s) step 18553/76294 | train loss 3.518764 | norm 5.1660 | lr 1.28e-04 | (3799.92 ms | 137973 tok/s) step 18554/76294 | train loss 3.505677 | norm 3.9446 | lr 1.28e-04 | (3820.56 ms | 137228 tok/s) step 18555/76294 | train loss 3.501894 | norm 3.2525 | lr 1.28e-04 | (3799.38 ms | 137993 tok/s) step 18556/76294 | train loss 3.448989 | norm 5.3997 | lr 1.28e-04 | (3818.32 ms | 137309 tok/s) step 18557/76294 | train loss 3.446477 | norm 8.3589 | lr 1.28e-04 | (3798.14 ms | 138038 tok/s) step 18558/76294 | train loss 3.484628 | norm 4.4275 | lr 1.28e-04 | (3805.69 ms | 137764 tok/s) step 18559/76294 | train loss 3.466223 | norm 5.1759 | lr 1.28e-04 | (3798.69 ms | 138018 tok/s) step 18560/76294 | train loss 3.433391 | norm 5.4805 | lr 1.28e-04 | (3845.35 ms | 136343 tok/s) step 18561/76294 | train loss 3.494781 | norm 3.3844 | lr 1.27e-04 | (3797.32 ms | 138068 tok/s) step 18562/76294 | train loss 3.503932 | norm 3.2863 | lr 1.27e-04 | (3804.62 ms | 137803 tok/s) step 18563/76294 | train loss 3.418837 | norm 7.7382 | lr 1.27e-04 | (3819.92 ms | 137251 tok/s) step 18564/76294 | train loss 3.456216 | norm 6.8498 | lr 1.27e-04 | (3804.19 ms | 137819 tok/s) step 18565/76294 | train loss 3.461565 | norm 4.1462 | lr 1.27e-04 | (3822.63 ms | 137154 tok/s) step 18566/76294 | train loss 3.399131 | norm 7.6057 | lr 1.27e-04 | (3800.62 ms | 137948 tok/s) step 18567/76294 | train loss 3.484674 | norm 4.2543 | lr 1.27e-04 | (3802.48 ms | 137880 tok/s) step 18568/76294 | train loss 3.469911 | norm 4.6224 | lr 1.27e-04 | (4003.09 ms | 130971 tok/s) step 18569/76294 | train loss 3.420169 | norm 6.1831 | lr 1.27e-04 | (3795.54 ms | 138133 tok/s) step 18570/76294 | train loss 3.365253 | norm 10.2166 | lr 1.27e-04 | (3806.33 ms | 137741 tok/s) step 18571/76294 | train loss 3.480334 | norm 10.1769 | lr 1.27e-04 | (3819.71 ms | 137258 tok/s) step 18572/76294 | train loss 3.417705 | norm 9.3679 | lr 1.27e-04 | (3806.70 ms | 137728 tok/s) step 18573/76294 | train loss 3.455700 | norm 9.4501 | lr 1.27e-04 | (3815.35 ms | 137416 tok/s) step 18574/76294 | train loss 3.489164 | norm 4.0568 | lr 1.27e-04 | (3807.69 ms | 137692 tok/s) step 18575/76294 | train loss 3.427092 | norm 5.4037 | lr 1.27e-04 | (3827.00 ms | 136997 tok/s) step 18576/76294 | train loss 3.432330 | norm 14.7208 | lr 1.27e-04 | (3800.89 ms | 137938 tok/s) step 18577/76294 | train loss 3.373524 | norm 4.5058 | lr 1.27e-04 | (3806.14 ms | 137748 tok/s) step 18578/76294 | train loss 3.447868 | norm 5.7053 | lr 1.27e-04 | (3872.81 ms | 135376 tok/s) step 18579/76294 | train loss 3.411944 | norm 5.3794 | lr 1.27e-04 | (3802.41 ms | 137883 tok/s) step 18580/76294 | train loss 3.426834 | norm 9.7151 | lr 1.27e-04 | (3846.93 ms | 136287 tok/s) step 18581/76294 | train loss 3.378232 | norm 8.3382 | lr 1.27e-04 | (3816.19 ms | 137385 tok/s) step 18582/76294 | train loss 3.460391 | norm 9.6048 | lr 1.27e-04 | (3811.71 ms | 137547 tok/s) step 18583/76294 | train loss 3.483913 | norm 7.0288 | lr 1.27e-04 | (3797.92 ms | 138046 tok/s) step 18584/76294 | train loss 3.398072 | norm 5.8153 | lr 1.27e-04 | (3822.62 ms | 137154 tok/s) step 18585/76294 | train loss 3.449107 | norm 6.1795 | lr 1.27e-04 | (3800.38 ms | 137957 tok/s) step 18586/76294 | train loss 3.420514 | norm 7.3792 | lr 1.27e-04 | (3854.26 ms | 136028 tok/s) step 18587/76294 | train loss 3.448290 | norm 7.6576 | lr 1.27e-04 | (3799.27 ms | 137997 tok/s) step 18588/76294 | train loss 3.441695 | norm 10.2840 | lr 1.27e-04 | (3827.47 ms | 136980 tok/s) step 18589/76294 | train loss 3.440208 | norm 10.7425 | lr 1.27e-04 | (3803.19 ms | 137855 tok/s) step 18590/76294 | train loss 3.477388 | norm 14.6230 | lr 1.27e-04 | (3808.88 ms | 137649 tok/s) step 18591/76294 | train loss 3.410036 | norm 18.6770 | lr 1.27e-04 | (3797.90 ms | 138047 tok/s) step 18592/76294 | train loss 3.544917 | norm 22.3934 | lr 1.27e-04 | (3810.46 ms | 137592 tok/s) step 18593/76294 | train loss 3.441679 | norm 19.2463 | lr 1.27e-04 | (3801.14 ms | 137929 tok/s) step 18594/76294 | train loss 3.472494 | norm 15.5819 | lr 1.27e-04 | (3805.11 ms | 137785 tok/s) step 18595/76294 | train loss 3.490555 | norm 14.7684 | lr 1.27e-04 | (3824.49 ms | 137087 tok/s) step 18596/76294 | train loss 3.448621 | norm 18.8359 | lr 1.27e-04 | (3801.84 ms | 137904 tok/s) step 18597/76294 | train loss 3.438392 | norm 23.1497 | lr 1.27e-04 | (3808.14 ms | 137676 tok/s) step 18598/76294 | train loss 3.470212 | norm 32.4750 | lr 1.27e-04 | (3801.43 ms | 137919 tok/s) step 18599/76294 | train loss 3.566477 | norm 26.3109 | lr 1.27e-04 | (3801.58 ms | 137913 tok/s) step 18600/76294 | train loss 3.516757 | norm 11.6928 | lr 1.27e-04 | (3825.01 ms | 137068 tok/s) step 18601/76294 | train loss 3.517428 | norm 17.7053 | lr 1.27e-04 | (3800.77 ms | 137943 tok/s) step 18602/76294 | train loss 3.482691 | norm 10.2822 | lr 1.27e-04 | (3830.47 ms | 136873 tok/s) step 18603/76294 | train loss 3.457606 | norm 9.4647 | lr 1.27e-04 | (3938.61 ms | 133115 tok/s) step 18604/76294 | train loss 3.461526 | norm 15.3911 | lr 1.27e-04 | (3796.39 ms | 138102 tok/s) step 18605/76294 | train loss 3.479559 | norm 15.2643 | lr 1.27e-04 | (3834.08 ms | 136744 tok/s) step 18606/76294 | train loss 3.467972 | norm 10.1900 | lr 1.27e-04 | (3800.62 ms | 137948 tok/s) step 18607/76294 | train loss 3.479083 | norm 10.9326 | lr 1.27e-04 | (4219.83 ms | 124244 tok/s) step 18608/76294 | train loss 3.516173 | norm 8.6133 | lr 1.27e-04 | (3820.02 ms | 137247 tok/s) step 18609/76294 | train loss 3.421933 | norm 13.3099 | lr 1.27e-04 | (3809.49 ms | 137627 tok/s) step 18610/76294 | train loss 3.479496 | norm 13.6765 | lr 1.27e-04 | (3796.93 ms | 138082 tok/s) step 18611/76294 | train loss 3.507865 | norm 12.3703 | lr 1.27e-04 | (3827.62 ms | 136975 tok/s) step 18612/76294 | train loss 3.445031 | norm 9.8741 | lr 1.27e-04 | (3800.95 ms | 137936 tok/s) step 18613/76294 | train loss 3.472893 | norm 8.6228 | lr 1.27e-04 | (3847.63 ms | 136263 tok/s) step 18614/76294 | train loss 3.527111 | norm 8.6759 | lr 1.27e-04 | (3825.18 ms | 137062 tok/s) step 18615/76294 | train loss 3.451532 | norm 9.1831 | lr 1.27e-04 | (3824.82 ms | 137075 tok/s) step 18616/76294 | train loss 3.441621 | norm 6.1114 | lr 1.27e-04 | (3803.45 ms | 137845 tok/s) step 18617/76294 | train loss 3.411618 | norm 6.8756 | lr 1.27e-04 | (3901.31 ms | 134388 tok/s) step 18618/76294 | train loss 3.448543 | norm 9.2201 | lr 1.27e-04 | (3795.24 ms | 138143 tok/s) step 18619/76294 | train loss 3.443489 | norm 9.9572 | lr 1.27e-04 | (3847.13 ms | 136280 tok/s) step 18620/76294 | train loss 3.406666 | norm 5.5356 | lr 1.27e-04 | (3808.53 ms | 137662 tok/s) step 18621/76294 | train loss 3.490089 | norm 5.8485 | lr 1.27e-04 | (3826.01 ms | 137033 tok/s) step 18622/76294 | train loss 3.425500 | norm 7.8580 | lr 1.27e-04 | (3796.77 ms | 138088 tok/s) step 18623/76294 | train loss 3.441931 | norm 9.5858 | lr 1.27e-04 | (3851.89 ms | 136112 tok/s) step 18624/76294 | train loss 3.426016 | norm 5.0991 | lr 1.27e-04 | (3801.79 ms | 137906 tok/s) step 18625/76294 | train loss 3.410244 | norm 9.6991 | lr 1.27e-04 | (3851.70 ms | 136118 tok/s) step 18626/76294 | train loss 3.427254 | norm 5.4150 | lr 1.27e-04 | (3798.21 ms | 138036 tok/s) step 18627/76294 | train loss 3.453351 | norm 2.8026 | lr 1.27e-04 | (3880.72 ms | 135101 tok/s) step 18628/76294 | train loss 3.436529 | norm 5.7944 | lr 1.27e-04 | (3780.62 ms | 138678 tok/s) step 18629/76294 | train loss 3.458302 | norm 3.3984 | lr 1.27e-04 | (3796.91 ms | 138083 tok/s) step 18630/76294 | train loss 3.458154 | norm 2.6745 | lr 1.26e-04 | (3783.53 ms | 138571 tok/s) step 18631/76294 | train loss 3.450331 | norm 5.9832 | lr 1.26e-04 | (3801.91 ms | 137901 tok/s) step 18632/76294 | train loss 3.491630 | norm 6.6029 | lr 1.26e-04 | (3794.73 ms | 138162 tok/s) step 18633/76294 | train loss 3.430329 | norm 3.8042 | lr 1.26e-04 | (3794.81 ms | 138159 tok/s) step 18634/76294 | train loss 3.495298 | norm 6.8084 | lr 1.26e-04 | (3809.29 ms | 137634 tok/s) step 18635/76294 | train loss 3.498417 | norm 6.2077 | lr 1.26e-04 | (3792.51 ms | 138243 tok/s) step 18636/76294 | train loss 3.490727 | norm 10.8863 | lr 1.26e-04 | (3794.95 ms | 138154 tok/s) step 18637/76294 | train loss 3.470493 | norm 6.3681 | lr 1.26e-04 | (3817.34 ms | 137344 tok/s) step 18638/76294 | train loss 3.412895 | norm 2.9882 | lr 1.26e-04 | (3796.58 ms | 138095 tok/s) step 18639/76294 | train loss 3.453454 | norm 7.2927 | lr 1.26e-04 | (3799.91 ms | 137974 tok/s) step 18640/76294 | train loss 3.452613 | norm 3.5526 | lr 1.26e-04 | (3800.43 ms | 137955 tok/s) step 18641/76294 | train loss 3.378600 | norm 5.8444 | lr 1.26e-04 | (3796.81 ms | 138087 tok/s) step 18642/76294 | train loss 3.436166 | norm 3.7195 | lr 1.26e-04 | (3799.62 ms | 137984 tok/s) step 18643/76294 | train loss 3.466137 | norm 3.3221 | lr 1.26e-04 | (3798.86 ms | 138012 tok/s) step 18644/76294 | train loss 3.522186 | norm 5.7073 | lr 1.26e-04 | (3816.23 ms | 137384 tok/s) step 18645/76294 | train loss 3.455106 | norm 6.3624 | lr 1.26e-04 | (3804.46 ms | 137809 tok/s) step 18646/76294 | train loss 3.401732 | norm 3.3992 | lr 1.26e-04 | (3796.50 ms | 138098 tok/s) step 18647/76294 | train loss 3.429703 | norm 5.1225 | lr 1.26e-04 | (3794.64 ms | 138166 tok/s) step 18648/76294 | train loss 3.438257 | norm 5.0812 | lr 1.26e-04 | (3831.71 ms | 136829 tok/s) step 18649/76294 | train loss 3.479615 | norm 5.9371 | lr 1.26e-04 | (3802.64 ms | 137875 tok/s) step 18650/76294 | train loss 3.475496 | norm 8.5002 | lr 1.26e-04 | (3799.97 ms | 137972 tok/s) step 18651/76294 | train loss 3.445006 | norm 7.8706 | lr 1.26e-04 | (3883.08 ms | 135018 tok/s) step 18652/76294 | train loss 3.640945 | norm 9.5917 | lr 1.26e-04 | (3799.10 ms | 138003 tok/s) step 18653/76294 | train loss 3.471797 | norm 9.1015 | lr 1.26e-04 | (3799.59 ms | 137985 tok/s) step 18654/76294 | train loss 3.567020 | norm 10.9433 | lr 1.26e-04 | (3818.82 ms | 137290 tok/s) step 18655/76294 | train loss 3.468931 | norm 5.6795 | lr 1.26e-04 | (3797.20 ms | 138072 tok/s) step 18656/76294 | train loss 3.439417 | norm 7.1527 | lr 1.26e-04 | (3799.64 ms | 137984 tok/s) step 18657/76294 | train loss 3.452025 | norm 5.6569 | lr 1.26e-04 | (3807.07 ms | 137714 tok/s) step 18658/76294 | train loss 3.470029 | norm 8.9249 | lr 1.26e-04 | (3801.84 ms | 137904 tok/s) step 18659/76294 | train loss 3.441041 | norm 4.5382 | lr 1.26e-04 | (3801.17 ms | 137928 tok/s) step 18660/76294 | train loss 3.428241 | norm 5.0229 | lr 1.26e-04 | (3804.00 ms | 137825 tok/s) step 18661/76294 | train loss 3.427102 | norm 8.1675 | lr 1.26e-04 | (3798.12 ms | 138039 tok/s) step 18662/76294 | train loss 3.427231 | norm 5.8737 | lr 1.26e-04 | (3798.54 ms | 138024 tok/s) step 18663/76294 | train loss 3.441354 | norm 11.1100 | lr 1.26e-04 | (3833.23 ms | 136774 tok/s) step 18664/76294 | train loss 3.493948 | norm 10.3980 | lr 1.26e-04 | (3798.33 ms | 138031 tok/s) step 18665/76294 | train loss 3.447412 | norm 9.8455 | lr 1.26e-04 | (3846.11 ms | 136317 tok/s) step 18666/76294 | train loss 3.402595 | norm 7.6128 | lr 1.26e-04 | (3799.03 ms | 138006 tok/s) step 18667/76294 | train loss 3.448527 | norm 13.8918 | lr 1.26e-04 | (3802.78 ms | 137870 tok/s) step 18668/76294 | train loss 3.478264 | norm 12.1965 | lr 1.26e-04 | (3817.05 ms | 137354 tok/s) step 18669/76294 | train loss 3.460437 | norm 12.7393 | lr 1.26e-04 | (3800.43 ms | 137955 tok/s) step 18670/76294 | train loss 3.564530 | norm 10.5009 | lr 1.26e-04 | (3807.52 ms | 137698 tok/s) step 18671/76294 | train loss 3.466008 | norm 8.6141 | lr 1.26e-04 | (3801.48 ms | 137917 tok/s) step 18672/76294 | train loss 3.395841 | norm 3.3825 | lr 1.26e-04 | (3806.79 ms | 137724 tok/s) step 18673/76294 | train loss 3.455817 | norm 8.5659 | lr 1.26e-04 | (3803.80 ms | 137833 tok/s) step 18674/76294 | train loss 3.460474 | norm 6.1285 | lr 1.26e-04 | (3804.12 ms | 137821 tok/s) step 18675/76294 | train loss 3.448111 | norm 4.6198 | lr 1.26e-04 | (3802.46 ms | 137881 tok/s) step 18676/76294 | train loss 3.549518 | norm 12.3463 | lr 1.26e-04 | (3840.93 ms | 136500 tok/s) step 18677/76294 | train loss 3.629285 | norm 6.5160 | lr 1.26e-04 | (3802.38 ms | 137884 tok/s) step 18678/76294 | train loss 3.448166 | norm 2.3697 | lr 1.26e-04 | (3877.63 ms | 135209 tok/s) step 18679/76294 | train loss 3.492422 | norm 3.4480 | lr 1.26e-04 | (3798.12 ms | 138039 tok/s) step 18680/76294 | train loss 3.383777 | norm 6.9570 | lr 1.26e-04 | (3824.60 ms | 137083 tok/s) step 18681/76294 | train loss 3.495793 | norm 3.3415 | lr 1.26e-04 | (3806.41 ms | 137738 tok/s) step 18682/76294 | train loss 3.432403 | norm 5.7204 | lr 1.26e-04 | (3799.66 ms | 137983 tok/s) step 18683/76294 | train loss 3.449014 | norm 9.1207 | lr 1.26e-04 | (3821.18 ms | 137206 tok/s) step 18684/76294 | train loss 3.449165 | norm 12.3396 | lr 1.26e-04 | (3805.30 ms | 137778 tok/s) step 18685/76294 | train loss 3.506064 | norm 6.0540 | lr 1.26e-04 | (3813.33 ms | 137488 tok/s) step 18686/76294 | train loss 3.414970 | norm 7.5172 | lr 1.26e-04 | (3799.71 ms | 137981 tok/s) step 18687/76294 | train loss 3.491468 | norm 8.3760 | lr 1.26e-04 | (3798.66 ms | 138019 tok/s) step 18688/76294 | train loss 3.497895 | norm 7.4423 | lr 1.26e-04 | (3834.37 ms | 136734 tok/s) step 18689/76294 | train loss 3.430784 | norm 5.5767 | lr 1.26e-04 | (3799.85 ms | 137976 tok/s) step 18690/76294 | train loss 3.461160 | norm 7.8096 | lr 1.26e-04 | (3805.45 ms | 137773 tok/s) step 18691/76294 | train loss 3.429986 | norm 7.3825 | lr 1.26e-04 | (3820.44 ms | 137232 tok/s) step 18692/76294 | train loss 3.429643 | norm 6.0343 | lr 1.26e-04 | (3801.28 ms | 137924 tok/s) step 18693/76294 | train loss 3.473703 | norm 3.9700 | lr 1.26e-04 | (3807.64 ms | 137694 tok/s) step 18694/76294 | train loss 3.436240 | norm 4.9710 | lr 1.26e-04 | (3802.43 ms | 137882 tok/s) step 18695/76294 | train loss 3.410511 | norm 55.8736 | lr 1.26e-04 | (3820.28 ms | 137238 tok/s) step 18696/76294 | train loss 3.431158 | norm 6.9450 | lr 1.26e-04 | (3806.49 ms | 137735 tok/s) step 18697/76294 | train loss 3.449196 | norm 7.4833 | lr 1.26e-04 | (3818.17 ms | 137314 tok/s) step 18698/76294 | train loss 3.458063 | norm 4.4047 | lr 1.26e-04 | (3807.12 ms | 137713 tok/s) step 18699/76294 | train loss 3.417198 | norm 3.5990 | lr 1.26e-04 | (3828.23 ms | 136953 tok/s) step 18700/76294 | train loss 3.443677 | norm 8.8976 | lr 1.26e-04 | (3809.25 ms | 137635 tok/s) step 18701/76294 | train loss 3.435703 | norm 8.5868 | lr 1.26e-04 | (3897.73 ms | 134511 tok/s) step 18702/76294 | train loss 3.650369 | norm 8.0304 | lr 1.26e-04 | (3804.63 ms | 137803 tok/s) step 18703/76294 | train loss 3.465226 | norm 6.2450 | lr 1.26e-04 | (3804.70 ms | 137800 tok/s) step 18704/76294 | train loss 3.443470 | norm 12.2602 | lr 1.26e-04 | (3847.23 ms | 136277 tok/s) step 18705/76294 | train loss 3.416590 | norm 13.6842 | lr 1.25e-04 | (3866.89 ms | 135584 tok/s) step 18706/76294 | train loss 3.450228 | norm 4.2181 | lr 1.25e-04 | (3864.70 ms | 135661 tok/s) step 18707/76294 | train loss 3.405487 | norm 5.1205 | lr 1.25e-04 | (3896.39 ms | 134557 tok/s) step 18708/76294 | train loss 3.475786 | norm 5.2947 | lr 1.25e-04 | (3881.09 ms | 135088 tok/s) step 18709/76294 | train loss 3.607063 | norm 6.2030 | lr 1.25e-04 | (3783.09 ms | 138587 tok/s) step 18710/76294 | train loss 3.423189 | norm 14.2660 | lr 1.25e-04 | (3844.99 ms | 136356 tok/s) step 18711/76294 | train loss 3.421409 | norm 5.7469 | lr 1.25e-04 | (3890.91 ms | 134747 tok/s) step 18712/76294 | train loss 3.425357 | norm 6.2756 | lr 1.25e-04 | (3895.77 ms | 134579 tok/s) step 18713/76294 | train loss 3.483001 | norm 6.6697 | lr 1.25e-04 | (3778.21 ms | 138766 tok/s) step 18714/76294 | train loss 3.358942 | norm 4.1873 | lr 1.25e-04 | (10065.99 ms | 52085 tok/s) step 18715/76294 | train loss 3.395186 | norm 3.6993 | lr 1.25e-04 | (3856.17 ms | 135961 tok/s) step 18716/76294 | train loss 3.517993 | norm 3.6315 | lr 1.25e-04 | (3824.34 ms | 137092 tok/s) step 18717/76294 | train loss 3.399471 | norm 4.0456 | lr 1.25e-04 | (3776.19 ms | 138841 tok/s) step 18718/76294 | train loss 3.463110 | norm 2.3630 | lr 1.25e-04 | (3774.25 ms | 138912 tok/s) step 18719/76294 | train loss 3.480013 | norm 4.8343 | lr 1.25e-04 | (3866.70 ms | 135591 tok/s) step 18720/76294 | train loss 3.407743 | norm 3.5591 | lr 1.25e-04 | (3777.70 ms | 138785 tok/s) step 18721/76294 | train loss 3.460245 | norm 6.7603 | lr 1.25e-04 | (3832.48 ms | 136801 tok/s) step 18722/76294 | train loss 3.401824 | norm 8.7683 | lr 1.25e-04 | (3781.05 ms | 138662 tok/s) step 18723/76294 | train loss 3.464968 | norm 8.1853 | lr 1.25e-04 | (3785.20 ms | 138510 tok/s) step 18724/76294 | train loss 3.398914 | norm 4.9579 | lr 1.25e-04 | (3799.86 ms | 137976 tok/s) step 18725/76294 | train loss 3.418388 | norm 5.0936 | lr 1.25e-04 | (3789.59 ms | 138350 tok/s) step 18726/76294 | train loss 3.491812 | norm 5.6102 | lr 1.25e-04 | (3798.83 ms | 138013 tok/s) step 18727/76294 | train loss 3.434011 | norm 6.9222 | lr 1.25e-04 | (3795.20 ms | 138145 tok/s) step 18728/76294 | train loss 3.404171 | norm 5.2617 | lr 1.25e-04 | (3805.59 ms | 137768 tok/s) step 18729/76294 | train loss 3.416443 | norm 10.4114 | lr 1.25e-04 | (3799.05 ms | 138005 tok/s) step 18730/76294 | train loss 3.410711 | norm 6.0714 | lr 1.25e-04 | (3801.88 ms | 137902 tok/s) step 18731/76294 | train loss 3.485486 | norm 9.5453 | lr 1.25e-04 | (3812.74 ms | 137509 tok/s) step 18732/76294 | train loss 3.489886 | norm 5.1158 | lr 1.25e-04 | (3804.07 ms | 137823 tok/s) step 18733/76294 | train loss 3.387481 | norm 8.0638 | lr 1.25e-04 | (3824.06 ms | 137102 tok/s) step 18734/76294 | train loss 3.442624 | norm 12.9878 | lr 1.25e-04 | (3804.47 ms | 137809 tok/s) step 18735/76294 | train loss 3.431001 | norm 5.3887 | lr 1.25e-04 | (3810.35 ms | 137596 tok/s) step 18736/76294 | train loss 3.482734 | norm 9.7706 | lr 1.25e-04 | (3821.93 ms | 137179 tok/s) step 18737/76294 | train loss 3.363358 | norm 7.1909 | lr 1.25e-04 | (3807.49 ms | 137699 tok/s) step 18738/76294 | train loss 3.449971 | norm 7.5152 | lr 1.25e-04 | (3810.22 ms | 137601 tok/s) step 18739/76294 | train loss 3.439779 | norm 6.3250 | lr 1.25e-04 | (3806.19 ms | 137746 tok/s) step 18740/76294 | train loss 3.445140 | norm 9.7279 | lr 1.25e-04 | (3821.44 ms | 137197 tok/s) step 18741/76294 | train loss 3.418033 | norm 8.6946 | lr 1.25e-04 | (3966.50 ms | 132179 tok/s) step 18742/76294 | train loss 3.673971 | norm 9.1607 | lr 1.25e-04 | (3806.37 ms | 137739 tok/s) step 18743/76294 | train loss 3.468488 | norm 8.6610 | lr 1.25e-04 | (3812.16 ms | 137530 tok/s) step 18744/76294 | train loss 3.456302 | norm 4.1293 | lr 1.25e-04 | (3824.91 ms | 137072 tok/s) step 18745/76294 | train loss 3.622840 | norm 11.9891 | lr 1.25e-04 | (3810.58 ms | 137587 tok/s) step 18746/76294 | train loss 3.484226 | norm 6.2491 | lr 1.25e-04 | (3836.19 ms | 136669 tok/s) step 18747/76294 | train loss 3.456197 | norm 4.2853 | lr 1.25e-04 | (3828.45 ms | 136945 tok/s) step 18748/76294 | train loss 3.471072 | norm 5.6664 | lr 1.25e-04 | (3809.18 ms | 137638 tok/s) step 18749/76294 | train loss 3.470550 | norm 7.0681 | lr 1.25e-04 | (3812.63 ms | 137514 tok/s) step 18750/76294 | train loss 3.478724 | norm 9.4249 | lr 1.25e-04 | (3813.25 ms | 137491 tok/s) val loss: 3.446805 saving model checkpoint to ./results/gpt2-124M-gqa/step_18750.pth step 18751/76294 | train loss 3.549886 | norm 10.7326 | lr 1.25e-04 | (3898.09 ms | 134499 tok/s) step 18752/76294 | train loss 3.452487 | norm 13.9317 | lr 1.25e-04 | (3821.96 ms | 137178 tok/s) step 18753/76294 | train loss 3.426613 | norm 11.1538 | lr 1.25e-04 | (3806.23 ms | 137745 tok/s) step 18754/76294 | train loss 3.475046 | norm 5.3158 | lr 1.25e-04 | (3841.03 ms | 136497 tok/s) step 18755/76294 | train loss 3.521474 | norm 2.9631 | lr 1.25e-04 | (3812.22 ms | 137528 tok/s) step 18756/76294 | train loss 3.487779 | norm 8.0458 | lr 1.25e-04 | (3812.98 ms | 137501 tok/s) step 18757/76294 | train loss 3.383456 | norm 17.8202 | lr 1.25e-04 | (3817.80 ms | 137327 tok/s) step 18758/76294 | train loss 3.486823 | norm 10.4096 | lr 1.25e-04 | (3808.57 ms | 137660 tok/s) step 18759/76294 | train loss 3.457400 | norm 10.1905 | lr 1.25e-04 | (3841.46 ms | 136481 tok/s) step 18760/76294 | train loss 3.398368 | norm 13.1110 | lr 1.25e-04 | (3811.07 ms | 137570 tok/s) step 18761/76294 | train loss 3.439334 | norm 10.0648 | lr 1.25e-04 | (3834.80 ms | 136719 tok/s) step 18762/76294 | train loss 3.462974 | norm 7.0286 | lr 1.25e-04 | (3803.21 ms | 137854 tok/s) step 18763/76294 | train loss 3.448306 | norm 3.5636 | lr 1.25e-04 | (3810.89 ms | 137576 tok/s) step 18764/76294 | train loss 3.442001 | norm 4.8295 | lr 1.25e-04 | (3832.92 ms | 136786 tok/s) step 18765/76294 | train loss 3.400114 | norm 6.9809 | lr 1.25e-04 | (3808.09 ms | 137677 tok/s) step 18766/76294 | train loss 3.457040 | norm 8.6331 | lr 1.25e-04 | (3813.75 ms | 137473 tok/s) step 18767/76294 | train loss 3.508343 | norm 4.8218 | lr 1.25e-04 | (3805.31 ms | 137778 tok/s) step 18768/76294 | train loss 3.366961 | norm 4.7425 | lr 1.25e-04 | (3821.82 ms | 137183 tok/s) step 18769/76294 | train loss 3.456151 | norm 4.9326 | lr 1.25e-04 | (3860.25 ms | 135817 tok/s) step 18770/76294 | train loss 3.419968 | norm 5.7145 | lr 1.25e-04 | (3803.36 ms | 137849 tok/s) step 18771/76294 | train loss 3.427839 | norm 7.3127 | lr 1.25e-04 | (3809.99 ms | 137609 tok/s) step 18772/76294 | train loss 3.466415 | norm 3.8073 | lr 1.25e-04 | (3802.63 ms | 137875 tok/s) step 18773/76294 | train loss 3.438547 | norm 3.0816 | lr 1.25e-04 | (3808.77 ms | 137653 tok/s) step 18774/76294 | train loss 3.481514 | norm 7.0607 | lr 1.25e-04 | (3826.59 ms | 137012 tok/s) step 18775/76294 | train loss 3.442616 | norm 9.5535 | lr 1.25e-04 | (3801.10 ms | 137931 tok/s) step 18776/76294 | train loss 3.448562 | norm 9.5188 | lr 1.25e-04 | (3805.79 ms | 137761 tok/s) step 18777/76294 | train loss 3.480443 | norm 9.7690 | lr 1.25e-04 | (3804.42 ms | 137810 tok/s) step 18778/76294 | train loss 3.509100 | norm 7.3789 | lr 1.25e-04 | (3821.51 ms | 137194 tok/s) step 18779/76294 | train loss 3.472588 | norm 7.1986 | lr 1.25e-04 | (3803.64 ms | 137839 tok/s) step 18780/76294 | train loss 3.481381 | norm 8.2293 | lr 1.25e-04 | (3807.13 ms | 137712 tok/s) step 18781/76294 | train loss 3.482795 | norm 4.1170 | lr 1.25e-04 | (3803.32 ms | 137850 tok/s) step 18782/76294 | train loss 3.419027 | norm 5.0838 | lr 1.25e-04 | (3842.47 ms | 136446 tok/s) step 18783/76294 | train loss 3.408660 | norm 5.8329 | lr 1.25e-04 | (3803.71 ms | 137836 tok/s) step 18784/76294 | train loss 3.388801 | norm 8.8098 | lr 1.25e-04 | (3803.29 ms | 137851 tok/s) step 18785/76294 | train loss 3.379708 | norm 3.6906 | lr 1.25e-04 | (3802.93 ms | 137864 tok/s) step 18786/76294 | train loss 3.470092 | norm 6.1831 | lr 1.25e-04 | (3809.96 ms | 137610 tok/s) step 18787/76294 | train loss 3.439984 | norm 5.9657 | lr 1.24e-04 | (3803.19 ms | 137855 tok/s) step 18788/76294 | train loss 3.432215 | norm 5.7362 | lr 1.24e-04 | (3816.98 ms | 137357 tok/s) step 18789/76294 | train loss 3.415484 | norm 8.5796 | lr 1.24e-04 | (3808.74 ms | 137654 tok/s) step 18790/76294 | train loss 3.454086 | norm 8.4037 | lr 1.24e-04 | (3800.98 ms | 137935 tok/s) step 18791/76294 | train loss 3.392389 | norm 9.5948 | lr 1.24e-04 | (3823.83 ms | 137111 tok/s) step 18792/76294 | train loss 3.444932 | norm 8.1146 | lr 1.24e-04 | (3799.80 ms | 137978 tok/s) step 18793/76294 | train loss 3.400651 | norm 8.1140 | lr 1.24e-04 | (3802.49 ms | 137880 tok/s) step 18794/76294 | train loss 3.489054 | norm 6.6443 | lr 1.24e-04 | (3888.14 ms | 134843 tok/s) step 18795/76294 | train loss 3.478865 | norm 4.4914 | lr 1.24e-04 | (3892.31 ms | 134698 tok/s) step 18796/76294 | train loss 3.438417 | norm 5.7586 | lr 1.24e-04 | (3839.99 ms | 136534 tok/s) step 18797/76294 | train loss 3.433392 | norm 3.4184 | lr 1.24e-04 | (3821.47 ms | 137195 tok/s) step 18798/76294 | train loss 3.490939 | norm 10.6495 | lr 1.24e-04 | (4097.05 ms | 127967 tok/s) step 18799/76294 | train loss 3.432416 | norm 5.7439 | lr 1.24e-04 | (3819.50 ms | 137266 tok/s) step 18800/76294 | train loss 3.471501 | norm 4.8889 | lr 1.24e-04 | (3799.48 ms | 137989 tok/s) step 18801/76294 | train loss 3.379014 | norm 4.6282 | lr 1.24e-04 | (3836.41 ms | 136661 tok/s) step 18802/76294 | train loss 3.485507 | norm 5.6280 | lr 1.24e-04 | (3806.63 ms | 137730 tok/s) step 18803/76294 | train loss 3.392020 | norm 3.4418 | lr 1.24e-04 | (3805.10 ms | 137786 tok/s) step 18804/76294 | train loss 3.490241 | norm 6.3735 | lr 1.24e-04 | (3803.26 ms | 137852 tok/s) step 18805/76294 | train loss 3.390587 | norm 4.7524 | lr 1.24e-04 | (3806.88 ms | 137721 tok/s) step 18806/76294 | train loss 3.489657 | norm 6.0818 | lr 1.24e-04 | (3802.40 ms | 137883 tok/s) step 18807/76294 | train loss 3.394076 | norm 9.1845 | lr 1.24e-04 | (3804.40 ms | 137811 tok/s) step 18808/76294 | train loss 3.482460 | norm 9.4780 | lr 1.24e-04 | (3830.66 ms | 136866 tok/s) step 18809/76294 | train loss 3.411457 | norm 9.1393 | lr 1.24e-04 | (3805.96 ms | 137754 tok/s) step 18810/76294 | train loss 3.484153 | norm 6.4885 | lr 1.24e-04 | (3802.07 ms | 137895 tok/s) step 18811/76294 | train loss 3.452937 | norm 10.7635 | lr 1.24e-04 | (3807.88 ms | 137685 tok/s) step 18812/76294 | train loss 3.413542 | norm 5.1442 | lr 1.24e-04 | (3801.86 ms | 137903 tok/s) step 18813/76294 | train loss 3.446857 | norm 6.2797 | lr 1.24e-04 | (3808.05 ms | 137679 tok/s) step 18814/76294 | train loss 3.392394 | norm 9.9122 | lr 1.24e-04 | (3868.83 ms | 135516 tok/s) step 18815/76294 | train loss 3.460364 | norm 4.5932 | lr 1.24e-04 | (3819.46 ms | 137268 tok/s) step 18816/76294 | train loss 3.459875 | norm 9.9416 | lr 1.24e-04 | (3802.12 ms | 137894 tok/s) step 18817/76294 | train loss 3.487424 | norm 4.3252 | lr 1.24e-04 | (3802.06 ms | 137896 tok/s) step 18818/76294 | train loss 3.387446 | norm 4.0548 | lr 1.24e-04 | (3801.39 ms | 137920 tok/s) step 18819/76294 | train loss 3.534895 | norm 9.7609 | lr 1.24e-04 | (3896.58 ms | 134551 tok/s) step 18820/76294 | train loss 3.404611 | norm 4.5638 | lr 1.24e-04 | (3799.96 ms | 137972 tok/s) step 18821/76294 | train loss 3.480013 | norm 3.3986 | lr 1.24e-04 | (3801.06 ms | 137932 tok/s) step 18822/76294 | train loss 3.412898 | norm 6.8542 | lr 1.24e-04 | (3824.08 ms | 137102 tok/s) step 18823/76294 | train loss 3.530184 | norm 3.3598 | lr 1.24e-04 | (3798.95 ms | 138009 tok/s) step 18824/76294 | train loss 3.407986 | norm 4.0496 | lr 1.24e-04 | (3802.16 ms | 137892 tok/s) step 18825/76294 | train loss 3.489246 | norm 6.3374 | lr 1.24e-04 | (3818.44 ms | 137304 tok/s) step 18826/76294 | train loss 3.327395 | norm 7.5812 | lr 1.24e-04 | (3802.51 ms | 137879 tok/s) step 18827/76294 | train loss 3.457434 | norm 11.2439 | lr 1.24e-04 | (3794.45 ms | 138172 tok/s) step 18828/76294 | train loss 3.433510 | norm 5.4884 | lr 1.24e-04 | (3830.34 ms | 136878 tok/s) step 18829/76294 | train loss 3.562142 | norm 2.9465 | lr 1.24e-04 | (3800.69 ms | 137946 tok/s) step 18830/76294 | train loss 3.380667 | norm 7.1841 | lr 1.24e-04 | (3803.22 ms | 137854 tok/s) step 18831/76294 | train loss 3.460365 | norm 6.3020 | lr 1.24e-04 | (3818.07 ms | 137318 tok/s) step 18832/76294 | train loss 3.435478 | norm 6.2201 | lr 1.24e-04 | (3796.29 ms | 138105 tok/s) step 18833/76294 | train loss 3.445848 | norm 5.9343 | lr 1.24e-04 | (3806.96 ms | 137718 tok/s) step 18834/76294 | train loss 3.346719 | norm 2.6434 | lr 1.24e-04 | (3800.76 ms | 137943 tok/s) step 18835/76294 | train loss 3.447793 | norm 9.2960 | lr 1.24e-04 | (3806.39 ms | 137739 tok/s) step 18836/76294 | train loss 3.370682 | norm 8.6426 | lr 1.24e-04 | (3799.69 ms | 137982 tok/s) step 18837/76294 | train loss 3.443472 | norm 7.7266 | lr 1.24e-04 | (3806.04 ms | 137752 tok/s) step 18838/76294 | train loss 3.491784 | norm 9.9501 | lr 1.24e-04 | (3802.50 ms | 137880 tok/s) step 18839/76294 | train loss 3.410418 | norm 9.0941 | lr 1.24e-04 | (3805.18 ms | 137783 tok/s) step 18840/76294 | train loss 3.373484 | norm 10.6821 | lr 1.24e-04 | (3806.74 ms | 137726 tok/s) step 18841/76294 | train loss 3.484970 | norm 8.2375 | lr 1.24e-04 | (3799.29 ms | 137996 tok/s) step 18842/76294 | train loss 3.408622 | norm 8.0254 | lr 1.24e-04 | (3824.79 ms | 137076 tok/s) step 18843/76294 | train loss 3.467697 | norm 11.7293 | lr 1.24e-04 | (3799.23 ms | 137999 tok/s) step 18844/76294 | train loss 3.360060 | norm 11.3199 | lr 1.24e-04 | (4006.14 ms | 130871 tok/s) step 18845/76294 | train loss 3.454723 | norm 5.5820 | lr 1.24e-04 | (3801.29 ms | 137924 tok/s) step 18846/76294 | train loss 3.363459 | norm 3.6994 | lr 1.24e-04 | (3816.11 ms | 137388 tok/s) step 18847/76294 | train loss 3.405493 | norm 4.1971 | lr 1.24e-04 | (3800.31 ms | 137959 tok/s) step 18848/76294 | train loss 3.399682 | norm 3.2546 | lr 1.24e-04 | (3842.99 ms | 136427 tok/s) step 18849/76294 | train loss 3.409184 | norm 5.8395 | lr 1.24e-04 | (3799.64 ms | 137984 tok/s) step 18850/76294 | train loss 3.379428 | norm 3.9173 | lr 1.24e-04 | (3802.97 ms | 137863 tok/s) step 18851/76294 | train loss 3.450382 | norm 4.7928 | lr 1.24e-04 | (3822.25 ms | 137167 tok/s) step 18852/76294 | train loss 3.381193 | norm 4.7634 | lr 1.24e-04 | (3959.86 ms | 132401 tok/s) step 18853/76294 | train loss 3.461138 | norm 9.9644 | lr 1.24e-04 | (3796.61 ms | 138094 tok/s) step 18854/76294 | train loss 3.483778 | norm 7.7440 | lr 1.24e-04 | (3826.06 ms | 137031 tok/s) step 18855/76294 | train loss 3.443307 | norm 3.2031 | lr 1.24e-04 | (3796.36 ms | 138103 tok/s) step 18856/76294 | train loss 3.417730 | norm 3.1283 | lr 1.24e-04 | (3805.33 ms | 137777 tok/s) step 18857/76294 | train loss 3.383686 | norm 9.4199 | lr 1.24e-04 | (3802.09 ms | 137895 tok/s) step 18858/76294 | train loss 3.429946 | norm 9.7379 | lr 1.24e-04 | (3833.74 ms | 136756 tok/s) step 18859/76294 | train loss 3.420223 | norm 4.6015 | lr 1.24e-04 | (3800.80 ms | 137941 tok/s) step 18860/76294 | train loss 3.492145 | norm 6.6501 | lr 1.24e-04 | (3822.29 ms | 137166 tok/s) step 18861/76294 | train loss 3.361550 | norm 7.6242 | lr 1.24e-04 | (3796.41 ms | 138101 tok/s) step 18862/76294 | train loss 3.405354 | norm 12.7729 | lr 1.24e-04 | (3801.00 ms | 137934 tok/s) step 18863/76294 | train loss 3.396770 | norm 7.9659 | lr 1.24e-04 | (3821.40 ms | 137198 tok/s) step 18864/76294 | train loss 3.345330 | norm 8.2517 | lr 1.24e-04 | (3797.86 ms | 138048 tok/s) step 18865/76294 | train loss 3.389473 | norm 5.5157 | lr 1.24e-04 | (3798.36 ms | 138030 tok/s) step 18866/76294 | train loss 3.443132 | norm 9.0943 | lr 1.24e-04 | (3831.46 ms | 136838 tok/s) step 18867/76294 | train loss 3.452773 | norm 6.9957 | lr 1.24e-04 | (3802.53 ms | 137879 tok/s) step 18868/76294 | train loss 3.470103 | norm 2.6657 | lr 1.24e-04 | (3871.23 ms | 135432 tok/s) step 18869/76294 | train loss 3.484648 | norm 5.7647 | lr 1.24e-04 | (3869.21 ms | 135503 tok/s) step 18870/76294 | train loss 3.461357 | norm 6.6617 | lr 1.24e-04 | (3797.46 ms | 138063 tok/s) step 18871/76294 | train loss 3.413747 | norm 6.2455 | lr 1.24e-04 | (3802.16 ms | 137892 tok/s) step 18872/76294 | train loss 3.433992 | norm 3.8669 | lr 1.24e-04 | (3838.70 ms | 136580 tok/s) step 18873/76294 | train loss 3.409619 | norm 6.1870 | lr 1.24e-04 | (3806.35 ms | 137740 tok/s) step 18874/76294 | train loss 3.442817 | norm 7.0873 | lr 1.24e-04 | (3805.85 ms | 137759 tok/s) step 18875/76294 | train loss 3.472236 | norm 6.1414 | lr 1.24e-04 | (3827.67 ms | 136973 tok/s) step 18876/76294 | train loss 3.465928 | norm 7.0417 | lr 1.24e-04 | (3806.58 ms | 137732 tok/s) step 18877/76294 | train loss 3.449458 | norm 5.9383 | lr 1.24e-04 | (3807.21 ms | 137709 tok/s) step 18878/76294 | train loss 3.474925 | norm 4.1677 | lr 1.23e-04 | (3823.84 ms | 137110 tok/s) step 18879/76294 | train loss 3.388944 | norm 7.5051 | lr 1.23e-04 | (3810.03 ms | 137607 tok/s) step 18880/76294 | train loss 3.412725 | norm 3.7718 | lr 1.23e-04 | (3824.48 ms | 137087 tok/s) step 18881/76294 | train loss 3.401283 | norm 5.5866 | lr 1.23e-04 | (3808.41 ms | 137666 tok/s) step 18882/76294 | train loss 3.483432 | norm 7.6936 | lr 1.23e-04 | (3805.04 ms | 137788 tok/s) step 18883/76294 | train loss 3.340646 | norm 6.5540 | lr 1.23e-04 | (3825.63 ms | 137046 tok/s) step 18884/76294 | train loss 3.548418 | norm 5.1975 | lr 1.23e-04 | (3820.88 ms | 137216 tok/s) step 18885/76294 | train loss 3.350117 | norm 6.2783 | lr 1.23e-04 | (3821.52 ms | 137194 tok/s) step 18886/76294 | train loss 3.436603 | norm 10.0623 | lr 1.23e-04 | (3840.54 ms | 136514 tok/s) step 18887/76294 | train loss 3.331883 | norm 5.1551 | lr 1.23e-04 | (3807.92 ms | 137684 tok/s) step 18888/76294 | train loss 3.449090 | norm 3.4942 | lr 1.23e-04 | (3795.99 ms | 138116 tok/s) step 18889/76294 | train loss 3.304669 | norm 11.1136 | lr 1.23e-04 | (3804.56 ms | 137805 tok/s) step 18890/76294 | train loss 3.447614 | norm 4.5823 | lr 1.23e-04 | (3823.21 ms | 137133 tok/s) step 18891/76294 | train loss 3.392864 | norm 5.1853 | lr 1.23e-04 | (3803.67 ms | 137837 tok/s) step 18892/76294 | train loss 3.464669 | norm 9.3284 | lr 1.23e-04 | (3797.72 ms | 138053 tok/s) step 18893/76294 | train loss 3.396122 | norm 3.1051 | lr 1.23e-04 | (3832.44 ms | 136803 tok/s) step 18894/76294 | train loss 3.442013 | norm 2.0693 | lr 1.23e-04 | (3855.53 ms | 135983 tok/s) step 18895/76294 | train loss 3.392677 | norm 6.0982 | lr 1.23e-04 | (3802.09 ms | 137895 tok/s) step 18896/76294 | train loss 3.479592 | norm 8.3878 | lr 1.23e-04 | (3804.13 ms | 137821 tok/s) step 18897/76294 | train loss 3.386935 | norm 7.4776 | lr 1.23e-04 | (3817.68 ms | 137332 tok/s) step 18898/76294 | train loss 3.420841 | norm 6.1558 | lr 1.23e-04 | (3799.49 ms | 137989 tok/s) step 18899/76294 | train loss 3.405395 | norm 3.6597 | lr 1.23e-04 | (3819.50 ms | 137266 tok/s) step 18900/76294 | train loss 3.476710 | norm 6.3462 | lr 1.23e-04 | (3801.47 ms | 137917 tok/s) step 18901/76294 | train loss 3.361345 | norm 6.3431 | lr 1.23e-04 | (3805.72 ms | 137763 tok/s) step 18902/76294 | train loss 3.492414 | norm 4.8406 | lr 1.23e-04 | (6017.79 ms | 87123 tok/s) step 18903/76294 | train loss 3.388756 | norm 6.6629 | lr 1.23e-04 | (3802.46 ms | 137881 tok/s) step 18904/76294 | train loss 3.416538 | norm 10.5771 | lr 1.23e-04 | (3830.18 ms | 136883 tok/s) step 18905/76294 | train loss 3.447292 | norm 7.5501 | lr 1.23e-04 | (3794.68 ms | 138164 tok/s) step 18906/76294 | train loss 3.478319 | norm 8.2695 | lr 1.23e-04 | (3799.46 ms | 137990 tok/s) step 18907/76294 | train loss 3.408661 | norm 4.8665 | lr 1.23e-04 | (3817.65 ms | 137333 tok/s) step 18908/76294 | train loss 3.445049 | norm 4.0663 | lr 1.23e-04 | (3797.09 ms | 138076 tok/s) step 18909/76294 | train loss 3.362276 | norm 7.2088 | lr 1.23e-04 | (3815.69 ms | 137403 tok/s) step 18910/76294 | train loss 3.414819 | norm 6.1069 | lr 1.23e-04 | (3800.88 ms | 137939 tok/s) step 18911/76294 | train loss 3.388226 | norm 4.9493 | lr 1.23e-04 | (3799.43 ms | 137991 tok/s) step 18912/76294 | train loss 3.450935 | norm 3.8679 | lr 1.23e-04 | (3796.08 ms | 138113 tok/s) step 18913/76294 | train loss 3.558931 | norm 5.6634 | lr 1.23e-04 | (3817.34 ms | 137344 tok/s) step 18914/76294 | train loss 3.520497 | norm 7.3270 | lr 1.23e-04 | (3824.11 ms | 137101 tok/s) step 18915/76294 | train loss 3.429980 | norm 8.2708 | lr 1.23e-04 | (3798.51 ms | 138025 tok/s) step 18916/76294 | train loss 3.441831 | norm 7.8553 | lr 1.23e-04 | (3827.31 ms | 136986 tok/s) step 18917/76294 | train loss 3.416285 | norm 6.3068 | lr 1.23e-04 | (3825.73 ms | 137042 tok/s) step 18918/76294 | train loss 3.426497 | norm 7.6364 | lr 1.23e-04 | (3880.07 ms | 135123 tok/s) step 18919/76294 | train loss 3.415968 | norm 4.2160 | lr 1.23e-04 | (3794.56 ms | 138168 tok/s) step 18920/76294 | train loss 3.411405 | norm 3.2448 | lr 1.23e-04 | (3825.25 ms | 137060 tok/s) step 18921/76294 | train loss 3.380602 | norm 5.8803 | lr 1.23e-04 | (3819.49 ms | 137267 tok/s) step 18922/76294 | train loss 3.417660 | norm 9.7042 | lr 1.23e-04 | (3839.69 ms | 136544 tok/s) step 18923/76294 | train loss 3.411838 | norm 10.3603 | lr 1.23e-04 | (3794.11 ms | 138185 tok/s) step 18924/76294 | train loss 3.407843 | norm 6.1256 | lr 1.23e-04 | (3831.65 ms | 136831 tok/s) step 18925/76294 | train loss 3.420735 | norm 6.6355 | lr 1.23e-04 | (3798.24 ms | 138035 tok/s) step 18926/76294 | train loss 3.449628 | norm 6.2677 | lr 1.23e-04 | (3801.04 ms | 137933 tok/s) step 18927/76294 | train loss 3.422249 | norm 6.6881 | lr 1.23e-04 | (3814.87 ms | 137433 tok/s) step 18928/76294 | train loss 3.450280 | norm 2.9937 | lr 1.23e-04 | (3801.01 ms | 137934 tok/s) step 18929/76294 | train loss 3.383377 | norm 11.4692 | lr 1.23e-04 | (3796.45 ms | 138099 tok/s) step 18930/76294 | train loss 3.450073 | norm 9.6890 | lr 1.23e-04 | (3819.89 ms | 137252 tok/s) step 18931/76294 | train loss 3.468041 | norm 4.5603 | lr 1.23e-04 | (3795.35 ms | 138140 tok/s) step 18932/76294 | train loss 3.402087 | norm 7.2800 | lr 1.23e-04 | (3798.35 ms | 138030 tok/s) step 18933/76294 | train loss 3.417501 | norm 11.7425 | lr 1.23e-04 | (3814.39 ms | 137450 tok/s) step 18934/76294 | train loss 3.529946 | norm 10.2417 | lr 1.23e-04 | (7316.73 ms | 71656 tok/s) step 18935/76294 | train loss 3.454031 | norm 14.9293 | lr 1.23e-04 | (3783.33 ms | 138578 tok/s) step 18936/76294 | train loss 3.429417 | norm 5.8874 | lr 1.23e-04 | (3811.86 ms | 137541 tok/s) step 18937/76294 | train loss 3.470364 | norm 5.4818 | lr 1.23e-04 | (3787.36 ms | 138431 tok/s) step 18938/76294 | train loss 3.560617 | norm 8.0268 | lr 1.23e-04 | (3790.83 ms | 138304 tok/s) step 18939/76294 | train loss 3.418783 | norm 9.2498 | lr 1.23e-04 | (3792.19 ms | 138255 tok/s) step 18940/76294 | train loss 3.455393 | norm 6.3533 | lr 1.23e-04 | (3827.42 ms | 136982 tok/s) step 18941/76294 | train loss 3.417081 | norm 8.1547 | lr 1.23e-04 | (3872.47 ms | 135389 tok/s) step 18942/76294 | train loss 3.429888 | norm 15.9444 | lr 1.23e-04 | (3878.03 ms | 135195 tok/s) step 18943/76294 | train loss 3.357865 | norm 13.7677 | lr 1.23e-04 | (3788.87 ms | 138376 tok/s) step 18944/76294 | train loss 3.442870 | norm 9.4174 | lr 1.23e-04 | (3797.96 ms | 138044 tok/s) step 18945/76294 | train loss 3.384516 | norm 5.7455 | lr 1.23e-04 | (3809.73 ms | 137618 tok/s) step 18946/76294 | train loss 3.368150 | norm 3.5728 | lr 1.23e-04 | (3797.13 ms | 138075 tok/s) step 18947/76294 | train loss 3.479076 | norm 4.6382 | lr 1.23e-04 | (3812.09 ms | 137533 tok/s) step 18948/76294 | train loss 3.453638 | norm 4.2719 | lr 1.23e-04 | (3797.37 ms | 138066 tok/s) step 18949/76294 | train loss 3.430603 | norm 10.6181 | lr 1.23e-04 | (3803.81 ms | 137832 tok/s) step 18950/76294 | train loss 3.483937 | norm 7.2853 | lr 1.23e-04 | (3795.75 ms | 138125 tok/s) step 18951/76294 | train loss 3.446566 | norm 9.9943 | lr 1.23e-04 | (3799.05 ms | 138005 tok/s) step 18952/76294 | train loss 3.463298 | norm 7.0099 | lr 1.23e-04 | (3825.95 ms | 137035 tok/s) step 18953/76294 | train loss 3.462752 | norm 6.4934 | lr 1.23e-04 | (3789.72 ms | 138345 tok/s) step 18954/76294 | train loss 3.386006 | norm 7.3687 | lr 1.23e-04 | (3798.02 ms | 138043 tok/s) step 18955/76294 | train loss 3.360557 | norm 8.8656 | lr 1.23e-04 | (3815.20 ms | 137421 tok/s) step 18956/76294 | train loss 3.449328 | norm 6.7459 | lr 1.23e-04 | (3823.49 ms | 137123 tok/s) step 18957/76294 | train loss 3.501275 | norm 11.2821 | lr 1.23e-04 | (3792.34 ms | 138249 tok/s) step 18958/76294 | train loss 3.472919 | norm 6.6888 | lr 1.23e-04 | (3812.22 ms | 137528 tok/s) step 18959/76294 | train loss 3.482974 | norm 6.0136 | lr 1.23e-04 | (3791.37 ms | 138285 tok/s) step 18960/76294 | train loss 3.393798 | norm 3.6898 | lr 1.23e-04 | (3790.87 ms | 138303 tok/s) step 18961/76294 | train loss 3.390227 | norm 10.1235 | lr 1.23e-04 | (3815.66 ms | 137404 tok/s) step 18962/76294 | train loss 3.532701 | norm 8.5273 | lr 1.23e-04 | (3830.56 ms | 136870 tok/s) step 18963/76294 | train loss 3.461976 | norm 7.4364 | lr 1.23e-04 | (3799.08 ms | 138004 tok/s) step 18964/76294 | train loss 3.441548 | norm 9.4908 | lr 1.23e-04 | (3815.65 ms | 137405 tok/s) step 18965/76294 | train loss 3.452299 | norm 11.5135 | lr 1.23e-04 | (3799.87 ms | 137975 tok/s) step 18966/76294 | train loss 3.404069 | norm 9.8263 | lr 1.23e-04 | (3804.39 ms | 137811 tok/s) step 18967/76294 | train loss 3.435201 | norm 9.4145 | lr 1.23e-04 | (3879.91 ms | 135129 tok/s) step 18968/76294 | train loss 3.383863 | norm 7.2976 | lr 1.23e-04 | (3855.00 ms | 136002 tok/s) step 18969/76294 | train loss 3.439570 | norm 6.3094 | lr 1.23e-04 | (3805.09 ms | 137786 tok/s) step 18970/76294 | train loss 3.408658 | norm 5.3633 | lr 1.23e-04 | (3830.77 ms | 136862 tok/s) step 18971/76294 | train loss 3.461150 | norm 5.6797 | lr 1.23e-04 | (3812.40 ms | 137522 tok/s) step 18972/76294 | train loss 3.380975 | norm 5.8439 | lr 1.23e-04 | (3803.11 ms | 137858 tok/s) step 18973/76294 | train loss 3.428034 | norm 4.1174 | lr 1.23e-04 | (3833.25 ms | 136774 tok/s) step 18974/76294 | train loss 3.438546 | norm 6.4488 | lr 1.23e-04 | (3803.65 ms | 137838 tok/s) step 18975/76294 | train loss 3.429830 | norm 14.2756 | lr 1.23e-04 | (3805.97 ms | 137754 tok/s) step 18976/76294 | train loss 3.415578 | norm 5.6876 | lr 1.23e-04 | (3853.31 ms | 136062 tok/s) step 18977/76294 | train loss 3.499116 | norm 5.2664 | lr 1.23e-04 | (3805.07 ms | 137787 tok/s) step 18978/76294 | train loss 3.329111 | norm 7.6734 | lr 1.23e-04 | (3800.23 ms | 137962 tok/s) step 18979/76294 | train loss 3.448819 | norm 8.3225 | lr 1.23e-04 | (3804.00 ms | 137826 tok/s) step 18980/76294 | train loss 3.380109 | norm 10.0443 | lr 1.23e-04 | (3802.21 ms | 137890 tok/s) step 18981/76294 | train loss 3.403739 | norm 7.3458 | lr 1.23e-04 | (3796.78 ms | 138088 tok/s) step 18982/76294 | train loss 3.355591 | norm 9.6816 | lr 1.23e-04 | (3808.02 ms | 137680 tok/s) step 18983/76294 | train loss 3.473346 | norm 9.8987 | lr 1.23e-04 | (3798.61 ms | 138021 tok/s) step 18984/76294 | train loss 3.435462 | norm 6.6206 | lr 1.22e-04 | (3802.18 ms | 137891 tok/s) step 18985/76294 | train loss 3.447235 | norm 7.2917 | lr 1.22e-04 | (3821.55 ms | 137192 tok/s) step 18986/76294 | train loss 3.424120 | norm 11.4360 | lr 1.22e-04 | (3802.31 ms | 137887 tok/s) step 18987/76294 | train loss 3.428820 | norm 6.1529 | lr 1.22e-04 | (3799.74 ms | 137980 tok/s) step 18988/76294 | train loss 3.476941 | norm 10.4058 | lr 1.22e-04 | (3801.03 ms | 137933 tok/s) step 18989/76294 | train loss 3.540612 | norm 7.9448 | lr 1.22e-04 | (4050.17 ms | 129448 tok/s) step 18990/76294 | train loss 3.500395 | norm 9.3365 | lr 1.22e-04 | (3823.65 ms | 137117 tok/s) step 18991/76294 | train loss 3.435794 | norm 8.4443 | lr 1.22e-04 | (3798.40 ms | 138029 tok/s) step 18992/76294 | train loss 3.446825 | norm 7.2446 | lr 1.22e-04 | (3898.74 ms | 134476 tok/s) step 18993/76294 | train loss 3.501510 | norm 9.8064 | lr 1.22e-04 | (3795.31 ms | 138141 tok/s) step 18994/76294 | train loss 3.554251 | norm 10.1337 | lr 1.22e-04 | (3806.31 ms | 137742 tok/s) step 18995/76294 | train loss 3.535639 | norm 11.9845 | lr 1.22e-04 | (3815.25 ms | 137419 tok/s) step 18996/76294 | train loss 3.525451 | norm 18.5418 | lr 1.22e-04 | (3805.44 ms | 137773 tok/s) step 18997/76294 | train loss 3.467652 | norm 6.7577 | lr 1.22e-04 | (3795.43 ms | 138137 tok/s) step 18998/76294 | train loss 3.434625 | norm 15.7868 | lr 1.22e-04 | (3827.97 ms | 136962 tok/s) step 18999/76294 | train loss 3.463090 | norm 5.7747 | lr 1.22e-04 | (3799.81 ms | 137978 tok/s) step 19000/76294 | train loss 3.492260 | norm 10.8857 | lr 1.22e-04 | (3842.70 ms | 136437 tok/s) val loss: 3.458557 saving model checkpoint to ./results/gpt2-124M-gqa/step_19000.pth step 19001/76294 | train loss 3.475042 | norm 5.8734 | lr 1.22e-04 | (3819.77 ms | 137257 tok/s) step 19002/76294 | train loss 3.472435 | norm 7.8178 | lr 1.22e-04 | (3796.56 ms | 138096 tok/s) step 19003/76294 | train loss 3.530266 | norm 17.0282 | lr 1.22e-04 | (3796.53 ms | 138096 tok/s) step 19004/76294 | train loss 3.520943 | norm 14.8423 | lr 1.22e-04 | (3850.31 ms | 136168 tok/s) step 19005/76294 | train loss 3.474910 | norm 8.6357 | lr 1.22e-04 | (3795.92 ms | 138119 tok/s) step 19006/76294 | train loss 3.541149 | norm 12.6093 | lr 1.22e-04 | (3806.69 ms | 137728 tok/s) step 19007/76294 | train loss 3.484397 | norm 16.3931 | lr 1.22e-04 | (3819.24 ms | 137276 tok/s) step 19008/76294 | train loss 3.478096 | norm 12.2886 | lr 1.22e-04 | (3800.26 ms | 137961 tok/s) step 19009/76294 | train loss 3.396095 | norm 12.9212 | lr 1.22e-04 | (3806.56 ms | 137733 tok/s) step 19010/76294 | train loss 3.501115 | norm 13.2792 | lr 1.22e-04 | (3828.00 ms | 136961 tok/s) step 19011/76294 | train loss 3.440801 | norm 18.2785 | lr 1.22e-04 | (3796.00 ms | 138116 tok/s) step 19012/76294 | train loss 3.461925 | norm 16.2334 | lr 1.22e-04 | (3827.91 ms | 136964 tok/s) step 19013/76294 | train loss 3.494408 | norm 14.9560 | lr 1.22e-04 | (3799.23 ms | 137999 tok/s) step 19014/76294 | train loss 3.470233 | norm 10.9890 | lr 1.22e-04 | (3812.15 ms | 137531 tok/s) step 19015/76294 | train loss 3.478562 | norm 11.5641 | lr 1.22e-04 | (3823.82 ms | 137111 tok/s) step 19016/76294 | train loss 3.479263 | norm 10.4401 | lr 1.22e-04 | (3801.43 ms | 137919 tok/s) step 19017/76294 | train loss 3.449925 | norm 14.7833 | lr 1.22e-04 | (3870.33 ms | 135464 tok/s) step 19018/76294 | train loss 3.472061 | norm 19.7648 | lr 1.22e-04 | (3797.33 ms | 138068 tok/s) step 19019/76294 | train loss 3.446422 | norm 8.9821 | lr 1.22e-04 | (3803.80 ms | 137833 tok/s) step 19020/76294 | train loss 3.442546 | norm 8.1449 | lr 1.22e-04 | (3818.44 ms | 137304 tok/s) step 19021/76294 | train loss 3.597311 | norm 14.0000 | lr 1.22e-04 | (3804.02 ms | 137825 tok/s) step 19022/76294 | train loss 3.415498 | norm 8.7413 | lr 1.22e-04 | (3809.93 ms | 137611 tok/s) step 19023/76294 | train loss 3.477278 | norm 6.7979 | lr 1.22e-04 | (3798.52 ms | 138024 tok/s) step 19024/76294 | train loss 3.402045 | norm 7.4209 | lr 1.22e-04 | (3809.82 ms | 137615 tok/s) step 19025/76294 | train loss 3.462319 | norm 9.8836 | lr 1.22e-04 | (3825.55 ms | 137049 tok/s) step 19026/76294 | train loss 3.447234 | norm 10.0388 | lr 1.22e-04 | (3805.66 ms | 137765 tok/s) step 19027/76294 | train loss 3.498128 | norm 10.0651 | lr 1.22e-04 | (3806.76 ms | 137726 tok/s) step 19028/76294 | train loss 3.477271 | norm 4.2987 | lr 1.22e-04 | (3811.00 ms | 137572 tok/s) step 19029/76294 | train loss 3.430323 | norm 13.1253 | lr 1.22e-04 | (3800.79 ms | 137942 tok/s) step 19030/76294 | train loss 3.464099 | norm 27.0876 | lr 1.22e-04 | (3808.98 ms | 137645 tok/s) step 19031/76294 | train loss 3.504320 | norm 24.7563 | lr 1.22e-04 | (3803.63 ms | 137839 tok/s) step 19032/76294 | train loss 3.474257 | norm 5.3998 | lr 1.22e-04 | (3806.01 ms | 137753 tok/s) step 19033/76294 | train loss 3.466443 | norm 7.8129 | lr 1.22e-04 | (3798.25 ms | 138034 tok/s) step 19034/76294 | train loss 3.424362 | norm 17.5367 | lr 1.22e-04 | (3809.99 ms | 137609 tok/s) step 19035/76294 | train loss 3.501349 | norm 17.2184 | lr 1.22e-04 | (3804.98 ms | 137790 tok/s) step 19036/76294 | train loss 3.547997 | norm 18.4095 | lr 1.22e-04 | (3827.33 ms | 136985 tok/s) step 19037/76294 | train loss 3.480231 | norm 13.4436 | lr 1.22e-04 | (3798.70 ms | 138018 tok/s) step 19038/76294 | train loss 3.462486 | norm 13.4676 | lr 1.22e-04 | (3795.35 ms | 138140 tok/s) step 19039/76294 | train loss 3.619628 | norm 25.5001 | lr 1.22e-04 | (3821.00 ms | 137212 tok/s) step 19040/76294 | train loss 3.521153 | norm 15.5467 | lr 1.22e-04 | (3796.84 ms | 138085 tok/s) step 19041/76294 | train loss 3.536896 | norm 22.0322 | lr 1.22e-04 | (3812.17 ms | 137530 tok/s) step 19042/76294 | train loss 3.514855 | norm 12.0456 | lr 1.22e-04 | (3819.97 ms | 137249 tok/s) step 19043/76294 | train loss 3.482084 | norm 10.7821 | lr 1.22e-04 | (3865.64 ms | 135628 tok/s) step 19044/76294 | train loss 3.439129 | norm 9.8644 | lr 1.22e-04 | (3795.88 ms | 138120 tok/s) step 19045/76294 | train loss 3.553229 | norm 6.4206 | lr 1.22e-04 | (3845.74 ms | 136329 tok/s) step 19046/76294 | train loss 3.469878 | norm 6.8378 | lr 1.22e-04 | (3795.85 ms | 138122 tok/s) step 19047/76294 | train loss 3.454382 | norm 12.1762 | lr 1.22e-04 | (3799.52 ms | 137988 tok/s) step 19048/76294 | train loss 3.444551 | norm 12.8603 | lr 1.22e-04 | (3816.23 ms | 137384 tok/s) step 19049/76294 | train loss 3.499348 | norm 7.1308 | lr 1.22e-04 | (3803.85 ms | 137831 tok/s) step 19050/76294 | train loss 3.516638 | norm 7.6819 | lr 1.22e-04 | (3798.97 ms | 138008 tok/s) step 19051/76294 | train loss 3.417575 | norm 8.6332 | lr 1.22e-04 | (3824.80 ms | 137076 tok/s) step 19052/76294 | train loss 3.501856 | norm 13.0347 | lr 1.22e-04 | (3796.59 ms | 138094 tok/s) step 19053/76294 | train loss 3.487881 | norm 54.0569 | lr 1.22e-04 | (3804.62 ms | 137803 tok/s) step 19054/76294 | train loss 3.491907 | norm 5.7201 | lr 1.22e-04 | (3818.62 ms | 137298 tok/s) step 19055/76294 | train loss 3.517809 | norm 10.1737 | lr 1.22e-04 | (3800.04 ms | 137969 tok/s) step 19056/76294 | train loss 3.499145 | norm 32.7687 | lr 1.22e-04 | (3815.21 ms | 137421 tok/s) step 19057/76294 | train loss 3.535227 | norm 24.0114 | lr 1.22e-04 | (3797.21 ms | 138072 tok/s) step 19058/76294 | train loss 3.474144 | norm 19.7570 | lr 1.22e-04 | (3796.51 ms | 138097 tok/s) step 19059/76294 | train loss 3.534892 | norm 16.4335 | lr 1.22e-04 | (3838.76 ms | 136577 tok/s) step 19060/76294 | train loss 3.486551 | norm 23.2257 | lr 1.22e-04 | (3795.96 ms | 138118 tok/s) step 19061/76294 | train loss 3.505874 | norm 21.8265 | lr 1.22e-04 | (3801.05 ms | 137932 tok/s) step 19062/76294 | train loss 3.451134 | norm 17.4783 | lr 1.22e-04 | (3821.80 ms | 137184 tok/s) step 19063/76294 | train loss 3.511454 | norm 21.7186 | lr 1.22e-04 | (3802.57 ms | 137877 tok/s) step 19064/76294 | train loss 3.459255 | norm 12.6760 | lr 1.22e-04 | (3794.70 ms | 138163 tok/s) step 19065/76294 | train loss 3.477252 | norm 13.8652 | lr 1.22e-04 | (3829.04 ms | 136924 tok/s) step 19066/76294 | train loss 3.484633 | norm 33.9551 | lr 1.22e-04 | (3795.78 ms | 138124 tok/s) step 19067/76294 | train loss 3.533042 | norm 6.9569 | lr 1.22e-04 | (3822.16 ms | 137171 tok/s) step 19068/76294 | train loss 3.505486 | norm 4.2449 | lr 1.22e-04 | (3932.07 ms | 133336 tok/s) step 19069/76294 | train loss 3.486676 | norm 8.9786 | lr 1.22e-04 | (3797.52 ms | 138061 tok/s) step 19070/76294 | train loss 3.495054 | norm 18.9949 | lr 1.22e-04 | (3822.91 ms | 137144 tok/s) step 19071/76294 | train loss 3.499708 | norm 18.7071 | lr 1.22e-04 | (3799.38 ms | 137993 tok/s) step 19072/76294 | train loss 3.525498 | norm 10.9262 | lr 1.22e-04 | (3799.57 ms | 137986 tok/s) step 19073/76294 | train loss 3.491662 | norm 7.1240 | lr 1.22e-04 | (3819.20 ms | 137277 tok/s) step 19074/76294 | train loss 3.411077 | norm 11.4790 | lr 1.22e-04 | (3808.22 ms | 137673 tok/s) step 19075/76294 | train loss 3.487487 | norm 11.1565 | lr 1.22e-04 | (3825.06 ms | 137067 tok/s) step 19076/76294 | train loss 3.512717 | norm 8.4238 | lr 1.22e-04 | (3795.76 ms | 138125 tok/s) step 19077/76294 | train loss 3.459980 | norm 11.3978 | lr 1.22e-04 | (3824.30 ms | 137094 tok/s) step 19078/76294 | train loss 3.414424 | norm 13.4173 | lr 1.22e-04 | (3796.29 ms | 138105 tok/s) step 19079/76294 | train loss 3.547235 | norm 15.8833 | lr 1.22e-04 | (3798.42 ms | 138028 tok/s) step 19080/76294 | train loss 3.509807 | norm 29.9758 | lr 1.22e-04 | (3824.69 ms | 137080 tok/s) step 19081/76294 | train loss 3.451499 | norm 14.7948 | lr 1.22e-04 | (3796.17 ms | 138110 tok/s) step 19082/76294 | train loss 3.554929 | norm 9.9744 | lr 1.22e-04 | (3804.05 ms | 137824 tok/s) step 19083/76294 | train loss 3.463928 | norm 17.6455 | lr 1.22e-04 | (3804.15 ms | 137820 tok/s) step 19084/76294 | train loss 3.518939 | norm 7.4744 | lr 1.22e-04 | (3824.70 ms | 137079 tok/s) step 19085/76294 | train loss 3.456821 | norm 6.5821 | lr 1.22e-04 | (3825.63 ms | 137046 tok/s) step 19086/76294 | train loss 3.434690 | norm 10.7039 | lr 1.22e-04 | (3809.54 ms | 137625 tok/s) step 19087/76294 | train loss 3.473133 | norm 9.4064 | lr 1.22e-04 | (3807.76 ms | 137689 tok/s) step 19088/76294 | train loss 3.453748 | norm 7.7582 | lr 1.22e-04 | (3807.28 ms | 137707 tok/s) step 19089/76294 | train loss 3.520622 | norm 3.1911 | lr 1.22e-04 | (3838.46 ms | 136588 tok/s) step 19090/76294 | train loss 3.538594 | norm 4.4990 | lr 1.22e-04 | (3808.76 ms | 137653 tok/s) step 19091/76294 | train loss 3.526693 | norm 6.1863 | lr 1.22e-04 | (3803.28 ms | 137852 tok/s) step 19092/76294 | train loss 3.463886 | norm 12.6628 | lr 1.22e-04 | (3810.08 ms | 137605 tok/s) step 19093/76294 | train loss 3.502622 | norm 13.3504 | lr 1.22e-04 | (3802.22 ms | 137890 tok/s) step 19094/76294 | train loss 3.448330 | norm 32.9436 | lr 1.22e-04 | (3853.62 ms | 136051 tok/s) step 19095/76294 | train loss 3.467957 | norm 26.5042 | lr 1.22e-04 | (3800.70 ms | 137945 tok/s) step 19096/76294 | train loss 3.565457 | norm 8.2089 | lr 1.22e-04 | (3809.87 ms | 137613 tok/s) step 19097/76294 | train loss 3.476703 | norm 11.5417 | lr 1.22e-04 | (3819.30 ms | 137273 tok/s) step 19098/76294 | train loss 3.446351 | norm 14.1971 | lr 1.22e-04 | (3797.92 ms | 138046 tok/s) step 19099/76294 | train loss 3.493896 | norm 6.8925 | lr 1.22e-04 | (3844.84 ms | 136361 tok/s) step 19100/76294 | train loss 3.434342 | norm 20.9368 | lr 1.22e-04 | (3804.83 ms | 137795 tok/s) step 19101/76294 | train loss 3.514384 | norm 16.7152 | lr 1.22e-04 | (3813.29 ms | 137490 tok/s) step 19102/76294 | train loss 3.492726 | norm 10.6903 | lr 1.22e-04 | (3796.11 ms | 138112 tok/s) step 19103/76294 | train loss 3.576342 | norm 12.8657 | lr 1.22e-04 | (3854.14 ms | 136032 tok/s) step 19104/76294 | train loss 3.480624 | norm 23.3703 | lr 1.22e-04 | (3797.71 ms | 138054 tok/s) step 19105/76294 | train loss 3.448813 | norm 27.2586 | lr 1.22e-04 | (3801.31 ms | 137923 tok/s) step 19106/76294 | train loss 3.706659 | norm 19.2194 | lr 1.22e-04 | (3815.78 ms | 137400 tok/s) step 19107/76294 | train loss 3.523464 | norm 20.1224 | lr 1.22e-04 | (3805.59 ms | 137768 tok/s) step 19108/76294 | train loss 3.445944 | norm 18.4400 | lr 1.22e-04 | (3806.96 ms | 137718 tok/s) step 19109/76294 | train loss 3.482258 | norm 11.8609 | lr 1.22e-04 | (3804.58 ms | 137805 tok/s) step 19110/76294 | train loss 3.523909 | norm 60.2903 | lr 1.22e-04 | (3804.92 ms | 137792 tok/s) step 19111/76294 | train loss 3.481966 | norm 8.3944 | lr 1.22e-04 | (3800.09 ms | 137967 tok/s) step 19112/76294 | train loss 3.479955 | norm 8.1100 | lr 1.22e-04 | (3818.99 ms | 137284 tok/s) step 19113/76294 | train loss 3.495688 | norm 7.5917 | lr 1.22e-04 | (3799.72 ms | 137981 tok/s) step 19114/76294 | train loss 3.450933 | norm 17.5544 | lr 1.21e-04 | (3819.95 ms | 137250 tok/s) step 19115/76294 | train loss 3.417337 | norm 7.5267 | lr 1.21e-04 | (3800.91 ms | 137938 tok/s) step 19116/76294 | train loss 3.410279 | norm 8.5489 | lr 1.21e-04 | (3805.60 ms | 137768 tok/s) step 19117/76294 | train loss 3.443123 | norm 10.1472 | lr 1.21e-04 | (3797.48 ms | 138062 tok/s) step 19118/76294 | train loss 3.475507 | norm 14.0291 | lr 1.21e-04 | (3796.45 ms | 138099 tok/s) step 19119/76294 | train loss 3.502385 | norm 11.5532 | lr 1.21e-04 | (3878.75 ms | 135169 tok/s) step 19120/76294 | train loss 3.491347 | norm 5.4038 | lr 1.21e-04 | (3798.66 ms | 138019 tok/s) step 19121/76294 | train loss 3.482090 | norm 5.3107 | lr 1.21e-04 | (3806.74 ms | 137726 tok/s) step 19122/76294 | train loss 3.494777 | norm 8.3806 | lr 1.21e-04 | (3819.64 ms | 137261 tok/s) step 19123/76294 | train loss 3.464096 | norm 32.4519 | lr 1.21e-04 | (3803.18 ms | 137855 tok/s) step 19124/76294 | train loss 3.450414 | norm 24.1061 | lr 1.21e-04 | (3796.50 ms | 138098 tok/s) step 19125/76294 | train loss 3.493239 | norm 4.5073 | lr 1.21e-04 | (3833.64 ms | 136760 tok/s) step 19126/76294 | train loss 3.505793 | norm 14.2620 | lr 1.21e-04 | (3801.56 ms | 137914 tok/s) step 19127/76294 | train loss 3.468158 | norm 11.4325 | lr 1.21e-04 | (3802.04 ms | 137896 tok/s) step 19128/76294 | train loss 3.550256 | norm 7.3867 | lr 1.21e-04 | (3819.30 ms | 137273 tok/s) step 19129/76294 | train loss 3.427824 | norm 9.4983 | lr 1.21e-04 | (3806.69 ms | 137728 tok/s) step 19130/76294 | train loss 3.541654 | norm 13.4452 | lr 1.21e-04 | (3802.40 ms | 137884 tok/s) step 19131/76294 | train loss 3.423534 | norm 15.1325 | lr 1.21e-04 | (3799.73 ms | 137980 tok/s) step 19132/76294 | train loss 3.467241 | norm 8.5978 | lr 1.21e-04 | (3827.18 ms | 136991 tok/s) step 19133/76294 | train loss 3.450274 | norm 8.1232 | lr 1.21e-04 | (3801.69 ms | 137909 tok/s) step 19134/76294 | train loss 3.556767 | norm 7.5944 | lr 1.21e-04 | (3808.15 ms | 137675 tok/s) step 19135/76294 | train loss 3.505201 | norm 6.4133 | lr 1.21e-04 | (3887.58 ms | 134862 tok/s) step 19136/76294 | train loss 3.447323 | norm 7.6052 | lr 1.21e-04 | (3796.75 ms | 138089 tok/s) step 19137/76294 | train loss 3.404441 | norm 9.5024 | lr 1.21e-04 | (3847.79 ms | 136257 tok/s) step 19138/76294 | train loss 3.473575 | norm 6.9510 | lr 1.21e-04 | (3799.19 ms | 138000 tok/s) step 19139/76294 | train loss 3.459673 | norm 12.5862 | lr 1.21e-04 | (3802.55 ms | 137878 tok/s) step 19140/76294 | train loss 3.458358 | norm 9.7318 | lr 1.21e-04 | (3818.83 ms | 137290 tok/s) step 19141/76294 | train loss 3.527665 | norm 9.1750 | lr 1.21e-04 | (3809.10 ms | 137641 tok/s) step 19142/76294 | train loss 3.468672 | norm 10.8372 | lr 1.21e-04 | (3799.90 ms | 137974 tok/s) step 19143/76294 | train loss 3.467374 | norm 9.5811 | lr 1.21e-04 | (3831.78 ms | 136826 tok/s) step 19144/76294 | train loss 3.544960 | norm 3.9397 | lr 1.21e-04 | (3908.99 ms | 134123 tok/s) step 19145/76294 | train loss 3.440075 | norm 16.5200 | lr 1.21e-04 | (3838.17 ms | 136598 tok/s) step 19146/76294 | train loss 3.502257 | norm 11.0450 | lr 1.21e-04 | (3807.39 ms | 137703 tok/s) step 19147/76294 | train loss 3.467971 | norm 9.1880 | lr 1.21e-04 | (3817.74 ms | 137330 tok/s) step 19148/76294 | train loss 3.491693 | norm 5.8768 | lr 1.21e-04 | (3798.54 ms | 138024 tok/s) step 19149/76294 | train loss 3.536829 | norm 4.6211 | lr 1.21e-04 | (3806.97 ms | 137718 tok/s) step 19150/76294 | train loss 3.494750 | norm 16.7775 | lr 1.21e-04 | (3797.90 ms | 138047 tok/s) step 19151/76294 | train loss 3.498145 | norm 23.7641 | lr 1.21e-04 | (3823.02 ms | 137140 tok/s) step 19152/76294 | train loss 3.510787 | norm 6.2941 | lr 1.21e-04 | (3796.02 ms | 138115 tok/s) step 19153/76294 | train loss 3.401261 | norm 8.5527 | lr 1.21e-04 | (3818.02 ms | 137319 tok/s) step 19154/76294 | train loss 3.467419 | norm 8.8869 | lr 1.21e-04 | (3824.50 ms | 137087 tok/s) step 19155/76294 | train loss 3.450167 | norm 14.6091 | lr 1.21e-04 | (3804.80 ms | 137796 tok/s) step 19156/76294 | train loss 3.442280 | norm 14.6659 | lr 1.21e-04 | (3797.52 ms | 138061 tok/s) step 19157/76294 | train loss 3.515135 | norm 8.6130 | lr 1.21e-04 | (3828.41 ms | 136947 tok/s) step 19158/76294 | train loss 3.464659 | norm 7.2365 | lr 1.21e-04 | (3802.04 ms | 137897 tok/s) step 19159/76294 | train loss 3.447705 | norm 4.6366 | lr 1.21e-04 | (6792.90 ms | 77182 tok/s) step 19160/76294 | train loss 3.468533 | norm 6.2314 | lr 1.21e-04 | (3853.43 ms | 136057 tok/s) step 19161/76294 | train loss 3.461292 | norm 4.7644 | lr 1.21e-04 | (3795.36 ms | 138139 tok/s) step 19162/76294 | train loss 3.341628 | norm 11.8034 | lr 1.21e-04 | (3816.64 ms | 137369 tok/s) step 19163/76294 | train loss 3.522223 | norm 5.4880 | lr 1.21e-04 | (3794.09 ms | 138185 tok/s) step 19164/76294 | train loss 3.519456 | norm 9.3488 | lr 1.21e-04 | (3796.53 ms | 138097 tok/s) step 19165/76294 | train loss 3.493197 | norm 12.6072 | lr 1.21e-04 | (3797.87 ms | 138048 tok/s) step 19166/76294 | train loss 3.481385 | norm 6.2181 | lr 1.21e-04 | (3799.12 ms | 138002 tok/s) step 19167/76294 | train loss 3.549741 | norm 10.4244 | lr 1.21e-04 | (3794.44 ms | 138173 tok/s) step 19168/76294 | train loss 3.490619 | norm 14.5085 | lr 1.21e-04 | (3831.93 ms | 136821 tok/s) step 19169/76294 | train loss 3.648569 | norm 11.4687 | lr 1.21e-04 | (3802.71 ms | 137872 tok/s) step 19170/76294 | train loss 3.489680 | norm 12.5702 | lr 1.21e-04 | (3847.10 ms | 136281 tok/s) step 19171/76294 | train loss 3.493060 | norm 8.5789 | lr 1.21e-04 | (3796.02 ms | 138115 tok/s) step 19172/76294 | train loss 3.516017 | norm 23.3926 | lr 1.21e-04 | (3798.28 ms | 138033 tok/s) step 19173/76294 | train loss 3.464332 | norm 9.6032 | lr 1.21e-04 | (3814.10 ms | 137461 tok/s) step 19174/76294 | train loss 3.462070 | norm 15.5846 | lr 1.21e-04 | (3805.06 ms | 137787 tok/s) step 19175/76294 | train loss 3.492706 | norm 8.2777 | lr 1.21e-04 | (3814.63 ms | 137441 tok/s) step 19176/76294 | train loss 3.379770 | norm 5.4936 | lr 1.21e-04 | (3794.19 ms | 138182 tok/s) step 19177/76294 | train loss 3.469460 | norm 10.1913 | lr 1.21e-04 | (3800.93 ms | 137937 tok/s) step 19178/76294 | train loss 3.566214 | norm 10.3891 | lr 1.21e-04 | (3795.29 ms | 138142 tok/s) step 19179/76294 | train loss 3.451312 | norm 9.5347 | lr 1.21e-04 | (4032.22 ms | 130025 tok/s) step 19180/76294 | train loss 3.421738 | norm 9.4734 | lr 1.21e-04 | (3818.22 ms | 137312 tok/s) step 19181/76294 | train loss 3.465633 | norm 9.3265 | lr 1.21e-04 | (3794.99 ms | 138153 tok/s) step 19182/76294 | train loss 3.462168 | norm 8.9179 | lr 1.21e-04 | (3792.08 ms | 138259 tok/s) step 19183/76294 | train loss 3.474186 | norm 5.3747 | lr 1.21e-04 | (3820.75 ms | 137221 tok/s) step 19184/76294 | train loss 3.432642 | norm 3.0053 | lr 1.21e-04 | (3795.80 ms | 138123 tok/s) step 19185/76294 | train loss 3.443213 | norm 8.4628 | lr 1.21e-04 | (3822.33 ms | 137164 tok/s) step 19186/76294 | train loss 3.499551 | norm 13.6661 | lr 1.21e-04 | (3806.16 ms | 137747 tok/s) step 19187/76294 | train loss 3.417577 | norm 8.7395 | lr 1.21e-04 | (3806.16 ms | 137747 tok/s) step 19188/76294 | train loss 3.536335 | norm 13.7479 | lr 1.21e-04 | (3821.52 ms | 137193 tok/s) step 19189/76294 | train loss 3.484251 | norm 11.5636 | lr 1.21e-04 | (3804.08 ms | 137823 tok/s) step 19190/76294 | train loss 3.466753 | norm 17.9206 | lr 1.21e-04 | (3802.20 ms | 137891 tok/s) step 19191/76294 | train loss 3.482960 | norm 11.5142 | lr 1.21e-04 | (3805.52 ms | 137770 tok/s) step 19192/76294 | train loss 3.471669 | norm 23.7739 | lr 1.21e-04 | (3807.41 ms | 137702 tok/s) step 19193/76294 | train loss 3.435870 | norm 16.7269 | lr 1.21e-04 | (3877.88 ms | 135200 tok/s) step 19194/76294 | train loss 3.454369 | norm 24.3118 | lr 1.21e-04 | (3796.59 ms | 138094 tok/s) step 19195/76294 | train loss 3.571607 | norm 18.7701 | lr 1.21e-04 | (3801.63 ms | 137911 tok/s) step 19196/76294 | train loss 3.470771 | norm 19.1060 | lr 1.21e-04 | (3821.64 ms | 137189 tok/s) step 19197/76294 | train loss 3.467844 | norm 14.9376 | lr 1.21e-04 | (3806.41 ms | 137738 tok/s) step 19198/76294 | train loss 3.509507 | norm 10.7082 | lr 1.21e-04 | (3809.79 ms | 137616 tok/s) step 19199/76294 | train loss 3.568010 | norm 13.8205 | lr 1.21e-04 | (3848.33 ms | 136238 tok/s) step 19200/76294 | train loss 3.421950 | norm 16.4875 | lr 1.21e-04 | (3803.70 ms | 137836 tok/s) step 19201/76294 | train loss 3.447018 | norm 8.6126 | lr 1.21e-04 | (3808.05 ms | 137679 tok/s) step 19202/76294 | train loss 3.466105 | norm 11.5877 | lr 1.21e-04 | (3909.10 ms | 134120 tok/s) step 19203/76294 | train loss 3.519528 | norm 5.8121 | lr 1.21e-04 | (3788.95 ms | 138373 tok/s) step 19204/76294 | train loss 3.537158 | norm 11.4664 | lr 1.21e-04 | (3815.20 ms | 137421 tok/s) step 19205/76294 | train loss 3.544540 | norm 7.8388 | lr 1.21e-04 | (3790.87 ms | 138303 tok/s) step 19206/76294 | train loss 3.464075 | norm 7.0303 | lr 1.21e-04 | (3795.20 ms | 138145 tok/s) step 19207/76294 | train loss 3.559381 | norm 12.6013 | lr 1.21e-04 | (3839.02 ms | 136568 tok/s) step 19208/76294 | train loss 3.446651 | norm 5.8542 | lr 1.21e-04 | (3793.34 ms | 138213 tok/s) step 19209/76294 | train loss 3.496044 | norm 10.8342 | lr 1.21e-04 | (3816.43 ms | 137377 tok/s) step 19210/76294 | train loss 3.461484 | norm 33.4979 | lr 1.21e-04 | (3813.20 ms | 137493 tok/s) step 19211/76294 | train loss 3.478444 | norm 8.2225 | lr 1.21e-04 | (3795.23 ms | 138144 tok/s) step 19212/76294 | train loss 3.507657 | norm 6.1810 | lr 1.21e-04 | (3885.18 ms | 134945 tok/s) step 19213/76294 | train loss 3.535117 | norm 23.4169 | lr 1.21e-04 | (3789.06 ms | 138369 tok/s) step 19214/76294 | train loss 3.500459 | norm 19.0727 | lr 1.21e-04 | (3949.78 ms | 132739 tok/s) step 19215/76294 | train loss 3.599758 | norm 13.3543 | lr 1.21e-04 | (3781.09 ms | 138661 tok/s) step 19216/76294 | train loss 3.538921 | norm 24.6011 | lr 1.21e-04 | (3867.38 ms | 135567 tok/s) step 19217/76294 | train loss 3.523093 | norm 31.7622 | lr 1.21e-04 | (3952.23 ms | 132656 tok/s) step 19218/76294 | train loss 3.493862 | norm 17.2955 | lr 1.21e-04 | (3782.59 ms | 138606 tok/s) step 19219/76294 | train loss 3.524091 | norm 17.1389 | lr 1.21e-04 | (3876.17 ms | 135259 tok/s) step 19220/76294 | train loss 3.540692 | norm 16.9265 | lr 1.21e-04 | (3788.77 ms | 138379 tok/s) step 19221/76294 | train loss 3.450606 | norm 20.7983 | lr 1.21e-04 | (3791.27 ms | 138288 tok/s) step 19222/76294 | train loss 3.547226 | norm 21.8645 | lr 1.21e-04 | (3809.18 ms | 137638 tok/s) step 19223/76294 | train loss 3.468208 | norm 22.8761 | lr 1.21e-04 | (3799.07 ms | 138004 tok/s) step 19224/76294 | train loss 3.491225 | norm 18.0991 | lr 1.21e-04 | (3796.88 ms | 138084 tok/s) step 19225/76294 | train loss 3.441336 | norm 17.5556 | lr 1.21e-04 | (3793.78 ms | 138197 tok/s) step 19226/76294 | train loss 3.461613 | norm 16.0759 | lr 1.21e-04 | (3812.11 ms | 137532 tok/s) step 19227/76294 | train loss 3.405815 | norm 13.4018 | lr 1.21e-04 | (3794.59 ms | 138167 tok/s) step 19228/76294 | train loss 3.390053 | norm 22.3061 | lr 1.21e-04 | (3789.13 ms | 138366 tok/s) step 19229/76294 | train loss 3.425738 | norm 16.4439 | lr 1.21e-04 | (3844.75 ms | 136365 tok/s) step 19230/76294 | train loss 3.459379 | norm 7.6751 | lr 1.21e-04 | (3797.12 ms | 138075 tok/s) step 19231/76294 | train loss 3.437054 | norm 19.2588 | lr 1.21e-04 | (3807.91 ms | 137684 tok/s) step 19232/76294 | train loss 3.469959 | norm 23.8297 | lr 1.21e-04 | (3793.98 ms | 138189 tok/s) step 19233/76294 | train loss 3.708797 | norm 18.2043 | lr 1.21e-04 | (3950.14 ms | 132727 tok/s) step 19234/76294 | train loss 3.484151 | norm 18.3446 | lr 1.21e-04 | (3788.61 ms | 138385 tok/s) step 19235/76294 | train loss 3.508043 | norm 10.4622 | lr 1.21e-04 | (3904.54 ms | 134276 tok/s) step 19236/76294 | train loss 3.395384 | norm 12.4759 | lr 1.21e-04 | (3788.66 ms | 138383 tok/s) step 19237/76294 | train loss 3.422391 | norm 8.4367 | lr 1.21e-04 | (3930.45 ms | 133391 tok/s) step 19238/76294 | train loss 3.422132 | norm 7.5246 | lr 1.21e-04 | (3841.45 ms | 136482 tok/s) step 19239/76294 | train loss 3.489359 | norm 6.8719 | lr 1.21e-04 | (3770.14 ms | 139063 tok/s) step 19240/76294 | train loss 3.429188 | norm 16.3698 | lr 1.21e-04 | (3969.59 ms | 132076 tok/s) step 19241/76294 | train loss 3.466301 | norm 7.9582 | lr 1.21e-04 | (4103.74 ms | 127759 tok/s) step 19242/76294 | train loss 3.389225 | norm 14.1706 | lr 1.21e-04 | (7515.59 ms | 69760 tok/s) step 19243/76294 | train loss 3.528636 | norm 5.6966 | lr 1.21e-04 | (3760.39 ms | 139424 tok/s) step 19244/76294 | train loss 3.477081 | norm 10.8912 | lr 1.21e-04 | (3797.50 ms | 138061 tok/s) step 19245/76294 | train loss 3.450795 | norm 12.8148 | lr 1.21e-04 | (3770.58 ms | 139047 tok/s) step 19246/76294 | train loss 3.462201 | norm 5.2194 | lr 1.21e-04 | (3803.88 ms | 137830 tok/s) step 19247/76294 | train loss 3.402714 | norm 24.0333 | lr 1.21e-04 | (3801.36 ms | 137921 tok/s) step 19248/76294 | train loss 3.515807 | norm 22.7935 | lr 1.21e-04 | (3782.72 ms | 138601 tok/s) step 19249/76294 | train loss 3.491787 | norm 30.2167 | lr 1.21e-04 | (3791.54 ms | 138278 tok/s) step 19250/76294 | train loss 3.506631 | norm 15.5445 | lr 1.21e-04 | (3780.95 ms | 138666 tok/s) val loss: 3.458870 saving model checkpoint to ./results/gpt2-124M-gqa/step_19250.pth step 19251/76294 | train loss 3.443284 | norm 15.8814 | lr 1.21e-04 | (3820.07 ms | 137246 tok/s) step 19252/76294 | train loss 3.491752 | norm 8.3261 | lr 1.21e-04 | (3779.52 ms | 138718 tok/s) step 19253/76294 | train loss 3.389938 | norm 11.7161 | lr 1.21e-04 | (3778.78 ms | 138745 tok/s) step 19254/76294 | train loss 3.504026 | norm 15.3623 | lr 1.21e-04 | (3807.06 ms | 137715 tok/s) step 19255/76294 | train loss 3.418872 | norm 16.1716 | lr 1.21e-04 | (3782.14 ms | 138622 tok/s) step 19256/76294 | train loss 3.504440 | norm 6.0230 | lr 1.21e-04 | (3827.00 ms | 136997 tok/s) step 19257/76294 | train loss 3.484422 | norm 18.4486 | lr 1.21e-04 | (3786.44 ms | 138464 tok/s) step 19258/76294 | train loss 3.469278 | norm 9.9797 | lr 1.21e-04 | (3803.49 ms | 137844 tok/s) step 19259/76294 | train loss 3.436512 | norm 20.2529 | lr 1.21e-04 | (3794.41 ms | 138174 tok/s) step 19260/76294 | train loss 3.445026 | norm 13.6663 | lr 1.21e-04 | (3841.56 ms | 136478 tok/s) step 19261/76294 | train loss 3.444976 | norm 15.2119 | lr 1.21e-04 | (3788.66 ms | 138384 tok/s) step 19262/76294 | train loss 3.457927 | norm 8.3316 | lr 1.21e-04 | (3799.52 ms | 137988 tok/s) step 19263/76294 | train loss 3.389575 | norm 13.3552 | lr 1.21e-04 | (3812.77 ms | 137509 tok/s) step 19264/76294 | train loss 3.460965 | norm 17.1894 | lr 1.21e-04 | (3795.72 ms | 138126 tok/s) step 19265/76294 | train loss 3.440482 | norm 15.7845 | lr 1.21e-04 | (3792.33 ms | 138249 tok/s) step 19266/76294 | train loss 3.487624 | norm 10.0516 | lr 1.21e-04 | (3814.22 ms | 137456 tok/s) step 19267/76294 | train loss 3.486029 | norm 23.8670 | lr 1.21e-04 | (3796.34 ms | 138103 tok/s) step 19268/76294 | train loss 3.474039 | norm 20.2856 | lr 1.21e-04 | (3821.91 ms | 137180 tok/s) step 19269/76294 | train loss 3.482300 | norm 17.5336 | lr 1.21e-04 | (3794.15 ms | 138183 tok/s) step 19270/76294 | train loss 3.520150 | norm 20.5501 | lr 1.21e-04 | (3796.88 ms | 138084 tok/s) step 19271/76294 | train loss 3.548665 | norm 24.3662 | lr 1.21e-04 | (3816.74 ms | 137365 tok/s) step 19272/76294 | train loss 3.442450 | norm 16.6087 | lr 1.21e-04 | (3827.86 ms | 136966 tok/s) step 19273/76294 | train loss 3.475581 | norm 37.2267 | lr 1.21e-04 | (3813.15 ms | 137495 tok/s) step 19274/76294 | train loss 3.525476 | norm 41.3809 | lr 1.21e-04 | (3798.62 ms | 138021 tok/s) step 19275/76294 | train loss 3.488891 | norm 22.8548 | lr 1.21e-04 | (3799.98 ms | 137971 tok/s) step 19276/76294 | train loss 3.454566 | norm 18.1185 | lr 1.21e-04 | (3815.67 ms | 137404 tok/s) step 19277/76294 | train loss 3.436277 | norm 6.6407 | lr 1.21e-04 | (3834.61 ms | 136725 tok/s) step 19278/76294 | train loss 3.461546 | norm 12.3852 | lr 1.21e-04 | (3800.53 ms | 137951 tok/s) step 19279/76294 | train loss 3.465431 | norm 9.7606 | lr 1.21e-04 | (3831.88 ms | 136823 tok/s) step 19280/76294 | train loss 3.448208 | norm 8.0736 | lr 1.21e-04 | (3798.90 ms | 138010 tok/s) step 19281/76294 | train loss 3.467544 | norm 6.9360 | lr 1.21e-04 | (3876.82 ms | 135236 tok/s) step 19282/76294 | train loss 3.421627 | norm 7.4511 | lr 1.21e-04 | (3803.05 ms | 137860 tok/s) step 19283/76294 | train loss 3.504890 | norm 14.2463 | lr 1.21e-04 | (3806.48 ms | 137736 tok/s) step 19284/76294 | train loss 3.537687 | norm 14.9689 | lr 1.21e-04 | (3818.76 ms | 137293 tok/s) step 19285/76294 | train loss 3.491270 | norm 9.6396 | lr 1.21e-04 | (3797.93 ms | 138046 tok/s) step 19286/76294 | train loss 3.444509 | norm 12.8359 | lr 1.21e-04 | (3847.94 ms | 136252 tok/s) step 19287/76294 | train loss 3.424741 | norm 19.4910 | lr 1.21e-04 | (3797.07 ms | 138077 tok/s) step 19288/76294 | train loss 3.412958 | norm 14.6767 | lr 1.21e-04 | (3803.04 ms | 137860 tok/s) step 19289/76294 | train loss 3.431322 | norm 9.7317 | lr 1.21e-04 | (3824.31 ms | 137094 tok/s) step 19290/76294 | train loss 3.401029 | norm 7.9362 | lr 1.21e-04 | (3806.20 ms | 137746 tok/s) step 19291/76294 | train loss 3.447248 | norm 14.2879 | lr 1.21e-04 | (3807.64 ms | 137694 tok/s) step 19292/76294 | train loss 3.459471 | norm 8.5489 | lr 1.21e-04 | (3804.07 ms | 137823 tok/s) step 19293/76294 | train loss 3.420324 | norm 20.7008 | lr 1.21e-04 | (3807.63 ms | 137694 tok/s) step 19294/76294 | train loss 3.406600 | norm 11.0180 | lr 1.21e-04 | (3804.19 ms | 137818 tok/s) step 19295/76294 | train loss 3.634573 | norm 25.7840 | lr 1.21e-04 | (3804.87 ms | 137794 tok/s) step 19296/76294 | train loss 3.422690 | norm 20.9722 | lr 1.21e-04 | (3807.36 ms | 137704 tok/s) step 19297/76294 | train loss 3.441397 | norm 17.5454 | lr 1.21e-04 | (3808.13 ms | 137676 tok/s) step 19298/76294 | train loss 3.412803 | norm 10.6717 | lr 1.21e-04 | (3803.48 ms | 137844 tok/s) step 19299/76294 | train loss 3.507317 | norm 13.3529 | lr 1.21e-04 | (3803.39 ms | 137848 tok/s) step 19300/76294 | train loss 3.444956 | norm 16.8449 | lr 1.21e-04 | (3803.26 ms | 137852 tok/s) step 19301/76294 | train loss 3.439512 | norm 12.4627 | lr 1.21e-04 | (3800.70 ms | 137945 tok/s) step 19302/76294 | train loss 3.454719 | norm 16.4183 | lr 1.21e-04 | (3826.14 ms | 137028 tok/s) step 19303/76294 | train loss 3.474920 | norm 13.2724 | lr 1.20e-04 | (3801.22 ms | 137926 tok/s) step 19304/76294 | train loss 3.424546 | norm 10.2723 | lr 1.20e-04 | (3803.66 ms | 137838 tok/s) step 19305/76294 | train loss 3.481457 | norm 13.1942 | lr 1.20e-04 | (3821.41 ms | 137198 tok/s) step 19306/76294 | train loss 3.437469 | norm 4.7961 | lr 1.20e-04 | (3869.21 ms | 135503 tok/s) step 19307/76294 | train loss 3.467270 | norm 12.2795 | lr 1.20e-04 | (3794.29 ms | 138178 tok/s) step 19308/76294 | train loss 3.395825 | norm 14.4177 | lr 1.20e-04 | (3804.57 ms | 137805 tok/s) step 19309/76294 | train loss 3.500902 | norm 17.5426 | lr 1.20e-04 | (3838.27 ms | 136595 tok/s) step 19310/76294 | train loss 3.470126 | norm 7.9426 | lr 1.20e-04 | (3801.94 ms | 137900 tok/s) step 19311/76294 | train loss 3.448728 | norm 7.4739 | lr 1.20e-04 | (3843.67 ms | 136403 tok/s) step 19312/76294 | train loss 3.431606 | norm 8.6842 | lr 1.20e-04 | (3808.97 ms | 137646 tok/s) step 19313/76294 | train loss 3.483808 | norm 10.1712 | lr 1.20e-04 | (3843.71 ms | 136401 tok/s) step 19314/76294 | train loss 3.431933 | norm 11.1134 | lr 1.20e-04 | (3796.14 ms | 138111 tok/s) step 19315/76294 | train loss 3.433726 | norm 6.5897 | lr 1.20e-04 | (3851.19 ms | 136137 tok/s) step 19316/76294 | train loss 3.429000 | norm 9.4063 | lr 1.20e-04 | (3795.52 ms | 138134 tok/s) step 19317/76294 | train loss 3.488089 | norm 5.1035 | lr 1.20e-04 | (3799.65 ms | 137983 tok/s) step 19318/76294 | train loss 3.421838 | norm 17.2056 | lr 1.20e-04 | (3819.87 ms | 137253 tok/s) step 19319/76294 | train loss 3.510443 | norm 9.8936 | lr 1.20e-04 | (3796.91 ms | 138083 tok/s) step 19320/76294 | train loss 3.450599 | norm 13.0719 | lr 1.20e-04 | (3824.57 ms | 137084 tok/s) step 19321/76294 | train loss 3.482702 | norm 12.6133 | lr 1.20e-04 | (3799.23 ms | 137998 tok/s) step 19322/76294 | train loss 3.414165 | norm 6.6193 | lr 1.20e-04 | (3820.60 ms | 137227 tok/s) step 19323/76294 | train loss 3.487813 | norm 21.8088 | lr 1.20e-04 | (3797.17 ms | 138073 tok/s) step 19324/76294 | train loss 3.503350 | norm 26.5684 | lr 1.20e-04 | (3799.90 ms | 137974 tok/s) step 19325/76294 | train loss 3.484007 | norm 25.5853 | lr 1.20e-04 | (3823.24 ms | 137132 tok/s) step 19326/76294 | train loss 3.493542 | norm 28.7333 | lr 1.20e-04 | (3797.67 ms | 138055 tok/s) step 19327/76294 | train loss 3.483854 | norm 24.4076 | lr 1.20e-04 | (3805.57 ms | 137769 tok/s) step 19328/76294 | train loss 3.436702 | norm 5.7255 | lr 1.20e-04 | (3799.42 ms | 137992 tok/s) step 19329/76294 | train loss 3.429881 | norm 13.7287 | lr 1.20e-04 | (3801.75 ms | 137907 tok/s) step 19330/76294 | train loss 3.467574 | norm 40.2861 | lr 1.20e-04 | (3806.58 ms | 137732 tok/s) step 19331/76294 | train loss 3.525806 | norm 11.9079 | lr 1.20e-04 | (3912.98 ms | 133987 tok/s) step 19332/76294 | train loss 3.441319 | norm 9.7193 | lr 1.20e-04 | (3793.21 ms | 138218 tok/s) step 19333/76294 | train loss 3.524599 | norm 9.8719 | lr 1.20e-04 | (3819.33 ms | 137272 tok/s) step 19334/76294 | train loss 3.629447 | norm 6.9253 | lr 1.20e-04 | (3800.68 ms | 137946 tok/s) step 19335/76294 | train loss 3.518137 | norm 8.5919 | lr 1.20e-04 | (3802.61 ms | 137876 tok/s) step 19336/76294 | train loss 3.487595 | norm 15.0651 | lr 1.20e-04 | (3819.93 ms | 137251 tok/s) step 19337/76294 | train loss 3.537268 | norm 13.2519 | lr 1.20e-04 | (3802.53 ms | 137879 tok/s) step 19338/76294 | train loss 3.467149 | norm 18.3671 | lr 1.20e-04 | (3817.87 ms | 137325 tok/s) step 19339/76294 | train loss 3.482187 | norm 18.9630 | lr 1.20e-04 | (3807.69 ms | 137692 tok/s) step 19340/76294 | train loss 3.484316 | norm 35.0520 | lr 1.20e-04 | (3819.41 ms | 137270 tok/s) step 19341/76294 | train loss 3.539972 | norm 33.3975 | lr 1.20e-04 | (3804.17 ms | 137819 tok/s) step 19342/76294 | train loss 3.503916 | norm 35.5711 | lr 1.20e-04 | (3806.03 ms | 137752 tok/s) step 19343/76294 | train loss 3.513160 | norm 37.7818 | lr 1.20e-04 | (3804.27 ms | 137816 tok/s) step 19344/76294 | train loss 3.507698 | norm 48.3605 | lr 1.20e-04 | (3805.05 ms | 137787 tok/s) step 19345/76294 | train loss 3.526806 | norm 50.2336 | lr 1.20e-04 | (3803.12 ms | 137857 tok/s) step 19346/76294 | train loss 3.551212 | norm 61.1290 | lr 1.20e-04 | (3805.06 ms | 137787 tok/s) step 19347/76294 | train loss 3.598387 | norm 52.8649 | lr 1.20e-04 | (3804.54 ms | 137806 tok/s) step 19348/76294 | train loss 3.575749 | norm 46.5582 | lr 1.20e-04 | (3807.78 ms | 137689 tok/s) step 19349/76294 | train loss 3.539602 | norm 41.6175 | lr 1.20e-04 | (3802.11 ms | 137894 tok/s) step 19350/76294 | train loss 3.638119 | norm 52.3230 | lr 1.20e-04 | (3800.30 ms | 137960 tok/s) step 19351/76294 | train loss 3.604284 | norm 39.4925 | lr 1.20e-04 | (3802.24 ms | 137889 tok/s) step 19352/76294 | train loss 3.531136 | norm 41.7109 | lr 1.20e-04 | (3803.12 ms | 137857 tok/s) step 19353/76294 | train loss 3.495733 | norm 39.7271 | lr 1.20e-04 | (3802.08 ms | 137895 tok/s) step 19354/76294 | train loss 3.459896 | norm 59.4327 | lr 1.20e-04 | (3804.27 ms | 137816 tok/s) step 19355/76294 | train loss 3.460168 | norm 37.3138 | lr 1.20e-04 | (3802.54 ms | 137878 tok/s) step 19356/76294 | train loss 3.508324 | norm 37.4865 | lr 1.20e-04 | (3803.49 ms | 137844 tok/s) step 19357/76294 | train loss 3.471710 | norm 33.8379 | lr 1.20e-04 | (3835.23 ms | 136703 tok/s) step 19358/76294 | train loss 3.578771 | norm 28.8450 | lr 1.20e-04 | (3801.52 ms | 137915 tok/s) step 19359/76294 | train loss 3.432657 | norm 23.4697 | lr 1.20e-04 | (3829.96 ms | 136891 tok/s) step 19360/76294 | train loss 3.498581 | norm 24.8855 | lr 1.20e-04 | (3807.98 ms | 137681 tok/s) step 19361/76294 | train loss 3.477894 | norm 36.7689 | lr 1.20e-04 | (3820.40 ms | 137234 tok/s) step 19362/76294 | train loss 3.404000 | norm 49.3885 | lr 1.20e-04 | (3831.69 ms | 136829 tok/s) step 19363/76294 | train loss 3.480085 | norm 40.5527 | lr 1.20e-04 | (3819.29 ms | 137274 tok/s) step 19364/76294 | train loss 3.541509 | norm 48.4078 | lr 1.20e-04 | (4024.59 ms | 130271 tok/s) step 19365/76294 | train loss 3.509072 | norm 51.9032 | lr 1.20e-04 | (3988.59 ms | 131447 tok/s) step 19366/76294 | train loss 3.477061 | norm 52.3631 | lr 1.20e-04 | (3801.70 ms | 137909 tok/s) step 19367/76294 | train loss 3.461506 | norm 61.0560 | lr 1.20e-04 | (3801.61 ms | 137912 tok/s) step 19368/76294 | train loss 3.543128 | norm 58.2635 | lr 1.20e-04 | (3829.77 ms | 136898 tok/s) step 19369/76294 | train loss 3.462533 | norm 75.8113 | lr 1.20e-04 | (3803.49 ms | 137844 tok/s) step 19370/76294 | train loss 3.496037 | norm 46.7821 | lr 1.20e-04 | (6502.95 ms | 80623 tok/s) step 19371/76294 | train loss 3.482597 | norm 38.2885 | lr 1.20e-04 | (3792.78 ms | 138233 tok/s) step 19372/76294 | train loss 3.456340 | norm 49.2242 | lr 1.20e-04 | (3813.01 ms | 137500 tok/s) step 19373/76294 | train loss 3.463052 | norm 57.2046 | lr 1.20e-04 | (3825.60 ms | 137047 tok/s) step 19374/76294 | train loss 3.436872 | norm 44.0086 | lr 1.20e-04 | (3848.96 ms | 136215 tok/s) step 19375/76294 | train loss 3.654022 | norm 44.1690 | lr 1.20e-04 | (3797.83 ms | 138050 tok/s) step 19376/76294 | train loss 4.007457 | norm 55.8466 | lr 1.20e-04 | (3804.31 ms | 137814 tok/s) step 19377/76294 | train loss 3.488088 | norm 57.6190 | lr 1.20e-04 | (3802.86 ms | 137867 tok/s) step 19378/76294 | train loss 3.419846 | norm 56.7595 | lr 1.20e-04 | (3798.31 ms | 138032 tok/s) step 19379/76294 | train loss 3.451201 | norm 97.5831 | lr 1.20e-04 | (3842.60 ms | 136441 tok/s) step 19380/76294 | train loss 3.474592 | norm 60.1844 | lr 1.20e-04 | (3899.34 ms | 134456 tok/s) step 19381/76294 | train loss 3.473145 | norm 73.9455 | lr 1.20e-04 | (3795.12 ms | 138148 tok/s) step 19382/76294 | train loss 3.516711 | norm 74.5237 | lr 1.20e-04 | (3821.39 ms | 137198 tok/s) step 19383/76294 | train loss 3.555087 | norm 72.9883 | lr 1.20e-04 | (3796.69 ms | 138091 tok/s) step 19384/76294 | train loss 3.498811 | norm 33.0313 | lr 1.20e-04 | (3817.62 ms | 137334 tok/s) step 19385/76294 | train loss 3.540237 | norm 52.7484 | lr 1.20e-04 | (3809.55 ms | 137625 tok/s) step 19386/76294 | train loss 3.425243 | norm 34.5732 | lr 1.20e-04 | (3824.94 ms | 137071 tok/s) step 19387/76294 | train loss 3.497715 | norm 54.7446 | lr 1.20e-04 | (3794.23 ms | 138180 tok/s) step 19388/76294 | train loss 3.412097 | norm 77.5716 | lr 1.20e-04 | (3801.40 ms | 137920 tok/s) step 19389/76294 | train loss 3.437138 | norm 25.7229 | lr 1.20e-04 | (3846.35 ms | 136308 tok/s) step 19390/76294 | train loss 3.515062 | norm 40.5080 | lr 1.20e-04 | (3797.84 ms | 138049 tok/s) step 19391/76294 | train loss 3.437050 | norm 89.8194 | lr 1.20e-04 | (3822.75 ms | 137149 tok/s) step 19392/76294 | train loss 3.473960 | norm 32.6766 | lr 1.20e-04 | (3815.27 ms | 137418 tok/s) step 19393/76294 | train loss 3.418082 | norm 57.1559 | lr 1.20e-04 | (3802.51 ms | 137879 tok/s) step 19394/76294 | train loss 3.540818 | norm 81.5318 | lr 1.20e-04 | (3803.96 ms | 137827 tok/s) step 19395/76294 | train loss 3.452658 | norm 79.1065 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 19396/76294 | train loss 3.487072 | norm 95.8300 | lr 1.20e-04 | (3795.60 ms | 138130 tok/s) step 19397/76294 | train loss 3.424160 | norm 47.6991 | lr 1.20e-04 | (3843.73 ms | 136401 tok/s) step 19398/76294 | train loss 3.473421 | norm 46.7124 | lr 1.20e-04 | (3799.38 ms | 137993 tok/s) step 19399/76294 | train loss 3.505753 | norm 41.7068 | lr 1.20e-04 | (3827.70 ms | 136972 tok/s) step 19400/76294 | train loss 3.469837 | norm 27.3626 | lr 1.20e-04 | (3798.21 ms | 138035 tok/s) step 19401/76294 | train loss 3.448442 | norm 58.7946 | lr 1.20e-04 | (3797.92 ms | 138046 tok/s) step 19402/76294 | train loss 3.517024 | norm 36.1909 | lr 1.20e-04 | (3818.23 ms | 137312 tok/s) step 19403/76294 | train loss 3.485372 | norm 11.7245 | lr 1.20e-04 | (3805.73 ms | 137763 tok/s) step 19404/76294 | train loss 3.448649 | norm 81.0729 | lr 1.20e-04 | (3803.38 ms | 137848 tok/s) step 19405/76294 | train loss 3.580660 | norm 85.6513 | lr 1.20e-04 | (3871.99 ms | 135405 tok/s) step 19406/76294 | train loss 3.525549 | norm 61.0568 | lr 1.20e-04 | (3810.08 ms | 137605 tok/s) step 19407/76294 | train loss 3.542459 | norm 87.0374 | lr 1.20e-04 | (3800.47 ms | 137953 tok/s) step 19408/76294 | train loss 3.626636 | norm 64.6850 | lr 1.20e-04 | (3851.72 ms | 136118 tok/s) step 19409/76294 | train loss 3.488227 | norm 39.7896 | lr 1.20e-04 | (3800.90 ms | 137938 tok/s) step 19410/76294 | train loss 3.540390 | norm 50.9543 | lr 1.20e-04 | (3828.85 ms | 136931 tok/s) step 19411/76294 | train loss 3.478693 | norm 61.2939 | lr 1.20e-04 | (3799.34 ms | 137994 tok/s) step 19412/76294 | train loss 3.562062 | norm 51.6597 | lr 1.20e-04 | (3805.06 ms | 137787 tok/s) step 19413/76294 | train loss 3.417607 | norm 45.8845 | lr 1.20e-04 | (3824.76 ms | 137077 tok/s) step 19414/76294 | train loss 3.552047 | norm 17.3638 | lr 1.20e-04 | (3801.85 ms | 137903 tok/s) step 19415/76294 | train loss 3.458634 | norm 43.5516 | lr 1.20e-04 | (3797.79 ms | 138051 tok/s) step 19416/76294 | train loss 3.499039 | norm 50.0069 | lr 1.20e-04 | (3830.57 ms | 136869 tok/s) step 19417/76294 | train loss 3.479254 | norm 67.8384 | lr 1.20e-04 | (3795.72 ms | 138126 tok/s) step 19418/76294 | train loss 3.477135 | norm 47.9857 | lr 1.20e-04 | (3804.43 ms | 137810 tok/s) step 19419/76294 | train loss 3.477193 | norm 37.9950 | lr 1.20e-04 | (3823.07 ms | 137138 tok/s) step 19420/76294 | train loss 3.590080 | norm 50.9961 | lr 1.20e-04 | (3808.57 ms | 137660 tok/s) step 19421/76294 | train loss 3.575488 | norm 21.6011 | lr 1.20e-04 | (3799.64 ms | 137984 tok/s) step 19422/76294 | train loss 3.475265 | norm 28.1370 | lr 1.20e-04 | (3828.99 ms | 136926 tok/s) step 19423/76294 | train loss 3.499023 | norm 28.7175 | lr 1.20e-04 | (3798.29 ms | 138033 tok/s) step 19424/76294 | train loss 3.467194 | norm 55.6024 | lr 1.20e-04 | (3801.05 ms | 137932 tok/s) step 19425/76294 | train loss 3.458170 | norm 36.4659 | lr 1.20e-04 | (3817.06 ms | 137354 tok/s) step 19426/76294 | train loss 3.456073 | norm 30.2273 | lr 1.20e-04 | (3812.62 ms | 137514 tok/s) step 19427/76294 | train loss 3.608175 | norm 45.0906 | lr 1.20e-04 | (3810.53 ms | 137589 tok/s) step 19428/76294 | train loss 3.407763 | norm 35.2223 | lr 1.20e-04 | (3811.28 ms | 137562 tok/s) step 19429/76294 | train loss 3.532296 | norm 23.8772 | lr 1.20e-04 | (3815.58 ms | 137407 tok/s) step 19430/76294 | train loss 3.460312 | norm 29.2498 | lr 1.20e-04 | (3846.91 ms | 136288 tok/s) step 19431/76294 | train loss 3.471670 | norm 41.2676 | lr 1.20e-04 | (3826.82 ms | 137004 tok/s) step 19432/76294 | train loss 3.582643 | norm 37.1337 | lr 1.20e-04 | (3837.95 ms | 136606 tok/s) step 19433/76294 | train loss 3.433580 | norm 15.0589 | lr 1.20e-04 | (3806.33 ms | 137741 tok/s) step 19434/76294 | train loss 3.519720 | norm 22.5535 | lr 1.20e-04 | (3837.72 ms | 136615 tok/s) step 19435/76294 | train loss 3.428834 | norm 24.5114 | lr 1.20e-04 | (3806.09 ms | 137750 tok/s) step 19436/76294 | train loss 3.439729 | norm 20.8906 | lr 1.20e-04 | (3808.13 ms | 137676 tok/s) step 19437/76294 | train loss 3.414844 | norm 45.4532 | lr 1.20e-04 | (3832.01 ms | 136818 tok/s) step 19438/76294 | train loss 3.511751 | norm 30.9458 | lr 1.20e-04 | (3818.39 ms | 137306 tok/s) step 19439/76294 | train loss 3.476401 | norm 32.4911 | lr 1.20e-04 | (3798.29 ms | 138033 tok/s) step 19440/76294 | train loss 3.537049 | norm 30.7535 | lr 1.20e-04 | (3837.56 ms | 136620 tok/s) step 19441/76294 | train loss 3.456554 | norm 52.0047 | lr 1.20e-04 | (3807.17 ms | 137711 tok/s) step 19442/76294 | train loss 3.506884 | norm 31.6596 | lr 1.20e-04 | (3810.23 ms | 137600 tok/s) step 19443/76294 | train loss 3.464382 | norm 49.1338 | lr 1.20e-04 | (3820.75 ms | 137221 tok/s) step 19444/76294 | train loss 3.480480 | norm 19.4411 | lr 1.20e-04 | (3802.33 ms | 137886 tok/s) step 19445/76294 | train loss 3.458014 | norm 15.7044 | lr 1.20e-04 | (3803.87 ms | 137830 tok/s) step 19446/76294 | train loss 3.447425 | norm 36.8828 | lr 1.20e-04 | (3801.91 ms | 137901 tok/s) step 19447/76294 | train loss 3.496907 | norm 28.0645 | lr 1.20e-04 | (3802.15 ms | 137893 tok/s) step 19448/76294 | train loss 3.503761 | norm 31.5528 | lr 1.20e-04 | (3808.13 ms | 137676 tok/s) step 19449/76294 | train loss 3.550873 | norm 25.5826 | lr 1.20e-04 | (3804.82 ms | 137796 tok/s) step 19450/76294 | train loss 3.457505 | norm 29.6569 | lr 1.20e-04 | (3797.77 ms | 138051 tok/s) step 19451/76294 | train loss 3.460935 | norm 22.7529 | lr 1.20e-04 | (3829.04 ms | 136924 tok/s) step 19452/76294 | train loss 3.416703 | norm 25.4609 | lr 1.20e-04 | (3799.06 ms | 138005 tok/s) step 19453/76294 | train loss 3.460697 | norm 44.6114 | lr 1.20e-04 | (3802.45 ms | 137882 tok/s) step 19454/76294 | train loss 3.472330 | norm 31.8523 | lr 1.20e-04 | (3820.71 ms | 137223 tok/s) step 19455/76294 | train loss 3.475210 | norm 23.8080 | lr 1.20e-04 | (3841.43 ms | 136483 tok/s) step 19456/76294 | train loss 3.454819 | norm 24.5417 | lr 1.20e-04 | (3812.70 ms | 137511 tok/s) step 19457/76294 | train loss 3.456279 | norm 29.2696 | lr 1.20e-04 | (3833.86 ms | 136752 tok/s) step 19458/76294 | train loss 3.482959 | norm 38.3332 | lr 1.20e-04 | (3828.30 ms | 136951 tok/s) step 19459/76294 | train loss 3.500164 | norm 13.9178 | lr 1.20e-04 | (3798.00 ms | 138043 tok/s) step 19460/76294 | train loss 3.491087 | norm 17.6517 | lr 1.20e-04 | (3806.19 ms | 137746 tok/s) step 19461/76294 | train loss 3.469831 | norm 33.2346 | lr 1.20e-04 | (3797.25 ms | 138070 tok/s) step 19462/76294 | train loss 3.461001 | norm 30.3204 | lr 1.20e-04 | (3806.25 ms | 137744 tok/s) step 19463/76294 | train loss 3.467120 | norm 24.0371 | lr 1.20e-04 | (3800.91 ms | 137938 tok/s) step 19464/76294 | train loss 3.445118 | norm 4.2894 | lr 1.20e-04 | (3851.64 ms | 136121 tok/s) step 19465/76294 | train loss 3.505535 | norm 49.5384 | lr 1.20e-04 | (3806.58 ms | 137732 tok/s) step 19466/76294 | train loss 3.625347 | norm 67.9181 | lr 1.20e-04 | (3802.28 ms | 137888 tok/s) step 19467/76294 | train loss 3.603089 | norm 73.9554 | lr 1.20e-04 | (3801.32 ms | 137923 tok/s) step 19468/76294 | train loss 3.582700 | norm 82.5915 | lr 1.20e-04 | (3804.43 ms | 137810 tok/s) step 19469/76294 | train loss 3.604985 | norm 41.5018 | lr 1.20e-04 | (3827.28 ms | 136987 tok/s) step 19470/76294 | train loss 3.539443 | norm 26.3348 | lr 1.20e-04 | (3800.41 ms | 137956 tok/s) step 19471/76294 | train loss 3.561449 | norm 27.8622 | lr 1.20e-04 | (3825.34 ms | 137056 tok/s) step 19472/76294 | train loss 3.562709 | norm 30.8872 | lr 1.20e-04 | (3797.92 ms | 138046 tok/s) step 19473/76294 | train loss 3.488753 | norm 32.2398 | lr 1.20e-04 | (3859.62 ms | 135839 tok/s) step 19474/76294 | train loss 3.616133 | norm 40.7509 | lr 1.20e-04 | (3803.63 ms | 137839 tok/s) step 19475/76294 | train loss 3.470050 | norm 38.1318 | lr 1.20e-04 | (3811.59 ms | 137551 tok/s) step 19476/76294 | train loss 3.567466 | norm 22.0733 | lr 1.20e-04 | (3829.50 ms | 136908 tok/s) step 19477/76294 | train loss 3.479142 | norm 35.3658 | lr 1.20e-04 | (3808.75 ms | 137653 tok/s) step 19478/76294 | train loss 3.455084 | norm 30.9356 | lr 1.20e-04 | (3815.05 ms | 137426 tok/s) step 19479/76294 | train loss 3.432556 | norm 51.2589 | lr 1.20e-04 | (3909.90 ms | 134092 tok/s) step 19480/76294 | train loss 3.435984 | norm 37.0137 | lr 1.20e-04 | (3805.57 ms | 137769 tok/s) step 19481/76294 | train loss 3.512294 | norm 35.9016 | lr 1.20e-04 | (3834.35 ms | 136734 tok/s) step 19482/76294 | train loss 3.434664 | norm 33.7726 | lr 1.20e-04 | (3803.35 ms | 137849 tok/s) step 19483/76294 | train loss 3.494869 | norm 14.3443 | lr 1.20e-04 | (3828.92 ms | 136928 tok/s) step 19484/76294 | train loss 3.419888 | norm 7.4924 | lr 1.20e-04 | (3807.15 ms | 137711 tok/s) step 19485/76294 | train loss 3.441065 | norm 7.4648 | lr 1.20e-04 | (3810.80 ms | 137579 tok/s) step 19486/76294 | train loss 3.478617 | norm 11.8485 | lr 1.20e-04 | (3829.86 ms | 136895 tok/s) step 19487/76294 | train loss 3.410026 | norm 8.4467 | lr 1.20e-04 | (3807.96 ms | 137682 tok/s) step 19488/76294 | train loss 3.572981 | norm 8.6259 | lr 1.20e-04 | (3828.87 ms | 136930 tok/s) step 19489/76294 | train loss 3.457530 | norm 10.8594 | lr 1.20e-04 | (3818.99 ms | 137285 tok/s) step 19490/76294 | train loss 3.535933 | norm 41.5406 | lr 1.20e-04 | (3822.99 ms | 137141 tok/s) step 19491/76294 | train loss 3.528670 | norm 34.6654 | lr 1.20e-04 | (3806.42 ms | 137738 tok/s) step 19492/76294 | train loss 3.735438 | norm 66.8482 | lr 1.20e-04 | (3811.42 ms | 137557 tok/s) step 19493/76294 | train loss 3.530499 | norm 58.6808 | lr 1.20e-04 | (3815.56 ms | 137408 tok/s) step 19494/76294 | train loss 3.612089 | norm 28.9107 | lr 1.20e-04 | (3801.85 ms | 137903 tok/s) step 19495/76294 | train loss 3.616872 | norm 12.4083 | lr 1.20e-04 | (3899.04 ms | 134466 tok/s) step 19496/76294 | train loss 3.611865 | norm 16.9454 | lr 1.20e-04 | (3806.26 ms | 137744 tok/s) step 19497/76294 | train loss 3.553215 | norm 14.9506 | lr 1.20e-04 | (3807.11 ms | 137713 tok/s) step 19498/76294 | train loss 3.541668 | norm 25.7097 | lr 1.20e-04 | (3822.27 ms | 137167 tok/s) step 19499/76294 | train loss 3.586814 | norm 24.4269 | lr 1.20e-04 | (5289.10 ms | 99126 tok/s) step 19500/76294 | train loss 3.565364 | norm 61.2228 | lr 1.20e-04 | (3804.97 ms | 137790 tok/s) val loss: 3.599533 saving model checkpoint to ./results/gpt2-124M-gqa/step_19500.pth step 19501/76294 | train loss 3.594616 | norm 29.3136 | lr 1.20e-04 | (3820.06 ms | 137246 tok/s) step 19502/76294 | train loss 3.699420 | norm 21.5127 | lr 1.20e-04 | (3790.70 ms | 138309 tok/s) step 19503/76294 | train loss 3.533361 | norm 13.7252 | lr 1.20e-04 | (3875.45 ms | 135285 tok/s) step 19504/76294 | train loss 3.577366 | norm 35.2241 | lr 1.20e-04 | (3795.69 ms | 138127 tok/s) step 19505/76294 | train loss 3.636353 | norm 16.9231 | lr 1.20e-04 | (3802.93 ms | 137864 tok/s) step 19506/76294 | train loss 3.572862 | norm 12.3748 | lr 1.20e-04 | (3821.27 ms | 137203 tok/s) step 19507/76294 | train loss 3.792439 | norm 11.9353 | lr 1.20e-04 | (3802.82 ms | 137868 tok/s) step 19508/76294 | train loss 3.637880 | norm 7.6858 | lr 1.20e-04 | (3798.19 ms | 138036 tok/s) step 19509/76294 | train loss 3.525375 | norm 48.9176 | lr 1.20e-04 | (3825.50 ms | 137051 tok/s) step 19510/76294 | train loss 3.614615 | norm 12.5921 | lr 1.20e-04 | (3799.40 ms | 137992 tok/s) step 19511/76294 | train loss 3.594877 | norm 5.5251 | lr 1.20e-04 | (3853.72 ms | 136047 tok/s) step 19512/76294 | train loss 3.516757 | norm 9.7126 | lr 1.20e-04 | (3798.80 ms | 138014 tok/s) step 19513/76294 | train loss 3.635452 | norm 40.9544 | lr 1.20e-04 | (3806.54 ms | 137734 tok/s) step 19514/76294 | train loss 3.664185 | norm 12.4152 | lr 1.20e-04 | (3826.29 ms | 137022 tok/s) step 19515/76294 | train loss 3.666047 | norm 5.3233 | lr 1.20e-04 | (3801.35 ms | 137922 tok/s) step 19516/76294 | train loss 3.582865 | norm 23.5482 | lr 1.20e-04 | (3812.92 ms | 137503 tok/s) step 19517/76294 | train loss 3.598544 | norm 11.0403 | lr 1.20e-04 | (3805.99 ms | 137753 tok/s) step 19518/76294 | train loss 3.596029 | norm 16.2598 | lr 1.20e-04 | (3808.47 ms | 137664 tok/s) step 19519/76294 | train loss 3.569230 | norm 7.5884 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 19520/76294 | train loss 3.586927 | norm 16.6451 | lr 1.20e-04 | (3811.17 ms | 137566 tok/s) step 19521/76294 | train loss 3.598374 | norm 6.7866 | lr 1.20e-04 | (3806.45 ms | 137737 tok/s) step 19522/76294 | train loss 3.707355 | norm 5.9140 | lr 1.20e-04 | (3811.67 ms | 137548 tok/s) step 19523/76294 | train loss 3.590738 | norm 10.2997 | lr 1.20e-04 | (3812.93 ms | 137503 tok/s) step 19524/76294 | train loss 3.525349 | norm 7.7685 | lr 1.20e-04 | (3807.47 ms | 137700 tok/s) step 19525/76294 | train loss 3.618310 | norm 6.5620 | lr 1.20e-04 | (3807.23 ms | 137708 tok/s) step 19526/76294 | train loss 3.520971 | norm 6.6933 | lr 1.20e-04 | (3809.32 ms | 137633 tok/s) step 19527/76294 | train loss 3.602435 | norm 8.7319 | lr 1.20e-04 | (3805.27 ms | 137780 tok/s) step 19528/76294 | train loss 3.474723 | norm 7.3287 | lr 1.20e-04 | (3882.97 ms | 135023 tok/s) step 19529/76294 | train loss 3.524090 | norm 3.7336 | lr 1.20e-04 | (3808.20 ms | 137673 tok/s) step 19530/76294 | train loss 3.531994 | norm 10.6959 | lr 1.20e-04 | (3814.84 ms | 137434 tok/s) step 19531/76294 | train loss 3.506124 | norm 11.0157 | lr 1.20e-04 | (3803.52 ms | 137843 tok/s) step 19532/76294 | train loss 3.498054 | norm 5.2864 | lr 1.20e-04 | (3825.23 ms | 137060 tok/s) step 19533/76294 | train loss 3.490916 | norm 11.9154 | lr 1.20e-04 | (3801.22 ms | 137926 tok/s) step 19534/76294 | train loss 3.516053 | norm 10.7580 | lr 1.20e-04 | (3829.00 ms | 136926 tok/s) step 19535/76294 | train loss 3.538904 | norm 13.0314 | lr 1.20e-04 | (3800.87 ms | 137939 tok/s) step 19536/76294 | train loss 3.475430 | norm 3.9463 | lr 1.20e-04 | (3854.51 ms | 136019 tok/s) step 19537/76294 | train loss 3.501625 | norm 4.3837 | lr 1.20e-04 | (3802.73 ms | 137871 tok/s) step 19538/76294 | train loss 3.411469 | norm 10.1245 | lr 1.20e-04 | (3816.16 ms | 137386 tok/s) step 19539/76294 | train loss 3.518178 | norm 20.6421 | lr 1.20e-04 | (3800.13 ms | 137966 tok/s) step 19540/76294 | train loss 3.486062 | norm 7.7615 | lr 1.20e-04 | (3825.11 ms | 137065 tok/s) step 19541/76294 | train loss 3.442848 | norm 3.6473 | lr 1.20e-04 | (3822.39 ms | 137163 tok/s) step 19542/76294 | train loss 3.564098 | norm 11.3878 | lr 1.20e-04 | (3809.23 ms | 137636 tok/s) step 19543/76294 | train loss 3.448811 | norm 6.8465 | lr 1.20e-04 | (3799.07 ms | 138004 tok/s) step 19544/76294 | train loss 3.466195 | norm 7.3169 | lr 1.20e-04 | (3826.41 ms | 137018 tok/s) step 19545/76294 | train loss 3.475676 | norm 5.8404 | lr 1.20e-04 | (3803.03 ms | 137861 tok/s) step 19546/76294 | train loss 3.455675 | norm 5.5105 | lr 1.20e-04 | (3800.39 ms | 137956 tok/s) step 19547/76294 | train loss 3.507747 | norm 6.0226 | lr 1.20e-04 | (3826.28 ms | 137023 tok/s) step 19548/76294 | train loss 3.477926 | norm 5.8556 | lr 1.20e-04 | (3810.39 ms | 137594 tok/s) step 19549/76294 | train loss 3.660874 | norm 8.8455 | lr 1.20e-04 | (3820.96 ms | 137214 tok/s) step 19550/76294 | train loss 3.491083 | norm 8.5573 | lr 1.20e-04 | (3799.94 ms | 137973 tok/s) step 19551/76294 | train loss 3.532421 | norm 8.4828 | lr 1.20e-04 | (3805.88 ms | 137757 tok/s) step 19552/76294 | train loss 3.519437 | norm 4.5467 | lr 1.20e-04 | (3887.93 ms | 134850 tok/s) step 19553/76294 | train loss 3.443343 | norm 5.2882 | lr 1.20e-04 | (3876.48 ms | 135248 tok/s) step 19554/76294 | train loss 3.644973 | norm 4.5659 | lr 1.20e-04 | (3799.44 ms | 137991 tok/s) step 19555/76294 | train loss 3.448029 | norm 4.2279 | lr 1.20e-04 | (3841.60 ms | 136476 tok/s) step 19556/76294 | train loss 3.523848 | norm 4.2450 | lr 1.20e-04 | (3809.40 ms | 137630 tok/s) step 19557/76294 | train loss 3.394188 | norm 3.4318 | lr 1.20e-04 | (3810.68 ms | 137584 tok/s) step 19558/76294 | train loss 3.511493 | norm 4.5716 | lr 1.20e-04 | (3830.03 ms | 136889 tok/s) step 19559/76294 | train loss 3.420639 | norm 5.4893 | lr 1.20e-04 | (3807.44 ms | 137701 tok/s) step 19560/76294 | train loss 3.493493 | norm 4.3865 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 19561/76294 | train loss 3.509604 | norm 6.3695 | lr 1.20e-04 | (4088.12 ms | 128247 tok/s) step 19562/76294 | train loss 3.438733 | norm 3.0146 | lr 1.20e-04 | (3878.62 ms | 135174 tok/s) step 19563/76294 | train loss 3.504275 | norm 16.4373 | lr 1.20e-04 | (3793.94 ms | 138191 tok/s) step 19564/76294 | train loss 3.465163 | norm 3.5137 | lr 1.20e-04 | (3797.62 ms | 138057 tok/s) step 19565/76294 | train loss 3.373429 | norm 2.5601 | lr 1.20e-04 | (3887.99 ms | 134848 tok/s) step 19566/76294 | train loss 3.477132 | norm 5.8225 | lr 1.20e-04 | (3778.88 ms | 138742 tok/s) step 19567/76294 | train loss 3.454033 | norm 5.3109 | lr 1.20e-04 | (3899.52 ms | 134449 tok/s) step 19568/76294 | train loss 3.456461 | norm 3.1443 | lr 1.20e-04 | (11440.70 ms | 45827 tok/s) step 19569/76294 | train loss 3.492815 | norm 6.1782 | lr 1.20e-04 | (3818.40 ms | 137306 tok/s) step 19570/76294 | train loss 3.455825 | norm 4.3195 | lr 1.20e-04 | (3851.04 ms | 136142 tok/s) step 19571/76294 | train loss 3.531149 | norm 2.9514 | lr 1.20e-04 | (3760.13 ms | 139434 tok/s) step 19572/76294 | train loss 3.517397 | norm 3.6622 | lr 1.20e-04 | (3861.26 ms | 135782 tok/s) step 19573/76294 | train loss 3.505910 | norm 2.9276 | lr 1.20e-04 | (3764.73 ms | 139263 tok/s) step 19574/76294 | train loss 3.462288 | norm 11.6020 | lr 1.20e-04 | (3770.50 ms | 139050 tok/s) step 19575/76294 | train loss 3.412604 | norm 2.9668 | lr 1.20e-04 | (3795.06 ms | 138150 tok/s) step 19576/76294 | train loss 3.516473 | norm 2.4503 | lr 1.20e-04 | (3774.80 ms | 138892 tok/s) step 19577/76294 | train loss 3.420178 | norm 5.3162 | lr 1.20e-04 | (3781.20 ms | 138656 tok/s) step 19578/76294 | train loss 3.486250 | norm 3.6263 | lr 1.20e-04 | (3781.57 ms | 138643 tok/s) step 19579/76294 | train loss 3.464363 | norm 4.0007 | lr 1.20e-04 | (3787.07 ms | 138441 tok/s) step 19580/76294 | train loss 3.421127 | norm 4.2536 | lr 1.20e-04 | (3784.25 ms | 138545 tok/s) step 19581/76294 | train loss 3.462561 | norm 3.4464 | lr 1.20e-04 | (3794.66 ms | 138165 tok/s) step 19582/76294 | train loss 3.379142 | norm 6.8693 | lr 1.20e-04 | (3788.56 ms | 138387 tok/s) step 19583/76294 | train loss 3.470857 | norm 12.4587 | lr 1.20e-04 | (3789.57 ms | 138350 tok/s) step 19584/76294 | train loss 3.424535 | norm 15.4856 | lr 1.20e-04 | (3818.92 ms | 137287 tok/s) step 19585/76294 | train loss 3.470441 | norm 9.2440 | lr 1.20e-04 | (3788.48 ms | 138390 tok/s) step 19586/76294 | train loss 3.546298 | norm 10.8621 | lr 1.20e-04 | (3793.01 ms | 138225 tok/s) step 19587/76294 | train loss 3.541040 | norm 6.4776 | lr 1.20e-04 | (3871.97 ms | 135406 tok/s) step 19588/76294 | train loss 3.457012 | norm 2.8312 | lr 1.20e-04 | (3803.66 ms | 137838 tok/s) step 19589/76294 | train loss 3.390036 | norm 3.2504 | lr 1.20e-04 | (3798.01 ms | 138043 tok/s) step 19590/76294 | train loss 3.481330 | norm 9.7224 | lr 1.20e-04 | (3985.30 ms | 131555 tok/s) step 19591/76294 | train loss 3.552238 | norm 6.6265 | lr 1.20e-04 | (3794.65 ms | 138165 tok/s) step 19592/76294 | train loss 3.490747 | norm 8.9420 | lr 1.20e-04 | (3832.99 ms | 136783 tok/s) step 19593/76294 | train loss 3.479954 | norm 4.9979 | lr 1.20e-04 | (3877.69 ms | 135206 tok/s) step 19594/76294 | train loss 3.433565 | norm 6.2120 | lr 1.20e-04 | (3799.23 ms | 137999 tok/s) step 19595/76294 | train loss 3.453906 | norm 16.5596 | lr 1.20e-04 | (3850.07 ms | 136176 tok/s) step 19596/76294 | train loss 3.403625 | norm 30.5421 | lr 1.20e-04 | (3863.07 ms | 135718 tok/s) step 19597/76294 | train loss 3.461211 | norm 6.2218 | lr 1.20e-04 | (3774.07 ms | 138919 tok/s) step 19598/76294 | train loss 3.495037 | norm 3.6399 | lr 1.20e-04 | (3780.07 ms | 138698 tok/s) step 19599/76294 | train loss 3.391041 | norm 3.8933 | lr 1.20e-04 | (3798.51 ms | 138025 tok/s) step 19600/76294 | train loss 3.432848 | norm 4.3740 | lr 1.20e-04 | (3868.42 ms | 135530 tok/s) step 19601/76294 | train loss 3.472598 | norm 4.5268 | lr 1.20e-04 | (3781.57 ms | 138643 tok/s) step 19602/76294 | train loss 3.457814 | norm 7.6794 | lr 1.20e-04 | (3864.16 ms | 135680 tok/s) step 19603/76294 | train loss 3.444697 | norm 3.6290 | lr 1.20e-04 | (3777.47 ms | 138794 tok/s) step 19604/76294 | train loss 3.463545 | norm 4.0046 | lr 1.20e-04 | (3782.34 ms | 138615 tok/s) step 19605/76294 | train loss 3.396119 | norm 41.0618 | lr 1.20e-04 | (3802.57 ms | 137877 tok/s) step 19606/76294 | train loss 3.468394 | norm 10.4556 | lr 1.20e-04 | (3787.34 ms | 138432 tok/s) step 19607/76294 | train loss 3.398895 | norm 5.3108 | lr 1.20e-04 | (3809.04 ms | 137643 tok/s) step 19608/76294 | train loss 3.511858 | norm 6.1801 | lr 1.20e-04 | (3791.98 ms | 138262 tok/s) step 19609/76294 | train loss 3.484278 | norm 5.4853 | lr 1.20e-04 | (3811.58 ms | 137551 tok/s) step 19610/76294 | train loss 3.404907 | norm 4.0365 | lr 1.20e-04 | (3797.91 ms | 138046 tok/s) step 19611/76294 | train loss 3.509065 | norm 5.3931 | lr 1.20e-04 | (3794.20 ms | 138182 tok/s) step 19612/76294 | train loss 3.358953 | norm 4.2501 | lr 1.20e-04 | (3873.14 ms | 135365 tok/s) step 19613/76294 | train loss 3.498170 | norm 6.9722 | lr 1.20e-04 | (3797.01 ms | 138079 tok/s) step 19614/76294 | train loss 3.447515 | norm 6.5731 | lr 1.20e-04 | (3825.07 ms | 137066 tok/s) step 19615/76294 | train loss 3.568850 | norm 7.0009 | lr 1.20e-04 | (3800.83 ms | 137940 tok/s) step 19616/76294 | train loss 3.502064 | norm 7.4452 | lr 1.20e-04 | (3803.83 ms | 137832 tok/s) step 19617/76294 | train loss 3.386505 | norm 6.8323 | lr 1.20e-04 | (3822.46 ms | 137160 tok/s) step 19618/76294 | train loss 3.423129 | norm 9.5431 | lr 1.20e-04 | (3807.25 ms | 137708 tok/s) step 19619/76294 | train loss 3.434214 | norm 11.1203 | lr 1.20e-04 | (3813.93 ms | 137467 tok/s) step 19620/76294 | train loss 3.424721 | norm 41.2303 | lr 1.20e-04 | (3806.40 ms | 137738 tok/s) step 19621/76294 | train loss 3.446737 | norm 8.5289 | lr 1.20e-04 | (3835.49 ms | 136694 tok/s) step 19622/76294 | train loss 3.383796 | norm 3.8045 | lr 1.20e-04 | (3822.38 ms | 137163 tok/s) step 19623/76294 | train loss 3.410910 | norm 4.4799 | lr 1.20e-04 | (3824.25 ms | 137096 tok/s) step 19624/76294 | train loss 3.477268 | norm 6.8030 | lr 1.20e-04 | (3809.64 ms | 137621 tok/s) step 19625/76294 | train loss 3.396773 | norm 16.5115 | lr 1.20e-04 | (3842.16 ms | 136457 tok/s) step 19626/76294 | train loss 3.464300 | norm 5.2317 | lr 1.20e-04 | (3804.64 ms | 137802 tok/s) step 19627/76294 | train loss 3.564555 | norm 8.5089 | lr 1.20e-04 | (3811.96 ms | 137537 tok/s) step 19628/76294 | train loss 3.405131 | norm 6.4226 | lr 1.20e-04 | (3834.60 ms | 136725 tok/s) step 19629/76294 | train loss 3.500775 | norm 5.0295 | lr 1.20e-04 | (3811.45 ms | 137556 tok/s) step 19630/76294 | train loss 3.448040 | norm 7.7715 | lr 1.20e-04 | (3925.11 ms | 133573 tok/s) step 19631/76294 | train loss 3.468753 | norm 6.4746 | lr 1.20e-04 | (3813.34 ms | 137488 tok/s) step 19632/76294 | train loss 3.448044 | norm 10.8116 | lr 1.20e-04 | (3851.75 ms | 136117 tok/s) step 19633/76294 | train loss 3.411628 | norm 5.6571 | lr 1.20e-04 | (3827.49 ms | 136980 tok/s) step 19634/76294 | train loss 3.489562 | norm 3.7086 | lr 1.20e-04 | (3829.84 ms | 136895 tok/s) step 19635/76294 | train loss 3.552138 | norm 9.0931 | lr 1.20e-04 | (3869.00 ms | 135510 tok/s) step 19636/76294 | train loss 3.464957 | norm 18.3182 | lr 1.20e-04 | (3837.13 ms | 136635 tok/s) step 19637/76294 | train loss 3.549978 | norm 4.5578 | lr 1.20e-04 | (3840.67 ms | 136510 tok/s) step 19638/76294 | train loss 3.490919 | norm 3.7476 | lr 1.20e-04 | (3829.83 ms | 136896 tok/s) step 19639/76294 | train loss 3.485995 | norm 8.5745 | lr 1.20e-04 | (3824.48 ms | 137087 tok/s) step 19640/76294 | train loss 3.495822 | norm 9.7322 | lr 1.20e-04 | (3808.54 ms | 137661 tok/s) step 19641/76294 | train loss 3.536951 | norm 9.0723 | lr 1.20e-04 | (3804.73 ms | 137799 tok/s) step 19642/76294 | train loss 3.493886 | norm 19.3091 | lr 1.20e-04 | (3826.91 ms | 137000 tok/s) step 19643/76294 | train loss 3.512904 | norm 5.4076 | lr 1.20e-04 | (3811.68 ms | 137548 tok/s) step 19644/76294 | train loss 3.545702 | norm 13.4629 | lr 1.20e-04 | (3816.48 ms | 137375 tok/s) step 19645/76294 | train loss 3.504649 | norm 4.5993 | lr 1.20e-04 | (3834.36 ms | 136734 tok/s) step 19646/76294 | train loss 3.475270 | norm 13.0554 | lr 1.20e-04 | (3804.69 ms | 137800 tok/s) step 19647/76294 | train loss 3.533410 | norm 8.3838 | lr 1.20e-04 | (3805.66 ms | 137765 tok/s) step 19648/76294 | train loss 3.493335 | norm 10.4151 | lr 1.20e-04 | (3831.43 ms | 136839 tok/s) step 19649/76294 | train loss 3.489764 | norm 9.7674 | lr 1.20e-04 | (3804.25 ms | 137816 tok/s) step 19650/76294 | train loss 3.549497 | norm 16.1828 | lr 1.20e-04 | (3823.07 ms | 137138 tok/s) step 19651/76294 | train loss 3.479990 | norm 594.9227 | lr 1.20e-04 | (3830.82 ms | 136861 tok/s) step 19652/76294 | train loss 3.577924 | norm 11.5555 | lr 1.20e-04 | (3798.16 ms | 138037 tok/s) step 19653/76294 | train loss 3.573839 | norm 11.5809 | lr 1.20e-04 | (3830.52 ms | 136871 tok/s) step 19654/76294 | train loss 3.499210 | norm 17.7455 | lr 1.20e-04 | (3800.68 ms | 137946 tok/s) step 19655/76294 | train loss 3.495546 | norm 188.7234 | lr 1.20e-04 | (3801.42 ms | 137919 tok/s) step 19656/76294 | train loss 3.451122 | norm 37.7670 | lr 1.20e-04 | (3819.05 ms | 137282 tok/s) step 19657/76294 | train loss 3.572929 | norm 7.3530 | lr 1.20e-04 | (3798.78 ms | 138015 tok/s) step 19658/76294 | train loss 3.459015 | norm 13.4566 | lr 1.20e-04 | (3804.97 ms | 137790 tok/s) step 19659/76294 | train loss 3.513915 | norm 8.0364 | lr 1.20e-04 | (3817.15 ms | 137351 tok/s) step 19660/76294 | train loss 3.584145 | norm 9.9881 | lr 1.20e-04 | (3807.68 ms | 137692 tok/s) step 19661/76294 | train loss 3.462842 | norm 10.0314 | lr 1.20e-04 | (3876.55 ms | 135246 tok/s) step 19662/76294 | train loss 3.441074 | norm 5.4340 | lr 1.20e-04 | (3828.01 ms | 136961 tok/s) step 19663/76294 | train loss 3.424702 | norm 5.1754 | lr 1.20e-04 | (3815.59 ms | 137407 tok/s) step 19664/76294 | train loss 3.506002 | norm 7.5972 | lr 1.20e-04 | (3828.42 ms | 136946 tok/s) step 19665/76294 | train loss 3.459877 | norm 6.8249 | lr 1.20e-04 | (3798.62 ms | 138021 tok/s) step 19666/76294 | train loss 3.408229 | norm 8.9819 | lr 1.20e-04 | (3804.59 ms | 137804 tok/s) step 19667/76294 | train loss 3.580170 | norm 11.1848 | lr 1.20e-04 | (4052.77 ms | 129365 tok/s) step 19668/76294 | train loss 3.395962 | norm 4.0604 | lr 1.20e-04 | (3798.85 ms | 138012 tok/s) step 19669/76294 | train loss 3.503466 | norm 7.7924 | lr 1.20e-04 | (3802.48 ms | 137881 tok/s) step 19670/76294 | train loss 3.410119 | norm 3.7937 | lr 1.20e-04 | (3819.50 ms | 137266 tok/s) step 19671/76294 | train loss 3.469037 | norm 4.5582 | lr 1.20e-04 | (3803.95 ms | 137827 tok/s) step 19672/76294 | train loss 3.462178 | norm 3.9258 | lr 1.20e-04 | (3802.84 ms | 137867 tok/s) step 19673/76294 | train loss 3.411696 | norm 3.7135 | lr 1.20e-04 | (3803.39 ms | 137847 tok/s) step 19674/76294 | train loss 3.516839 | norm 4.4653 | lr 1.20e-04 | (3802.96 ms | 137863 tok/s) step 19675/76294 | train loss 3.468705 | norm 4.1309 | lr 1.20e-04 | (3798.75 ms | 138016 tok/s) step 19676/76294 | train loss 3.389755 | norm 4.2029 | lr 1.20e-04 | (3795.49 ms | 138135 tok/s) step 19677/76294 | train loss 3.450356 | norm 2.9514 | lr 1.20e-04 | (3846.27 ms | 136311 tok/s) step 19678/76294 | train loss 3.424433 | norm 2.9593 | lr 1.20e-04 | (3797.38 ms | 138066 tok/s) step 19679/76294 | train loss 3.462746 | norm 2.6964 | lr 1.20e-04 | (3844.15 ms | 136386 tok/s) step 19680/76294 | train loss 3.423948 | norm 2.8745 | lr 1.20e-04 | (3798.35 ms | 138030 tok/s) step 19681/76294 | train loss 3.398917 | norm 4.2998 | lr 1.20e-04 | (3866.48 ms | 135598 tok/s) step 19682/76294 | train loss 3.407324 | norm 3.4981 | lr 1.20e-04 | (3821.62 ms | 137190 tok/s) step 19683/76294 | train loss 3.464767 | norm 3.2092 | lr 1.20e-04 | (3833.71 ms | 136757 tok/s) step 19684/76294 | train loss 3.473683 | norm 2.6201 | lr 1.20e-04 | (3955.49 ms | 132547 tok/s) step 19685/76294 | train loss 3.466073 | norm 4.1800 | lr 1.20e-04 | (3792.64 ms | 138238 tok/s) step 19686/76294 | train loss 3.502435 | norm 3.6323 | lr 1.20e-04 | (3994.40 ms | 131256 tok/s) step 19687/76294 | train loss 3.387440 | norm 4.9165 | lr 1.20e-04 | (3859.88 ms | 135830 tok/s) step 19688/76294 | train loss 3.353263 | norm 3.9722 | lr 1.20e-04 | (3833.40 ms | 136768 tok/s) step 19689/76294 | train loss 3.510139 | norm 6.1178 | lr 1.20e-04 | (3899.14 ms | 134462 tok/s) step 19690/76294 | train loss 3.407175 | norm 2.9434 | lr 1.20e-04 | (3774.03 ms | 138920 tok/s) step 19691/76294 | train loss 3.458714 | norm 3.6222 | lr 1.20e-04 | (3855.96 ms | 135968 tok/s) step 19692/76294 | train loss 3.439833 | norm 3.9089 | lr 1.20e-04 | (3767.94 ms | 139144 tok/s) step 19693/76294 | train loss 3.445709 | norm 2.2742 | lr 1.20e-04 | (3877.07 ms | 135228 tok/s) step 19694/76294 | train loss 3.499487 | norm 9.0417 | lr 1.20e-04 | (3826.63 ms | 137010 tok/s) step 19695/76294 | train loss 3.433814 | norm 3.8643 | lr 1.20e-04 | (3861.14 ms | 135786 tok/s) step 19696/76294 | train loss 3.450372 | norm 2.8412 | lr 1.20e-04 | (3768.95 ms | 139107 tok/s) step 19697/76294 | train loss 3.444319 | norm 6.8435 | lr 1.20e-04 | (3774.57 ms | 138900 tok/s) step 19698/76294 | train loss 3.403390 | norm 6.3018 | lr 1.20e-04 | (3792.44 ms | 138245 tok/s) step 19699/76294 | train loss 3.487402 | norm 11.7018 | lr 1.20e-04 | (3834.00 ms | 136747 tok/s) step 19700/76294 | train loss 3.432475 | norm 3.7624 | lr 1.20e-04 | (3783.48 ms | 138573 tok/s) step 19701/76294 | train loss 3.429713 | norm 5.3493 | lr 1.20e-04 | (3787.70 ms | 138419 tok/s) step 19702/76294 | train loss 3.450052 | norm 3.8938 | lr 1.20e-04 | (3783.17 ms | 138584 tok/s) step 19703/76294 | train loss 3.403607 | norm 3.1608 | lr 1.20e-04 | (3793.38 ms | 138211 tok/s) step 19704/76294 | train loss 3.393709 | norm 3.1502 | lr 1.20e-04 | (3808.22 ms | 137673 tok/s) step 19705/76294 | train loss 3.400261 | norm 4.3263 | lr 1.20e-04 | (3791.46 ms | 138281 tok/s) step 19706/76294 | train loss 3.414448 | norm 4.7489 | lr 1.20e-04 | (3799.25 ms | 137998 tok/s) step 19707/76294 | train loss 3.444706 | norm 8.1441 | lr 1.20e-04 | (3792.36 ms | 138248 tok/s) step 19708/76294 | train loss 3.437208 | norm 5.9762 | lr 1.20e-04 | (3791.31 ms | 138287 tok/s) step 19709/76294 | train loss 3.506605 | norm 5.1236 | lr 1.20e-04 | (3827.48 ms | 136980 tok/s) step 19710/76294 | train loss 3.394507 | norm 4.8077 | lr 1.20e-04 | (3794.26 ms | 138179 tok/s) step 19711/76294 | train loss 3.433252 | norm 4.2190 | lr 1.20e-04 | (3843.69 ms | 136402 tok/s) step 19712/76294 | train loss 3.427665 | norm 2.8604 | lr 1.20e-04 | (3797.73 ms | 138053 tok/s) step 19713/76294 | train loss 3.406442 | norm 4.8275 | lr 1.20e-04 | (3800.23 ms | 137962 tok/s) step 19714/76294 | train loss 3.455660 | norm 6.5213 | lr 1.20e-04 | (3820.93 ms | 137215 tok/s) step 19715/76294 | train loss 3.482477 | norm 8.6430 | lr 1.20e-04 | (3806.70 ms | 137728 tok/s) step 19716/76294 | train loss 3.331347 | norm 4.5017 | lr 1.20e-04 | (3827.13 ms | 136993 tok/s) step 19717/76294 | train loss 3.456745 | norm 11.2226 | lr 1.20e-04 | (3809.69 ms | 137619 tok/s) step 19718/76294 | train loss 3.368028 | norm 5.2886 | lr 1.20e-04 | (3809.80 ms | 137616 tok/s) step 19719/76294 | train loss 3.400611 | norm 6.9295 | lr 1.20e-04 | (3809.42 ms | 137629 tok/s) step 19720/76294 | train loss 3.480536 | norm 6.6739 | lr 1.20e-04 | (3811.83 ms | 137542 tok/s) step 19721/76294 | train loss 3.459264 | norm 6.1212 | lr 1.20e-04 | (3809.05 ms | 137643 tok/s) step 19722/76294 | train loss 3.412786 | norm 4.5897 | lr 1.20e-04 | (3822.95 ms | 137142 tok/s) step 19723/76294 | train loss 3.432827 | norm 3.2927 | lr 1.20e-04 | (3804.73 ms | 137799 tok/s) step 19724/76294 | train loss 3.487566 | norm 3.7010 | lr 1.20e-04 | (3959.29 ms | 132420 tok/s) step 19725/76294 | train loss 3.421595 | norm 2.8691 | lr 1.20e-04 | (3805.41 ms | 137774 tok/s) step 19726/76294 | train loss 3.474905 | norm 4.2272 | lr 1.20e-04 | (3817.49 ms | 137338 tok/s) step 19727/76294 | train loss 3.510749 | norm 3.7783 | lr 1.20e-04 | (3829.62 ms | 136903 tok/s) step 19728/76294 | train loss 3.385447 | norm 3.1105 | lr 1.20e-04 | (3808.22 ms | 137673 tok/s) step 19729/76294 | train loss 3.439688 | norm 6.4944 | lr 1.20e-04 | (3808.46 ms | 137664 tok/s) step 19730/76294 | train loss 3.426180 | norm 2.9395 | lr 1.20e-04 | (3804.99 ms | 137790 tok/s) step 19731/76294 | train loss 3.483772 | norm 4.6675 | lr 1.20e-04 | (3809.80 ms | 137616 tok/s) step 19732/76294 | train loss 3.431365 | norm 4.7131 | lr 1.20e-04 | (3806.61 ms | 137731 tok/s) step 19733/76294 | train loss 3.422544 | norm 2.7489 | lr 1.20e-04 | (3803.27 ms | 137852 tok/s) step 19734/76294 | train loss 3.455355 | norm 6.2680 | lr 1.20e-04 | (3833.78 ms | 136755 tok/s) step 19735/76294 | train loss 3.444517 | norm 4.3388 | lr 1.20e-04 | (3804.86 ms | 137794 tok/s) step 19736/76294 | train loss 3.452949 | norm 7.0520 | lr 1.20e-04 | (3808.47 ms | 137664 tok/s) step 19737/76294 | train loss 3.510102 | norm 5.2393 | lr 1.20e-04 | (3825.79 ms | 137040 tok/s) step 19738/76294 | train loss 3.481520 | norm 5.6319 | lr 1.20e-04 | (3806.25 ms | 137744 tok/s) step 19739/76294 | train loss 3.533156 | norm 5.2710 | lr 1.20e-04 | (3809.91 ms | 137612 tok/s) step 19740/76294 | train loss 3.423334 | norm 4.3749 | lr 1.20e-04 | (3805.65 ms | 137766 tok/s) step 19741/76294 | train loss 3.476374 | norm 3.9402 | lr 1.20e-04 | (3805.80 ms | 137760 tok/s) step 19742/76294 | train loss 3.444341 | norm 3.9112 | lr 1.20e-04 | (3806.51 ms | 137735 tok/s) step 19743/76294 | train loss 3.416620 | norm 6.7685 | lr 1.20e-04 | (3823.98 ms | 137105 tok/s) step 19744/76294 | train loss 3.442659 | norm 10.0789 | lr 1.20e-04 | (3804.80 ms | 137797 tok/s) step 19745/76294 | train loss 3.436678 | norm 6.7281 | lr 1.20e-04 | (3805.56 ms | 137769 tok/s) step 19746/76294 | train loss 3.394176 | norm 5.1785 | lr 1.20e-04 | (3826.84 ms | 137003 tok/s) step 19747/76294 | train loss 3.564483 | norm 5.2551 | lr 1.20e-04 | (3804.33 ms | 137814 tok/s) step 19748/76294 | train loss 3.422845 | norm 3.3852 | lr 1.20e-04 | (3803.05 ms | 137860 tok/s) step 19749/76294 | train loss 3.522125 | norm 68.8911 | lr 1.20e-04 | (3982.56 ms | 131646 tok/s) step 19750/76294 | train loss 3.478967 | norm 4.9331 | lr 1.20e-04 | (3797.86 ms | 138048 tok/s) val loss: 3.461416 saving model checkpoint to ./results/gpt2-124M-gqa/step_19750.pth step 19751/76294 | train loss 3.486058 | norm 5.7238 | lr 1.20e-04 | (3820.57 ms | 137228 tok/s) step 19752/76294 | train loss 3.463944 | norm 3.3285 | lr 1.20e-04 | (4123.28 ms | 127153 tok/s) step 19753/76294 | train loss 3.471403 | norm 2.5102 | lr 1.20e-04 | (3828.98 ms | 136926 tok/s) step 19754/76294 | train loss 3.527663 | norm 3.0950 | lr 1.20e-04 | (3799.68 ms | 137982 tok/s) step 19755/76294 | train loss 3.453831 | norm 3.1511 | lr 1.20e-04 | (3796.65 ms | 138092 tok/s) step 19756/76294 | train loss 3.407931 | norm 4.4863 | lr 1.20e-04 | (3798.80 ms | 138014 tok/s) step 19757/76294 | train loss 3.466289 | norm 10.7416 | lr 1.20e-04 | (3825.17 ms | 137063 tok/s) step 19758/76294 | train loss 3.469535 | norm 3.5662 | lr 1.20e-04 | (3800.14 ms | 137965 tok/s) step 19759/76294 | train loss 3.445124 | norm 2.6009 | lr 1.20e-04 | (3811.88 ms | 137541 tok/s) step 19760/76294 | train loss 3.434271 | norm 2.9233 | lr 1.20e-04 | (3797.88 ms | 138048 tok/s) step 19761/76294 | train loss 3.495561 | norm 6.8951 | lr 1.20e-04 | (3800.58 ms | 137950 tok/s) step 19762/76294 | train loss 3.475518 | norm 3.8115 | lr 1.20e-04 | (3822.37 ms | 137163 tok/s) step 19763/76294 | train loss 3.434729 | norm 2.2622 | lr 1.20e-04 | (3805.96 ms | 137755 tok/s) step 19764/76294 | train loss 3.438559 | norm 2.8358 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 19765/76294 | train loss 3.441646 | norm 3.8974 | lr 1.20e-04 | (3802.60 ms | 137876 tok/s) step 19766/76294 | train loss 3.510419 | norm 3.3441 | lr 1.20e-04 | (3806.48 ms | 137736 tok/s) step 19767/76294 | train loss 3.512444 | norm 3.4002 | lr 1.20e-04 | (3800.05 ms | 137969 tok/s) step 19768/76294 | train loss 3.445863 | norm 3.8002 | lr 1.20e-04 | (3832.36 ms | 136805 tok/s) step 19769/76294 | train loss 3.482512 | norm 10.4239 | lr 1.20e-04 | (3798.24 ms | 138035 tok/s) step 19770/76294 | train loss 3.603567 | norm 18.0469 | lr 1.20e-04 | (3815.59 ms | 137407 tok/s) step 19771/76294 | train loss 3.481183 | norm 3.9597 | lr 1.20e-04 | (3802.91 ms | 137865 tok/s) step 19772/76294 | train loss 3.443372 | norm 3.0975 | lr 1.20e-04 | (3801.81 ms | 137905 tok/s) step 19773/76294 | train loss 3.407978 | norm 9.1171 | lr 1.20e-04 | (3814.71 ms | 137438 tok/s) step 19774/76294 | train loss 3.424413 | norm 5.4062 | lr 1.20e-04 | (3883.13 ms | 135017 tok/s) step 19775/76294 | train loss 3.522897 | norm 3.1139 | lr 1.20e-04 | (3796.04 ms | 138114 tok/s) step 19776/76294 | train loss 3.508600 | norm 4.9911 | lr 1.20e-04 | (3959.55 ms | 132411 tok/s) step 19777/76294 | train loss 3.572936 | norm 7.8823 | lr 1.20e-04 | (3794.00 ms | 138189 tok/s) step 19778/76294 | train loss 3.485056 | norm 5.4031 | lr 1.20e-04 | (3794.09 ms | 138186 tok/s) step 19779/76294 | train loss 3.473320 | norm 7.0243 | lr 1.20e-04 | (3814.60 ms | 137442 tok/s) step 19780/76294 | train loss 3.574532 | norm 3.4684 | lr 1.20e-04 | (4329.46 ms | 121098 tok/s) step 19781/76294 | train loss 3.456306 | norm 3.5377 | lr 1.20e-04 | (3810.84 ms | 137578 tok/s) step 19782/76294 | train loss 3.469389 | norm 2.8848 | lr 1.20e-04 | (3849.55 ms | 136194 tok/s) step 19783/76294 | train loss 3.494006 | norm 3.0740 | lr 1.20e-04 | (3793.67 ms | 138201 tok/s) step 19784/76294 | train loss 3.443899 | norm 3.3304 | lr 1.20e-04 | (3800.88 ms | 137939 tok/s) step 19785/76294 | train loss 3.474389 | norm 5.1352 | lr 1.20e-04 | (3823.89 ms | 137108 tok/s) step 19786/76294 | train loss 3.444053 | norm 3.4198 | lr 1.20e-04 | (3799.39 ms | 137993 tok/s) step 19787/76294 | train loss 3.440860 | norm 2.3933 | lr 1.20e-04 | (3797.95 ms | 138045 tok/s) step 19788/76294 | train loss 3.435111 | norm 3.3662 | lr 1.20e-04 | (3829.00 ms | 136925 tok/s) step 19789/76294 | train loss 3.463161 | norm 3.1249 | lr 1.20e-04 | (3794.26 ms | 138179 tok/s) step 19790/76294 | train loss 3.455544 | norm 2.2242 | lr 1.20e-04 | (3803.95 ms | 137827 tok/s) step 19791/76294 | train loss 3.433704 | norm 3.5886 | lr 1.20e-04 | (3817.88 ms | 137324 tok/s) step 19792/76294 | train loss 3.455105 | norm 8.8362 | lr 1.20e-04 | (3795.68 ms | 138128 tok/s) step 19793/76294 | train loss 3.473372 | norm 4.3344 | lr 1.20e-04 | (3795.65 ms | 138129 tok/s) step 19794/76294 | train loss 3.501980 | norm 6.2812 | lr 1.20e-04 | (3824.15 ms | 137099 tok/s) step 19795/76294 | train loss 3.481518 | norm 3.8003 | lr 1.20e-04 | (3795.82 ms | 138122 tok/s) step 19796/76294 | train loss 3.424796 | norm 5.8500 | lr 1.20e-04 | (3803.05 ms | 137860 tok/s) step 19797/76294 | train loss 3.528394 | norm 5.9399 | lr 1.20e-04 | (3817.85 ms | 137326 tok/s) step 19798/76294 | train loss 3.372995 | norm 8.2358 | lr 1.20e-04 | (3869.01 ms | 135510 tok/s) step 19799/76294 | train loss 3.455130 | norm 6.5985 | lr 1.20e-04 | (3797.60 ms | 138058 tok/s) step 19800/76294 | train loss 3.426376 | norm 11.2602 | lr 1.20e-04 | (3843.05 ms | 136425 tok/s) step 19801/76294 | train loss 3.493736 | norm 4.5833 | lr 1.20e-04 | (3795.09 ms | 138149 tok/s) step 19802/76294 | train loss 3.448390 | norm 9.7739 | lr 1.20e-04 | (3806.50 ms | 137735 tok/s) step 19803/76294 | train loss 3.462340 | norm 14.1914 | lr 1.20e-04 | (3796.24 ms | 138107 tok/s) step 19804/76294 | train loss 3.392452 | norm 10.7807 | lr 1.20e-04 | (3812.13 ms | 137531 tok/s) step 19805/76294 | train loss 3.459907 | norm 7.9482 | lr 1.20e-04 | (3793.76 ms | 138197 tok/s) step 19806/76294 | train loss 3.441727 | norm 16.2312 | lr 1.20e-04 | (3802.48 ms | 137880 tok/s) step 19807/76294 | train loss 3.514802 | norm 7.9536 | lr 1.20e-04 | (3815.28 ms | 137418 tok/s) step 19808/76294 | train loss 3.461629 | norm 13.6496 | lr 1.20e-04 | (3806.18 ms | 137747 tok/s) step 19809/76294 | train loss 3.472920 | norm 7.5887 | lr 1.20e-04 | (3802.31 ms | 137887 tok/s) step 19810/76294 | train loss 3.502623 | norm 14.1343 | lr 1.20e-04 | (3829.78 ms | 136898 tok/s) step 19811/76294 | train loss 3.452133 | norm 13.5464 | lr 1.20e-04 | (3803.34 ms | 137849 tok/s) step 19812/76294 | train loss 3.483582 | norm 15.5248 | lr 1.20e-04 | (3858.06 ms | 135894 tok/s) step 19813/76294 | train loss 3.861554 | norm 12.2065 | lr 1.20e-04 | (3799.86 ms | 137976 tok/s) step 19814/76294 | train loss 3.488766 | norm 11.7958 | lr 1.20e-04 | (3808.36 ms | 137668 tok/s) step 19815/76294 | train loss 3.514912 | norm 14.3320 | lr 1.20e-04 | (3828.11 ms | 136957 tok/s) step 19816/76294 | train loss 3.597309 | norm 6.4838 | lr 1.20e-04 | (3804.94 ms | 137791 tok/s) step 19817/76294 | train loss 3.409899 | norm 17.5203 | lr 1.20e-04 | (3810.32 ms | 137597 tok/s) step 19818/76294 | train loss 3.494943 | norm 13.3835 | lr 1.20e-04 | (3806.96 ms | 137718 tok/s) step 19819/76294 | train loss 3.488616 | norm 13.2877 | lr 1.20e-04 | (3813.62 ms | 137478 tok/s) step 19820/76294 | train loss 3.479446 | norm 18.5536 | lr 1.20e-04 | (3812.20 ms | 137529 tok/s) step 19821/76294 | train loss 3.478455 | norm 11.3866 | lr 1.20e-04 | (3815.21 ms | 137421 tok/s) step 19822/76294 | train loss 3.508810 | norm 12.5284 | lr 1.20e-04 | (3807.65 ms | 137693 tok/s) step 19823/76294 | train loss 3.555172 | norm 12.2153 | lr 1.20e-04 | (3805.99 ms | 137753 tok/s) step 19824/76294 | train loss 3.451214 | norm 16.7379 | lr 1.20e-04 | (3840.63 ms | 136511 tok/s) step 19825/76294 | train loss 3.500375 | norm 9.1991 | lr 1.20e-04 | (3803.56 ms | 137841 tok/s) step 19826/76294 | train loss 3.470754 | norm 17.6353 | lr 1.20e-04 | (3822.83 ms | 137146 tok/s) step 19827/76294 | train loss 3.460781 | norm 11.8817 | lr 1.20e-04 | (3818.03 ms | 137319 tok/s) step 19828/76294 | train loss 3.464325 | norm 7.7522 | lr 1.20e-04 | (3867.55 ms | 135561 tok/s) step 19829/76294 | train loss 3.458043 | norm 11.9061 | lr 1.20e-04 | (3805.47 ms | 137772 tok/s) step 19830/76294 | train loss 3.451052 | norm 6.1492 | lr 1.20e-04 | (3834.68 ms | 136723 tok/s) step 19831/76294 | train loss 3.464913 | norm 16.1261 | lr 1.20e-04 | (4483.81 ms | 116929 tok/s) step 19832/76294 | train loss 3.500200 | norm 12.1465 | lr 1.20e-04 | (3807.82 ms | 137687 tok/s) step 19833/76294 | train loss 3.482391 | norm 7.9692 | lr 1.20e-04 | (3803.02 ms | 137861 tok/s) step 19834/76294 | train loss 3.444794 | norm 7.2316 | lr 1.20e-04 | (3809.48 ms | 137627 tok/s) step 19835/76294 | train loss 3.438132 | norm 17.7545 | lr 1.20e-04 | (3808.91 ms | 137648 tok/s) step 19836/76294 | train loss 3.594627 | norm 27.4357 | lr 1.20e-04 | (3813.75 ms | 137473 tok/s) step 19837/76294 | train loss 3.447834 | norm 26.6557 | lr 1.20e-04 | (3804.75 ms | 137798 tok/s) step 19838/76294 | train loss 3.495991 | norm 9.4899 | lr 1.20e-04 | (3808.60 ms | 137659 tok/s) step 19839/76294 | train loss 3.552603 | norm 4.3435 | lr 1.20e-04 | (3810.15 ms | 137603 tok/s) step 19840/76294 | train loss 3.442565 | norm 5.1560 | lr 1.20e-04 | (3804.03 ms | 137824 tok/s) step 19841/76294 | train loss 3.471740 | norm 5.9735 | lr 1.20e-04 | (3836.22 ms | 136668 tok/s) step 19842/76294 | train loss 3.511219 | norm 9.2546 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 19843/76294 | train loss 3.588281 | norm 5.7655 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 19844/76294 | train loss 3.452774 | norm 17.7896 | lr 1.20e-04 | (3819.17 ms | 137278 tok/s) step 19845/76294 | train loss 3.460712 | norm 4.4597 | lr 1.20e-04 | (3799.18 ms | 138000 tok/s) step 19846/76294 | train loss 3.485021 | norm 4.3639 | lr 1.20e-04 | (3802.01 ms | 137898 tok/s) step 19847/76294 | train loss 3.460685 | norm 4.4650 | lr 1.20e-04 | (3803.98 ms | 137826 tok/s) step 19848/76294 | train loss 3.475889 | norm 4.5027 | lr 1.20e-04 | (3802.11 ms | 137894 tok/s) step 19849/76294 | train loss 3.584358 | norm 8.7948 | lr 1.20e-04 | (3800.37 ms | 137957 tok/s) step 19850/76294 | train loss 3.465069 | norm 5.8063 | lr 1.20e-04 | (3843.22 ms | 136419 tok/s) step 19851/76294 | train loss 3.519585 | norm 4.5285 | lr 1.20e-04 | (3799.30 ms | 137996 tok/s) step 19852/76294 | train loss 3.451025 | norm 4.8873 | lr 1.20e-04 | (3806.43 ms | 137738 tok/s) step 19853/76294 | train loss 3.443758 | norm 11.4548 | lr 1.20e-04 | (3822.69 ms | 137151 tok/s) step 19854/76294 | train loss 3.485008 | norm 8.6494 | lr 1.20e-04 | (3810.33 ms | 137597 tok/s) step 19855/76294 | train loss 3.499323 | norm 5.7535 | lr 1.20e-04 | (3804.11 ms | 137822 tok/s) step 19856/76294 | train loss 3.460106 | norm 7.8258 | lr 1.20e-04 | (3802.86 ms | 137867 tok/s) step 19857/76294 | train loss 3.533169 | norm 6.5326 | lr 1.20e-04 | (3807.04 ms | 137715 tok/s) step 19858/76294 | train loss 3.474569 | norm 5.0897 | lr 1.20e-04 | (3810.09 ms | 137605 tok/s) step 19859/76294 | train loss 3.503114 | norm 4.8339 | lr 1.20e-04 | (3806.95 ms | 137719 tok/s) step 19860/76294 | train loss 3.398360 | norm 9.4136 | lr 1.20e-04 | (3810.08 ms | 137605 tok/s) step 19861/76294 | train loss 3.691947 | norm 7.1985 | lr 1.20e-04 | (3813.68 ms | 137475 tok/s) step 19862/76294 | train loss 3.499865 | norm 9.2351 | lr 1.20e-04 | (3804.53 ms | 137806 tok/s) step 19863/76294 | train loss 3.470608 | norm 8.1116 | lr 1.20e-04 | (3823.18 ms | 137134 tok/s) step 19864/76294 | train loss 3.445589 | norm 5.8530 | lr 1.20e-04 | (3801.74 ms | 137908 tok/s) step 19865/76294 | train loss 3.438225 | norm 5.0542 | lr 1.20e-04 | (3797.91 ms | 138047 tok/s) step 19866/76294 | train loss 3.476324 | norm 5.3587 | lr 1.20e-04 | (3830.46 ms | 136873 tok/s) step 19867/76294 | train loss 3.494307 | norm 6.8638 | lr 1.20e-04 | (3799.99 ms | 137971 tok/s) step 19868/76294 | train loss 3.570197 | norm 4.6135 | lr 1.20e-04 | (3799.85 ms | 137976 tok/s) step 19869/76294 | train loss 3.512856 | norm 8.1558 | lr 1.20e-04 | (3820.94 ms | 137214 tok/s) step 19870/76294 | train loss 3.472115 | norm 5.7082 | lr 1.20e-04 | (3808.37 ms | 137667 tok/s) step 19871/76294 | train loss 3.417694 | norm 4.9586 | lr 1.20e-04 | (3804.64 ms | 137802 tok/s) step 19872/76294 | train loss 3.434358 | norm 3.2399 | lr 1.20e-04 | (3804.08 ms | 137823 tok/s) step 19873/76294 | train loss 3.477334 | norm 4.6295 | lr 1.20e-04 | (3887.14 ms | 134878 tok/s) step 19874/76294 | train loss 3.465441 | norm 4.3055 | lr 1.20e-04 | (3877.35 ms | 135218 tok/s) step 19875/76294 | train loss 3.505745 | norm 4.4025 | lr 1.20e-04 | (3798.49 ms | 138026 tok/s) step 19876/76294 | train loss 3.473636 | norm 6.2463 | lr 1.20e-04 | (3803.22 ms | 137854 tok/s) step 19877/76294 | train loss 3.494784 | norm 4.2462 | lr 1.20e-04 | (3838.75 ms | 136578 tok/s) step 19878/76294 | train loss 3.430991 | norm 4.3456 | lr 1.20e-04 | (3803.50 ms | 137844 tok/s) step 19879/76294 | train loss 3.510053 | norm 3.5768 | lr 1.20e-04 | (3806.77 ms | 137725 tok/s) step 19880/76294 | train loss 3.470338 | norm 3.4627 | lr 1.20e-04 | (3804.63 ms | 137803 tok/s) step 19881/76294 | train loss 3.479199 | norm 4.3349 | lr 1.20e-04 | (3805.25 ms | 137780 tok/s) step 19882/76294 | train loss 3.453418 | norm 4.2588 | lr 1.20e-04 | (3803.42 ms | 137847 tok/s) step 19883/76294 | train loss 3.491369 | norm 3.6335 | lr 1.20e-04 | (3806.95 ms | 137719 tok/s) step 19884/76294 | train loss 3.558239 | norm 5.8639 | lr 1.20e-04 | (3803.32 ms | 137850 tok/s) step 19885/76294 | train loss 3.527544 | norm 10.9540 | lr 1.20e-04 | (3808.37 ms | 137667 tok/s) step 19886/76294 | train loss 3.782665 | norm 5.9321 | lr 1.20e-04 | (3802.55 ms | 137878 tok/s) step 19887/76294 | train loss 3.470514 | norm 2.7090 | lr 1.20e-04 | (3805.51 ms | 137771 tok/s) step 19888/76294 | train loss 3.470071 | norm 7.8466 | lr 1.20e-04 | (3802.76 ms | 137870 tok/s) step 19889/76294 | train loss 3.460154 | norm 6.6195 | lr 1.20e-04 | (3806.79 ms | 137725 tok/s) step 19890/76294 | train loss 3.461628 | norm 4.7677 | lr 1.20e-04 | (3828.02 ms | 136961 tok/s) step 19891/76294 | train loss 3.531485 | norm 11.0782 | lr 1.20e-04 | (3804.56 ms | 137805 tok/s) step 19892/76294 | train loss 3.549665 | norm 3.7895 | lr 1.20e-04 | (3850.68 ms | 136155 tok/s) step 19893/76294 | train loss 3.512608 | norm 3.7545 | lr 1.20e-04 | (3797.74 ms | 138052 tok/s) step 19894/76294 | train loss 3.536239 | norm 6.0651 | lr 1.20e-04 | (3831.24 ms | 136845 tok/s) step 19895/76294 | train loss 3.463649 | norm 4.4457 | lr 1.20e-04 | (3797.99 ms | 138043 tok/s) step 19896/76294 | train loss 3.429301 | norm 5.1766 | lr 1.20e-04 | (3852.58 ms | 136088 tok/s) step 19897/76294 | train loss 3.471741 | norm 5.9406 | lr 1.20e-04 | (3796.77 ms | 138088 tok/s) step 19898/76294 | train loss 3.469012 | norm 8.9634 | lr 1.20e-04 | (3825.59 ms | 137048 tok/s) step 19899/76294 | train loss 3.488968 | norm 3.3001 | lr 1.20e-04 | (3871.78 ms | 135412 tok/s) step 19900/76294 | train loss 3.438800 | norm 4.6441 | lr 1.20e-04 | (3797.76 ms | 138052 tok/s) step 19901/76294 | train loss 3.470911 | norm 3.9978 | lr 1.20e-04 | (3825.54 ms | 137049 tok/s) step 19902/76294 | train loss 3.486655 | norm 5.5552 | lr 1.20e-04 | (3795.10 ms | 138149 tok/s) step 19903/76294 | train loss 3.448305 | norm 3.8061 | lr 1.20e-04 | (3802.33 ms | 137886 tok/s) step 19904/76294 | train loss 3.406275 | norm 9.1834 | lr 1.20e-04 | (3834.71 ms | 136722 tok/s) step 19905/76294 | train loss 3.504195 | norm 4.1156 | lr 1.20e-04 | (3821.09 ms | 137209 tok/s) step 19906/76294 | train loss 3.467331 | norm 4.0746 | lr 1.20e-04 | (5374.74 ms | 97547 tok/s) step 19907/76294 | train loss 3.452914 | norm 4.9211 | lr 1.20e-04 | (3796.72 ms | 138090 tok/s) step 19908/76294 | train loss 3.474947 | norm 3.4356 | lr 1.20e-04 | (3802.86 ms | 137867 tok/s) step 19909/76294 | train loss 3.468850 | norm 5.5533 | lr 1.20e-04 | (3795.26 ms | 138143 tok/s) step 19910/76294 | train loss 3.487607 | norm 3.9807 | lr 1.20e-04 | (3820.87 ms | 137217 tok/s) step 19911/76294 | train loss 3.524998 | norm 3.4270 | lr 1.20e-04 | (3798.45 ms | 138027 tok/s) step 19912/76294 | train loss 3.516616 | norm 4.1374 | lr 1.20e-04 | (3797.95 ms | 138045 tok/s) step 19913/76294 | train loss 3.478271 | norm 5.5314 | lr 1.20e-04 | (3944.57 ms | 132914 tok/s) step 19914/76294 | train loss 3.406988 | norm 5.3092 | lr 1.20e-04 | (3798.30 ms | 138032 tok/s) step 19915/76294 | train loss 3.412532 | norm 3.8659 | lr 1.20e-04 | (3855.11 ms | 135998 tok/s) step 19916/76294 | train loss 3.464222 | norm 4.0062 | lr 1.20e-04 | (3797.05 ms | 138078 tok/s) step 19917/76294 | train loss 3.504441 | norm 6.5518 | lr 1.20e-04 | (3800.36 ms | 137958 tok/s) step 19918/76294 | train loss 3.527599 | norm 4.7364 | lr 1.20e-04 | (3814.17 ms | 137458 tok/s) step 19919/76294 | train loss 3.432930 | norm 3.8189 | lr 1.20e-04 | (3892.51 ms | 134691 tok/s) step 19920/76294 | train loss 3.472057 | norm 15.9814 | lr 1.20e-04 | (3880.54 ms | 135107 tok/s) step 19921/76294 | train loss 3.541390 | norm 5.3509 | lr 1.20e-04 | (3800.35 ms | 137958 tok/s) step 19922/76294 | train loss 3.511629 | norm 3.5349 | lr 1.20e-04 | (3803.98 ms | 137826 tok/s) step 19923/76294 | train loss 3.390362 | norm 8.6189 | lr 1.20e-04 | (3898.29 ms | 134492 tok/s) step 19924/76294 | train loss 3.437877 | norm 5.0650 | lr 1.20e-04 | (3796.12 ms | 138112 tok/s) step 19925/76294 | train loss 3.442374 | norm 4.1091 | lr 1.20e-04 | (3802.77 ms | 137870 tok/s) step 19926/76294 | train loss 3.455986 | norm 3.4363 | lr 1.20e-04 | (3796.67 ms | 138091 tok/s) step 19927/76294 | train loss 3.457161 | norm 3.5453 | lr 1.20e-04 | (3806.18 ms | 137747 tok/s) step 19928/76294 | train loss 3.497686 | norm 5.6772 | lr 1.20e-04 | (3819.65 ms | 137261 tok/s) step 19929/76294 | train loss 3.501910 | norm 4.3257 | lr 1.20e-04 | (3808.94 ms | 137647 tok/s) step 19930/76294 | train loss 3.473331 | norm 6.0671 | lr 1.20e-04 | (3797.88 ms | 138047 tok/s) step 19931/76294 | train loss 3.489014 | norm 9.5428 | lr 1.20e-04 | (3856.93 ms | 135934 tok/s) step 19932/76294 | train loss 3.505776 | norm 14.5003 | lr 1.20e-04 | (3797.03 ms | 138078 tok/s) step 19933/76294 | train loss 3.440611 | norm 6.7659 | lr 1.20e-04 | (3827.22 ms | 136989 tok/s) step 19934/76294 | train loss 3.528267 | norm 14.1014 | lr 1.20e-04 | (3798.78 ms | 138015 tok/s) step 19935/76294 | train loss 3.463106 | norm 9.5244 | lr 1.20e-04 | (3805.31 ms | 137778 tok/s) step 19936/76294 | train loss 3.514310 | norm 8.0639 | lr 1.20e-04 | (3808.32 ms | 137669 tok/s) step 19937/76294 | train loss 3.509575 | norm 2.8895 | lr 1.20e-04 | (3801.05 ms | 137933 tok/s) step 19938/76294 | train loss 3.483749 | norm 5.9271 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 19939/76294 | train loss 3.414138 | norm 6.1376 | lr 1.20e-04 | (3798.31 ms | 138032 tok/s) step 19940/76294 | train loss 3.498909 | norm 5.8600 | lr 1.20e-04 | (3850.11 ms | 136175 tok/s) step 19941/76294 | train loss 3.450276 | norm 7.9822 | lr 1.20e-04 | (3802.25 ms | 137889 tok/s) step 19942/76294 | train loss 3.483753 | norm 6.4808 | lr 1.20e-04 | (4105.84 ms | 127693 tok/s) step 19943/76294 | train loss 3.481885 | norm 8.6927 | lr 1.20e-04 | (3797.70 ms | 138054 tok/s) step 19944/76294 | train loss 3.422669 | norm 4.5159 | lr 1.20e-04 | (3805.37 ms | 137776 tok/s) step 19945/76294 | train loss 3.450250 | norm 6.0225 | lr 1.20e-04 | (3824.71 ms | 137079 tok/s) step 19946/76294 | train loss 3.445106 | norm 3.9287 | lr 1.20e-04 | (3801.65 ms | 137911 tok/s) step 19947/76294 | train loss 3.484564 | norm 4.0611 | lr 1.20e-04 | (3807.08 ms | 137714 tok/s) step 19948/76294 | train loss 3.627999 | norm 5.7640 | lr 1.20e-04 | (3879.88 ms | 135130 tok/s) step 19949/76294 | train loss 3.429671 | norm 5.1732 | lr 1.20e-04 | (3852.21 ms | 136101 tok/s) step 19950/76294 | train loss 3.481856 | norm 5.9031 | lr 1.20e-04 | (3798.27 ms | 138033 tok/s) step 19951/76294 | train loss 3.445451 | norm 4.7115 | lr 1.20e-04 | (3802.69 ms | 137873 tok/s) step 19952/76294 | train loss 3.428190 | norm 5.4912 | lr 1.20e-04 | (3828.46 ms | 136945 tok/s) step 19953/76294 | train loss 3.489081 | norm 6.0284 | lr 1.20e-04 | (3797.15 ms | 138074 tok/s) step 19954/76294 | train loss 3.434849 | norm 4.3071 | lr 1.20e-04 | (3803.10 ms | 137858 tok/s) step 19955/76294 | train loss 3.432335 | norm 7.1658 | lr 1.20e-04 | (3802.04 ms | 137897 tok/s) step 19956/76294 | train loss 3.438641 | norm 9.0626 | lr 1.20e-04 | (3806.36 ms | 137740 tok/s) step 19957/76294 | train loss 3.529253 | norm 7.5961 | lr 1.20e-04 | (3825.62 ms | 137046 tok/s) step 19958/76294 | train loss 3.452862 | norm 7.8406 | lr 1.20e-04 | (3842.84 ms | 136433 tok/s) step 19959/76294 | train loss 3.492966 | norm 8.1611 | lr 1.20e-04 | (3799.77 ms | 137979 tok/s) step 19960/76294 | train loss 3.494077 | norm 11.7116 | lr 1.20e-04 | (3845.58 ms | 136335 tok/s) step 19961/76294 | train loss 3.536230 | norm 8.6764 | lr 1.20e-04 | (3795.33 ms | 138140 tok/s) step 19962/76294 | train loss 3.567583 | norm 11.1682 | lr 1.20e-04 | (3920.09 ms | 133744 tok/s) step 19963/76294 | train loss 3.528635 | norm 17.2409 | lr 1.20e-04 | (3800.43 ms | 137955 tok/s) step 19964/76294 | train loss 3.499282 | norm 20.9477 | lr 1.20e-04 | (3910.64 ms | 134067 tok/s) step 19965/76294 | train loss 3.502069 | norm 10.4361 | lr 1.20e-04 | (3775.59 ms | 138862 tok/s) step 19966/76294 | train loss 3.636640 | norm 17.6562 | lr 1.20e-04 | (3882.92 ms | 135024 tok/s) step 19967/76294 | train loss 3.547979 | norm 27.6866 | lr 1.20e-04 | (3763.05 ms | 139325 tok/s) step 19968/76294 | train loss 3.531610 | norm 12.8740 | lr 1.20e-04 | (3931.14 ms | 133368 tok/s) step 19969/76294 | train loss 3.528503 | norm 13.9702 | lr 1.20e-04 | (3844.86 ms | 136361 tok/s) step 19970/76294 | train loss 3.561374 | norm 7.9098 | lr 1.20e-04 | (3874.79 ms | 135308 tok/s) step 19971/76294 | train loss 3.486270 | norm 4.7574 | lr 1.20e-04 | (3882.74 ms | 135030 tok/s) step 19972/76294 | train loss 3.480664 | norm 11.3081 | lr 1.20e-04 | (3769.87 ms | 139073 tok/s) step 19973/76294 | train loss 3.506655 | norm 9.5822 | lr 1.20e-04 | (3883.27 ms | 135012 tok/s) step 19974/76294 | train loss 3.461709 | norm 7.5805 | lr 1.20e-04 | (3766.72 ms | 139189 tok/s) step 19975/76294 | train loss 3.488990 | norm 5.9162 | lr 1.20e-04 | (3868.25 ms | 135536 tok/s) step 19976/76294 | train loss 3.530353 | norm 5.6896 | lr 1.20e-04 | (3785.11 ms | 138513 tok/s) step 19977/76294 | train loss 3.505133 | norm 4.8422 | lr 1.20e-04 | (3890.13 ms | 134774 tok/s) step 19978/76294 | train loss 3.465612 | norm 5.3338 | lr 1.20e-04 | (3801.97 ms | 137899 tok/s) step 19979/76294 | train loss 3.493686 | norm 4.0772 | lr 1.20e-04 | (3882.83 ms | 135027 tok/s) step 19980/76294 | train loss 3.439111 | norm 4.8945 | lr 1.20e-04 | (3774.03 ms | 138920 tok/s) step 19981/76294 | train loss 3.441928 | norm 4.9461 | lr 1.20e-04 | (3881.09 ms | 135088 tok/s) step 19982/76294 | train loss 3.450679 | norm 5.6529 | lr 1.20e-04 | (3872.89 ms | 135374 tok/s) step 19983/76294 | train loss 3.490687 | norm 5.6671 | lr 1.20e-04 | (3774.22 ms | 138913 tok/s) step 19984/76294 | train loss 3.446217 | norm 4.3258 | lr 1.20e-04 | (3780.23 ms | 138692 tok/s) step 19985/76294 | train loss 3.515170 | norm 3.0438 | lr 1.20e-04 | (3944.02 ms | 132932 tok/s) step 19986/76294 | train loss 3.468841 | norm 5.2962 | lr 1.20e-04 | (3797.99 ms | 138044 tok/s) step 19987/76294 | train loss 3.461730 | norm 3.7503 | lr 1.20e-04 | (3938.88 ms | 133106 tok/s) step 19988/76294 | train loss 3.525439 | norm 4.1864 | lr 1.20e-04 | (3794.12 ms | 138184 tok/s) step 19989/76294 | train loss 3.413240 | norm 5.9909 | lr 1.20e-04 | (3803.08 ms | 137859 tok/s) step 19990/76294 | train loss 3.464881 | norm 10.8402 | lr 1.20e-04 | (3815.61 ms | 137406 tok/s) step 19991/76294 | train loss 3.451420 | norm 8.4678 | lr 1.20e-04 | (3812.87 ms | 137505 tok/s) step 19992/76294 | train loss 3.572025 | norm 6.5565 | lr 1.20e-04 | (3795.99 ms | 138116 tok/s) step 19993/76294 | train loss 3.432851 | norm 4.9791 | lr 1.20e-04 | (3798.81 ms | 138014 tok/s) step 19994/76294 | train loss 3.495551 | norm 5.2710 | lr 1.20e-04 | (3802.16 ms | 137892 tok/s) step 19995/76294 | train loss 3.430492 | norm 6.8510 | lr 1.20e-04 | (3805.32 ms | 137778 tok/s) step 19996/76294 | train loss 3.507792 | norm 5.3166 | lr 1.20e-04 | (3806.61 ms | 137731 tok/s) step 19997/76294 | train loss 3.616291 | norm 4.7371 | lr 1.20e-04 | (3826.20 ms | 137026 tok/s) step 19998/76294 | train loss 3.455260 | norm 6.3863 | lr 1.20e-04 | (3809.37 ms | 137631 tok/s) step 19999/76294 | train loss 3.445009 | norm 8.1546 | lr 1.20e-04 | (3806.59 ms | 137732 tok/s) step 20000/76294 | train loss 3.478616 | norm 7.5342 | lr 1.20e-04 | (3834.69 ms | 136722 tok/s) val loss: 3.474232 saving model checkpoint to ./results/gpt2-124M-gqa/step_20000.pth step 20001/76294 | train loss 3.416490 | norm 7.0479 | lr 1.20e-04 | (3816.23 ms | 137384 tok/s) step 20002/76294 | train loss 3.520730 | norm 5.6221 | lr 1.20e-04 | (3831.78 ms | 136826 tok/s) step 20003/76294 | train loss 3.465318 | norm 3.7130 | lr 1.20e-04 | (3807.86 ms | 137686 tok/s) step 20004/76294 | train loss 3.470685 | norm 11.7529 | lr 1.20e-04 | (3803.16 ms | 137856 tok/s) step 20005/76294 | train loss 3.485460 | norm 7.1410 | lr 1.20e-04 | (3859.01 ms | 135861 tok/s) step 20006/76294 | train loss 3.513349 | norm 4.3392 | lr 1.20e-04 | (3808.19 ms | 137674 tok/s) step 20007/76294 | train loss 3.477713 | norm 4.0057 | lr 1.20e-04 | (3809.30 ms | 137634 tok/s) step 20008/76294 | train loss 3.468554 | norm 2.5007 | lr 1.20e-04 | (3831.10 ms | 136851 tok/s) step 20009/76294 | train loss 3.442240 | norm 4.3321 | lr 1.20e-04 | (3815.85 ms | 137397 tok/s) step 20010/76294 | train loss 3.502887 | norm 5.3331 | lr 1.20e-04 | (3814.14 ms | 137459 tok/s) step 20011/76294 | train loss 3.526336 | norm 4.2983 | lr 1.20e-04 | (3813.56 ms | 137480 tok/s) step 20012/76294 | train loss 3.448725 | norm 2.8729 | lr 1.20e-04 | (3844.31 ms | 136380 tok/s) step 20013/76294 | train loss 3.382727 | norm 3.3271 | lr 1.20e-04 | (3847.98 ms | 136250 tok/s) step 20014/76294 | train loss 3.508676 | norm 11.5831 | lr 1.20e-04 | (3805.43 ms | 137774 tok/s) step 20015/76294 | train loss 3.484516 | norm 25.9435 | lr 1.20e-04 | (3812.54 ms | 137517 tok/s) step 20016/76294 | train loss 3.458932 | norm 5.0398 | lr 1.20e-04 | (3825.02 ms | 137068 tok/s) step 20017/76294 | train loss 3.586379 | norm 3.9405 | lr 1.20e-04 | (3805.51 ms | 137771 tok/s) step 20018/76294 | train loss 3.457348 | norm 2.6006 | lr 1.20e-04 | (3811.88 ms | 137541 tok/s) step 20019/76294 | train loss 3.444241 | norm 3.9171 | lr 1.20e-04 | (3809.55 ms | 137625 tok/s) step 20020/76294 | train loss 3.427289 | norm 8.0928 | lr 1.20e-04 | (3802.74 ms | 137871 tok/s) step 20021/76294 | train loss 3.500399 | norm 3.8376 | lr 1.20e-04 | (3831.41 ms | 136839 tok/s) step 20022/76294 | train loss 3.473600 | norm 96.5833 | lr 1.20e-04 | (3804.97 ms | 137790 tok/s) step 20023/76294 | train loss 3.520942 | norm 5.0339 | lr 1.20e-04 | (3848.83 ms | 136220 tok/s) step 20024/76294 | train loss 3.515813 | norm 4.1717 | lr 1.20e-04 | (3812.37 ms | 137523 tok/s) step 20025/76294 | train loss 3.436776 | norm 4.0300 | lr 1.20e-04 | (3836.33 ms | 136664 tok/s) step 20026/76294 | train loss 3.420812 | norm 4.1326 | lr 1.20e-04 | (3802.94 ms | 137864 tok/s) step 20027/76294 | train loss 3.586143 | norm 5.7431 | lr 1.20e-04 | (3808.25 ms | 137672 tok/s) step 20028/76294 | train loss 3.482410 | norm 6.1324 | lr 1.20e-04 | (3825.44 ms | 137053 tok/s) step 20029/76294 | train loss 3.551403 | norm 5.9634 | lr 1.20e-04 | (3804.80 ms | 137797 tok/s) step 20030/76294 | train loss 3.464231 | norm 36.6848 | lr 1.20e-04 | (3807.47 ms | 137700 tok/s) step 20031/76294 | train loss 3.484845 | norm 8.4573 | lr 1.20e-04 | (4159.45 ms | 126047 tok/s) step 20032/76294 | train loss 3.402914 | norm 6.1910 | lr 1.20e-04 | (3807.25 ms | 137708 tok/s) step 20033/76294 | train loss 3.518523 | norm 3.6218 | lr 1.20e-04 | (3805.79 ms | 137761 tok/s) step 20034/76294 | train loss 3.467452 | norm 10.1839 | lr 1.20e-04 | (3824.17 ms | 137098 tok/s) step 20035/76294 | train loss 3.459512 | norm 24.0714 | lr 1.20e-04 | (3805.82 ms | 137760 tok/s) step 20036/76294 | train loss 3.449523 | norm 13.4539 | lr 1.20e-04 | (3923.36 ms | 133632 tok/s) step 20037/76294 | train loss 3.447290 | norm 15.2306 | lr 1.20e-04 | (3800.05 ms | 137969 tok/s) step 20038/76294 | train loss 3.412302 | norm 8.7962 | lr 1.20e-04 | (3805.08 ms | 137786 tok/s) step 20039/76294 | train loss 3.480536 | norm 12.7681 | lr 1.20e-04 | (3823.57 ms | 137120 tok/s) step 20040/76294 | train loss 3.499317 | norm 30.0417 | lr 1.20e-04 | (3803.06 ms | 137859 tok/s) step 20041/76294 | train loss 3.503416 | norm 23.3444 | lr 1.20e-04 | (3805.82 ms | 137760 tok/s) step 20042/76294 | train loss 3.500846 | norm 24.3235 | lr 1.20e-04 | (3805.94 ms | 137755 tok/s) step 20043/76294 | train loss 3.524889 | norm 17.1812 | lr 1.20e-04 | (3812.78 ms | 137508 tok/s) step 20044/76294 | train loss 3.540969 | norm 20.1362 | lr 1.20e-04 | (3800.51 ms | 137952 tok/s) step 20045/76294 | train loss 3.491427 | norm 26.0072 | lr 1.20e-04 | (3808.93 ms | 137647 tok/s) step 20046/76294 | train loss 3.544207 | norm 33.1887 | lr 1.20e-04 | (3831.31 ms | 136843 tok/s) step 20047/76294 | train loss 3.511709 | norm 22.1453 | lr 1.20e-04 | (3800.25 ms | 137961 tok/s) step 20048/76294 | train loss 3.489562 | norm 16.6799 | lr 1.20e-04 | (3831.42 ms | 136839 tok/s) step 20049/76294 | train loss 3.457580 | norm 65.3511 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 20050/76294 | train loss 3.530346 | norm 80.2655 | lr 1.20e-04 | (3806.15 ms | 137748 tok/s) step 20051/76294 | train loss 3.365744 | norm 25.9487 | lr 1.20e-04 | (3822.56 ms | 137156 tok/s) step 20052/76294 | train loss 3.496186 | norm 34.2618 | lr 1.20e-04 | (3808.16 ms | 137675 tok/s) step 20053/76294 | train loss 3.461928 | norm 49.0208 | lr 1.20e-04 | (3807.57 ms | 137696 tok/s) step 20054/76294 | train loss 3.483926 | norm 92.8539 | lr 1.20e-04 | (3835.95 ms | 136678 tok/s) step 20055/76294 | train loss 3.486928 | norm 32.7216 | lr 1.20e-04 | (3811.39 ms | 137558 tok/s) step 20056/76294 | train loss 3.556839 | norm 20.5818 | lr 1.20e-04 | (3856.90 ms | 135935 tok/s) step 20057/76294 | train loss 3.472066 | norm 57.7835 | lr 1.20e-04 | (3805.39 ms | 137775 tok/s) step 20058/76294 | train loss 3.526145 | norm 35.2127 | lr 1.20e-04 | (3808.15 ms | 137675 tok/s) step 20059/76294 | train loss 3.481682 | norm 57.8464 | lr 1.20e-04 | (3798.82 ms | 138014 tok/s) step 20060/76294 | train loss 3.516605 | norm 61.2319 | lr 1.20e-04 | (3825.50 ms | 137051 tok/s) step 20061/76294 | train loss 3.510039 | norm 28.8777 | lr 1.20e-04 | (3875.56 ms | 135281 tok/s) step 20062/76294 | train loss 3.489984 | norm 63.2046 | lr 1.20e-04 | (3854.24 ms | 136029 tok/s) step 20063/76294 | train loss 3.470841 | norm 26.5754 | lr 1.20e-04 | (3798.51 ms | 138025 tok/s) step 20064/76294 | train loss 3.472111 | norm 23.3014 | lr 1.20e-04 | (3815.54 ms | 137408 tok/s) step 20065/76294 | train loss 3.498740 | norm 130.9791 | lr 1.20e-04 | (3798.14 ms | 138038 tok/s) step 20066/76294 | train loss 3.581382 | norm 47.0511 | lr 1.20e-04 | (3801.31 ms | 137923 tok/s) step 20067/76294 | train loss 3.483104 | norm 93.8476 | lr 1.20e-04 | (3820.14 ms | 137243 tok/s) step 20068/76294 | train loss 3.446440 | norm 100.8813 | lr 1.20e-04 | (3802.11 ms | 137894 tok/s) step 20069/76294 | train loss 3.562630 | norm 102.4794 | lr 1.20e-04 | (3804.13 ms | 137821 tok/s) step 20070/76294 | train loss 3.491771 | norm 53.8738 | lr 1.20e-04 | (3806.61 ms | 137731 tok/s) step 20071/76294 | train loss 3.528871 | norm 22.2867 | lr 1.20e-04 | (3801.97 ms | 137899 tok/s) step 20072/76294 | train loss 3.537132 | norm 35.6341 | lr 1.20e-04 | (3816.99 ms | 137357 tok/s) step 20073/76294 | train loss 3.506722 | norm 28.2857 | lr 1.20e-04 | (3824.90 ms | 137072 tok/s) step 20074/76294 | train loss 3.468240 | norm 23.9276 | lr 1.20e-04 | (3796.90 ms | 138083 tok/s) step 20075/76294 | train loss 3.469050 | norm 76.7414 | lr 1.20e-04 | (3803.99 ms | 137826 tok/s) step 20076/76294 | train loss 3.460786 | norm 58.4471 | lr 1.20e-04 | (3801.77 ms | 137906 tok/s) step 20077/76294 | train loss 3.459541 | norm 25.6162 | lr 1.20e-04 | (3804.72 ms | 137799 tok/s) step 20078/76294 | train loss 3.492029 | norm 20.4010 | lr 1.20e-04 | (3802.55 ms | 137878 tok/s) step 20079/76294 | train loss 3.607389 | norm 46.9010 | lr 1.20e-04 | (3806.88 ms | 137721 tok/s) step 20080/76294 | train loss 3.507849 | norm 27.2552 | lr 1.20e-04 | (4200.12 ms | 124827 tok/s) step 20081/76294 | train loss 3.499285 | norm 70.4172 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 20082/76294 | train loss 3.496123 | norm 43.9415 | lr 1.20e-04 | (3797.44 ms | 138064 tok/s) step 20083/76294 | train loss 3.518754 | norm 42.4118 | lr 1.20e-04 | (3807.15 ms | 137711 tok/s) step 20084/76294 | train loss 3.541963 | norm 38.2872 | lr 1.20e-04 | (3806.76 ms | 137726 tok/s) step 20085/76294 | train loss 3.483934 | norm 32.9868 | lr 1.20e-04 | (3821.72 ms | 137186 tok/s) step 20086/76294 | train loss 3.456124 | norm 28.6898 | lr 1.20e-04 | (3803.02 ms | 137861 tok/s) step 20087/76294 | train loss 3.481887 | norm 31.9112 | lr 1.20e-04 | (3808.74 ms | 137654 tok/s) step 20088/76294 | train loss 3.436014 | norm 15.1919 | lr 1.20e-04 | (3847.72 ms | 136259 tok/s) step 20089/76294 | train loss 3.484705 | norm 36.8752 | lr 1.20e-04 | (3807.23 ms | 137708 tok/s) step 20090/76294 | train loss 3.448976 | norm 29.8077 | lr 1.20e-04 | (3807.96 ms | 137682 tok/s) step 20091/76294 | train loss 3.567290 | norm 14.3002 | lr 1.20e-04 | (3827.89 ms | 136965 tok/s) step 20092/76294 | train loss 3.465958 | norm 19.7309 | lr 1.20e-04 | (3809.47 ms | 137627 tok/s) step 20093/76294 | train loss 3.479853 | norm 15.9472 | lr 1.20e-04 | (3811.20 ms | 137565 tok/s) step 20094/76294 | train loss 3.470490 | norm 21.7314 | lr 1.20e-04 | (3809.04 ms | 137643 tok/s) step 20095/76294 | train loss 3.469753 | norm 9.7944 | lr 1.20e-04 | (3814.48 ms | 137447 tok/s) step 20096/76294 | train loss 3.464389 | norm 29.7559 | lr 1.20e-04 | (3809.41 ms | 137630 tok/s) step 20097/76294 | train loss 3.494612 | norm 26.9445 | lr 1.20e-04 | (3829.57 ms | 136905 tok/s) step 20098/76294 | train loss 3.479384 | norm 28.7231 | lr 1.20e-04 | (3806.45 ms | 137737 tok/s) step 20099/76294 | train loss 3.402356 | norm 14.4303 | lr 1.20e-04 | (3833.26 ms | 136774 tok/s) step 20100/76294 | train loss 3.412905 | norm 5.3078 | lr 1.20e-04 | (3809.27 ms | 137635 tok/s) step 20101/76294 | train loss 3.390516 | norm 23.6918 | lr 1.20e-04 | (3816.68 ms | 137367 tok/s) step 20102/76294 | train loss 3.434575 | norm 41.1095 | lr 1.20e-04 | (3808.71 ms | 137655 tok/s) step 20103/76294 | train loss 3.527391 | norm 12.5060 | lr 1.20e-04 | (3831.06 ms | 136852 tok/s) step 20104/76294 | train loss 3.464710 | norm 9.7170 | lr 1.20e-04 | (3826.60 ms | 137011 tok/s) step 20105/76294 | train loss 3.376060 | norm 48.6935 | lr 1.20e-04 | (3836.29 ms | 136665 tok/s) step 20106/76294 | train loss 3.477350 | norm 11.4691 | lr 1.20e-04 | (3815.38 ms | 137414 tok/s) step 20107/76294 | train loss 3.544425 | norm 5.3847 | lr 1.20e-04 | (3834.32 ms | 136736 tok/s) step 20108/76294 | train loss 3.582785 | norm 4.4654 | lr 1.20e-04 | (3824.86 ms | 137074 tok/s) step 20109/76294 | train loss 3.486638 | norm 11.4623 | lr 1.20e-04 | (3893.39 ms | 134661 tok/s) step 20110/76294 | train loss 3.424834 | norm 19.8079 | lr 1.20e-04 | (3789.05 ms | 138369 tok/s) step 20111/76294 | train loss 3.498271 | norm 12.2278 | lr 1.20e-04 | (3842.53 ms | 136443 tok/s) step 20112/76294 | train loss 3.611435 | norm 12.3945 | lr 1.20e-04 | (3784.80 ms | 138525 tok/s) step 20113/76294 | train loss 3.441878 | norm 6.2558 | lr 1.20e-04 | (3815.04 ms | 137427 tok/s) step 20114/76294 | train loss 3.535359 | norm 6.2613 | lr 1.20e-04 | (3810.30 ms | 137597 tok/s) step 20115/76294 | train loss 3.546423 | norm 6.8592 | lr 1.20e-04 | (3819.42 ms | 137269 tok/s) step 20116/76294 | train loss 3.512514 | norm 14.3889 | lr 1.20e-04 | (3827.91 ms | 136964 tok/s) step 20117/76294 | train loss 3.506097 | norm 5.6807 | lr 1.20e-04 | (3793.04 ms | 138224 tok/s) step 20118/76294 | train loss 3.520784 | norm 4.4761 | lr 1.20e-04 | (3796.26 ms | 138106 tok/s) step 20119/76294 | train loss 3.459298 | norm 11.6662 | lr 1.20e-04 | (3805.44 ms | 137773 tok/s) step 20120/76294 | train loss 3.439776 | norm 5.5641 | lr 1.20e-04 | (3791.75 ms | 138271 tok/s) step 20121/76294 | train loss 3.438276 | norm 4.9446 | lr 1.20e-04 | (3847.96 ms | 136251 tok/s) step 20122/76294 | train loss 3.404817 | norm 10.5892 | lr 1.20e-04 | (3874.52 ms | 135317 tok/s) step 20123/76294 | train loss 3.476438 | norm 11.7989 | lr 1.20e-04 | (3894.98 ms | 134606 tok/s) step 20124/76294 | train loss 3.476230 | norm 63.2596 | lr 1.20e-04 | (3781.53 ms | 138644 tok/s) step 20125/76294 | train loss 3.441538 | norm 16.7755 | lr 1.20e-04 | (3809.12 ms | 137640 tok/s) step 20126/76294 | train loss 3.490214 | norm 8.6978 | lr 1.20e-04 | (3786.35 ms | 138468 tok/s) step 20127/76294 | train loss 3.553303 | norm 27.8551 | lr 1.20e-04 | (3800.28 ms | 137960 tok/s) step 20128/76294 | train loss 3.442148 | norm 5.3866 | lr 1.20e-04 | (3791.89 ms | 138265 tok/s) step 20129/76294 | train loss 3.572936 | norm 4.9436 | lr 1.20e-04 | (3815.06 ms | 137426 tok/s) step 20130/76294 | train loss 3.518982 | norm 10.5881 | lr 1.20e-04 | (3793.22 ms | 138217 tok/s) step 20131/76294 | train loss 3.479430 | norm 52.5010 | lr 1.20e-04 | (3848.19 ms | 136243 tok/s) step 20132/76294 | train loss 3.492139 | norm 7.8638 | lr 1.20e-04 | (3795.70 ms | 138127 tok/s) step 20133/76294 | train loss 3.412272 | norm 6.7378 | lr 1.20e-04 | (4184.10 ms | 125305 tok/s) step 20134/76294 | train loss 3.467855 | norm 14.6378 | lr 1.20e-04 | (3790.42 ms | 138319 tok/s) step 20135/76294 | train loss 3.490204 | norm 5.7361 | lr 1.20e-04 | (3812.39 ms | 137522 tok/s) step 20136/76294 | train loss 3.485098 | norm 6.4732 | lr 1.20e-04 | (3795.27 ms | 138142 tok/s) step 20137/76294 | train loss 3.450910 | norm 10.3152 | lr 1.20e-04 | (3798.97 ms | 138008 tok/s) step 20138/76294 | train loss 3.478974 | norm 13.1484 | lr 1.20e-04 | (3825.18 ms | 137062 tok/s) step 20139/76294 | train loss 3.451742 | norm 7.4022 | lr 1.20e-04 | (3826.20 ms | 137026 tok/s) step 20140/76294 | train loss 3.488229 | norm 16.9673 | lr 1.20e-04 | (3804.94 ms | 137791 tok/s) step 20141/76294 | train loss 3.519881 | norm 15.6815 | lr 1.20e-04 | (3894.00 ms | 134640 tok/s) step 20142/76294 | train loss 3.520598 | norm 14.7141 | lr 1.20e-04 | (3798.20 ms | 138036 tok/s) step 20143/76294 | train loss 3.422169 | norm 6.5413 | lr 1.20e-04 | (3857.75 ms | 135905 tok/s) step 20144/76294 | train loss 3.509907 | norm 5.5501 | lr 1.20e-04 | (3798.01 ms | 138043 tok/s) step 20145/76294 | train loss 3.497164 | norm 5.8209 | lr 1.20e-04 | (3849.64 ms | 136192 tok/s) step 20146/76294 | train loss 3.449260 | norm 6.4913 | lr 1.20e-04 | (3792.93 ms | 138228 tok/s) step 20147/76294 | train loss 3.484765 | norm 14.6927 | lr 1.20e-04 | (3809.79 ms | 137616 tok/s) step 20148/76294 | train loss 3.459374 | norm 4.9240 | lr 1.20e-04 | (3842.83 ms | 136433 tok/s) step 20149/76294 | train loss 3.448676 | norm 49.4133 | lr 1.20e-04 | (3841.80 ms | 136469 tok/s) step 20150/76294 | train loss 3.438751 | norm 36.4622 | lr 1.20e-04 | (3817.61 ms | 137334 tok/s) step 20151/76294 | train loss 3.500996 | norm 30.1769 | lr 1.20e-04 | (3921.46 ms | 133697 tok/s) step 20152/76294 | train loss 3.545572 | norm 71.3124 | lr 1.20e-04 | (3794.56 ms | 138168 tok/s) step 20153/76294 | train loss 3.561046 | norm 14.9783 | lr 1.20e-04 | (3800.73 ms | 137944 tok/s) step 20154/76294 | train loss 3.463920 | norm 49.0539 | lr 1.20e-04 | (3821.74 ms | 137186 tok/s) step 20155/76294 | train loss 3.489616 | norm 11.1280 | lr 1.20e-04 | (3805.19 ms | 137783 tok/s) step 20156/76294 | train loss 3.417051 | norm 4.5889 | lr 1.20e-04 | (3883.27 ms | 135012 tok/s) step 20157/76294 | train loss 3.580096 | norm 7.1485 | lr 1.20e-04 | (3803.99 ms | 137826 tok/s) step 20158/76294 | train loss 3.545363 | norm 42.3423 | lr 1.20e-04 | (3807.69 ms | 137692 tok/s) step 20159/76294 | train loss 3.574566 | norm 6.1165 | lr 1.20e-04 | (3822.23 ms | 137168 tok/s) step 20160/76294 | train loss 3.497234 | norm 7.1845 | lr 1.20e-04 | (3807.48 ms | 137699 tok/s) step 20161/76294 | train loss 3.454970 | norm 12.4503 | lr 1.20e-04 | (3884.16 ms | 134981 tok/s) step 20162/76294 | train loss 3.479098 | norm 7.5386 | lr 1.20e-04 | (3801.19 ms | 137927 tok/s) step 20163/76294 | train loss 3.411325 | norm 4.0627 | lr 1.20e-04 | (3809.20 ms | 137637 tok/s) step 20164/76294 | train loss 3.451482 | norm 2.9594 | lr 1.20e-04 | (3826.32 ms | 137022 tok/s) step 20165/76294 | train loss 3.446020 | norm 4.7058 | lr 1.20e-04 | (3807.54 ms | 137697 tok/s) step 20166/76294 | train loss 3.461439 | norm 3.9178 | lr 1.20e-04 | (3805.93 ms | 137756 tok/s) step 20167/76294 | train loss 3.486657 | norm 5.4632 | lr 1.20e-04 | (3837.84 ms | 136610 tok/s) step 20168/76294 | train loss 3.458914 | norm 5.9375 | lr 1.20e-04 | (3805.86 ms | 137758 tok/s) step 20169/76294 | train loss 3.459702 | norm 3.2834 | lr 1.20e-04 | (3808.32 ms | 137669 tok/s) step 20170/76294 | train loss 3.503052 | norm 6.3292 | lr 1.20e-04 | (3827.30 ms | 136986 tok/s) step 20171/76294 | train loss 3.514698 | norm 2.6843 | lr 1.20e-04 | (3914.00 ms | 133952 tok/s) step 20172/76294 | train loss 3.457892 | norm 11.7425 | lr 1.20e-04 | (3802.57 ms | 137877 tok/s) step 20173/76294 | train loss 3.496471 | norm 4.9175 | lr 1.20e-04 | (3838.38 ms | 136591 tok/s) step 20174/76294 | train loss 3.479697 | norm 39.1701 | lr 1.20e-04 | (3801.49 ms | 137917 tok/s) step 20175/76294 | train loss 3.501207 | norm 5.5845 | lr 1.20e-04 | (3805.63 ms | 137766 tok/s) step 20176/76294 | train loss 3.480160 | norm 2.7285 | lr 1.20e-04 | (3821.40 ms | 137198 tok/s) step 20177/76294 | train loss 3.445682 | norm 5.1252 | lr 1.20e-04 | (3806.90 ms | 137721 tok/s) step 20178/76294 | train loss 3.500616 | norm 3.3695 | lr 1.20e-04 | (3799.90 ms | 137974 tok/s) step 20179/76294 | train loss 3.436555 | norm 3.3663 | lr 1.20e-04 | (3827.55 ms | 136977 tok/s) step 20180/76294 | train loss 3.441258 | norm 8.2911 | lr 1.20e-04 | (3882.84 ms | 135027 tok/s) step 20181/76294 | train loss 3.744864 | norm 3.7643 | lr 1.20e-04 | (3804.42 ms | 137810 tok/s) step 20182/76294 | train loss 3.455929 | norm 4.0453 | lr 1.20e-04 | (3815.97 ms | 137393 tok/s) step 20183/76294 | train loss 3.526402 | norm 4.6480 | lr 1.20e-04 | (3814.05 ms | 137462 tok/s) step 20184/76294 | train loss 3.476482 | norm 3.1995 | lr 1.20e-04 | (3818.12 ms | 137316 tok/s) step 20185/76294 | train loss 3.450483 | norm 2.7730 | lr 1.20e-04 | (3836.76 ms | 136649 tok/s) step 20186/76294 | train loss 3.406015 | norm 3.2791 | lr 1.20e-04 | (3838.68 ms | 136580 tok/s) step 20187/76294 | train loss 3.461958 | norm 2.1589 | lr 1.20e-04 | (3844.15 ms | 136386 tok/s) step 20188/76294 | train loss 3.448509 | norm 4.6514 | lr 1.20e-04 | (3803.85 ms | 137831 tok/s) step 20189/76294 | train loss 3.492628 | norm 3.2665 | lr 1.20e-04 | (3828.25 ms | 136952 tok/s) step 20190/76294 | train loss 3.450536 | norm 2.6430 | lr 1.20e-04 | (3801.95 ms | 137900 tok/s) step 20191/76294 | train loss 3.481404 | norm 2.7043 | lr 1.20e-04 | (3917.85 ms | 133820 tok/s) step 20192/76294 | train loss 3.580760 | norm 2.9141 | lr 1.20e-04 | (3793.88 ms | 138193 tok/s) step 20193/76294 | train loss 3.425607 | norm 3.6711 | lr 1.20e-04 | (3797.25 ms | 138071 tok/s) step 20194/76294 | train loss 3.453693 | norm 4.7784 | lr 1.20e-04 | (3817.66 ms | 137332 tok/s) step 20195/76294 | train loss 3.486945 | norm 7.9026 | lr 1.20e-04 | (3801.94 ms | 137900 tok/s) step 20196/76294 | train loss 3.429468 | norm 5.8576 | lr 1.20e-04 | (3980.33 ms | 131720 tok/s) step 20197/76294 | train loss 3.514832 | norm 5.2860 | lr 1.20e-04 | (5726.04 ms | 91562 tok/s) step 20198/76294 | train loss 3.489490 | norm 3.8011 | lr 1.20e-04 | (5342.72 ms | 98131 tok/s) step 20199/76294 | train loss 3.537290 | norm 2.0092 | lr 1.20e-04 | (3813.37 ms | 137487 tok/s) step 20200/76294 | train loss 3.408398 | norm 2.7784 | lr 1.20e-04 | (3791.02 ms | 138297 tok/s) step 20201/76294 | train loss 3.552080 | norm 3.4147 | lr 1.20e-04 | (3875.69 ms | 135276 tok/s) step 20202/76294 | train loss 3.478285 | norm 3.3830 | lr 1.20e-04 | (3790.96 ms | 138299 tok/s) step 20203/76294 | train loss 3.422114 | norm 10.0916 | lr 1.20e-04 | (3945.92 ms | 132868 tok/s) step 20204/76294 | train loss 3.409103 | norm 4.9168 | lr 1.20e-04 | (3835.52 ms | 136693 tok/s) step 20205/76294 | train loss 3.449146 | norm 9.5390 | lr 1.20e-04 | (12366.45 ms | 42396 tok/s) step 20206/76294 | train loss 3.510689 | norm 23.4838 | lr 1.20e-04 | (3776.67 ms | 138823 tok/s) step 20207/76294 | train loss 3.383873 | norm 4.9565 | lr 1.20e-04 | (3780.75 ms | 138673 tok/s) step 20208/76294 | train loss 3.514634 | norm 6.4865 | lr 1.20e-04 | (3985.97 ms | 131533 tok/s) step 20209/76294 | train loss 3.540127 | norm 7.0290 | lr 1.20e-04 | (3857.10 ms | 135928 tok/s) step 20210/76294 | train loss 3.417197 | norm 3.4719 | lr 1.20e-04 | (3763.49 ms | 139309 tok/s) step 20211/76294 | train loss 3.446710 | norm 3.4978 | lr 1.20e-04 | (3869.36 ms | 135497 tok/s) step 20212/76294 | train loss 3.436826 | norm 15.5458 | lr 1.20e-04 | (5451.10 ms | 96180 tok/s) step 20213/76294 | train loss 3.430017 | norm 4.8092 | lr 1.20e-04 | (3863.88 ms | 135690 tok/s) step 20214/76294 | train loss 3.449259 | norm 10.7320 | lr 1.20e-04 | (3756.73 ms | 139559 tok/s) step 20215/76294 | train loss 3.429901 | norm 5.0157 | lr 1.20e-04 | (3763.37 ms | 139314 tok/s) step 20216/76294 | train loss 3.441072 | norm 18.5361 | lr 1.20e-04 | (3783.60 ms | 138569 tok/s) step 20217/76294 | train loss 3.519149 | norm 4.2159 | lr 1.20e-04 | (3775.01 ms | 138884 tok/s) step 20218/76294 | train loss 3.474174 | norm 5.6493 | lr 1.20e-04 | (3774.78 ms | 138892 tok/s) step 20219/76294 | train loss 3.518716 | norm 6.1427 | lr 1.20e-04 | (3772.37 ms | 138981 tok/s) step 20220/76294 | train loss 3.479513 | norm 3.7307 | lr 1.20e-04 | (3805.39 ms | 137775 tok/s) step 20221/76294 | train loss 3.493903 | norm 5.3347 | lr 1.20e-04 | (3776.20 ms | 138840 tok/s) step 20222/76294 | train loss 3.529673 | norm 6.6943 | lr 1.20e-04 | (3782.83 ms | 138597 tok/s) step 20223/76294 | train loss 3.408801 | norm 4.8659 | lr 1.20e-04 | (3803.67 ms | 137837 tok/s) step 20224/76294 | train loss 3.509517 | norm 3.9103 | lr 1.20e-04 | (3788.03 ms | 138407 tok/s) step 20225/76294 | train loss 3.396068 | norm 4.9987 | lr 1.20e-04 | (3808.93 ms | 137647 tok/s) step 20226/76294 | train loss 3.561533 | norm 8.2167 | lr 1.20e-04 | (3789.87 ms | 138339 tok/s) step 20227/76294 | train loss 3.456243 | norm 9.5749 | lr 1.20e-04 | (3794.85 ms | 138158 tok/s) step 20228/76294 | train loss 3.503911 | norm 5.3300 | lr 1.20e-04 | (3884.86 ms | 134957 tok/s) step 20229/76294 | train loss 3.512638 | norm 5.9709 | lr 1.20e-04 | (3790.52 ms | 138316 tok/s) step 20230/76294 | train loss 3.444490 | norm 5.0425 | lr 1.20e-04 | (3822.17 ms | 137170 tok/s) step 20231/76294 | train loss 3.561312 | norm 4.9459 | lr 1.20e-04 | (3790.99 ms | 138298 tok/s) step 20232/76294 | train loss 3.490225 | norm 39.6611 | lr 1.20e-04 | (3794.07 ms | 138186 tok/s) step 20233/76294 | train loss 3.471295 | norm 28.3140 | lr 1.20e-04 | (3810.66 ms | 137584 tok/s) step 20234/76294 | train loss 3.471832 | norm 6.1745 | lr 1.20e-04 | (3792.32 ms | 138250 tok/s) step 20235/76294 | train loss 3.456271 | norm 2.7901 | lr 1.20e-04 | (3799.01 ms | 138006 tok/s) step 20236/76294 | train loss 3.483154 | norm 7.7633 | lr 1.20e-04 | (3801.03 ms | 137933 tok/s) step 20237/76294 | train loss 3.430864 | norm 3.8776 | lr 1.20e-04 | (3800.96 ms | 137936 tok/s) step 20238/76294 | train loss 3.503974 | norm 4.5761 | lr 1.20e-04 | (3801.61 ms | 137912 tok/s) step 20239/76294 | train loss 3.443380 | norm 4.3035 | lr 1.20e-04 | (3809.56 ms | 137624 tok/s) step 20240/76294 | train loss 3.461321 | norm 4.1658 | lr 1.20e-04 | (3799.92 ms | 137973 tok/s) step 20241/76294 | train loss 3.425380 | norm 7.8669 | lr 1.20e-04 | (3808.17 ms | 137674 tok/s) step 20242/76294 | train loss 3.435013 | norm 6.2557 | lr 1.20e-04 | (3802.91 ms | 137865 tok/s) step 20243/76294 | train loss 3.472822 | norm 3.7479 | lr 1.20e-04 | (3841.50 ms | 136480 tok/s) step 20244/76294 | train loss 3.437676 | norm 3.8292 | lr 1.20e-04 | (3798.25 ms | 138034 tok/s) step 20245/76294 | train loss 3.477751 | norm 4.5135 | lr 1.20e-04 | (3841.25 ms | 136489 tok/s) step 20246/76294 | train loss 3.478116 | norm 4.6843 | lr 1.20e-04 | (3799.14 ms | 138002 tok/s) step 20247/76294 | train loss 3.477825 | norm 5.4875 | lr 1.20e-04 | (3804.29 ms | 137815 tok/s) step 20248/76294 | train loss 3.558612 | norm 31.2015 | lr 1.20e-04 | (3819.03 ms | 137283 tok/s) step 20249/76294 | train loss 3.488087 | norm 3.9927 | lr 1.20e-04 | (3803.23 ms | 137853 tok/s) step 20250/76294 | train loss 3.430046 | norm 3.1919 | lr 1.20e-04 | (3797.90 ms | 138047 tok/s) val loss: 3.441334 saving model checkpoint to ./results/gpt2-124M-gqa/step_20250.pth step 20251/76294 | train loss 3.466096 | norm 5.0113 | lr 1.20e-04 | (3893.33 ms | 134663 tok/s) step 20252/76294 | train loss 3.424831 | norm 3.6314 | lr 1.20e-04 | (3865.85 ms | 135621 tok/s) step 20253/76294 | train loss 3.472256 | norm 2.5389 | lr 1.20e-04 | (3797.19 ms | 138073 tok/s) step 20254/76294 | train loss 3.413983 | norm 3.0349 | lr 1.20e-04 | (3819.30 ms | 137273 tok/s) step 20255/76294 | train loss 3.486780 | norm 5.0961 | lr 1.20e-04 | (3796.07 ms | 138113 tok/s) step 20256/76294 | train loss 3.483423 | norm 3.7155 | lr 1.20e-04 | (3845.57 ms | 136335 tok/s) step 20257/76294 | train loss 3.459108 | norm 3.7099 | lr 1.20e-04 | (3805.69 ms | 137764 tok/s) step 20258/76294 | train loss 3.495692 | norm 4.4921 | lr 1.20e-04 | (3817.67 ms | 137332 tok/s) step 20259/76294 | train loss 3.451574 | norm 4.0084 | lr 1.20e-04 | (3820.87 ms | 137217 tok/s) step 20260/76294 | train loss 3.408705 | norm 4.6025 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 20261/76294 | train loss 3.458783 | norm 3.0397 | lr 1.20e-04 | (3804.94 ms | 137792 tok/s) step 20262/76294 | train loss 3.426997 | norm 6.8532 | lr 1.20e-04 | (3806.53 ms | 137734 tok/s) step 20263/76294 | train loss 3.471022 | norm 5.3284 | lr 1.20e-04 | (3807.72 ms | 137691 tok/s) step 20264/76294 | train loss 3.529673 | norm 4.0165 | lr 1.20e-04 | (3804.87 ms | 137794 tok/s) step 20265/76294 | train loss 3.433562 | norm 2.7627 | lr 1.20e-04 | (3807.73 ms | 137691 tok/s) step 20266/76294 | train loss 3.419856 | norm 3.2817 | lr 1.20e-04 | (3824.32 ms | 137093 tok/s) step 20267/76294 | train loss 3.421508 | norm 3.8849 | lr 1.20e-04 | (3800.20 ms | 137963 tok/s) step 20268/76294 | train loss 3.434125 | norm 4.0838 | lr 1.20e-04 | (3821.47 ms | 137196 tok/s) step 20269/76294 | train loss 3.436900 | norm 3.4779 | lr 1.20e-04 | (3804.14 ms | 137821 tok/s) step 20270/76294 | train loss 3.516094 | norm 4.0450 | lr 1.20e-04 | (3805.53 ms | 137770 tok/s) step 20271/76294 | train loss 3.407103 | norm 4.8850 | lr 1.20e-04 | (3826.96 ms | 136999 tok/s) step 20272/76294 | train loss 3.482463 | norm 5.5444 | lr 1.20e-04 | (3804.29 ms | 137815 tok/s) step 20273/76294 | train loss 3.420475 | norm 5.9898 | lr 1.20e-04 | (3806.25 ms | 137744 tok/s) step 20274/76294 | train loss 3.467182 | norm 7.7321 | lr 1.20e-04 | (3806.38 ms | 137739 tok/s) step 20275/76294 | train loss 3.494056 | norm 16.0545 | lr 1.20e-04 | (3797.10 ms | 138076 tok/s) step 20276/76294 | train loss 3.459516 | norm 6.0705 | lr 1.20e-04 | (3829.55 ms | 136906 tok/s) step 20277/76294 | train loss 3.462025 | norm 10.9242 | lr 1.20e-04 | (3799.82 ms | 137977 tok/s) step 20278/76294 | train loss 3.442260 | norm 3.2718 | lr 1.20e-04 | (3902.81 ms | 134336 tok/s) step 20279/76294 | train loss 3.460397 | norm 4.8000 | lr 1.20e-04 | (3799.40 ms | 137992 tok/s) step 20280/76294 | train loss 3.466110 | norm 12.2138 | lr 1.20e-04 | (3815.74 ms | 137401 tok/s) step 20281/76294 | train loss 3.496236 | norm 5.0936 | lr 1.20e-04 | (3824.76 ms | 137077 tok/s) step 20282/76294 | train loss 3.476676 | norm 6.3122 | lr 1.20e-04 | (3801.07 ms | 137932 tok/s) step 20283/76294 | train loss 3.384413 | norm 6.3689 | lr 1.20e-04 | (3806.97 ms | 137718 tok/s) step 20284/76294 | train loss 3.495230 | norm 4.9740 | lr 1.20e-04 | (3825.96 ms | 137035 tok/s) step 20285/76294 | train loss 3.500973 | norm 6.0782 | lr 1.20e-04 | (3797.03 ms | 138078 tok/s) step 20286/76294 | train loss 3.480909 | norm 30.0479 | lr 1.20e-04 | (3808.24 ms | 137672 tok/s) step 20287/76294 | train loss 3.486725 | norm 6.3047 | lr 1.20e-04 | (3817.07 ms | 137354 tok/s) step 20288/76294 | train loss 3.414911 | norm 3.9909 | lr 1.20e-04 | (3803.97 ms | 137827 tok/s) step 20289/76294 | train loss 3.511211 | norm 3.8499 | lr 1.20e-04 | (3808.73 ms | 137654 tok/s) step 20290/76294 | train loss 3.477569 | norm 7.6924 | lr 1.20e-04 | (3799.37 ms | 137994 tok/s) step 20291/76294 | train loss 3.491268 | norm 4.5082 | lr 1.20e-04 | (3807.25 ms | 137708 tok/s) step 20292/76294 | train loss 3.457139 | norm 4.7017 | lr 1.20e-04 | (3844.71 ms | 136366 tok/s) step 20293/76294 | train loss 3.447056 | norm 7.4781 | lr 1.20e-04 | (3796.75 ms | 138089 tok/s) step 20294/76294 | train loss 3.500864 | norm 4.7104 | lr 1.20e-04 | (3802.13 ms | 137893 tok/s) step 20295/76294 | train loss 3.403236 | norm 4.3420 | lr 1.20e-04 | (3819.16 ms | 137279 tok/s) step 20296/76294 | train loss 3.475795 | norm 2.9791 | lr 1.20e-04 | (3801.33 ms | 137922 tok/s) step 20297/76294 | train loss 3.509494 | norm 5.6482 | lr 1.20e-04 | (3798.11 ms | 138039 tok/s) step 20298/76294 | train loss 3.421736 | norm 5.2202 | lr 1.20e-04 | (3821.57 ms | 137192 tok/s) step 20299/76294 | train loss 3.547458 | norm 4.8213 | lr 1.20e-04 | (3798.78 ms | 138015 tok/s) step 20300/76294 | train loss 3.494367 | norm 3.2924 | lr 1.20e-04 | (3816.88 ms | 137360 tok/s) step 20301/76294 | train loss 3.570268 | norm 3.5186 | lr 1.20e-04 | (3798.23 ms | 138035 tok/s) step 20302/76294 | train loss 3.497478 | norm 8.7796 | lr 1.20e-04 | (3802.52 ms | 137879 tok/s) step 20303/76294 | train loss 3.434585 | norm 4.7264 | lr 1.20e-04 | (3821.42 ms | 137197 tok/s) step 20304/76294 | train loss 3.411373 | norm 3.7449 | lr 1.20e-04 | (3916.58 ms | 133864 tok/s) step 20305/76294 | train loss 3.440247 | norm 3.5093 | lr 1.20e-04 | (3802.10 ms | 137894 tok/s) step 20306/76294 | train loss 3.399030 | norm 3.6890 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 20307/76294 | train loss 3.475854 | norm 4.4256 | lr 1.20e-04 | (3818.72 ms | 137294 tok/s) step 20308/76294 | train loss 3.429776 | norm 3.8393 | lr 1.20e-04 | (3801.59 ms | 137913 tok/s) step 20309/76294 | train loss 3.416049 | norm 7.1157 | lr 1.20e-04 | (3805.50 ms | 137771 tok/s) step 20310/76294 | train loss 3.442892 | norm 28.9672 | lr 1.20e-04 | (3799.48 ms | 137989 tok/s) step 20311/76294 | train loss 3.425803 | norm 5.5581 | lr 1.20e-04 | (3807.38 ms | 137703 tok/s) step 20312/76294 | train loss 3.443770 | norm 4.5699 | lr 1.20e-04 | (3798.96 ms | 138008 tok/s) step 20313/76294 | train loss 3.465487 | norm 3.6055 | lr 1.20e-04 | (3804.80 ms | 137796 tok/s) step 20314/76294 | train loss 3.420503 | norm 4.4426 | lr 1.20e-04 | (3799.56 ms | 137986 tok/s) step 20315/76294 | train loss 3.450252 | norm 3.2340 | lr 1.20e-04 | (3794.88 ms | 138157 tok/s) step 20316/76294 | train loss 3.455716 | norm 5.1830 | lr 1.20e-04 | (3845.44 ms | 136340 tok/s) step 20317/76294 | train loss 3.430176 | norm 9.0365 | lr 1.20e-04 | (3796.32 ms | 138104 tok/s) step 20318/76294 | train loss 3.434763 | norm 6.3648 | lr 1.20e-04 | (3797.55 ms | 138060 tok/s) step 20319/76294 | train loss 3.540152 | norm 4.4218 | lr 1.20e-04 | (3819.42 ms | 137269 tok/s) step 20320/76294 | train loss 3.471147 | norm 3.1237 | lr 1.20e-04 | (3799.31 ms | 137996 tok/s) step 20321/76294 | train loss 3.378355 | norm 5.0555 | lr 1.20e-04 | (3796.20 ms | 138108 tok/s) step 20322/76294 | train loss 3.433880 | norm 4.9989 | lr 1.20e-04 | (3831.56 ms | 136834 tok/s) step 20323/76294 | train loss 3.467825 | norm 3.3748 | lr 1.20e-04 | (3799.66 ms | 137983 tok/s) step 20324/76294 | train loss 3.370576 | norm 2.6545 | lr 1.20e-04 | (4490.24 ms | 116762 tok/s) step 20325/76294 | train loss 3.362019 | norm 4.1824 | lr 1.20e-04 | (3818.14 ms | 137315 tok/s) step 20326/76294 | train loss 3.478777 | norm 3.8443 | lr 1.20e-04 | (3805.19 ms | 137782 tok/s) step 20327/76294 | train loss 3.560793 | norm 2.7302 | lr 1.20e-04 | (3823.54 ms | 137121 tok/s) step 20328/76294 | train loss 3.470430 | norm 4.7490 | lr 1.20e-04 | (3802.95 ms | 137864 tok/s) step 20329/76294 | train loss 3.379188 | norm 3.9012 | lr 1.20e-04 | (3827.14 ms | 136992 tok/s) step 20330/76294 | train loss 3.511460 | norm 6.2356 | lr 1.20e-04 | (3802.11 ms | 137894 tok/s) step 20331/76294 | train loss 3.377326 | norm 5.2690 | lr 1.20e-04 | (3836.11 ms | 136672 tok/s) step 20332/76294 | train loss 3.424048 | norm 7.8977 | lr 1.20e-04 | (3797.17 ms | 138073 tok/s) step 20333/76294 | train loss 3.423180 | norm 3.7488 | lr 1.20e-04 | (3802.49 ms | 137880 tok/s) step 20334/76294 | train loss 3.426857 | norm 3.2713 | lr 1.20e-04 | (3816.92 ms | 137359 tok/s) step 20335/76294 | train loss 3.479114 | norm 3.7458 | lr 1.20e-04 | (3799.41 ms | 137992 tok/s) step 20336/76294 | train loss 3.440183 | norm 4.9510 | lr 1.20e-04 | (3800.59 ms | 137949 tok/s) step 20337/76294 | train loss 3.508392 | norm 3.4455 | lr 1.20e-04 | (3799.44 ms | 137991 tok/s) step 20338/76294 | train loss 3.462454 | norm 4.3340 | lr 1.20e-04 | (3802.61 ms | 137876 tok/s) step 20339/76294 | train loss 3.456628 | norm 4.4690 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 20340/76294 | train loss 3.417064 | norm 3.3057 | lr 1.20e-04 | (3808.89 ms | 137648 tok/s) step 20341/76294 | train loss 3.495325 | norm 2.2708 | lr 1.20e-04 | (3803.56 ms | 137841 tok/s) step 20342/76294 | train loss 3.512362 | norm 4.7889 | lr 1.20e-04 | (3807.04 ms | 137716 tok/s) step 20343/76294 | train loss 3.436316 | norm 4.9804 | lr 1.20e-04 | (3806.51 ms | 137735 tok/s) step 20344/76294 | train loss 3.466237 | norm 4.4627 | lr 1.20e-04 | (3801.02 ms | 137934 tok/s) step 20345/76294 | train loss 3.434576 | norm 4.0118 | lr 1.20e-04 | (3830.76 ms | 136863 tok/s) step 20346/76294 | train loss 3.506780 | norm 4.9459 | lr 1.20e-04 | (3799.30 ms | 137996 tok/s) step 20347/76294 | train loss 3.437084 | norm 4.1198 | lr 1.20e-04 | (3844.05 ms | 136389 tok/s) step 20348/76294 | train loss 3.480460 | norm 3.9565 | lr 1.20e-04 | (3796.77 ms | 138088 tok/s) step 20349/76294 | train loss 3.431830 | norm 6.4987 | lr 1.20e-04 | (3804.67 ms | 137801 tok/s) step 20350/76294 | train loss 3.376418 | norm 6.2558 | lr 1.20e-04 | (3826.00 ms | 137033 tok/s) step 20351/76294 | train loss 3.464638 | norm 4.1421 | lr 1.20e-04 | (3798.69 ms | 138018 tok/s) step 20352/76294 | train loss 3.456521 | norm 4.1057 | lr 1.20e-04 | (3805.69 ms | 137764 tok/s) step 20353/76294 | train loss 3.536119 | norm 2.7861 | lr 1.20e-04 | (3801.26 ms | 137925 tok/s) step 20354/76294 | train loss 3.488244 | norm 2.6436 | lr 1.20e-04 | (3806.63 ms | 137730 tok/s) step 20355/76294 | train loss 3.484054 | norm 4.2591 | lr 1.20e-04 | (3874.73 ms | 135309 tok/s) step 20356/76294 | train loss 3.435726 | norm 3.5374 | lr 1.20e-04 | (3797.64 ms | 138056 tok/s) step 20357/76294 | train loss 3.505207 | norm 3.7944 | lr 1.20e-04 | (3831.52 ms | 136836 tok/s) step 20358/76294 | train loss 3.427892 | norm 4.0723 | lr 1.20e-04 | (3802.29 ms | 137888 tok/s) step 20359/76294 | train loss 3.444911 | norm 3.1112 | lr 1.20e-04 | (3800.88 ms | 137938 tok/s) step 20360/76294 | train loss 3.469037 | norm 4.6668 | lr 1.20e-04 | (3892.44 ms | 134694 tok/s) step 20361/76294 | train loss 3.541870 | norm 4.9049 | lr 1.20e-04 | (3820.61 ms | 137226 tok/s) step 20362/76294 | train loss 3.461695 | norm 3.8196 | lr 1.20e-04 | (3803.64 ms | 137839 tok/s) step 20363/76294 | train loss 3.405820 | norm 3.4884 | lr 1.20e-04 | (3802.07 ms | 137895 tok/s) step 20364/76294 | train loss 3.507141 | norm 3.3757 | lr 1.20e-04 | (3805.66 ms | 137765 tok/s) step 20365/76294 | train loss 3.399660 | norm 3.7365 | lr 1.20e-04 | (3802.63 ms | 137875 tok/s) step 20366/76294 | train loss 3.451030 | norm 5.4901 | lr 1.20e-04 | (3806.31 ms | 137742 tok/s) step 20367/76294 | train loss 3.437442 | norm 3.9868 | lr 1.20e-04 | (3803.50 ms | 137843 tok/s) step 20368/76294 | train loss 3.563801 | norm 3.6653 | lr 1.20e-04 | (3803.88 ms | 137830 tok/s) step 20369/76294 | train loss 3.474507 | norm 3.2194 | lr 1.20e-04 | (3803.70 ms | 137836 tok/s) step 20370/76294 | train loss 3.420676 | norm 4.4430 | lr 1.20e-04 | (3825.33 ms | 137057 tok/s) step 20371/76294 | train loss 3.450902 | norm 2.8980 | lr 1.20e-04 | (3801.17 ms | 137928 tok/s) step 20372/76294 | train loss 3.460878 | norm 4.3964 | lr 1.20e-04 | (3803.61 ms | 137839 tok/s) step 20373/76294 | train loss 3.523188 | norm 5.0672 | lr 1.20e-04 | (3802.07 ms | 137895 tok/s) step 20374/76294 | train loss 3.560085 | norm 3.9476 | lr 1.20e-04 | (3804.22 ms | 137817 tok/s) step 20375/76294 | train loss 3.498555 | norm 3.8099 | lr 1.20e-04 | (3809.64 ms | 137621 tok/s) step 20376/76294 | train loss 3.422804 | norm 3.2501 | lr 1.20e-04 | (3826.63 ms | 137010 tok/s) step 20377/76294 | train loss 3.440113 | norm 3.8497 | lr 1.20e-04 | (3818.25 ms | 137311 tok/s) step 20378/76294 | train loss 3.459221 | norm 3.2033 | lr 1.20e-04 | (3864.51 ms | 135667 tok/s) step 20379/76294 | train loss 3.503179 | norm 3.8539 | lr 1.20e-04 | (3814.46 ms | 137447 tok/s) step 20380/76294 | train loss 3.442278 | norm 3.0453 | lr 1.20e-04 | (3801.03 ms | 137933 tok/s) step 20381/76294 | train loss 3.423105 | norm 2.7372 | lr 1.20e-04 | (3825.66 ms | 137045 tok/s) step 20382/76294 | train loss 3.509912 | norm 5.8237 | lr 1.20e-04 | (3796.57 ms | 138095 tok/s) step 20383/76294 | train loss 3.389035 | norm 7.4374 | lr 1.20e-04 | (3798.25 ms | 138034 tok/s) step 20384/76294 | train loss 3.433168 | norm 6.7791 | lr 1.20e-04 | (3820.06 ms | 137246 tok/s) step 20385/76294 | train loss 3.381175 | norm 4.1194 | lr 1.20e-04 | (3828.06 ms | 136959 tok/s) step 20386/76294 | train loss 3.459712 | norm 5.7662 | lr 1.20e-04 | (3798.21 ms | 138036 tok/s) step 20387/76294 | train loss 3.451731 | norm 5.3876 | lr 1.20e-04 | (3813.51 ms | 137482 tok/s) step 20388/76294 | train loss 3.396625 | norm 4.3634 | lr 1.20e-04 | (3807.11 ms | 137713 tok/s) step 20389/76294 | train loss 3.361628 | norm 3.4494 | lr 1.20e-04 | (3798.54 ms | 138024 tok/s) step 20390/76294 | train loss 3.372957 | norm 5.5718 | lr 1.20e-04 | (3839.35 ms | 136557 tok/s) step 20391/76294 | train loss 3.461443 | norm 6.6189 | lr 1.20e-04 | (3793.92 ms | 138192 tok/s) step 20392/76294 | train loss 3.413806 | norm 4.9449 | lr 1.20e-04 | (3808.06 ms | 137679 tok/s) step 20393/76294 | train loss 3.505779 | norm 3.8441 | lr 1.20e-04 | (3798.69 ms | 138018 tok/s) step 20394/76294 | train loss 3.392608 | norm 3.7245 | lr 1.20e-04 | (3822.86 ms | 137146 tok/s) step 20395/76294 | train loss 3.484117 | norm 5.9139 | lr 1.20e-04 | (3795.40 ms | 138138 tok/s) step 20396/76294 | train loss 3.423896 | norm 4.9941 | lr 1.20e-04 | (3798.20 ms | 138036 tok/s) step 20397/76294 | train loss 3.451001 | norm 5.2882 | lr 1.20e-04 | (3816.81 ms | 137363 tok/s) step 20398/76294 | train loss 3.471738 | norm 3.1043 | lr 1.20e-04 | (3800.88 ms | 137939 tok/s) step 20399/76294 | train loss 3.571551 | norm 5.6831 | lr 1.20e-04 | (3819.63 ms | 137262 tok/s) step 20400/76294 | train loss 3.461454 | norm 6.1366 | lr 1.20e-04 | (3802.76 ms | 137870 tok/s) step 20401/76294 | train loss 3.476494 | norm 10.2788 | lr 1.20e-04 | (3802.59 ms | 137877 tok/s) step 20402/76294 | train loss 3.474672 | norm 5.4958 | lr 1.20e-04 | (3838.19 ms | 136598 tok/s) step 20403/76294 | train loss 3.369920 | norm 8.4910 | lr 1.20e-04 | (3801.89 ms | 137902 tok/s) step 20404/76294 | train loss 3.548079 | norm 6.1331 | lr 1.20e-04 | (3837.13 ms | 136636 tok/s) step 20405/76294 | train loss 3.472506 | norm 4.6334 | lr 1.20e-04 | (3802.38 ms | 137884 tok/s) step 20406/76294 | train loss 3.467848 | norm 5.3214 | lr 1.20e-04 | (3805.97 ms | 137754 tok/s) step 20407/76294 | train loss 3.401536 | norm 14.5583 | lr 1.20e-04 | (3889.61 ms | 134792 tok/s) step 20408/76294 | train loss 3.431062 | norm 6.5290 | lr 1.20e-04 | (3796.00 ms | 138116 tok/s) step 20409/76294 | train loss 3.427295 | norm 8.9595 | lr 1.20e-04 | (3854.34 ms | 136025 tok/s) step 20410/76294 | train loss 3.475635 | norm 9.0239 | lr 1.20e-04 | (3799.02 ms | 138006 tok/s) step 20411/76294 | train loss 3.554832 | norm 7.5215 | lr 1.20e-04 | (3824.76 ms | 137077 tok/s) step 20412/76294 | train loss 3.435797 | norm 6.4448 | lr 1.20e-04 | (3797.40 ms | 138065 tok/s) step 20413/76294 | train loss 3.446431 | norm 6.4949 | lr 1.20e-04 | (3871.98 ms | 135406 tok/s) step 20414/76294 | train loss 3.453490 | norm 5.5068 | lr 1.20e-04 | (3797.24 ms | 138071 tok/s) step 20415/76294 | train loss 3.441923 | norm 22.3824 | lr 1.20e-04 | (3822.88 ms | 137145 tok/s) step 20416/76294 | train loss 3.435805 | norm 14.8645 | lr 1.20e-04 | (3795.18 ms | 138146 tok/s) step 20417/76294 | train loss 3.508489 | norm 7.7146 | lr 1.20e-04 | (3824.45 ms | 137088 tok/s) step 20418/76294 | train loss 3.361721 | norm 23.0679 | lr 1.20e-04 | (3804.24 ms | 137817 tok/s) step 20419/76294 | train loss 3.464260 | norm 10.1986 | lr 1.20e-04 | (3895.95 ms | 134572 tok/s) step 20420/76294 | train loss 3.511108 | norm 5.1503 | lr 1.20e-04 | (3818.37 ms | 137307 tok/s) step 20421/76294 | train loss 3.465025 | norm 6.2022 | lr 1.20e-04 | (3797.57 ms | 138059 tok/s) step 20422/76294 | train loss 3.486840 | norm 4.2143 | lr 1.20e-04 | (3802.25 ms | 137889 tok/s) step 20423/76294 | train loss 3.489719 | norm 4.2926 | lr 1.20e-04 | (3801.20 ms | 137927 tok/s) step 20424/76294 | train loss 3.465273 | norm 4.2891 | lr 1.20e-04 | (3808.14 ms | 137676 tok/s) step 20425/76294 | train loss 3.470392 | norm 3.0131 | lr 1.20e-04 | (3797.33 ms | 138068 tok/s) step 20426/76294 | train loss 3.403829 | norm 7.0302 | lr 1.20e-04 | (3806.82 ms | 137723 tok/s) step 20427/76294 | train loss 3.475568 | norm 4.0585 | lr 1.20e-04 | (3803.71 ms | 137836 tok/s) step 20428/76294 | train loss 3.447598 | norm 7.7766 | lr 1.20e-04 | (3797.68 ms | 138055 tok/s) step 20429/76294 | train loss 3.501270 | norm 4.2986 | lr 1.20e-04 | (3832.29 ms | 136808 tok/s) step 20430/76294 | train loss 3.458415 | norm 7.3626 | lr 1.20e-04 | (3795.85 ms | 138122 tok/s) step 20431/76294 | train loss 3.512664 | norm 5.9022 | lr 1.20e-04 | (3802.01 ms | 137898 tok/s) step 20432/76294 | train loss 3.461373 | norm 8.0585 | lr 1.20e-04 | (3817.28 ms | 137346 tok/s) step 20433/76294 | train loss 3.457015 | norm 14.2641 | lr 1.20e-04 | (3801.04 ms | 137933 tok/s) step 20434/76294 | train loss 3.439634 | norm 7.3054 | lr 1.20e-04 | (3817.43 ms | 137341 tok/s) step 20435/76294 | train loss 3.521250 | norm 25.4679 | lr 1.20e-04 | (3801.20 ms | 137927 tok/s) step 20436/76294 | train loss 3.517749 | norm 10.3508 | lr 1.20e-04 | (3800.78 ms | 137942 tok/s) step 20437/76294 | train loss 3.460179 | norm 13.2171 | lr 1.20e-04 | (4271.41 ms | 122744 tok/s) step 20438/76294 | train loss 3.518342 | norm 10.6876 | lr 1.20e-04 | (3854.70 ms | 136013 tok/s) step 20439/76294 | train loss 3.505594 | norm 7.3302 | lr 1.20e-04 | (3795.87 ms | 138121 tok/s) step 20440/76294 | train loss 3.541901 | norm 10.6844 | lr 1.20e-04 | (3821.89 ms | 137180 tok/s) step 20441/76294 | train loss 3.514468 | norm 10.1058 | lr 1.20e-04 | (3794.79 ms | 138160 tok/s) step 20442/76294 | train loss 3.501357 | norm 5.3444 | lr 1.20e-04 | (7270.81 ms | 72109 tok/s) step 20443/76294 | train loss 3.476329 | norm 4.7326 | lr 1.20e-04 | (3788.22 ms | 138400 tok/s) step 20444/76294 | train loss 3.455173 | norm 9.8717 | lr 1.20e-04 | (3796.79 ms | 138087 tok/s) step 20445/76294 | train loss 3.451924 | norm 5.6448 | lr 1.20e-04 | (3813.46 ms | 137483 tok/s) step 20446/76294 | train loss 3.487534 | norm 12.3303 | lr 1.20e-04 | (3798.52 ms | 138024 tok/s) step 20447/76294 | train loss 3.388518 | norm 9.8579 | lr 1.20e-04 | (3813.72 ms | 137474 tok/s) step 20448/76294 | train loss 3.531246 | norm 29.4332 | lr 1.20e-04 | (3888.74 ms | 134822 tok/s) step 20449/76294 | train loss 3.486076 | norm 19.5488 | lr 1.20e-04 | (3800.91 ms | 137938 tok/s) step 20450/76294 | train loss 3.467828 | norm 14.8901 | lr 1.20e-04 | (3805.75 ms | 137762 tok/s) step 20451/76294 | train loss 3.481345 | norm 11.9515 | lr 1.20e-04 | (3821.99 ms | 137177 tok/s) step 20452/76294 | train loss 3.508804 | norm 11.2915 | lr 1.20e-04 | (3806.72 ms | 137727 tok/s) step 20453/76294 | train loss 3.443821 | norm 9.5905 | lr 1.20e-04 | (3806.22 ms | 137745 tok/s) step 20454/76294 | train loss 3.504679 | norm 13.2682 | lr 1.20e-04 | (3809.91 ms | 137612 tok/s) step 20455/76294 | train loss 3.493785 | norm 14.7012 | lr 1.20e-04 | (3809.95 ms | 137610 tok/s) step 20456/76294 | train loss 3.520867 | norm 18.0513 | lr 1.20e-04 | (3804.28 ms | 137815 tok/s) step 20457/76294 | train loss 3.494843 | norm 12.5425 | lr 1.20e-04 | (3869.19 ms | 135503 tok/s) step 20458/76294 | train loss 3.498165 | norm 12.9920 | lr 1.20e-04 | (3798.62 ms | 138021 tok/s) step 20459/76294 | train loss 3.519199 | norm 15.6379 | lr 1.20e-04 | (3800.36 ms | 137957 tok/s) step 20460/76294 | train loss 3.484512 | norm 17.6300 | lr 1.20e-04 | (3816.77 ms | 137364 tok/s) step 20461/76294 | train loss 3.554239 | norm 17.3984 | lr 1.20e-04 | (3798.36 ms | 138030 tok/s) step 20462/76294 | train loss 3.539397 | norm 20.9646 | lr 1.20e-04 | (3797.77 ms | 138052 tok/s) step 20463/76294 | train loss 3.477926 | norm 19.6221 | lr 1.20e-04 | (3826.00 ms | 137033 tok/s) step 20464/76294 | train loss 3.462654 | norm 16.7457 | lr 1.20e-04 | (3799.60 ms | 137985 tok/s) step 20465/76294 | train loss 3.449695 | norm 14.9747 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 20466/76294 | train loss 3.462154 | norm 12.4702 | lr 1.20e-04 | (3801.48 ms | 137917 tok/s) step 20467/76294 | train loss 3.480145 | norm 10.4744 | lr 1.20e-04 | (3835.37 ms | 136698 tok/s) step 20468/76294 | train loss 3.435271 | norm 12.1284 | lr 1.20e-04 | (3797.18 ms | 138073 tok/s) step 20469/76294 | train loss 3.473452 | norm 12.6384 | lr 1.20e-04 | (3807.52 ms | 137698 tok/s) step 20470/76294 | train loss 3.498941 | norm 8.3122 | lr 1.20e-04 | (3821.72 ms | 137187 tok/s) step 20471/76294 | train loss 3.456289 | norm 13.8079 | lr 1.20e-04 | (3801.55 ms | 137914 tok/s) step 20472/76294 | train loss 3.478133 | norm 15.7701 | lr 1.20e-04 | (3803.97 ms | 137826 tok/s) step 20473/76294 | train loss 3.524056 | norm 14.2289 | lr 1.20e-04 | (3803.81 ms | 137832 tok/s) step 20474/76294 | train loss 3.447094 | norm 20.1686 | lr 1.20e-04 | (3804.03 ms | 137824 tok/s) step 20475/76294 | train loss 3.459478 | norm 15.4252 | lr 1.20e-04 | (3800.13 ms | 137966 tok/s) step 20476/76294 | train loss 3.641100 | norm 20.2143 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 20477/76294 | train loss 3.507278 | norm 22.2249 | lr 1.20e-04 | (3804.28 ms | 137815 tok/s) step 20478/76294 | train loss 3.424950 | norm 29.2359 | lr 1.20e-04 | (3805.47 ms | 137772 tok/s) step 20479/76294 | train loss 3.675861 | norm 25.5773 | lr 1.20e-04 | (3801.26 ms | 137925 tok/s) step 20480/76294 | train loss 3.437983 | norm 18.0026 | lr 1.20e-04 | (3808.16 ms | 137675 tok/s) step 20481/76294 | train loss 3.498928 | norm 40.3838 | lr 1.20e-04 | (3805.98 ms | 137754 tok/s) step 20482/76294 | train loss 3.501887 | norm 27.7594 | lr 1.20e-04 | (3807.31 ms | 137705 tok/s) step 20483/76294 | train loss 3.479792 | norm 31.1253 | lr 1.20e-04 | (3875.44 ms | 135285 tok/s) step 20484/76294 | train loss 3.504219 | norm 41.7532 | lr 1.20e-04 | (3800.75 ms | 137943 tok/s) step 20485/76294 | train loss 3.551245 | norm 46.6858 | lr 1.20e-04 | (3808.73 ms | 137654 tok/s) step 20486/76294 | train loss 3.538131 | norm 43.7706 | lr 1.20e-04 | (3798.21 ms | 138035 tok/s) step 20487/76294 | train loss 3.546647 | norm 75.0092 | lr 1.20e-04 | (3801.15 ms | 137929 tok/s) step 20488/76294 | train loss 3.640675 | norm 46.3109 | lr 1.20e-04 | (3815.49 ms | 137410 tok/s) step 20489/76294 | train loss 3.543117 | norm 24.0594 | lr 1.20e-04 | (3799.39 ms | 137993 tok/s) step 20490/76294 | train loss 3.486657 | norm 40.0698 | lr 1.20e-04 | (3817.03 ms | 137355 tok/s) step 20491/76294 | train loss 3.562073 | norm 42.4809 | lr 1.20e-04 | (4112.16 ms | 127497 tok/s) step 20492/76294 | train loss 3.552023 | norm 38.5615 | lr 1.20e-04 | (3801.54 ms | 137915 tok/s) step 20493/76294 | train loss 3.562889 | norm 30.7843 | lr 1.20e-04 | (3808.17 ms | 137674 tok/s) step 20494/76294 | train loss 3.474770 | norm 31.5434 | lr 1.20e-04 | (3896.69 ms | 134547 tok/s) step 20495/76294 | train loss 3.534886 | norm 58.8088 | lr 1.20e-04 | (3791.73 ms | 138272 tok/s) step 20496/76294 | train loss 3.559818 | norm 35.3218 | lr 1.20e-04 | (3823.70 ms | 137115 tok/s) step 20497/76294 | train loss 3.484488 | norm 28.1727 | lr 1.20e-04 | (3790.44 ms | 138319 tok/s) step 20498/76294 | train loss 3.532641 | norm 22.8721 | lr 1.20e-04 | (3797.70 ms | 138054 tok/s) step 20499/76294 | train loss 3.480863 | norm 31.5858 | lr 1.20e-04 | (3821.09 ms | 137209 tok/s) step 20500/76294 | train loss 3.537069 | norm 60.2982 | lr 1.20e-04 | (3802.99 ms | 137862 tok/s) val loss: 3.515145 saving model checkpoint to ./results/gpt2-124M-gqa/step_20500.pth step 20501/76294 | train loss 3.530016 | norm 53.7591 | lr 1.20e-04 | (3906.48 ms | 134210 tok/s) step 20502/76294 | train loss 3.566070 | norm 44.5668 | lr 1.20e-04 | (3811.85 ms | 137541 tok/s) step 20503/76294 | train loss 3.582431 | norm 28.4816 | lr 1.20e-04 | (4079.20 ms | 128527 tok/s) step 20504/76294 | train loss 3.503257 | norm 34.3634 | lr 1.20e-04 | (3792.01 ms | 138261 tok/s) step 20505/76294 | train loss 3.521279 | norm 30.0508 | lr 1.20e-04 | (3803.16 ms | 137856 tok/s) step 20506/76294 | train loss 3.500960 | norm 35.9393 | lr 1.20e-04 | (3821.17 ms | 137206 tok/s) step 20507/76294 | train loss 3.455951 | norm 48.8699 | lr 1.20e-04 | (3997.60 ms | 131151 tok/s) step 20508/76294 | train loss 3.496202 | norm 47.5445 | lr 1.20e-04 | (3794.73 ms | 138162 tok/s) step 20509/76294 | train loss 3.484348 | norm 30.0037 | lr 1.20e-04 | (3826.80 ms | 137004 tok/s) step 20510/76294 | train loss 3.554886 | norm 35.7856 | lr 1.20e-04 | (3793.50 ms | 138207 tok/s) step 20511/76294 | train loss 3.465693 | norm 58.5761 | lr 1.20e-04 | (3853.34 ms | 136061 tok/s) step 20512/76294 | train loss 3.465184 | norm 24.8884 | lr 1.20e-04 | (3797.86 ms | 138048 tok/s) step 20513/76294 | train loss 3.495224 | norm 21.9257 | lr 1.20e-04 | (3800.14 ms | 137965 tok/s) step 20514/76294 | train loss 3.529248 | norm 12.1569 | lr 1.20e-04 | (3823.32 ms | 137129 tok/s) step 20515/76294 | train loss 3.530690 | norm 22.5031 | lr 1.20e-04 | (4050.49 ms | 129438 tok/s) step 20516/76294 | train loss 3.529924 | norm 14.7283 | lr 1.20e-04 | (3822.68 ms | 137152 tok/s) step 20517/76294 | train loss 3.638570 | norm 18.4713 | lr 1.20e-04 | (3801.91 ms | 137901 tok/s) step 20518/76294 | train loss 3.502336 | norm 20.6622 | lr 1.20e-04 | (3820.59 ms | 137227 tok/s) step 20519/76294 | train loss 3.585329 | norm 16.4661 | lr 1.20e-04 | (3798.94 ms | 138009 tok/s) step 20520/76294 | train loss 3.542296 | norm 19.2463 | lr 1.20e-04 | (3794.50 ms | 138171 tok/s) step 20521/76294 | train loss 3.586674 | norm 13.2667 | lr 1.20e-04 | (3830.77 ms | 136862 tok/s) step 20522/76294 | train loss 3.526634 | norm 11.7382 | lr 1.20e-04 | (3791.04 ms | 138296 tok/s) step 20523/76294 | train loss 3.531673 | norm 8.9492 | lr 1.20e-04 | (3792.31 ms | 138250 tok/s) step 20524/76294 | train loss 3.536415 | norm 8.7045 | lr 1.20e-04 | (3815.95 ms | 137394 tok/s) step 20525/76294 | train loss 3.501798 | norm 25.1703 | lr 1.20e-04 | (3827.02 ms | 136996 tok/s) step 20526/76294 | train loss 3.559881 | norm 10.3211 | lr 1.20e-04 | (3787.69 ms | 138419 tok/s) step 20527/76294 | train loss 3.558383 | norm 8.8468 | lr 1.20e-04 | (3818.94 ms | 137286 tok/s) step 20528/76294 | train loss 3.486904 | norm 5.8598 | lr 1.20e-04 | (3791.09 ms | 138295 tok/s) step 20529/76294 | train loss 3.550630 | norm 7.4823 | lr 1.20e-04 | (3829.96 ms | 136891 tok/s) step 20530/76294 | train loss 3.539134 | norm 8.5929 | lr 1.20e-04 | (3814.31 ms | 137453 tok/s) step 20531/76294 | train loss 3.486365 | norm 5.2184 | lr 1.20e-04 | (3827.93 ms | 136964 tok/s) step 20532/76294 | train loss 3.561692 | norm 7.6564 | lr 1.20e-04 | (3840.91 ms | 136501 tok/s) step 20533/76294 | train loss 3.594497 | norm 10.4374 | lr 1.20e-04 | (3800.83 ms | 137940 tok/s) step 20534/76294 | train loss 3.605898 | norm 5.9762 | lr 1.20e-04 | (3791.29 ms | 138288 tok/s) step 20535/76294 | train loss 3.553852 | norm 4.0481 | lr 1.20e-04 | (3793.54 ms | 138205 tok/s) step 20536/76294 | train loss 3.519336 | norm 9.6123 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 20537/76294 | train loss 3.431006 | norm 10.6483 | lr 1.20e-04 | (3825.58 ms | 137048 tok/s) step 20538/76294 | train loss 3.518830 | norm 12.8302 | lr 1.20e-04 | (3801.86 ms | 137903 tok/s) step 20539/76294 | train loss 3.573076 | norm 4.3663 | lr 1.20e-04 | (3822.38 ms | 137163 tok/s) step 20540/76294 | train loss 3.522436 | norm 4.7564 | lr 1.20e-04 | (3809.36 ms | 137632 tok/s) step 20541/76294 | train loss 3.517361 | norm 10.0806 | lr 1.20e-04 | (3800.55 ms | 137950 tok/s) step 20542/76294 | train loss 3.501603 | norm 7.1570 | lr 1.20e-04 | (3796.98 ms | 138080 tok/s) step 20543/76294 | train loss 3.468684 | norm 5.4228 | lr 1.20e-04 | (3879.23 ms | 135152 tok/s) step 20544/76294 | train loss 3.460609 | norm 3.8377 | lr 1.20e-04 | (3794.33 ms | 138177 tok/s) step 20545/76294 | train loss 3.515764 | norm 5.9007 | lr 1.20e-04 | (3802.03 ms | 137897 tok/s) step 20546/76294 | train loss 3.550314 | norm 4.9328 | lr 1.20e-04 | (3821.48 ms | 137195 tok/s) step 20547/76294 | train loss 3.491727 | norm 9.1842 | lr 1.20e-04 | (3796.11 ms | 138112 tok/s) step 20548/76294 | train loss 3.509046 | norm 26.0346 | lr 1.20e-04 | (3798.54 ms | 138024 tok/s) step 20549/76294 | train loss 3.512853 | norm 8.1863 | lr 1.20e-04 | (3824.71 ms | 137079 tok/s) step 20550/76294 | train loss 3.503293 | norm 8.3884 | lr 1.20e-04 | (3797.74 ms | 138053 tok/s) step 20551/76294 | train loss 3.493505 | norm 5.9784 | lr 1.20e-04 | (3804.37 ms | 137812 tok/s) step 20552/76294 | train loss 3.468662 | norm 6.0421 | lr 1.20e-04 | (3834.18 ms | 136741 tok/s) step 20553/76294 | train loss 3.483745 | norm 4.2696 | lr 1.20e-04 | (3800.21 ms | 137963 tok/s) step 20554/76294 | train loss 3.428606 | norm 107.5242 | lr 1.20e-04 | (3796.28 ms | 138106 tok/s) step 20555/76294 | train loss 3.488969 | norm 7.0284 | lr 1.20e-04 | (3825.25 ms | 137060 tok/s) step 20556/76294 | train loss 3.494033 | norm 5.2350 | lr 1.20e-04 | (3799.44 ms | 137991 tok/s) step 20557/76294 | train loss 3.552023 | norm 4.9845 | lr 1.20e-04 | (3872.30 ms | 135394 tok/s) step 20558/76294 | train loss 3.454620 | norm 5.0207 | lr 1.20e-04 | (3799.44 ms | 137991 tok/s) step 20559/76294 | train loss 3.582790 | norm 4.5738 | lr 1.20e-04 | (3806.90 ms | 137721 tok/s) step 20560/76294 | train loss 3.474846 | norm 7.0985 | lr 1.20e-04 | (3825.40 ms | 137054 tok/s) step 20561/76294 | train loss 3.458434 | norm 6.8671 | lr 1.20e-04 | (3805.66 ms | 137765 tok/s) step 20562/76294 | train loss 3.425861 | norm 4.3525 | lr 1.20e-04 | (3798.34 ms | 138031 tok/s) step 20563/76294 | train loss 3.497096 | norm 10.0933 | lr 1.20e-04 | (3834.73 ms | 136721 tok/s) step 20564/76294 | train loss 3.525782 | norm 4.2730 | lr 1.20e-04 | (3796.15 ms | 138110 tok/s) step 20565/76294 | train loss 3.567026 | norm 5.3217 | lr 1.20e-04 | (3800.93 ms | 137937 tok/s) step 20566/76294 | train loss 3.525335 | norm 3.9677 | lr 1.20e-04 | (3830.10 ms | 136886 tok/s) step 20567/76294 | train loss 3.442127 | norm 9.1174 | lr 1.20e-04 | (3795.18 ms | 138146 tok/s) step 20568/76294 | train loss 3.466758 | norm 6.1379 | lr 1.20e-04 | (3854.80 ms | 136009 tok/s) step 20569/76294 | train loss 3.448316 | norm 6.3337 | lr 1.20e-04 | (3797.42 ms | 138064 tok/s) step 20570/76294 | train loss 3.428544 | norm 4.7225 | lr 1.20e-04 | (3854.01 ms | 136037 tok/s) step 20571/76294 | train loss 3.474694 | norm 4.1808 | lr 1.20e-04 | (3831.08 ms | 136851 tok/s) step 20572/76294 | train loss 3.402427 | norm 7.3841 | lr 1.20e-04 | (3805.60 ms | 137767 tok/s) step 20573/76294 | train loss 3.552603 | norm 4.9409 | lr 1.20e-04 | (3819.55 ms | 137264 tok/s) step 20574/76294 | train loss 3.419494 | norm 4.4589 | lr 1.20e-04 | (3802.24 ms | 137889 tok/s) step 20575/76294 | train loss 3.461986 | norm 5.5503 | lr 1.20e-04 | (3869.42 ms | 135495 tok/s) step 20576/76294 | train loss 3.550037 | norm 5.5007 | lr 1.20e-04 | (3801.30 ms | 137923 tok/s) step 20577/76294 | train loss 3.493013 | norm 4.0169 | lr 1.20e-04 | (3800.28 ms | 137960 tok/s) step 20578/76294 | train loss 3.461208 | norm 4.4826 | lr 1.20e-04 | (3812.00 ms | 137536 tok/s) step 20579/76294 | train loss 3.468740 | norm 7.7855 | lr 1.20e-04 | (3805.74 ms | 137762 tok/s) step 20580/76294 | train loss 3.519290 | norm 7.8189 | lr 1.20e-04 | (3802.36 ms | 137885 tok/s) step 20581/76294 | train loss 3.479954 | norm 8.8641 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 20582/76294 | train loss 3.459720 | norm 4.1674 | lr 1.20e-04 | (3804.83 ms | 137795 tok/s) step 20583/76294 | train loss 3.474008 | norm 8.8278 | lr 1.20e-04 | (3821.22 ms | 137204 tok/s) step 20584/76294 | train loss 3.437273 | norm 7.4678 | lr 1.20e-04 | (3798.11 ms | 138039 tok/s) step 20585/76294 | train loss 3.504674 | norm 5.2775 | lr 1.20e-04 | (3816.07 ms | 137390 tok/s) step 20586/76294 | train loss 3.509648 | norm 4.4110 | lr 1.20e-04 | (3796.66 ms | 138092 tok/s) step 20587/76294 | train loss 3.480095 | norm 5.9162 | lr 1.20e-04 | (3818.80 ms | 137291 tok/s) step 20588/76294 | train loss 3.504148 | norm 18.6584 | lr 1.20e-04 | (3799.62 ms | 137984 tok/s) step 20589/76294 | train loss 3.531991 | norm 23.1652 | lr 1.20e-04 | (3824.85 ms | 137074 tok/s) step 20590/76294 | train loss 3.493445 | norm 4.2812 | lr 1.20e-04 | (3799.20 ms | 138000 tok/s) step 20591/76294 | train loss 3.639804 | norm 3.4543 | lr 1.20e-04 | (3801.10 ms | 137931 tok/s) step 20592/76294 | train loss 3.548069 | norm 124.3730 | lr 1.20e-04 | (3822.15 ms | 137171 tok/s) step 20593/76294 | train loss 3.480942 | norm 8.1674 | lr 1.20e-04 | (3809.64 ms | 137622 tok/s) step 20594/76294 | train loss 3.503182 | norm 9.8757 | lr 1.20e-04 | (3804.77 ms | 137797 tok/s) step 20595/76294 | train loss 3.493970 | norm 5.3660 | lr 1.20e-04 | (3800.77 ms | 137943 tok/s) step 20596/76294 | train loss 3.556807 | norm 6.8639 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 20597/76294 | train loss 3.500204 | norm 6.9324 | lr 1.20e-04 | (3799.85 ms | 137976 tok/s) step 20598/76294 | train loss 3.517039 | norm 11.3488 | lr 1.20e-04 | (3805.94 ms | 137755 tok/s) step 20599/76294 | train loss 3.564163 | norm 9.8866 | lr 1.20e-04 | (3803.24 ms | 137853 tok/s) step 20600/76294 | train loss 3.493378 | norm 6.1078 | lr 1.20e-04 | (3802.95 ms | 137864 tok/s) step 20601/76294 | train loss 3.481371 | norm 18.9661 | lr 1.20e-04 | (3805.81 ms | 137760 tok/s) step 20602/76294 | train loss 3.527084 | norm 5.6816 | lr 1.20e-04 | (3857.15 ms | 135926 tok/s) step 20603/76294 | train loss 3.475594 | norm 7.5823 | lr 1.20e-04 | (3793.22 ms | 138217 tok/s) step 20604/76294 | train loss 3.520927 | norm 6.3061 | lr 1.20e-04 | (3817.28 ms | 137346 tok/s) step 20605/76294 | train loss 3.489177 | norm 53.5809 | lr 1.20e-04 | (3797.26 ms | 138070 tok/s) step 20606/76294 | train loss 3.539358 | norm 5.0196 | lr 1.20e-04 | (3847.53 ms | 136266 tok/s) step 20607/76294 | train loss 3.464296 | norm 5.1832 | lr 1.20e-04 | (3800.25 ms | 137961 tok/s) step 20608/76294 | train loss 3.480155 | norm 16.2197 | lr 1.20e-04 | (3830.15 ms | 136884 tok/s) step 20609/76294 | train loss 3.507045 | norm 17.8232 | lr 1.20e-04 | (3867.98 ms | 135546 tok/s) step 20610/76294 | train loss 3.553525 | norm 6.2918 | lr 1.20e-04 | (3800.03 ms | 137969 tok/s) step 20611/76294 | train loss 3.439095 | norm 6.4734 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 20612/76294 | train loss 3.516757 | norm 5.3445 | lr 1.20e-04 | (3821.49 ms | 137194 tok/s) step 20613/76294 | train loss 3.491577 | norm 4.6655 | lr 1.20e-04 | (3812.85 ms | 137506 tok/s) step 20614/76294 | train loss 3.465391 | norm 3.5694 | lr 1.20e-04 | (3804.60 ms | 137804 tok/s) step 20615/76294 | train loss 3.496599 | norm 5.2122 | lr 1.20e-04 | (3797.49 ms | 138062 tok/s) step 20616/76294 | train loss 3.508255 | norm 5.5170 | lr 1.20e-04 | (3805.15 ms | 137784 tok/s) step 20617/76294 | train loss 3.486461 | norm 5.3691 | lr 1.20e-04 | (3801.06 ms | 137932 tok/s) step 20618/76294 | train loss 3.396276 | norm 6.9056 | lr 1.20e-04 | (3827.72 ms | 136971 tok/s) step 20619/76294 | train loss 3.480456 | norm 5.6638 | lr 1.20e-04 | (3799.36 ms | 137994 tok/s) step 20620/76294 | train loss 3.535500 | norm 15.9188 | lr 1.20e-04 | (3806.19 ms | 137746 tok/s) step 20621/76294 | train loss 3.485266 | norm 8.1634 | lr 1.20e-04 | (3822.57 ms | 137156 tok/s) step 20622/76294 | train loss 3.526221 | norm 10.7887 | lr 1.20e-04 | (3810.78 ms | 137580 tok/s) step 20623/76294 | train loss 3.548882 | norm 9.1828 | lr 1.20e-04 | (3805.34 ms | 137777 tok/s) step 20624/76294 | train loss 3.496445 | norm 7.9514 | lr 1.20e-04 | (3800.39 ms | 137956 tok/s) step 20625/76294 | train loss 3.526945 | norm 5.2678 | lr 1.20e-04 | (3806.30 ms | 137742 tok/s) step 20626/76294 | train loss 3.491210 | norm 5.4864 | lr 1.20e-04 | (3800.84 ms | 137940 tok/s) step 20627/76294 | train loss 3.517870 | norm 4.6061 | lr 1.20e-04 | (3802.90 ms | 137865 tok/s) step 20628/76294 | train loss 3.573948 | norm 5.8785 | lr 1.20e-04 | (3803.84 ms | 137831 tok/s) step 20629/76294 | train loss 3.496974 | norm 6.8600 | lr 1.20e-04 | (3808.71 ms | 137655 tok/s) step 20630/76294 | train loss 3.530491 | norm 5.8356 | lr 1.20e-04 | (3800.10 ms | 137967 tok/s) step 20631/76294 | train loss 3.508714 | norm 9.0327 | lr 1.20e-04 | (3807.15 ms | 137711 tok/s) step 20632/76294 | train loss 3.581324 | norm 5.4348 | lr 1.20e-04 | (3798.50 ms | 138025 tok/s) step 20633/76294 | train loss 3.474183 | norm 7.2944 | lr 1.20e-04 | (3804.68 ms | 137801 tok/s) step 20634/76294 | train loss 3.609222 | norm 5.1214 | lr 1.20e-04 | (3804.00 ms | 137826 tok/s) step 20635/76294 | train loss 3.532349 | norm 4.8867 | lr 1.20e-04 | (3945.10 ms | 132896 tok/s) step 20636/76294 | train loss 3.560950 | norm 6.0241 | lr 1.20e-04 | (3794.07 ms | 138186 tok/s) step 20637/76294 | train loss 3.496582 | norm 4.6970 | lr 1.20e-04 | (3809.39 ms | 137630 tok/s) step 20638/76294 | train loss 3.527886 | norm 5.9927 | lr 1.20e-04 | (3816.08 ms | 137389 tok/s) step 20639/76294 | train loss 3.551124 | norm 4.7172 | lr 1.20e-04 | (3993.65 ms | 131281 tok/s) step 20640/76294 | train loss 3.568422 | norm 13.6716 | lr 1.20e-04 | (3907.42 ms | 134177 tok/s) step 20641/76294 | train loss 3.414818 | norm 6.8907 | lr 1.20e-04 | (3791.01 ms | 138298 tok/s) step 20642/76294 | train loss 3.512550 | norm 6.0601 | lr 1.20e-04 | (3812.24 ms | 137528 tok/s) step 20643/76294 | train loss 3.512337 | norm 6.3955 | lr 1.20e-04 | (3787.34 ms | 138432 tok/s) step 20644/76294 | train loss 3.500650 | norm 6.2389 | lr 1.20e-04 | (4807.19 ms | 109063 tok/s) step 20645/76294 | train loss 3.512159 | norm 3.6924 | lr 1.20e-04 | (3815.08 ms | 137425 tok/s) step 20646/76294 | train loss 3.585167 | norm 2.9534 | lr 1.20e-04 | (3791.42 ms | 138283 tok/s) step 20647/76294 | train loss 3.464973 | norm 5.7212 | lr 1.20e-04 | (3814.15 ms | 137459 tok/s) step 20648/76294 | train loss 3.478645 | norm 4.1336 | lr 1.20e-04 | (3794.66 ms | 138165 tok/s) step 20649/76294 | train loss 3.489187 | norm 7.1782 | lr 1.20e-04 | (3794.92 ms | 138155 tok/s) step 20650/76294 | train loss 3.489341 | norm 3.0285 | lr 1.20e-04 | (3799.40 ms | 137992 tok/s) step 20651/76294 | train loss 3.507485 | norm 4.0205 | lr 1.20e-04 | (3833.59 ms | 136762 tok/s) step 20652/76294 | train loss 3.559100 | norm 5.0636 | lr 1.20e-04 | (3793.06 ms | 138223 tok/s) step 20653/76294 | train loss 3.473999 | norm 4.6292 | lr 1.20e-04 | (3983.96 ms | 131600 tok/s) step 20654/76294 | train loss 3.712275 | norm 3.3264 | lr 1.20e-04 | (3794.86 ms | 138158 tok/s) step 20655/76294 | train loss 3.579961 | norm 3.5205 | lr 1.20e-04 | (3798.41 ms | 138028 tok/s) step 20656/76294 | train loss 3.530966 | norm 4.0338 | lr 1.20e-04 | (3817.50 ms | 137338 tok/s) step 20657/76294 | train loss 3.433537 | norm 3.8704 | lr 1.20e-04 | (3802.73 ms | 137872 tok/s) step 20658/76294 | train loss 3.531196 | norm 4.9482 | lr 1.20e-04 | (3801.07 ms | 137932 tok/s) step 20659/76294 | train loss 3.474068 | norm 3.6695 | lr 1.20e-04 | (3798.45 ms | 138027 tok/s) step 20660/76294 | train loss 3.482956 | norm 4.4452 | lr 1.20e-04 | (3816.49 ms | 137374 tok/s) step 20661/76294 | train loss 3.527683 | norm 12.6553 | lr 1.20e-04 | (3795.42 ms | 138137 tok/s) step 20662/76294 | train loss 3.584313 | norm 7.7444 | lr 1.20e-04 | (3804.08 ms | 137823 tok/s) step 20663/76294 | train loss 3.449444 | norm 4.4867 | lr 1.20e-04 | (3801.67 ms | 137910 tok/s) step 20664/76294 | train loss 3.566482 | norm 12.1582 | lr 1.20e-04 | (3802.24 ms | 137889 tok/s) step 20665/76294 | train loss 3.497325 | norm 5.4350 | lr 1.20e-04 | (3796.58 ms | 138095 tok/s) step 20666/76294 | train loss 3.441778 | norm 5.0152 | lr 1.20e-04 | (3826.18 ms | 137026 tok/s) step 20667/76294 | train loss 3.453324 | norm 4.5842 | lr 1.20e-04 | (3797.15 ms | 138074 tok/s) step 20668/76294 | train loss 3.435753 | norm 2.7188 | lr 1.20e-04 | (3802.69 ms | 137873 tok/s) step 20669/76294 | train loss 3.519106 | norm 3.3984 | lr 1.20e-04 | (3813.90 ms | 137468 tok/s) step 20670/76294 | train loss 3.498404 | norm 4.5813 | lr 1.20e-04 | (3802.57 ms | 137877 tok/s) step 20671/76294 | train loss 3.532733 | norm 2.7011 | lr 1.20e-04 | (3805.25 ms | 137780 tok/s) step 20672/76294 | train loss 3.414899 | norm 2.8498 | lr 1.20e-04 | (3821.69 ms | 137187 tok/s) step 20673/76294 | train loss 3.470877 | norm 7.5965 | lr 1.20e-04 | (3798.10 ms | 138039 tok/s) step 20674/76294 | train loss 3.465912 | norm 3.1168 | lr 1.20e-04 | (3856.35 ms | 135954 tok/s) step 20675/76294 | train loss 3.457896 | norm 2.7742 | lr 1.20e-04 | (3805.19 ms | 137782 tok/s) step 20676/76294 | train loss 3.422324 | norm 2.9173 | lr 1.20e-04 | (3820.27 ms | 137238 tok/s) step 20677/76294 | train loss 3.535311 | norm 3.3135 | lr 1.20e-04 | (3797.64 ms | 138056 tok/s) step 20678/76294 | train loss 3.413864 | norm 2.6199 | lr 1.20e-04 | (3806.23 ms | 137745 tok/s) step 20679/76294 | train loss 3.446880 | norm 2.3219 | lr 1.20e-04 | (3815.67 ms | 137404 tok/s) step 20680/76294 | train loss 3.517830 | norm 2.7557 | lr 1.20e-04 | (3807.25 ms | 137708 tok/s) step 20681/76294 | train loss 3.506039 | norm 3.8538 | lr 1.20e-04 | (3811.60 ms | 137551 tok/s) step 20682/76294 | train loss 3.429190 | norm 2.8599 | lr 1.20e-04 | (3799.45 ms | 137991 tok/s) step 20683/76294 | train loss 3.473826 | norm 3.9612 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 20684/76294 | train loss 3.421081 | norm 2.1829 | lr 1.20e-04 | (3804.31 ms | 137814 tok/s) step 20685/76294 | train loss 3.404495 | norm 2.6994 | lr 1.20e-04 | (3879.72 ms | 135135 tok/s) step 20686/76294 | train loss 3.428174 | norm 2.3143 | lr 1.20e-04 | (3871.10 ms | 135437 tok/s) step 20687/76294 | train loss 3.474436 | norm 5.1414 | lr 1.20e-04 | (3802.36 ms | 137885 tok/s) step 20688/76294 | train loss 3.483561 | norm 3.0729 | lr 1.20e-04 | (3803.22 ms | 137854 tok/s) step 20689/76294 | train loss 3.513354 | norm 57.1673 | lr 1.20e-04 | (3818.92 ms | 137287 tok/s) step 20690/76294 | train loss 3.563810 | norm 6.4273 | lr 1.20e-04 | (3799.66 ms | 137983 tok/s) step 20691/76294 | train loss 3.489319 | norm 4.1340 | lr 1.20e-04 | (3804.98 ms | 137790 tok/s) step 20692/76294 | train loss 3.482579 | norm 5.0594 | lr 1.20e-04 | (3801.82 ms | 137905 tok/s) step 20693/76294 | train loss 3.543656 | norm 8.5960 | lr 1.20e-04 | (3802.83 ms | 137868 tok/s) step 20694/76294 | train loss 3.453143 | norm 5.6324 | lr 1.20e-04 | (3811.03 ms | 137571 tok/s) step 20695/76294 | train loss 3.467280 | norm 3.2369 | lr 1.20e-04 | (3797.61 ms | 138057 tok/s) step 20696/76294 | train loss 3.472925 | norm 3.5671 | lr 1.20e-04 | (3819.90 ms | 137252 tok/s) step 20697/76294 | train loss 3.451736 | norm 2.3864 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 20698/76294 | train loss 3.431090 | norm 4.2006 | lr 1.20e-04 | (3823.69 ms | 137116 tok/s) step 20699/76294 | train loss 3.463688 | norm 2.4455 | lr 1.20e-04 | (3800.41 ms | 137956 tok/s) step 20700/76294 | train loss 3.483735 | norm 22.3440 | lr 1.20e-04 | (3803.26 ms | 137852 tok/s) step 20701/76294 | train loss 3.463560 | norm 3.3658 | lr 1.20e-04 | (3798.30 ms | 138032 tok/s) step 20702/76294 | train loss 3.374584 | norm 9.2990 | lr 1.20e-04 | (3824.09 ms | 137101 tok/s) step 20703/76294 | train loss 3.520437 | norm 4.8118 | lr 1.20e-04 | (3797.28 ms | 138069 tok/s) step 20704/76294 | train loss 3.442016 | norm 3.3583 | lr 1.20e-04 | (3845.14 ms | 136351 tok/s) step 20705/76294 | train loss 3.378090 | norm 7.5358 | lr 1.20e-04 | (4078.72 ms | 128542 tok/s) step 20706/76294 | train loss 3.417087 | norm 4.6401 | lr 1.20e-04 | (3807.74 ms | 137690 tok/s) step 20707/76294 | train loss 3.519185 | norm 5.8993 | lr 1.20e-04 | (3834.56 ms | 136727 tok/s) step 20708/76294 | train loss 3.438251 | norm 285.2549 | lr 1.20e-04 | (3873.37 ms | 135357 tok/s) step 20709/76294 | train loss 3.492384 | norm 7.8597 | lr 1.20e-04 | (3791.47 ms | 138281 tok/s) step 20710/76294 | train loss 3.432001 | norm 5.0953 | lr 1.20e-04 | (3801.14 ms | 137929 tok/s) step 20711/76294 | train loss 3.464340 | norm 2.0511 | lr 1.20e-04 | (3886.73 ms | 134892 tok/s) step 20712/76294 | train loss 3.431133 | norm 2.8438 | lr 1.20e-04 | (3793.85 ms | 138194 tok/s) step 20713/76294 | train loss 3.563811 | norm 5.6589 | lr 1.20e-04 | (3867.16 ms | 135574 tok/s) step 20714/76294 | train loss 3.430420 | norm 2.9499 | lr 1.20e-04 | (3794.39 ms | 138174 tok/s) step 20715/76294 | train loss 3.539057 | norm 2.1592 | lr 1.20e-04 | (3803.85 ms | 137831 tok/s) step 20716/76294 | train loss 3.425864 | norm 4.7965 | lr 1.20e-04 | (3813.92 ms | 137467 tok/s) step 20717/76294 | train loss 3.449133 | norm 2.3734 | lr 1.20e-04 | (3796.90 ms | 138083 tok/s) step 20718/76294 | train loss 3.436233 | norm 3.3773 | lr 1.20e-04 | (5333.64 ms | 98298 tok/s) step 20719/76294 | train loss 3.430506 | norm 2.4237 | lr 1.20e-04 | (3813.96 ms | 137465 tok/s) step 20720/76294 | train loss 3.445935 | norm 2.8351 | lr 1.20e-04 | (3800.28 ms | 137960 tok/s) step 20721/76294 | train loss 3.443675 | norm 2.2651 | lr 1.20e-04 | (3798.91 ms | 138010 tok/s) step 20722/76294 | train loss 3.451783 | norm 4.1110 | lr 1.20e-04 | (3802.47 ms | 137881 tok/s) step 20723/76294 | train loss 3.535852 | norm 2.8567 | lr 1.20e-04 | (3800.47 ms | 137953 tok/s) step 20724/76294 | train loss 3.506059 | norm 3.5637 | lr 1.20e-04 | (3833.66 ms | 136759 tok/s) step 20725/76294 | train loss 3.452406 | norm 5.9810 | lr 1.20e-04 | (3794.36 ms | 138176 tok/s) step 20726/76294 | train loss 3.457693 | norm 2.5973 | lr 1.20e-04 | (3848.33 ms | 136238 tok/s) step 20727/76294 | train loss 3.428002 | norm 2.5773 | lr 1.20e-04 | (3799.05 ms | 138005 tok/s) step 20728/76294 | train loss 3.556120 | norm 3.3118 | lr 1.20e-04 | (3817.72 ms | 137330 tok/s) step 20729/76294 | train loss 3.388862 | norm 1.9628 | lr 1.20e-04 | (3801.00 ms | 137934 tok/s) step 20730/76294 | train loss 3.462393 | norm 4.4243 | lr 1.20e-04 | (3806.17 ms | 137747 tok/s) step 20731/76294 | train loss 3.413592 | norm 2.9773 | lr 1.20e-04 | (3820.05 ms | 137246 tok/s) step 20732/76294 | train loss 3.583682 | norm 2.1775 | lr 1.20e-04 | (3800.99 ms | 137934 tok/s) step 20733/76294 | train loss 3.457549 | norm 3.4582 | lr 1.20e-04 | (3810.25 ms | 137600 tok/s) step 20734/76294 | train loss 3.441713 | norm 2.7440 | lr 1.20e-04 | (3811.36 ms | 137559 tok/s) step 20735/76294 | train loss 3.435001 | norm 2.7621 | lr 1.20e-04 | (3922.97 ms | 133646 tok/s) step 20736/76294 | train loss 3.643599 | norm 3.2018 | lr 1.20e-04 | (3798.88 ms | 138011 tok/s) step 20737/76294 | train loss 3.448334 | norm 2.0041 | lr 1.20e-04 | (3850.09 ms | 136175 tok/s) step 20738/76294 | train loss 3.438674 | norm 2.7433 | lr 1.20e-04 | (3795.78 ms | 138124 tok/s) step 20739/76294 | train loss 3.436988 | norm 3.0058 | lr 1.20e-04 | (3906.99 ms | 134192 tok/s) step 20740/76294 | train loss 3.436146 | norm 3.8739 | lr 1.20e-04 | (3783.63 ms | 138568 tok/s) step 20741/76294 | train loss 3.409606 | norm 2.4798 | lr 1.20e-04 | (3837.64 ms | 136617 tok/s) step 20742/76294 | train loss 3.527882 | norm 2.0594 | lr 1.20e-04 | (3800.43 ms | 137955 tok/s) step 20743/76294 | train loss 3.449962 | norm 2.7881 | lr 1.20e-04 | (3844.49 ms | 136374 tok/s) step 20744/76294 | train loss 3.532323 | norm 3.0176 | lr 1.20e-04 | (3791.24 ms | 138289 tok/s) step 20745/76294 | train loss 3.432936 | norm 1.6707 | lr 1.20e-04 | (3912.43 ms | 134006 tok/s) step 20746/76294 | train loss 3.443949 | norm 3.7939 | lr 1.20e-04 | (3782.45 ms | 138611 tok/s) step 20747/76294 | train loss 3.441174 | norm 3.9943 | lr 1.20e-04 | (3808.56 ms | 137660 tok/s) step 20748/76294 | train loss 3.514971 | norm 5.7190 | lr 1.20e-04 | (3782.09 ms | 138624 tok/s) step 20749/76294 | train loss 3.407544 | norm 2.5478 | lr 1.20e-04 | (28417.59 ms | 18449 tok/s) step 20750/76294 | train loss 3.437258 | norm 4.5014 | lr 1.20e-04 | (3833.57 ms | 136762 tok/s) val loss: 3.440113 saving model checkpoint to ./results/gpt2-124M-gqa/step_20750.pth step 20751/76294 | train loss 3.465447 | norm 3.5565 | lr 1.20e-04 | (3825.36 ms | 137056 tok/s) step 20752/76294 | train loss 3.427808 | norm 1.9204 | lr 1.20e-04 | (3736.28 ms | 140323 tok/s) step 20753/76294 | train loss 3.450341 | norm 2.2739 | lr 1.20e-04 | (3837.16 ms | 136634 tok/s) step 20754/76294 | train loss 3.470692 | norm 3.1434 | lr 1.20e-04 | (3783.35 ms | 138578 tok/s) step 20755/76294 | train loss 3.485527 | norm 4.1468 | lr 1.20e-04 | (3830.40 ms | 136875 tok/s) step 20756/76294 | train loss 3.433425 | norm 1.9700 | lr 1.20e-04 | (3757.61 ms | 139527 tok/s) step 20757/76294 | train loss 3.446828 | norm 1.6876 | lr 1.20e-04 | (3769.12 ms | 139101 tok/s) step 20758/76294 | train loss 3.417538 | norm 5.1399 | lr 1.20e-04 | (3764.01 ms | 139290 tok/s) step 20759/76294 | train loss 3.418088 | norm 2.8513 | lr 1.20e-04 | (3808.73 ms | 137654 tok/s) step 20760/76294 | train loss 3.459312 | norm 5.4186 | lr 1.20e-04 | (3768.69 ms | 139117 tok/s) step 20761/76294 | train loss 3.525946 | norm 3.7362 | lr 1.20e-04 | (3771.10 ms | 139028 tok/s) step 20762/76294 | train loss 3.440791 | norm 2.6945 | lr 1.20e-04 | (3790.47 ms | 138318 tok/s) step 20763/76294 | train loss 3.517651 | norm 2.2807 | lr 1.20e-04 | (3808.24 ms | 137672 tok/s) step 20764/76294 | train loss 3.431382 | norm 2.4622 | lr 1.20e-04 | (3781.15 ms | 138658 tok/s) step 20765/76294 | train loss 3.494922 | norm 1.8139 | lr 1.20e-04 | (3811.96 ms | 137538 tok/s) step 20766/76294 | train loss 3.399187 | norm 2.3003 | lr 1.20e-04 | (3787.93 ms | 138410 tok/s) step 20767/76294 | train loss 3.436679 | norm 3.6495 | lr 1.20e-04 | (3792.24 ms | 138253 tok/s) step 20768/76294 | train loss 3.434601 | norm 2.6707 | lr 1.20e-04 | (3801.43 ms | 137919 tok/s) step 20769/76294 | train loss 3.435528 | norm 5.3731 | lr 1.20e-04 | (3794.37 ms | 138175 tok/s) step 20770/76294 | train loss 3.481459 | norm 2.5732 | lr 1.20e-04 | (3791.65 ms | 138275 tok/s) step 20771/76294 | train loss 3.458412 | norm 5.6856 | lr 1.20e-04 | (3811.54 ms | 137553 tok/s) step 20772/76294 | train loss 3.497154 | norm 3.2245 | lr 1.20e-04 | (3790.78 ms | 138306 tok/s) step 20773/76294 | train loss 3.398103 | norm 6.1342 | lr 1.20e-04 | (3797.71 ms | 138054 tok/s) step 20774/76294 | train loss 3.538325 | norm 9.1857 | lr 1.20e-04 | (3797.90 ms | 138047 tok/s) step 20775/76294 | train loss 3.458359 | norm 5.0261 | lr 1.20e-04 | (3792.85 ms | 138230 tok/s) step 20776/76294 | train loss 3.527507 | norm 2.5864 | lr 1.20e-04 | (3819.42 ms | 137269 tok/s) step 20777/76294 | train loss 3.421096 | norm 34.9028 | lr 1.20e-04 | (3792.75 ms | 138234 tok/s) step 20778/76294 | train loss 3.475597 | norm 4.7110 | lr 1.20e-04 | (3800.66 ms | 137947 tok/s) step 20779/76294 | train loss 3.483686 | norm 3.5761 | lr 1.20e-04 | (3818.91 ms | 137287 tok/s) step 20780/76294 | train loss 3.472516 | norm 3.4699 | lr 1.20e-04 | (3806.64 ms | 137730 tok/s) step 20781/76294 | train loss 3.473482 | norm 7.4602 | lr 1.20e-04 | (3803.02 ms | 137861 tok/s) step 20782/76294 | train loss 3.462421 | norm 3.5253 | lr 1.20e-04 | (3802.60 ms | 137876 tok/s) step 20783/76294 | train loss 3.580771 | norm 4.7514 | lr 1.20e-04 | (3804.01 ms | 137825 tok/s) step 20784/76294 | train loss 3.464957 | norm 2.7204 | lr 1.20e-04 | (3824.96 ms | 137070 tok/s) step 20785/76294 | train loss 3.443442 | norm 2.7094 | lr 1.20e-04 | (3804.17 ms | 137819 tok/s) step 20786/76294 | train loss 3.483433 | norm 3.4666 | lr 1.20e-04 | (3803.95 ms | 137827 tok/s) step 20787/76294 | train loss 3.500885 | norm 2.4971 | lr 1.20e-04 | (3805.57 ms | 137769 tok/s) step 20788/76294 | train loss 3.455494 | norm 2.4755 | lr 1.20e-04 | (3804.20 ms | 137818 tok/s) step 20789/76294 | train loss 3.472449 | norm 2.0719 | lr 1.20e-04 | (3806.26 ms | 137744 tok/s) step 20790/76294 | train loss 3.361775 | norm 2.2687 | lr 1.20e-04 | (3808.90 ms | 137648 tok/s) step 20791/76294 | train loss 3.517257 | norm 3.1660 | lr 1.20e-04 | (3809.24 ms | 137636 tok/s) step 20792/76294 | train loss 3.431677 | norm 1.9580 | lr 1.20e-04 | (3810.56 ms | 137588 tok/s) step 20793/76294 | train loss 3.467152 | norm 2.1798 | lr 1.20e-04 | (3806.57 ms | 137732 tok/s) step 20794/76294 | train loss 3.422009 | norm 2.5645 | lr 1.20e-04 | (3802.75 ms | 137871 tok/s) step 20795/76294 | train loss 3.483194 | norm 2.0767 | lr 1.20e-04 | (3829.48 ms | 136908 tok/s) step 20796/76294 | train loss 3.379270 | norm 1.9538 | lr 1.20e-04 | (3883.03 ms | 135020 tok/s) step 20797/76294 | train loss 3.533142 | norm 1.8439 | lr 1.20e-04 | (3801.82 ms | 137904 tok/s) step 20798/76294 | train loss 3.413098 | norm 2.7350 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 20799/76294 | train loss 3.455938 | norm 3.8545 | lr 1.20e-04 | (3822.44 ms | 137160 tok/s) step 20800/76294 | train loss 3.357504 | norm 2.6822 | lr 1.20e-04 | (3804.69 ms | 137800 tok/s) step 20801/76294 | train loss 3.540184 | norm 1.7321 | lr 1.20e-04 | (3799.40 ms | 137992 tok/s) step 20802/76294 | train loss 3.421870 | norm 2.9554 | lr 1.20e-04 | (3829.32 ms | 136914 tok/s) step 20803/76294 | train loss 3.456422 | norm 1.8471 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 20804/76294 | train loss 3.498224 | norm 5.5613 | lr 1.20e-04 | (3799.76 ms | 137979 tok/s) step 20805/76294 | train loss 3.405434 | norm 3.6520 | lr 1.20e-04 | (3817.99 ms | 137320 tok/s) step 20806/76294 | train loss 3.446794 | norm 5.8480 | lr 1.20e-04 | (3801.63 ms | 137911 tok/s) step 20807/76294 | train loss 3.413078 | norm 3.0839 | lr 1.20e-04 | (3806.20 ms | 137746 tok/s) step 20808/76294 | train loss 3.457905 | norm 1.9863 | lr 1.20e-04 | (3805.40 ms | 137775 tok/s) step 20809/76294 | train loss 3.436852 | norm 2.6805 | lr 1.20e-04 | (3816.08 ms | 137389 tok/s) step 20810/76294 | train loss 3.439532 | norm 2.3303 | lr 1.20e-04 | (4044.55 ms | 129628 tok/s) step 20811/76294 | train loss 3.410511 | norm 2.1583 | lr 1.20e-04 | (3816.43 ms | 137376 tok/s) step 20812/76294 | train loss 3.416299 | norm 1.9893 | lr 1.20e-04 | (3801.27 ms | 137925 tok/s) step 20813/76294 | train loss 3.425245 | norm 2.0013 | lr 1.20e-04 | (3800.73 ms | 137944 tok/s) step 20814/76294 | train loss 3.462858 | norm 2.6307 | lr 1.20e-04 | (3803.67 ms | 137837 tok/s) step 20815/76294 | train loss 3.455941 | norm 2.8158 | lr 1.20e-04 | (3806.63 ms | 137730 tok/s) step 20816/76294 | train loss 3.439698 | norm 2.0822 | lr 1.20e-04 | (3798.65 ms | 138020 tok/s) step 20817/76294 | train loss 3.398515 | norm 3.2948 | lr 1.20e-04 | (3805.17 ms | 137783 tok/s) step 20818/76294 | train loss 3.433108 | norm 1.8937 | lr 1.20e-04 | (3800.78 ms | 137942 tok/s) step 20819/76294 | train loss 3.497093 | norm 1.6706 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 20820/76294 | train loss 3.392673 | norm 2.5333 | lr 1.20e-04 | (3801.45 ms | 137918 tok/s) step 20821/76294 | train loss 3.402640 | norm 3.7060 | lr 1.20e-04 | (3872.21 ms | 135398 tok/s) step 20822/76294 | train loss 3.362836 | norm 4.6427 | lr 1.20e-04 | (3853.59 ms | 136052 tok/s) step 20823/76294 | train loss 3.408708 | norm 2.5447 | lr 1.20e-04 | (3801.00 ms | 137934 tok/s) step 20824/76294 | train loss 3.373541 | norm 1.7507 | lr 1.20e-04 | (3820.81 ms | 137219 tok/s) step 20825/76294 | train loss 3.411775 | norm 1.8993 | lr 1.20e-04 | (4532.32 ms | 115678 tok/s) step 20826/76294 | train loss 3.419039 | norm 3.6797 | lr 1.20e-04 | (3825.79 ms | 137041 tok/s) step 20827/76294 | train loss 3.461372 | norm 2.0218 | lr 1.20e-04 | (3799.98 ms | 137971 tok/s) step 20828/76294 | train loss 3.426468 | norm 2.6955 | lr 1.20e-04 | (3810.90 ms | 137576 tok/s) step 20829/76294 | train loss 3.443073 | norm 5.6857 | lr 1.20e-04 | (3796.96 ms | 138081 tok/s) step 20830/76294 | train loss 3.436307 | norm 5.4079 | lr 1.20e-04 | (3802.41 ms | 137883 tok/s) step 20831/76294 | train loss 3.493526 | norm 2.8902 | lr 1.20e-04 | (3814.35 ms | 137452 tok/s) step 20832/76294 | train loss 3.488283 | norm 1.7488 | lr 1.20e-04 | (3801.07 ms | 137932 tok/s) step 20833/76294 | train loss 3.406217 | norm 2.5099 | lr 1.20e-04 | (3818.54 ms | 137301 tok/s) step 20834/76294 | train loss 3.521524 | norm 2.0007 | lr 1.20e-04 | (3803.33 ms | 137850 tok/s) step 20835/76294 | train loss 3.478809 | norm 2.6788 | lr 1.20e-04 | (3802.29 ms | 137888 tok/s) step 20836/76294 | train loss 3.420738 | norm 3.1024 | lr 1.20e-04 | (3807.38 ms | 137703 tok/s) step 20837/76294 | train loss 3.478549 | norm 4.9037 | lr 1.20e-04 | (3807.57 ms | 137696 tok/s) step 20838/76294 | train loss 3.419099 | norm 3.8748 | lr 1.20e-04 | (3800.47 ms | 137953 tok/s) step 20839/76294 | train loss 3.433839 | norm 3.4324 | lr 1.20e-04 | (3797.07 ms | 138077 tok/s) step 20840/76294 | train loss 3.516830 | norm 3.3450 | lr 1.20e-04 | (3824.35 ms | 137092 tok/s) step 20841/76294 | train loss 3.426989 | norm 2.2089 | lr 1.20e-04 | (3798.29 ms | 138033 tok/s) step 20842/76294 | train loss 3.399161 | norm 3.8711 | lr 1.20e-04 | (3848.73 ms | 136223 tok/s) step 20843/76294 | train loss 3.452825 | norm 3.9475 | lr 1.20e-04 | (3796.68 ms | 138091 tok/s) step 20844/76294 | train loss 3.430019 | norm 5.2145 | lr 1.20e-04 | (3804.48 ms | 137808 tok/s) step 20845/76294 | train loss 3.442106 | norm 3.4792 | lr 1.20e-04 | (3818.78 ms | 137292 tok/s) step 20846/76294 | train loss 3.412935 | norm 4.1228 | lr 1.20e-04 | (3803.89 ms | 137829 tok/s) step 20847/76294 | train loss 3.416625 | norm 2.5986 | lr 1.20e-04 | (3803.53 ms | 137843 tok/s) step 20848/76294 | train loss 3.400508 | norm 2.9869 | lr 1.20e-04 | (3876.66 ms | 135242 tok/s) step 20849/76294 | train loss 3.613119 | norm 5.8924 | lr 1.20e-04 | (3825.53 ms | 137050 tok/s) step 20850/76294 | train loss 3.451726 | norm 3.3880 | lr 1.20e-04 | (3826.48 ms | 137016 tok/s) step 20851/76294 | train loss 3.457111 | norm 4.4002 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 20852/76294 | train loss 3.514342 | norm 2.9466 | lr 1.20e-04 | (3806.21 ms | 137745 tok/s) step 20853/76294 | train loss 3.444139 | norm 3.5900 | lr 1.20e-04 | (3803.16 ms | 137856 tok/s) step 20854/76294 | train loss 3.386739 | norm 4.4493 | lr 1.20e-04 | (3803.41 ms | 137847 tok/s) step 20855/76294 | train loss 3.417590 | norm 3.5639 | lr 1.20e-04 | (3817.31 ms | 137345 tok/s) step 20856/76294 | train loss 3.417840 | norm 7.2454 | lr 1.20e-04 | (3807.84 ms | 137686 tok/s) step 20857/76294 | train loss 3.495689 | norm 6.9405 | lr 1.20e-04 | (3801.32 ms | 137923 tok/s) step 20858/76294 | train loss 3.414686 | norm 14.5655 | lr 1.20e-04 | (3797.90 ms | 138047 tok/s) step 20859/76294 | train loss 3.511517 | norm 10.4088 | lr 1.20e-04 | (3874.47 ms | 135318 tok/s) step 20860/76294 | train loss 3.415029 | norm 11.3155 | lr 1.20e-04 | (3778.41 ms | 138759 tok/s) step 20861/76294 | train loss 3.404493 | norm 8.7028 | lr 1.20e-04 | (3784.18 ms | 138547 tok/s) step 20862/76294 | train loss 3.460047 | norm 23.3001 | lr 1.20e-04 | (3800.57 ms | 137950 tok/s) step 20863/76294 | train loss 3.411067 | norm 9.1858 | lr 1.20e-04 | (3787.74 ms | 138417 tok/s) step 20864/76294 | train loss 3.465100 | norm 11.2198 | lr 1.20e-04 | (3793.62 ms | 138202 tok/s) step 20865/76294 | train loss 3.423934 | norm 15.0911 | lr 1.20e-04 | (3791.28 ms | 138288 tok/s) step 20866/76294 | train loss 3.442204 | norm 16.0123 | lr 1.20e-04 | (3788.98 ms | 138372 tok/s) step 20867/76294 | train loss 3.419991 | norm 5.8892 | lr 1.20e-04 | (3821.48 ms | 137195 tok/s) step 20868/76294 | train loss 3.479339 | norm 10.6855 | lr 1.20e-04 | (3788.91 ms | 138374 tok/s) step 20869/76294 | train loss 3.409454 | norm 9.2828 | lr 1.20e-04 | (3795.17 ms | 138146 tok/s) step 20870/76294 | train loss 3.418720 | norm 15.6764 | lr 1.20e-04 | (3811.32 ms | 137561 tok/s) step 20871/76294 | train loss 3.393253 | norm 8.8413 | lr 1.20e-04 | (3829.39 ms | 136912 tok/s) step 20872/76294 | train loss 3.357285 | norm 10.5447 | lr 1.20e-04 | (3809.34 ms | 137632 tok/s) step 20873/76294 | train loss 3.500176 | norm 8.8096 | lr 1.20e-04 | (3798.62 ms | 138021 tok/s) step 20874/76294 | train loss 3.472215 | norm 5.8573 | lr 1.20e-04 | (3811.22 ms | 137564 tok/s) step 20875/76294 | train loss 3.466509 | norm 9.6867 | lr 1.20e-04 | (3799.02 ms | 138006 tok/s) step 20876/76294 | train loss 3.480104 | norm 76.3882 | lr 1.20e-04 | (3798.58 ms | 138022 tok/s) step 20877/76294 | train loss 3.412898 | norm 5.1802 | lr 1.20e-04 | (3796.50 ms | 138098 tok/s) step 20878/76294 | train loss 3.435021 | norm 6.9835 | lr 1.20e-04 | (3791.23 ms | 138290 tok/s) step 20879/76294 | train loss 3.374814 | norm 4.3577 | lr 1.20e-04 | (3818.18 ms | 137314 tok/s) step 20880/76294 | train loss 3.482274 | norm 3.8419 | lr 1.20e-04 | (3791.66 ms | 138274 tok/s) step 20881/76294 | train loss 3.440205 | norm 4.6378 | lr 1.20e-04 | (3796.85 ms | 138085 tok/s) step 20882/76294 | train loss 3.459574 | norm 5.2624 | lr 1.20e-04 | (3817.34 ms | 137344 tok/s) step 20883/76294 | train loss 3.404887 | norm 4.7234 | lr 1.20e-04 | (3798.57 ms | 138022 tok/s) step 20884/76294 | train loss 3.475651 | norm 2.8176 | lr 1.20e-04 | (3801.10 ms | 137931 tok/s) step 20885/76294 | train loss 3.475168 | norm 2.6813 | lr 1.20e-04 | (3796.27 ms | 138106 tok/s) step 20886/76294 | train loss 3.442913 | norm 12.3156 | lr 1.20e-04 | (3805.39 ms | 137775 tok/s) step 20887/76294 | train loss 3.501948 | norm 3.9630 | lr 1.20e-04 | (3796.74 ms | 138089 tok/s) step 20888/76294 | train loss 3.462405 | norm 3.8657 | lr 1.20e-04 | (3793.97 ms | 138190 tok/s) step 20889/76294 | train loss 3.478616 | norm 4.0885 | lr 1.20e-04 | (3844.85 ms | 136361 tok/s) step 20890/76294 | train loss 3.438029 | norm 3.4663 | lr 1.20e-04 | (3794.42 ms | 138173 tok/s) step 20891/76294 | train loss 3.493689 | norm 4.8299 | lr 1.20e-04 | (3798.03 ms | 138042 tok/s) step 20892/76294 | train loss 3.433087 | norm 3.0772 | lr 1.20e-04 | (3815.23 ms | 137420 tok/s) step 20893/76294 | train loss 3.389680 | norm 2.3300 | lr 1.20e-04 | (3800.00 ms | 137971 tok/s) step 20894/76294 | train loss 3.387654 | norm 5.9472 | lr 1.20e-04 | (3793.95 ms | 138191 tok/s) step 20895/76294 | train loss 3.472968 | norm 5.9881 | lr 1.20e-04 | (3822.54 ms | 137157 tok/s) step 20896/76294 | train loss 3.404104 | norm 4.1237 | lr 1.20e-04 | (4100.53 ms | 127859 tok/s) step 20897/76294 | train loss 3.493506 | norm 4.2250 | lr 1.20e-04 | (3803.32 ms | 137850 tok/s) step 20898/76294 | train loss 3.404733 | norm 5.9320 | lr 1.20e-04 | (3793.50 ms | 138207 tok/s) step 20899/76294 | train loss 3.475049 | norm 3.5660 | lr 1.20e-04 | (5847.25 ms | 89664 tok/s) step 20900/76294 | train loss 3.452887 | norm 3.4303 | lr 1.20e-04 | (3790.43 ms | 138319 tok/s) step 20901/76294 | train loss 3.470012 | norm 2.8361 | lr 1.20e-04 | (3801.53 ms | 137915 tok/s) step 20902/76294 | train loss 3.393358 | norm 2.6001 | lr 1.20e-04 | (3822.83 ms | 137146 tok/s) step 20903/76294 | train loss 3.415518 | norm 4.6104 | lr 1.20e-04 | (3792.69 ms | 138237 tok/s) step 20904/76294 | train loss 3.448867 | norm 5.1395 | lr 1.20e-04 | (3798.27 ms | 138033 tok/s) step 20905/76294 | train loss 3.475577 | norm 2.2923 | lr 1.20e-04 | (3812.64 ms | 137513 tok/s) step 20906/76294 | train loss 3.431313 | norm 2.5829 | lr 1.20e-04 | (3794.47 ms | 138171 tok/s) step 20907/76294 | train loss 3.445529 | norm 3.2343 | lr 1.20e-04 | (3929.47 ms | 133425 tok/s) step 20908/76294 | train loss 3.476225 | norm 2.9384 | lr 1.20e-04 | (3791.80 ms | 138269 tok/s) step 20909/76294 | train loss 3.516715 | norm 6.3482 | lr 1.20e-04 | (3843.51 ms | 136409 tok/s) step 20910/76294 | train loss 3.491855 | norm 3.7366 | lr 1.20e-04 | (3792.74 ms | 138235 tok/s) step 20911/76294 | train loss 3.480800 | norm 5.2833 | lr 1.20e-04 | (3797.54 ms | 138060 tok/s) step 20912/76294 | train loss 3.479149 | norm 2.4580 | lr 1.20e-04 | (3814.34 ms | 137452 tok/s) step 20913/76294 | train loss 3.448915 | norm 2.9631 | lr 1.20e-04 | (3808.33 ms | 137669 tok/s) step 20914/76294 | train loss 3.415337 | norm 4.6354 | lr 1.20e-04 | (3802.75 ms | 137871 tok/s) step 20915/76294 | train loss 3.385980 | norm 3.8167 | lr 1.20e-04 | (3793.67 ms | 138201 tok/s) step 20916/76294 | train loss 3.417204 | norm 3.7126 | lr 1.20e-04 | (3800.75 ms | 137943 tok/s) step 20917/76294 | train loss 3.463539 | norm 3.5992 | lr 1.20e-04 | (3796.64 ms | 138092 tok/s) step 20918/76294 | train loss 3.443112 | norm 3.7682 | lr 1.20e-04 | (3793.17 ms | 138219 tok/s) step 20919/76294 | train loss 3.454679 | norm 2.5107 | lr 1.20e-04 | (3823.82 ms | 137111 tok/s) step 20920/76294 | train loss 3.426917 | norm 2.9240 | lr 1.20e-04 | (3795.43 ms | 138136 tok/s) step 20921/76294 | train loss 3.444481 | norm 3.6739 | lr 1.20e-04 | (3828.05 ms | 136960 tok/s) step 20922/76294 | train loss 3.499920 | norm 3.5552 | lr 1.20e-04 | (3817.66 ms | 137332 tok/s) step 20923/76294 | train loss 3.545082 | norm 2.0590 | lr 1.20e-04 | (3812.85 ms | 137506 tok/s) step 20924/76294 | train loss 3.463578 | norm 2.5388 | lr 1.20e-04 | (3794.18 ms | 138182 tok/s) step 20925/76294 | train loss 3.419652 | norm 2.9656 | lr 1.20e-04 | (3842.51 ms | 136444 tok/s) step 20926/76294 | train loss 3.464152 | norm 7.9297 | lr 1.20e-04 | (3795.63 ms | 138130 tok/s) step 20927/76294 | train loss 3.410245 | norm 10.6575 | lr 1.20e-04 | (3796.32 ms | 138104 tok/s) step 20928/76294 | train loss 3.444922 | norm 9.3431 | lr 1.20e-04 | (3815.25 ms | 137419 tok/s) step 20929/76294 | train loss 3.445427 | norm 4.3246 | lr 1.20e-04 | (3799.21 ms | 137999 tok/s) step 20930/76294 | train loss 3.507280 | norm 3.9153 | lr 1.20e-04 | (3802.15 ms | 137893 tok/s) step 20931/76294 | train loss 3.458954 | norm 3.8181 | lr 1.20e-04 | (3803.86 ms | 137830 tok/s) step 20932/76294 | train loss 3.475355 | norm 2.8012 | lr 1.20e-04 | (3815.47 ms | 137411 tok/s) step 20933/76294 | train loss 3.452986 | norm 2.3648 | lr 1.20e-04 | (3795.74 ms | 138125 tok/s) step 20934/76294 | train loss 3.419312 | norm 2.1970 | lr 1.20e-04 | (3803.53 ms | 137843 tok/s) step 20935/76294 | train loss 3.369285 | norm 3.2207 | lr 1.20e-04 | (3804.34 ms | 137813 tok/s) step 20936/76294 | train loss 3.482882 | norm 2.8306 | lr 1.20e-04 | (3796.67 ms | 138092 tok/s) step 20937/76294 | train loss 3.443290 | norm 6.7860 | lr 1.20e-04 | (3949.37 ms | 132752 tok/s) step 20938/76294 | train loss 3.443339 | norm 3.4799 | lr 1.20e-04 | (3794.65 ms | 138165 tok/s) step 20939/76294 | train loss 3.502782 | norm 3.0336 | lr 1.20e-04 | (3848.98 ms | 136215 tok/s) step 20940/76294 | train loss 3.402062 | norm 3.0496 | lr 1.20e-04 | (3797.39 ms | 138065 tok/s) step 20941/76294 | train loss 3.411844 | norm 4.0100 | lr 1.20e-04 | (3829.24 ms | 136917 tok/s) step 20942/76294 | train loss 3.420691 | norm 3.4298 | lr 1.20e-04 | (3812.64 ms | 137513 tok/s) step 20943/76294 | train loss 3.411682 | norm 2.7029 | lr 1.20e-04 | (3793.08 ms | 138222 tok/s) step 20944/76294 | train loss 3.428397 | norm 4.1326 | lr 1.20e-04 | (3822.02 ms | 137176 tok/s) step 20945/76294 | train loss 3.470598 | norm 5.5647 | lr 1.20e-04 | (3794.31 ms | 138178 tok/s) step 20946/76294 | train loss 3.517517 | norm 6.7919 | lr 1.20e-04 | (3820.24 ms | 137240 tok/s) step 20947/76294 | train loss 3.498026 | norm 3.3861 | lr 1.20e-04 | (3798.39 ms | 138029 tok/s) step 20948/76294 | train loss 3.409117 | norm 2.9459 | lr 1.20e-04 | (3820.68 ms | 137224 tok/s) step 20949/76294 | train loss 3.412643 | norm 2.2998 | lr 1.20e-04 | (3793.38 ms | 138211 tok/s) step 20950/76294 | train loss 3.473889 | norm 2.1972 | lr 1.20e-04 | (3811.79 ms | 137544 tok/s) step 20951/76294 | train loss 3.405263 | norm 3.7857 | lr 1.20e-04 | (3794.34 ms | 138176 tok/s) step 20952/76294 | train loss 3.447731 | norm 6.5975 | lr 1.20e-04 | (3798.13 ms | 138038 tok/s) step 20953/76294 | train loss 3.408696 | norm 3.1558 | lr 1.20e-04 | (3829.35 ms | 136913 tok/s) step 20954/76294 | train loss 3.418905 | norm 4.0554 | lr 1.20e-04 | (3802.21 ms | 137890 tok/s) step 20955/76294 | train loss 3.376490 | norm 2.8113 | lr 1.20e-04 | (3805.14 ms | 137784 tok/s) step 20956/76294 | train loss 3.386525 | norm 3.1833 | lr 1.20e-04 | (3803.54 ms | 137842 tok/s) step 20957/76294 | train loss 3.376925 | norm 4.7507 | lr 1.20e-04 | (3873.01 ms | 135370 tok/s) step 20958/76294 | train loss 3.408952 | norm 5.3931 | lr 1.20e-04 | (3801.72 ms | 137908 tok/s) step 20959/76294 | train loss 3.429101 | norm 4.0706 | lr 1.20e-04 | (3841.91 ms | 136465 tok/s) step 20960/76294 | train loss 3.363649 | norm 2.9240 | lr 1.20e-04 | (3794.66 ms | 138165 tok/s) step 20961/76294 | train loss 3.438828 | norm 3.1284 | lr 1.20e-04 | (3799.57 ms | 137986 tok/s) step 20962/76294 | train loss 3.468599 | norm 2.5726 | lr 1.20e-04 | (3814.28 ms | 137454 tok/s) step 20963/76294 | train loss 3.394834 | norm 2.0515 | lr 1.20e-04 | (3795.91 ms | 138119 tok/s) step 20964/76294 | train loss 3.419446 | norm 4.3437 | lr 1.20e-04 | (3806.65 ms | 137730 tok/s) step 20965/76294 | train loss 3.416490 | norm 5.4748 | lr 1.20e-04 | (3883.75 ms | 134995 tok/s) step 20966/76294 | train loss 3.435939 | norm 5.3580 | lr 1.20e-04 | (3797.00 ms | 138079 tok/s) step 20967/76294 | train loss 3.513556 | norm 9.5270 | lr 1.20e-04 | (3822.99 ms | 137141 tok/s) step 20968/76294 | train loss 3.432958 | norm 3.3158 | lr 1.20e-04 | (3798.89 ms | 138011 tok/s) step 20969/76294 | train loss 3.444820 | norm 12.7374 | lr 1.20e-04 | (3799.84 ms | 137976 tok/s) step 20970/76294 | train loss 3.450129 | norm 2.3836 | lr 1.20e-04 | (3815.22 ms | 137420 tok/s) step 20971/76294 | train loss 3.487349 | norm 13.5763 | lr 1.20e-04 | (3829.93 ms | 136892 tok/s) step 20972/76294 | train loss 3.409853 | norm 6.0094 | lr 1.20e-04 | (3963.36 ms | 132284 tok/s) step 20973/76294 | train loss 3.461327 | norm 5.9706 | lr 1.20e-04 | (3816.77 ms | 137364 tok/s) step 20974/76294 | train loss 3.394787 | norm 5.1511 | lr 1.20e-04 | (3809.15 ms | 137639 tok/s) step 20975/76294 | train loss 3.447470 | norm 3.4956 | lr 1.20e-04 | (3796.73 ms | 138089 tok/s) step 20976/76294 | train loss 3.445410 | norm 7.9034 | lr 1.20e-04 | (3802.42 ms | 137883 tok/s) step 20977/76294 | train loss 3.366297 | norm 6.1767 | lr 1.20e-04 | (3825.10 ms | 137065 tok/s) step 20978/76294 | train loss 3.448092 | norm 6.9698 | lr 1.20e-04 | (3800.16 ms | 137965 tok/s) step 20979/76294 | train loss 3.434517 | norm 3.9957 | lr 1.20e-04 | (3804.06 ms | 137823 tok/s) step 20980/76294 | train loss 3.397145 | norm 9.9058 | lr 1.20e-04 | (3801.48 ms | 137917 tok/s) step 20981/76294 | train loss 3.428639 | norm 6.7764 | lr 1.20e-04 | (3796.70 ms | 138090 tok/s) step 20982/76294 | train loss 3.436966 | norm 5.1156 | lr 1.20e-04 | (3827.74 ms | 136971 tok/s) step 20983/76294 | train loss 3.455415 | norm 5.7612 | lr 1.20e-04 | (3795.46 ms | 138136 tok/s) step 20984/76294 | train loss 3.438534 | norm 11.1333 | lr 1.20e-04 | (3802.62 ms | 137876 tok/s) step 20985/76294 | train loss 3.405257 | norm 16.1230 | lr 1.20e-04 | (3822.93 ms | 137143 tok/s) step 20986/76294 | train loss 3.488614 | norm 11.2582 | lr 1.20e-04 | (3801.69 ms | 137909 tok/s) step 20987/76294 | train loss 3.398085 | norm 7.7208 | lr 1.20e-04 | (3795.84 ms | 138122 tok/s) step 20988/76294 | train loss 3.484642 | norm 5.5658 | lr 1.20e-04 | (3828.94 ms | 136928 tok/s) step 20989/76294 | train loss 3.473129 | norm 6.3235 | lr 1.20e-04 | (3798.13 ms | 138039 tok/s) step 20990/76294 | train loss 3.435289 | norm 6.6265 | lr 1.20e-04 | (3802.90 ms | 137865 tok/s) step 20991/76294 | train loss 3.472422 | norm 10.7147 | lr 1.20e-04 | (3818.82 ms | 137291 tok/s) step 20992/76294 | train loss 3.455564 | norm 15.3951 | lr 1.20e-04 | (3802.53 ms | 137879 tok/s) step 20993/76294 | train loss 3.450502 | norm 5.9665 | lr 1.20e-04 | (3799.29 ms | 137996 tok/s) step 20994/76294 | train loss 3.578164 | norm 8.3413 | lr 1.20e-04 | (3801.22 ms | 137926 tok/s) step 20995/76294 | train loss 3.409616 | norm 12.4750 | lr 1.20e-04 | (3800.56 ms | 137950 tok/s) step 20996/76294 | train loss 3.480289 | norm 8.8394 | lr 1.20e-04 | (3801.93 ms | 137901 tok/s) step 20997/76294 | train loss 3.459190 | norm 14.2041 | lr 1.20e-04 | (3800.38 ms | 137957 tok/s) step 20998/76294 | train loss 3.419021 | norm 12.3745 | lr 1.20e-04 | (3861.57 ms | 135771 tok/s) step 20999/76294 | train loss 3.413057 | norm 8.6033 | lr 1.20e-04 | (3802.06 ms | 137896 tok/s) step 21000/76294 | train loss 3.471180 | norm 8.8165 | lr 1.20e-04 | (3810.70 ms | 137583 tok/s) val loss: 3.458253 saving model checkpoint to ./results/gpt2-124M-gqa/step_21000.pth step 21001/76294 | train loss 3.451578 | norm 15.0375 | lr 1.20e-04 | (3847.24 ms | 136277 tok/s) step 21002/76294 | train loss 3.458165 | norm 22.7358 | lr 1.20e-04 | (3805.58 ms | 137768 tok/s) step 21003/76294 | train loss 3.485751 | norm 31.8654 | lr 1.20e-04 | (3807.74 ms | 137690 tok/s) step 21004/76294 | train loss 3.570543 | norm 36.1532 | lr 1.20e-04 | (3793.00 ms | 138225 tok/s) step 21005/76294 | train loss 3.540938 | norm 37.9617 | lr 1.20e-04 | (3794.96 ms | 138154 tok/s) step 21006/76294 | train loss 3.515256 | norm 14.2856 | lr 1.20e-04 | (3820.58 ms | 137227 tok/s) step 21007/76294 | train loss 3.536872 | norm 16.0933 | lr 1.20e-04 | (3801.69 ms | 137909 tok/s) step 21008/76294 | train loss 3.491817 | norm 49.3704 | lr 1.20e-04 | (3800.78 ms | 137942 tok/s) step 21009/76294 | train loss 3.440313 | norm 98.0148 | lr 1.20e-04 | (3799.62 ms | 137984 tok/s) step 21010/76294 | train loss 3.586468 | norm 44.0803 | lr 1.20e-04 | (3799.57 ms | 137986 tok/s) step 21011/76294 | train loss 3.568030 | norm 33.6572 | lr 1.20e-04 | (3823.86 ms | 137110 tok/s) step 21012/76294 | train loss 3.562057 | norm 48.0562 | lr 1.20e-04 | (3798.08 ms | 138040 tok/s) step 21013/76294 | train loss 3.646047 | norm 48.8590 | lr 1.20e-04 | (3827.29 ms | 136987 tok/s) step 21014/76294 | train loss 3.557795 | norm 49.3236 | lr 1.20e-04 | (3799.19 ms | 138000 tok/s) step 21015/76294 | train loss 3.662895 | norm 90.9586 | lr 1.20e-04 | (3806.89 ms | 137721 tok/s) step 21016/76294 | train loss 3.671847 | norm 70.8809 | lr 1.20e-04 | (3821.58 ms | 137192 tok/s) step 21017/76294 | train loss 3.706181 | norm 47.6864 | lr 1.20e-04 | (3801.66 ms | 137910 tok/s) step 21018/76294 | train loss 3.680005 | norm 96.2999 | lr 1.20e-04 | (3797.51 ms | 138061 tok/s) step 21019/76294 | train loss 3.628418 | norm 76.1890 | lr 1.20e-04 | (3825.45 ms | 137053 tok/s) step 21020/76294 | train loss 3.611400 | norm 50.4919 | lr 1.20e-04 | (3798.92 ms | 138010 tok/s) step 21021/76294 | train loss 3.681277 | norm 61.9384 | lr 1.20e-04 | (3799.77 ms | 137979 tok/s) step 21022/76294 | train loss 3.605139 | norm 71.4745 | lr 1.20e-04 | (3799.17 ms | 138001 tok/s) step 21023/76294 | train loss 3.687110 | norm 50.4623 | lr 1.20e-04 | (3876.03 ms | 135264 tok/s) step 21024/76294 | train loss 3.599937 | norm 30.4192 | lr 1.20e-04 | (3797.19 ms | 138073 tok/s) step 21025/76294 | train loss 3.525397 | norm 17.5940 | lr 1.20e-04 | (3810.60 ms | 137587 tok/s) step 21026/76294 | train loss 3.586394 | norm 20.1505 | lr 1.20e-04 | (3796.23 ms | 138107 tok/s) step 21027/76294 | train loss 3.449449 | norm 29.3200 | lr 1.20e-04 | (3802.39 ms | 137884 tok/s) step 21028/76294 | train loss 3.546062 | norm 37.3110 | lr 1.20e-04 | (3820.29 ms | 137238 tok/s) step 21029/76294 | train loss 3.553308 | norm 24.8975 | lr 1.20e-04 | (3801.40 ms | 137920 tok/s) step 21030/76294 | train loss 3.472879 | norm 27.0804 | lr 1.20e-04 | (3815.80 ms | 137399 tok/s) step 21031/76294 | train loss 3.519459 | norm 18.5279 | lr 1.20e-04 | (3801.93 ms | 137901 tok/s) step 21032/76294 | train loss 3.508510 | norm 20.5404 | lr 1.20e-04 | (3807.01 ms | 137716 tok/s) step 21033/76294 | train loss 3.441808 | norm 20.7391 | lr 1.20e-04 | (3799.15 ms | 138001 tok/s) step 21034/76294 | train loss 3.517911 | norm 10.8543 | lr 1.20e-04 | (3800.23 ms | 137962 tok/s) step 21035/76294 | train loss 3.448415 | norm 10.2132 | lr 1.20e-04 | (3804.28 ms | 137815 tok/s) step 21036/76294 | train loss 3.469370 | norm 5.4044 | lr 1.20e-04 | (3836.63 ms | 136653 tok/s) step 21037/76294 | train loss 3.583136 | norm 6.0203 | lr 1.20e-04 | (3800.62 ms | 137948 tok/s) step 21038/76294 | train loss 3.435683 | norm 10.7288 | lr 1.20e-04 | (3818.99 ms | 137284 tok/s) step 21039/76294 | train loss 3.484430 | norm 4.5766 | lr 1.20e-04 | (3802.34 ms | 137885 tok/s) step 21040/76294 | train loss 3.460838 | norm 7.9365 | lr 1.20e-04 | (3817.72 ms | 137330 tok/s) step 21041/76294 | train loss 3.424954 | norm 4.3361 | lr 1.20e-04 | (3803.90 ms | 137829 tok/s) step 21042/76294 | train loss 3.460911 | norm 6.5089 | lr 1.20e-04 | (3809.34 ms | 137632 tok/s) step 21043/76294 | train loss 3.523644 | norm 3.8037 | lr 1.20e-04 | (3801.04 ms | 137933 tok/s) step 21044/76294 | train loss 3.460491 | norm 3.6573 | lr 1.20e-04 | (3815.05 ms | 137426 tok/s) step 21045/76294 | train loss 3.457831 | norm 3.7755 | lr 1.20e-04 | (3798.38 ms | 138029 tok/s) step 21046/76294 | train loss 3.430219 | norm 6.4190 | lr 1.20e-04 | (3815.28 ms | 137418 tok/s) step 21047/76294 | train loss 3.423360 | norm 9.0230 | lr 1.20e-04 | (3800.72 ms | 137944 tok/s) step 21048/76294 | train loss 3.493908 | norm 11.5577 | lr 1.20e-04 | (3866.24 ms | 135607 tok/s) step 21049/76294 | train loss 3.431095 | norm 5.9410 | lr 1.20e-04 | (3808.41 ms | 137666 tok/s) step 21050/76294 | train loss 3.456646 | norm 9.6252 | lr 1.20e-04 | (3840.14 ms | 136528 tok/s) step 21051/76294 | train loss 3.409855 | norm 5.5639 | lr 1.20e-04 | (3799.99 ms | 137971 tok/s) step 21052/76294 | train loss 3.503834 | norm 16.8508 | lr 1.20e-04 | (3849.14 ms | 136209 tok/s) step 21053/76294 | train loss 3.375190 | norm 5.3725 | lr 1.20e-04 | (3795.31 ms | 138141 tok/s) step 21054/76294 | train loss 3.463583 | norm 4.0279 | lr 1.20e-04 | (3800.93 ms | 137937 tok/s) step 21055/76294 | train loss 3.443911 | norm 12.5491 | lr 1.20e-04 | (3823.47 ms | 137124 tok/s) step 21056/76294 | train loss 3.489424 | norm 13.7062 | lr 1.20e-04 | (3799.16 ms | 138001 tok/s) step 21057/76294 | train loss 3.531404 | norm 26.3656 | lr 1.20e-04 | (3804.90 ms | 137793 tok/s) step 21058/76294 | train loss 3.550865 | norm 6.9863 | lr 1.20e-04 | (4247.72 ms | 123428 tok/s) step 21059/76294 | train loss 3.534414 | norm 45.6057 | lr 1.20e-04 | (3800.00 ms | 137971 tok/s) step 21060/76294 | train loss 3.478935 | norm 19.3467 | lr 1.20e-04 | (3807.68 ms | 137692 tok/s) step 21061/76294 | train loss 3.444047 | norm 22.0850 | lr 1.20e-04 | (3796.62 ms | 138094 tok/s) step 21062/76294 | train loss 3.530679 | norm 46.6093 | lr 1.20e-04 | (3822.38 ms | 137163 tok/s) step 21063/76294 | train loss 3.493438 | norm 30.1836 | lr 1.20e-04 | (3795.20 ms | 138145 tok/s) step 21064/76294 | train loss 3.499239 | norm 14.1607 | lr 1.20e-04 | (4050.72 ms | 129431 tok/s) step 21065/76294 | train loss 3.540896 | norm 37.1097 | lr 1.20e-04 | (3897.27 ms | 134527 tok/s) step 21066/76294 | train loss 3.551565 | norm 52.4087 | lr 1.20e-04 | (3790.60 ms | 138313 tok/s) step 21067/76294 | train loss 3.522208 | norm 20.5319 | lr 1.20e-04 | (3819.52 ms | 137265 tok/s) step 21068/76294 | train loss 3.489042 | norm 63.6446 | lr 1.20e-04 | (3797.16 ms | 138074 tok/s) step 21069/76294 | train loss 3.565288 | norm 30.0912 | lr 1.20e-04 | (3978.64 ms | 131776 tok/s) step 21070/76294 | train loss 3.549796 | norm 38.7294 | lr 1.20e-04 | (3798.46 ms | 138026 tok/s) step 21071/76294 | train loss 3.616496 | norm 34.5319 | lr 1.20e-04 | (3791.98 ms | 138262 tok/s) step 21072/76294 | train loss 3.538447 | norm 22.0279 | lr 1.20e-04 | (3797.78 ms | 138051 tok/s) step 21073/76294 | train loss 3.494017 | norm 16.8072 | lr 1.20e-04 | (3907.65 ms | 134170 tok/s) step 21074/76294 | train loss 3.534252 | norm 19.3343 | lr 1.20e-04 | (3794.36 ms | 138176 tok/s) step 21075/76294 | train loss 3.535584 | norm 36.4483 | lr 1.20e-04 | (3846.62 ms | 136298 tok/s) step 21076/76294 | train loss 3.512670 | norm 23.7880 | lr 1.20e-04 | (3793.89 ms | 138193 tok/s) step 21077/76294 | train loss 3.603552 | norm 18.4834 | lr 1.20e-04 | (3827.20 ms | 136990 tok/s) step 21078/76294 | train loss 3.514986 | norm 200.2942 | lr 1.20e-04 | (3814.95 ms | 137430 tok/s) step 21079/76294 | train loss 3.496881 | norm 35.9955 | lr 1.20e-04 | (3800.13 ms | 137966 tok/s) step 21080/76294 | train loss 3.535917 | norm 58.1199 | lr 1.20e-04 | (3822.44 ms | 137160 tok/s) step 21081/76294 | train loss 3.568087 | norm 117.8502 | lr 1.20e-04 | (3824.52 ms | 137086 tok/s) step 21082/76294 | train loss 3.529257 | norm 51.2198 | lr 1.20e-04 | (3817.54 ms | 137337 tok/s) step 21083/76294 | train loss 3.558520 | norm 37.7418 | lr 1.20e-04 | (4165.34 ms | 125869 tok/s) step 21084/76294 | train loss 3.552393 | norm 30.9941 | lr 1.20e-04 | (3790.52 ms | 138316 tok/s) step 21085/76294 | train loss 3.512792 | norm 23.3291 | lr 1.20e-04 | (3822.48 ms | 137159 tok/s) step 21086/76294 | train loss 3.545144 | norm 18.1078 | lr 1.20e-04 | (3796.67 ms | 138092 tok/s) step 21087/76294 | train loss 3.511766 | norm 20.2589 | lr 1.20e-04 | (4147.95 ms | 126397 tok/s) step 21088/76294 | train loss 3.495471 | norm 22.9323 | lr 1.20e-04 | (3793.90 ms | 138192 tok/s) step 21089/76294 | train loss 3.508900 | norm 51.6539 | lr 1.20e-04 | (3799.51 ms | 137988 tok/s) step 21090/76294 | train loss 3.533776 | norm 38.0000 | lr 1.20e-04 | (3817.26 ms | 137347 tok/s) step 21091/76294 | train loss 3.531484 | norm 42.1716 | lr 1.20e-04 | (3805.53 ms | 137770 tok/s) step 21092/76294 | train loss 3.581102 | norm 577.7832 | lr 1.20e-04 | (3817.34 ms | 137344 tok/s) step 21093/76294 | train loss 3.530454 | norm 24.5212 | lr 1.20e-04 | (3801.90 ms | 137901 tok/s) step 21094/76294 | train loss 3.568233 | norm 51.3216 | lr 1.20e-04 | (3818.52 ms | 137302 tok/s) step 21095/76294 | train loss 3.515529 | norm 42.4518 | lr 1.20e-04 | (3800.08 ms | 137967 tok/s) step 21096/76294 | train loss 3.609208 | norm 27.2178 | lr 1.20e-04 | (3821.38 ms | 137198 tok/s) step 21097/76294 | train loss 3.512316 | norm 35.9185 | lr 1.20e-04 | (3802.04 ms | 137897 tok/s) step 21098/76294 | train loss 3.575481 | norm 64.2450 | lr 1.20e-04 | (3797.36 ms | 138066 tok/s) step 21099/76294 | train loss 3.532288 | norm 75.0590 | lr 1.20e-04 | (3801.93 ms | 137901 tok/s) step 21100/76294 | train loss 3.478037 | norm 156.3911 | lr 1.20e-04 | (3801.06 ms | 137932 tok/s) step 21101/76294 | train loss 3.538738 | norm 160.7455 | lr 1.20e-04 | (3806.21 ms | 137745 tok/s) step 21102/76294 | train loss 3.578767 | norm 115.5297 | lr 1.20e-04 | (3830.50 ms | 136872 tok/s) step 21103/76294 | train loss 3.594742 | norm 264.3686 | lr 1.20e-04 | (3802.86 ms | 137867 tok/s) step 21104/76294 | train loss 3.658787 | norm 96.4435 | lr 1.20e-04 | (3806.78 ms | 137725 tok/s) step 21105/76294 | train loss 3.676942 | norm 260.8680 | lr 1.20e-04 | (3822.35 ms | 137164 tok/s) step 21106/76294 | train loss 3.680775 | norm 154.7676 | lr 1.20e-04 | (3800.43 ms | 137955 tok/s) step 21107/76294 | train loss 3.809139 | norm 205.5970 | lr 1.20e-04 | (3810.02 ms | 137608 tok/s) step 21108/76294 | train loss 3.660944 | norm 153.3484 | lr 1.20e-04 | (3801.55 ms | 137914 tok/s) step 21109/76294 | train loss 3.786654 | norm 270.0374 | lr 1.20e-04 | (3819.19 ms | 137277 tok/s) step 21110/76294 | train loss 3.700688 | norm 217.0815 | lr 1.20e-04 | (3799.55 ms | 137987 tok/s) step 21111/76294 | train loss 3.609545 | norm 239.8348 | lr 1.20e-04 | (3803.86 ms | 137830 tok/s) step 21112/76294 | train loss 3.628079 | norm 120.3644 | lr 1.20e-04 | (3825.63 ms | 137046 tok/s) step 21113/76294 | train loss 3.674699 | norm 317.9308 | lr 1.20e-04 | (3804.20 ms | 137818 tok/s) step 21114/76294 | train loss 3.587877 | norm 421.4142 | lr 1.20e-04 | (3799.82 ms | 137977 tok/s) step 21115/76294 | train loss 3.663891 | norm 192.5546 | lr 1.20e-04 | (3829.83 ms | 136896 tok/s) step 21116/76294 | train loss 3.611457 | norm 259.0959 | lr 1.20e-04 | (3802.30 ms | 137887 tok/s) step 21117/76294 | train loss 3.561930 | norm 186.6652 | lr 1.20e-04 | (3805.88 ms | 137757 tok/s) step 21118/76294 | train loss 3.562234 | norm 271.2874 | lr 1.20e-04 | (3822.87 ms | 137145 tok/s) step 21119/76294 | train loss 3.519274 | norm 164.0023 | lr 1.20e-04 | (3804.89 ms | 137793 tok/s) step 21120/76294 | train loss 3.614154 | norm 124.1289 | lr 1.20e-04 | (3806.10 ms | 137749 tok/s) step 21121/76294 | train loss 3.511492 | norm 236.7714 | lr 1.20e-04 | (3802.29 ms | 137888 tok/s) step 21122/76294 | train loss 3.661562 | norm 80.9398 | lr 1.20e-04 | (3808.41 ms | 137666 tok/s) step 21123/76294 | train loss 3.508062 | norm 361.8991 | lr 1.20e-04 | (3804.38 ms | 137812 tok/s) step 21124/76294 | train loss 3.475648 | norm 141.3428 | lr 1.20e-04 | (3828.31 ms | 136950 tok/s) step 21125/76294 | train loss 3.519238 | norm 102.1144 | lr 1.20e-04 | (3803.00 ms | 137862 tok/s) step 21126/76294 | train loss 3.548349 | norm 91.9504 | lr 1.20e-04 | (3799.66 ms | 137983 tok/s) step 21127/76294 | train loss 3.644468 | norm 72.9246 | lr 1.20e-04 | (3830.60 ms | 136868 tok/s) step 21128/76294 | train loss 3.504906 | norm 61.1566 | lr 1.20e-04 | (3800.19 ms | 137964 tok/s) step 21129/76294 | train loss 3.495648 | norm 50.0964 | lr 1.20e-04 | (3842.43 ms | 136447 tok/s) step 21130/76294 | train loss 3.442020 | norm 173.9477 | lr 1.20e-04 | (3799.43 ms | 137991 tok/s) step 21131/76294 | train loss 3.511375 | norm 89.2091 | lr 1.20e-04 | (3820.31 ms | 137237 tok/s) step 21132/76294 | train loss 3.473327 | norm 186.7388 | lr 1.20e-04 | (3801.29 ms | 137924 tok/s) step 21133/76294 | train loss 3.511342 | norm 115.4701 | lr 1.20e-04 | (3834.35 ms | 136734 tok/s) step 21134/76294 | train loss 3.473722 | norm 35.9346 | lr 1.20e-04 | (3829.99 ms | 136890 tok/s) step 21135/76294 | train loss 3.493587 | norm 187.4256 | lr 1.20e-04 | (3801.50 ms | 137916 tok/s) step 21136/76294 | train loss 3.486370 | norm 54.4622 | lr 1.20e-04 | (3805.78 ms | 137761 tok/s) step 21137/76294 | train loss 3.485876 | norm 56.1520 | lr 1.20e-04 | (4091.95 ms | 128127 tok/s) step 21138/76294 | train loss 3.571093 | norm 131.8949 | lr 1.20e-04 | (3799.98 ms | 137971 tok/s) step 21139/76294 | train loss 3.433212 | norm 55.8081 | lr 1.20e-04 | (3828.18 ms | 136955 tok/s) step 21140/76294 | train loss 3.482930 | norm 163.7879 | lr 1.20e-04 | (3800.63 ms | 137948 tok/s) step 21141/76294 | train loss 3.516049 | norm 175.1866 | lr 1.20e-04 | (3803.49 ms | 137844 tok/s) step 21142/76294 | train loss 3.446679 | norm 63.5897 | lr 1.20e-04 | (3837.33 ms | 136628 tok/s) step 21143/76294 | train loss 3.515531 | norm 89.9687 | lr 1.20e-04 | (3808.57 ms | 137660 tok/s) step 21144/76294 | train loss 3.430811 | norm 347.5365 | lr 1.20e-04 | (3811.25 ms | 137563 tok/s) step 21145/76294 | train loss 3.467163 | norm 125.8811 | lr 1.20e-04 | (3799.14 ms | 138002 tok/s) step 21146/76294 | train loss 3.459282 | norm 116.0196 | lr 1.20e-04 | (3863.66 ms | 135697 tok/s) step 21147/76294 | train loss 3.494878 | norm 43.7751 | lr 1.20e-04 | (3799.18 ms | 138000 tok/s) step 21148/76294 | train loss 3.472605 | norm 136.8406 | lr 1.20e-04 | (3808.25 ms | 137671 tok/s) step 21149/76294 | train loss 3.482506 | norm 70.3279 | lr 1.20e-04 | (3827.90 ms | 136965 tok/s) step 21150/76294 | train loss 3.545425 | norm 330.4398 | lr 1.20e-04 | (3804.36 ms | 137812 tok/s) step 21151/76294 | train loss 3.488118 | norm 66.9820 | lr 1.20e-04 | (3828.15 ms | 136956 tok/s) step 21152/76294 | train loss 3.426296 | norm 109.1819 | lr 1.20e-04 | (3805.39 ms | 137775 tok/s) step 21153/76294 | train loss 3.539640 | norm 153.9153 | lr 1.20e-04 | (3825.62 ms | 137046 tok/s) step 21154/76294 | train loss 3.466674 | norm 175.1543 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 21155/76294 | train loss 3.453326 | norm 92.0056 | lr 1.20e-04 | (3810.04 ms | 137607 tok/s) step 21156/76294 | train loss 3.454502 | norm 192.8891 | lr 1.20e-04 | (3801.69 ms | 137909 tok/s) step 21157/76294 | train loss 3.468379 | norm 123.9140 | lr 1.20e-04 | (3810.95 ms | 137574 tok/s) step 21158/76294 | train loss 3.433118 | norm 185.4834 | lr 1.20e-04 | (3805.53 ms | 137770 tok/s) step 21159/76294 | train loss 3.489206 | norm 89.7445 | lr 1.20e-04 | (3803.42 ms | 137846 tok/s) step 21160/76294 | train loss 3.508095 | norm 101.4982 | lr 1.20e-04 | (3802.17 ms | 137892 tok/s) step 21161/76294 | train loss 3.538442 | norm 69.8534 | lr 1.20e-04 | (3801.02 ms | 137934 tok/s) step 21162/76294 | train loss 3.463040 | norm 180.6813 | lr 1.20e-04 | (3822.51 ms | 137158 tok/s) step 21163/76294 | train loss 3.487411 | norm 99.4949 | lr 1.20e-04 | (3799.73 ms | 137980 tok/s) step 21164/76294 | train loss 3.510477 | norm 523.2999 | lr 1.20e-04 | (3830.03 ms | 136889 tok/s) step 21165/76294 | train loss 3.488983 | norm 114.3679 | lr 1.20e-04 | (3803.82 ms | 137832 tok/s) step 21166/76294 | train loss 3.445498 | norm 119.0405 | lr 1.20e-04 | (3847.95 ms | 136251 tok/s) step 21167/76294 | train loss 3.483261 | norm 565.4432 | lr 1.20e-04 | (3800.31 ms | 137959 tok/s) step 21168/76294 | train loss 3.537247 | norm 53.8882 | lr 1.20e-04 | (3805.34 ms | 137777 tok/s) step 21169/76294 | train loss 3.542476 | norm 82.6815 | lr 1.20e-04 | (4003.95 ms | 130943 tok/s) step 21170/76294 | train loss 3.544441 | norm 42.7263 | lr 1.20e-04 | (4552.32 ms | 115169 tok/s) step 21171/76294 | train loss 3.595200 | norm 27.5662 | lr 1.20e-04 | (5120.14 ms | 102397 tok/s) step 21172/76294 | train loss 3.535243 | norm 23.0809 | lr 1.20e-04 | (3844.38 ms | 136378 tok/s) step 21173/76294 | train loss 3.519497 | norm 65.4019 | lr 1.20e-04 | (3796.24 ms | 138107 tok/s) step 21174/76294 | train loss 3.621435 | norm 107.2794 | lr 1.20e-04 | (3891.57 ms | 134724 tok/s) step 21175/76294 | train loss 3.593630 | norm 18.7239 | lr 1.20e-04 | (3796.33 ms | 138104 tok/s) step 21176/76294 | train loss 3.589919 | norm 26.3883 | lr 1.20e-04 | (3799.51 ms | 137988 tok/s) step 21177/76294 | train loss 3.558507 | norm 54.1947 | lr 1.20e-04 | (3885.34 ms | 134940 tok/s) step 21178/76294 | train loss 3.599744 | norm 40.1908 | lr 1.20e-04 | (3794.64 ms | 138165 tok/s) step 21179/76294 | train loss 3.546565 | norm 34.1992 | lr 1.20e-04 | (3818.32 ms | 137309 tok/s) step 21180/76294 | train loss 3.562111 | norm 168.8904 | lr 1.20e-04 | (3799.23 ms | 137999 tok/s) step 21181/76294 | train loss 3.627017 | norm 28.1702 | lr 1.20e-04 | (3801.13 ms | 137929 tok/s) step 21182/76294 | train loss 3.572017 | norm 70.9906 | lr 1.20e-04 | (4028.77 ms | 130136 tok/s) step 21183/76294 | train loss 3.594097 | norm 46.4155 | lr 1.20e-04 | (3814.32 ms | 137452 tok/s) step 21184/76294 | train loss 3.573579 | norm 49.2910 | lr 1.20e-04 | (3803.12 ms | 137857 tok/s) step 21185/76294 | train loss 3.576068 | norm 48.0167 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 21186/76294 | train loss 3.538214 | norm 217.5941 | lr 1.20e-04 | (3808.07 ms | 137678 tok/s) step 21187/76294 | train loss 3.529497 | norm 13.8681 | lr 1.20e-04 | (3798.30 ms | 138032 tok/s) step 21188/76294 | train loss 3.617941 | norm 29.5218 | lr 1.20e-04 | (3813.61 ms | 137478 tok/s) step 21189/76294 | train loss 3.589447 | norm 180.7784 | lr 1.20e-04 | (3806.57 ms | 137733 tok/s) step 21190/76294 | train loss 3.765646 | norm 82.9147 | lr 1.20e-04 | (3797.97 ms | 138044 tok/s) step 21191/76294 | train loss 3.527969 | norm 18.8884 | lr 1.20e-04 | (3826.54 ms | 137014 tok/s) step 21192/76294 | train loss 3.534037 | norm 26.6246 | lr 1.20e-04 | (3796.70 ms | 138091 tok/s) step 21193/76294 | train loss 3.560299 | norm 25.2079 | lr 1.20e-04 | (3806.94 ms | 137719 tok/s) step 21194/76294 | train loss 3.572088 | norm 16.5140 | lr 1.20e-04 | (3865.96 ms | 135616 tok/s) step 21195/76294 | train loss 3.601327 | norm 32.8947 | lr 1.20e-04 | (3804.69 ms | 137800 tok/s) step 21196/76294 | train loss 3.576493 | norm 13.7948 | lr 1.20e-04 | (3912.98 ms | 133987 tok/s) step 21197/76294 | train loss 3.548845 | norm 94.6503 | lr 1.20e-04 | (3798.31 ms | 138032 tok/s) step 21198/76294 | train loss 3.504083 | norm 34.1518 | lr 1.20e-04 | (3800.50 ms | 137952 tok/s) step 21199/76294 | train loss 3.530816 | norm 34.5525 | lr 1.20e-04 | (3816.57 ms | 137371 tok/s) step 21200/76294 | train loss 3.532484 | norm 17.6242 | lr 1.20e-04 | (3805.08 ms | 137786 tok/s) step 21201/76294 | train loss 3.562948 | norm 29.1086 | lr 1.20e-04 | (3826.91 ms | 137000 tok/s) step 21202/76294 | train loss 3.530169 | norm 53.6016 | lr 1.20e-04 | (3858.25 ms | 135888 tok/s) step 21203/76294 | train loss 3.512801 | norm 37.6072 | lr 1.20e-04 | (3929.32 ms | 133430 tok/s) step 21204/76294 | train loss 3.569373 | norm 37.2519 | lr 1.20e-04 | (3798.53 ms | 138024 tok/s) step 21205/76294 | train loss 3.493608 | norm 30.9586 | lr 1.20e-04 | (3854.99 ms | 136003 tok/s) step 21206/76294 | train loss 3.511028 | norm 87.5471 | lr 1.20e-04 | (3797.75 ms | 138052 tok/s) step 21207/76294 | train loss 3.509504 | norm 64.8273 | lr 1.20e-04 | (3829.19 ms | 136919 tok/s) step 21208/76294 | train loss 3.512693 | norm 474.8239 | lr 1.20e-04 | (3793.68 ms | 138200 tok/s) step 21209/76294 | train loss 3.526239 | norm 72.7727 | lr 1.20e-04 | (3822.66 ms | 137153 tok/s) step 21210/76294 | train loss 3.497927 | norm 68.0994 | lr 1.20e-04 | (3797.10 ms | 138076 tok/s) step 21211/76294 | train loss 3.455087 | norm 28.3692 | lr 1.20e-04 | (3800.05 ms | 137969 tok/s) step 21212/76294 | train loss 3.473102 | norm 148.2503 | lr 1.20e-04 | (3824.35 ms | 137092 tok/s) step 21213/76294 | train loss 3.470137 | norm 228.5935 | lr 1.20e-04 | (3801.55 ms | 137914 tok/s) step 21214/76294 | train loss 3.547295 | norm 77.6201 | lr 1.20e-04 | (3818.61 ms | 137298 tok/s) step 21215/76294 | train loss 3.478101 | norm 517.8124 | lr 1.20e-04 | (3837.61 ms | 136619 tok/s) step 21216/76294 | train loss 3.505034 | norm 73.1613 | lr 1.20e-04 | (3806.59 ms | 137732 tok/s) step 21217/76294 | train loss 3.484786 | norm 306.5427 | lr 1.20e-04 | (3801.38 ms | 137920 tok/s) step 21218/76294 | train loss 3.512842 | norm 220.6018 | lr 1.20e-04 | (3819.85 ms | 137254 tok/s) step 21219/76294 | train loss 3.516309 | norm 48.7006 | lr 1.20e-04 | (3802.21 ms | 137890 tok/s) step 21220/76294 | train loss 3.548920 | norm 42.3998 | lr 1.20e-04 | (3805.69 ms | 137764 tok/s) step 21221/76294 | train loss 3.482299 | norm 51.0220 | lr 1.20e-04 | (3800.81 ms | 137941 tok/s) step 21222/76294 | train loss 3.542275 | norm 113.2874 | lr 1.20e-04 | (3803.87 ms | 137830 tok/s) step 21223/76294 | train loss 3.554276 | norm 67.4164 | lr 1.20e-04 | (3798.56 ms | 138023 tok/s) step 21224/76294 | train loss 3.480245 | norm 60.9056 | lr 1.20e-04 | (3826.15 ms | 137027 tok/s) step 21225/76294 | train loss 3.493686 | norm 138.2084 | lr 1.20e-04 | (3793.94 ms | 138191 tok/s) step 21226/76294 | train loss 3.477120 | norm 40.2314 | lr 1.20e-04 | (3799.83 ms | 137977 tok/s) step 21227/76294 | train loss 3.518651 | norm 76.0909 | lr 1.20e-04 | (3820.67 ms | 137224 tok/s) step 21228/76294 | train loss 3.460521 | norm 34.5474 | lr 1.20e-04 | (3797.62 ms | 138057 tok/s) step 21229/76294 | train loss 3.431802 | norm 72.0644 | lr 1.20e-04 | (3889.12 ms | 134809 tok/s) step 21230/76294 | train loss 3.557355 | norm 49.6327 | lr 1.20e-04 | (3796.64 ms | 138092 tok/s) step 21231/76294 | train loss 3.473206 | norm 46.2834 | lr 1.20e-04 | (3800.91 ms | 137937 tok/s) step 21232/76294 | train loss 3.464556 | norm 45.8937 | lr 1.20e-04 | (3821.92 ms | 137179 tok/s) step 21233/76294 | train loss 3.423319 | norm 40.0810 | lr 1.20e-04 | (3797.95 ms | 138045 tok/s) step 21234/76294 | train loss 3.465078 | norm 98.4999 | lr 1.20e-04 | (3802.04 ms | 137896 tok/s) step 21235/76294 | train loss 3.492813 | norm 51.5519 | lr 1.20e-04 | (3801.96 ms | 137899 tok/s) step 21236/76294 | train loss 3.424100 | norm 63.4656 | lr 1.20e-04 | (3800.57 ms | 137950 tok/s) step 21237/76294 | train loss 3.443687 | norm 52.0902 | lr 1.20e-04 | (3807.48 ms | 137700 tok/s) step 21238/76294 | train loss 3.577782 | norm 76.6758 | lr 1.20e-04 | (3796.78 ms | 138087 tok/s) step 21239/76294 | train loss 3.451311 | norm 83.6082 | lr 1.20e-04 | (3830.35 ms | 136877 tok/s) step 21240/76294 | train loss 3.507027 | norm 76.7899 | lr 1.20e-04 | (3797.24 ms | 138071 tok/s) step 21241/76294 | train loss 3.485186 | norm 15.4979 | lr 1.20e-04 | (3805.82 ms | 137760 tok/s) step 21242/76294 | train loss 3.428567 | norm 50.4999 | lr 1.20e-04 | (3825.93 ms | 137035 tok/s) step 21243/76294 | train loss 3.456419 | norm 43.2572 | lr 1.20e-04 | (3797.76 ms | 138052 tok/s) step 21244/76294 | train loss 3.450642 | norm 35.9692 | lr 1.20e-04 | (3798.50 ms | 138025 tok/s) step 21245/76294 | train loss 3.425802 | norm 16.3307 | lr 1.20e-04 | (3799.65 ms | 137983 tok/s) step 21246/76294 | train loss 3.441731 | norm 22.9592 | lr 1.20e-04 | (3801.95 ms | 137900 tok/s) step 21247/76294 | train loss 3.415290 | norm 53.4675 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 21248/76294 | train loss 3.514197 | norm 30.1734 | lr 1.20e-04 | (3803.70 ms | 137836 tok/s) step 21249/76294 | train loss 3.487182 | norm 22.9201 | lr 1.20e-04 | (3796.91 ms | 138083 tok/s) step 21250/76294 | train loss 3.463968 | norm 25.2028 | lr 1.20e-04 | (3806.06 ms | 137751 tok/s) val loss: 3.468666 saving model checkpoint to ./results/gpt2-124M-gqa/step_21250.pth step 21251/76294 | train loss 3.542428 | norm 60.4247 | lr 1.20e-04 | (3836.97 ms | 136641 tok/s) step 21252/76294 | train loss 3.445302 | norm 237.0253 | lr 1.20e-04 | (3793.59 ms | 138204 tok/s) step 21253/76294 | train loss 3.518811 | norm 92.1028 | lr 1.20e-04 | (3823.30 ms | 137130 tok/s) step 21254/76294 | train loss 3.449874 | norm 32.0267 | lr 1.20e-04 | (3844.50 ms | 136373 tok/s) step 21255/76294 | train loss 3.451240 | norm 48.2318 | lr 1.20e-04 | (3795.01 ms | 138152 tok/s) step 21256/76294 | train loss 3.411946 | norm 34.1402 | lr 1.20e-04 | (3795.31 ms | 138141 tok/s) step 21257/76294 | train loss 3.499232 | norm 52.6232 | lr 1.20e-04 | (3812.58 ms | 137515 tok/s) step 21258/76294 | train loss 3.504024 | norm 73.0098 | lr 1.20e-04 | (3805.94 ms | 137755 tok/s) step 21259/76294 | train loss 3.501262 | norm 47.3687 | lr 1.20e-04 | (3801.10 ms | 137931 tok/s) step 21260/76294 | train loss 3.500979 | norm 33.9231 | lr 1.20e-04 | (3799.31 ms | 137996 tok/s) step 21261/76294 | train loss 3.440768 | norm 53.6082 | lr 1.20e-04 | (3796.94 ms | 138082 tok/s) step 21262/76294 | train loss 3.482774 | norm 25.4766 | lr 1.20e-04 | (3820.62 ms | 137226 tok/s) step 21263/76294 | train loss 3.464504 | norm 53.1886 | lr 1.20e-04 | (3794.06 ms | 138187 tok/s) step 21264/76294 | train loss 3.449075 | norm 31.2652 | lr 1.20e-04 | (3820.64 ms | 137225 tok/s) step 21265/76294 | train loss 3.530875 | norm 35.6076 | lr 1.20e-04 | (3817.99 ms | 137321 tok/s) step 21266/76294 | train loss 3.459298 | norm 21.6097 | lr 1.20e-04 | (3801.38 ms | 137920 tok/s) step 21267/76294 | train loss 3.608932 | norm 14.1882 | lr 1.20e-04 | (3798.28 ms | 138033 tok/s) step 21268/76294 | train loss 3.503746 | norm 34.8846 | lr 1.20e-04 | (3861.84 ms | 135761 tok/s) step 21269/76294 | train loss 3.536446 | norm 62.7430 | lr 1.20e-04 | (3795.47 ms | 138135 tok/s) step 21270/76294 | train loss 3.468699 | norm 37.4484 | lr 1.20e-04 | (3822.16 ms | 137171 tok/s) step 21271/76294 | train loss 3.527666 | norm 31.5269 | lr 1.20e-04 | (3796.80 ms | 138087 tok/s) step 21272/76294 | train loss 3.483832 | norm 31.2596 | lr 1.20e-04 | (3805.68 ms | 137764 tok/s) step 21273/76294 | train loss 3.611280 | norm 25.1640 | lr 1.20e-04 | (3818.37 ms | 137307 tok/s) step 21274/76294 | train loss 3.514322 | norm 47.0466 | lr 1.20e-04 | (3802.16 ms | 137892 tok/s) step 21275/76294 | train loss 3.484473 | norm 77.7226 | lr 1.20e-04 | (3825.75 ms | 137042 tok/s) step 21276/76294 | train loss 3.506735 | norm 65.8817 | lr 1.20e-04 | (3796.75 ms | 138089 tok/s) step 21277/76294 | train loss 3.486918 | norm 205.2223 | lr 1.20e-04 | (3824.06 ms | 137103 tok/s) step 21278/76294 | train loss 3.493020 | norm 25.5370 | lr 1.20e-04 | (4576.17 ms | 114569 tok/s) step 21279/76294 | train loss 3.495513 | norm 48.0670 | lr 1.20e-04 | (3922.02 ms | 133678 tok/s) step 21280/76294 | train loss 3.528522 | norm 46.5441 | lr 1.20e-04 | (3788.10 ms | 138404 tok/s) step 21281/76294 | train loss 3.432939 | norm 53.2477 | lr 1.20e-04 | (11096.93 ms | 47246 tok/s) step 21282/76294 | train loss 3.512004 | norm 108.7421 | lr 1.20e-04 | (3852.82 ms | 136079 tok/s) step 21283/76294 | train loss 3.587804 | norm 66.1429 | lr 1.20e-04 | (3804.93 ms | 137792 tok/s) step 21284/76294 | train loss 3.552993 | norm 75.9685 | lr 1.20e-04 | (3781.09 ms | 138661 tok/s) step 21285/76294 | train loss 3.550594 | norm 80.8256 | lr 1.20e-04 | (3791.65 ms | 138274 tok/s) step 21286/76294 | train loss 3.574238 | norm 34.7109 | lr 1.20e-04 | (3783.27 ms | 138581 tok/s) step 21287/76294 | train loss 3.589490 | norm 27.7238 | lr 1.20e-04 | (4447.51 ms | 117884 tok/s) step 21288/76294 | train loss 3.473299 | norm 46.2288 | lr 1.20e-04 | (3787.51 ms | 138426 tok/s) step 21289/76294 | train loss 3.525135 | norm 45.4326 | lr 1.20e-04 | (3824.53 ms | 137086 tok/s) step 21290/76294 | train loss 3.442229 | norm 55.5422 | lr 1.20e-04 | (3781.73 ms | 138637 tok/s) step 21291/76294 | train loss 3.469937 | norm 138.2887 | lr 1.20e-04 | (3784.02 ms | 138553 tok/s) step 21292/76294 | train loss 3.482841 | norm 59.6648 | lr 1.20e-04 | (3811.69 ms | 137547 tok/s) step 21293/76294 | train loss 3.587768 | norm 36.9037 | lr 1.20e-04 | (3787.24 ms | 138435 tok/s) step 21294/76294 | train loss 3.434888 | norm 43.0288 | lr 1.20e-04 | (3790.29 ms | 138324 tok/s) step 21295/76294 | train loss 3.512216 | norm 81.3304 | lr 1.20e-04 | (3784.94 ms | 138520 tok/s) step 21296/76294 | train loss 3.493778 | norm 65.4719 | lr 1.20e-04 | (3795.59 ms | 138131 tok/s) step 21297/76294 | train loss 3.488014 | norm 37.5598 | lr 1.20e-04 | (3823.27 ms | 137131 tok/s) step 21298/76294 | train loss 3.533406 | norm 194.2444 | lr 1.20e-04 | (3796.17 ms | 138110 tok/s) step 21299/76294 | train loss 3.509970 | norm 66.3417 | lr 1.20e-04 | (3788.89 ms | 138375 tok/s) step 21300/76294 | train loss 3.593944 | norm 124.8929 | lr 1.20e-04 | (3789.93 ms | 138337 tok/s) step 21301/76294 | train loss 3.505278 | norm 196.5453 | lr 1.20e-04 | (3787.42 ms | 138429 tok/s) step 21302/76294 | train loss 3.692204 | norm 95.4738 | lr 1.20e-04 | (3785.89 ms | 138485 tok/s) step 21303/76294 | train loss 3.600381 | norm 129.7158 | lr 1.20e-04 | (3810.24 ms | 137600 tok/s) step 21304/76294 | train loss 3.592801 | norm 172.4986 | lr 1.20e-04 | (3786.88 ms | 138448 tok/s) step 21305/76294 | train loss 3.503909 | norm 112.1071 | lr 1.20e-04 | (3793.81 ms | 138196 tok/s) step 21306/76294 | train loss 3.653097 | norm 166.0807 | lr 1.20e-04 | (3807.65 ms | 137693 tok/s) step 21307/76294 | train loss 3.551533 | norm 69.4714 | lr 1.20e-04 | (3787.80 ms | 138415 tok/s) step 21308/76294 | train loss 3.615870 | norm 142.1953 | lr 1.20e-04 | (3792.45 ms | 138245 tok/s) step 21309/76294 | train loss 3.511169 | norm 129.5191 | lr 1.20e-04 | (3792.04 ms | 138260 tok/s) step 21310/76294 | train loss 3.590577 | norm 81.8386 | lr 1.20e-04 | (3790.17 ms | 138328 tok/s) step 21311/76294 | train loss 3.554429 | norm 101.1656 | lr 1.20e-04 | (3788.67 ms | 138383 tok/s) step 21312/76294 | train loss 3.523103 | norm 136.7365 | lr 1.20e-04 | (3794.22 ms | 138181 tok/s) step 21313/76294 | train loss 3.535359 | norm 92.0667 | lr 1.20e-04 | (3791.76 ms | 138270 tok/s) step 21314/76294 | train loss 3.505078 | norm 97.0826 | lr 1.20e-04 | (3785.35 ms | 138504 tok/s) step 21315/76294 | train loss 3.569230 | norm 97.0083 | lr 1.20e-04 | (3812.02 ms | 137535 tok/s) step 21316/76294 | train loss 3.441179 | norm 228.9238 | lr 1.20e-04 | (3789.04 ms | 138370 tok/s) step 21317/76294 | train loss 3.561648 | norm 40.7872 | lr 1.20e-04 | (3847.72 ms | 136260 tok/s) step 21318/76294 | train loss 3.527583 | norm 145.7303 | lr 1.20e-04 | (3788.43 ms | 138392 tok/s) step 21319/76294 | train loss 3.528956 | norm 106.5017 | lr 1.20e-04 | (3788.59 ms | 138386 tok/s) step 21320/76294 | train loss 3.495869 | norm 112.0600 | lr 1.20e-04 | (3825.45 ms | 137053 tok/s) step 21321/76294 | train loss 3.464533 | norm 59.9245 | lr 1.20e-04 | (3795.08 ms | 138149 tok/s) step 21322/76294 | train loss 3.499859 | norm 37.4239 | lr 1.20e-04 | (3792.28 ms | 138251 tok/s) step 21323/76294 | train loss 3.459180 | norm 88.0680 | lr 1.20e-04 | (3789.68 ms | 138346 tok/s) step 21324/76294 | train loss 3.554487 | norm 51.3620 | lr 1.20e-04 | (3787.87 ms | 138412 tok/s) step 21325/76294 | train loss 3.482594 | norm 55.3036 | lr 1.20e-04 | (3813.29 ms | 137490 tok/s) step 21326/76294 | train loss 3.523020 | norm 23.8587 | lr 1.20e-04 | (3791.39 ms | 138284 tok/s) step 21327/76294 | train loss 3.483058 | norm 52.5193 | lr 1.20e-04 | (3796.23 ms | 138107 tok/s) step 21328/76294 | train loss 3.565784 | norm 62.2273 | lr 1.20e-04 | (3853.50 ms | 136055 tok/s) step 21329/76294 | train loss 3.537189 | norm 638.6588 | lr 1.20e-04 | (3787.72 ms | 138418 tok/s) step 21330/76294 | train loss 3.515411 | norm 146.1248 | lr 1.20e-04 | (3792.27 ms | 138252 tok/s) step 21331/76294 | train loss 3.531775 | norm 223.9563 | lr 1.20e-04 | (3811.89 ms | 137540 tok/s) step 21332/76294 | train loss 3.524846 | norm 147.5670 | lr 1.20e-04 | (3789.53 ms | 138352 tok/s) step 21333/76294 | train loss 3.553751 | norm 45.1010 | lr 1.20e-04 | (3795.01 ms | 138152 tok/s) step 21334/76294 | train loss 3.458924 | norm 47.4701 | lr 1.20e-04 | (3788.45 ms | 138391 tok/s) step 21335/76294 | train loss 3.611199 | norm 236.8868 | lr 1.20e-04 | (3796.64 ms | 138092 tok/s) step 21336/76294 | train loss 3.486955 | norm 34.6706 | lr 1.20e-04 | (3791.10 ms | 138294 tok/s) step 21337/76294 | train loss 3.481229 | norm 90.5013 | lr 1.20e-04 | (3789.55 ms | 138351 tok/s) step 21338/76294 | train loss 3.483321 | norm 66.0409 | lr 1.20e-04 | (3818.44 ms | 137304 tok/s) step 21339/76294 | train loss 3.539176 | norm 115.2338 | lr 1.20e-04 | (3787.05 ms | 138442 tok/s) step 21340/76294 | train loss 3.452537 | norm 33.2500 | lr 1.20e-04 | (3791.20 ms | 138291 tok/s) step 21341/76294 | train loss 3.523687 | norm 31.4575 | lr 1.20e-04 | (3810.65 ms | 137585 tok/s) step 21342/76294 | train loss 3.491234 | norm 58.8972 | lr 1.20e-04 | (3790.92 ms | 138301 tok/s) step 21343/76294 | train loss 3.512402 | norm 39.7686 | lr 1.20e-04 | (3795.54 ms | 138133 tok/s) step 21344/76294 | train loss 3.511195 | norm 23.9837 | lr 1.20e-04 | (3789.21 ms | 138363 tok/s) step 21345/76294 | train loss 3.433410 | norm 84.0416 | lr 1.20e-04 | (3796.56 ms | 138095 tok/s) step 21346/76294 | train loss 3.528066 | norm 66.9672 | lr 1.20e-04 | (3789.77 ms | 138343 tok/s) step 21347/76294 | train loss 3.438851 | norm 159.2153 | lr 1.20e-04 | (3793.16 ms | 138219 tok/s) step 21348/76294 | train loss 3.702564 | norm 88.6403 | lr 1.20e-04 | (3795.38 ms | 138138 tok/s) step 21349/76294 | train loss 3.507520 | norm 123.4604 | lr 1.20e-04 | (3792.18 ms | 138255 tok/s) step 21350/76294 | train loss 3.563482 | norm 79.9219 | lr 1.20e-04 | (3791.95 ms | 138264 tok/s) step 21351/76294 | train loss 3.494491 | norm 72.7741 | lr 1.20e-04 | (3793.28 ms | 138215 tok/s) step 21352/76294 | train loss 3.471122 | norm 63.2895 | lr 1.20e-04 | (3792.36 ms | 138248 tok/s) step 21353/76294 | train loss 3.463828 | norm 142.2338 | lr 1.20e-04 | (3876.97 ms | 135231 tok/s) step 21354/76294 | train loss 3.516670 | norm 92.8058 | lr 1.20e-04 | (3789.29 ms | 138360 tok/s) step 21355/76294 | train loss 3.523791 | norm 79.8026 | lr 1.20e-04 | (3945.64 ms | 132878 tok/s) step 21356/76294 | train loss 3.523156 | norm 77.9580 | lr 1.20e-04 | (3787.80 ms | 138415 tok/s) step 21357/76294 | train loss 3.545746 | norm 107.8123 | lr 1.20e-04 | (3791.34 ms | 138286 tok/s) step 21358/76294 | train loss 3.535640 | norm 131.2380 | lr 1.20e-04 | (3803.32 ms | 137850 tok/s) step 21359/76294 | train loss 3.476019 | norm 63.2115 | lr 1.20e-04 | (3787.50 ms | 138426 tok/s) step 21360/76294 | train loss 3.545875 | norm 59.0399 | lr 1.20e-04 | (3793.60 ms | 138203 tok/s) step 21361/76294 | train loss 3.485008 | norm 208.5824 | lr 1.20e-04 | (3790.32 ms | 138323 tok/s) step 21362/76294 | train loss 3.502899 | norm 152.0585 | lr 1.20e-04 | (3790.57 ms | 138314 tok/s) step 21363/76294 | train loss 3.515420 | norm 42.0409 | lr 1.20e-04 | (3788.24 ms | 138399 tok/s) step 21364/76294 | train loss 3.477363 | norm 64.3037 | lr 1.20e-04 | (3784.89 ms | 138521 tok/s) step 21365/76294 | train loss 3.514283 | norm 54.1342 | lr 1.20e-04 | (3813.17 ms | 137494 tok/s) step 21366/76294 | train loss 3.467088 | norm 42.1310 | lr 1.20e-04 | (3787.50 ms | 138426 tok/s) step 21367/76294 | train loss 3.443837 | norm 29.2602 | lr 1.20e-04 | (4290.13 ms | 122208 tok/s) step 21368/76294 | train loss 3.493647 | norm 90.2721 | lr 1.20e-04 | (3811.08 ms | 137570 tok/s) step 21369/76294 | train loss 3.490168 | norm 56.2842 | lr 1.20e-04 | (3788.99 ms | 138371 tok/s) step 21370/76294 | train loss 3.540318 | norm 38.7525 | lr 1.20e-04 | (3789.18 ms | 138364 tok/s) step 21371/76294 | train loss 3.479933 | norm 59.8774 | lr 1.20e-04 | (3790.42 ms | 138319 tok/s) step 21372/76294 | train loss 3.521142 | norm 65.2364 | lr 1.20e-04 | (3792.11 ms | 138257 tok/s) step 21373/76294 | train loss 3.491716 | norm 72.2133 | lr 1.20e-04 | (3787.83 ms | 138414 tok/s) step 21374/76294 | train loss 3.493408 | norm 61.7226 | lr 1.20e-04 | (3785.86 ms | 138486 tok/s) step 21375/76294 | train loss 3.539583 | norm 380.5750 | lr 1.20e-04 | (3810.12 ms | 137604 tok/s) step 21376/76294 | train loss 3.556591 | norm 101.6714 | lr 1.20e-04 | (3784.89 ms | 138521 tok/s) step 21377/76294 | train loss 3.569936 | norm 77.7680 | lr 1.20e-04 | (3791.78 ms | 138270 tok/s) step 21378/76294 | train loss 3.501528 | norm 109.1607 | lr 1.20e-04 | (3909.45 ms | 134108 tok/s) step 21379/76294 | train loss 3.535681 | norm 71.7334 | lr 1.20e-04 | (3782.04 ms | 138626 tok/s) step 21380/76294 | train loss 3.487350 | norm 66.9667 | lr 1.20e-04 | (3792.21 ms | 138254 tok/s) step 21381/76294 | train loss 3.489420 | norm 89.5016 | lr 1.20e-04 | (3810.09 ms | 137605 tok/s) step 21382/76294 | train loss 3.503122 | norm 39.5099 | lr 1.20e-04 | (3786.33 ms | 138469 tok/s) step 21383/76294 | train loss 3.660009 | norm 101.3146 | lr 1.20e-04 | (3803.23 ms | 137853 tok/s) step 21384/76294 | train loss 3.530519 | norm 51.7363 | lr 1.20e-04 | (3790.35 ms | 138322 tok/s) step 21385/76294 | train loss 3.462090 | norm 36.4855 | lr 1.20e-04 | (3836.55 ms | 136656 tok/s) step 21386/76294 | train loss 3.528241 | norm 49.7760 | lr 1.20e-04 | (3788.12 ms | 138403 tok/s) step 21387/76294 | train loss 3.516468 | norm 31.0378 | lr 1.20e-04 | (3789.43 ms | 138355 tok/s) step 21388/76294 | train loss 3.563592 | norm 42.6514 | lr 1.20e-04 | (3806.08 ms | 137750 tok/s) step 21389/76294 | train loss 3.499626 | norm 49.7430 | lr 1.20e-04 | (3813.92 ms | 137467 tok/s) step 21390/76294 | train loss 3.516433 | norm 26.3877 | lr 1.20e-04 | (3785.98 ms | 138481 tok/s) step 21391/76294 | train loss 3.514897 | norm 9.0731 | lr 1.20e-04 | (3814.29 ms | 137454 tok/s) step 21392/76294 | train loss 3.512932 | norm 16.7625 | lr 1.20e-04 | (3787.86 ms | 138413 tok/s) step 21393/76294 | train loss 3.577341 | norm 35.9450 | lr 1.20e-04 | (3787.59 ms | 138423 tok/s) step 21394/76294 | train loss 3.455690 | norm 18.6333 | lr 1.20e-04 | (3807.43 ms | 137701 tok/s) step 21395/76294 | train loss 3.535669 | norm 15.5378 | lr 1.20e-04 | (3788.56 ms | 138387 tok/s) step 21396/76294 | train loss 3.538640 | norm 42.6374 | lr 1.20e-04 | (3795.32 ms | 138141 tok/s) step 21397/76294 | train loss 3.589561 | norm 18.8916 | lr 1.20e-04 | (3791.02 ms | 138297 tok/s) step 21398/76294 | train loss 3.478690 | norm 10.9441 | lr 1.20e-04 | (3794.66 ms | 138165 tok/s) step 21399/76294 | train loss 3.464303 | norm 12.6949 | lr 1.20e-04 | (3794.37 ms | 138175 tok/s) step 21400/76294 | train loss 3.483827 | norm 28.8550 | lr 1.20e-04 | (3793.31 ms | 138214 tok/s) step 21401/76294 | train loss 3.507655 | norm 7.8361 | lr 1.20e-04 | (3815.78 ms | 137400 tok/s) step 21402/76294 | train loss 3.519297 | norm 15.4688 | lr 1.20e-04 | (3851.47 ms | 136127 tok/s) step 21403/76294 | train loss 3.496798 | norm 17.7194 | lr 1.20e-04 | (3858.53 ms | 135878 tok/s) step 21404/76294 | train loss 3.569405 | norm 49.1140 | lr 1.20e-04 | (3824.79 ms | 137076 tok/s) step 21405/76294 | train loss 3.480271 | norm 33.3924 | lr 1.20e-04 | (3798.89 ms | 138011 tok/s) step 21406/76294 | train loss 3.507573 | norm 20.3696 | lr 1.20e-04 | (3789.34 ms | 138359 tok/s) step 21407/76294 | train loss 3.574647 | norm 30.9866 | lr 1.20e-04 | (3796.84 ms | 138085 tok/s) step 21408/76294 | train loss 3.453434 | norm 11.3323 | lr 1.20e-04 | (3796.72 ms | 138090 tok/s) step 21409/76294 | train loss 3.591123 | norm 19.0338 | lr 1.20e-04 | (3785.43 ms | 138502 tok/s) step 21410/76294 | train loss 3.506915 | norm 18.4238 | lr 1.20e-04 | (3807.60 ms | 137695 tok/s) step 21411/76294 | train loss 3.604109 | norm 32.8991 | lr 1.20e-04 | (3784.86 ms | 138522 tok/s) step 21412/76294 | train loss 3.604822 | norm 31.7597 | lr 1.20e-04 | (3788.73 ms | 138381 tok/s) step 21413/76294 | train loss 3.670024 | norm 36.0895 | lr 1.20e-04 | (3810.64 ms | 137585 tok/s) step 21414/76294 | train loss 3.565090 | norm 82.9399 | lr 1.20e-04 | (3792.64 ms | 138238 tok/s) step 21415/76294 | train loss 3.622572 | norm 25.0704 | lr 1.20e-04 | (3796.59 ms | 138094 tok/s) step 21416/76294 | train loss 3.609480 | norm 50.0980 | lr 1.20e-04 | (3789.83 ms | 138341 tok/s) step 21417/76294 | train loss 3.598612 | norm 50.3278 | lr 1.20e-04 | (3790.20 ms | 138327 tok/s) step 21418/76294 | train loss 3.728346 | norm 10.5937 | lr 1.20e-04 | (3789.99 ms | 138335 tok/s) step 21419/76294 | train loss 3.532782 | norm 37.5819 | lr 1.20e-04 | (3794.03 ms | 138187 tok/s) step 21420/76294 | train loss 3.648506 | norm 23.1320 | lr 1.20e-04 | (3791.35 ms | 138285 tok/s) step 21421/76294 | train loss 3.621207 | norm 31.6588 | lr 1.20e-04 | (3786.94 ms | 138446 tok/s) step 21422/76294 | train loss 3.597328 | norm 32.1251 | lr 1.20e-04 | (3812.78 ms | 137508 tok/s) step 21423/76294 | train loss 3.493652 | norm 16.4902 | lr 1.20e-04 | (3786.01 ms | 138480 tok/s) step 21424/76294 | train loss 3.524640 | norm 18.2365 | lr 1.20e-04 | (3797.94 ms | 138045 tok/s) step 21425/76294 | train loss 3.593621 | norm 15.4802 | lr 1.20e-04 | (3807.34 ms | 137705 tok/s) step 21426/76294 | train loss 3.534045 | norm 8.7715 | lr 1.20e-04 | (3796.64 ms | 138093 tok/s) step 21427/76294 | train loss 3.533224 | norm 16.4215 | lr 1.20e-04 | (4672.96 ms | 112196 tok/s) step 21428/76294 | train loss 3.565456 | norm 11.7455 | lr 1.20e-04 | (3871.47 ms | 135424 tok/s) step 21429/76294 | train loss 3.477906 | norm 14.0523 | lr 1.20e-04 | (3785.13 ms | 138512 tok/s) step 21430/76294 | train loss 3.557839 | norm 15.5866 | lr 1.20e-04 | (3817.68 ms | 137332 tok/s) step 21431/76294 | train loss 3.560063 | norm 19.1592 | lr 1.20e-04 | (3792.33 ms | 138250 tok/s) step 21432/76294 | train loss 3.679853 | norm 20.6215 | lr 1.20e-04 | (3799.11 ms | 138003 tok/s) step 21433/76294 | train loss 3.656393 | norm 17.8526 | lr 1.20e-04 | (3813.10 ms | 137497 tok/s) step 21434/76294 | train loss 3.688877 | norm 50.5890 | lr 1.20e-04 | (3785.61 ms | 138495 tok/s) step 21435/76294 | train loss 3.695851 | norm 13.4033 | lr 1.20e-04 | (3792.52 ms | 138243 tok/s) step 21436/76294 | train loss 3.672227 | norm 16.4670 | lr 1.20e-04 | (3786.63 ms | 138458 tok/s) step 21437/76294 | train loss 3.666926 | norm 13.3228 | lr 1.20e-04 | (3790.15 ms | 138329 tok/s) step 21438/76294 | train loss 3.647933 | norm 13.2976 | lr 1.20e-04 | (3789.60 ms | 138349 tok/s) step 21439/76294 | train loss 3.690992 | norm 15.1126 | lr 1.20e-04 | (3783.83 ms | 138560 tok/s) step 21440/76294 | train loss 3.668835 | norm 12.1261 | lr 1.20e-04 | (3814.89 ms | 137432 tok/s) step 21441/76294 | train loss 3.670110 | norm 8.5728 | lr 1.20e-04 | (3787.68 ms | 138419 tok/s) step 21442/76294 | train loss 3.654619 | norm 16.5881 | lr 1.20e-04 | (3789.54 ms | 138351 tok/s) step 21443/76294 | train loss 3.669892 | norm 13.4308 | lr 1.20e-04 | (3806.01 ms | 137753 tok/s) step 21444/76294 | train loss 3.684592 | norm 9.9836 | lr 1.20e-04 | (3789.38 ms | 138357 tok/s) step 21445/76294 | train loss 3.644916 | norm 13.1151 | lr 1.20e-04 | (3781.82 ms | 138634 tok/s) step 21446/76294 | train loss 3.662241 | norm 8.8998 | lr 1.20e-04 | (3811.20 ms | 137565 tok/s) step 21447/76294 | train loss 3.552117 | norm 7.2396 | lr 1.20e-04 | (3786.73 ms | 138454 tok/s) step 21448/76294 | train loss 3.540820 | norm 9.6297 | lr 1.20e-04 | (3793.69 ms | 138200 tok/s) step 21449/76294 | train loss 3.627603 | norm 7.2338 | lr 1.20e-04 | (3806.65 ms | 137730 tok/s) step 21450/76294 | train loss 3.558753 | norm 6.6376 | lr 1.20e-04 | (3788.61 ms | 138385 tok/s) step 21451/76294 | train loss 3.536308 | norm 4.1799 | lr 1.20e-04 | (3790.90 ms | 138302 tok/s) step 21452/76294 | train loss 3.487424 | norm 5.8918 | lr 1.20e-04 | (3787.49 ms | 138426 tok/s) step 21453/76294 | train loss 3.569910 | norm 4.9644 | lr 1.20e-04 | (3793.26 ms | 138216 tok/s) step 21454/76294 | train loss 3.543393 | norm 5.8186 | lr 1.20e-04 | (3862.43 ms | 135741 tok/s) step 21455/76294 | train loss 3.500961 | norm 4.2312 | lr 1.20e-04 | (3788.23 ms | 138399 tok/s) step 21456/76294 | train loss 3.511714 | norm 5.3310 | lr 1.20e-04 | (3805.75 ms | 137762 tok/s) step 21457/76294 | train loss 3.485742 | norm 4.3946 | lr 1.20e-04 | (3786.54 ms | 138461 tok/s) step 21458/76294 | train loss 3.461244 | norm 9.3478 | lr 1.20e-04 | (3798.07 ms | 138041 tok/s) step 21459/76294 | train loss 3.505017 | norm 5.2986 | lr 1.20e-04 | (3817.99 ms | 137320 tok/s) step 21460/76294 | train loss 3.550637 | norm 4.3974 | lr 1.20e-04 | (3787.65 ms | 138420 tok/s) step 21461/76294 | train loss 3.494860 | norm 9.0818 | lr 1.20e-04 | (3803.68 ms | 137837 tok/s) step 21462/76294 | train loss 3.511971 | norm 8.7487 | lr 1.20e-04 | (3828.92 ms | 136928 tok/s) step 21463/76294 | train loss 3.490009 | norm 6.0999 | lr 1.20e-04 | (3784.41 ms | 138539 tok/s) step 21464/76294 | train loss 3.504057 | norm 4.2525 | lr 1.20e-04 | (3791.70 ms | 138273 tok/s) step 21465/76294 | train loss 3.614007 | norm 5.3257 | lr 1.20e-04 | (3807.65 ms | 137693 tok/s) step 21466/76294 | train loss 3.609751 | norm 3.9261 | lr 1.20e-04 | (3788.67 ms | 138383 tok/s) step 21467/76294 | train loss 3.494997 | norm 9.6421 | lr 1.20e-04 | (3796.75 ms | 138088 tok/s) step 21468/76294 | train loss 3.501405 | norm 7.1747 | lr 1.20e-04 | (4307.85 ms | 121705 tok/s) step 21469/76294 | train loss 3.498654 | norm 4.9219 | lr 1.20e-04 | (3783.67 ms | 138566 tok/s) step 21470/76294 | train loss 3.486949 | norm 2.9308 | lr 1.20e-04 | (3889.88 ms | 134783 tok/s) step 21471/76294 | train loss 3.478898 | norm 5.3869 | lr 1.20e-04 | (3784.78 ms | 138525 tok/s) step 21472/76294 | train loss 3.444391 | norm 5.0031 | lr 1.20e-04 | (3883.20 ms | 135015 tok/s) step 21473/76294 | train loss 3.476793 | norm 3.1999 | lr 1.20e-04 | (3784.14 ms | 138549 tok/s) step 21474/76294 | train loss 3.458705 | norm 3.7494 | lr 1.20e-04 | (3847.75 ms | 136258 tok/s) step 21475/76294 | train loss 3.478709 | norm 4.2226 | lr 1.20e-04 | (3783.74 ms | 138563 tok/s) step 21476/76294 | train loss 3.483758 | norm 3.7512 | lr 1.20e-04 | (3807.82 ms | 137687 tok/s) step 21477/76294 | train loss 3.443238 | norm 4.9554 | lr 1.20e-04 | (3798.49 ms | 138025 tok/s) step 21478/76294 | train loss 3.574258 | norm 5.6518 | lr 1.20e-04 | (3820.35 ms | 137236 tok/s) step 21479/76294 | train loss 3.423514 | norm 6.7723 | lr 1.20e-04 | (3836.09 ms | 136672 tok/s) step 21480/76294 | train loss 3.441128 | norm 5.1442 | lr 1.20e-04 | (3782.84 ms | 138596 tok/s) step 21481/76294 | train loss 3.513121 | norm 4.8700 | lr 1.20e-04 | (3793.28 ms | 138215 tok/s) step 21482/76294 | train loss 3.434150 | norm 27.3176 | lr 1.20e-04 | (3791.18 ms | 138292 tok/s) step 21483/76294 | train loss 3.375870 | norm 5.1049 | lr 1.20e-04 | (3800.18 ms | 137964 tok/s) step 21484/76294 | train loss 3.454699 | norm 6.3083 | lr 1.20e-04 | (3787.26 ms | 138434 tok/s) step 21485/76294 | train loss 3.373840 | norm 5.0441 | lr 1.20e-04 | (3797.83 ms | 138049 tok/s) step 21486/76294 | train loss 3.416398 | norm 3.9946 | lr 1.20e-04 | (3807.34 ms | 137705 tok/s) step 21487/76294 | train loss 3.502628 | norm 4.8504 | lr 1.20e-04 | (3791.48 ms | 138281 tok/s) step 21488/76294 | train loss 3.508633 | norm 5.4916 | lr 1.20e-04 | (3793.61 ms | 138203 tok/s) step 21489/76294 | train loss 3.472315 | norm 9.5028 | lr 1.20e-04 | (3789.91 ms | 138338 tok/s) step 21490/76294 | train loss 3.450625 | norm 5.6000 | lr 1.20e-04 | (3794.32 ms | 138177 tok/s) step 21491/76294 | train loss 3.411753 | norm 5.5567 | lr 1.20e-04 | (3790.95 ms | 138300 tok/s) step 21492/76294 | train loss 3.412299 | norm 7.1667 | lr 1.20e-04 | (3787.27 ms | 138434 tok/s) step 21493/76294 | train loss 3.475471 | norm 4.2779 | lr 1.20e-04 | (4632.25 ms | 113182 tok/s) step 21494/76294 | train loss 3.412009 | norm 4.4133 | lr 1.20e-04 | (3788.97 ms | 138372 tok/s) step 21495/76294 | train loss 3.433540 | norm 7.9460 | lr 1.20e-04 | (3788.93 ms | 138374 tok/s) step 21496/76294 | train loss 3.512966 | norm 5.2971 | lr 1.20e-04 | (3782.90 ms | 138594 tok/s) step 21497/76294 | train loss 3.432189 | norm 4.8584 | lr 1.20e-04 | (3790.98 ms | 138299 tok/s) step 21498/76294 | train loss 3.439725 | norm 4.4415 | lr 1.20e-04 | (3804.98 ms | 137790 tok/s) step 21499/76294 | train loss 3.474295 | norm 6.6723 | lr 1.20e-04 | (3818.64 ms | 137297 tok/s) step 21500/76294 | train loss 3.418207 | norm 6.5841 | lr 1.20e-04 | (3798.78 ms | 138015 tok/s) val loss: 3.483825 saving model checkpoint to ./results/gpt2-124M-gqa/step_21500.pth step 21501/76294 | train loss 3.452667 | norm 5.9600 | lr 1.20e-04 | (3807.88 ms | 137685 tok/s) step 21502/76294 | train loss 3.500261 | norm 8.9404 | lr 1.20e-04 | (3810.60 ms | 137587 tok/s) step 21503/76294 | train loss 3.463899 | norm 7.9804 | lr 1.20e-04 | (3797.78 ms | 138051 tok/s) step 21504/76294 | train loss 3.444242 | norm 5.2407 | lr 1.20e-04 | (3822.35 ms | 137164 tok/s) step 21505/76294 | train loss 3.560562 | norm 5.4455 | lr 1.20e-04 | (3792.13 ms | 138257 tok/s) step 21506/76294 | train loss 3.363299 | norm 6.1253 | lr 1.20e-04 | (3813.66 ms | 137476 tok/s) step 21507/76294 | train loss 3.431206 | norm 7.2503 | lr 1.20e-04 | (3810.97 ms | 137573 tok/s) step 21508/76294 | train loss 3.515833 | norm 6.4733 | lr 1.20e-04 | (3795.48 ms | 138135 tok/s) step 21509/76294 | train loss 3.458552 | norm 4.2818 | lr 1.20e-04 | (3809.93 ms | 137611 tok/s) step 21510/76294 | train loss 3.533694 | norm 5.3573 | lr 1.20e-04 | (3795.78 ms | 138124 tok/s) step 21511/76294 | train loss 3.539884 | norm 10.0862 | lr 1.20e-04 | (3797.77 ms | 138052 tok/s) step 21512/76294 | train loss 3.470365 | norm 6.1897 | lr 1.20e-04 | (3797.11 ms | 138076 tok/s) step 21513/76294 | train loss 3.419496 | norm 4.6452 | lr 1.20e-04 | (3817.17 ms | 137350 tok/s) step 21514/76294 | train loss 3.479998 | norm 4.1695 | lr 1.20e-04 | (3793.24 ms | 138216 tok/s) step 21515/76294 | train loss 3.439290 | norm 5.7480 | lr 1.20e-04 | (3797.36 ms | 138066 tok/s) step 21516/76294 | train loss 3.468587 | norm 3.8715 | lr 1.20e-04 | (3809.72 ms | 137619 tok/s) step 21517/76294 | train loss 3.473181 | norm 14.0957 | lr 1.20e-04 | (3799.35 ms | 137994 tok/s) step 21518/76294 | train loss 3.442595 | norm 9.3552 | lr 1.20e-04 | (3797.79 ms | 138051 tok/s) step 21519/76294 | train loss 3.444062 | norm 4.0318 | lr 1.20e-04 | (3799.69 ms | 137982 tok/s) step 21520/76294 | train loss 3.495252 | norm 5.1660 | lr 1.20e-04 | (3819.38 ms | 137270 tok/s) step 21521/76294 | train loss 3.412276 | norm 8.4651 | lr 1.20e-04 | (3794.58 ms | 138168 tok/s) step 21522/76294 | train loss 3.540451 | norm 7.4804 | lr 1.20e-04 | (3807.16 ms | 137711 tok/s) step 21523/76294 | train loss 3.476484 | norm 7.8396 | lr 1.20e-04 | (3824.03 ms | 137104 tok/s) step 21524/76294 | train loss 3.425379 | norm 4.8557 | lr 1.20e-04 | (3796.27 ms | 138106 tok/s) step 21525/76294 | train loss 3.368999 | norm 17.2869 | lr 1.20e-04 | (3794.33 ms | 138177 tok/s) step 21526/76294 | train loss 3.488149 | norm 5.3957 | lr 1.20e-04 | (3822.02 ms | 137176 tok/s) step 21527/76294 | train loss 3.417520 | norm 4.7498 | lr 1.20e-04 | (3798.05 ms | 138041 tok/s) step 21528/76294 | train loss 3.451121 | norm 30.4944 | lr 1.20e-04 | (3796.98 ms | 138080 tok/s) step 21529/76294 | train loss 3.485435 | norm 4.1365 | lr 1.20e-04 | (3880.17 ms | 135120 tok/s) step 21530/76294 | train loss 3.389037 | norm 4.7385 | lr 1.20e-04 | (3792.74 ms | 138235 tok/s) step 21531/76294 | train loss 3.495165 | norm 5.3391 | lr 1.20e-04 | (3806.14 ms | 137748 tok/s) step 21532/76294 | train loss 3.514185 | norm 16.8890 | lr 1.20e-04 | (3823.81 ms | 137111 tok/s) step 21533/76294 | train loss 3.452973 | norm 5.0656 | lr 1.20e-04 | (3793.54 ms | 138206 tok/s) step 21534/76294 | train loss 3.416289 | norm 3.2263 | lr 1.20e-04 | (3798.71 ms | 138017 tok/s) step 21535/76294 | train loss 3.446455 | norm 3.2874 | lr 1.20e-04 | (3800.84 ms | 137940 tok/s) step 21536/76294 | train loss 3.441245 | norm 5.3993 | lr 1.20e-04 | (3804.10 ms | 137822 tok/s) step 21537/76294 | train loss 3.483754 | norm 6.1779 | lr 1.20e-04 | (3805.87 ms | 137758 tok/s) step 21538/76294 | train loss 3.443828 | norm 5.1661 | lr 1.20e-04 | (3811.10 ms | 137569 tok/s) step 21539/76294 | train loss 3.425271 | norm 4.2085 | lr 1.20e-04 | (3798.21 ms | 138036 tok/s) step 21540/76294 | train loss 3.476470 | norm 4.7066 | lr 1.20e-04 | (3804.19 ms | 137819 tok/s) step 21541/76294 | train loss 3.531866 | norm 11.2275 | lr 1.20e-04 | (3800.20 ms | 137963 tok/s) step 21542/76294 | train loss 3.429143 | norm 5.3130 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 21543/76294 | train loss 3.645664 | norm 12.5631 | lr 1.20e-04 | (3801.19 ms | 137927 tok/s) step 21544/76294 | train loss 3.455720 | norm 6.3464 | lr 1.20e-04 | (3798.19 ms | 138036 tok/s) step 21545/76294 | train loss 3.428266 | norm 8.4333 | lr 1.20e-04 | (3829.70 ms | 136900 tok/s) step 21546/76294 | train loss 3.504833 | norm 5.1297 | lr 1.20e-04 | (3800.12 ms | 137966 tok/s) step 21547/76294 | train loss 3.465116 | norm 19.9964 | lr 1.20e-04 | (3811.25 ms | 137563 tok/s) step 21548/76294 | train loss 3.442896 | norm 5.3621 | lr 1.20e-04 | (3796.44 ms | 138100 tok/s) step 21549/76294 | train loss 3.537861 | norm 6.1885 | lr 1.20e-04 | (3803.33 ms | 137850 tok/s) step 21550/76294 | train loss 3.435329 | norm 6.3584 | lr 1.20e-04 | (3831.28 ms | 136844 tok/s) step 21551/76294 | train loss 3.475440 | norm 12.8118 | lr 1.20e-04 | (3814.29 ms | 137454 tok/s) step 21552/76294 | train loss 3.508518 | norm 7.3935 | lr 1.20e-04 | (3807.00 ms | 137717 tok/s) step 21553/76294 | train loss 3.482492 | norm 28.7418 | lr 1.20e-04 | (3797.39 ms | 138065 tok/s) step 21554/76294 | train loss 3.482460 | norm 15.1730 | lr 1.20e-04 | (3798.32 ms | 138032 tok/s) step 21555/76294 | train loss 3.521622 | norm 23.3053 | lr 1.20e-04 | (3918.15 ms | 133810 tok/s) step 21556/76294 | train loss 3.713181 | norm 11.7882 | lr 1.20e-04 | (3797.18 ms | 138073 tok/s) step 21557/76294 | train loss 3.590774 | norm 51.7312 | lr 1.20e-04 | (3802.20 ms | 137891 tok/s) step 21558/76294 | train loss 3.533146 | norm 10.0149 | lr 1.20e-04 | (3829.35 ms | 136913 tok/s) step 21559/76294 | train loss 3.527252 | norm 18.5079 | lr 1.20e-04 | (3807.06 ms | 137715 tok/s) step 21560/76294 | train loss 3.473594 | norm 20.2381 | lr 1.20e-04 | (3804.98 ms | 137790 tok/s) step 21561/76294 | train loss 3.649150 | norm 46.5019 | lr 1.20e-04 | (3804.46 ms | 137809 tok/s) step 21562/76294 | train loss 3.576321 | norm 111.2398 | lr 1.20e-04 | (3809.14 ms | 137639 tok/s) step 21563/76294 | train loss 3.495291 | norm 190.5596 | lr 1.20e-04 | (3806.96 ms | 137718 tok/s) step 21564/76294 | train loss 3.560786 | norm 10.0356 | lr 1.20e-04 | (3828.81 ms | 136932 tok/s) step 21565/76294 | train loss 3.523513 | norm 120.3627 | lr 1.20e-04 | (3806.83 ms | 137723 tok/s) step 21566/76294 | train loss 3.517083 | norm 22.1031 | lr 1.20e-04 | (3810.83 ms | 137579 tok/s) step 21567/76294 | train loss 3.560598 | norm 16.6082 | lr 1.20e-04 | (3813.93 ms | 137467 tok/s) step 21568/76294 | train loss 3.508700 | norm 15.3552 | lr 1.20e-04 | (3811.48 ms | 137555 tok/s) step 21569/76294 | train loss 3.543374 | norm 9.9677 | lr 1.20e-04 | (3801.53 ms | 137915 tok/s) step 21570/76294 | train loss 3.542862 | norm 32.8008 | lr 1.20e-04 | (3833.28 ms | 136773 tok/s) step 21571/76294 | train loss 3.493878 | norm 14.0431 | lr 1.20e-04 | (3804.74 ms | 137799 tok/s) step 21572/76294 | train loss 3.427494 | norm 18.9182 | lr 1.20e-04 | (3809.63 ms | 137622 tok/s) step 21573/76294 | train loss 3.454593 | norm 61.2938 | lr 1.20e-04 | (3817.86 ms | 137325 tok/s) step 21574/76294 | train loss 3.484586 | norm 32.6641 | lr 1.20e-04 | (3806.63 ms | 137730 tok/s) step 21575/76294 | train loss 3.466091 | norm 289.4141 | lr 1.20e-04 | (3804.38 ms | 137812 tok/s) step 21576/76294 | train loss 3.547230 | norm 41.2392 | lr 1.20e-04 | (3810.16 ms | 137603 tok/s) step 21577/76294 | train loss 3.499816 | norm 125.9405 | lr 1.20e-04 | (3815.50 ms | 137410 tok/s) step 21578/76294 | train loss 3.605754 | norm 10.8896 | lr 1.20e-04 | (3810.24 ms | 137600 tok/s) step 21579/76294 | train loss 3.523556 | norm 203.0545 | lr 1.20e-04 | (3809.32 ms | 137633 tok/s) step 21580/76294 | train loss 3.549692 | norm 81.1561 | lr 1.20e-04 | (3887.49 ms | 134866 tok/s) step 21581/76294 | train loss 3.498727 | norm 24.5609 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 21582/76294 | train loss 3.568051 | norm 67.0781 | lr 1.20e-04 | (3840.09 ms | 136530 tok/s) step 21583/76294 | train loss 3.548499 | norm 190.3057 | lr 1.20e-04 | (3803.84 ms | 137831 tok/s) step 21584/76294 | train loss 3.588739 | norm 26.9831 | lr 1.20e-04 | (3810.51 ms | 137590 tok/s) step 21585/76294 | train loss 3.662214 | norm 220.3962 | lr 1.20e-04 | (3822.32 ms | 137165 tok/s) step 21586/76294 | train loss 3.614861 | norm 68.5078 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 21587/76294 | train loss 3.547167 | norm 61.4204 | lr 1.20e-04 | (3801.35 ms | 137921 tok/s) step 21588/76294 | train loss 3.575009 | norm 244.7354 | lr 1.20e-04 | (3797.02 ms | 138079 tok/s) step 21589/76294 | train loss 3.590812 | norm 105.7514 | lr 1.20e-04 | (3828.99 ms | 136926 tok/s) step 21590/76294 | train loss 3.582849 | norm 103.3885 | lr 1.20e-04 | (3817.96 ms | 137321 tok/s) step 21591/76294 | train loss 3.657296 | norm 34.3629 | lr 1.20e-04 | (3802.10 ms | 137894 tok/s) step 21592/76294 | train loss 3.755019 | norm 17.8466 | lr 1.20e-04 | (3797.22 ms | 138072 tok/s) step 21593/76294 | train loss 3.525483 | norm 57.3440 | lr 1.20e-04 | (3827.85 ms | 136967 tok/s) step 21594/76294 | train loss 3.607890 | norm 24.4472 | lr 1.20e-04 | (3797.73 ms | 138053 tok/s) step 21595/76294 | train loss 3.570132 | norm 44.7365 | lr 1.20e-04 | (3849.06 ms | 136212 tok/s) step 21596/76294 | train loss 3.559949 | norm 27.3549 | lr 1.20e-04 | (3796.19 ms | 138109 tok/s) step 21597/76294 | train loss 3.560773 | norm 25.0731 | lr 1.20e-04 | (3836.08 ms | 136673 tok/s) step 21598/76294 | train loss 3.630284 | norm 63.5482 | lr 1.20e-04 | (3799.31 ms | 137995 tok/s) step 21599/76294 | train loss 3.523560 | norm 47.3642 | lr 1.20e-04 | (3801.52 ms | 137916 tok/s) step 21600/76294 | train loss 3.536356 | norm 37.7062 | lr 1.20e-04 | (3829.11 ms | 136922 tok/s) step 21601/76294 | train loss 3.564032 | norm 60.7302 | lr 1.20e-04 | (3801.21 ms | 137927 tok/s) step 21602/76294 | train loss 3.562146 | norm 21.1890 | lr 1.20e-04 | (3799.13 ms | 138002 tok/s) step 21603/76294 | train loss 3.544989 | norm 27.7413 | lr 1.20e-04 | (3808.25 ms | 137672 tok/s) step 21604/76294 | train loss 3.527373 | norm 30.4611 | lr 1.20e-04 | (3801.58 ms | 137913 tok/s) step 21605/76294 | train loss 3.630791 | norm 26.2159 | lr 1.20e-04 | (3861.42 ms | 135776 tok/s) step 21606/76294 | train loss 3.557734 | norm 22.4267 | lr 1.20e-04 | (3886.70 ms | 134893 tok/s) step 21607/76294 | train loss 3.547590 | norm 95.1838 | lr 1.20e-04 | (3801.47 ms | 137917 tok/s) step 21608/76294 | train loss 3.501915 | norm 40.2145 | lr 1.20e-04 | (3801.50 ms | 137916 tok/s) step 21609/76294 | train loss 3.570774 | norm 35.1791 | lr 1.20e-04 | (3827.28 ms | 136987 tok/s) step 21610/76294 | train loss 3.525810 | norm 14.0675 | lr 1.20e-04 | (3799.84 ms | 137976 tok/s) step 21611/76294 | train loss 3.640449 | norm 42.0242 | lr 1.20e-04 | (3798.98 ms | 138008 tok/s) step 21612/76294 | train loss 3.511832 | norm 20.7703 | lr 1.20e-04 | (3834.71 ms | 136722 tok/s) step 21613/76294 | train loss 3.549971 | norm 86.8195 | lr 1.20e-04 | (3796.31 ms | 138104 tok/s) step 21614/76294 | train loss 3.475365 | norm 15.7721 | lr 1.20e-04 | (3803.26 ms | 137852 tok/s) step 21615/76294 | train loss 3.552305 | norm 20.4282 | lr 1.20e-04 | (3830.48 ms | 136873 tok/s) step 21616/76294 | train loss 3.520282 | norm 29.2393 | lr 1.20e-04 | (3801.13 ms | 137930 tok/s) step 21617/76294 | train loss 3.471288 | norm 13.8432 | lr 1.20e-04 | (3800.83 ms | 137940 tok/s) step 21618/76294 | train loss 3.464451 | norm 38.6652 | lr 1.20e-04 | (3805.29 ms | 137779 tok/s) step 21619/76294 | train loss 3.499003 | norm 32.8824 | lr 1.20e-04 | (3799.10 ms | 138003 tok/s) step 21620/76294 | train loss 3.531127 | norm 37.8529 | lr 1.20e-04 | (3824.96 ms | 137070 tok/s) step 21621/76294 | train loss 3.422197 | norm 44.1698 | lr 1.20e-04 | (3797.93 ms | 138046 tok/s) step 21622/76294 | train loss 3.444632 | norm 15.4187 | lr 1.20e-04 | (3799.76 ms | 137979 tok/s) step 21623/76294 | train loss 3.529600 | norm 32.5246 | lr 1.20e-04 | (3827.29 ms | 136987 tok/s) step 21624/76294 | train loss 3.438241 | norm 26.0891 | lr 1.20e-04 | (3805.51 ms | 137771 tok/s) step 21625/76294 | train loss 3.390383 | norm 40.1564 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 21626/76294 | train loss 3.551943 | norm 19.7614 | lr 1.20e-04 | (3798.66 ms | 138019 tok/s) step 21627/76294 | train loss 3.521281 | norm 42.4860 | lr 1.20e-04 | (3809.55 ms | 137625 tok/s) step 21628/76294 | train loss 3.435854 | norm 35.2468 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 21629/76294 | train loss 3.466223 | norm 33.4166 | lr 1.20e-04 | (3804.47 ms | 137808 tok/s) step 21630/76294 | train loss 3.474921 | norm 17.4359 | lr 1.20e-04 | (3805.90 ms | 137757 tok/s) step 21631/76294 | train loss 3.482344 | norm 10.6285 | lr 1.20e-04 | (3806.32 ms | 137741 tok/s) step 21632/76294 | train loss 3.527428 | norm 22.5636 | lr 1.20e-04 | (3845.58 ms | 136335 tok/s) step 21633/76294 | train loss 3.544962 | norm 36.9891 | lr 1.20e-04 | (3802.42 ms | 137883 tok/s) step 21634/76294 | train loss 3.449683 | norm 29.4837 | lr 1.20e-04 | (3801.29 ms | 137924 tok/s) step 21635/76294 | train loss 3.536217 | norm 13.6803 | lr 1.20e-04 | (3820.26 ms | 137239 tok/s) step 21636/76294 | train loss 3.626733 | norm 22.9243 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 21637/76294 | train loss 3.406951 | norm 18.6209 | lr 1.20e-04 | (3806.12 ms | 137749 tok/s) step 21638/76294 | train loss 3.524504 | norm 15.8457 | lr 1.20e-04 | (3798.45 ms | 138027 tok/s) step 21639/76294 | train loss 3.494832 | norm 35.8061 | lr 1.20e-04 | (3802.97 ms | 137863 tok/s) step 21640/76294 | train loss 3.457557 | norm 22.3830 | lr 1.20e-04 | (3796.68 ms | 138091 tok/s) step 21641/76294 | train loss 3.506059 | norm 10.3394 | lr 1.20e-04 | (3808.23 ms | 137672 tok/s) step 21642/76294 | train loss 3.497094 | norm 8.7375 | lr 1.20e-04 | (3826.60 ms | 137011 tok/s) step 21643/76294 | train loss 3.489093 | norm 7.0360 | lr 1.20e-04 | (3826.76 ms | 137006 tok/s) step 21644/76294 | train loss 3.504055 | norm 9.1661 | lr 1.20e-04 | (3800.47 ms | 137954 tok/s) step 21645/76294 | train loss 3.535396 | norm 19.4911 | lr 1.20e-04 | (3833.37 ms | 136769 tok/s) step 21646/76294 | train loss 3.496427 | norm 9.6733 | lr 1.20e-04 | (3799.10 ms | 138003 tok/s) step 21647/76294 | train loss 3.413684 | norm 10.5777 | lr 1.20e-04 | (3805.64 ms | 137766 tok/s) step 21648/76294 | train loss 3.549592 | norm 7.2156 | lr 1.20e-04 | (3820.36 ms | 137235 tok/s) step 21649/76294 | train loss 3.486692 | norm 17.7084 | lr 1.20e-04 | (3803.46 ms | 137845 tok/s) step 21650/76294 | train loss 3.422547 | norm 11.1012 | lr 1.20e-04 | (3814.35 ms | 137452 tok/s) step 21651/76294 | train loss 3.473990 | norm 38.0041 | lr 1.20e-04 | (3803.33 ms | 137850 tok/s) step 21652/76294 | train loss 3.470492 | norm 22.8680 | lr 1.20e-04 | (3802.98 ms | 137862 tok/s) step 21653/76294 | train loss 3.576032 | norm 16.9306 | lr 1.20e-04 | (3826.67 ms | 137009 tok/s) step 21654/76294 | train loss 3.465595 | norm 9.7670 | lr 1.20e-04 | (3798.23 ms | 138035 tok/s) step 21655/76294 | train loss 3.499148 | norm 17.6977 | lr 1.20e-04 | (3803.33 ms | 137850 tok/s) step 21656/76294 | train loss 3.457247 | norm 24.9245 | lr 1.20e-04 | (3824.22 ms | 137097 tok/s) step 21657/76294 | train loss 3.455251 | norm 19.1376 | lr 1.20e-04 | (3801.26 ms | 137925 tok/s) step 21658/76294 | train loss 3.498985 | norm 8.9397 | lr 1.20e-04 | (3866.70 ms | 135590 tok/s) step 21659/76294 | train loss 3.412657 | norm 4.1828 | lr 1.20e-04 | (4104.70 ms | 127729 tok/s) step 21660/76294 | train loss 3.486667 | norm 11.7252 | lr 1.20e-04 | (3817.43 ms | 137340 tok/s) step 21661/76294 | train loss 3.508871 | norm 5.0617 | lr 1.20e-04 | (3799.46 ms | 137990 tok/s) step 21662/76294 | train loss 3.459385 | norm 10.7627 | lr 1.20e-04 | (3811.33 ms | 137560 tok/s) step 21663/76294 | train loss 3.501209 | norm 12.8369 | lr 1.20e-04 | (3800.91 ms | 137938 tok/s) step 21664/76294 | train loss 3.453474 | norm 6.5254 | lr 1.20e-04 | (3821.30 ms | 137201 tok/s) step 21665/76294 | train loss 3.504479 | norm 11.4006 | lr 1.20e-04 | (3801.70 ms | 137909 tok/s) step 21666/76294 | train loss 3.501874 | norm 5.2035 | lr 1.20e-04 | (3821.47 ms | 137195 tok/s) step 21667/76294 | train loss 3.457462 | norm 14.5920 | lr 1.20e-04 | (3805.86 ms | 137758 tok/s) step 21668/76294 | train loss 3.478503 | norm 10.1603 | lr 1.20e-04 | (3794.93 ms | 138155 tok/s) step 21669/76294 | train loss 3.492348 | norm 8.5677 | lr 1.20e-04 | (3833.92 ms | 136750 tok/s) step 21670/76294 | train loss 3.538663 | norm 6.8905 | lr 1.20e-04 | (3800.84 ms | 137940 tok/s) step 21671/76294 | train loss 3.590521 | norm 15.0466 | lr 1.20e-04 | (3831.66 ms | 136831 tok/s) step 21672/76294 | train loss 3.530856 | norm 6.3087 | lr 1.20e-04 | (3796.32 ms | 138104 tok/s) step 21673/76294 | train loss 3.518694 | norm 17.1374 | lr 1.20e-04 | (3813.51 ms | 137482 tok/s) step 21674/76294 | train loss 3.457439 | norm 9.1410 | lr 1.20e-04 | (3797.55 ms | 138059 tok/s) step 21675/76294 | train loss 3.515688 | norm 13.8687 | lr 1.20e-04 | (3817.66 ms | 137332 tok/s) step 21676/76294 | train loss 3.543564 | norm 16.7075 | lr 1.20e-04 | (3819.08 ms | 137281 tok/s) step 21677/76294 | train loss 3.460206 | norm 7.3520 | lr 1.20e-04 | (3803.90 ms | 137829 tok/s) step 21678/76294 | train loss 3.603253 | norm 10.6616 | lr 1.20e-04 | (3801.78 ms | 137906 tok/s) step 21679/76294 | train loss 3.463622 | norm 10.3164 | lr 1.20e-04 | (3828.03 ms | 136960 tok/s) step 21680/76294 | train loss 3.514980 | norm 6.1094 | lr 1.20e-04 | (3805.23 ms | 137781 tok/s) step 21681/76294 | train loss 3.496418 | norm 2.9697 | lr 1.20e-04 | (3813.79 ms | 137472 tok/s) step 21682/76294 | train loss 3.530859 | norm 6.4520 | lr 1.20e-04 | (3808.20 ms | 137673 tok/s) step 21683/76294 | train loss 3.449558 | norm 28.3191 | lr 1.20e-04 | (3931.34 ms | 133361 tok/s) step 21684/76294 | train loss 3.544361 | norm 27.9695 | lr 1.20e-04 | (3866.20 ms | 135608 tok/s) step 21685/76294 | train loss 3.477293 | norm 6.2764 | lr 1.20e-04 | (3796.64 ms | 138093 tok/s) step 21686/76294 | train loss 3.487388 | norm 6.4837 | lr 1.20e-04 | (3798.23 ms | 138035 tok/s) step 21687/76294 | train loss 3.500112 | norm 7.6158 | lr 1.20e-04 | (3821.24 ms | 137204 tok/s) step 21688/76294 | train loss 3.458731 | norm 5.5713 | lr 1.20e-04 | (3801.00 ms | 137934 tok/s) step 21689/76294 | train loss 3.539051 | norm 16.7591 | lr 1.20e-04 | (3804.13 ms | 137821 tok/s) step 21690/76294 | train loss 3.466947 | norm 5.1005 | lr 1.20e-04 | (3799.74 ms | 137980 tok/s) step 21691/76294 | train loss 3.528110 | norm 26.7125 | lr 1.20e-04 | (3810.55 ms | 137589 tok/s) step 21692/76294 | train loss 3.505222 | norm 3.3606 | lr 1.20e-04 | (3801.96 ms | 137899 tok/s) step 21693/76294 | train loss 3.524304 | norm 4.2163 | lr 1.20e-04 | (3799.24 ms | 137998 tok/s) step 21694/76294 | train loss 3.461077 | norm 13.1172 | lr 1.20e-04 | (3833.99 ms | 136747 tok/s) step 21695/76294 | train loss 3.455041 | norm 4.2538 | lr 1.20e-04 | (3800.01 ms | 137970 tok/s) step 21696/76294 | train loss 3.467187 | norm 4.0270 | lr 1.20e-04 | (3802.13 ms | 137893 tok/s) step 21697/76294 | train loss 3.538022 | norm 8.7577 | lr 1.20e-04 | (3819.37 ms | 137271 tok/s) step 21698/76294 | train loss 3.532670 | norm 5.4806 | lr 1.20e-04 | (3803.12 ms | 137857 tok/s) step 21699/76294 | train loss 3.410874 | norm 5.8614 | lr 1.20e-04 | (3804.54 ms | 137806 tok/s) step 21700/76294 | train loss 3.510941 | norm 3.9999 | lr 1.20e-04 | (3800.51 ms | 137952 tok/s) step 21701/76294 | train loss 3.464395 | norm 6.0824 | lr 1.20e-04 | (3809.54 ms | 137625 tok/s) step 21702/76294 | train loss 3.460116 | norm 5.2847 | lr 1.20e-04 | (3802.74 ms | 137871 tok/s) step 21703/76294 | train loss 3.451724 | norm 10.1941 | lr 1.20e-04 | (3848.60 ms | 136228 tok/s) step 21704/76294 | train loss 3.520877 | norm 5.6924 | lr 1.20e-04 | (3844.05 ms | 136390 tok/s) step 21705/76294 | train loss 3.472923 | norm 67.1376 | lr 1.20e-04 | (3804.47 ms | 137808 tok/s) step 21706/76294 | train loss 3.477309 | norm 8.7419 | lr 1.20e-04 | (3819.81 ms | 137255 tok/s) step 21707/76294 | train loss 3.521545 | norm 9.8385 | lr 1.20e-04 | (3803.74 ms | 137835 tok/s) step 21708/76294 | train loss 3.566396 | norm 8.6367 | lr 1.20e-04 | (3808.75 ms | 137654 tok/s) step 21709/76294 | train loss 3.466876 | norm 4.3405 | lr 1.20e-04 | (6460.57 ms | 81152 tok/s) step 21710/76294 | train loss 3.463564 | norm 8.2227 | lr 1.20e-04 | (3900.10 ms | 134429 tok/s) step 21711/76294 | train loss 3.478791 | norm 5.1367 | lr 1.20e-04 | (3789.27 ms | 138361 tok/s) step 21712/76294 | train loss 3.477232 | norm 10.2674 | lr 1.20e-04 | (3794.90 ms | 138156 tok/s) step 21713/76294 | train loss 3.483714 | norm 3.7432 | lr 1.20e-04 | (3853.98 ms | 136038 tok/s) step 21714/76294 | train loss 3.474315 | norm 4.5152 | lr 1.20e-04 | (3787.16 ms | 138438 tok/s) step 21715/76294 | train loss 3.456228 | norm 11.9176 | lr 1.20e-04 | (4825.57 ms | 108648 tok/s) step 21716/76294 | train loss 3.493416 | norm 5.0718 | lr 1.20e-04 | (3789.36 ms | 138358 tok/s) step 21717/76294 | train loss 3.436659 | norm 5.8106 | lr 1.20e-04 | (3791.00 ms | 138298 tok/s) step 21718/76294 | train loss 3.441383 | norm 4.9949 | lr 1.20e-04 | (3787.89 ms | 138412 tok/s) step 21719/76294 | train loss 3.498563 | norm 5.4349 | lr 1.20e-04 | (3799.10 ms | 138003 tok/s) step 21720/76294 | train loss 3.491843 | norm 4.6040 | lr 1.20e-04 | (3813.91 ms | 137467 tok/s) step 21721/76294 | train loss 3.429009 | norm 2.7336 | lr 1.20e-04 | (3792.68 ms | 138237 tok/s) step 21722/76294 | train loss 3.463409 | norm 4.9956 | lr 1.20e-04 | (3791.40 ms | 138284 tok/s) step 21723/76294 | train loss 3.474182 | norm 3.6026 | lr 1.20e-04 | (3818.39 ms | 137306 tok/s) step 21724/76294 | train loss 3.452816 | norm 3.9527 | lr 1.20e-04 | (3794.19 ms | 138182 tok/s) step 21725/76294 | train loss 3.512419 | norm 2.9960 | lr 1.20e-04 | (3800.27 ms | 137961 tok/s) step 21726/76294 | train loss 3.606753 | norm 4.2590 | lr 1.20e-04 | (3814.27 ms | 137455 tok/s) step 21727/76294 | train loss 3.447634 | norm 6.3297 | lr 1.20e-04 | (3807.63 ms | 137694 tok/s) step 21728/76294 | train loss 3.457701 | norm 4.7102 | lr 1.20e-04 | (3793.07 ms | 138223 tok/s) step 21729/76294 | train loss 3.443535 | norm 3.3655 | lr 1.20e-04 | (3824.95 ms | 137070 tok/s) step 21730/76294 | train loss 3.509882 | norm 4.1112 | lr 1.20e-04 | (3797.79 ms | 138051 tok/s) step 21731/76294 | train loss 3.425207 | norm 4.8490 | lr 1.20e-04 | (3802.51 ms | 137879 tok/s) step 21732/76294 | train loss 3.434202 | norm 15.6846 | lr 1.20e-04 | (3814.12 ms | 137460 tok/s) step 21733/76294 | train loss 3.498410 | norm 6.8753 | lr 1.20e-04 | (3799.10 ms | 138003 tok/s) step 21734/76294 | train loss 3.521261 | norm 5.1678 | lr 1.20e-04 | (3795.88 ms | 138120 tok/s) step 21735/76294 | train loss 3.402435 | norm 7.1028 | lr 1.20e-04 | (3854.51 ms | 136019 tok/s) step 21736/76294 | train loss 3.544126 | norm 7.2955 | lr 1.20e-04 | (3802.22 ms | 137890 tok/s) step 21737/76294 | train loss 3.482895 | norm 7.6735 | lr 1.20e-04 | (3800.43 ms | 137955 tok/s) step 21738/76294 | train loss 3.527733 | norm 6.5106 | lr 1.20e-04 | (3822.11 ms | 137172 tok/s) step 21739/76294 | train loss 3.529462 | norm 7.7430 | lr 1.20e-04 | (3798.38 ms | 138029 tok/s) step 21740/76294 | train loss 3.447132 | norm 4.1274 | lr 1.20e-04 | (3810.38 ms | 137595 tok/s) step 21741/76294 | train loss 3.458560 | norm 11.0964 | lr 1.20e-04 | (3805.10 ms | 137786 tok/s) step 21742/76294 | train loss 3.419177 | norm 8.0050 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 21743/76294 | train loss 3.550114 | norm 11.9968 | lr 1.20e-04 | (3808.32 ms | 137669 tok/s) step 21744/76294 | train loss 3.549809 | norm 8.5386 | lr 1.20e-04 | (3807.15 ms | 137711 tok/s) step 21745/76294 | train loss 3.511710 | norm 6.5874 | lr 1.20e-04 | (3807.17 ms | 137711 tok/s) step 21746/76294 | train loss 3.489960 | norm 4.1738 | lr 1.20e-04 | (3813.88 ms | 137468 tok/s) step 21747/76294 | train loss 3.550443 | norm 5.3796 | lr 1.20e-04 | (3808.53 ms | 137661 tok/s) step 21748/76294 | train loss 3.634740 | norm 5.4381 | lr 1.20e-04 | (3813.40 ms | 137486 tok/s) step 21749/76294 | train loss 3.457919 | norm 5.7212 | lr 1.20e-04 | (3956.56 ms | 132511 tok/s) step 21750/76294 | train loss 3.563943 | norm 5.6919 | lr 1.20e-04 | (3810.63 ms | 137586 tok/s) val loss: 3.474988 saving model checkpoint to ./results/gpt2-124M-gqa/step_21750.pth step 21751/76294 | train loss 3.436449 | norm 3.7227 | lr 1.20e-04 | (3828.89 ms | 136929 tok/s) step 21752/76294 | train loss 3.505245 | norm 5.0315 | lr 1.20e-04 | (3824.28 ms | 137094 tok/s) step 21753/76294 | train loss 3.475672 | norm 6.2557 | lr 1.20e-04 | (3808.16 ms | 137675 tok/s) step 21754/76294 | train loss 3.470092 | norm 4.5659 | lr 1.20e-04 | (3814.32 ms | 137453 tok/s) step 21755/76294 | train loss 3.460315 | norm 5.3631 | lr 1.20e-04 | (3806.31 ms | 137742 tok/s) step 21756/76294 | train loss 3.517660 | norm 5.6702 | lr 1.20e-04 | (3810.05 ms | 137606 tok/s) step 21757/76294 | train loss 3.547980 | norm 5.9951 | lr 1.20e-04 | (3804.40 ms | 137811 tok/s) step 21758/76294 | train loss 3.495207 | norm 4.1227 | lr 1.20e-04 | (3802.24 ms | 137889 tok/s) step 21759/76294 | train loss 3.431827 | norm 4.2477 | lr 1.20e-04 | (3838.65 ms | 136581 tok/s) step 21760/76294 | train loss 3.511370 | norm 4.5457 | lr 1.20e-04 | (3836.03 ms | 136675 tok/s) step 21761/76294 | train loss 3.470928 | norm 6.0492 | lr 1.20e-04 | (3885.28 ms | 134942 tok/s) step 21762/76294 | train loss 3.538124 | norm 11.1287 | lr 1.20e-04 | (3805.62 ms | 137767 tok/s) step 21763/76294 | train loss 3.439863 | norm 6.6550 | lr 1.20e-04 | (3826.12 ms | 137029 tok/s) step 21764/76294 | train loss 3.589512 | norm 14.6378 | lr 1.20e-04 | (3802.14 ms | 137893 tok/s) step 21765/76294 | train loss 3.458207 | norm 4.5746 | lr 1.20e-04 | (3806.32 ms | 137741 tok/s) step 21766/76294 | train loss 3.521413 | norm 8.7290 | lr 1.20e-04 | (3805.47 ms | 137772 tok/s) step 21767/76294 | train loss 3.475334 | norm 6.1463 | lr 1.20e-04 | (3804.13 ms | 137821 tok/s) step 21768/76294 | train loss 3.486172 | norm 6.3851 | lr 1.20e-04 | (3807.93 ms | 137683 tok/s) step 21769/76294 | train loss 3.500026 | norm 6.6696 | lr 1.20e-04 | (3804.45 ms | 137809 tok/s) step 21770/76294 | train loss 3.474285 | norm 6.5488 | lr 1.20e-04 | (3825.75 ms | 137042 tok/s) step 21771/76294 | train loss 3.498916 | norm 8.9552 | lr 1.20e-04 | (3806.25 ms | 137744 tok/s) step 21772/76294 | train loss 3.470036 | norm 7.3980 | lr 1.20e-04 | (3806.85 ms | 137722 tok/s) step 21773/76294 | train loss 3.514462 | norm 11.0663 | lr 1.20e-04 | (3880.73 ms | 135100 tok/s) step 21774/76294 | train loss 3.526571 | norm 7.8038 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 21775/76294 | train loss 3.449825 | norm 6.0993 | lr 1.20e-04 | (3802.18 ms | 137892 tok/s) step 21776/76294 | train loss 3.537383 | norm 9.1435 | lr 1.20e-04 | (3795.84 ms | 138122 tok/s) step 21777/76294 | train loss 3.454210 | norm 7.1827 | lr 1.20e-04 | (3828.92 ms | 136929 tok/s) step 21778/76294 | train loss 3.452720 | norm 6.2931 | lr 1.20e-04 | (3798.47 ms | 138026 tok/s) step 21779/76294 | train loss 3.429983 | norm 45.1080 | lr 1.20e-04 | (3802.59 ms | 137877 tok/s) step 21780/76294 | train loss 3.456671 | norm 9.9095 | lr 1.20e-04 | (3816.68 ms | 137367 tok/s) step 21781/76294 | train loss 3.495013 | norm 4.7362 | lr 1.20e-04 | (3806.39 ms | 137739 tok/s) step 21782/76294 | train loss 3.428222 | norm 9.6730 | lr 1.20e-04 | (3802.15 ms | 137893 tok/s) step 21783/76294 | train loss 3.512730 | norm 5.0066 | lr 1.20e-04 | (3825.16 ms | 137063 tok/s) step 21784/76294 | train loss 3.412956 | norm 5.4919 | lr 1.20e-04 | (3800.89 ms | 137938 tok/s) step 21785/76294 | train loss 3.458557 | norm 4.7226 | lr 1.20e-04 | (3825.64 ms | 137046 tok/s) step 21786/76294 | train loss 3.478844 | norm 2.9025 | lr 1.20e-04 | (3800.27 ms | 137961 tok/s) step 21787/76294 | train loss 3.581678 | norm 8.4140 | lr 1.20e-04 | (3934.47 ms | 133255 tok/s) step 21788/76294 | train loss 3.508440 | norm 4.5871 | lr 1.20e-04 | (3796.57 ms | 138095 tok/s) step 21789/76294 | train loss 3.441516 | norm 5.2681 | lr 1.20e-04 | (3803.21 ms | 137854 tok/s) step 21790/76294 | train loss 3.484593 | norm 5.0822 | lr 1.20e-04 | (3818.94 ms | 137286 tok/s) step 21791/76294 | train loss 3.476440 | norm 3.3685 | lr 1.20e-04 | (3801.62 ms | 137912 tok/s) step 21792/76294 | train loss 3.443146 | norm 3.0317 | lr 1.20e-04 | (3801.35 ms | 137922 tok/s) step 21793/76294 | train loss 3.435809 | norm 4.6054 | lr 1.20e-04 | (3797.12 ms | 138075 tok/s) step 21794/76294 | train loss 3.433433 | norm 5.2688 | lr 1.20e-04 | (3804.60 ms | 137804 tok/s) step 21795/76294 | train loss 3.445438 | norm 4.6678 | lr 1.20e-04 | (3803.39 ms | 137847 tok/s) step 21796/76294 | train loss 3.494262 | norm 5.0205 | lr 1.20e-04 | (3804.42 ms | 137810 tok/s) step 21797/76294 | train loss 3.418791 | norm 4.6929 | lr 1.20e-04 | (3810.87 ms | 137577 tok/s) step 21798/76294 | train loss 3.511986 | norm 3.4994 | lr 1.20e-04 | (3800.74 ms | 137944 tok/s) step 21799/76294 | train loss 3.497977 | norm 4.3713 | lr 1.20e-04 | (3825.95 ms | 137035 tok/s) step 21800/76294 | train loss 3.456190 | norm 6.2851 | lr 1.20e-04 | (3803.04 ms | 137860 tok/s) step 21801/76294 | train loss 3.526476 | norm 4.0194 | lr 1.20e-04 | (3801.52 ms | 137915 tok/s) step 21802/76294 | train loss 3.462826 | norm 2.5075 | lr 1.20e-04 | (3823.87 ms | 137109 tok/s) step 21803/76294 | train loss 3.490941 | norm 4.1897 | lr 1.20e-04 | (3804.25 ms | 137816 tok/s) step 21804/76294 | train loss 3.456162 | norm 2.9818 | lr 1.20e-04 | (3797.34 ms | 138067 tok/s) step 21805/76294 | train loss 3.425418 | norm 3.1187 | lr 1.20e-04 | (3824.74 ms | 137078 tok/s) step 21806/76294 | train loss 3.472159 | norm 8.5396 | lr 1.20e-04 | (3802.84 ms | 137867 tok/s) step 21807/76294 | train loss 3.455367 | norm 3.4137 | lr 1.20e-04 | (3804.22 ms | 137817 tok/s) step 21808/76294 | train loss 3.460559 | norm 4.2503 | lr 1.20e-04 | (3818.04 ms | 137319 tok/s) step 21809/76294 | train loss 3.413510 | norm 5.7545 | lr 1.20e-04 | (3805.51 ms | 137771 tok/s) step 21810/76294 | train loss 3.529315 | norm 3.1177 | lr 1.20e-04 | (3806.82 ms | 137723 tok/s) step 21811/76294 | train loss 3.457067 | norm 4.2965 | lr 1.20e-04 | (3803.77 ms | 137834 tok/s) step 21812/76294 | train loss 3.521388 | norm 5.1409 | lr 1.20e-04 | (3809.56 ms | 137624 tok/s) step 21813/76294 | train loss 3.537589 | norm 6.8369 | lr 1.20e-04 | (3809.27 ms | 137635 tok/s) step 21814/76294 | train loss 3.407367 | norm 4.1487 | lr 1.20e-04 | (3865.14 ms | 135645 tok/s) step 21815/76294 | train loss 3.416773 | norm 9.4990 | lr 1.20e-04 | (3802.60 ms | 137876 tok/s) step 21816/76294 | train loss 3.444599 | norm 8.4105 | lr 1.20e-04 | (3803.99 ms | 137826 tok/s) step 21817/76294 | train loss 3.449241 | norm 5.9071 | lr 1.20e-04 | (3821.80 ms | 137184 tok/s) step 21818/76294 | train loss 3.462295 | norm 4.5147 | lr 1.20e-04 | (3804.03 ms | 137824 tok/s) step 21819/76294 | train loss 3.547292 | norm 4.8143 | lr 1.20e-04 | (3805.03 ms | 137788 tok/s) step 21820/76294 | train loss 3.488456 | norm 5.1118 | lr 1.20e-04 | (3804.46 ms | 137809 tok/s) step 21821/76294 | train loss 3.468673 | norm 8.0376 | lr 1.20e-04 | (3801.30 ms | 137923 tok/s) step 21822/76294 | train loss 3.569311 | norm 4.5997 | lr 1.20e-04 | (3804.57 ms | 137805 tok/s) step 21823/76294 | train loss 3.449671 | norm 5.3613 | lr 1.20e-04 | (3817.39 ms | 137342 tok/s) step 21824/76294 | train loss 3.525028 | norm 6.2248 | lr 1.20e-04 | (3797.81 ms | 138050 tok/s) step 21825/76294 | train loss 3.503392 | norm 4.5425 | lr 1.20e-04 | (3809.70 ms | 137619 tok/s) step 21826/76294 | train loss 3.473826 | norm 6.0228 | lr 1.20e-04 | (3802.87 ms | 137866 tok/s) step 21827/76294 | train loss 3.516523 | norm 3.7173 | lr 1.20e-04 | (3860.84 ms | 135796 tok/s) step 21828/76294 | train loss 3.512321 | norm 5.3271 | lr 1.20e-04 | (3800.24 ms | 137962 tok/s) step 21829/76294 | train loss 3.468844 | norm 7.9593 | lr 1.20e-04 | (3806.22 ms | 137745 tok/s) step 21830/76294 | train loss 3.500394 | norm 4.1293 | lr 1.20e-04 | (3800.14 ms | 137965 tok/s) step 21831/76294 | train loss 3.478082 | norm 3.9335 | lr 1.20e-04 | (3819.32 ms | 137273 tok/s) step 21832/76294 | train loss 3.534683 | norm 6.0759 | lr 1.20e-04 | (3806.94 ms | 137719 tok/s) step 21833/76294 | train loss 3.485915 | norm 3.0382 | lr 1.20e-04 | (3824.27 ms | 137095 tok/s) step 21834/76294 | train loss 3.471341 | norm 3.2664 | lr 1.20e-04 | (3802.44 ms | 137882 tok/s) step 21835/76294 | train loss 3.458662 | norm 5.0096 | lr 1.20e-04 | (3804.27 ms | 137815 tok/s) step 21836/76294 | train loss 3.489955 | norm 3.1223 | lr 1.20e-04 | (3801.25 ms | 137925 tok/s) step 21837/76294 | train loss 3.465087 | norm 2.6216 | lr 1.20e-04 | (3805.47 ms | 137772 tok/s) step 21838/76294 | train loss 3.371412 | norm 4.3577 | lr 1.20e-04 | (3803.38 ms | 137848 tok/s) step 21839/76294 | train loss 3.532250 | norm 3.9982 | lr 1.20e-04 | (3820.22 ms | 137240 tok/s) step 21840/76294 | train loss 3.499993 | norm 4.8223 | lr 1.20e-04 | (3801.14 ms | 137929 tok/s) step 21841/76294 | train loss 3.473948 | norm 6.1739 | lr 1.20e-04 | (3882.97 ms | 135022 tok/s) step 21842/76294 | train loss 3.535078 | norm 3.0890 | lr 1.20e-04 | (3796.84 ms | 138085 tok/s) step 21843/76294 | train loss 3.440402 | norm 3.3443 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 21844/76294 | train loss 3.526197 | norm 11.4419 | lr 1.20e-04 | (3822.14 ms | 137171 tok/s) step 21845/76294 | train loss 3.501928 | norm 4.0071 | lr 1.20e-04 | (3806.78 ms | 137725 tok/s) step 21846/76294 | train loss 3.559677 | norm 5.7475 | lr 1.20e-04 | (3797.15 ms | 138074 tok/s) step 21847/76294 | train loss 3.602167 | norm 3.8523 | lr 1.20e-04 | (3857.10 ms | 135928 tok/s) step 21848/76294 | train loss 3.392976 | norm 3.7131 | lr 1.20e-04 | (3799.61 ms | 137985 tok/s) step 21849/76294 | train loss 3.424966 | norm 2.8405 | lr 1.20e-04 | (3814.92 ms | 137431 tok/s) step 21850/76294 | train loss 3.393300 | norm 3.2221 | lr 1.20e-04 | (4054.76 ms | 129302 tok/s) step 21851/76294 | train loss 3.473283 | norm 5.8999 | lr 1.20e-04 | (3823.41 ms | 137126 tok/s) step 21852/76294 | train loss 3.442829 | norm 4.4117 | lr 1.20e-04 | (3797.15 ms | 138074 tok/s) step 21853/76294 | train loss 3.464699 | norm 3.2091 | lr 1.20e-04 | (3821.64 ms | 137189 tok/s) step 21854/76294 | train loss 3.464505 | norm 4.6242 | lr 1.20e-04 | (3795.84 ms | 138122 tok/s) step 21855/76294 | train loss 3.522856 | norm 3.8403 | lr 1.20e-04 | (3806.75 ms | 137726 tok/s) step 21856/76294 | train loss 3.536227 | norm 8.3932 | lr 1.20e-04 | (3800.07 ms | 137968 tok/s) step 21857/76294 | train loss 3.481325 | norm 3.2781 | lr 1.20e-04 | (3810.65 ms | 137585 tok/s) step 21858/76294 | train loss 3.478859 | norm 3.0487 | lr 1.20e-04 | (3817.05 ms | 137354 tok/s) step 21859/76294 | train loss 3.416734 | norm 3.0444 | lr 1.20e-04 | (3803.70 ms | 137836 tok/s) step 21860/76294 | train loss 3.543307 | norm 3.2289 | lr 1.20e-04 | (3803.57 ms | 137841 tok/s) step 21861/76294 | train loss 3.507816 | norm 2.4860 | lr 1.20e-04 | (3798.98 ms | 138008 tok/s) step 21862/76294 | train loss 3.498946 | norm 2.6969 | lr 1.20e-04 | (3811.67 ms | 137548 tok/s) step 21863/76294 | train loss 3.488097 | norm 4.2426 | lr 1.20e-04 | (3809.67 ms | 137620 tok/s) step 21864/76294 | train loss 3.473326 | norm 4.9156 | lr 1.20e-04 | (3839.77 ms | 136541 tok/s) step 21865/76294 | train loss 3.420049 | norm 2.5116 | lr 1.20e-04 | (3805.76 ms | 137762 tok/s) step 21866/76294 | train loss 3.444489 | norm 6.4833 | lr 1.20e-04 | (3811.27 ms | 137562 tok/s) step 21867/76294 | train loss 3.466397 | norm 3.8443 | lr 1.20e-04 | (3826.84 ms | 137003 tok/s) step 21868/76294 | train loss 3.451404 | norm 3.8314 | lr 1.20e-04 | (3819.32 ms | 137273 tok/s) step 21869/76294 | train loss 3.451282 | norm 4.3017 | lr 1.20e-04 | (3817.35 ms | 137343 tok/s) step 21870/76294 | train loss 3.439856 | norm 2.5908 | lr 1.20e-04 | (3819.15 ms | 137279 tok/s) step 21871/76294 | train loss 3.443290 | norm 2.1453 | lr 1.20e-04 | (3804.49 ms | 137808 tok/s) step 21872/76294 | train loss 3.354894 | norm 4.4478 | lr 1.20e-04 | (3833.61 ms | 136761 tok/s) step 21873/76294 | train loss 3.489148 | norm 3.0717 | lr 1.20e-04 | (3803.64 ms | 137839 tok/s) step 21874/76294 | train loss 3.537841 | norm 4.8628 | lr 1.20e-04 | (3833.08 ms | 136780 tok/s) step 21875/76294 | train loss 3.441307 | norm 2.3076 | lr 1.20e-04 | (3846.11 ms | 136316 tok/s) step 21876/76294 | train loss 3.469753 | norm 2.4795 | lr 1.20e-04 | (3804.87 ms | 137794 tok/s) step 21877/76294 | train loss 3.491916 | norm 3.0556 | lr 1.20e-04 | (3806.30 ms | 137742 tok/s) step 21878/76294 | train loss 3.474038 | norm 3.6077 | lr 1.20e-04 | (3832.49 ms | 136801 tok/s) step 21879/76294 | train loss 3.445885 | norm 5.1585 | lr 1.20e-04 | (3803.98 ms | 137826 tok/s) step 21880/76294 | train loss 3.437176 | norm 2.2741 | lr 1.20e-04 | (3809.30 ms | 137634 tok/s) step 21881/76294 | train loss 3.478908 | norm 2.4457 | lr 1.20e-04 | (3823.47 ms | 137124 tok/s) step 21882/76294 | train loss 3.461070 | norm 3.9237 | lr 1.20e-04 | (3800.71 ms | 137945 tok/s) step 21883/76294 | train loss 3.466923 | norm 4.3224 | lr 1.20e-04 | (3805.23 ms | 137781 tok/s) step 21884/76294 | train loss 3.416301 | norm 4.6650 | lr 1.20e-04 | (3805.54 ms | 137770 tok/s) step 21885/76294 | train loss 3.481656 | norm 2.3984 | lr 1.20e-04 | (3808.41 ms | 137666 tok/s) step 21886/76294 | train loss 3.528542 | norm 3.2339 | lr 1.20e-04 | (3801.73 ms | 137908 tok/s) step 21887/76294 | train loss 3.521923 | norm 5.0785 | lr 1.20e-04 | (3835.99 ms | 136676 tok/s) step 21888/76294 | train loss 3.628261 | norm 4.7713 | lr 1.20e-04 | (3823.88 ms | 137109 tok/s) step 21889/76294 | train loss 3.405705 | norm 3.0395 | lr 1.20e-04 | (3803.25 ms | 137853 tok/s) step 21890/76294 | train loss 3.474845 | norm 6.4045 | lr 1.20e-04 | (3819.31 ms | 137273 tok/s) step 21891/76294 | train loss 3.419242 | norm 4.1154 | lr 1.20e-04 | (3856.01 ms | 135967 tok/s) step 21892/76294 | train loss 3.423413 | norm 2.7322 | lr 1.20e-04 | (3798.31 ms | 138032 tok/s) step 21893/76294 | train loss 3.456761 | norm 4.8511 | lr 1.20e-04 | (3827.59 ms | 136976 tok/s) step 21894/76294 | train loss 3.509693 | norm 5.1046 | lr 1.20e-04 | (3799.36 ms | 137994 tok/s) step 21895/76294 | train loss 3.500346 | norm 3.7189 | lr 1.20e-04 | (3806.74 ms | 137726 tok/s) step 21896/76294 | train loss 3.464314 | norm 4.1670 | lr 1.20e-04 | (3819.97 ms | 137249 tok/s) step 21897/76294 | train loss 3.667209 | norm 5.1243 | lr 1.20e-04 | (3797.62 ms | 138057 tok/s) step 21898/76294 | train loss 3.463607 | norm 3.5471 | lr 1.20e-04 | (3818.34 ms | 137308 tok/s) step 21899/76294 | train loss 3.479914 | norm 3.2183 | lr 1.20e-04 | (3798.28 ms | 138033 tok/s) step 21900/76294 | train loss 3.467647 | norm 2.6171 | lr 1.20e-04 | (3800.93 ms | 137937 tok/s) step 21901/76294 | train loss 3.481632 | norm 3.0111 | lr 1.20e-04 | (3823.70 ms | 137115 tok/s) step 21902/76294 | train loss 3.451345 | norm 3.8783 | lr 1.20e-04 | (3800.75 ms | 137943 tok/s) step 21903/76294 | train loss 3.494226 | norm 3.4100 | lr 1.20e-04 | (3799.47 ms | 137990 tok/s) step 21904/76294 | train loss 3.511228 | norm 3.3382 | lr 1.20e-04 | (4555.19 ms | 115097 tok/s) step 21905/76294 | train loss 3.505766 | norm 5.4577 | lr 1.20e-04 | (3791.54 ms | 138278 tok/s) step 21906/76294 | train loss 3.468329 | norm 4.9733 | lr 1.20e-04 | (3799.80 ms | 137978 tok/s) step 21907/76294 | train loss 3.498602 | norm 4.6761 | lr 1.20e-04 | (3821.01 ms | 137212 tok/s) step 21908/76294 | train loss 3.495407 | norm 4.6216 | lr 1.20e-04 | (3805.38 ms | 137776 tok/s) step 21909/76294 | train loss 3.455015 | norm 2.9036 | lr 1.20e-04 | (3804.29 ms | 137815 tok/s) step 21910/76294 | train loss 3.476131 | norm 4.3385 | lr 1.20e-04 | (3810.82 ms | 137579 tok/s) step 21911/76294 | train loss 3.432647 | norm 10.8526 | lr 1.20e-04 | (3799.97 ms | 137972 tok/s) step 21912/76294 | train loss 3.484579 | norm 4.7979 | lr 1.20e-04 | (3829.09 ms | 136922 tok/s) step 21913/76294 | train loss 3.511385 | norm 3.9839 | lr 1.20e-04 | (3798.88 ms | 138011 tok/s) step 21914/76294 | train loss 3.482045 | norm 3.9004 | lr 1.20e-04 | (3809.68 ms | 137620 tok/s) step 21915/76294 | train loss 3.410954 | norm 10.4775 | lr 1.20e-04 | (3818.10 ms | 137316 tok/s) step 21916/76294 | train loss 3.513862 | norm 8.1581 | lr 1.20e-04 | (3829.51 ms | 136907 tok/s) step 21917/76294 | train loss 3.418220 | norm 4.9012 | lr 1.20e-04 | (3874.18 ms | 135329 tok/s) step 21918/76294 | train loss 3.403901 | norm 6.0836 | lr 1.20e-04 | (3796.58 ms | 138095 tok/s) step 21919/76294 | train loss 3.414545 | norm 2.6236 | lr 1.20e-04 | (3809.99 ms | 137609 tok/s) step 21920/76294 | train loss 3.427985 | norm 5.5778 | lr 1.20e-04 | (3817.54 ms | 137336 tok/s) step 21921/76294 | train loss 3.661130 | norm 5.6751 | lr 1.20e-04 | (3803.21 ms | 137854 tok/s) step 21922/76294 | train loss 3.459873 | norm 4.4496 | lr 1.20e-04 | (3809.58 ms | 137623 tok/s) step 21923/76294 | train loss 3.479284 | norm 4.0413 | lr 1.20e-04 | (3798.67 ms | 138019 tok/s) step 21924/76294 | train loss 3.491118 | norm 8.0777 | lr 1.20e-04 | (3799.08 ms | 138004 tok/s) step 21925/76294 | train loss 3.526190 | norm 9.1331 | lr 1.20e-04 | (4853.25 ms | 108028 tok/s) step 21926/76294 | train loss 3.503632 | norm 11.0602 | lr 1.20e-04 | (3881.09 ms | 135088 tok/s) step 21927/76294 | train loss 3.590487 | norm 10.9210 | lr 1.20e-04 | (3833.39 ms | 136769 tok/s) step 21928/76294 | train loss 3.544693 | norm 20.9936 | lr 1.20e-04 | (3796.93 ms | 138082 tok/s) step 21929/76294 | train loss 3.669052 | norm 21.5652 | lr 1.20e-04 | (3820.42 ms | 137233 tok/s) step 21930/76294 | train loss 3.544955 | norm 24.7386 | lr 1.20e-04 | (3797.05 ms | 138078 tok/s) step 21931/76294 | train loss 3.567498 | norm 17.3674 | lr 1.20e-04 | (3843.16 ms | 136421 tok/s) step 21932/76294 | train loss 3.550715 | norm 15.5766 | lr 1.20e-04 | (3794.72 ms | 138162 tok/s) step 21933/76294 | train loss 3.755394 | norm 8.0514 | lr 1.20e-04 | (3796.67 ms | 138092 tok/s) step 21934/76294 | train loss 3.610732 | norm 7.7510 | lr 1.20e-04 | (3817.02 ms | 137355 tok/s) step 21935/76294 | train loss 3.541090 | norm 8.9525 | lr 1.20e-04 | (3802.04 ms | 137896 tok/s) step 21936/76294 | train loss 3.545004 | norm 11.4708 | lr 1.20e-04 | (3806.74 ms | 137726 tok/s) step 21937/76294 | train loss 3.530242 | norm 12.5647 | lr 1.20e-04 | (3800.08 ms | 137968 tok/s) step 21938/76294 | train loss 3.495435 | norm 10.8542 | lr 1.20e-04 | (3808.43 ms | 137665 tok/s) step 21939/76294 | train loss 3.580698 | norm 5.1642 | lr 1.20e-04 | (3799.46 ms | 137990 tok/s) step 21940/76294 | train loss 3.510341 | norm 8.4125 | lr 1.20e-04 | (3801.83 ms | 137904 tok/s) step 21941/76294 | train loss 3.420225 | norm 6.9466 | lr 1.20e-04 | (3831.33 ms | 136842 tok/s) step 21942/76294 | train loss 3.482429 | norm 4.8793 | lr 1.20e-04 | (3887.15 ms | 134877 tok/s) step 21943/76294 | train loss 3.389899 | norm 8.2877 | lr 1.20e-04 | (3804.54 ms | 137806 tok/s) step 21944/76294 | train loss 3.512039 | norm 4.0032 | lr 1.20e-04 | (3870.34 ms | 135463 tok/s) step 21945/76294 | train loss 3.484106 | norm 5.8712 | lr 1.20e-04 | (3825.80 ms | 137040 tok/s) step 21946/76294 | train loss 3.508881 | norm 5.4667 | lr 1.20e-04 | (3809.35 ms | 137632 tok/s) step 21947/76294 | train loss 3.486932 | norm 2.5258 | lr 1.20e-04 | (3854.42 ms | 136023 tok/s) step 21948/76294 | train loss 3.425890 | norm 4.8723 | lr 1.20e-04 | (3801.88 ms | 137902 tok/s) step 21949/76294 | train loss 3.444001 | norm 8.2204 | lr 1.20e-04 | (3805.23 ms | 137781 tok/s) step 21950/76294 | train loss 3.546875 | norm 9.4145 | lr 1.20e-04 | (3828.38 ms | 136948 tok/s) step 21951/76294 | train loss 3.460558 | norm 6.3419 | lr 1.20e-04 | (3806.07 ms | 137751 tok/s) step 21952/76294 | train loss 3.472574 | norm 5.9056 | lr 1.20e-04 | (3830.64 ms | 136867 tok/s) step 21953/76294 | train loss 3.499789 | norm 5.0256 | lr 1.20e-04 | (3808.32 ms | 137669 tok/s) step 21954/76294 | train loss 3.480387 | norm 4.8300 | lr 1.20e-04 | (3810.40 ms | 137594 tok/s) step 21955/76294 | train loss 3.469339 | norm 8.4764 | lr 1.20e-04 | (3808.87 ms | 137649 tok/s) step 21956/76294 | train loss 3.473640 | norm 3.4300 | lr 1.20e-04 | (3814.50 ms | 137446 tok/s) step 21957/76294 | train loss 3.473833 | norm 3.7832 | lr 1.20e-04 | (3808.82 ms | 137651 tok/s) step 21958/76294 | train loss 3.443428 | norm 4.1775 | lr 1.20e-04 | (3803.86 ms | 137831 tok/s) step 21959/76294 | train loss 3.445549 | norm 2.3903 | lr 1.20e-04 | (3838.50 ms | 136587 tok/s) step 21960/76294 | train loss 3.520611 | norm 3.4318 | lr 1.20e-04 | (3803.73 ms | 137835 tok/s) step 21961/76294 | train loss 3.422186 | norm 5.3471 | lr 1.20e-04 | (3837.12 ms | 136636 tok/s) step 21962/76294 | train loss 3.423390 | norm 3.2027 | lr 1.20e-04 | (3828.80 ms | 136933 tok/s) step 21963/76294 | train loss 3.416060 | norm 2.0805 | lr 1.20e-04 | (3807.49 ms | 137699 tok/s) step 21964/76294 | train loss 3.457945 | norm 2.0352 | lr 1.20e-04 | (3831.43 ms | 136839 tok/s) step 21965/76294 | train loss 3.425107 | norm 2.9367 | lr 1.20e-04 | (3811.06 ms | 137570 tok/s) step 21966/76294 | train loss 3.467199 | norm 3.2402 | lr 1.20e-04 | (3802.90 ms | 137865 tok/s) step 21967/76294 | train loss 3.447464 | norm 3.5481 | lr 1.20e-04 | (3836.09 ms | 136673 tok/s) step 21968/76294 | train loss 3.663352 | norm 3.0349 | lr 1.20e-04 | (3875.95 ms | 135267 tok/s) step 21969/76294 | train loss 3.408495 | norm 3.8399 | lr 1.20e-04 | (3796.12 ms | 138112 tok/s) step 21970/76294 | train loss 3.521432 | norm 3.9676 | lr 1.20e-04 | (3809.46 ms | 137628 tok/s) step 21971/76294 | train loss 3.483944 | norm 2.8567 | lr 1.20e-04 | (3804.26 ms | 137816 tok/s) step 21972/76294 | train loss 3.468028 | norm 3.1815 | lr 1.20e-04 | (3803.41 ms | 137847 tok/s) step 21973/76294 | train loss 3.414757 | norm 3.5817 | lr 1.20e-04 | (3817.15 ms | 137351 tok/s) step 21974/76294 | train loss 3.387443 | norm 2.9550 | lr 1.20e-04 | (3801.20 ms | 137927 tok/s) step 21975/76294 | train loss 3.514004 | norm 3.2722 | lr 1.20e-04 | (3827.38 ms | 136983 tok/s) step 21976/76294 | train loss 3.430346 | norm 3.6825 | lr 1.20e-04 | (3794.28 ms | 138179 tok/s) step 21977/76294 | train loss 3.542963 | norm 4.7431 | lr 1.20e-04 | (3817.35 ms | 137344 tok/s) step 21978/76294 | train loss 3.483424 | norm 3.0055 | lr 1.20e-04 | (3801.16 ms | 137928 tok/s) step 21979/76294 | train loss 3.512700 | norm 6.0775 | lr 1.20e-04 | (3798.69 ms | 138018 tok/s) step 21980/76294 | train loss 3.437843 | norm 3.0661 | lr 1.20e-04 | (5600.31 ms | 93618 tok/s) step 21981/76294 | train loss 3.515579 | norm 4.7652 | lr 1.20e-04 | (3792.24 ms | 138253 tok/s) step 21982/76294 | train loss 3.511200 | norm 2.9272 | lr 1.20e-04 | (3803.43 ms | 137846 tok/s) step 21983/76294 | train loss 3.497354 | norm 2.5955 | lr 1.20e-04 | (3797.66 ms | 138056 tok/s) step 21984/76294 | train loss 3.452906 | norm 12.6084 | lr 1.20e-04 | (3799.69 ms | 137982 tok/s) step 21985/76294 | train loss 3.543892 | norm 4.4701 | lr 1.20e-04 | (3797.98 ms | 138044 tok/s) step 21986/76294 | train loss 3.461740 | norm 3.4936 | lr 1.20e-04 | (3810.98 ms | 137573 tok/s) step 21987/76294 | train loss 3.368020 | norm 11.2036 | lr 1.20e-04 | (3798.18 ms | 138037 tok/s) step 21988/76294 | train loss 3.498771 | norm 3.7130 | lr 1.20e-04 | (3796.67 ms | 138092 tok/s) step 21989/76294 | train loss 3.539815 | norm 2.3069 | lr 1.20e-04 | (3823.26 ms | 137131 tok/s) step 21990/76294 | train loss 3.419480 | norm 7.0719 | lr 1.20e-04 | (3799.30 ms | 137996 tok/s) step 21991/76294 | train loss 3.455358 | norm 4.6831 | lr 1.20e-04 | (3803.69 ms | 137837 tok/s) step 21992/76294 | train loss 3.468534 | norm 3.2588 | lr 1.20e-04 | (3821.49 ms | 137195 tok/s) step 21993/76294 | train loss 3.467531 | norm 2.8995 | lr 1.20e-04 | (3877.04 ms | 135229 tok/s) step 21994/76294 | train loss 3.485028 | norm 4.8439 | lr 1.20e-04 | (3797.61 ms | 138057 tok/s) step 21995/76294 | train loss 3.391870 | norm 4.2063 | lr 1.20e-04 | (3802.93 ms | 137864 tok/s) step 21996/76294 | train loss 3.421327 | norm 2.8856 | lr 1.20e-04 | (3798.75 ms | 138016 tok/s) step 21997/76294 | train loss 3.473442 | norm 7.0153 | lr 1.20e-04 | (3819.02 ms | 137283 tok/s) step 21998/76294 | train loss 3.546798 | norm 4.2913 | lr 1.20e-04 | (3797.19 ms | 138073 tok/s) step 21999/76294 | train loss 3.442062 | norm 4.1900 | lr 1.20e-04 | (3815.08 ms | 137425 tok/s) step 22000/76294 | train loss 3.495863 | norm 3.9567 | lr 1.20e-04 | (3796.83 ms | 138086 tok/s) val loss: 3.440616 saving model checkpoint to ./results/gpt2-124M-gqa/step_22000.pth step 22001/76294 | train loss 3.455377 | norm 3.5141 | lr 1.20e-04 | (3829.33 ms | 136914 tok/s) step 22002/76294 | train loss 3.440465 | norm 2.6394 | lr 1.20e-04 | (3793.43 ms | 138209 tok/s) step 22003/76294 | train loss 3.502975 | norm 32.4675 | lr 1.20e-04 | (3797.89 ms | 138047 tok/s) step 22004/76294 | train loss 3.442151 | norm 5.3276 | lr 1.20e-04 | (3836.96 ms | 136642 tok/s) step 22005/76294 | train loss 3.425676 | norm 2.9542 | lr 1.20e-04 | (3795.56 ms | 138132 tok/s) step 22006/76294 | train loss 3.454032 | norm 3.8431 | lr 1.20e-04 | (3803.23 ms | 137853 tok/s) step 22007/76294 | train loss 3.485834 | norm 2.6421 | lr 1.20e-04 | (3819.83 ms | 137254 tok/s) step 22008/76294 | train loss 3.418846 | norm 3.5813 | lr 1.20e-04 | (3803.01 ms | 137861 tok/s) step 22009/76294 | train loss 3.457232 | norm 3.4932 | lr 1.20e-04 | (3842.41 ms | 136448 tok/s) step 22010/76294 | train loss 3.462207 | norm 2.3815 | lr 1.20e-04 | (3800.88 ms | 137938 tok/s) step 22011/76294 | train loss 3.432299 | norm 2.9704 | lr 1.20e-04 | (3836.93 ms | 136643 tok/s) step 22012/76294 | train loss 3.523705 | norm 3.0384 | lr 1.20e-04 | (3805.87 ms | 137758 tok/s) step 22013/76294 | train loss 3.476500 | norm 5.6631 | lr 1.20e-04 | (3807.29 ms | 137706 tok/s) step 22014/76294 | train loss 3.455180 | norm 3.2223 | lr 1.20e-04 | (3824.10 ms | 137101 tok/s) step 22015/76294 | train loss 3.464708 | norm 3.7543 | lr 1.20e-04 | (3802.31 ms | 137887 tok/s) step 22016/76294 | train loss 3.453487 | norm 4.1259 | lr 1.20e-04 | (3803.10 ms | 137858 tok/s) step 22017/76294 | train loss 3.422153 | norm 5.2172 | lr 1.20e-04 | (3796.51 ms | 138097 tok/s) step 22018/76294 | train loss 3.474485 | norm 6.4622 | lr 1.20e-04 | (3929.56 ms | 133422 tok/s) step 22019/76294 | train loss 3.469040 | norm 6.4758 | lr 1.20e-04 | (3796.94 ms | 138082 tok/s) step 22020/76294 | train loss 3.432327 | norm 3.4643 | lr 1.20e-04 | (3822.03 ms | 137175 tok/s) step 22021/76294 | train loss 3.418212 | norm 7.3365 | lr 1.20e-04 | (3805.77 ms | 137761 tok/s) step 22022/76294 | train loss 3.405580 | norm 3.1876 | lr 1.20e-04 | (3807.57 ms | 137696 tok/s) step 22023/76294 | train loss 3.459979 | norm 3.6405 | lr 1.20e-04 | (3820.82 ms | 137219 tok/s) step 22024/76294 | train loss 3.457180 | norm 3.3605 | lr 1.20e-04 | (3802.50 ms | 137880 tok/s) step 22025/76294 | train loss 3.466772 | norm 5.3547 | lr 1.20e-04 | (3829.28 ms | 136916 tok/s) step 22026/76294 | train loss 3.435015 | norm 3.8065 | lr 1.20e-04 | (3801.55 ms | 137914 tok/s) step 22027/76294 | train loss 3.373112 | norm 5.9698 | lr 1.20e-04 | (3800.55 ms | 137951 tok/s) step 22028/76294 | train loss 3.515233 | norm 4.1124 | lr 1.20e-04 | (3831.59 ms | 136833 tok/s) step 22029/76294 | train loss 3.362582 | norm 5.9250 | lr 1.20e-04 | (3800.65 ms | 137947 tok/s) step 22030/76294 | train loss 3.361337 | norm 4.2703 | lr 1.20e-04 | (3806.59 ms | 137732 tok/s) step 22031/76294 | train loss 3.507379 | norm 4.2934 | lr 1.20e-04 | (3821.70 ms | 137187 tok/s) step 22032/76294 | train loss 3.415803 | norm 4.2617 | lr 1.20e-04 | (3800.65 ms | 137947 tok/s) step 22033/76294 | train loss 3.470828 | norm 5.7492 | lr 1.20e-04 | (3805.84 ms | 137759 tok/s) step 22034/76294 | train loss 3.388382 | norm 55.5872 | lr 1.20e-04 | (3801.85 ms | 137904 tok/s) step 22035/76294 | train loss 3.442876 | norm 2.8658 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 22036/76294 | train loss 3.417610 | norm 3.6832 | lr 1.20e-04 | (3805.23 ms | 137781 tok/s) step 22037/76294 | train loss 3.427391 | norm 4.4438 | lr 1.20e-04 | (3803.82 ms | 137832 tok/s) step 22038/76294 | train loss 3.411406 | norm 5.3038 | lr 1.20e-04 | (3800.59 ms | 137949 tok/s) step 22039/76294 | train loss 3.492747 | norm 3.2815 | lr 1.20e-04 | (3805.10 ms | 137786 tok/s) step 22040/76294 | train loss 3.379162 | norm 3.1031 | lr 1.20e-04 | (4073.26 ms | 128715 tok/s) step 22041/76294 | train loss 3.500388 | norm 2.9886 | lr 1.20e-04 | (3797.79 ms | 138051 tok/s) step 22042/76294 | train loss 3.467903 | norm 16.3078 | lr 1.20e-04 | (3822.60 ms | 137155 tok/s) step 22043/76294 | train loss 3.605150 | norm 5.8564 | lr 1.20e-04 | (3877.94 ms | 135198 tok/s) step 22044/76294 | train loss 3.477565 | norm 3.7194 | lr 1.20e-04 | (3873.87 ms | 135339 tok/s) step 22045/76294 | train loss 3.459251 | norm 5.7307 | lr 1.20e-04 | (3795.46 ms | 138135 tok/s) step 22046/76294 | train loss 3.485609 | norm 2.9715 | lr 1.20e-04 | (3805.47 ms | 137772 tok/s) step 22047/76294 | train loss 3.431629 | norm 4.1729 | lr 1.20e-04 | (3799.26 ms | 137997 tok/s) step 22048/76294 | train loss 3.407584 | norm 9.6193 | lr 1.20e-04 | (3818.85 ms | 137290 tok/s) step 22049/76294 | train loss 3.487752 | norm 3.6934 | lr 1.20e-04 | (3797.37 ms | 138066 tok/s) step 22050/76294 | train loss 3.435380 | norm 4.6947 | lr 1.20e-04 | (3802.83 ms | 137868 tok/s) step 22051/76294 | train loss 3.507793 | norm 5.4168 | lr 1.20e-04 | (3796.61 ms | 138094 tok/s) step 22052/76294 | train loss 3.449850 | norm 3.4755 | lr 1.20e-04 | (3806.55 ms | 137733 tok/s) step 22053/76294 | train loss 3.411398 | norm 7.8279 | lr 1.20e-04 | (3799.41 ms | 137992 tok/s) step 22054/76294 | train loss 3.513653 | norm 7.7877 | lr 1.20e-04 | (3817.78 ms | 137328 tok/s) step 22055/76294 | train loss 3.469076 | norm 5.1136 | lr 1.20e-04 | (3798.36 ms | 138030 tok/s) step 22056/76294 | train loss 3.520925 | norm 2.5134 | lr 1.20e-04 | (3802.45 ms | 137881 tok/s) step 22057/76294 | train loss 3.371938 | norm 3.3773 | lr 1.20e-04 | (3822.39 ms | 137163 tok/s) step 22058/76294 | train loss 3.441483 | norm 5.4556 | lr 1.20e-04 | (3803.37 ms | 137848 tok/s) step 22059/76294 | train loss 3.432399 | norm 4.6331 | lr 1.20e-04 | (3805.45 ms | 137773 tok/s) step 22060/76294 | train loss 3.468895 | norm 4.0292 | lr 1.20e-04 | (3804.31 ms | 137814 tok/s) step 22061/76294 | train loss 3.437941 | norm 6.0179 | lr 1.20e-04 | (3806.13 ms | 137748 tok/s) step 22062/76294 | train loss 3.530546 | norm 2.4157 | lr 1.20e-04 | (3804.36 ms | 137813 tok/s) step 22063/76294 | train loss 3.460122 | norm 8.2162 | lr 1.20e-04 | (3799.95 ms | 137972 tok/s) step 22064/76294 | train loss 3.434325 | norm 3.5043 | lr 1.20e-04 | (3829.09 ms | 136922 tok/s) step 22065/76294 | train loss 3.440277 | norm 3.3185 | lr 1.20e-04 | (3799.11 ms | 138003 tok/s) step 22066/76294 | train loss 3.535928 | norm 3.4359 | lr 1.20e-04 | (3802.36 ms | 137885 tok/s) step 22067/76294 | train loss 3.498060 | norm 3.1329 | lr 1.20e-04 | (3816.89 ms | 137360 tok/s) step 22068/76294 | train loss 3.475609 | norm 4.0709 | lr 1.20e-04 | (3801.31 ms | 137923 tok/s) step 22069/76294 | train loss 3.389608 | norm 2.8744 | lr 1.20e-04 | (3800.04 ms | 137969 tok/s) step 22070/76294 | train loss 3.434385 | norm 2.5306 | lr 1.20e-04 | (3824.51 ms | 137086 tok/s) step 22071/76294 | train loss 3.460212 | norm 5.0647 | lr 1.20e-04 | (3803.54 ms | 137842 tok/s) step 22072/76294 | train loss 3.438503 | norm 4.3672 | lr 1.20e-04 | (3806.87 ms | 137722 tok/s) step 22073/76294 | train loss 3.451604 | norm 9.8177 | lr 1.20e-04 | (3798.67 ms | 138019 tok/s) step 22074/76294 | train loss 3.468963 | norm 4.9464 | lr 1.20e-04 | (3957.86 ms | 132467 tok/s) step 22075/76294 | train loss 3.465668 | norm 6.0310 | lr 1.20e-04 | (3797.37 ms | 138066 tok/s) step 22076/76294 | train loss 3.466697 | norm 5.7634 | lr 1.20e-04 | (3802.32 ms | 137886 tok/s) step 22077/76294 | train loss 3.490250 | norm 5.4810 | lr 1.20e-04 | (3830.37 ms | 136876 tok/s) step 22078/76294 | train loss 3.459082 | norm 10.2938 | lr 1.20e-04 | (3799.89 ms | 137975 tok/s) step 22079/76294 | train loss 3.447480 | norm 6.0501 | lr 1.20e-04 | (3811.13 ms | 137568 tok/s) step 22080/76294 | train loss 3.416344 | norm 6.2512 | lr 1.20e-04 | (3800.54 ms | 137951 tok/s) step 22081/76294 | train loss 3.469197 | norm 3.6156 | lr 1.20e-04 | (3807.05 ms | 137715 tok/s) step 22082/76294 | train loss 3.460091 | norm 3.8631 | lr 1.20e-04 | (3799.14 ms | 138002 tok/s) step 22083/76294 | train loss 3.437555 | norm 2.7542 | lr 1.20e-04 | (3808.29 ms | 137670 tok/s) step 22084/76294 | train loss 3.530922 | norm 5.5016 | lr 1.20e-04 | (3796.48 ms | 138098 tok/s) step 22085/76294 | train loss 3.417075 | norm 2.7260 | lr 1.20e-04 | (3801.09 ms | 137931 tok/s) step 22086/76294 | train loss 3.486447 | norm 13.0222 | lr 1.20e-04 | (3799.83 ms | 137977 tok/s) step 22087/76294 | train loss 3.535321 | norm 3.9406 | lr 1.20e-04 | (3824.01 ms | 137104 tok/s) step 22088/76294 | train loss 3.478302 | norm 3.4134 | lr 1.20e-04 | (3812.89 ms | 137504 tok/s) step 22089/76294 | train loss 3.491851 | norm 4.6413 | lr 1.20e-04 | (3804.85 ms | 137795 tok/s) step 22090/76294 | train loss 3.464561 | norm 3.4251 | lr 1.20e-04 | (3799.82 ms | 137977 tok/s) step 22091/76294 | train loss 3.477687 | norm 4.9400 | lr 1.20e-04 | (3803.59 ms | 137840 tok/s) step 22092/76294 | train loss 3.419108 | norm 2.2595 | lr 1.20e-04 | (3805.21 ms | 137782 tok/s) step 22093/76294 | train loss 3.455023 | norm 2.2703 | lr 1.20e-04 | (3806.72 ms | 137727 tok/s) step 22094/76294 | train loss 3.498333 | norm 1.8316 | lr 1.20e-04 | (3802.95 ms | 137863 tok/s) step 22095/76294 | train loss 3.437468 | norm 4.4561 | lr 1.20e-04 | (3950.08 ms | 132728 tok/s) step 22096/76294 | train loss 3.456788 | norm 3.5738 | lr 1.20e-04 | (3794.48 ms | 138171 tok/s) step 22097/76294 | train loss 3.426975 | norm 3.4579 | lr 1.20e-04 | (3814.01 ms | 137464 tok/s) step 22098/76294 | train loss 3.431632 | norm 3.7981 | lr 1.20e-04 | (3801.08 ms | 137931 tok/s) step 22099/76294 | train loss 3.478870 | norm 4.2036 | lr 1.20e-04 | (3799.94 ms | 137973 tok/s) step 22100/76294 | train loss 3.505783 | norm 2.5079 | lr 1.20e-04 | (3814.12 ms | 137460 tok/s) step 22101/76294 | train loss 3.465147 | norm 7.7580 | lr 1.20e-04 | (3802.78 ms | 137870 tok/s) step 22102/76294 | train loss 3.478341 | norm 2.6383 | lr 1.20e-04 | (3797.09 ms | 138076 tok/s) step 22103/76294 | train loss 3.438493 | norm 3.0297 | lr 1.20e-04 | (3822.89 ms | 137144 tok/s) step 22104/76294 | train loss 3.501133 | norm 3.7586 | lr 1.20e-04 | (3796.34 ms | 138104 tok/s) step 22105/76294 | train loss 3.429509 | norm 6.7031 | lr 1.20e-04 | (3802.32 ms | 137886 tok/s) step 22106/76294 | train loss 3.374002 | norm 3.7550 | lr 1.20e-04 | (3824.75 ms | 137078 tok/s) step 22107/76294 | train loss 3.490993 | norm 3.5452 | lr 1.20e-04 | (3806.23 ms | 137745 tok/s) step 22108/76294 | train loss 3.417910 | norm 2.6522 | lr 1.20e-04 | (3822.05 ms | 137175 tok/s) step 22109/76294 | train loss 3.459693 | norm 2.6620 | lr 1.20e-04 | (3816.09 ms | 137389 tok/s) step 22110/76294 | train loss 3.486708 | norm 4.3006 | lr 1.20e-04 | (3799.08 ms | 138004 tok/s) step 22111/76294 | train loss 3.515110 | norm 4.1973 | lr 1.20e-04 | (3827.34 ms | 136985 tok/s) step 22112/76294 | train loss 3.432320 | norm 8.8595 | lr 1.20e-04 | (3799.65 ms | 137983 tok/s) step 22113/76294 | train loss 3.494811 | norm 5.5314 | lr 1.20e-04 | (3808.05 ms | 137679 tok/s) step 22114/76294 | train loss 3.511834 | norm 3.2268 | lr 1.20e-04 | (3798.07 ms | 138041 tok/s) step 22115/76294 | train loss 3.437714 | norm 12.3535 | lr 1.20e-04 | (3800.63 ms | 137948 tok/s) step 22116/76294 | train loss 3.455606 | norm 4.7985 | lr 1.20e-04 | (3797.37 ms | 138066 tok/s) step 22117/76294 | train loss 3.540023 | norm 3.1829 | lr 1.20e-04 | (3830.89 ms | 136858 tok/s) step 22118/76294 | train loss 3.456058 | norm 4.9864 | lr 1.20e-04 | (3801.81 ms | 137905 tok/s) step 22119/76294 | train loss 3.497192 | norm 4.9354 | lr 1.20e-04 | (3798.22 ms | 138035 tok/s) step 22120/76294 | train loss 3.480153 | norm 4.0831 | lr 1.20e-04 | (3821.52 ms | 137194 tok/s) step 22121/76294 | train loss 3.429953 | norm 3.3897 | lr 1.20e-04 | (3852.15 ms | 136103 tok/s) step 22122/76294 | train loss 3.469964 | norm 4.0263 | lr 1.20e-04 | (3802.38 ms | 137884 tok/s) step 22123/76294 | train loss 3.447362 | norm 5.8890 | lr 1.20e-04 | (3917.44 ms | 133834 tok/s) step 22124/76294 | train loss 3.426070 | norm 5.1658 | lr 1.20e-04 | (3826.30 ms | 137022 tok/s) step 22125/76294 | train loss 3.526841 | norm 10.2479 | lr 1.20e-04 | (3836.32 ms | 136664 tok/s) step 22126/76294 | train loss 3.575824 | norm 4.7318 | lr 1.20e-04 | (3802.47 ms | 137881 tok/s) step 22127/76294 | train loss 3.416830 | norm 6.4133 | lr 1.20e-04 | (3797.63 ms | 138057 tok/s) step 22128/76294 | train loss 3.553385 | norm 3.5009 | lr 1.20e-04 | (3827.72 ms | 136971 tok/s) step 22129/76294 | train loss 3.498748 | norm 3.7127 | lr 1.20e-04 | (3800.91 ms | 137938 tok/s) step 22130/76294 | train loss 3.588927 | norm 6.9481 | lr 1.20e-04 | (3820.40 ms | 137234 tok/s) step 22131/76294 | train loss 3.571391 | norm 3.7343 | lr 1.20e-04 | (3825.38 ms | 137055 tok/s) step 22132/76294 | train loss 3.481178 | norm 4.7304 | lr 1.20e-04 | (3801.52 ms | 137916 tok/s) step 22133/76294 | train loss 3.441055 | norm 8.3459 | lr 1.20e-04 | (3820.39 ms | 137234 tok/s) step 22134/76294 | train loss 3.490863 | norm 4.1641 | lr 1.20e-04 | (3807.05 ms | 137715 tok/s) step 22135/76294 | train loss 3.482548 | norm 7.0058 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 22136/76294 | train loss 3.475791 | norm 5.0144 | lr 1.20e-04 | (3799.58 ms | 137986 tok/s) step 22137/76294 | train loss 3.513031 | norm 3.7234 | lr 1.20e-04 | (6330.76 ms | 82816 tok/s) step 22138/76294 | train loss 3.601010 | norm 5.7095 | lr 1.20e-04 | (3844.97 ms | 136357 tok/s) step 22139/76294 | train loss 3.392476 | norm 3.8205 | lr 1.20e-04 | (3835.33 ms | 136699 tok/s) step 22140/76294 | train loss 3.507718 | norm 2.6823 | lr 1.20e-04 | (3789.62 ms | 138348 tok/s) step 22141/76294 | train loss 3.477662 | norm 4.2787 | lr 1.20e-04 | (3793.79 ms | 138196 tok/s) step 22142/76294 | train loss 3.440704 | norm 4.2110 | lr 1.20e-04 | (3817.64 ms | 137333 tok/s) step 22143/76294 | train loss 3.478545 | norm 2.7861 | lr 1.20e-04 | (3871.76 ms | 135413 tok/s) step 22144/76294 | train loss 3.461233 | norm 3.3398 | lr 1.20e-04 | (3858.06 ms | 135894 tok/s) step 22145/76294 | train loss 3.473949 | norm 2.8782 | lr 1.20e-04 | (3783.64 ms | 138567 tok/s) step 22146/76294 | train loss 3.399156 | norm 4.3562 | lr 1.20e-04 | (3797.52 ms | 138060 tok/s) step 22147/76294 | train loss 3.486363 | norm 2.1193 | lr 1.20e-04 | (3786.04 ms | 138479 tok/s) step 22148/76294 | train loss 3.431221 | norm 4.8557 | lr 1.20e-04 | (3832.91 ms | 136786 tok/s) step 22149/76294 | train loss 3.481966 | norm 3.1726 | lr 1.20e-04 | (3789.10 ms | 138367 tok/s) step 22150/76294 | train loss 3.408673 | norm 7.4998 | lr 1.20e-04 | (3798.93 ms | 138009 tok/s) step 22151/76294 | train loss 3.433671 | norm 7.6720 | lr 1.20e-04 | (3815.18 ms | 137422 tok/s) step 22152/76294 | train loss 3.449431 | norm 6.8471 | lr 1.20e-04 | (3822.25 ms | 137167 tok/s) step 22153/76294 | train loss 3.482224 | norm 3.0828 | lr 1.20e-04 | (3800.63 ms | 137948 tok/s) step 22154/76294 | train loss 3.504239 | norm 2.8014 | lr 1.20e-04 | (3796.81 ms | 138086 tok/s) step 22155/76294 | train loss 3.440180 | norm 5.6255 | lr 1.20e-04 | (3795.51 ms | 138134 tok/s) step 22156/76294 | train loss 3.527466 | norm 5.7593 | lr 1.20e-04 | (3830.81 ms | 136861 tok/s) step 22157/76294 | train loss 3.489655 | norm 3.6977 | lr 1.20e-04 | (3803.18 ms | 137855 tok/s) step 22158/76294 | train loss 3.440316 | norm 5.9762 | lr 1.20e-04 | (3809.55 ms | 137625 tok/s) step 22159/76294 | train loss 3.433376 | norm 5.5853 | lr 1.20e-04 | (3828.14 ms | 136956 tok/s) step 22160/76294 | train loss 3.473844 | norm 3.8629 | lr 1.20e-04 | (3805.35 ms | 137777 tok/s) step 22161/76294 | train loss 3.482360 | norm 4.3892 | lr 1.20e-04 | (3810.87 ms | 137577 tok/s) step 22162/76294 | train loss 3.402133 | norm 6.1005 | lr 1.20e-04 | (3806.83 ms | 137723 tok/s) step 22163/76294 | train loss 3.447120 | norm 4.8991 | lr 1.20e-04 | (3809.17 ms | 137639 tok/s) step 22164/76294 | train loss 3.420301 | norm 3.8917 | lr 1.20e-04 | (3805.37 ms | 137776 tok/s) step 22165/76294 | train loss 3.444021 | norm 4.3512 | lr 1.20e-04 | (3822.17 ms | 137170 tok/s) step 22166/76294 | train loss 3.521508 | norm 4.3764 | lr 1.20e-04 | (3802.05 ms | 137896 tok/s) step 22167/76294 | train loss 3.411849 | norm 5.6459 | lr 1.20e-04 | (3798.57 ms | 138022 tok/s) step 22168/76294 | train loss 3.505200 | norm 5.6364 | lr 1.20e-04 | (3831.71 ms | 136829 tok/s) step 22169/76294 | train loss 3.483578 | norm 6.1556 | lr 1.20e-04 | (3799.16 ms | 138001 tok/s) step 22170/76294 | train loss 3.472490 | norm 5.0684 | lr 1.20e-04 | (3824.10 ms | 137101 tok/s) step 22171/76294 | train loss 3.449158 | norm 6.2830 | lr 1.20e-04 | (3799.26 ms | 137997 tok/s) step 22172/76294 | train loss 3.425413 | norm 6.7759 | lr 1.20e-04 | (3801.56 ms | 137914 tok/s) step 22173/76294 | train loss 3.464593 | norm 47.4554 | lr 1.20e-04 | (3820.65 ms | 137225 tok/s) step 22174/76294 | train loss 3.463060 | norm 13.9543 | lr 1.20e-04 | (3875.36 ms | 135288 tok/s) step 22175/76294 | train loss 3.490980 | norm 6.8812 | lr 1.20e-04 | (3902.86 ms | 134334 tok/s) step 22176/76294 | train loss 3.476812 | norm 2.9111 | lr 1.20e-04 | (3801.85 ms | 137903 tok/s) step 22177/76294 | train loss 3.490433 | norm 4.5678 | lr 1.20e-04 | (5530.32 ms | 94802 tok/s) step 22178/76294 | train loss 3.439023 | norm 3.6326 | lr 1.20e-04 | (3819.36 ms | 137271 tok/s) step 22179/76294 | train loss 3.477469 | norm 5.3199 | lr 1.20e-04 | (3806.23 ms | 137745 tok/s) step 22180/76294 | train loss 3.564463 | norm 5.0856 | lr 1.20e-04 | (3844.19 ms | 136385 tok/s) step 22181/76294 | train loss 3.454769 | norm 4.1106 | lr 1.20e-04 | (3802.59 ms | 137876 tok/s) step 22182/76294 | train loss 3.372044 | norm 4.4212 | lr 1.20e-04 | (3799.99 ms | 137971 tok/s) step 22183/76294 | train loss 3.391794 | norm 9.9798 | lr 1.20e-04 | (3804.54 ms | 137806 tok/s) step 22184/76294 | train loss 3.433312 | norm 3.8675 | lr 1.20e-04 | (3802.34 ms | 137886 tok/s) step 22185/76294 | train loss 3.480978 | norm 7.3626 | lr 1.20e-04 | (3804.34 ms | 137813 tok/s) step 22186/76294 | train loss 3.427050 | norm 6.4680 | lr 1.20e-04 | (3809.03 ms | 137644 tok/s) step 22187/76294 | train loss 3.631171 | norm 4.0782 | lr 1.20e-04 | (3802.39 ms | 137884 tok/s) step 22188/76294 | train loss 3.466987 | norm 3.4897 | lr 1.20e-04 | (3803.66 ms | 137838 tok/s) step 22189/76294 | train loss 3.439469 | norm 3.5130 | lr 1.20e-04 | (3800.19 ms | 137964 tok/s) step 22190/76294 | train loss 3.427996 | norm 2.4603 | lr 1.20e-04 | (3801.14 ms | 137929 tok/s) step 22191/76294 | train loss 3.433696 | norm 4.9836 | lr 1.20e-04 | (3805.16 ms | 137783 tok/s) step 22192/76294 | train loss 3.431564 | norm 2.6750 | lr 1.20e-04 | (3802.20 ms | 137891 tok/s) step 22193/76294 | train loss 3.400692 | norm 4.1909 | lr 1.20e-04 | (3808.11 ms | 137677 tok/s) step 22194/76294 | train loss 3.452753 | norm 3.1011 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 22195/76294 | train loss 3.449701 | norm 3.6154 | lr 1.20e-04 | (3811.73 ms | 137546 tok/s) step 22196/76294 | train loss 3.500246 | norm 15.5979 | lr 1.20e-04 | (3802.29 ms | 137887 tok/s) step 22197/76294 | train loss 3.551317 | norm 5.7406 | lr 1.20e-04 | (3809.29 ms | 137634 tok/s) step 22198/76294 | train loss 3.449287 | norm 2.9100 | lr 1.20e-04 | (3802.33 ms | 137886 tok/s) step 22199/76294 | train loss 3.430565 | norm 4.2223 | lr 1.20e-04 | (3798.87 ms | 138011 tok/s) step 22200/76294 | train loss 3.432561 | norm 3.6852 | lr 1.20e-04 | (3851.59 ms | 136122 tok/s) step 22201/76294 | train loss 3.512143 | norm 12.3059 | lr 1.20e-04 | (3800.01 ms | 137970 tok/s) step 22202/76294 | train loss 3.405758 | norm 3.8166 | lr 1.20e-04 | (3803.80 ms | 137833 tok/s) step 22203/76294 | train loss 3.480612 | norm 4.2306 | lr 1.20e-04 | (3800.25 ms | 137962 tok/s) step 22204/76294 | train loss 3.450200 | norm 7.2034 | lr 1.20e-04 | (3801.69 ms | 137909 tok/s) step 22205/76294 | train loss 3.474075 | norm 5.6445 | lr 1.20e-04 | (3818.21 ms | 137313 tok/s) step 22206/76294 | train loss 3.432096 | norm 9.1403 | lr 1.20e-04 | (3810.37 ms | 137595 tok/s) step 22207/76294 | train loss 3.498486 | norm 11.2872 | lr 1.20e-04 | (3820.89 ms | 137216 tok/s) step 22208/76294 | train loss 3.489843 | norm 3.4122 | lr 1.20e-04 | (3810.50 ms | 137591 tok/s) step 22209/76294 | train loss 3.486846 | norm 9.6832 | lr 1.20e-04 | (3809.53 ms | 137625 tok/s) step 22210/76294 | train loss 3.474002 | norm 6.9672 | lr 1.20e-04 | (3798.42 ms | 138028 tok/s) step 22211/76294 | train loss 3.491906 | norm 11.4430 | lr 1.20e-04 | (3811.20 ms | 137565 tok/s) step 22212/76294 | train loss 3.449822 | norm 4.8458 | lr 1.20e-04 | (3798.18 ms | 138037 tok/s) step 22213/76294 | train loss 3.443978 | norm 7.1382 | lr 1.20e-04 | (3818.34 ms | 137308 tok/s) step 22214/76294 | train loss 3.500475 | norm 6.8319 | lr 1.20e-04 | (3797.71 ms | 138054 tok/s) step 22215/76294 | train loss 3.448303 | norm 4.4750 | lr 1.20e-04 | (3834.25 ms | 136738 tok/s) step 22216/76294 | train loss 3.501944 | norm 3.7611 | lr 1.20e-04 | (3807.10 ms | 137713 tok/s) step 22217/76294 | train loss 3.657769 | norm 9.9920 | lr 1.20e-04 | (3810.14 ms | 137603 tok/s) step 22218/76294 | train loss 3.513884 | norm 7.0319 | lr 1.20e-04 | (3833.46 ms | 136766 tok/s) step 22219/76294 | train loss 3.540807 | norm 5.5893 | lr 1.20e-04 | (3800.05 ms | 137969 tok/s) step 22220/76294 | train loss 3.493764 | norm 3.7201 | lr 1.20e-04 | (3841.12 ms | 136494 tok/s) step 22221/76294 | train loss 3.483643 | norm 5.1484 | lr 1.20e-04 | (3810.68 ms | 137584 tok/s) step 22222/76294 | train loss 3.463305 | norm 4.5185 | lr 1.20e-04 | (3843.49 ms | 136409 tok/s) step 22223/76294 | train loss 3.435930 | norm 4.1881 | lr 1.20e-04 | (3800.55 ms | 137950 tok/s) step 22224/76294 | train loss 3.491026 | norm 4.3613 | lr 1.20e-04 | (3803.74 ms | 137835 tok/s) step 22225/76294 | train loss 3.481264 | norm 3.0256 | lr 1.20e-04 | (4012.97 ms | 130648 tok/s) step 22226/76294 | train loss 3.500928 | norm 7.5574 | lr 1.20e-04 | (3802.57 ms | 137877 tok/s) step 22227/76294 | train loss 3.532340 | norm 5.1381 | lr 1.20e-04 | (3877.78 ms | 135203 tok/s) step 22228/76294 | train loss 3.499620 | norm 4.3026 | lr 1.20e-04 | (3797.32 ms | 138068 tok/s) step 22229/76294 | train loss 3.572266 | norm 12.1209 | lr 1.20e-04 | (3882.01 ms | 135056 tok/s) step 22230/76294 | train loss 3.529076 | norm 8.9828 | lr 1.20e-04 | (3798.78 ms | 138015 tok/s) step 22231/76294 | train loss 3.466085 | norm 6.4998 | lr 1.20e-04 | (4579.82 ms | 114478 tok/s) step 22232/76294 | train loss 3.473696 | norm 17.3853 | lr 1.20e-04 | (3792.51 ms | 138243 tok/s) step 22233/76294 | train loss 3.475436 | norm 2.8110 | lr 1.20e-04 | (3800.70 ms | 137945 tok/s) step 22234/76294 | train loss 3.465443 | norm 10.7017 | lr 1.20e-04 | (3828.76 ms | 136934 tok/s) step 22235/76294 | train loss 3.544011 | norm 4.8369 | lr 1.20e-04 | (3806.80 ms | 137724 tok/s) step 22236/76294 | train loss 3.517416 | norm 4.9919 | lr 1.20e-04 | (3817.45 ms | 137340 tok/s) step 22237/76294 | train loss 3.483123 | norm 5.6511 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 22238/76294 | train loss 3.559332 | norm 4.5192 | lr 1.20e-04 | (3811.89 ms | 137540 tok/s) step 22239/76294 | train loss 3.503247 | norm 5.4237 | lr 1.20e-04 | (3805.92 ms | 137756 tok/s) step 22240/76294 | train loss 3.435734 | norm 4.7415 | lr 1.20e-04 | (3806.05 ms | 137751 tok/s) step 22241/76294 | train loss 3.501924 | norm 7.5840 | lr 1.20e-04 | (3805.70 ms | 137764 tok/s) step 22242/76294 | train loss 3.490442 | norm 4.9594 | lr 1.20e-04 | (3800.27 ms | 137961 tok/s) step 22243/76294 | train loss 3.501731 | norm 4.2139 | lr 1.20e-04 | (3841.70 ms | 136473 tok/s) step 22244/76294 | train loss 3.459892 | norm 5.5764 | lr 1.20e-04 | (3797.51 ms | 138061 tok/s) step 22245/76294 | train loss 3.445770 | norm 5.0089 | lr 1.20e-04 | (3801.87 ms | 137903 tok/s) step 22246/76294 | train loss 3.467026 | norm 5.8893 | lr 1.20e-04 | (3817.15 ms | 137351 tok/s) step 22247/76294 | train loss 3.537287 | norm 11.0400 | lr 1.20e-04 | (3803.72 ms | 137835 tok/s) step 22248/76294 | train loss 3.522860 | norm 6.6800 | lr 1.20e-04 | (3806.24 ms | 137744 tok/s) step 22249/76294 | train loss 3.477681 | norm 3.2777 | lr 1.20e-04 | (3804.21 ms | 137818 tok/s) step 22250/76294 | train loss 3.474331 | norm 4.1064 | lr 1.20e-04 | (3840.17 ms | 136527 tok/s) val loss: 3.455016 saving model checkpoint to ./results/gpt2-124M-gqa/step_22250.pth step 22251/76294 | train loss 3.437161 | norm 5.1399 | lr 1.20e-04 | (3874.61 ms | 135314 tok/s) step 22252/76294 | train loss 3.498734 | norm 6.6536 | lr 1.20e-04 | (3817.00 ms | 137356 tok/s) step 22253/76294 | train loss 3.448311 | norm 4.2265 | lr 1.20e-04 | (3809.94 ms | 137611 tok/s) step 22254/76294 | train loss 3.534089 | norm 4.5038 | lr 1.20e-04 | (3801.76 ms | 137907 tok/s) step 22255/76294 | train loss 3.476253 | norm 6.2936 | lr 1.20e-04 | (3818.80 ms | 137291 tok/s) step 22256/76294 | train loss 3.501148 | norm 5.3241 | lr 1.20e-04 | (3805.34 ms | 137777 tok/s) step 22257/76294 | train loss 3.478628 | norm 3.9093 | lr 1.20e-04 | (3803.23 ms | 137853 tok/s) step 22258/76294 | train loss 3.542226 | norm 3.4109 | lr 1.20e-04 | (3797.14 ms | 138075 tok/s) step 22259/76294 | train loss 3.454524 | norm 4.5761 | lr 1.20e-04 | (3828.69 ms | 136937 tok/s) step 22260/76294 | train loss 3.469231 | norm 5.7416 | lr 1.20e-04 | (3800.52 ms | 137952 tok/s) step 22261/76294 | train loss 3.471044 | norm 9.9624 | lr 1.20e-04 | (3809.38 ms | 137631 tok/s) step 22262/76294 | train loss 3.457600 | norm 5.5050 | lr 1.20e-04 | (3800.64 ms | 137947 tok/s) step 22263/76294 | train loss 3.459744 | norm 5.1793 | lr 1.20e-04 | (3799.20 ms | 138000 tok/s) step 22264/76294 | train loss 3.551309 | norm 5.3129 | lr 1.20e-04 | (3823.85 ms | 137110 tok/s) step 22265/76294 | train loss 3.479223 | norm 9.8691 | lr 1.20e-04 | (3803.72 ms | 137836 tok/s) step 22266/76294 | train loss 3.537994 | norm 8.0017 | lr 1.20e-04 | (3800.13 ms | 137966 tok/s) step 22267/76294 | train loss 3.521130 | norm 5.5621 | lr 1.20e-04 | (3835.97 ms | 136677 tok/s) step 22268/76294 | train loss 3.545655 | norm 9.2159 | lr 1.20e-04 | (3802.85 ms | 137867 tok/s) step 22269/76294 | train loss 3.517414 | norm 5.6579 | lr 1.20e-04 | (3814.72 ms | 137438 tok/s) step 22270/76294 | train loss 3.480700 | norm 3.8387 | lr 1.20e-04 | (3799.05 ms | 138005 tok/s) step 22271/76294 | train loss 3.454463 | norm 4.4766 | lr 1.20e-04 | (3810.43 ms | 137593 tok/s) step 22272/76294 | train loss 3.508884 | norm 6.1546 | lr 1.20e-04 | (3801.89 ms | 137902 tok/s) step 22273/76294 | train loss 3.444253 | norm 3.6202 | lr 1.20e-04 | (3804.70 ms | 137800 tok/s) step 22274/76294 | train loss 3.468721 | norm 7.9769 | lr 1.20e-04 | (3822.84 ms | 137146 tok/s) step 22275/76294 | train loss 3.473130 | norm 5.4448 | lr 1.20e-04 | (3804.76 ms | 137798 tok/s) step 22276/76294 | train loss 3.530897 | norm 14.4435 | lr 1.20e-04 | (3895.31 ms | 134595 tok/s) step 22277/76294 | train loss 3.503914 | norm 25.7105 | lr 1.20e-04 | (3802.27 ms | 137888 tok/s) step 22278/76294 | train loss 3.425725 | norm 8.0751 | lr 1.20e-04 | (3861.02 ms | 135790 tok/s) step 22279/76294 | train loss 3.485493 | norm 13.7690 | lr 1.20e-04 | (3817.68 ms | 137332 tok/s) step 22280/76294 | train loss 3.476921 | norm 6.4697 | lr 1.20e-04 | (3813.41 ms | 137486 tok/s) step 22281/76294 | train loss 3.506171 | norm 3.9596 | lr 1.20e-04 | (3833.77 ms | 136755 tok/s) step 22282/76294 | train loss 3.432578 | norm 4.1186 | lr 1.20e-04 | (3806.12 ms | 137749 tok/s) step 22283/76294 | train loss 3.522659 | norm 5.8077 | lr 1.20e-04 | (3811.45 ms | 137556 tok/s) step 22284/76294 | train loss 3.505668 | norm 4.0291 | lr 1.20e-04 | (3836.10 ms | 136672 tok/s) step 22285/76294 | train loss 3.452024 | norm 3.3999 | lr 1.20e-04 | (3806.96 ms | 137718 tok/s) step 22286/76294 | train loss 3.468393 | norm 7.3519 | lr 1.20e-04 | (3810.55 ms | 137588 tok/s) step 22287/76294 | train loss 3.504119 | norm 6.0620 | lr 1.20e-04 | (3829.20 ms | 136918 tok/s) step 22288/76294 | train loss 3.562040 | norm 4.6874 | lr 1.20e-04 | (3808.42 ms | 137666 tok/s) step 22289/76294 | train loss 3.506958 | norm 8.7111 | lr 1.20e-04 | (3811.04 ms | 137571 tok/s) step 22290/76294 | train loss 3.550311 | norm 5.1996 | lr 1.20e-04 | (3812.70 ms | 137511 tok/s) step 22291/76294 | train loss 3.525841 | norm 3.2976 | lr 1.20e-04 | (3812.93 ms | 137503 tok/s) step 22292/76294 | train loss 3.443467 | norm 3.4055 | lr 1.20e-04 | (3804.19 ms | 137819 tok/s) step 22293/76294 | train loss 3.486847 | norm 7.5573 | lr 1.20e-04 | (3814.65 ms | 137441 tok/s) step 22294/76294 | train loss 3.530232 | norm 4.0620 | lr 1.20e-04 | (3805.48 ms | 137772 tok/s) step 22295/76294 | train loss 3.452999 | norm 5.5544 | lr 1.20e-04 | (3816.44 ms | 137376 tok/s) step 22296/76294 | train loss 3.467514 | norm 6.2341 | lr 1.20e-04 | (3813.52 ms | 137482 tok/s) step 22297/76294 | train loss 3.534648 | norm 6.4847 | lr 1.20e-04 | (3858.49 ms | 135879 tok/s) step 22298/76294 | train loss 3.558466 | norm 3.6201 | lr 1.20e-04 | (3807.97 ms | 137682 tok/s) step 22299/76294 | train loss 3.503817 | norm 3.8163 | lr 1.20e-04 | (3801.60 ms | 137912 tok/s) step 22300/76294 | train loss 3.501050 | norm 7.5896 | lr 1.20e-04 | (3812.39 ms | 137522 tok/s) step 22301/76294 | train loss 3.519728 | norm 6.1962 | lr 1.20e-04 | (3881.64 ms | 135069 tok/s) step 22302/76294 | train loss 3.466458 | norm 6.4458 | lr 1.20e-04 | (3806.10 ms | 137749 tok/s) step 22303/76294 | train loss 3.462180 | norm 4.8177 | lr 1.20e-04 | (3807.68 ms | 137692 tok/s) step 22304/76294 | train loss 3.470270 | norm 6.8133 | lr 1.20e-04 | (3809.99 ms | 137609 tok/s) step 22305/76294 | train loss 3.507764 | norm 4.7558 | lr 1.20e-04 | (3812.60 ms | 137514 tok/s) step 22306/76294 | train loss 3.424389 | norm 4.1195 | lr 1.20e-04 | (3812.46 ms | 137520 tok/s) step 22307/76294 | train loss 3.391542 | norm 7.5537 | lr 1.20e-04 | (3827.41 ms | 136983 tok/s) step 22308/76294 | train loss 3.456928 | norm 5.0542 | lr 1.20e-04 | (3800.89 ms | 137938 tok/s) step 22309/76294 | train loss 3.490760 | norm 2.5878 | lr 1.20e-04 | (3800.78 ms | 137942 tok/s) step 22310/76294 | train loss 3.430283 | norm 2.0246 | lr 1.20e-04 | (3836.15 ms | 136670 tok/s) step 22311/76294 | train loss 3.500164 | norm 3.7214 | lr 1.20e-04 | (3798.47 ms | 138026 tok/s) step 22312/76294 | train loss 3.527207 | norm 7.2448 | lr 1.20e-04 | (3808.27 ms | 137671 tok/s) step 22313/76294 | train loss 3.490680 | norm 5.9782 | lr 1.20e-04 | (3799.68 ms | 137982 tok/s) step 22314/76294 | train loss 3.434073 | norm 3.2077 | lr 1.20e-04 | (3808.65 ms | 137657 tok/s) step 22315/76294 | train loss 3.439420 | norm 7.0087 | lr 1.20e-04 | (3826.16 ms | 137027 tok/s) step 22316/76294 | train loss 3.435808 | norm 2.6614 | lr 1.20e-04 | (3805.48 ms | 137772 tok/s) step 22317/76294 | train loss 3.457096 | norm 5.6736 | lr 1.20e-04 | (3802.86 ms | 137867 tok/s) step 22318/76294 | train loss 3.440304 | norm 8.4836 | lr 1.20e-04 | (3802.67 ms | 137874 tok/s) step 22319/76294 | train loss 3.495504 | norm 4.0275 | lr 1.20e-04 | (3803.51 ms | 137843 tok/s) step 22320/76294 | train loss 3.537742 | norm 5.0881 | lr 1.20e-04 | (3801.03 ms | 137933 tok/s) step 22321/76294 | train loss 3.441943 | norm 6.1792 | lr 1.20e-04 | (3808.59 ms | 137660 tok/s) step 22322/76294 | train loss 3.450987 | norm 3.8299 | lr 1.20e-04 | (3804.57 ms | 137805 tok/s) step 22323/76294 | train loss 3.443201 | norm 3.2315 | lr 1.20e-04 | (3803.28 ms | 137851 tok/s) step 22324/76294 | train loss 3.512602 | norm 3.0372 | lr 1.20e-04 | (3803.43 ms | 137846 tok/s) step 22325/76294 | train loss 3.550730 | norm 5.3011 | lr 1.20e-04 | (3798.98 ms | 138007 tok/s) step 22326/76294 | train loss 3.433674 | norm 11.5016 | lr 1.20e-04 | (3907.54 ms | 134174 tok/s) step 22327/76294 | train loss 3.503017 | norm 8.6717 | lr 1.20e-04 | (3870.67 ms | 135452 tok/s) step 22328/76294 | train loss 3.471545 | norm 5.2712 | lr 1.20e-04 | (3799.53 ms | 137988 tok/s) step 22329/76294 | train loss 3.468468 | norm 2.9940 | lr 1.20e-04 | (3800.96 ms | 137936 tok/s) step 22330/76294 | train loss 3.483519 | norm 4.9557 | lr 1.20e-04 | (3820.22 ms | 137240 tok/s) step 22331/76294 | train loss 3.471380 | norm 6.4962 | lr 1.20e-04 | (3824.52 ms | 137086 tok/s) step 22332/76294 | train loss 3.572250 | norm 9.0998 | lr 1.20e-04 | (3804.27 ms | 137816 tok/s) step 22333/76294 | train loss 3.459275 | norm 6.7423 | lr 1.20e-04 | (3809.02 ms | 137644 tok/s) step 22334/76294 | train loss 3.444626 | norm 4.5952 | lr 1.20e-04 | (3801.10 ms | 137931 tok/s) step 22335/76294 | train loss 3.449036 | norm 4.7078 | lr 1.20e-04 | (3801.77 ms | 137906 tok/s) step 22336/76294 | train loss 3.507138 | norm 5.6179 | lr 1.20e-04 | (3804.87 ms | 137794 tok/s) step 22337/76294 | train loss 3.450886 | norm 5.3593 | lr 1.20e-04 | (3801.94 ms | 137900 tok/s) step 22338/76294 | train loss 3.499309 | norm 6.1216 | lr 1.20e-04 | (3803.70 ms | 137836 tok/s) step 22339/76294 | train loss 3.464179 | norm 6.7804 | lr 1.20e-04 | (3806.59 ms | 137732 tok/s) step 22340/76294 | train loss 3.481344 | norm 6.0949 | lr 1.20e-04 | (3800.49 ms | 137953 tok/s) step 22341/76294 | train loss 3.532622 | norm 7.5097 | lr 1.20e-04 | (3802.66 ms | 137874 tok/s) step 22342/76294 | train loss 3.488660 | norm 16.4993 | lr 1.20e-04 | (3812.05 ms | 137535 tok/s) step 22343/76294 | train loss 3.542159 | norm 11.8116 | lr 1.20e-04 | (3801.56 ms | 137914 tok/s) step 22344/76294 | train loss 3.472248 | norm 7.3583 | lr 1.20e-04 | (3806.39 ms | 137739 tok/s) step 22345/76294 | train loss 3.498945 | norm 6.7452 | lr 1.20e-04 | (3804.12 ms | 137821 tok/s) step 22346/76294 | train loss 3.408984 | norm 8.0760 | lr 1.20e-04 | (3833.98 ms | 136748 tok/s) step 22347/76294 | train loss 3.461972 | norm 6.9292 | lr 1.20e-04 | (3805.53 ms | 137770 tok/s) step 22348/76294 | train loss 3.592722 | norm 9.1285 | lr 1.20e-04 | (3809.84 ms | 137614 tok/s) step 22349/76294 | train loss 3.489748 | norm 8.8436 | lr 1.20e-04 | (3799.19 ms | 138000 tok/s) step 22350/76294 | train loss 3.511385 | norm 4.4810 | lr 1.20e-04 | (3917.64 ms | 133828 tok/s) step 22351/76294 | train loss 3.463950 | norm 5.7731 | lr 1.20e-04 | (3800.90 ms | 137938 tok/s) step 22352/76294 | train loss 3.477198 | norm 9.0557 | lr 1.20e-04 | (3908.95 ms | 134125 tok/s) step 22353/76294 | train loss 3.491501 | norm 9.3167 | lr 1.20e-04 | (3799.42 ms | 137992 tok/s) step 22354/76294 | train loss 3.463099 | norm 9.6826 | lr 1.20e-04 | (5168.77 ms | 101434 tok/s) step 22355/76294 | train loss 3.475225 | norm 15.4872 | lr 1.20e-04 | (3825.73 ms | 137043 tok/s) step 22356/76294 | train loss 3.498607 | norm 10.2669 | lr 1.20e-04 | (3796.68 ms | 138091 tok/s) step 22357/76294 | train loss 3.463751 | norm 6.0048 | lr 1.20e-04 | (3806.34 ms | 137741 tok/s) step 22358/76294 | train loss 3.457470 | norm 6.0596 | lr 1.20e-04 | (3816.49 ms | 137375 tok/s) step 22359/76294 | train loss 3.560807 | norm 8.3810 | lr 1.20e-04 | (3822.44 ms | 137160 tok/s) step 22360/76294 | train loss 3.464560 | norm 5.2372 | lr 1.20e-04 | (3823.89 ms | 137108 tok/s) step 22361/76294 | train loss 3.595340 | norm 9.1897 | lr 1.20e-04 | (3805.18 ms | 137783 tok/s) step 22362/76294 | train loss 3.703311 | norm 41.6583 | lr 1.20e-04 | (3808.24 ms | 137672 tok/s) step 22363/76294 | train loss 3.515583 | norm 16.9974 | lr 1.20e-04 | (3798.79 ms | 138014 tok/s) step 22364/76294 | train loss 3.552626 | norm 16.6916 | lr 1.20e-04 | (3844.48 ms | 136374 tok/s) step 22365/76294 | train loss 3.572832 | norm 36.7742 | lr 1.20e-04 | (3807.56 ms | 137697 tok/s) step 22366/76294 | train loss 3.643723 | norm 31.4194 | lr 1.20e-04 | (3809.40 ms | 137630 tok/s) step 22367/76294 | train loss 3.621274 | norm 70.1180 | lr 1.20e-04 | (3822.68 ms | 137152 tok/s) step 22368/76294 | train loss 3.857244 | norm 37.4989 | lr 1.20e-04 | (3807.24 ms | 137708 tok/s) step 22369/76294 | train loss 4.025228 | norm 63.1825 | lr 1.20e-04 | (3807.04 ms | 137715 tok/s) step 22370/76294 | train loss 4.068137 | norm 85.4410 | lr 1.20e-04 | (3816.47 ms | 137375 tok/s) step 22371/76294 | train loss 3.931180 | norm 42.4700 | lr 1.20e-04 | (3807.97 ms | 137682 tok/s) step 22372/76294 | train loss 3.871339 | norm 46.6183 | lr 1.20e-04 | (3840.99 ms | 136498 tok/s) step 22373/76294 | train loss 3.825894 | norm 50.4812 | lr 1.20e-04 | (3806.57 ms | 137732 tok/s) step 22374/76294 | train loss 3.791725 | norm 45.2325 | lr 1.20e-04 | (3863.58 ms | 135700 tok/s) step 22375/76294 | train loss 3.907459 | norm 51.2803 | lr 1.20e-04 | (3810.27 ms | 137599 tok/s) step 22376/76294 | train loss 3.755193 | norm 30.9563 | lr 1.20e-04 | (3834.46 ms | 136731 tok/s) step 22377/76294 | train loss 3.732390 | norm 81.4960 | lr 1.20e-04 | (3840.37 ms | 136520 tok/s) step 22378/76294 | train loss 3.735462 | norm 76.3640 | lr 1.20e-04 | (3834.43 ms | 136732 tok/s) step 22379/76294 | train loss 3.697508 | norm 52.3594 | lr 1.20e-04 | (3824.01 ms | 137104 tok/s) step 22380/76294 | train loss 3.813977 | norm 31.7686 | lr 1.20e-04 | (3815.22 ms | 137420 tok/s) step 22381/76294 | train loss 3.659218 | norm 10.8741 | lr 1.20e-04 | (3811.20 ms | 137565 tok/s) step 22382/76294 | train loss 3.699960 | norm 8.9493 | lr 1.20e-04 | (3808.04 ms | 137679 tok/s) step 22383/76294 | train loss 3.618076 | norm 10.6515 | lr 1.20e-04 | (3810.51 ms | 137590 tok/s) step 22384/76294 | train loss 3.591470 | norm 12.5882 | lr 1.20e-04 | (3818.72 ms | 137294 tok/s) step 22385/76294 | train loss 3.617755 | norm 19.8665 | lr 1.20e-04 | (3805.29 ms | 137779 tok/s) step 22386/76294 | train loss 3.552822 | norm 9.1182 | lr 1.20e-04 | (3808.04 ms | 137679 tok/s) step 22387/76294 | train loss 3.539578 | norm 9.0809 | lr 1.20e-04 | (3804.09 ms | 137822 tok/s) step 22388/76294 | train loss 3.561080 | norm 9.1417 | lr 1.20e-04 | (3822.07 ms | 137174 tok/s) step 22389/76294 | train loss 3.666966 | norm 5.5816 | lr 1.20e-04 | (3807.32 ms | 137705 tok/s) step 22390/76294 | train loss 3.561652 | norm 14.1310 | lr 1.20e-04 | (3806.47 ms | 137736 tok/s) step 22391/76294 | train loss 3.526062 | norm 5.1790 | lr 1.20e-04 | (3803.24 ms | 137853 tok/s) step 22392/76294 | train loss 3.617104 | norm 4.5523 | lr 1.20e-04 | (3807.18 ms | 137710 tok/s) step 22393/76294 | train loss 3.537020 | norm 4.4777 | lr 1.20e-04 | (3830.50 ms | 136872 tok/s) step 22394/76294 | train loss 3.586126 | norm 2.4991 | lr 1.20e-04 | (3797.95 ms | 138045 tok/s) step 22395/76294 | train loss 3.470666 | norm 4.5917 | lr 1.20e-04 | (3824.00 ms | 137105 tok/s) step 22396/76294 | train loss 3.449287 | norm 3.3140 | lr 1.20e-04 | (3824.40 ms | 137090 tok/s) step 22397/76294 | train loss 3.470302 | norm 3.2029 | lr 1.20e-04 | (3804.89 ms | 137793 tok/s) step 22398/76294 | train loss 3.487740 | norm 2.9331 | lr 1.20e-04 | (3803.86 ms | 137830 tok/s) step 22399/76294 | train loss 3.544460 | norm 4.1851 | lr 1.20e-04 | (3830.78 ms | 136862 tok/s) step 22400/76294 | train loss 3.525320 | norm 3.5880 | lr 1.20e-04 | (3796.79 ms | 138087 tok/s) step 22401/76294 | train loss 3.507798 | norm 4.4682 | lr 1.20e-04 | (3849.17 ms | 136208 tok/s) step 22402/76294 | train loss 3.481847 | norm 7.0230 | lr 1.20e-04 | (3906.31 ms | 134215 tok/s) step 22403/76294 | train loss 3.429626 | norm 5.5151 | lr 1.20e-04 | (3793.34 ms | 138213 tok/s) step 22404/76294 | train loss 3.544795 | norm 10.1207 | lr 1.20e-04 | (3817.06 ms | 137354 tok/s) step 22405/76294 | train loss 3.496216 | norm 4.8509 | lr 1.20e-04 | (3821.83 ms | 137183 tok/s) step 22406/76294 | train loss 3.470240 | norm 3.1804 | lr 1.20e-04 | (3800.16 ms | 137965 tok/s) step 22407/76294 | train loss 3.438424 | norm 5.8776 | lr 1.20e-04 | (3810.55 ms | 137588 tok/s) step 22408/76294 | train loss 3.470437 | norm 5.0341 | lr 1.20e-04 | (3805.72 ms | 137763 tok/s) step 22409/76294 | train loss 3.454893 | norm 8.9207 | lr 1.20e-04 | (3808.43 ms | 137665 tok/s) step 22410/76294 | train loss 3.480880 | norm 6.7109 | lr 1.20e-04 | (3804.51 ms | 137807 tok/s) step 22411/76294 | train loss 3.460917 | norm 5.5918 | lr 1.20e-04 | (3796.25 ms | 138107 tok/s) step 22412/76294 | train loss 3.438533 | norm 11.8128 | lr 1.20e-04 | (3826.06 ms | 137031 tok/s) step 22413/76294 | train loss 3.472424 | norm 9.1306 | lr 1.20e-04 | (3794.76 ms | 138161 tok/s) step 22414/76294 | train loss 3.469302 | norm 7.3271 | lr 1.20e-04 | (3804.04 ms | 137824 tok/s) step 22415/76294 | train loss 3.490939 | norm 10.6543 | lr 1.20e-04 | (3817.36 ms | 137343 tok/s) step 22416/76294 | train loss 3.487213 | norm 5.7959 | lr 1.20e-04 | (3804.26 ms | 137816 tok/s) step 22417/76294 | train loss 3.480346 | norm 6.7737 | lr 1.20e-04 | (3804.38 ms | 137812 tok/s) step 22418/76294 | train loss 3.492950 | norm 9.8604 | lr 1.20e-04 | (3802.02 ms | 137897 tok/s) step 22419/76294 | train loss 3.506546 | norm 17.1059 | lr 1.20e-04 | (3809.48 ms | 137627 tok/s) step 22420/76294 | train loss 3.628632 | norm 17.8012 | lr 1.20e-04 | (3799.51 ms | 137988 tok/s) step 22421/76294 | train loss 3.511079 | norm 8.0417 | lr 1.20e-04 | (3798.26 ms | 138034 tok/s) step 22422/76294 | train loss 3.443714 | norm 11.4547 | lr 1.20e-04 | (4088.73 ms | 128228 tok/s) step 22423/76294 | train loss 3.492814 | norm 12.6305 | lr 1.20e-04 | (3800.29 ms | 137960 tok/s) step 22424/76294 | train loss 3.505567 | norm 18.5232 | lr 1.20e-04 | (3803.68 ms | 137837 tok/s) step 22425/76294 | train loss 3.489227 | norm 8.7682 | lr 1.20e-04 | (3823.12 ms | 137136 tok/s) step 22426/76294 | train loss 3.535035 | norm 9.0353 | lr 1.20e-04 | (3802.60 ms | 137876 tok/s) step 22427/76294 | train loss 3.461433 | norm 11.8415 | lr 1.20e-04 | (3806.82 ms | 137723 tok/s) step 22428/76294 | train loss 3.494915 | norm 17.1151 | lr 1.20e-04 | (3873.88 ms | 135339 tok/s) step 22429/76294 | train loss 3.484972 | norm 12.3491 | lr 1.20e-04 | (3801.90 ms | 137902 tok/s) step 22430/76294 | train loss 3.528994 | norm 9.1881 | lr 1.20e-04 | (3802.36 ms | 137885 tok/s) step 22431/76294 | train loss 3.490966 | norm 12.9485 | lr 1.20e-04 | (3817.87 ms | 137325 tok/s) step 22432/76294 | train loss 3.506725 | norm 10.6975 | lr 1.20e-04 | (3804.47 ms | 137809 tok/s) step 22433/76294 | train loss 3.498230 | norm 10.4156 | lr 1.20e-04 | (3802.30 ms | 137887 tok/s) step 22434/76294 | train loss 3.516216 | norm 14.2690 | lr 1.20e-04 | (3801.28 ms | 137924 tok/s) step 22435/76294 | train loss 3.552302 | norm 12.7862 | lr 1.20e-04 | (5536.97 ms | 94689 tok/s) step 22436/76294 | train loss 3.461814 | norm 8.5488 | lr 1.20e-04 | (3793.44 ms | 138209 tok/s) step 22437/76294 | train loss 3.524969 | norm 8.7241 | lr 1.20e-04 | (3799.65 ms | 137983 tok/s) step 22438/76294 | train loss 3.500336 | norm 6.4140 | lr 1.20e-04 | (3798.06 ms | 138041 tok/s) step 22439/76294 | train loss 3.529269 | norm 5.0350 | lr 1.20e-04 | (3799.94 ms | 137973 tok/s) step 22440/76294 | train loss 3.564960 | norm 8.3520 | lr 1.20e-04 | (3801.21 ms | 137927 tok/s) step 22441/76294 | train loss 3.530218 | norm 5.0949 | lr 1.20e-04 | (3801.47 ms | 137917 tok/s) step 22442/76294 | train loss 3.512930 | norm 7.3339 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 22443/76294 | train loss 3.594184 | norm 8.2313 | lr 1.20e-04 | (3798.15 ms | 138038 tok/s) step 22444/76294 | train loss 3.565078 | norm 5.7669 | lr 1.20e-04 | (3824.71 ms | 137079 tok/s) step 22445/76294 | train loss 3.510498 | norm 4.3300 | lr 1.20e-04 | (3794.55 ms | 138169 tok/s) step 22446/76294 | train loss 3.490086 | norm 5.5697 | lr 1.20e-04 | (3800.21 ms | 137963 tok/s) step 22447/76294 | train loss 3.544111 | norm 3.8216 | lr 1.20e-04 | (3817.05 ms | 137354 tok/s) step 22448/76294 | train loss 3.513351 | norm 5.7518 | lr 1.20e-04 | (3795.65 ms | 138129 tok/s) step 22449/76294 | train loss 3.474960 | norm 3.0044 | lr 1.20e-04 | (3811.28 ms | 137562 tok/s) step 22450/76294 | train loss 3.501518 | norm 4.6920 | lr 1.20e-04 | (3799.72 ms | 137981 tok/s) step 22451/76294 | train loss 3.523428 | norm 4.0810 | lr 1.20e-04 | (3793.93 ms | 138191 tok/s) step 22452/76294 | train loss 3.463763 | norm 6.1146 | lr 1.20e-04 | (3829.77 ms | 136898 tok/s) step 22453/76294 | train loss 3.485005 | norm 3.4415 | lr 1.20e-04 | (3869.94 ms | 135477 tok/s) step 22454/76294 | train loss 3.446212 | norm 9.9799 | lr 1.20e-04 | (3797.22 ms | 138071 tok/s) step 22455/76294 | train loss 3.481409 | norm 5.5500 | lr 1.20e-04 | (3844.63 ms | 136369 tok/s) step 22456/76294 | train loss 3.506860 | norm 3.4118 | lr 1.20e-04 | (3810.23 ms | 137600 tok/s) step 22457/76294 | train loss 3.485036 | norm 3.7599 | lr 1.20e-04 | (3802.95 ms | 137863 tok/s) step 22458/76294 | train loss 3.542333 | norm 3.0933 | lr 1.20e-04 | (3819.48 ms | 137267 tok/s) step 22459/76294 | train loss 3.465813 | norm 4.6492 | lr 1.20e-04 | (3805.32 ms | 137778 tok/s) step 22460/76294 | train loss 3.485499 | norm 4.7601 | lr 1.20e-04 | (3805.11 ms | 137785 tok/s) step 22461/76294 | train loss 3.449997 | norm 2.5110 | lr 1.20e-04 | (3802.17 ms | 137892 tok/s) step 22462/76294 | train loss 3.459025 | norm 6.4064 | lr 1.20e-04 | (3803.90 ms | 137829 tok/s) step 22463/76294 | train loss 3.472818 | norm 3.3639 | lr 1.20e-04 | (3800.79 ms | 137942 tok/s) step 22464/76294 | train loss 3.507643 | norm 2.3015 | lr 1.20e-04 | (3839.53 ms | 136550 tok/s) step 22465/76294 | train loss 3.431727 | norm 3.1877 | lr 1.20e-04 | (3800.41 ms | 137955 tok/s) step 22466/76294 | train loss 3.397001 | norm 3.3918 | lr 1.20e-04 | (3796.23 ms | 138108 tok/s) step 22467/76294 | train loss 3.433939 | norm 3.6610 | lr 1.20e-04 | (3826.01 ms | 137032 tok/s) step 22468/76294 | train loss 3.462582 | norm 5.8747 | lr 1.20e-04 | (3799.43 ms | 137991 tok/s) step 22469/76294 | train loss 3.473609 | norm 2.9547 | lr 1.20e-04 | (3803.64 ms | 137839 tok/s) step 22470/76294 | train loss 3.440412 | norm 5.8251 | lr 1.20e-04 | (3832.25 ms | 136809 tok/s) step 22471/76294 | train loss 3.413932 | norm 3.8014 | lr 1.20e-04 | (3800.94 ms | 137936 tok/s) step 22472/76294 | train loss 3.483821 | norm 6.2491 | lr 1.20e-04 | (3802.57 ms | 137877 tok/s) step 22473/76294 | train loss 3.384853 | norm 5.2833 | lr 1.20e-04 | (3797.74 ms | 138053 tok/s) step 22474/76294 | train loss 3.523383 | norm 4.4024 | lr 1.20e-04 | (3805.07 ms | 137787 tok/s) step 22475/76294 | train loss 3.424180 | norm 4.5642 | lr 1.20e-04 | (3805.14 ms | 137784 tok/s) step 22476/76294 | train loss 3.475170 | norm 4.4054 | lr 1.20e-04 | (3797.24 ms | 138071 tok/s) step 22477/76294 | train loss 3.431888 | norm 4.3446 | lr 1.20e-04 | (3830.53 ms | 136871 tok/s) step 22478/76294 | train loss 3.445854 | norm 5.5009 | lr 1.20e-04 | (3902.46 ms | 134348 tok/s) step 22479/76294 | train loss 3.444648 | norm 6.7197 | lr 1.20e-04 | (3851.49 ms | 136126 tok/s) step 22480/76294 | train loss 3.457797 | norm 4.6598 | lr 1.20e-04 | (3804.48 ms | 137808 tok/s) step 22481/76294 | train loss 3.465703 | norm 7.5309 | lr 1.20e-04 | (3807.08 ms | 137714 tok/s) step 22482/76294 | train loss 3.437229 | norm 7.0094 | lr 1.20e-04 | (3820.39 ms | 137234 tok/s) step 22483/76294 | train loss 3.461691 | norm 4.5075 | lr 1.20e-04 | (3808.53 ms | 137662 tok/s) step 22484/76294 | train loss 3.475739 | norm 5.6958 | lr 1.20e-04 | (3821.30 ms | 137202 tok/s) step 22485/76294 | train loss 3.408782 | norm 9.0212 | lr 1.20e-04 | (3796.31 ms | 138105 tok/s) step 22486/76294 | train loss 3.542589 | norm 6.1061 | lr 1.20e-04 | (3798.30 ms | 138032 tok/s) step 22487/76294 | train loss 3.441129 | norm 7.3930 | lr 1.20e-04 | (3824.58 ms | 137084 tok/s) step 22488/76294 | train loss 3.484730 | norm 5.2246 | lr 1.20e-04 | (3797.41 ms | 138065 tok/s) step 22489/76294 | train loss 3.513846 | norm 3.8727 | lr 1.20e-04 | (3805.84 ms | 137759 tok/s) step 22490/76294 | train loss 3.462474 | norm 3.1495 | lr 1.20e-04 | (3820.27 ms | 137238 tok/s) step 22491/76294 | train loss 3.427641 | norm 5.1688 | lr 1.20e-04 | (3798.82 ms | 138014 tok/s) step 22492/76294 | train loss 3.428052 | norm 5.8726 | lr 1.20e-04 | (3805.87 ms | 137758 tok/s) step 22493/76294 | train loss 3.439186 | norm 6.6929 | lr 1.20e-04 | (3796.93 ms | 138082 tok/s) step 22494/76294 | train loss 3.509830 | norm 6.1765 | lr 1.20e-04 | (3800.73 ms | 137944 tok/s) step 22495/76294 | train loss 3.426072 | norm 4.5272 | lr 1.20e-04 | (3828.50 ms | 136944 tok/s) step 22496/76294 | train loss 3.481094 | norm 3.6051 | lr 1.20e-04 | (3801.02 ms | 137934 tok/s) step 22497/76294 | train loss 3.399507 | norm 4.8080 | lr 1.20e-04 | (3798.69 ms | 138018 tok/s) step 22498/76294 | train loss 3.440581 | norm 3.3672 | lr 1.20e-04 | (3820.96 ms | 137214 tok/s) step 22499/76294 | train loss 3.531158 | norm 4.0136 | lr 1.20e-04 | (3803.53 ms | 137842 tok/s) step 22500/76294 | train loss 3.450941 | norm 3.2424 | lr 1.20e-04 | (3803.02 ms | 137861 tok/s) val loss: 3.443884 saving model checkpoint to ./results/gpt2-124M-gqa/step_22500.pth step 22501/76294 | train loss 3.387387 | norm 2.2557 | lr 1.20e-04 | (3816.91 ms | 137359 tok/s) step 22502/76294 | train loss 3.401042 | norm 2.4321 | lr 1.20e-04 | (3825.28 ms | 137059 tok/s) step 22503/76294 | train loss 3.431917 | norm 4.0824 | lr 1.20e-04 | (3814.44 ms | 137448 tok/s) step 22504/76294 | train loss 3.607133 | norm 3.0096 | lr 1.20e-04 | (3919.61 ms | 133760 tok/s) step 22505/76294 | train loss 3.383341 | norm 3.4726 | lr 1.20e-04 | (3812.42 ms | 137521 tok/s) step 22506/76294 | train loss 3.421418 | norm 3.8986 | lr 1.20e-04 | (3827.54 ms | 136978 tok/s) step 22507/76294 | train loss 3.471227 | norm 6.7812 | lr 1.20e-04 | (3801.89 ms | 137902 tok/s) step 22508/76294 | train loss 3.490588 | norm 5.6845 | lr 1.20e-04 | (3851.62 ms | 136122 tok/s) step 22509/76294 | train loss 3.485801 | norm 4.4271 | lr 1.20e-04 | (3802.41 ms | 137883 tok/s) step 22510/76294 | train loss 3.504762 | norm 8.3532 | lr 1.20e-04 | (3835.45 ms | 136695 tok/s) step 22511/76294 | train loss 3.470475 | norm 3.5532 | lr 1.20e-04 | (3801.86 ms | 137903 tok/s) step 22512/76294 | train loss 3.433449 | norm 2.8471 | lr 1.20e-04 | (3806.34 ms | 137741 tok/s) step 22513/76294 | train loss 3.455849 | norm 4.5250 | lr 1.20e-04 | (3831.79 ms | 136826 tok/s) step 22514/76294 | train loss 3.399143 | norm 2.7556 | lr 1.20e-04 | (3808.07 ms | 137678 tok/s) step 22515/76294 | train loss 3.454650 | norm 4.1327 | lr 1.20e-04 | (3813.95 ms | 137466 tok/s) step 22516/76294 | train loss 3.482607 | norm 4.3995 | lr 1.20e-04 | (3807.97 ms | 137682 tok/s) step 22517/76294 | train loss 3.416891 | norm 3.8824 | lr 1.20e-04 | (3815.51 ms | 137410 tok/s) step 22518/76294 | train loss 3.405339 | norm 4.1231 | lr 1.20e-04 | (3807.50 ms | 137699 tok/s) step 22519/76294 | train loss 3.523429 | norm 3.9710 | lr 1.20e-04 | (3806.17 ms | 137747 tok/s) step 22520/76294 | train loss 3.423972 | norm 3.0553 | lr 1.20e-04 | (3837.87 ms | 136609 tok/s) step 22521/76294 | train loss 3.480735 | norm 4.7040 | lr 1.20e-04 | (3805.96 ms | 137754 tok/s) step 22522/76294 | train loss 3.317427 | norm 4.3231 | lr 1.20e-04 | (3809.79 ms | 137616 tok/s) step 22523/76294 | train loss 3.460237 | norm 2.8362 | lr 1.20e-04 | (3833.55 ms | 136763 tok/s) step 22524/76294 | train loss 3.408557 | norm 4.1761 | lr 1.20e-04 | (3809.17 ms | 137639 tok/s) step 22525/76294 | train loss 3.637833 | norm 6.8178 | lr 1.20e-04 | (3807.99 ms | 137681 tok/s) step 22526/76294 | train loss 3.459648 | norm 5.0928 | lr 1.20e-04 | (3803.46 ms | 137845 tok/s) step 22527/76294 | train loss 3.484707 | norm 3.8252 | lr 1.20e-04 | (3817.01 ms | 137356 tok/s) step 22528/76294 | train loss 3.452070 | norm 4.7774 | lr 1.20e-04 | (3808.17 ms | 137675 tok/s) step 22529/76294 | train loss 3.431090 | norm 5.1433 | lr 1.20e-04 | (3900.21 ms | 134426 tok/s) step 22530/76294 | train loss 3.447776 | norm 8.6638 | lr 1.20e-04 | (3805.96 ms | 137754 tok/s) step 22531/76294 | train loss 3.491311 | norm 6.4418 | lr 1.20e-04 | (3867.25 ms | 135571 tok/s) step 22532/76294 | train loss 3.464334 | norm 4.4766 | lr 1.20e-04 | (3802.01 ms | 137897 tok/s) step 22533/76294 | train loss 3.481714 | norm 5.1272 | lr 1.20e-04 | (3829.82 ms | 136896 tok/s) step 22534/76294 | train loss 3.482997 | norm 4.1644 | lr 1.20e-04 | (3801.60 ms | 137913 tok/s) step 22535/76294 | train loss 3.540501 | norm 5.9389 | lr 1.20e-04 | (3801.33 ms | 137922 tok/s) step 22536/76294 | train loss 3.470964 | norm 4.4811 | lr 1.20e-04 | (3823.16 ms | 137135 tok/s) step 22537/76294 | train loss 3.533995 | norm 4.6959 | lr 1.20e-04 | (4488.33 ms | 116811 tok/s) step 22538/76294 | train loss 3.458668 | norm 3.9915 | lr 1.20e-04 | (3799.46 ms | 137990 tok/s) step 22539/76294 | train loss 3.475766 | norm 23.2164 | lr 1.20e-04 | (3829.92 ms | 136893 tok/s) step 22540/76294 | train loss 3.442456 | norm 13.5041 | lr 1.20e-04 | (3796.50 ms | 138098 tok/s) step 22541/76294 | train loss 3.441266 | norm 17.4766 | lr 1.20e-04 | (3801.58 ms | 137913 tok/s) step 22542/76294 | train loss 3.443333 | norm 6.1406 | lr 1.20e-04 | (3820.52 ms | 137230 tok/s) step 22543/76294 | train loss 3.471380 | norm 13.0355 | lr 1.20e-04 | (3948.76 ms | 132773 tok/s) step 22544/76294 | train loss 3.431844 | norm 5.3702 | lr 1.20e-04 | (3831.26 ms | 136845 tok/s) step 22545/76294 | train loss 3.463573 | norm 5.3533 | lr 1.20e-04 | (3801.54 ms | 137915 tok/s) step 22546/76294 | train loss 3.471521 | norm 4.0207 | lr 1.20e-04 | (3801.08 ms | 137931 tok/s) step 22547/76294 | train loss 3.484563 | norm 6.6248 | lr 1.20e-04 | (3800.52 ms | 137952 tok/s) step 22548/76294 | train loss 3.461368 | norm 9.3731 | lr 1.20e-04 | (3823.00 ms | 137140 tok/s) step 22549/76294 | train loss 3.498421 | norm 7.5765 | lr 1.20e-04 | (4358.27 ms | 120297 tok/s) step 22550/76294 | train loss 3.535408 | norm 3.7314 | lr 1.20e-04 | (3815.44 ms | 137412 tok/s) step 22551/76294 | train loss 3.388486 | norm 5.3219 | lr 1.20e-04 | (3796.83 ms | 138086 tok/s) step 22552/76294 | train loss 3.441592 | norm 3.9610 | lr 1.20e-04 | (3844.85 ms | 136361 tok/s) step 22553/76294 | train loss 3.425047 | norm 2.8448 | lr 1.20e-04 | (3797.31 ms | 138068 tok/s) step 22554/76294 | train loss 3.480925 | norm 7.0875 | lr 1.20e-04 | (3801.73 ms | 137908 tok/s) step 22555/76294 | train loss 3.443129 | norm 5.6814 | lr 1.20e-04 | (3796.66 ms | 138092 tok/s) step 22556/76294 | train loss 3.471919 | norm 8.4598 | lr 1.20e-04 | (3845.61 ms | 136334 tok/s) step 22557/76294 | train loss 3.415720 | norm 5.4961 | lr 1.20e-04 | (3799.69 ms | 137982 tok/s) step 22558/76294 | train loss 3.478060 | norm 7.3194 | lr 1.20e-04 | (4041.47 ms | 129727 tok/s) step 22559/76294 | train loss 3.458948 | norm 4.0050 | lr 1.20e-04 | (3812.84 ms | 137506 tok/s) step 22560/76294 | train loss 3.491216 | norm 2.6550 | lr 1.20e-04 | (3795.71 ms | 138127 tok/s) step 22561/76294 | train loss 3.505713 | norm 6.5188 | lr 1.20e-04 | (3804.88 ms | 137794 tok/s) step 22562/76294 | train loss 3.471609 | norm 6.3328 | lr 1.20e-04 | (3798.08 ms | 138040 tok/s) step 22563/76294 | train loss 3.511626 | norm 4.2249 | lr 1.20e-04 | (3820.37 ms | 137235 tok/s) step 22564/76294 | train loss 3.483142 | norm 2.8501 | lr 1.20e-04 | (3805.65 ms | 137766 tok/s) step 22565/76294 | train loss 3.472088 | norm 5.0729 | lr 1.20e-04 | (3804.69 ms | 137801 tok/s) step 22566/76294 | train loss 3.456032 | norm 4.2744 | lr 1.20e-04 | (4090.17 ms | 128182 tok/s) step 22567/76294 | train loss 3.419982 | norm 7.8642 | lr 1.20e-04 | (3799.98 ms | 137971 tok/s) step 22568/76294 | train loss 3.437947 | norm 6.3693 | lr 1.20e-04 | (3807.48 ms | 137699 tok/s) step 22569/76294 | train loss 3.474001 | norm 7.5844 | lr 1.20e-04 | (3803.06 ms | 137860 tok/s) step 22570/76294 | train loss 3.436658 | norm 8.7399 | lr 1.20e-04 | (3795.47 ms | 138135 tok/s) step 22571/76294 | train loss 3.544997 | norm 7.0565 | lr 1.20e-04 | (3823.05 ms | 137139 tok/s) step 22572/76294 | train loss 3.426662 | norm 4.9772 | lr 1.20e-04 | (3796.43 ms | 138100 tok/s) step 22573/76294 | train loss 3.433922 | norm 4.1264 | lr 1.20e-04 | (3803.70 ms | 137836 tok/s) step 22574/76294 | train loss 3.364233 | norm 3.2199 | lr 1.20e-04 | (3795.16 ms | 138146 tok/s) step 22575/76294 | train loss 3.450457 | norm 8.5249 | lr 1.20e-04 | (3802.66 ms | 137874 tok/s) step 22576/76294 | train loss 3.440061 | norm 8.7169 | lr 1.20e-04 | (3868.73 ms | 135520 tok/s) step 22577/76294 | train loss 3.511260 | norm 8.6193 | lr 1.20e-04 | (3797.05 ms | 138078 tok/s) step 22578/76294 | train loss 3.500843 | norm 5.2884 | lr 1.20e-04 | (3795.83 ms | 138122 tok/s) step 22579/76294 | train loss 3.429673 | norm 5.5886 | lr 1.20e-04 | (3826.38 ms | 137019 tok/s) step 22580/76294 | train loss 3.479332 | norm 5.6298 | lr 1.20e-04 | (3875.10 ms | 135297 tok/s) step 22581/76294 | train loss 3.415886 | norm 9.1233 | lr 1.20e-04 | (3800.19 ms | 137964 tok/s) step 22582/76294 | train loss 3.495514 | norm 11.0206 | lr 1.20e-04 | (3800.00 ms | 137971 tok/s) step 22583/76294 | train loss 3.482431 | norm 14.0542 | lr 1.20e-04 | (3832.52 ms | 136800 tok/s) step 22584/76294 | train loss 3.487185 | norm 73.6729 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 22585/76294 | train loss 3.475559 | norm 9.0467 | lr 1.20e-04 | (3814.57 ms | 137443 tok/s) step 22586/76294 | train loss 3.418214 | norm 11.4039 | lr 1.20e-04 | (3806.55 ms | 137733 tok/s) step 22587/76294 | train loss 3.496877 | norm 5.7238 | lr 1.20e-04 | (3835.81 ms | 136683 tok/s) step 22588/76294 | train loss 3.421602 | norm 12.1509 | lr 1.20e-04 | (3802.22 ms | 137890 tok/s) step 22589/76294 | train loss 3.452613 | norm 8.3823 | lr 1.20e-04 | (3805.57 ms | 137769 tok/s) step 22590/76294 | train loss 3.400506 | norm 5.6801 | lr 1.20e-04 | (3804.56 ms | 137805 tok/s) step 22591/76294 | train loss 3.470436 | norm 6.1010 | lr 1.20e-04 | (3807.49 ms | 137699 tok/s) step 22592/76294 | train loss 3.412191 | norm 6.9455 | lr 1.20e-04 | (3801.36 ms | 137921 tok/s) step 22593/76294 | train loss 3.469719 | norm 13.7574 | lr 1.20e-04 | (3816.56 ms | 137372 tok/s) step 22594/76294 | train loss 3.517876 | norm 12.3627 | lr 1.20e-04 | (3803.59 ms | 137840 tok/s) step 22595/76294 | train loss 3.464193 | norm 13.2068 | lr 1.20e-04 | (3805.87 ms | 137758 tok/s) step 22596/76294 | train loss 3.453026 | norm 17.0653 | lr 1.20e-04 | (3808.27 ms | 137671 tok/s) step 22597/76294 | train loss 3.465557 | norm 13.2964 | lr 1.20e-04 | (3805.31 ms | 137778 tok/s) step 22598/76294 | train loss 3.471032 | norm 8.7932 | lr 1.20e-04 | (3804.58 ms | 137805 tok/s) step 22599/76294 | train loss 3.454257 | norm 5.8179 | lr 1.20e-04 | (3795.60 ms | 138130 tok/s) step 22600/76294 | train loss 3.512024 | norm 9.4361 | lr 1.20e-04 | (3830.15 ms | 136884 tok/s) step 22601/76294 | train loss 3.452342 | norm 10.3141 | lr 1.20e-04 | (3800.88 ms | 137939 tok/s) step 22602/76294 | train loss 3.473952 | norm 11.2401 | lr 1.20e-04 | (3817.59 ms | 137335 tok/s) step 22603/76294 | train loss 3.430872 | norm 10.5600 | lr 1.20e-04 | (3800.30 ms | 137960 tok/s) step 22604/76294 | train loss 3.476966 | norm 15.2892 | lr 1.20e-04 | (3802.25 ms | 137889 tok/s) step 22605/76294 | train loss 3.455313 | norm 22.8300 | lr 1.20e-04 | (3832.48 ms | 136801 tok/s) step 22606/76294 | train loss 3.452653 | norm 16.7976 | lr 1.20e-04 | (3879.13 ms | 135156 tok/s) step 22607/76294 | train loss 3.467213 | norm 5.4056 | lr 1.20e-04 | (3796.89 ms | 138083 tok/s) step 22608/76294 | train loss 3.450762 | norm 4.5208 | lr 1.20e-04 | (3905.35 ms | 134248 tok/s) step 22609/76294 | train loss 3.517730 | norm 8.3561 | lr 1.20e-04 | (3798.01 ms | 138043 tok/s) step 22610/76294 | train loss 3.472530 | norm 6.8299 | lr 1.20e-04 | (3818.94 ms | 137286 tok/s) step 22611/76294 | train loss 3.436795 | norm 7.1895 | lr 1.20e-04 | (3797.91 ms | 138047 tok/s) step 22612/76294 | train loss 3.506665 | norm 16.5536 | lr 1.20e-04 | (3809.73 ms | 137618 tok/s) step 22613/76294 | train loss 3.421517 | norm 9.6467 | lr 1.20e-04 | (4101.30 ms | 127834 tok/s) step 22614/76294 | train loss 3.519869 | norm 20.1144 | lr 1.20e-04 | (3831.46 ms | 136838 tok/s) step 22615/76294 | train loss 3.787306 | norm 50.9010 | lr 1.20e-04 | (3794.57 ms | 138168 tok/s) step 22616/76294 | train loss 3.488708 | norm 39.4867 | lr 1.20e-04 | (3821.94 ms | 137178 tok/s) step 22617/76294 | train loss 3.511613 | norm 11.7422 | lr 1.20e-04 | (3799.19 ms | 138000 tok/s) step 22618/76294 | train loss 3.466357 | norm 19.4743 | lr 1.20e-04 | (3801.51 ms | 137916 tok/s) step 22619/76294 | train loss 3.501716 | norm 12.6937 | lr 1.20e-04 | (3819.31 ms | 137273 tok/s) step 22620/76294 | train loss 3.446807 | norm 17.2623 | lr 1.20e-04 | (3801.67 ms | 137910 tok/s) step 22621/76294 | train loss 3.537738 | norm 18.7432 | lr 1.20e-04 | (3807.30 ms | 137706 tok/s) step 22622/76294 | train loss 3.494632 | norm 25.5271 | lr 1.20e-04 | (3802.95 ms | 137864 tok/s) step 22623/76294 | train loss 3.440041 | norm 25.2857 | lr 1.20e-04 | (3804.21 ms | 137818 tok/s) step 22624/76294 | train loss 3.543307 | norm 22.5358 | lr 1.20e-04 | (3798.64 ms | 138020 tok/s) step 22625/76294 | train loss 3.449481 | norm 19.4400 | lr 1.20e-04 | (3804.68 ms | 137801 tok/s) step 22626/76294 | train loss 3.469907 | norm 26.6349 | lr 1.20e-04 | (3799.96 ms | 137972 tok/s) step 22627/76294 | train loss 3.465300 | norm 20.8219 | lr 1.20e-04 | (3805.71 ms | 137763 tok/s) step 22628/76294 | train loss 3.487952 | norm 16.1295 | lr 1.20e-04 | (3795.93 ms | 138118 tok/s) step 22629/76294 | train loss 3.461011 | norm 14.1367 | lr 1.20e-04 | (3805.84 ms | 137759 tok/s) step 22630/76294 | train loss 3.474683 | norm 14.0860 | lr 1.20e-04 | (3794.69 ms | 138164 tok/s) step 22631/76294 | train loss 3.541332 | norm 14.7207 | lr 1.20e-04 | (3802.95 ms | 137864 tok/s) step 22632/76294 | train loss 3.480885 | norm 14.7542 | lr 1.20e-04 | (3928.47 ms | 133459 tok/s) step 22633/76294 | train loss 3.493550 | norm 20.2158 | lr 1.20e-04 | (3800.31 ms | 137959 tok/s) step 22634/76294 | train loss 3.465597 | norm 15.3261 | lr 1.20e-04 | (3847.13 ms | 136280 tok/s) step 22635/76294 | train loss 3.456992 | norm 8.2656 | lr 1.20e-04 | (3801.56 ms | 137914 tok/s) step 22636/76294 | train loss 3.470074 | norm 17.7896 | lr 1.20e-04 | (3799.15 ms | 138001 tok/s) step 22637/76294 | train loss 3.545179 | norm 5.5729 | lr 1.20e-04 | (3839.30 ms | 136558 tok/s) step 22638/76294 | train loss 3.440280 | norm 13.6817 | lr 1.20e-04 | (3803.53 ms | 137842 tok/s) step 22639/76294 | train loss 3.461233 | norm 8.0282 | lr 1.20e-04 | (3804.05 ms | 137824 tok/s) step 22640/76294 | train loss 3.447027 | norm 12.9060 | lr 1.20e-04 | (3800.88 ms | 137939 tok/s) step 22641/76294 | train loss 3.533599 | norm 11.4607 | lr 1.20e-04 | (3804.58 ms | 137804 tok/s) step 22642/76294 | train loss 3.494208 | norm 21.0594 | lr 1.20e-04 | (3798.39 ms | 138029 tok/s) step 22643/76294 | train loss 3.610925 | norm 11.7983 | lr 1.20e-04 | (3802.34 ms | 137886 tok/s) step 22644/76294 | train loss 3.487314 | norm 20.0511 | lr 1.20e-04 | (3803.73 ms | 137835 tok/s) step 22645/76294 | train loss 3.498290 | norm 12.1163 | lr 1.20e-04 | (3819.82 ms | 137255 tok/s) step 22646/76294 | train loss 3.453052 | norm 7.4013 | lr 1.20e-04 | (3800.84 ms | 137940 tok/s) step 22647/76294 | train loss 3.489065 | norm 23.6249 | lr 1.20e-04 | (3800.80 ms | 137942 tok/s) step 22648/76294 | train loss 3.497715 | norm 38.3735 | lr 1.20e-04 | (3797.18 ms | 138073 tok/s) step 22649/76294 | train loss 3.499461 | norm 10.1558 | lr 1.20e-04 | (3806.12 ms | 137749 tok/s) step 22650/76294 | train loss 3.422242 | norm 19.5546 | lr 1.20e-04 | (3799.80 ms | 137978 tok/s) step 22651/76294 | train loss 3.532356 | norm 10.1644 | lr 1.20e-04 | (3800.74 ms | 137944 tok/s) step 22652/76294 | train loss 3.586982 | norm 12.6450 | lr 1.20e-04 | (3802.82 ms | 137868 tok/s) step 22653/76294 | train loss 3.458418 | norm 11.8463 | lr 1.20e-04 | (3794.41 ms | 138174 tok/s) step 22654/76294 | train loss 3.480047 | norm 6.7415 | lr 1.20e-04 | (3822.76 ms | 137149 tok/s) step 22655/76294 | train loss 3.440014 | norm 7.6808 | lr 1.20e-04 | (3796.67 ms | 138092 tok/s) step 22656/76294 | train loss 3.560148 | norm 4.7121 | lr 1.20e-04 | (3800.23 ms | 137962 tok/s) step 22657/76294 | train loss 3.470913 | norm 11.8153 | lr 1.20e-04 | (3818.54 ms | 137301 tok/s) step 22658/76294 | train loss 3.485317 | norm 5.9426 | lr 1.20e-04 | (3830.22 ms | 136882 tok/s) step 22659/76294 | train loss 3.535885 | norm 10.6756 | lr 1.20e-04 | (3816.74 ms | 137365 tok/s) step 22660/76294 | train loss 3.464914 | norm 13.0825 | lr 1.20e-04 | (3795.24 ms | 138144 tok/s) step 22661/76294 | train loss 3.558865 | norm 10.2991 | lr 1.20e-04 | (3798.45 ms | 138027 tok/s) step 22662/76294 | train loss 3.538463 | norm 36.0062 | lr 1.20e-04 | (3824.25 ms | 137096 tok/s) step 22663/76294 | train loss 3.490586 | norm 22.3840 | lr 1.20e-04 | (3802.70 ms | 137872 tok/s) step 22664/76294 | train loss 3.471450 | norm 15.4231 | lr 1.20e-04 | (3800.83 ms | 137941 tok/s) step 22665/76294 | train loss 3.492055 | norm 9.5256 | lr 1.20e-04 | (3819.37 ms | 137271 tok/s) step 22666/76294 | train loss 3.506929 | norm 10.1443 | lr 1.20e-04 | (3805.22 ms | 137781 tok/s) step 22667/76294 | train loss 3.503315 | norm 19.7666 | lr 1.20e-04 | (3802.45 ms | 137882 tok/s) step 22668/76294 | train loss 3.443650 | norm 12.1283 | lr 1.20e-04 | (3802.82 ms | 137868 tok/s) step 22669/76294 | train loss 3.516107 | norm 7.6803 | lr 1.20e-04 | (3804.85 ms | 137795 tok/s) step 22670/76294 | train loss 3.516562 | norm 17.1463 | lr 1.20e-04 | (3797.20 ms | 138072 tok/s) step 22671/76294 | train loss 3.492563 | norm 6.8703 | lr 1.20e-04 | (3804.42 ms | 137810 tok/s) step 22672/76294 | train loss 3.476360 | norm 9.9569 | lr 1.20e-04 | (3801.78 ms | 137906 tok/s) step 22673/76294 | train loss 3.504154 | norm 11.9418 | lr 1.20e-04 | (3800.55 ms | 137950 tok/s) step 22674/76294 | train loss 3.442004 | norm 16.9901 | lr 1.20e-04 | (3802.48 ms | 137881 tok/s) step 22675/76294 | train loss 3.432991 | norm 11.8084 | lr 1.20e-04 | (3817.59 ms | 137335 tok/s) step 22676/76294 | train loss 3.497410 | norm 16.5494 | lr 1.20e-04 | (3796.94 ms | 138082 tok/s) step 22677/76294 | train loss 3.470611 | norm 29.7172 | lr 1.20e-04 | (3797.42 ms | 138064 tok/s) step 22678/76294 | train loss 3.537580 | norm 9.7912 | lr 1.20e-04 | (3826.92 ms | 137000 tok/s) step 22679/76294 | train loss 3.540183 | norm 15.6667 | lr 1.20e-04 | (3802.29 ms | 137888 tok/s) step 22680/76294 | train loss 3.516762 | norm 30.5382 | lr 1.20e-04 | (3805.14 ms | 137784 tok/s) step 22681/76294 | train loss 3.459667 | norm 12.5152 | lr 1.20e-04 | (3816.76 ms | 137365 tok/s) step 22682/76294 | train loss 3.507971 | norm 17.5915 | lr 1.20e-04 | (3798.33 ms | 138031 tok/s) step 22683/76294 | train loss 3.533994 | norm 9.2137 | lr 1.20e-04 | (3866.29 ms | 135605 tok/s) step 22684/76294 | train loss 3.646359 | norm 33.7153 | lr 1.20e-04 | (3888.08 ms | 134845 tok/s) step 22685/76294 | train loss 3.515949 | norm 8.0983 | lr 1.20e-04 | (3792.91 ms | 138228 tok/s) step 22686/76294 | train loss 3.478768 | norm 9.3607 | lr 1.20e-04 | (3814.99 ms | 137428 tok/s) step 22687/76294 | train loss 3.487739 | norm 10.1201 | lr 1.20e-04 | (3796.71 ms | 138090 tok/s) step 22688/76294 | train loss 3.478324 | norm 20.5861 | lr 1.20e-04 | (3804.12 ms | 137821 tok/s) step 22689/76294 | train loss 3.612210 | norm 57.7122 | lr 1.20e-04 | (3818.62 ms | 137298 tok/s) step 22690/76294 | train loss 3.476161 | norm 13.2677 | lr 1.20e-04 | (3798.57 ms | 138022 tok/s) step 22691/76294 | train loss 3.492378 | norm 18.7344 | lr 1.20e-04 | (3806.25 ms | 137744 tok/s) step 22692/76294 | train loss 3.523651 | norm 8.2831 | lr 1.20e-04 | (3805.61 ms | 137767 tok/s) step 22693/76294 | train loss 3.566336 | norm 9.9235 | lr 1.20e-04 | (3822.09 ms | 137173 tok/s) step 22694/76294 | train loss 3.550178 | norm 31.8084 | lr 1.20e-04 | (3801.15 ms | 137929 tok/s) step 22695/76294 | train loss 3.517404 | norm 9.0090 | lr 1.20e-04 | (3842.80 ms | 136434 tok/s) step 22696/76294 | train loss 3.517387 | norm 18.5760 | lr 1.20e-04 | (3827.75 ms | 136970 tok/s) step 22697/76294 | train loss 3.535613 | norm 34.5908 | lr 1.20e-04 | (3796.49 ms | 138098 tok/s) step 22698/76294 | train loss 3.500640 | norm 13.4661 | lr 1.20e-04 | (3825.19 ms | 137062 tok/s) step 22699/76294 | train loss 3.566615 | norm 9.8582 | lr 1.20e-04 | (3797.26 ms | 138070 tok/s) step 22700/76294 | train loss 3.479944 | norm 13.4626 | lr 1.20e-04 | (3836.06 ms | 136674 tok/s) step 22701/76294 | train loss 3.434730 | norm 11.5716 | lr 1.20e-04 | (3816.32 ms | 137380 tok/s) step 22702/76294 | train loss 3.391712 | norm 9.9042 | lr 1.20e-04 | (3800.59 ms | 137949 tok/s) step 22703/76294 | train loss 3.496280 | norm 6.5371 | lr 1.20e-04 | (3799.12 ms | 138002 tok/s) step 22704/76294 | train loss 3.499239 | norm 9.1637 | lr 1.20e-04 | (3841.37 ms | 136485 tok/s) step 22705/76294 | train loss 3.485364 | norm 5.7272 | lr 1.20e-04 | (3805.15 ms | 137784 tok/s) step 22706/76294 | train loss 3.504125 | norm 6.2212 | lr 1.20e-04 | (3813.72 ms | 137474 tok/s) step 22707/76294 | train loss 3.483239 | norm 8.6093 | lr 1.20e-04 | (3818.67 ms | 137296 tok/s) step 22708/76294 | train loss 3.445757 | norm 8.3157 | lr 1.20e-04 | (3802.09 ms | 137895 tok/s) step 22709/76294 | train loss 3.458159 | norm 5.2055 | lr 1.20e-04 | (3812.77 ms | 137508 tok/s) step 22710/76294 | train loss 3.460748 | norm 24.0331 | lr 1.20e-04 | (3824.36 ms | 137092 tok/s) step 22711/76294 | train loss 3.522860 | norm 8.5836 | lr 1.20e-04 | (3818.17 ms | 137314 tok/s) step 22712/76294 | train loss 3.519238 | norm 8.4529 | lr 1.20e-04 | (3814.43 ms | 137449 tok/s) step 22713/76294 | train loss 3.471799 | norm 8.9185 | lr 1.20e-04 | (3808.52 ms | 137662 tok/s) step 22714/76294 | train loss 3.573279 | norm 11.7034 | lr 1.20e-04 | (3803.30 ms | 137851 tok/s) step 22715/76294 | train loss 3.647696 | norm 7.2698 | lr 1.20e-04 | (3805.53 ms | 137770 tok/s) step 22716/76294 | train loss 3.519887 | norm 14.4540 | lr 1.20e-04 | (3800.83 ms | 137940 tok/s) step 22717/76294 | train loss 3.464579 | norm 11.2332 | lr 1.20e-04 | (3823.46 ms | 137124 tok/s) step 22718/76294 | train loss 3.500880 | norm 11.5333 | lr 1.20e-04 | (3824.04 ms | 137103 tok/s) step 22719/76294 | train loss 3.555702 | norm 6.2787 | lr 1.20e-04 | (3807.06 ms | 137715 tok/s) step 22720/76294 | train loss 3.576196 | norm 11.1432 | lr 1.20e-04 | (3799.58 ms | 137986 tok/s) step 22721/76294 | train loss 3.438338 | norm 6.1100 | lr 1.20e-04 | (3801.49 ms | 137916 tok/s) step 22722/76294 | train loss 3.460923 | norm 15.9464 | lr 1.20e-04 | (3803.60 ms | 137840 tok/s) step 22723/76294 | train loss 3.555485 | norm 7.8237 | lr 1.20e-04 | (3807.87 ms | 137685 tok/s) step 22724/76294 | train loss 3.630098 | norm 17.7089 | lr 1.20e-04 | (3817.70 ms | 137331 tok/s) step 22725/76294 | train loss 3.505310 | norm 8.8175 | lr 1.20e-04 | (3807.51 ms | 137698 tok/s) step 22726/76294 | train loss 3.477327 | norm 7.8376 | lr 1.20e-04 | (3805.66 ms | 137765 tok/s) step 22727/76294 | train loss 3.465911 | norm 5.9321 | lr 1.20e-04 | (3805.36 ms | 137776 tok/s) step 22728/76294 | train loss 3.511715 | norm 12.2915 | lr 1.20e-04 | (3812.66 ms | 137512 tok/s) step 22729/76294 | train loss 3.423240 | norm 12.0824 | lr 1.20e-04 | (3803.93 ms | 137828 tok/s) step 22730/76294 | train loss 3.498848 | norm 6.6750 | lr 1.20e-04 | (3808.22 ms | 137673 tok/s) step 22731/76294 | train loss 3.548879 | norm 8.7388 | lr 1.20e-04 | (3798.06 ms | 138041 tok/s) step 22732/76294 | train loss 3.391046 | norm 9.7909 | lr 1.20e-04 | (3841.26 ms | 136488 tok/s) step 22733/76294 | train loss 3.643096 | norm 7.5454 | lr 1.20e-04 | (3799.91 ms | 137974 tok/s) step 22734/76294 | train loss 3.539145 | norm 3.9974 | lr 1.20e-04 | (6232.18 ms | 84126 tok/s) step 22735/76294 | train loss 3.470434 | norm 6.7849 | lr 1.20e-04 | (3937.55 ms | 133151 tok/s) step 22736/76294 | train loss 3.438511 | norm 15.5523 | lr 1.20e-04 | (3788.80 ms | 138378 tok/s) step 22737/76294 | train loss 3.485131 | norm 9.7223 | lr 1.20e-04 | (3986.63 ms | 131512 tok/s) step 22738/76294 | train loss 3.529922 | norm 4.9020 | lr 1.20e-04 | (3821.38 ms | 137199 tok/s) step 22739/76294 | train loss 3.526424 | norm 5.2035 | lr 1.20e-04 | (3790.00 ms | 138335 tok/s) step 22740/76294 | train loss 3.487245 | norm 6.4730 | lr 1.20e-04 | (3820.21 ms | 137240 tok/s) step 22741/76294 | train loss 3.501275 | norm 7.0340 | lr 1.20e-04 | (3816.40 ms | 137377 tok/s) step 22742/76294 | train loss 3.505059 | norm 10.4095 | lr 1.20e-04 | (3794.74 ms | 138162 tok/s) step 22743/76294 | train loss 3.497002 | norm 7.9663 | lr 1.20e-04 | (3834.54 ms | 136728 tok/s) step 22744/76294 | train loss 3.503150 | norm 6.7495 | lr 1.20e-04 | (3796.34 ms | 138103 tok/s) step 22745/76294 | train loss 3.474208 | norm 3.4531 | lr 1.20e-04 | (3797.30 ms | 138069 tok/s) step 22746/76294 | train loss 3.502799 | norm 626.8306 | lr 1.20e-04 | (3819.09 ms | 137281 tok/s) step 22747/76294 | train loss 3.535349 | norm 14.9549 | lr 1.20e-04 | (3897.34 ms | 134525 tok/s) step 22748/76294 | train loss 3.538267 | norm 8.6929 | lr 1.20e-04 | (3793.51 ms | 138207 tok/s) step 22749/76294 | train loss 3.532005 | norm 8.0346 | lr 1.20e-04 | (3829.20 ms | 136919 tok/s) step 22750/76294 | train loss 3.507002 | norm 18.7090 | lr 1.20e-04 | (3795.95 ms | 138118 tok/s) val loss: 3.500562 saving model checkpoint to ./results/gpt2-124M-gqa/step_22750.pth step 22751/76294 | train loss 3.449964 | norm 7.9413 | lr 1.20e-04 | (3852.15 ms | 136103 tok/s) step 22752/76294 | train loss 3.530177 | norm 15.3762 | lr 1.20e-04 | (3791.94 ms | 138264 tok/s) step 22753/76294 | train loss 3.511194 | norm 7.2172 | lr 1.20e-04 | (3805.42 ms | 137774 tok/s) step 22754/76294 | train loss 3.470266 | norm 11.1333 | lr 1.20e-04 | (3819.07 ms | 137282 tok/s) step 22755/76294 | train loss 3.438184 | norm 11.3382 | lr 1.20e-04 | (3796.80 ms | 138087 tok/s) step 22756/76294 | train loss 3.535796 | norm 22.3488 | lr 1.20e-04 | (3807.62 ms | 137695 tok/s) step 22757/76294 | train loss 3.537731 | norm 9.8788 | lr 1.20e-04 | (3799.43 ms | 137991 tok/s) step 22758/76294 | train loss 3.496874 | norm 14.6210 | lr 1.20e-04 | (3801.74 ms | 137907 tok/s) step 22759/76294 | train loss 3.540839 | norm 7.1562 | lr 1.20e-04 | (3801.93 ms | 137901 tok/s) step 22760/76294 | train loss 3.584909 | norm 11.8189 | lr 1.20e-04 | (3809.58 ms | 137624 tok/s) step 22761/76294 | train loss 3.444753 | norm 16.9389 | lr 1.20e-04 | (5977.03 ms | 87717 tok/s) step 22762/76294 | train loss 3.513788 | norm 6.6635 | lr 1.20e-04 | (3826.40 ms | 137019 tok/s) step 22763/76294 | train loss 3.539613 | norm 7.8455 | lr 1.20e-04 | (3828.31 ms | 136950 tok/s) step 22764/76294 | train loss 3.427713 | norm 44.3040 | lr 1.20e-04 | (3804.06 ms | 137823 tok/s) step 22765/76294 | train loss 3.430358 | norm 18.6773 | lr 1.20e-04 | (3804.02 ms | 137825 tok/s) step 22766/76294 | train loss 3.468240 | norm 16.0550 | lr 1.20e-04 | (4110.22 ms | 127557 tok/s) step 22767/76294 | train loss 3.496503 | norm 9.2055 | lr 1.20e-04 | (3820.95 ms | 137214 tok/s) step 22768/76294 | train loss 3.446000 | norm 7.7950 | lr 1.20e-04 | (3800.73 ms | 137944 tok/s) step 22769/76294 | train loss 3.526903 | norm 125.3844 | lr 1.20e-04 | (3798.60 ms | 138021 tok/s) step 22770/76294 | train loss 3.482068 | norm 22.7468 | lr 1.20e-04 | (3827.33 ms | 136985 tok/s) step 22771/76294 | train loss 3.521396 | norm 36.3727 | lr 1.20e-04 | (3794.78 ms | 138160 tok/s) step 22772/76294 | train loss 3.493568 | norm 11.4033 | lr 1.20e-04 | (3801.66 ms | 137910 tok/s) step 22773/76294 | train loss 3.540332 | norm 13.5823 | lr 1.20e-04 | (3815.17 ms | 137422 tok/s) step 22774/76294 | train loss 3.496562 | norm 53.0192 | lr 1.20e-04 | (3797.81 ms | 138050 tok/s) step 22775/76294 | train loss 3.542268 | norm 13.3742 | lr 1.20e-04 | (3799.64 ms | 137984 tok/s) step 22776/76294 | train loss 3.546211 | norm 6.4071 | lr 1.20e-04 | (3799.54 ms | 137987 tok/s) step 22777/76294 | train loss 3.464944 | norm 7.1942 | lr 1.20e-04 | (3807.39 ms | 137703 tok/s) step 22778/76294 | train loss 3.496743 | norm 12.2722 | lr 1.20e-04 | (3795.33 ms | 138140 tok/s) step 22779/76294 | train loss 3.437774 | norm 7.0604 | lr 1.20e-04 | (3811.22 ms | 137564 tok/s) step 22780/76294 | train loss 3.525886 | norm 17.7856 | lr 1.20e-04 | (3803.63 ms | 137839 tok/s) step 22781/76294 | train loss 3.486697 | norm 10.9876 | lr 1.20e-04 | (3804.96 ms | 137791 tok/s) step 22782/76294 | train loss 3.546492 | norm 6.4854 | lr 1.20e-04 | (3805.33 ms | 137777 tok/s) step 22783/76294 | train loss 3.415143 | norm 14.1768 | lr 1.20e-04 | (3802.82 ms | 137868 tok/s) step 22784/76294 | train loss 3.468298 | norm 10.9898 | lr 1.20e-04 | (3801.54 ms | 137915 tok/s) step 22785/76294 | train loss 3.490551 | norm 11.5028 | lr 1.20e-04 | (3799.14 ms | 138002 tok/s) step 22786/76294 | train loss 3.497398 | norm 12.7375 | lr 1.20e-04 | (3828.12 ms | 136957 tok/s) step 22787/76294 | train loss 3.573249 | norm 10.5346 | lr 1.20e-04 | (3794.93 ms | 138155 tok/s) step 22788/76294 | train loss 3.406482 | norm 12.9286 | lr 1.20e-04 | (3870.37 ms | 135462 tok/s) step 22789/76294 | train loss 3.581191 | norm 12.8498 | lr 1.20e-04 | (3797.42 ms | 138064 tok/s) step 22790/76294 | train loss 3.538077 | norm 13.6879 | lr 1.20e-04 | (4043.26 ms | 129670 tok/s) step 22791/76294 | train loss 3.469359 | norm 19.1258 | lr 1.20e-04 | (3821.48 ms | 137195 tok/s) step 22792/76294 | train loss 3.530451 | norm 14.7566 | lr 1.20e-04 | (3798.34 ms | 138031 tok/s) step 22793/76294 | train loss 3.487340 | norm 14.6568 | lr 1.20e-04 | (3797.83 ms | 138049 tok/s) step 22794/76294 | train loss 3.464294 | norm 24.8536 | lr 1.20e-04 | (3828.71 ms | 136936 tok/s) step 22795/76294 | train loss 3.569324 | norm 28.2800 | lr 1.20e-04 | (3795.68 ms | 138127 tok/s) step 22796/76294 | train loss 3.538337 | norm 34.3937 | lr 1.20e-04 | (3800.51 ms | 137952 tok/s) step 22797/76294 | train loss 3.503677 | norm 50.4694 | lr 1.20e-04 | (3821.43 ms | 137197 tok/s) step 22798/76294 | train loss 3.560228 | norm 55.2651 | lr 1.20e-04 | (3801.45 ms | 137918 tok/s) step 22799/76294 | train loss 3.525242 | norm 23.9730 | lr 1.20e-04 | (3821.76 ms | 137185 tok/s) step 22800/76294 | train loss 3.487604 | norm 19.1172 | lr 1.20e-04 | (3799.81 ms | 137977 tok/s) step 22801/76294 | train loss 3.545135 | norm 10.5183 | lr 1.20e-04 | (3804.06 ms | 137823 tok/s) step 22802/76294 | train loss 3.469959 | norm 13.4278 | lr 1.20e-04 | (3801.98 ms | 137899 tok/s) step 22803/76294 | train loss 3.611983 | norm 23.4239 | lr 1.20e-04 | (4148.02 ms | 126395 tok/s) step 22804/76294 | train loss 3.446384 | norm 7.0524 | lr 1.20e-04 | (3817.33 ms | 137344 tok/s) step 22805/76294 | train loss 3.489398 | norm 10.2267 | lr 1.20e-04 | (3803.50 ms | 137844 tok/s) step 22806/76294 | train loss 3.428834 | norm 21.2603 | lr 1.20e-04 | (3803.57 ms | 137841 tok/s) step 22807/76294 | train loss 3.529259 | norm 22.6269 | lr 1.20e-04 | (3805.29 ms | 137779 tok/s) step 22808/76294 | train loss 3.426543 | norm 7.1819 | lr 1.20e-04 | (3805.15 ms | 137784 tok/s) step 22809/76294 | train loss 3.510581 | norm 9.0592 | lr 1.20e-04 | (3800.84 ms | 137940 tok/s) step 22810/76294 | train loss 3.402130 | norm 11.4218 | lr 1.20e-04 | (3814.36 ms | 137451 tok/s) step 22811/76294 | train loss 3.400293 | norm 12.8791 | lr 1.20e-04 | (4090.92 ms | 128159 tok/s) step 22812/76294 | train loss 3.484669 | norm 14.0694 | lr 1.20e-04 | (3803.87 ms | 137830 tok/s) step 22813/76294 | train loss 3.395358 | norm 16.7957 | lr 1.20e-04 | (3879.44 ms | 135145 tok/s) step 22814/76294 | train loss 3.524268 | norm 13.9371 | lr 1.20e-04 | (3873.46 ms | 135354 tok/s) step 22815/76294 | train loss 3.435662 | norm 8.3033 | lr 1.20e-04 | (3807.46 ms | 137700 tok/s) step 22816/76294 | train loss 3.473311 | norm 6.8231 | lr 1.20e-04 | (3810.70 ms | 137583 tok/s) step 22817/76294 | train loss 3.485253 | norm 7.8044 | lr 1.20e-04 | (3804.93 ms | 137792 tok/s) step 22818/76294 | train loss 3.439860 | norm 5.7596 | lr 1.20e-04 | (3807.80 ms | 137688 tok/s) step 22819/76294 | train loss 3.393320 | norm 5.9154 | lr 1.20e-04 | (3825.22 ms | 137061 tok/s) step 22820/76294 | train loss 3.465343 | norm 6.8434 | lr 1.20e-04 | (3803.00 ms | 137862 tok/s) step 22821/76294 | train loss 3.524194 | norm 8.3869 | lr 1.20e-04 | (3797.35 ms | 138067 tok/s) step 22822/76294 | train loss 3.393872 | norm 7.0814 | lr 1.20e-04 | (3825.54 ms | 137050 tok/s) step 22823/76294 | train loss 3.399378 | norm 11.0316 | lr 1.20e-04 | (3799.14 ms | 138002 tok/s) step 22824/76294 | train loss 3.447408 | norm 9.6016 | lr 1.20e-04 | (3804.07 ms | 137823 tok/s) step 22825/76294 | train loss 3.507703 | norm 7.2911 | lr 1.20e-04 | (3819.68 ms | 137260 tok/s) step 22826/76294 | train loss 3.456105 | norm 5.5541 | lr 1.20e-04 | (3799.74 ms | 137980 tok/s) step 22827/76294 | train loss 3.443518 | norm 8.2299 | lr 1.20e-04 | (3803.77 ms | 137834 tok/s) step 22828/76294 | train loss 3.404625 | norm 6.8189 | lr 1.20e-04 | (3801.00 ms | 137934 tok/s) step 22829/76294 | train loss 3.479546 | norm 9.5862 | lr 1.20e-04 | (3803.77 ms | 137834 tok/s) step 22830/76294 | train loss 3.453336 | norm 14.9939 | lr 1.20e-04 | (3803.08 ms | 137859 tok/s) step 22831/76294 | train loss 3.413578 | norm 6.4453 | lr 1.20e-04 | (3803.89 ms | 137830 tok/s) step 22832/76294 | train loss 3.450413 | norm 8.3453 | lr 1.20e-04 | (3805.36 ms | 137776 tok/s) step 22833/76294 | train loss 3.394620 | norm 7.6905 | lr 1.20e-04 | (3805.33 ms | 137777 tok/s) step 22834/76294 | train loss 3.454508 | norm 12.3640 | lr 1.20e-04 | (3800.29 ms | 137960 tok/s) step 22835/76294 | train loss 3.475909 | norm 15.5723 | lr 1.20e-04 | (3803.24 ms | 137853 tok/s) step 22836/76294 | train loss 3.481264 | norm 15.6697 | lr 1.20e-04 | (3802.89 ms | 137866 tok/s) step 22837/76294 | train loss 3.410992 | norm 10.7189 | lr 1.20e-04 | (3813.55 ms | 137480 tok/s) step 22838/76294 | train loss 3.406828 | norm 44.7478 | lr 1.20e-04 | (3803.71 ms | 137836 tok/s) step 22839/76294 | train loss 3.428841 | norm 12.6374 | lr 1.20e-04 | (3801.52 ms | 137915 tok/s) step 22840/76294 | train loss 3.503924 | norm 8.8130 | lr 1.20e-04 | (3826.91 ms | 137000 tok/s) step 22841/76294 | train loss 3.515436 | norm 16.6167 | lr 1.20e-04 | (3824.85 ms | 137074 tok/s) step 22842/76294 | train loss 3.519790 | norm 21.4636 | lr 1.20e-04 | (3823.94 ms | 137107 tok/s) step 22843/76294 | train loss 3.522192 | norm 10.5539 | lr 1.20e-04 | (3802.88 ms | 137866 tok/s) step 22844/76294 | train loss 3.531373 | norm 9.3616 | lr 1.20e-04 | (3808.26 ms | 137671 tok/s) step 22845/76294 | train loss 3.454429 | norm 10.4950 | lr 1.20e-04 | (3803.83 ms | 137831 tok/s) step 22846/76294 | train loss 3.419496 | norm 11.6430 | lr 1.20e-04 | (3808.37 ms | 137667 tok/s) step 22847/76294 | train loss 3.430025 | norm 11.4305 | lr 1.20e-04 | (3804.17 ms | 137819 tok/s) step 22848/76294 | train loss 3.486515 | norm 11.0920 | lr 1.20e-04 | (3812.91 ms | 137503 tok/s) step 22849/76294 | train loss 3.484595 | norm 18.5745 | lr 1.20e-04 | (3801.80 ms | 137905 tok/s) step 22850/76294 | train loss 3.462658 | norm 9.4071 | lr 1.20e-04 | (3804.54 ms | 137806 tok/s) step 22851/76294 | train loss 3.531579 | norm 10.7589 | lr 1.20e-04 | (3807.02 ms | 137716 tok/s) step 22852/76294 | train loss 3.429014 | norm 8.1050 | lr 1.20e-04 | (3799.79 ms | 137978 tok/s) step 22853/76294 | train loss 3.462955 | norm 8.8287 | lr 1.20e-04 | (3827.88 ms | 136966 tok/s) step 22854/76294 | train loss 3.437487 | norm 34.4991 | lr 1.20e-04 | (3802.75 ms | 137871 tok/s) step 22855/76294 | train loss 3.432628 | norm 48.3528 | lr 1.20e-04 | (3804.78 ms | 137797 tok/s) step 22856/76294 | train loss 3.476629 | norm 47.1323 | lr 1.20e-04 | (3828.51 ms | 136943 tok/s) step 22857/76294 | train loss 3.477504 | norm 12.9367 | lr 1.20e-04 | (3807.09 ms | 137713 tok/s) step 22858/76294 | train loss 3.442961 | norm 11.1908 | lr 1.20e-04 | (3810.98 ms | 137573 tok/s) step 22859/76294 | train loss 3.449147 | norm 12.2018 | lr 1.20e-04 | (3812.88 ms | 137505 tok/s) step 22860/76294 | train loss 3.461792 | norm 13.0238 | lr 1.20e-04 | (3834.72 ms | 136721 tok/s) step 22861/76294 | train loss 3.433251 | norm 30.1467 | lr 1.20e-04 | (3801.05 ms | 137932 tok/s) step 22862/76294 | train loss 3.441032 | norm 13.2976 | lr 1.20e-04 | (3806.34 ms | 137741 tok/s) step 22863/76294 | train loss 3.434899 | norm 28.1758 | lr 1.20e-04 | (3826.21 ms | 137025 tok/s) step 22864/76294 | train loss 3.496645 | norm 15.2883 | lr 1.20e-04 | (3805.81 ms | 137760 tok/s) step 22865/76294 | train loss 3.493702 | norm 23.7122 | lr 1.20e-04 | (3814.90 ms | 137432 tok/s) step 22866/76294 | train loss 3.421582 | norm 13.7587 | lr 1.20e-04 | (3797.24 ms | 138071 tok/s) step 22867/76294 | train loss 3.486870 | norm 11.0336 | lr 1.20e-04 | (3804.96 ms | 137791 tok/s) step 22868/76294 | train loss 3.410159 | norm 20.9008 | lr 1.20e-04 | (3822.93 ms | 137143 tok/s) step 22869/76294 | train loss 3.485212 | norm 9.9251 | lr 1.20e-04 | (3798.09 ms | 138040 tok/s) step 22870/76294 | train loss 3.429610 | norm 9.6533 | lr 1.20e-04 | (3797.22 ms | 138071 tok/s) step 22871/76294 | train loss 3.457930 | norm 7.8821 | lr 1.20e-04 | (3831.37 ms | 136841 tok/s) step 22872/76294 | train loss 3.437448 | norm 6.0068 | lr 1.20e-04 | (3798.54 ms | 138023 tok/s) step 22873/76294 | train loss 3.501879 | norm 23.3432 | lr 1.20e-04 | (3803.13 ms | 137857 tok/s) step 22874/76294 | train loss 3.515743 | norm 9.8346 | lr 1.20e-04 | (3825.92 ms | 137036 tok/s) step 22875/76294 | train loss 3.467261 | norm 6.1324 | lr 1.20e-04 | (3802.61 ms | 137876 tok/s) step 22876/76294 | train loss 3.375665 | norm 6.2338 | lr 1.20e-04 | (3806.47 ms | 137736 tok/s) step 22877/76294 | train loss 3.613363 | norm 12.8056 | lr 1.20e-04 | (3805.35 ms | 137776 tok/s) step 22878/76294 | train loss 3.426388 | norm 6.2735 | lr 1.20e-04 | (3802.40 ms | 137884 tok/s) step 22879/76294 | train loss 3.475140 | norm 5.3849 | lr 1.20e-04 | (3802.23 ms | 137890 tok/s) step 22880/76294 | train loss 3.352158 | norm 6.5543 | lr 1.20e-04 | (3805.29 ms | 137779 tok/s) step 22881/76294 | train loss 3.493089 | norm 6.3496 | lr 1.20e-04 | (3801.10 ms | 137930 tok/s) step 22882/76294 | train loss 3.381007 | norm 10.8831 | lr 1.20e-04 | (3804.84 ms | 137795 tok/s) step 22883/76294 | train loss 3.507558 | norm 13.2342 | lr 1.20e-04 | (3801.74 ms | 137907 tok/s) step 22884/76294 | train loss 3.428276 | norm 13.4750 | lr 1.20e-04 | (3805.36 ms | 137776 tok/s) step 22885/76294 | train loss 3.402572 | norm 9.9570 | lr 1.20e-04 | (3805.94 ms | 137755 tok/s) step 22886/76294 | train loss 3.521449 | norm 7.2045 | lr 1.20e-04 | (3797.77 ms | 138052 tok/s) step 22887/76294 | train loss 3.418877 | norm 8.6174 | lr 1.20e-04 | (3842.07 ms | 136460 tok/s) step 22888/76294 | train loss 3.553878 | norm 5.4724 | lr 1.20e-04 | (3800.77 ms | 137943 tok/s) step 22889/76294 | train loss 3.387962 | norm 8.7798 | lr 1.20e-04 | (3815.33 ms | 137416 tok/s) step 22890/76294 | train loss 3.473700 | norm 12.6002 | lr 1.20e-04 | (3951.60 ms | 132678 tok/s) step 22891/76294 | train loss 3.428305 | norm 27.0574 | lr 1.20e-04 | (3844.74 ms | 136365 tok/s) step 22892/76294 | train loss 3.482462 | norm 18.8951 | lr 1.20e-04 | (3814.05 ms | 137462 tok/s) step 22893/76294 | train loss 3.495733 | norm 17.1745 | lr 1.20e-04 | (3797.38 ms | 138066 tok/s) step 22894/76294 | train loss 3.420088 | norm 10.5061 | lr 1.20e-04 | (3826.97 ms | 136998 tok/s) step 22895/76294 | train loss 3.963407 | norm 13.1157 | lr 1.20e-04 | (3797.12 ms | 138075 tok/s) step 22896/76294 | train loss 3.399620 | norm 8.9885 | lr 1.20e-04 | (3823.64 ms | 137118 tok/s) step 22897/76294 | train loss 3.516633 | norm 5.5864 | lr 1.20e-04 | (3821.85 ms | 137182 tok/s) step 22898/76294 | train loss 3.457728 | norm 40.3994 | lr 1.20e-04 | (3806.47 ms | 137736 tok/s) step 22899/76294 | train loss 3.470780 | norm 32.8787 | lr 1.20e-04 | (3825.71 ms | 137043 tok/s) step 22900/76294 | train loss 3.432877 | norm 11.6321 | lr 1.20e-04 | (3806.88 ms | 137721 tok/s) step 22901/76294 | train loss 3.536193 | norm 9.7984 | lr 1.20e-04 | (3806.35 ms | 137740 tok/s) step 22902/76294 | train loss 3.476987 | norm 9.0021 | lr 1.20e-04 | (3805.25 ms | 137780 tok/s) step 22903/76294 | train loss 3.426117 | norm 14.1339 | lr 1.20e-04 | (3806.13 ms | 137748 tok/s) step 22904/76294 | train loss 3.510695 | norm 16.5956 | lr 1.20e-04 | (3802.18 ms | 137891 tok/s) step 22905/76294 | train loss 3.420501 | norm 16.3845 | lr 1.20e-04 | (3811.61 ms | 137550 tok/s) step 22906/76294 | train loss 3.463848 | norm 13.4889 | lr 1.20e-04 | (3805.09 ms | 137786 tok/s) step 22907/76294 | train loss 3.391696 | norm 10.5292 | lr 1.20e-04 | (3802.85 ms | 137867 tok/s) step 22908/76294 | train loss 3.498874 | norm 10.3652 | lr 1.20e-04 | (3803.76 ms | 137834 tok/s) step 22909/76294 | train loss 3.410451 | norm 14.7373 | lr 1.20e-04 | (3811.04 ms | 137571 tok/s) step 22910/76294 | train loss 3.428940 | norm 9.4140 | lr 1.20e-04 | (3801.37 ms | 137921 tok/s) step 22911/76294 | train loss 3.384932 | norm 10.2265 | lr 1.20e-04 | (3807.92 ms | 137684 tok/s) step 22912/76294 | train loss 3.453287 | norm 8.0529 | lr 1.20e-04 | (3800.82 ms | 137941 tok/s) step 22913/76294 | train loss 3.432374 | norm 6.0950 | lr 1.20e-04 | (3798.94 ms | 138009 tok/s) step 22914/76294 | train loss 3.661974 | norm 5.5122 | lr 1.20e-04 | (3833.09 ms | 136780 tok/s) step 22915/76294 | train loss 3.540640 | norm 11.9162 | lr 1.20e-04 | (3863.18 ms | 135714 tok/s) step 22916/76294 | train loss 3.423354 | norm 8.3095 | lr 1.20e-04 | (3797.96 ms | 138045 tok/s) step 22917/76294 | train loss 3.528126 | norm 8.0382 | lr 1.20e-04 | (3802.08 ms | 137895 tok/s) step 22918/76294 | train loss 3.540394 | norm 9.6876 | lr 1.20e-04 | (3825.79 ms | 137040 tok/s) step 22919/76294 | train loss 3.457792 | norm 9.4107 | lr 1.20e-04 | (3842.53 ms | 136443 tok/s) step 22920/76294 | train loss 3.437625 | norm 8.4074 | lr 1.20e-04 | (3794.20 ms | 138181 tok/s) step 22921/76294 | train loss 3.514575 | norm 10.4029 | lr 1.20e-04 | (3824.54 ms | 137085 tok/s) step 22922/76294 | train loss 3.435565 | norm 9.8896 | lr 1.20e-04 | (3799.17 ms | 138001 tok/s) step 22923/76294 | train loss 3.452089 | norm 6.9569 | lr 1.20e-04 | (3800.17 ms | 137965 tok/s) step 22924/76294 | train loss 3.442905 | norm 8.5389 | lr 1.20e-04 | (3827.26 ms | 136988 tok/s) step 22925/76294 | train loss 3.462787 | norm 10.2731 | lr 1.20e-04 | (3886.37 ms | 134904 tok/s) step 22926/76294 | train loss 3.428752 | norm 7.8813 | lr 1.20e-04 | (3802.91 ms | 137865 tok/s) step 22927/76294 | train loss 3.433510 | norm 8.3582 | lr 1.20e-04 | (3821.94 ms | 137179 tok/s) step 22928/76294 | train loss 3.445063 | norm 8.0224 | lr 1.20e-04 | (3798.86 ms | 138012 tok/s) step 22929/76294 | train loss 3.410611 | norm 4.7647 | lr 1.20e-04 | (3794.89 ms | 138156 tok/s) step 22930/76294 | train loss 3.448243 | norm 5.1915 | lr 1.20e-04 | (3833.59 ms | 136761 tok/s) step 22931/76294 | train loss 3.498069 | norm 3.7446 | lr 1.20e-04 | (3797.97 ms | 138044 tok/s) step 22932/76294 | train loss 3.418514 | norm 8.3103 | lr 1.20e-04 | (3801.19 ms | 137927 tok/s) step 22933/76294 | train loss 3.432643 | norm 4.8512 | lr 1.20e-04 | (3846.52 ms | 136302 tok/s) step 22934/76294 | train loss 3.423279 | norm 6.6465 | lr 1.20e-04 | (3797.60 ms | 138058 tok/s) step 22935/76294 | train loss 3.550706 | norm 5.9989 | lr 1.20e-04 | (3805.80 ms | 137760 tok/s) step 22936/76294 | train loss 3.418021 | norm 8.0900 | lr 1.20e-04 | (3803.48 ms | 137844 tok/s) step 22937/76294 | train loss 3.476019 | norm 12.2171 | lr 1.20e-04 | (3804.24 ms | 137817 tok/s) step 22938/76294 | train loss 3.415610 | norm 9.9635 | lr 1.20e-04 | (3821.59 ms | 137191 tok/s) step 22939/76294 | train loss 3.438305 | norm 9.8963 | lr 1.20e-04 | (3850.30 ms | 136168 tok/s) step 22940/76294 | train loss 3.465804 | norm 8.1340 | lr 1.20e-04 | (3870.51 ms | 135457 tok/s) step 22941/76294 | train loss 3.393360 | norm 4.4185 | lr 1.20e-04 | (3794.86 ms | 138157 tok/s) step 22942/76294 | train loss 3.483312 | norm 5.4023 | lr 1.20e-04 | (3801.90 ms | 137902 tok/s) step 22943/76294 | train loss 3.413527 | norm 14.6041 | lr 1.20e-04 | (3816.83 ms | 137362 tok/s) step 22944/76294 | train loss 3.441206 | norm 9.4805 | lr 1.20e-04 | (3801.75 ms | 137907 tok/s) step 22945/76294 | train loss 3.458062 | norm 6.2160 | lr 1.20e-04 | (3804.60 ms | 137804 tok/s) step 22946/76294 | train loss 3.473013 | norm 12.0169 | lr 1.20e-04 | (3809.23 ms | 137636 tok/s) step 22947/76294 | train loss 3.417185 | norm 9.1333 | lr 1.20e-04 | (3796.09 ms | 138113 tok/s) step 22948/76294 | train loss 3.476292 | norm 6.2683 | lr 1.20e-04 | (3820.51 ms | 137230 tok/s) step 22949/76294 | train loss 3.474560 | norm 11.4295 | lr 1.20e-04 | (3796.77 ms | 138088 tok/s) step 22950/76294 | train loss 3.451539 | norm 9.4970 | lr 1.20e-04 | (3805.09 ms | 137786 tok/s) step 22951/76294 | train loss 3.495257 | norm 7.4295 | lr 1.20e-04 | (3824.32 ms | 137093 tok/s) step 22952/76294 | train loss 3.451649 | norm 12.8398 | lr 1.20e-04 | (3799.31 ms | 137996 tok/s) step 22953/76294 | train loss 3.409189 | norm 3.9157 | lr 1.20e-04 | (3800.16 ms | 137965 tok/s) step 22954/76294 | train loss 3.423403 | norm 4.0581 | lr 1.20e-04 | (3832.40 ms | 136804 tok/s) step 22955/76294 | train loss 3.443167 | norm 9.4270 | lr 1.20e-04 | (3806.49 ms | 137735 tok/s) step 22956/76294 | train loss 3.474329 | norm 7.5398 | lr 1.20e-04 | (3798.51 ms | 138025 tok/s) step 22957/76294 | train loss 3.405224 | norm 27.9304 | lr 1.20e-04 | (3796.15 ms | 138111 tok/s) step 22958/76294 | train loss 3.559233 | norm 20.4662 | lr 1.20e-04 | (3826.55 ms | 137013 tok/s) step 22959/76294 | train loss 3.497380 | norm 5.5084 | lr 1.20e-04 | (3796.95 ms | 138081 tok/s) step 22960/76294 | train loss 3.498991 | norm 8.4172 | lr 1.20e-04 | (3802.83 ms | 137868 tok/s) step 22961/76294 | train loss 3.401165 | norm 3.9284 | lr 1.20e-04 | (3827.52 ms | 136978 tok/s) step 22962/76294 | train loss 3.497715 | norm 4.8717 | lr 1.20e-04 | (3800.80 ms | 137941 tok/s) step 22963/76294 | train loss 3.462162 | norm 4.7925 | lr 1.20e-04 | (3804.76 ms | 137798 tok/s) step 22964/76294 | train loss 3.434246 | norm 5.5461 | lr 1.20e-04 | (3818.66 ms | 137296 tok/s) step 22965/76294 | train loss 3.536789 | norm 10.2285 | lr 1.20e-04 | (3849.40 ms | 136200 tok/s) step 22966/76294 | train loss 3.372975 | norm 5.0685 | lr 1.20e-04 | (3809.64 ms | 137621 tok/s) step 22967/76294 | train loss 3.475034 | norm 5.9251 | lr 1.20e-04 | (3870.30 ms | 135464 tok/s) step 22968/76294 | train loss 3.533290 | norm 4.1868 | lr 1.20e-04 | (3796.70 ms | 138090 tok/s) step 22969/76294 | train loss 3.478550 | norm 5.3297 | lr 1.20e-04 | (3808.83 ms | 137651 tok/s) step 22970/76294 | train loss 3.432452 | norm 5.7658 | lr 1.20e-04 | (4056.20 ms | 129256 tok/s) step 22971/76294 | train loss 3.480114 | norm 6.2719 | lr 1.20e-04 | (4012.78 ms | 130654 tok/s) step 22972/76294 | train loss 3.504028 | norm 4.2553 | lr 1.20e-04 | (3781.18 ms | 138657 tok/s) step 22973/76294 | train loss 3.471079 | norm 3.8091 | lr 1.20e-04 | (3812.69 ms | 137511 tok/s) step 22974/76294 | train loss 3.472986 | norm 7.6648 | lr 1.20e-04 | (3802.52 ms | 137879 tok/s) step 22975/76294 | train loss 3.437046 | norm 5.0535 | lr 1.20e-04 | (3864.23 ms | 135677 tok/s) step 22976/76294 | train loss 3.525824 | norm 2.8082 | lr 1.20e-04 | (3788.17 ms | 138401 tok/s) step 22977/76294 | train loss 3.433241 | norm 4.0862 | lr 1.20e-04 | (6087.05 ms | 86132 tok/s) step 22978/76294 | train loss 3.414392 | norm 2.6810 | lr 1.20e-04 | (3781.13 ms | 138659 tok/s) step 22979/76294 | train loss 3.457769 | norm 4.7342 | lr 1.20e-04 | (3780.09 ms | 138697 tok/s) step 22980/76294 | train loss 3.470648 | norm 6.1885 | lr 1.20e-04 | (3823.00 ms | 137141 tok/s) step 22981/76294 | train loss 3.441926 | norm 4.2517 | lr 1.20e-04 | (3784.83 ms | 138523 tok/s) step 22982/76294 | train loss 3.471590 | norm 4.2381 | lr 1.20e-04 | (3792.43 ms | 138246 tok/s) step 22983/76294 | train loss 3.438154 | norm 2.8821 | lr 1.20e-04 | (3789.62 ms | 138348 tok/s) step 22984/76294 | train loss 3.447369 | norm 3.5183 | lr 1.20e-04 | (3803.82 ms | 137832 tok/s) step 22985/76294 | train loss 3.354916 | norm 2.9024 | lr 1.20e-04 | (3787.40 ms | 138430 tok/s) step 22986/76294 | train loss 3.513448 | norm 2.6582 | lr 1.20e-04 | (3795.06 ms | 138150 tok/s) step 22987/76294 | train loss 3.394871 | norm 3.6789 | lr 1.20e-04 | (3884.28 ms | 134977 tok/s) step 22988/76294 | train loss 3.494832 | norm 6.1508 | lr 1.20e-04 | (5910.65 ms | 88702 tok/s) step 22989/76294 | train loss 3.458077 | norm 6.7171 | lr 1.20e-04 | (3786.13 ms | 138476 tok/s) step 22990/76294 | train loss 3.351984 | norm 7.0366 | lr 1.20e-04 | (3819.57 ms | 137264 tok/s) step 22991/76294 | train loss 3.462939 | norm 4.9795 | lr 1.20e-04 | (3785.89 ms | 138485 tok/s) step 22992/76294 | train loss 3.374541 | norm 4.5404 | lr 1.20e-04 | (3799.20 ms | 137999 tok/s) step 22993/76294 | train loss 3.522345 | norm 9.6146 | lr 1.20e-04 | (3784.85 ms | 138523 tok/s) step 22994/76294 | train loss 3.404548 | norm 7.8779 | lr 1.20e-04 | (4224.23 ms | 124114 tok/s) step 22995/76294 | train loss 3.465147 | norm 6.1802 | lr 1.20e-04 | (3807.03 ms | 137716 tok/s) step 22996/76294 | train loss 3.465356 | norm 7.9444 | lr 1.20e-04 | (3787.23 ms | 138436 tok/s) step 22997/76294 | train loss 3.378125 | norm 7.1368 | lr 1.20e-04 | (3795.28 ms | 138142 tok/s) step 22998/76294 | train loss 3.438152 | norm 9.2370 | lr 1.20e-04 | (3786.39 ms | 138466 tok/s) step 22999/76294 | train loss 3.458253 | norm 10.5155 | lr 1.20e-04 | (3809.38 ms | 137631 tok/s) step 23000/76294 | train loss 3.404755 | norm 9.3713 | lr 1.20e-04 | (3787.10 ms | 138440 tok/s) val loss: 3.455940 saving model checkpoint to ./results/gpt2-124M-gqa/step_23000.pth step 23001/76294 | train loss 3.522284 | norm 11.5947 | lr 1.20e-04 | (3822.44 ms | 137160 tok/s) step 23002/76294 | train loss 3.451738 | norm 6.8610 | lr 1.20e-04 | (3787.18 ms | 138437 tok/s) step 23003/76294 | train loss 3.388133 | norm 9.1781 | lr 1.20e-04 | (3814.00 ms | 137464 tok/s) step 23004/76294 | train loss 3.554190 | norm 5.7611 | lr 1.20e-04 | (3809.77 ms | 137617 tok/s) step 23005/76294 | train loss 3.428574 | norm 8.6235 | lr 1.20e-04 | (3795.72 ms | 138126 tok/s) step 23006/76294 | train loss 3.501938 | norm 11.5060 | lr 1.20e-04 | (3794.00 ms | 138189 tok/s) step 23007/76294 | train loss 3.446933 | norm 8.8025 | lr 1.20e-04 | (4082.75 ms | 128416 tok/s) step 23008/76294 | train loss 3.468502 | norm 8.7031 | lr 1.20e-04 | (3809.66 ms | 137621 tok/s) step 23009/76294 | train loss 3.491857 | norm 8.1054 | lr 1.20e-04 | (3791.22 ms | 138290 tok/s) step 23010/76294 | train loss 3.392051 | norm 13.7340 | lr 1.20e-04 | (3799.72 ms | 137981 tok/s) step 23011/76294 | train loss 3.449901 | norm 12.3075 | lr 1.20e-04 | (3788.45 ms | 138391 tok/s) step 23012/76294 | train loss 3.565405 | norm 6.4794 | lr 1.20e-04 | (3801.22 ms | 137926 tok/s) step 23013/76294 | train loss 3.446614 | norm 5.6572 | lr 1.20e-04 | (3797.54 ms | 138060 tok/s) step 23014/76294 | train loss 3.419132 | norm 7.6790 | lr 1.20e-04 | (3797.98 ms | 138044 tok/s) step 23015/76294 | train loss 3.517027 | norm 6.8534 | lr 1.20e-04 | (3881.59 ms | 135071 tok/s) step 23016/76294 | train loss 3.429287 | norm 9.1861 | lr 1.20e-04 | (3791.44 ms | 138282 tok/s) step 23017/76294 | train loss 3.445695 | norm 8.3070 | lr 1.20e-04 | (3790.96 ms | 138300 tok/s) step 23018/76294 | train loss 3.484241 | norm 7.5840 | lr 1.20e-04 | (3810.00 ms | 137608 tok/s) step 23019/76294 | train loss 3.439339 | norm 31.6765 | lr 1.20e-04 | (3801.47 ms | 137917 tok/s) step 23020/76294 | train loss 3.475618 | norm 7.0175 | lr 1.20e-04 | (3791.62 ms | 138275 tok/s) step 23021/76294 | train loss 3.546742 | norm 6.6292 | lr 1.20e-04 | (3818.97 ms | 137285 tok/s) step 23022/76294 | train loss 3.414740 | norm 4.9014 | lr 1.20e-04 | (3791.74 ms | 138271 tok/s) step 23023/76294 | train loss 3.479062 | norm 3.6663 | lr 1.20e-04 | (3802.70 ms | 137873 tok/s) step 23024/76294 | train loss 3.425577 | norm 8.0163 | lr 1.20e-04 | (3791.78 ms | 138270 tok/s) step 23025/76294 | train loss 3.403989 | norm 6.4831 | lr 1.20e-04 | (3795.76 ms | 138125 tok/s) step 23026/76294 | train loss 3.730262 | norm 11.8622 | lr 1.20e-04 | (3813.34 ms | 137488 tok/s) step 23027/76294 | train loss 3.451418 | norm 10.4179 | lr 1.20e-04 | (3795.14 ms | 138147 tok/s) step 23028/76294 | train loss 3.449894 | norm 8.8600 | lr 1.20e-04 | (3794.11 ms | 138185 tok/s) step 23029/76294 | train loss 3.475318 | norm 11.1948 | lr 1.20e-04 | (3820.52 ms | 137230 tok/s) step 23030/76294 | train loss 3.376355 | norm 4.0516 | lr 1.20e-04 | (3812.17 ms | 137530 tok/s) step 23031/76294 | train loss 3.471861 | norm 5.6094 | lr 1.20e-04 | (3794.19 ms | 138182 tok/s) step 23032/76294 | train loss 3.398512 | norm 3.3507 | lr 1.20e-04 | (3797.31 ms | 138068 tok/s) step 23033/76294 | train loss 3.382618 | norm 5.5112 | lr 1.20e-04 | (3801.09 ms | 137931 tok/s) step 23034/76294 | train loss 3.388517 | norm 8.3141 | lr 1.20e-04 | (3800.98 ms | 137935 tok/s) step 23035/76294 | train loss 3.435781 | norm 7.9878 | lr 1.20e-04 | (3791.79 ms | 138269 tok/s) step 23036/76294 | train loss 3.455320 | norm 6.5475 | lr 1.20e-04 | (3801.59 ms | 137913 tok/s) step 23037/76294 | train loss 3.454157 | norm 8.5799 | lr 1.20e-04 | (3801.28 ms | 137924 tok/s) step 23038/76294 | train loss 3.610562 | norm 5.3578 | lr 1.20e-04 | (3796.95 ms | 138081 tok/s) step 23039/76294 | train loss 3.427297 | norm 5.1452 | lr 1.20e-04 | (3829.06 ms | 136924 tok/s) step 23040/76294 | train loss 3.453330 | norm 7.6421 | lr 1.20e-04 | (3934.72 ms | 133247 tok/s) step 23041/76294 | train loss 3.439977 | norm 5.4476 | lr 1.20e-04 | (3790.93 ms | 138301 tok/s) step 23042/76294 | train loss 3.635909 | norm 9.3252 | lr 1.20e-04 | (3799.66 ms | 137983 tok/s) step 23043/76294 | train loss 3.473914 | norm 6.4081 | lr 1.20e-04 | (3801.62 ms | 137912 tok/s) step 23044/76294 | train loss 3.554389 | norm 10.9593 | lr 1.20e-04 | (3799.59 ms | 137985 tok/s) step 23045/76294 | train loss 3.462091 | norm 3.1393 | lr 1.20e-04 | (3792.68 ms | 138237 tok/s) step 23046/76294 | train loss 3.400367 | norm 3.6445 | lr 1.20e-04 | (3799.85 ms | 137976 tok/s) step 23047/76294 | train loss 3.442247 | norm 34.1499 | lr 1.20e-04 | (3802.96 ms | 137863 tok/s) step 23048/76294 | train loss 3.531540 | norm 6.3260 | lr 1.20e-04 | (3804.38 ms | 137812 tok/s) step 23049/76294 | train loss 3.384173 | norm 5.9294 | lr 1.20e-04 | (3797.31 ms | 138068 tok/s) step 23050/76294 | train loss 3.502270 | norm 9.5047 | lr 1.20e-04 | (3820.49 ms | 137231 tok/s) step 23051/76294 | train loss 3.445340 | norm 15.2912 | lr 1.20e-04 | (5825.92 ms | 89992 tok/s) step 23052/76294 | train loss 3.458294 | norm 6.9360 | lr 1.20e-04 | (3798.10 ms | 138040 tok/s) step 23053/76294 | train loss 3.477624 | norm 9.4592 | lr 1.20e-04 | (3852.16 ms | 136102 tok/s) step 23054/76294 | train loss 3.444214 | norm 8.6996 | lr 1.20e-04 | (3789.75 ms | 138344 tok/s) step 23055/76294 | train loss 3.437434 | norm 10.7871 | lr 1.20e-04 | (3791.95 ms | 138263 tok/s) step 23056/76294 | train loss 3.428277 | norm 12.0144 | lr 1.20e-04 | (3829.94 ms | 136892 tok/s) step 23057/76294 | train loss 3.549706 | norm 6.3563 | lr 1.20e-04 | (3790.57 ms | 138314 tok/s) step 23058/76294 | train loss 3.380000 | norm 12.0826 | lr 1.20e-04 | (3797.39 ms | 138065 tok/s) step 23059/76294 | train loss 3.454538 | norm 14.1622 | lr 1.20e-04 | (3930.50 ms | 133390 tok/s) step 23060/76294 | train loss 3.421290 | norm 6.0662 | lr 1.20e-04 | (3790.01 ms | 138334 tok/s) step 23061/76294 | train loss 3.419045 | norm 11.9569 | lr 1.20e-04 | (3838.58 ms | 136584 tok/s) step 23062/76294 | train loss 3.451522 | norm 7.0520 | lr 1.20e-04 | (3806.21 ms | 137746 tok/s) step 23063/76294 | train loss 3.441383 | norm 4.7163 | lr 1.20e-04 | (3796.20 ms | 138108 tok/s) step 23064/76294 | train loss 3.440586 | norm 8.2885 | lr 1.20e-04 | (4740.39 ms | 110600 tok/s) step 23065/76294 | train loss 3.436606 | norm 8.3764 | lr 1.20e-04 | (3941.37 ms | 133022 tok/s) step 23066/76294 | train loss 3.455706 | norm 8.2420 | lr 1.20e-04 | (3791.81 ms | 138269 tok/s) step 23067/76294 | train loss 3.449321 | norm 8.8401 | lr 1.20e-04 | (3914.49 ms | 133935 tok/s) step 23068/76294 | train loss 3.454104 | norm 19.6966 | lr 1.20e-04 | (3799.68 ms | 137982 tok/s) step 23069/76294 | train loss 3.496883 | norm 8.9644 | lr 1.20e-04 | (3841.04 ms | 136496 tok/s) step 23070/76294 | train loss 3.414633 | norm 6.2395 | lr 1.20e-04 | (3791.45 ms | 138282 tok/s) step 23071/76294 | train loss 3.454921 | norm 6.8037 | lr 1.20e-04 | (3829.86 ms | 136895 tok/s) step 23072/76294 | train loss 3.523454 | norm 8.1225 | lr 1.20e-04 | (3792.00 ms | 138262 tok/s) step 23073/76294 | train loss 3.466638 | norm 6.3168 | lr 1.20e-04 | (3797.35 ms | 138067 tok/s) step 23074/76294 | train loss 3.501893 | norm 14.0014 | lr 1.20e-04 | (3817.57 ms | 137336 tok/s) step 23075/76294 | train loss 3.407312 | norm 8.0538 | lr 1.20e-04 | (3797.40 ms | 138065 tok/s) step 23076/76294 | train loss 3.410954 | norm 11.9056 | lr 1.20e-04 | (3795.41 ms | 138137 tok/s) step 23077/76294 | train loss 3.446769 | norm 11.3491 | lr 1.20e-04 | (3795.02 ms | 138151 tok/s) step 23078/76294 | train loss 3.521917 | norm 14.6973 | lr 1.20e-04 | (3792.39 ms | 138248 tok/s) step 23079/76294 | train loss 3.491827 | norm 18.2137 | lr 1.20e-04 | (3820.85 ms | 137218 tok/s) step 23080/76294 | train loss 3.468214 | norm 18.2930 | lr 1.20e-04 | (3821.37 ms | 137199 tok/s) step 23081/76294 | train loss 3.386512 | norm 11.1101 | lr 1.20e-04 | (3796.49 ms | 138098 tok/s) step 23082/76294 | train loss 3.517545 | norm 10.9338 | lr 1.20e-04 | (3812.37 ms | 137523 tok/s) step 23083/76294 | train loss 3.444633 | norm 16.8820 | lr 1.20e-04 | (3791.69 ms | 138273 tok/s) step 23084/76294 | train loss 3.495603 | norm 8.1868 | lr 1.20e-04 | (3800.90 ms | 137938 tok/s) step 23085/76294 | train loss 3.492381 | norm 19.4715 | lr 1.20e-04 | (3796.76 ms | 138088 tok/s) step 23086/76294 | train loss 3.451649 | norm 13.7415 | lr 1.20e-04 | (3793.94 ms | 138191 tok/s) step 23087/76294 | train loss 3.470528 | norm 8.3007 | lr 1.20e-04 | (3816.78 ms | 137364 tok/s) step 23088/76294 | train loss 3.469552 | norm 19.4981 | lr 1.20e-04 | (3801.57 ms | 137914 tok/s) step 23089/76294 | train loss 3.487250 | norm 14.8659 | lr 1.20e-04 | (3798.56 ms | 138023 tok/s) step 23090/76294 | train loss 3.540025 | norm 8.9409 | lr 1.20e-04 | (3797.70 ms | 138054 tok/s) step 23091/76294 | train loss 3.474589 | norm 13.9749 | lr 1.20e-04 | (3895.34 ms | 134594 tok/s) step 23092/76294 | train loss 3.442007 | norm 10.8305 | lr 1.20e-04 | (3798.68 ms | 138019 tok/s) step 23093/76294 | train loss 3.607581 | norm 20.4760 | lr 1.20e-04 | (3829.33 ms | 136914 tok/s) step 23094/76294 | train loss 3.443364 | norm 16.0016 | lr 1.20e-04 | (3796.65 ms | 138092 tok/s) step 23095/76294 | train loss 3.489929 | norm 22.9169 | lr 1.20e-04 | (3887.32 ms | 134871 tok/s) step 23096/76294 | train loss 3.462563 | norm 63.5227 | lr 1.20e-04 | (3801.09 ms | 137931 tok/s) step 23097/76294 | train loss 3.526841 | norm 28.4844 | lr 1.20e-04 | (3828.29 ms | 136951 tok/s) step 23098/76294 | train loss 3.519389 | norm 14.3238 | lr 1.20e-04 | (3792.55 ms | 138242 tok/s) slurmstepd: error: *** JOB 11683335 ON g077 CANCELLED AT 2024-10-12T14:40:00 DUE TO TIME LIMIT ***