[2024-09-02 17:03:02,219][accelerate.utils.other][WARNING] - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [2024-09-02 17:03:02,226][Main][INFO] - Distributed environment: DistributedType.NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Mixed precision type: bf16 [2024-09-02 17:03:02,227][Main][INFO] - Working directory is /workspace/nanoT5/logs/2024-09-02/17-03-02 [2024-09-02 17:14:53,691][Main][INFO] - [train] Step 100 out of 65536 | Loss --> 51.971 | Grad_l2 --> 82.676 | Weights_l2 --> 7042.062 | Lr --> 0.010 | Seconds_per_step --> 6.760 | [2024-09-02 17:20:23,699][Main][INFO] - [train] Step 200 out of 65536 | Loss --> 14.150 | Grad_l2 --> 19.390 | Weights_l2 --> 7034.376 | Lr --> 0.010 | Seconds_per_step --> 3.300 | [2024-09-02 17:25:54,840][Main][INFO] - [train] Step 300 out of 65536 | Loss --> 9.006 | Grad_l2 --> 9.061 | Weights_l2 --> 7026.824 | Lr --> 0.010 | Seconds_per_step --> 3.311 | [2024-09-02 17:31:26,095][Main][INFO] - [train] Step 400 out of 65536 | Loss --> 7.529 | Grad_l2 --> 5.889 | Weights_l2 --> 7019.014 | Lr --> 0.010 | Seconds_per_step --> 3.313 | [2024-09-02 17:36:56,190][Main][INFO] - [train] Step 500 out of 65536 | Loss --> 6.618 | Grad_l2 --> 4.039 | Weights_l2 --> 7010.897 | Lr --> 0.011 | Seconds_per_step --> 3.301 | [2024-09-02 17:42:27,693][Main][INFO] - [train] Step 600 out of 65536 | Loss --> 5.994 | Grad_l2 --> 2.962 | Weights_l2 --> 7002.549 | Lr --> 0.011 | Seconds_per_step --> 3.315 | [2024-09-02 17:47:57,967][Main][INFO] - [train] Step 700 out of 65536 | Loss --> 5.703 | Grad_l2 --> 2.434 | Weights_l2 --> 6994.267 | Lr --> 0.011 | Seconds_per_step --> 3.303 | [2024-09-02 17:53:29,228][Main][INFO] - [train] Step 800 out of 65536 | Loss --> 6.603 | Grad_l2 --> 6.221 | Weights_l2 --> 6985.927 | Lr --> 0.011 | Seconds_per_step --> 3.313 | [2024-09-02 17:59:00,011][Main][INFO] - [train] Step 900 out of 65536 | Loss --> 5.408 | Grad_l2 --> 1.465 | Weights_l2 --> 6980.026 | Lr --> 0.011 | Seconds_per_step --> 3.308 | [2024-09-02 18:04:30,275][Main][INFO] - [train] Step 1000 out of 65536 | Loss --> 5.311 | Grad_l2 --> 0.992 | Weights_l2 --> 6975.109 | Lr --> 0.011 | Seconds_per_step --> 3.303 | [2024-09-02 18:10:01,468][Main][INFO] - [train] Step 1100 out of 65536 | Loss --> 5.241 | Grad_l2 --> 0.854 | Weights_l2 --> 6970.708 | Lr --> 0.011 | Seconds_per_step --> 3.312 | [2024-09-02 18:15:33,362][Main][INFO] - [train] Step 1200 out of 65536 | Loss --> 5.180 | Grad_l2 --> 0.838 | Weights_l2 --> 6966.641 | Lr --> 0.011 | Seconds_per_step --> 3.319 | [2024-09-02 18:21:03,902][Main][INFO] - [train] Step 1300 out of 65536 | Loss --> 5.126 | Grad_l2 --> 0.764 | Weights_l2 --> 6962.789 | Lr --> 0.011 | Seconds_per_step --> 3.305 | [2024-09-02 18:26:35,349][Main][INFO] - [train] Step 1400 out of 65536 | Loss --> 5.088 | Grad_l2 --> 0.744 | Weights_l2 --> 6959.146 | Lr --> 0.011 | Seconds_per_step --> 3.314 | [2024-09-02 18:32:06,048][Main][INFO] - [train] Step 1500 out of 65536 | Loss --> 5.046 | Grad_l2 --> 0.702 | Weights_l2 --> 6955.673 | Lr --> 0.012 | Seconds_per_step --> 3.307 | [2024-09-02 18:37:37,903][Main][INFO] - [train] Step 1600 out of 65536 | Loss --> 5.007 | Grad_l2 --> 0.691 | Weights_l2 --> 6952.523 | Lr --> 0.012 | Seconds_per_step --> 3.319 | [2024-09-02 18:43:09,723][Main][INFO] - [train] Step 1700 out of 65536 | Loss --> 4.973 | Grad_l2 --> 0.673 | Weights_l2 --> 6949.412 | Lr --> 0.012 | Seconds_per_step --> 3.318 | [2024-09-02 18:48:40,909][Main][INFO] - [train] Step 1800 out of 65536 | Loss --> 4.943 | Grad_l2 --> 0.671 | Weights_l2 --> 6946.498 | Lr --> 0.012 | Seconds_per_step --> 3.312 | [2024-09-02 18:54:13,524][Main][INFO] - [train] Step 1900 out of 65536 | Loss --> 4.929 | Grad_l2 --> 0.668 | Weights_l2 --> 6943.795 | Lr --> 0.012 | Seconds_per_step --> 3.326 | [2024-09-02 18:59:45,500][Main][INFO] - [train] Step 2000 out of 65536 | Loss --> 4.894 | Grad_l2 --> 0.665 | Weights_l2 --> 6941.241 | Lr --> 0.012 | Seconds_per_step --> 3.320 | [2024-09-02 19:05:16,395][Main][INFO] - [train] Step 2100 out of 65536 | Loss --> 4.881 | Grad_l2 --> 0.713 | Weights_l2 --> 6938.861 | Lr --> 0.012 | Seconds_per_step --> 3.309 | [2024-09-02 19:10:48,520][Main][INFO] - [train] Step 2200 out of 65536 | Loss --> 4.853 | Grad_l2 --> 0.653 | Weights_l2 --> 6936.551 | Lr --> 0.012 | Seconds_per_step --> 3.321 | [2024-09-02 19:16:19,278][Main][INFO] - [train] Step 2300 out of 65536 | Loss --> 4.829 | Grad_l2 --> 0.646 | Weights_l2 --> 6934.357 | Lr --> 0.012 | Seconds_per_step --> 3.308 | [2024-09-02 19:21:51,370][Main][INFO] - [train] Step 2400 out of 65536 | Loss --> 4.790 | Grad_l2 --> 0.620 | Weights_l2 --> 6932.338 | Lr --> 0.012 | Seconds_per_step --> 3.321 | [2024-09-02 19:27:23,544][Main][INFO] - [train] Step 2500 out of 65536 | Loss --> 4.784 | Grad_l2 --> 0.643 | Weights_l2 --> 6930.395 | Lr --> 0.013 | Seconds_per_step --> 3.322 | [2024-09-02 19:32:54,341][Main][INFO] - [train] Step 2600 out of 65536 | Loss --> 4.755 | Grad_l2 --> 0.623 | Weights_l2 --> 6928.543 | Lr --> 0.013 | Seconds_per_step --> 3.308 | [2024-09-02 19:38:25,942][Main][INFO] - [train] Step 2700 out of 65536 | Loss --> 4.743 | Grad_l2 --> 0.636 | Weights_l2 --> 6926.944 | Lr --> 0.013 | Seconds_per_step --> 3.316 | [2024-09-02 19:43:57,708][Main][INFO] - [train] Step 2800 out of 65536 | Loss --> 4.722 | Grad_l2 --> 0.590 | Weights_l2 --> 6925.379 | Lr --> 0.013 | Seconds_per_step --> 3.318 | [2024-09-02 19:49:28,285][Main][INFO] - [train] Step 2900 out of 65536 | Loss --> 4.715 | Grad_l2 --> 0.622 | Weights_l2 --> 6924.007 | Lr --> 0.013 | Seconds_per_step --> 3.306 | [2024-09-02 19:54:59,957][Main][INFO] - [train] Step 3000 out of 65536 | Loss --> 4.694 | Grad_l2 --> 0.652 | Weights_l2 --> 6922.709 | Lr --> 0.013 | Seconds_per_step --> 3.317 | [2024-09-02 20:00:31,072][Main][INFO] - [train] Step 3100 out of 65536 | Loss --> 4.678 | Grad_l2 --> 0.614 | Weights_l2 --> 6921.561 | Lr --> 0.013 | Seconds_per_step --> 3.311 | [2024-09-02 20:06:02,747][Main][INFO] - [train] Step 3200 out of 65536 | Loss --> 4.633 | Grad_l2 --> 0.610 | Weights_l2 --> 6920.463 | Lr --> 0.013 | Seconds_per_step --> 3.317 | [2024-09-02 20:11:34,607][Main][INFO] - [train] Step 3300 out of 65536 | Loss --> 4.599 | Grad_l2 --> 0.638 | Weights_l2 --> 6919.642 | Lr --> 0.013 | Seconds_per_step --> 3.319 | [2024-09-02 20:17:05,731][Main][INFO] - [train] Step 3400 out of 65536 | Loss --> 4.549 | Grad_l2 --> 0.774 | Weights_l2 --> 6919.263 | Lr --> 0.013 | Seconds_per_step --> 3.311 | [2024-09-02 20:22:37,601][Main][INFO] - [train] Step 3500 out of 65536 | Loss --> 4.420 | Grad_l2 --> 0.934 | Weights_l2 --> 6918.974 | Lr --> 0.014 | Seconds_per_step --> 3.319 | [2024-09-02 20:28:09,554][Main][INFO] - [train] Step 3600 out of 65536 | Loss --> 4.256 | Grad_l2 --> 0.763 | Weights_l2 --> 6919.477 | Lr --> 0.014 | Seconds_per_step --> 3.319 | [2024-09-02 20:33:40,654][Main][INFO] - [train] Step 3700 out of 65536 | Loss --> 4.131 | Grad_l2 --> 0.657 | Weights_l2 --> 6920.705 | Lr --> 0.014 | Seconds_per_step --> 3.311 | [2024-09-02 20:39:13,064][Main][INFO] - [train] Step 3800 out of 65536 | Loss --> 4.021 | Grad_l2 --> 0.709 | Weights_l2 --> 6922.188 | Lr --> 0.014 | Seconds_per_step --> 3.324 | [2024-09-02 20:44:45,663][Main][INFO] - [train] Step 3900 out of 65536 | Loss --> 3.909 | Grad_l2 --> 0.637 | Weights_l2 --> 6923.666 | Lr --> 0.014 | Seconds_per_step --> 3.326 | [2024-09-02 20:50:16,811][Main][INFO] - [train] Step 4000 out of 65536 | Loss --> 3.855 | Grad_l2 --> 1.013 | Weights_l2 --> 6923.778 | Lr --> 0.014 | Seconds_per_step --> 3.311 | [2024-09-02 20:55:49,235][Main][INFO] - [train] Step 4100 out of 65536 | Loss --> 3.770 | Grad_l2 --> 0.589 | Weights_l2 --> 6925.545 | Lr --> 0.014 | Seconds_per_step --> 3.324 | [2024-09-02 21:01:20,500][Main][INFO] - [train] Step 4200 out of 65536 | Loss --> 3.710 | Grad_l2 --> 0.579 | Weights_l2 --> 6927.200 | Lr --> 0.014 | Seconds_per_step --> 3.313 | [2024-09-02 21:06:53,406][Main][INFO] - [train] Step 4300 out of 65536 | Loss --> 3.651 | Grad_l2 --> 0.588 | Weights_l2 --> 6928.842 | Lr --> 0.014 | Seconds_per_step --> 3.329 | [2024-09-02 21:12:26,298][Main][INFO] - [train] Step 4400 out of 65536 | Loss --> 3.614 | Grad_l2 --> 0.632 | Weights_l2 --> 6930.597 | Lr --> 0.014 | Seconds_per_step --> 3.329 | [2024-09-02 21:17:57,623][Main][INFO] - [train] Step 4500 out of 65536 | Loss --> 3.582 | Grad_l2 --> 0.884 | Weights_l2 --> 6931.569 | Lr --> 0.015 | Seconds_per_step --> 3.313 | [2024-09-02 21:23:30,116][Main][INFO] - [train] Step 4600 out of 65536 | Loss --> 3.527 | Grad_l2 --> 0.582 | Weights_l2 --> 6933.783 | Lr --> 0.015 | Seconds_per_step --> 3.325 | [2024-09-02 21:29:02,417][Main][INFO] - [train] Step 4700 out of 65536 | Loss --> 3.476 | Grad_l2 --> 0.549 | Weights_l2 --> 6935.959 | Lr --> 0.015 | Seconds_per_step --> 3.323 | [2024-09-02 21:34:33,535][Main][INFO] - [train] Step 4800 out of 65536 | Loss --> 3.430 | Grad_l2 --> 0.551 | Weights_l2 --> 6938.224 | Lr --> 0.015 | Seconds_per_step --> 3.311 | [2024-09-02 21:40:05,905][Main][INFO] - [train] Step 4900 out of 65536 | Loss --> 3.395 | Grad_l2 --> 0.550 | Weights_l2 --> 6940.617 | Lr --> 0.015 | Seconds_per_step --> 3.324 | [2024-09-02 21:45:36,944][Main][INFO] - [train] Step 5000 out of 65536 | Loss --> 3.366 | Grad_l2 --> 0.546 | Weights_l2 --> 6943.230 | Lr --> 0.015 | Seconds_per_step --> 3.310 | [2024-09-02 21:45:36,947][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-5000 [2024-09-02 21:45:36,954][accelerate.utils.other][WARNING] - Removed shared tensor {'lm_head.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-09-02 21:45:44,182][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-5000/model.safetensors [2024-09-02 21:45:54,822][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-5000/optimizer.bin [2024-09-02 21:45:54,827][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-5000/scheduler.bin [2024-09-02 21:45:54,828][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-5000/sampler.bin [2024-09-02 21:45:54,829][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-5000/sampler_1.bin [2024-09-02 21:45:54,835][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-5000/random_states_0.pkl [2024-09-02 21:51:26,402][Main][INFO] - [train] Step 5100 out of 65536 | Loss --> 3.302 | Grad_l2 --> 0.541 | Weights_l2 --> 6946.278 | Lr --> 0.015 | Seconds_per_step --> 3.495 | [2024-09-02 21:56:58,321][Main][INFO] - [train] Step 5200 out of 65536 | Loss --> 3.248 | Grad_l2 --> 0.556 | Weights_l2 --> 6950.060 | Lr --> 0.015 | Seconds_per_step --> 3.319 | [2024-09-02 22:02:29,452][Main][INFO] - [train] Step 5300 out of 65536 | Loss --> 3.194 | Grad_l2 --> 0.566 | Weights_l2 --> 6954.461 | Lr --> 0.015 | Seconds_per_step --> 3.311 | [2024-09-02 22:08:01,594][Main][INFO] - [train] Step 5400 out of 65536 | Loss --> 3.144 | Grad_l2 --> 0.548 | Weights_l2 --> 6959.061 | Lr --> 0.015 | Seconds_per_step --> 3.321 | [2024-09-02 22:13:33,473][Main][INFO] - [train] Step 5500 out of 65536 | Loss --> 3.099 | Grad_l2 --> 0.546 | Weights_l2 --> 6963.676 | Lr --> 0.016 | Seconds_per_step --> 3.319 | [2024-09-02 22:19:04,763][Main][INFO] - [train] Step 5600 out of 65536 | Loss --> 3.044 | Grad_l2 --> 0.531 | Weights_l2 --> 6968.055 | Lr --> 0.016 | Seconds_per_step --> 3.313 | [2024-09-02 22:24:37,024][Main][INFO] - [train] Step 5700 out of 65536 | Loss --> 3.023 | Grad_l2 --> 0.528 | Weights_l2 --> 6972.595 | Lr --> 0.016 | Seconds_per_step --> 3.323 | [2024-09-02 22:30:08,010][Main][INFO] - [train] Step 5800 out of 65536 | Loss --> 2.999 | Grad_l2 --> 0.529 | Weights_l2 --> 6977.095 | Lr --> 0.016 | Seconds_per_step --> 3.310 | [2024-09-02 22:35:40,260][Main][INFO] - [train] Step 5900 out of 65536 | Loss --> 2.953 | Grad_l2 --> 0.516 | Weights_l2 --> 6981.522 | Lr --> 0.016 | Seconds_per_step --> 3.322 | [2024-09-02 22:41:12,494][Main][INFO] - [train] Step 6000 out of 65536 | Loss --> 2.924 | Grad_l2 --> 0.514 | Weights_l2 --> 6985.860 | Lr --> 0.016 | Seconds_per_step --> 3.322 | [2024-09-02 22:46:43,439][Main][INFO] - [train] Step 6100 out of 65536 | Loss --> 2.904 | Grad_l2 --> 0.500 | Weights_l2 --> 6990.209 | Lr --> 0.016 | Seconds_per_step --> 3.309 | [2024-09-02 22:52:15,361][Main][INFO] - [train] Step 6200 out of 65536 | Loss --> 2.885 | Grad_l2 --> 0.499 | Weights_l2 --> 6994.575 | Lr --> 0.016 | Seconds_per_step --> 3.319 | [2024-09-02 22:57:47,371][Main][INFO] - [train] Step 6300 out of 65536 | Loss --> 2.860 | Grad_l2 --> 0.496 | Weights_l2 --> 6998.855 | Lr --> 0.016 | Seconds_per_step --> 3.320 | [2024-09-02 23:03:18,243][Main][INFO] - [train] Step 6400 out of 65536 | Loss --> 2.828 | Grad_l2 --> 0.486 | Weights_l2 --> 7003.354 | Lr --> 0.016 | Seconds_per_step --> 3.309 | [2024-09-02 23:08:50,256][Main][INFO] - [train] Step 6500 out of 65536 | Loss --> 2.823 | Grad_l2 --> 0.491 | Weights_l2 --> 7007.772 | Lr --> 0.017 | Seconds_per_step --> 3.320 | [2024-09-02 23:14:21,254][Main][INFO] - [train] Step 6600 out of 65536 | Loss --> 2.801 | Grad_l2 --> 0.572 | Weights_l2 --> 7012.034 | Lr --> 0.017 | Seconds_per_step --> 3.310 | [2024-09-02 23:19:53,383][Main][INFO] - [train] Step 6700 out of 65536 | Loss --> 2.776 | Grad_l2 --> 0.473 | Weights_l2 --> 7016.624 | Lr --> 0.017 | Seconds_per_step --> 3.321 | [2024-09-02 23:25:25,894][Main][INFO] - [train] Step 6800 out of 65536 | Loss --> 2.764 | Grad_l2 --> 0.489 | Weights_l2 --> 7021.128 | Lr --> 0.017 | Seconds_per_step --> 3.325 | [2024-09-02 23:30:56,990][Main][INFO] - [train] Step 6900 out of 65536 | Loss --> 2.754 | Grad_l2 --> 0.467 | Weights_l2 --> 7025.909 | Lr --> 0.017 | Seconds_per_step --> 3.311 | [2024-09-02 23:36:28,837][Main][INFO] - [train] Step 7000 out of 65536 | Loss --> 2.716 | Grad_l2 --> 0.469 | Weights_l2 --> 7030.583 | Lr --> 0.017 | Seconds_per_step --> 3.318 | [2024-09-02 23:42:00,897][Main][INFO] - [train] Step 7100 out of 65536 | Loss --> 2.706 | Grad_l2 --> 0.470 | Weights_l2 --> 7035.338 | Lr --> 0.017 | Seconds_per_step --> 3.321 | [2024-09-02 23:47:31,913][Main][INFO] - [train] Step 7200 out of 65536 | Loss --> 2.685 | Grad_l2 --> 0.460 | Weights_l2 --> 7040.107 | Lr --> 0.017 | Seconds_per_step --> 3.310 | [2024-09-02 23:53:04,028][Main][INFO] - [train] Step 7300 out of 65536 | Loss --> 2.675 | Grad_l2 --> 0.462 | Weights_l2 --> 7044.921 | Lr --> 0.017 | Seconds_per_step --> 3.321 | [2024-09-02 23:58:35,224][Main][INFO] - [train] Step 7400 out of 65536 | Loss --> 2.670 | Grad_l2 --> 0.473 | Weights_l2 --> 7049.994 | Lr --> 0.017 | Seconds_per_step --> 3.312 | [2024-09-03 00:04:07,495][Main][INFO] - [train] Step 7500 out of 65536 | Loss --> 2.653 | Grad_l2 --> 0.452 | Weights_l2 --> 7055.123 | Lr --> 0.018 | Seconds_per_step --> 3.323 | [2024-09-03 00:09:39,687][Main][INFO] - [train] Step 7600 out of 65536 | Loss --> 2.644 | Grad_l2 --> 0.499 | Weights_l2 --> 7060.263 | Lr --> 0.018 | Seconds_per_step --> 3.322 | [2024-09-03 00:15:11,125][Main][INFO] - [train] Step 7700 out of 65536 | Loss --> 2.619 | Grad_l2 --> 0.451 | Weights_l2 --> 7065.593 | Lr --> 0.018 | Seconds_per_step --> 3.314 | [2024-09-03 00:20:43,656][Main][INFO] - [train] Step 7800 out of 65536 | Loss --> 2.611 | Grad_l2 --> 0.444 | Weights_l2 --> 7071.016 | Lr --> 0.018 | Seconds_per_step --> 3.325 | [2024-09-03 00:26:15,825][Main][INFO] - [train] Step 7900 out of 65536 | Loss --> 2.593 | Grad_l2 --> 0.444 | Weights_l2 --> 7076.338 | Lr --> 0.018 | Seconds_per_step --> 3.322 | [2024-09-03 00:31:46,986][Main][INFO] - [train] Step 8000 out of 65536 | Loss --> 2.591 | Grad_l2 --> 0.707 | Weights_l2 --> 7081.619 | Lr --> 0.018 | Seconds_per_step --> 3.312 | [2024-09-03 00:37:19,240][Main][INFO] - [train] Step 8100 out of 65536 | Loss --> 2.583 | Grad_l2 --> 0.504 | Weights_l2 --> 7087.303 | Lr --> 0.018 | Seconds_per_step --> 3.323 | [2024-09-03 00:42:50,497][Main][INFO] - [train] Step 8200 out of 65536 | Loss --> 2.572 | Grad_l2 --> 0.435 | Weights_l2 --> 7092.976 | Lr --> 0.018 | Seconds_per_step --> 3.313 | [2024-09-03 00:48:22,669][Main][INFO] - [train] Step 8300 out of 65536 | Loss --> 2.550 | Grad_l2 --> 0.444 | Weights_l2 --> 7098.242 | Lr --> 0.018 | Seconds_per_step --> 3.322 | [2024-09-03 00:53:54,859][Main][INFO] - [train] Step 8400 out of 65536 | Loss --> 2.533 | Grad_l2 --> 0.424 | Weights_l2 --> 7103.870 | Lr --> 0.018 | Seconds_per_step --> 3.322 | [2024-09-03 00:59:25,959][Main][INFO] - [train] Step 8500 out of 65536 | Loss --> 2.520 | Grad_l2 --> 0.415 | Weights_l2 --> 7109.426 | Lr --> 0.019 | Seconds_per_step --> 3.311 | [2024-09-03 01:04:58,102][Main][INFO] - [train] Step 8600 out of 65536 | Loss --> 2.512 | Grad_l2 --> 0.445 | Weights_l2 --> 7115.243 | Lr --> 0.019 | Seconds_per_step --> 3.321 | [2024-09-03 01:10:30,308][Main][INFO] - [train] Step 8700 out of 65536 | Loss --> 2.497 | Grad_l2 --> 0.416 | Weights_l2 --> 7120.917 | Lr --> 0.019 | Seconds_per_step --> 3.322 | [2024-09-03 01:16:01,412][Main][INFO] - [train] Step 8800 out of 65536 | Loss --> 2.503 | Grad_l2 --> 0.453 | Weights_l2 --> 7127.067 | Lr --> 0.019 | Seconds_per_step --> 3.311 | [2024-09-03 01:21:33,679][Main][INFO] - [train] Step 8900 out of 65536 | Loss --> 2.498 | Grad_l2 --> 0.519 | Weights_l2 --> 7133.268 | Lr --> 0.019 | Seconds_per_step --> 3.323 | [2024-09-03 01:27:05,633][Main][INFO] - [train] Step 9000 out of 65536 | Loss --> 2.480 | Grad_l2 --> 0.413 | Weights_l2 --> 7139.449 | Lr --> 0.019 | Seconds_per_step --> 3.320 | [2024-09-03 01:32:36,839][Main][INFO] - [train] Step 9100 out of 65536 | Loss --> 2.488 | Grad_l2 --> 0.429 | Weights_l2 --> 7145.663 | Lr --> 0.019 | Seconds_per_step --> 3.312 | [2024-09-03 01:38:09,090][Main][INFO] - [train] Step 9200 out of 65536 | Loss --> 2.458 | Grad_l2 --> 0.651 | Weights_l2 --> 7151.751 | Lr --> 0.019 | Seconds_per_step --> 3.322 | [2024-09-03 01:43:40,183][Main][INFO] - [train] Step 9300 out of 65536 | Loss --> 2.481 | Grad_l2 --> 0.667 | Weights_l2 --> 7157.979 | Lr --> 0.019 | Seconds_per_step --> 3.311 | [2024-09-03 01:49:12,323][Main][INFO] - [train] Step 9400 out of 65536 | Loss --> 2.454 | Grad_l2 --> 0.500 | Weights_l2 --> 7164.722 | Lr --> 0.019 | Seconds_per_step --> 3.321 | [2024-09-03 01:54:44,360][Main][INFO] - [train] Step 9500 out of 65536 | Loss --> 2.434 | Grad_l2 --> 0.434 | Weights_l2 --> 7171.100 | Lr --> 0.020 | Seconds_per_step --> 3.320 | [2024-09-03 02:00:15,384][Main][INFO] - [train] Step 9600 out of 65536 | Loss --> 2.430 | Grad_l2 --> 0.459 | Weights_l2 --> 7177.669 | Lr --> 0.020 | Seconds_per_step --> 3.310 | [2024-09-03 02:05:47,653][Main][INFO] - [train] Step 9700 out of 65536 | Loss --> 2.435 | Grad_l2 --> 0.458 | Weights_l2 --> 7184.407 | Lr --> 0.020 | Seconds_per_step --> 3.323 | [2024-09-03 02:11:19,839][Main][INFO] - [train] Step 9800 out of 65536 | Loss --> 2.431 | Grad_l2 --> 0.796 | Weights_l2 --> 7190.992 | Lr --> 0.020 | Seconds_per_step --> 3.322 | [2024-09-03 02:16:50,929][Main][INFO] - [train] Step 9900 out of 65536 | Loss --> 2.403 | Grad_l2 --> 0.782 | Weights_l2 --> 7197.863 | Lr --> 0.020 | Seconds_per_step --> 3.311 | [2024-09-03 02:22:23,236][Main][INFO] - [train] Step 10000 out of 65536 | Loss --> 2.445 | Grad_l2 --> 1.140 | Weights_l2 --> 7204.637 | Lr --> 0.020 | Seconds_per_step --> 3.323 | [2024-09-03 02:22:23,238][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-10000 [2024-09-03 02:22:23,245][accelerate.utils.other][WARNING] - Removed shared tensor {'lm_head.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-09-03 02:22:29,395][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-10000/model.safetensors [2024-09-03 02:22:38,780][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-10000/optimizer.bin [2024-09-03 02:22:38,784][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-10000/scheduler.bin [2024-09-03 02:22:38,784][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-10000/sampler.bin [2024-09-03 02:22:38,785][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-10000/sampler_1.bin [2024-09-03 02:22:38,790][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-10000/random_states_0.pkl [2024-09-03 02:28:09,713][Main][INFO] - [train] Step 10100 out of 65536 | Loss --> 2.441 | Grad_l2 --> 1.063 | Weights_l2 --> 7212.671 | Lr --> 0.020 | Seconds_per_step --> 3.465 | [2024-09-03 02:33:42,096][Main][INFO] - [train] Step 10200 out of 65536 | Loss --> 2.421 | Grad_l2 --> 1.135 | Weights_l2 --> 7219.539 | Lr --> 0.020 | Seconds_per_step --> 3.324 | [2024-09-03 02:39:14,331][Main][INFO] - [train] Step 10300 out of 65536 | Loss --> 2.408 | Grad_l2 --> 1.377 | Weights_l2 --> 7226.397 | Lr --> 0.020 | Seconds_per_step --> 3.322 | [2024-09-03 02:44:45,309][Main][INFO] - [train] Step 10400 out of 65536 | Loss --> 2.385 | Grad_l2 --> 1.568 | Weights_l2 --> 7232.973 | Lr --> 0.020 | Seconds_per_step --> 3.310 | [2024-09-03 02:50:17,356][Main][INFO] - [train] Step 10500 out of 65536 | Loss --> 2.383 | Grad_l2 --> 5.267 | Weights_l2 --> 7238.788 | Lr --> 0.020 | Seconds_per_step --> 3.320 | [2024-09-03 02:55:49,191][Main][INFO] - [train] Step 10600 out of 65536 | Loss --> 51.695 | Grad_l2 --> 2316.455 | Weights_l2 --> 7233.899 | Lr --> 0.020 | Seconds_per_step --> 3.318 | [2024-09-03 03:01:20,350][Main][INFO] - [train] Step 10700 out of 65536 | Loss --> 19.189 | Grad_l2 --> 206.407 | Weights_l2 --> 7221.798 | Lr --> 0.020 | Seconds_per_step --> 3.312 | [2024-09-03 03:06:52,743][Main][INFO] - [train] Step 10800 out of 65536 | Loss --> 6.908 | Grad_l2 --> 26.249 | Weights_l2 --> 7210.980 | Lr --> 0.020 | Seconds_per_step --> 3.324 | [2024-09-03 03:12:23,733][Main][INFO] - [train] Step 10900 out of 65536 | Loss --> 42.736 | Grad_l2 --> 1292.659 | Weights_l2 --> 7206.464 | Lr --> 0.020 | Seconds_per_step --> 3.310 |