bigscience-bot
commited on
Commit
·
cb35e14
1
Parent(s):
2dba181
new data
Browse files- logs/main_log.txt +200 -0
logs/main_log.txt
CHANGED
@@ -77002,3 +77002,203 @@ time (ms)
|
|
77002 |
time (ms)
|
77003 |
iteration 582/ 292968 | consumed samples: 1191936 | consumed tokens: 90669056 | elapsed time per iteration (ms): 108157.8 | learning rate: 3.178E-05 | global batch size: 2048 | lm loss: 5.475502E+00 | loss scale: 8192.0 | grad norm: 10832.692 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77004 |
time (ms)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77002 |
time (ms)
|
77003 |
iteration 582/ 292968 | consumed samples: 1191936 | consumed tokens: 90669056 | elapsed time per iteration (ms): 108157.8 | learning rate: 3.178E-05 | global batch size: 2048 | lm loss: 5.475502E+00 | loss scale: 8192.0 | grad norm: 10832.692 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77004 |
time (ms)
|
77005 |
+
iteration 583/ 292968 | consumed samples: 1193984 | consumed tokens: 90865664 | elapsed time per iteration (ms): 108967.2 | learning rate: 3.184E-05 | global batch size: 2048 | lm loss: 5.494294E+00 | loss scale: 8192.0 | grad norm: 14744.932 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77006 |
+
time (ms)
|
77007 |
+
iteration 584/ 292968 | consumed samples: 1196032 | consumed tokens: 91062272 | elapsed time per iteration (ms): 106812.8 | learning rate: 3.189E-05 | global batch size: 2048 | lm loss: 5.487658E+00 | loss scale: 8192.0 | grad norm: 8967.567 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77008 |
+
time (ms)
|
77009 |
+
iteration 585/ 292968 | consumed samples: 1198080 | consumed tokens: 91258880 | elapsed time per iteration (ms): 110130.1 | learning rate: 3.195E-05 | global batch size: 2048 | lm loss: 5.488459E+00 | loss scale: 8192.0 | grad norm: 14768.019 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77010 |
+
time (ms)
|
77011 |
+
iteration 586/ 292968 | consumed samples: 1200128 | consumed tokens: 91455488 | elapsed time per iteration (ms): 106231.0 | learning rate: 3.200E-05 | global batch size: 2048 | lm loss: 5.488029E+00 | loss scale: 8192.0 | grad norm: 13756.417 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77012 |
+
time (ms)
|
77013 |
+
iteration 587/ 292968 | consumed samples: 1202176 | consumed tokens: 91652096 | elapsed time per iteration (ms): 106565.7 | learning rate: 3.206E-05 | global batch size: 2048 | lm loss: 5.448896E+00 | loss scale: 8192.0 | grad norm: 8670.093 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77014 |
+
time (ms)
|
77015 |
+
iteration 588/ 292968 | consumed samples: 1204224 | consumed tokens: 91848704 | elapsed time per iteration (ms): 106823.5 | learning rate: 3.211E-05 | global batch size: 2048 | lm loss: 5.481108E+00 | loss scale: 8192.0 | grad norm: 13747.563 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77016 |
+
time (ms)
|
77017 |
+
iteration 589/ 292968 | consumed samples: 1206272 | consumed tokens: 92045312 | elapsed time per iteration (ms): 109210.1 | learning rate: 3.217E-05 | global batch size: 2048 | lm loss: 5.483897E+00 | loss scale: 8192.0 | grad norm: 13030.572 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77018 |
+
time (ms)
|
77019 |
+
iteration 590/ 292968 | consumed samples: 1208320 | consumed tokens: 92241920 | elapsed time per iteration (ms): 107071.2 | learning rate: 3.222E-05 | global batch size: 2048 | lm loss: 5.499794E+00 | loss scale: 8192.0 | grad norm: 12956.695 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77020 |
+
time (ms)
|
77021 |
+
iteration 591/ 292968 | consumed samples: 1210368 | consumed tokens: 92438528 | elapsed time per iteration (ms): 107481.3 | learning rate: 3.228E-05 | global batch size: 2048 | lm loss: 5.458858E+00 | loss scale: 8192.0 | grad norm: 8716.189 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77022 |
+
time (ms)
|
77023 |
+
iteration 592/ 292968 | consumed samples: 1212416 | consumed tokens: 92635136 | elapsed time per iteration (ms): 108187.6 | learning rate: 3.233E-05 | global batch size: 2048 | lm loss: 5.468006E+00 | loss scale: 8192.0 | grad norm: 10982.591 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77024 |
+
time (ms)
|
77025 |
+
iteration 593/ 292968 | consumed samples: 1214464 | consumed tokens: 92831744 | elapsed time per iteration (ms): 107146.7 | learning rate: 3.239E-05 | global batch size: 2048 | lm loss: 5.428665E+00 | loss scale: 8192.0 | grad norm: 10539.232 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77026 |
+
time (ms)
|
77027 |
+
iteration 594/ 292968 | consumed samples: 1216512 | consumed tokens: 93028352 | elapsed time per iteration (ms): 110124.1 | learning rate: 3.244E-05 | global batch size: 2048 | lm loss: 5.442387E+00 | loss scale: 8192.0 | grad norm: 13381.277 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77028 |
+
time (ms)
|
77029 |
+
iteration 595/ 292968 | consumed samples: 1218560 | consumed tokens: 93224960 | elapsed time per iteration (ms): 106387.0 | learning rate: 3.249E-05 | global batch size: 2048 | lm loss: 5.484375E+00 | loss scale: 8192.0 | grad norm: 11482.399 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77030 |
+
time (ms)
|
77031 |
+
iteration 596/ 292968 | consumed samples: 1220608 | consumed tokens: 93421568 | elapsed time per iteration (ms): 108330.7 | learning rate: 3.255E-05 | global batch size: 2048 | lm loss: 5.424896E+00 | loss scale: 8192.0 | grad norm: 12097.178 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77032 |
+
time (ms)
|
77033 |
+
iteration 597/ 292968 | consumed samples: 1222656 | consumed tokens: 93618176 | elapsed time per iteration (ms): 107065.9 | learning rate: 3.260E-05 | global batch size: 2048 | lm loss: 5.433896E+00 | loss scale: 8192.0 | grad norm: 15293.672 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77034 |
+
time (ms)
|
77035 |
+
iteration 598/ 292968 | consumed samples: 1224704 | consumed tokens: 93814784 | elapsed time per iteration (ms): 106989.0 | learning rate: 3.266E-05 | global batch size: 2048 | lm loss: 5.436405E+00 | loss scale: 8192.0 | grad norm: 11111.761 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77036 |
+
time (ms)
|
77037 |
+
iteration 599/ 292968 | consumed samples: 1226752 | consumed tokens: 94011392 | elapsed time per iteration (ms): 106858.4 | learning rate: 3.271E-05 | global batch size: 2048 | lm loss: 5.414397E+00 | loss scale: 8192.0 | grad norm: 13962.838 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77038 |
+
time (ms)
|
77039 |
+
iteration 600/ 292968 | consumed samples: 1228800 | consumed tokens: 94208000 | elapsed time per iteration (ms): 107260.3 | learning rate: 3.277E-05 | global batch size: 2048 | lm loss: 5.419570E+00 | loss scale: 8192.0 | grad norm: 11387.759 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77040 |
+
time (ms)
|
77041 |
+
-----------------------------------------------------------------------------------------------
|
77042 |
+
validation loss at iteration 600 | lm loss value: 5.387414E+00 | lm loss PPL: 2.186374E+02 |
|
77043 |
+
-----------------------------------------------------------------------------------------------
|
77044 |
+
saving checkpoint at iteration 600 to /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints
|
77045 |
+
[2021-10-25 06:13:42,645] [INFO] [logging.py:68:log_dist] [Rank 1] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/mp_rank_01_model_states.pt
|
77046 |
+
[2021-10-25 06:13:43,582] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/mp_rank_00_model_states.pt
|
77047 |
+
[2021-10-25 06:13:56,312] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_17_optim_states.pt
|
77048 |
+
[2021-10-25 06:13:56,315] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_102_optim_states.pt
|
77049 |
+
[2021-10-25 06:13:56,357] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_16_optim_states.pt
|
77050 |
+
[2021-10-25 06:13:56,388] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_28_optim_states.pt
|
77051 |
+
[2021-10-25 06:13:56,529] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_21_optim_states.pt
|
77052 |
+
[2021-10-25 06:13:56,535] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_108_optim_states.pt
|
77053 |
+
[2021-10-25 06:13:56,537] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_93_optim_states.pt
|
77054 |
+
[2021-10-25 06:13:56,588] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_12_optim_states.pt
|
77055 |
+
[2021-10-25 06:13:56,617] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_116_optim_states.pt
|
77056 |
+
[2021-10-25 06:13:56,621] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_95_optim_states.pt
|
77057 |
+
[2021-10-25 06:13:56,637] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_122_optim_states.pt
|
77058 |
+
[2021-10-25 06:13:56,677] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_06_optim_states.pt
|
77059 |
+
[2021-10-25 06:13:56,683] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_62_optim_states.pt
|
77060 |
+
[2021-10-25 06:13:56,689] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_123_optim_states.pt
|
77061 |
+
[2021-10-25 06:13:56,697] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_67_optim_states.pt
|
77062 |
+
[2021-10-25 06:13:56,728] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_59_optim_states.pt
|
77063 |
+
[2021-10-25 06:13:56,728] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_07_optim_states.pt
|
77064 |
+
[2021-10-25 06:13:56,732] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_99_optim_states.pt
|
77065 |
+
[2021-10-25 06:13:56,733] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_100_optim_states.pt
|
77066 |
+
[2021-10-25 06:13:56,792] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_30_optim_states.pt
|
77067 |
+
[2021-10-25 06:13:56,816] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_114_optim_states.pt
|
77068 |
+
[2021-10-25 06:13:56,829] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_111_optim_states.pt
|
77069 |
+
[2021-10-25 06:13:56,831] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_118_optim_states.pt
|
77070 |
+
[2021-10-25 06:13:56,978] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_57_optim_states.pt
|
77071 |
+
[2021-10-25 06:13:56,996] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_107_optim_states.pt
|
77072 |
+
[2021-10-25 06:13:57,041] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_23_optim_states.pt
|
77073 |
+
[2021-10-25 06:13:57,081] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_65_optim_states.pt
|
77074 |
+
[2021-10-25 06:13:57,115] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_105_optim_states.pt
|
77075 |
+
[2021-10-25 06:13:57,128] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_15_optim_states.pt
|
77076 |
+
[2021-10-25 06:13:57,137] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_60_optim_states.pt
|
77077 |
+
[2021-10-25 06:13:57,180] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_113_optim_states.pt
|
77078 |
+
[2021-10-25 06:13:57,254] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_52_optim_states.pt
|
77079 |
+
[2021-10-25 06:13:57,303] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_14_optim_states.pt
|
77080 |
+
[2021-10-25 06:13:57,320] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_71_optim_states.pt
|
77081 |
+
[2021-10-25 06:13:57,392] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_83_optim_states.pt
|
77082 |
+
[2021-10-25 06:13:57,413] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_43_optim_states.pt
|
77083 |
+
[2021-10-25 06:13:57,508] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_04_optim_states.pt
|
77084 |
+
[2021-10-25 06:13:57,526] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_45_optim_states.pt
|
77085 |
+
[2021-10-25 06:13:57,563] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_101_optim_states.pt
|
77086 |
+
[2021-10-25 06:13:57,565] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_58_optim_states.pt
|
77087 |
+
[2021-10-25 06:13:57,579] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_09_optim_states.pt
|
77088 |
+
[2021-10-25 06:13:57,583] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_26_optim_states.pt
|
77089 |
+
[2021-10-25 06:13:57,600] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_18_optim_states.pt
|
77090 |
+
[2021-10-25 06:13:57,646] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_72_optim_states.pt
|
77091 |
+
[2021-10-25 06:13:57,657] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_103_optim_states.pt
|
77092 |
+
[2021-10-25 06:13:57,668] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_94_optim_states.pt
|
77093 |
+
[2021-10-25 06:13:57,669] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_22_optim_states.pt
|
77094 |
+
[2021-10-25 06:13:57,673] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_11_optim_states.pt
|
77095 |
+
[2021-10-25 06:13:57,679] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_19_optim_states.pt
|
77096 |
+
[2021-10-25 06:13:57,686] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_97_optim_states.pt
|
77097 |
+
[2021-10-25 06:13:57,694] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_119_optim_states.pt
|
77098 |
+
[2021-10-25 06:13:57,698] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_25_optim_states.pt
|
77099 |
+
[2021-10-25 06:13:57,714] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_78_optim_states.pt
|
77100 |
+
[2021-10-25 06:13:57,730] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_56_optim_states.pt
|
77101 |
+
[2021-10-25 06:13:57,736] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_50_optim_states.pt
|
77102 |
+
[2021-10-25 06:13:57,758] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_104_optim_states.pt
|
77103 |
+
[2021-10-25 06:13:57,763] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_92_optim_states.pt
|
77104 |
+
[2021-10-25 06:13:57,774] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_13_optim_states.pt
|
77105 |
+
[2021-10-25 06:13:57,780] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_61_optim_states.pt
|
77106 |
+
[2021-10-25 06:13:57,813] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_98_optim_states.pt
|
77107 |
+
[2021-10-25 06:13:57,822] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_24_optim_states.pt
|
77108 |
+
[2021-10-25 06:13:57,837] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_90_optim_states.pt
|
77109 |
+
[2021-10-25 06:13:57,838] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_36_optim_states.pt
|
77110 |
+
[2021-10-25 06:13:57,841] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_110_optim_states.pt
|
77111 |
+
[2021-10-25 06:13:57,852] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_117_optim_states.pt
|
77112 |
+
[2021-10-25 06:13:57,861] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_05_optim_states.pt
|
77113 |
+
[2021-10-25 06:13:57,872] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_109_optim_states.pt
|
77114 |
+
[2021-10-25 06:13:57,876] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_121_optim_states.pt
|
77115 |
+
[2021-10-25 06:13:57,885] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_112_optim_states.pt
|
77116 |
+
[2021-10-25 06:13:57,886] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_51_optim_states.pt
|
77117 |
+
[2021-10-25 06:13:57,889] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_96_optim_states.pt
|
77118 |
+
[2021-10-25 06:13:57,924] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_33_optim_states.pt
|
77119 |
+
[2021-10-25 06:13:57,927] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_74_optim_states.pt
|
77120 |
+
[2021-10-25 06:13:57,943] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_63_optim_states.pt
|
77121 |
+
[2021-10-25 06:13:57,948] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_68_optim_states.pt
|
77122 |
+
[2021-10-25 06:13:57,948] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_106_optim_states.pt
|
77123 |
+
[2021-10-25 06:13:57,950] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_120_optim_states.pt
|
77124 |
+
[2021-10-25 06:13:57,969] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_54_optim_states.pt
|
77125 |
+
[2021-10-25 06:13:57,975] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_10_optim_states.pt
|
77126 |
+
[2021-10-25 06:13:57,992] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_20_optim_states.pt
|
77127 |
+
[2021-10-25 06:13:58,013] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_35_optim_states.pt
|
77128 |
+
[2021-10-25 06:13:58,052] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_81_optim_states.pt
|
77129 |
+
[2021-10-25 06:13:58,074] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_115_optim_states.pt
|
77130 |
+
[2021-10-25 06:13:58,092] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_27_optim_states.pt
|
77131 |
+
[2021-10-25 06:13:58,098] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_86_optim_states.pt
|
77132 |
+
[2021-10-25 06:13:58,124] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_46_optim_states.pt
|
77133 |
+
[2021-10-25 06:13:58,139] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_87_optim_states.pt
|
77134 |
+
[2021-10-25 06:13:58,153] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_41_optim_states.pt
|
77135 |
+
[2021-10-25 06:13:58,168] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_55_optim_states.pt
|
77136 |
+
[2021-10-25 06:13:58,258] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_77_optim_states.pt
|
77137 |
+
[2021-10-25 06:13:58,275] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_40_optim_states.pt
|
77138 |
+
[2021-10-25 06:13:58,282] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_39_optim_states.pt
|
77139 |
+
[2021-10-25 06:13:58,327] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_88_optim_states.pt
|
77140 |
+
[2021-10-25 06:13:58,364] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_79_optim_states.pt
|
77141 |
+
[2021-10-25 06:13:58,375] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_69_optim_states.pt
|
77142 |
+
[2021-10-25 06:13:58,404] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_42_optim_states.pt
|
77143 |
+
[2021-10-25 06:13:58,420] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_31_optim_states.pt
|
77144 |
+
[2021-10-25 06:13:58,425] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_84_optim_states.pt
|
77145 |
+
[2021-10-25 06:13:58,453] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_89_optim_states.pt
|
77146 |
+
[2021-10-25 06:13:58,456] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_47_optim_states.pt
|
77147 |
+
[2021-10-25 06:13:58,476] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_38_optim_states.pt
|
77148 |
+
[2021-10-25 06:13:58,505] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_80_optim_states.pt
|
77149 |
+
[2021-10-25 06:13:58,532] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_82_optim_states.pt
|
77150 |
+
[2021-10-25 06:13:58,578] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_70_optim_states.pt
|
77151 |
+
[2021-10-25 06:13:58,585] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_53_optim_states.pt
|
77152 |
+
[2021-10-25 06:13:58,601] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_34_optim_states.pt
|
77153 |
+
[2021-10-25 06:13:58,640] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_44_optim_states.pt
|
77154 |
+
[2021-10-25 06:13:58,679] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_91_optim_states.pt
|
77155 |
+
[2021-10-25 06:13:58,698] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_76_optim_states.pt
|
77156 |
+
[2021-10-25 06:13:58,757] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_49_optim_states.pt
|
77157 |
+
[2021-10-25 06:13:58,810] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_32_optim_states.pt
|
77158 |
+
[2021-10-25 06:13:58,824] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_48_optim_states.pt
|
77159 |
+
[2021-10-25 06:13:58,831] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_37_optim_states.pt
|
77160 |
+
[2021-10-25 06:13:58,832] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_85_optim_states.pt
|
77161 |
+
[2021-10-25 06:13:58,876] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_08_optim_states.pt
|
77162 |
+
[2021-10-25 06:13:59,037] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_125_optim_states.pt
|
77163 |
+
[2021-10-25 06:13:59,405] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_127_optim_states.pt
|
77164 |
+
[2021-10-25 06:13:59,502] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_29_optim_states.pt
|
77165 |
+
[2021-10-25 06:14:00,054] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_03_optim_states.pt
|
77166 |
+
[2021-10-25 06:14:00,433] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_02_optim_states.pt
|
77167 |
+
[2021-10-25 06:14:00,677] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_126_optim_states.pt
|
77168 |
+
[2021-10-25 06:14:00,974] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_124_optim_states.pt
|
77169 |
+
[2021-10-25 06:14:04,943] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_75_optim_states.pt
|
77170 |
+
[2021-10-25 06:14:05,563] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_64_optim_states.pt
|
77171 |
+
[2021-10-25 06:14:06,200] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_73_optim_states.pt
|
77172 |
+
[2021-10-25 06:14:06,857] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_66_optim_states.pt
|
77173 |
+
[2021-10-25 06:14:12,792] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_01_optim_states.pt
|
77174 |
+
[2021-10-25 06:14:13,515] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints/global_step600/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
77175 |
+
successfully saved checkpoint at iteration 600 to /gpfsscratch/rech/six/commun/checkpoints/tr8b-104B/checkpoints
|
77176 |
+
time (ms) | save-checkpoint: 34761.21
|
77177 |
+
iteration 601/ 292968 | consumed samples: 1230848 | consumed tokens: 94404608 | elapsed time per iteration (ms): 304940.5 | learning rate: 3.282E-05 | global batch size: 2048 | lm loss: 5.396969E+00 | loss scale: 8192.0 | grad norm: 12332.412 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77178 |
+
time (ms)
|
77179 |
+
iteration 602/ 292968 | consumed samples: 1232896 | consumed tokens: 94601216 | elapsed time per iteration (ms): 106807.5 | learning rate: 3.288E-05 | global batch size: 2048 | lm loss: 5.408408E+00 | loss scale: 8192.0 | grad norm: 11929.351 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77180 |
+
time (ms)
|
77181 |
+
iteration 603/ 292968 | consumed samples: 1234944 | consumed tokens: 94797824 | elapsed time per iteration (ms): 107857.1 | learning rate: 3.293E-05 | global batch size: 2048 | lm loss: 5.420089E+00 | loss scale: 8192.0 | grad norm: 11171.102 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77182 |
+
time (ms)
|
77183 |
+
iteration 604/ 292968 | consumed samples: 1236992 | consumed tokens: 94994432 | elapsed time per iteration (ms): 107461.0 | learning rate: 3.299E-05 | global batch size: 2048 | lm loss: 5.418396E+00 | loss scale: 8192.0 | grad norm: 9342.805 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77184 |
+
time (ms)
|
77185 |
+
iteration 605/ 292968 | consumed samples: 1239040 | consumed tokens: 95191040 | elapsed time per iteration (ms): 107939.7 | learning rate: 3.304E-05 | global batch size: 2048 | lm loss: 5.415629E+00 | loss scale: 8192.0 | grad norm: 12331.412 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77186 |
+
time (ms)
|
77187 |
+
iteration 606/ 292968 | consumed samples: 1241088 | consumed tokens: 95387648 | elapsed time per iteration (ms): 106693.6 | learning rate: 3.310E-05 | global batch size: 2048 | lm loss: 5.435667E+00 | loss scale: 8192.0 | grad norm: 16086.731 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77188 |
+
time (ms)
|
77189 |
+
iteration 607/ 292968 | consumed samples: 1243136 | consumed tokens: 95584256 | elapsed time per iteration (ms): 107708.8 | learning rate: 3.315E-05 | global batch size: 2048 | lm loss: 5.409382E+00 | loss scale: 8192.0 | grad norm: 9374.954 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77190 |
+
time (ms)
|
77191 |
+
iteration 608/ 292968 | consumed samples: 1245184 | consumed tokens: 95780864 | elapsed time per iteration (ms): 107679.7 | learning rate: 3.320E-05 | global batch size: 2048 | lm loss: 5.423688E+00 | loss scale: 8192.0 | grad norm: 12232.800 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77192 |
+
time (ms)
|
77193 |
+
iteration 609/ 292968 | consumed samples: 1247232 | consumed tokens: 95977472 | elapsed time per iteration (ms): 108222.9 | learning rate: 3.326E-05 | global batch size: 2048 | lm loss: 5.402236E+00 | loss scale: 8192.0 | grad norm: 9228.233 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77194 |
+
time (ms)
|
77195 |
+
iteration 610/ 292968 | consumed samples: 1249280 | consumed tokens: 96174080 | elapsed time per iteration (ms): 107400.0 | learning rate: 3.331E-05 | global batch size: 2048 | lm loss: 5.412461E+00 | loss scale: 8192.0 | grad norm: 11245.757 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77196 |
+
time (ms)
|
77197 |
+
iteration 611/ 292968 | consumed samples: 1251328 | consumed tokens: 96370688 | elapsed time per iteration (ms): 106468.7 | learning rate: 3.337E-05 | global batch size: 2048 | lm loss: 5.408649E+00 | loss scale: 8192.0 | grad norm: 11344.448 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77198 |
+
time (ms)
|
77199 |
+
iteration 612/ 292968 | consumed samples: 1253376 | consumed tokens: 96567296 | elapsed time per iteration (ms): 107650.3 | learning rate: 3.342E-05 | global batch size: 2048 | lm loss: 5.407639E+00 | loss scale: 8192.0 | grad norm: 11098.585 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77200 |
+
time (ms)
|
77201 |
+
iteration 613/ 292968 | consumed samples: 1255424 | consumed tokens: 96763904 | elapsed time per iteration (ms): 107751.1 | learning rate: 3.348E-05 | global batch size: 2048 | lm loss: 5.380627E+00 | loss scale: 8192.0 | grad norm: 8762.937 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77202 |
+
time (ms)
|
77203 |
+
iteration 614/ 292968 | consumed samples: 1257472 | consumed tokens: 96960512 | elapsed time per iteration (ms): 110635.4 | learning rate: 3.353E-05 | global batch size: 2048 | lm loss: 5.375699E+00 | loss scale: 8192.0 | grad norm: 11229.270 | num zeros: 0.0 | curriculum seqlen: 96 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
77204 |
+
time (ms)
|