bigscience-bot
commited on
Commit
•
537ef6a
1
Parent(s):
8d2ebe4
new data
Browse files- logs/main_log.txt +76 -0
logs/main_log.txt
CHANGED
@@ -40817,3 +40817,79 @@ valid loss at iteration 137000 | lm loss value: 1.448801E+00 | lm loss PPL: 4.25
|
|
40817 |
iteration 137200/ 152972 | consumed samples: 65166784 | consumed tokens: 133461573632 | elapsed time per iteration (ms): 5183.6 | learning rate: 1.587E-05 | global batch size: 512 | lm loss: 1.437702E+00 | loss scale: 131072.0 | grad norm: 15268.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40818 |
iteration 137400/ 152972 | consumed samples: 65269184 | consumed tokens: 133671288832 | elapsed time per iteration (ms): 4659.9 | learning rate: 1.572E-05 | global batch size: 512 | lm loss: 1.405018E+00 | loss scale: 131072.0 | grad norm: 17432.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40819 |
iteration 137600/ 152972 | consumed samples: 65371584 | consumed tokens: 133881004032 | elapsed time per iteration (ms): 4653.2 | learning rate: 1.558E-05 | global batch size: 512 | lm loss: 1.429552E+00 | loss scale: 262144.0 | grad norm: 30347.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
40817 |
iteration 137200/ 152972 | consumed samples: 65166784 | consumed tokens: 133461573632 | elapsed time per iteration (ms): 5183.6 | learning rate: 1.587E-05 | global batch size: 512 | lm loss: 1.437702E+00 | loss scale: 131072.0 | grad norm: 15268.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40818 |
iteration 137400/ 152972 | consumed samples: 65269184 | consumed tokens: 133671288832 | elapsed time per iteration (ms): 4659.9 | learning rate: 1.572E-05 | global batch size: 512 | lm loss: 1.405018E+00 | loss scale: 131072.0 | grad norm: 17432.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40819 |
iteration 137600/ 152972 | consumed samples: 65371584 | consumed tokens: 133881004032 | elapsed time per iteration (ms): 4653.2 | learning rate: 1.558E-05 | global batch size: 512 | lm loss: 1.429552E+00 | loss scale: 262144.0 | grad norm: 30347.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40820 |
+
iteration 137800/ 152972 | consumed samples: 65473984 | consumed tokens: 134090719232 | elapsed time per iteration (ms): 4644.6 | learning rate: 1.544E-05 | global batch size: 512 | lm loss: 1.446872E+00 | loss scale: 65536.0 | grad norm: 9272.526 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40821 |
+
[2021-11-29 03:30:35,344] [INFO] [logging.py:68:log_dist] [Rank 0] step=138000, skipped=295, lr=[1.5303912101312385e-05, 1.5303912101312385e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
|
40822 |
+
iteration 138000/ 152972 | consumed samples: 65576384 | consumed tokens: 134300434432 | elapsed time per iteration (ms): 4637.8 | learning rate: 1.530E-05 | global batch size: 512 | lm loss: 1.381682E+00 | loss scale: 65536.0 | grad norm: 10285.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
40823 |
+
steps: 138000 loss: 1.9639 iter time (s): 0.002 samples/sec: 220781.121
|
40824 |
+
--------------------------------------------------------------------------------------------
|
40825 |
+
valid loss at iteration 138000 | lm loss value: 1.399287E+00 | lm loss PPL: 4.052309E+00 |
|
40826 |
+
--------------------------------------------------------------------------------------------
|
40827 |
+
saving checkpoint at iteration 138000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
|
40828 |
+
[2021-11-29 03:32:25,649] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/mp_rank_00_model_states.pt
|
40829 |
+
[2021-11-29 03:32:26,072] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_1_mp_rank_01_optim_states.pt
|
40830 |
+
[2021-11-29 03:32:26,073] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_10_mp_rank_01_optim_states.pt
|
40831 |
+
[2021-11-29 03:32:26,074] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_3_mp_rank_01_optim_states.pt
|
40832 |
+
[2021-11-29 03:32:26,075] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_19_mp_rank_01_optim_states.pt
|
40833 |
+
[2021-11-29 03:32:26,075] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_20_mp_rank_01_optim_states.pt
|
40834 |
+
[2021-11-29 03:32:26,077] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_31_mp_rank_01_optim_states.pt
|
40835 |
+
[2021-11-29 03:32:26,077] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_7_mp_rank_01_optim_states.pt
|
40836 |
+
[2021-11-29 03:32:26,079] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_12_mp_rank_01_optim_states.pt
|
40837 |
+
[2021-11-29 03:32:26,080] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_30_mp_rank_01_optim_states.pt
|
40838 |
+
[2021-11-29 03:32:26,083] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_7_mp_rank_00_optim_states.pt
|
40839 |
+
[2021-11-29 03:32:26,084] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_27_mp_rank_01_optim_states.pt
|
40840 |
+
[2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_27_mp_rank_00_optim_states.pt
|
40841 |
+
[2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_5_mp_rank_00_optim_states.pt
|
40842 |
+
[2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_17_mp_rank_01_optim_states.pt
|
40843 |
+
[2021-11-29 03:32:26,086] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_28_mp_rank_00_optim_states.pt
|
40844 |
+
[2021-11-29 03:32:26,089] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_22_mp_rank_01_optim_states.pt
|
40845 |
+
[2021-11-29 03:32:26,089] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_20_mp_rank_00_optim_states.pt
|
40846 |
+
[2021-11-29 03:32:26,090] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_6_mp_rank_01_optim_states.pt
|
40847 |
+
[2021-11-29 03:32:26,092] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_19_mp_rank_00_optim_states.pt
|
40848 |
+
[2021-11-29 03:32:26,096] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_8_mp_rank_00_optim_states.pt
|
40849 |
+
[2021-11-29 03:32:26,097] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_24_mp_rank_00_optim_states.pt
|
40850 |
+
[2021-11-29 03:32:26,099] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_11_mp_rank_01_optim_states.pt
|
40851 |
+
[2021-11-29 03:32:26,102] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_2_mp_rank_00_optim_states.pt
|
40852 |
+
[2021-11-29 03:32:26,103] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_23_mp_rank_01_optim_states.pt
|
40853 |
+
[2021-11-29 03:32:26,106] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_26_mp_rank_01_optim_states.pt
|
40854 |
+
[2021-11-29 03:32:26,106] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_4_mp_rank_01_optim_states.pt
|
40855 |
+
[2021-11-29 03:32:26,107] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_31_mp_rank_00_optim_states.pt
|
40856 |
+
[2021-11-29 03:32:26,108] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_9_mp_rank_00_optim_states.pt
|
40857 |
+
[2021-11-29 03:32:26,109] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_18_mp_rank_00_optim_states.pt
|
40858 |
+
[2021-11-29 03:32:26,109] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_8_mp_rank_01_optim_states.pt
|
40859 |
+
[2021-11-29 03:32:26,110] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_29_mp_rank_01_optim_states.pt
|
40860 |
+
[2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_17_mp_rank_00_optim_states.pt
|
40861 |
+
[2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_6_mp_rank_00_optim_states.pt
|
40862 |
+
[2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_30_mp_rank_00_optim_states.pt
|
40863 |
+
[2021-11-29 03:32:26,113] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_14_mp_rank_00_optim_states.pt
|
40864 |
+
[2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_24_mp_rank_01_optim_states.pt
|
40865 |
+
[2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_23_mp_rank_00_optim_states.pt
|
40866 |
+
[2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_29_mp_rank_00_optim_states.pt
|
40867 |
+
[2021-11-29 03:32:26,115] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_16_mp_rank_01_optim_states.pt
|
40868 |
+
[2021-11-29 03:32:26,115] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_25_mp_rank_01_optim_states.pt
|
40869 |
+
[2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_22_mp_rank_00_optim_states.pt
|
40870 |
+
[2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_12_mp_rank_00_optim_states.pt
|
40871 |
+
[2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_11_mp_rank_00_optim_states.pt
|
40872 |
+
[2021-11-29 03:32:26,117] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_28_mp_rank_01_optim_states.pt
|
40873 |
+
[2021-11-29 03:32:26,121] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_18_mp_rank_01_optim_states.pt
|
40874 |
+
[2021-11-29 03:32:26,122] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_15_mp_rank_01_optim_states.pt
|
40875 |
+
[2021-11-29 03:32:26,122] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_4_mp_rank_00_optim_states.pt
|
40876 |
+
[2021-11-29 03:32:26,123] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_13_mp_rank_01_optim_states.pt
|
40877 |
+
[2021-11-29 03:32:26,124] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_21_mp_rank_01_optim_states.pt
|
40878 |
+
[2021-11-29 03:32:26,125] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_10_mp_rank_00_optim_states.pt
|
40879 |
+
[2021-11-29 03:32:26,125] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_25_mp_rank_00_optim_states.pt
|
40880 |
+
[2021-11-29 03:32:26,127] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_0_mp_rank_00_optim_states.pt
|
40881 |
+
[2021-11-29 03:32:26,128] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_9_mp_rank_01_optim_states.pt
|
40882 |
+
[2021-11-29 03:32:26,131] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_26_mp_rank_00_optim_states.pt
|
40883 |
+
[2021-11-29 03:32:26,135] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_14_mp_rank_01_optim_states.pt
|
40884 |
+
[2021-11-29 03:32:26,136] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_2_mp_rank_01_optim_states.pt
|
40885 |
+
[2021-11-29 03:32:26,136] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_0_mp_rank_01_optim_states.pt
|
40886 |
+
[2021-11-29 03:32:26,138] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_16_mp_rank_00_optim_states.pt
|
40887 |
+
[2021-11-29 03:32:26,139] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_21_mp_rank_00_optim_states.pt
|
40888 |
+
[2021-11-29 03:32:26,140] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_5_mp_rank_01_optim_states.pt
|
40889 |
+
[2021-11-29 03:32:26,179] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_15_mp_rank_00_optim_states.pt
|
40890 |
+
[2021-11-29 03:32:26,180] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_13_mp_rank_00_optim_states.pt
|
40891 |
+
[2021-11-29 03:32:26,410] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_1_mp_rank_00_optim_states.pt
|
40892 |
+
[2021-11-29 03:32:26,454] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_3_mp_rank_00_optim_states.pt
|
40893 |
+
successfully saved checkpoint at iteration 138000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
|
40894 |
+
time (ms) | save-checkpoint: 2778.09
|
40895 |
+
iteration 138200/ 152972 | consumed samples: 65678784 | consumed tokens: 134510149632 | elapsed time per iteration (ms): 5202.9 | learning rate: 1.517E-05 | global batch size: 512 | lm loss: 1.432905E+00 | loss scale: 65536.0 | grad norm: 8435.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|