bigscience-bot commited on
Commit
537ef6a
1 Parent(s): 8d2ebe4
Files changed (1) hide show
  1. logs/main_log.txt +76 -0
logs/main_log.txt CHANGED
@@ -40817,3 +40817,79 @@ valid loss at iteration 137000 | lm loss value: 1.448801E+00 | lm loss PPL: 4.25
40817
  iteration 137200/ 152972 | consumed samples: 65166784 | consumed tokens: 133461573632 | elapsed time per iteration (ms): 5183.6 | learning rate: 1.587E-05 | global batch size: 512 | lm loss: 1.437702E+00 | loss scale: 131072.0 | grad norm: 15268.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40818
  iteration 137400/ 152972 | consumed samples: 65269184 | consumed tokens: 133671288832 | elapsed time per iteration (ms): 4659.9 | learning rate: 1.572E-05 | global batch size: 512 | lm loss: 1.405018E+00 | loss scale: 131072.0 | grad norm: 17432.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40819
  iteration 137600/ 152972 | consumed samples: 65371584 | consumed tokens: 133881004032 | elapsed time per iteration (ms): 4653.2 | learning rate: 1.558E-05 | global batch size: 512 | lm loss: 1.429552E+00 | loss scale: 262144.0 | grad norm: 30347.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40817
  iteration 137200/ 152972 | consumed samples: 65166784 | consumed tokens: 133461573632 | elapsed time per iteration (ms): 5183.6 | learning rate: 1.587E-05 | global batch size: 512 | lm loss: 1.437702E+00 | loss scale: 131072.0 | grad norm: 15268.917 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40818
  iteration 137400/ 152972 | consumed samples: 65269184 | consumed tokens: 133671288832 | elapsed time per iteration (ms): 4659.9 | learning rate: 1.572E-05 | global batch size: 512 | lm loss: 1.405018E+00 | loss scale: 131072.0 | grad norm: 17432.153 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40819
  iteration 137600/ 152972 | consumed samples: 65371584 | consumed tokens: 133881004032 | elapsed time per iteration (ms): 4653.2 | learning rate: 1.558E-05 | global batch size: 512 | lm loss: 1.429552E+00 | loss scale: 262144.0 | grad norm: 30347.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40820
+ iteration 137800/ 152972 | consumed samples: 65473984 | consumed tokens: 134090719232 | elapsed time per iteration (ms): 4644.6 | learning rate: 1.544E-05 | global batch size: 512 | lm loss: 1.446872E+00 | loss scale: 65536.0 | grad norm: 9272.526 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40821
+ [2021-11-29 03:30:35,344] [INFO] [logging.py:68:log_dist] [Rank 0] step=138000, skipped=295, lr=[1.5303912101312385e-05, 1.5303912101312385e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
40822
+ iteration 138000/ 152972 | consumed samples: 65576384 | consumed tokens: 134300434432 | elapsed time per iteration (ms): 4637.8 | learning rate: 1.530E-05 | global batch size: 512 | lm loss: 1.381682E+00 | loss scale: 65536.0 | grad norm: 10285.752 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
40823
+ steps: 138000 loss: 1.9639 iter time (s): 0.002 samples/sec: 220781.121
40824
+ --------------------------------------------------------------------------------------------
40825
+ valid loss at iteration 138000 | lm loss value: 1.399287E+00 | lm loss PPL: 4.052309E+00 |
40826
+ --------------------------------------------------------------------------------------------
40827
+ saving checkpoint at iteration 138000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
40828
+ [2021-11-29 03:32:25,649] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/mp_rank_00_model_states.pt
40829
+ [2021-11-29 03:32:26,072] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_1_mp_rank_01_optim_states.pt
40830
+ [2021-11-29 03:32:26,073] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_10_mp_rank_01_optim_states.pt
40831
+ [2021-11-29 03:32:26,074] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_3_mp_rank_01_optim_states.pt
40832
+ [2021-11-29 03:32:26,075] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_19_mp_rank_01_optim_states.pt
40833
+ [2021-11-29 03:32:26,075] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_20_mp_rank_01_optim_states.pt
40834
+ [2021-11-29 03:32:26,077] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_31_mp_rank_01_optim_states.pt
40835
+ [2021-11-29 03:32:26,077] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_7_mp_rank_01_optim_states.pt
40836
+ [2021-11-29 03:32:26,079] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_12_mp_rank_01_optim_states.pt
40837
+ [2021-11-29 03:32:26,080] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_30_mp_rank_01_optim_states.pt
40838
+ [2021-11-29 03:32:26,083] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_7_mp_rank_00_optim_states.pt
40839
+ [2021-11-29 03:32:26,084] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_27_mp_rank_01_optim_states.pt
40840
+ [2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_27_mp_rank_00_optim_states.pt
40841
+ [2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_5_mp_rank_00_optim_states.pt
40842
+ [2021-11-29 03:32:26,085] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_17_mp_rank_01_optim_states.pt
40843
+ [2021-11-29 03:32:26,086] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_28_mp_rank_00_optim_states.pt
40844
+ [2021-11-29 03:32:26,089] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_22_mp_rank_01_optim_states.pt
40845
+ [2021-11-29 03:32:26,089] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_20_mp_rank_00_optim_states.pt
40846
+ [2021-11-29 03:32:26,090] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_6_mp_rank_01_optim_states.pt
40847
+ [2021-11-29 03:32:26,092] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_19_mp_rank_00_optim_states.pt
40848
+ [2021-11-29 03:32:26,096] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_8_mp_rank_00_optim_states.pt
40849
+ [2021-11-29 03:32:26,097] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_24_mp_rank_00_optim_states.pt
40850
+ [2021-11-29 03:32:26,099] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_11_mp_rank_01_optim_states.pt
40851
+ [2021-11-29 03:32:26,102] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_2_mp_rank_00_optim_states.pt
40852
+ [2021-11-29 03:32:26,103] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_23_mp_rank_01_optim_states.pt
40853
+ [2021-11-29 03:32:26,106] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_26_mp_rank_01_optim_states.pt
40854
+ [2021-11-29 03:32:26,106] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_4_mp_rank_01_optim_states.pt
40855
+ [2021-11-29 03:32:26,107] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_31_mp_rank_00_optim_states.pt
40856
+ [2021-11-29 03:32:26,108] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_9_mp_rank_00_optim_states.pt
40857
+ [2021-11-29 03:32:26,109] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_18_mp_rank_00_optim_states.pt
40858
+ [2021-11-29 03:32:26,109] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_8_mp_rank_01_optim_states.pt
40859
+ [2021-11-29 03:32:26,110] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_29_mp_rank_01_optim_states.pt
40860
+ [2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_17_mp_rank_00_optim_states.pt
40861
+ [2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_6_mp_rank_00_optim_states.pt
40862
+ [2021-11-29 03:32:26,111] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_30_mp_rank_00_optim_states.pt
40863
+ [2021-11-29 03:32:26,113] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_14_mp_rank_00_optim_states.pt
40864
+ [2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_24_mp_rank_01_optim_states.pt
40865
+ [2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_23_mp_rank_00_optim_states.pt
40866
+ [2021-11-29 03:32:26,114] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_29_mp_rank_00_optim_states.pt
40867
+ [2021-11-29 03:32:26,115] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_16_mp_rank_01_optim_states.pt
40868
+ [2021-11-29 03:32:26,115] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_25_mp_rank_01_optim_states.pt
40869
+ [2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_22_mp_rank_00_optim_states.pt
40870
+ [2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_12_mp_rank_00_optim_states.pt
40871
+ [2021-11-29 03:32:26,116] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_11_mp_rank_00_optim_states.pt
40872
+ [2021-11-29 03:32:26,117] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_28_mp_rank_01_optim_states.pt
40873
+ [2021-11-29 03:32:26,121] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_18_mp_rank_01_optim_states.pt
40874
+ [2021-11-29 03:32:26,122] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_15_mp_rank_01_optim_states.pt
40875
+ [2021-11-29 03:32:26,122] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_4_mp_rank_00_optim_states.pt
40876
+ [2021-11-29 03:32:26,123] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_13_mp_rank_01_optim_states.pt
40877
+ [2021-11-29 03:32:26,124] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_21_mp_rank_01_optim_states.pt
40878
+ [2021-11-29 03:32:26,125] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_10_mp_rank_00_optim_states.pt
40879
+ [2021-11-29 03:32:26,125] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_25_mp_rank_00_optim_states.pt
40880
+ [2021-11-29 03:32:26,127] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_0_mp_rank_00_optim_states.pt
40881
+ [2021-11-29 03:32:26,128] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_9_mp_rank_01_optim_states.pt
40882
+ [2021-11-29 03:32:26,131] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_26_mp_rank_00_optim_states.pt
40883
+ [2021-11-29 03:32:26,135] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_14_mp_rank_01_optim_states.pt
40884
+ [2021-11-29 03:32:26,136] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_2_mp_rank_01_optim_states.pt
40885
+ [2021-11-29 03:32:26,136] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_0_mp_rank_01_optim_states.pt
40886
+ [2021-11-29 03:32:26,138] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_16_mp_rank_00_optim_states.pt
40887
+ [2021-11-29 03:32:26,139] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_21_mp_rank_00_optim_states.pt
40888
+ [2021-11-29 03:32:26,140] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_5_mp_rank_01_optim_states.pt
40889
+ [2021-11-29 03:32:26,179] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_15_mp_rank_00_optim_states.pt
40890
+ [2021-11-29 03:32:26,180] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_13_mp_rank_00_optim_states.pt
40891
+ [2021-11-29 03:32:26,410] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_1_mp_rank_00_optim_states.pt
40892
+ [2021-11-29 03:32:26,454] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step138000/zero_pp_rank_3_mp_rank_00_optim_states.pt
40893
+ successfully saved checkpoint at iteration 138000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
40894
+ time (ms) | save-checkpoint: 2778.09
40895
+ iteration 138200/ 152972 | consumed samples: 65678784 | consumed tokens: 134510149632 | elapsed time per iteration (ms): 5202.9 | learning rate: 1.517E-05 | global batch size: 512 | lm loss: 1.432905E+00 | loss scale: 65536.0 | grad norm: 8435.686 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |