bigscience-bot commited on
Commit
0df4f4d
1 Parent(s): 48d2c00
Files changed (1) hide show
  1. logs/main_log.txt +77 -0
logs/main_log.txt CHANGED
@@ -23010,3 +23010,80 @@ valid loss at iteration 77000 | lm loss value: 1.463746E+00 | lm loss PPL: 4.322
23010
  iteration 77400/ 152972 | consumed samples: 34549184 | consumed tokens: 70756728832 | elapsed time per iteration (ms): 4640.8 | learning rate: 1.141E-04 | global batch size: 512 | lm loss: 1.480313E+00 | loss scale: 32768.0 | grad norm: 3833.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23011
  iteration 77600/ 152972 | consumed samples: 34651584 | consumed tokens: 70966444032 | elapsed time per iteration (ms): 4642.4 | learning rate: 1.137E-04 | global batch size: 512 | lm loss: 1.533694E+00 | loss scale: 32768.0 | grad norm: 1919.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23012
  iteration 77800/ 152972 | consumed samples: 34753984 | consumed tokens: 71176159232 | elapsed time per iteration (ms): 4642.3 | learning rate: 1.133E-04 | global batch size: 512 | lm loss: 1.484447E+00 | loss scale: 32768.0 | grad norm: 3477.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23010
  iteration 77400/ 152972 | consumed samples: 34549184 | consumed tokens: 70756728832 | elapsed time per iteration (ms): 4640.8 | learning rate: 1.141E-04 | global batch size: 512 | lm loss: 1.480313E+00 | loss scale: 32768.0 | grad norm: 3833.268 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23011
  iteration 77600/ 152972 | consumed samples: 34651584 | consumed tokens: 70966444032 | elapsed time per iteration (ms): 4642.4 | learning rate: 1.137E-04 | global batch size: 512 | lm loss: 1.533694E+00 | loss scale: 32768.0 | grad norm: 1919.989 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23012
  iteration 77800/ 152972 | consumed samples: 34753984 | consumed tokens: 71176159232 | elapsed time per iteration (ms): 4642.3 | learning rate: 1.133E-04 | global batch size: 512 | lm loss: 1.484447E+00 | loss scale: 32768.0 | grad norm: 3477.265 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23013
+ [2021-11-25 19:50:54,600] [INFO] [logging.py:68:log_dist] [Rank 0] step=78000, skipped=161, lr=[0.00011287287812300848, 0.00011287287812300848], mom=[(0.9, 0.999), (0.9, 0.999)]
23014
+ steps: 78000 loss: 1.6072 iter time (s): 0.002 samples/sec: 220820.748
23015
+ iteration 78000/ 152972 | consumed samples: 34856384 | consumed tokens: 71385874432 | elapsed time per iteration (ms): 4640.2 | learning rate: 1.129E-04 | global batch size: 512 | lm loss: 1.487332E+00 | loss scale: 65536.0 | grad norm: 6999.023 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23016
+ -------------------------------------------------------------------------------------------
23017
+ valid loss at iteration 78000 | lm loss value: 1.447810E+00 | lm loss PPL: 4.253788E+00 |
23018
+ -------------------------------------------------------------------------------------------
23019
+ saving checkpoint at iteration 78000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
23020
+ [2021-11-25 19:53:02,460] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/mp_rank_00_model_states.pt
23021
+ [2021-11-25 19:53:02,891] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_17_mp_rank_01_optim_states.pt
23022
+ [2021-11-25 19:53:02,893] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_15_mp_rank_00_optim_states.pt
23023
+ [2021-11-25 19:53:02,895] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_22_mp_rank_00_optim_states.pt
23024
+ [2021-11-25 19:53:02,897] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_14_mp_rank_00_optim_states.pt
23025
+ [2021-11-25 19:53:02,897] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_28_mp_rank_01_optim_states.pt
23026
+ [2021-11-25 19:53:02,898] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_11_mp_rank_00_optim_states.pt
23027
+ [2021-11-25 19:53:02,899] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_7_mp_rank_00_optim_states.pt
23028
+ [2021-11-25 19:53:02,900] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_26_mp_rank_01_optim_states.pt
23029
+ [2021-11-25 19:53:02,901] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_26_mp_rank_00_optim_states.pt
23030
+ [2021-11-25 19:53:02,901] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_31_mp_rank_00_optim_states.pt
23031
+ [2021-11-25 19:53:02,903] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_4_mp_rank_00_optim_states.pt
23032
+ [2021-11-25 19:53:02,903] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_2_mp_rank_00_optim_states.pt
23033
+ [2021-11-25 19:53:02,905] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_11_mp_rank_01_optim_states.pt
23034
+ [2021-11-25 19:53:02,906] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_31_mp_rank_01_optim_states.pt
23035
+ [2021-11-25 19:53:02,906] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_10_mp_rank_00_optim_states.pt
23036
+ [2021-11-25 19:53:02,906] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_3_mp_rank_01_optim_states.pt
23037
+ [2021-11-25 19:53:02,906] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_25_mp_rank_01_optim_states.pt
23038
+ [2021-11-25 19:53:02,907] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_21_mp_rank_01_optim_states.pt
23039
+ [2021-11-25 19:53:02,908] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_12_mp_rank_01_optim_states.pt
23040
+ [2021-11-25 19:53:02,908] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_1_mp_rank_00_optim_states.pt
23041
+ [2021-11-25 19:53:02,909] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_18_mp_rank_01_optim_states.pt
23042
+ [2021-11-25 19:53:02,912] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_20_mp_rank_01_optim_states.pt
23043
+ [2021-11-25 19:53:02,912] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_15_mp_rank_01_optim_states.pt
23044
+ [2021-11-25 19:53:02,913] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_7_mp_rank_01_optim_states.pt
23045
+ [2021-11-25 19:53:02,913] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_6_mp_rank_01_optim_states.pt
23046
+ [2021-11-25 19:53:02,914] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_16_mp_rank_00_optim_states.pt
23047
+ [2021-11-25 19:53:02,917] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_18_mp_rank_00_optim_states.pt
23048
+ [2021-11-25 19:53:02,920] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_24_mp_rank_00_optim_states.pt
23049
+ [2021-11-25 19:53:02,923] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_28_mp_rank_00_optim_states.pt
23050
+ [2021-11-25 19:53:02,926] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_13_mp_rank_00_optim_states.pt
23051
+ [2021-11-25 19:53:02,928] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_29_mp_rank_00_optim_states.pt
23052
+ [2021-11-25 19:53:02,928] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_4_mp_rank_01_optim_states.pt
23053
+ [2021-11-25 19:53:02,929] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_17_mp_rank_00_optim_states.pt
23054
+ [2021-11-25 19:53:02,930] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_9_mp_rank_00_optim_states.pt
23055
+ [2021-11-25 19:53:02,931] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_27_mp_rank_01_optim_states.pt
23056
+ [2021-11-25 19:53:02,931] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_8_mp_rank_00_optim_states.pt
23057
+ [2021-11-25 19:53:02,932] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_0_mp_rank_01_optim_states.pt
23058
+ [2021-11-25 19:53:02,934] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_30_mp_rank_01_optim_states.pt
23059
+ [2021-11-25 19:53:02,937] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_23_mp_rank_01_optim_states.pt
23060
+ [2021-11-25 19:53:02,937] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_23_mp_rank_00_optim_states.pt
23061
+ [2021-11-25 19:53:02,937] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_2_mp_rank_01_optim_states.pt
23062
+ [2021-11-25 19:53:02,938] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_6_mp_rank_00_optim_states.pt
23063
+ [2021-11-25 19:53:02,939] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_10_mp_rank_01_optim_states.pt
23064
+ [2021-11-25 19:53:02,939] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_20_mp_rank_00_optim_states.pt
23065
+ [2021-11-25 19:53:02,940] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_9_mp_rank_01_optim_states.pt
23066
+ [2021-11-25 19:53:02,940] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_25_mp_rank_00_optim_states.pt
23067
+ [2021-11-25 19:53:02,941] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_24_mp_rank_01_optim_states.pt
23068
+ [2021-11-25 19:53:02,941] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_12_mp_rank_00_optim_states.pt
23069
+ [2021-11-25 19:53:02,941] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_27_mp_rank_00_optim_states.pt
23070
+ [2021-11-25 19:53:02,943] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_0_mp_rank_00_optim_states.pt
23071
+ [2021-11-25 19:53:02,944] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_29_mp_rank_01_optim_states.pt
23072
+ [2021-11-25 19:53:02,944] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_21_mp_rank_00_optim_states.pt
23073
+ [2021-11-25 19:53:02,944] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_30_mp_rank_00_optim_states.pt
23074
+ [2021-11-25 19:53:02,945] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_5_mp_rank_00_optim_states.pt
23075
+ [2021-11-25 19:53:02,947] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_3_mp_rank_00_optim_states.pt
23076
+ [2021-11-25 19:53:02,949] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_22_mp_rank_01_optim_states.pt
23077
+ [2021-11-25 19:53:02,951] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_19_mp_rank_00_optim_states.pt
23078
+ [2021-11-25 19:53:02,955] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_1_mp_rank_01_optim_states.pt
23079
+ [2021-11-25 19:53:02,959] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_16_mp_rank_01_optim_states.pt
23080
+ [2021-11-25 19:53:02,961] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_14_mp_rank_01_optim_states.pt
23081
+ [2021-11-25 19:53:02,962] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_5_mp_rank_01_optim_states.pt
23082
+ [2021-11-25 19:53:02,964] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_19_mp_rank_01_optim_states.pt
23083
+ [2021-11-25 19:53:02,966] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_8_mp_rank_01_optim_states.pt
23084
+ [2021-11-25 19:53:02,969] [INFO] [engine.py:2540:_save_zero_checkpoint] zero checkpoint saved /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints/global_step78000/zero_pp_rank_13_mp_rank_01_optim_states.pt
23085
+ successfully saved checkpoint at iteration 78000 to /gpfsscratch/rech/six/commun/checkpoints/tr6g-1B3-oscar-loss-reweighting/checkpoints
23086
+ time (ms) | save-checkpoint: 2747.00
23087
+ iteration 78200/ 152972 | consumed samples: 34958784 | consumed tokens: 71595589632 | elapsed time per iteration (ms): 5294.6 | learning rate: 1.125E-04 | global batch size: 512 | lm loss: 1.510645E+00 | loss scale: 65536.0 | grad norm: 5610.810 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23088
+ iteration 78400/ 152972 | consumed samples: 35061184 | consumed tokens: 71805304832 | elapsed time per iteration (ms): 4655.1 | learning rate: 1.120E-04 | global batch size: 512 | lm loss: 1.483753E+00 | loss scale: 131072.0 | grad norm: 16549.933 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
23089
+ iteration 78600/ 152972 | consumed samples: 35163584 | consumed tokens: 72015020032 | elapsed time per iteration (ms): 4642.6 | learning rate: 1.116E-04 | global batch size: 512 | lm loss: 1.459196E+00 | loss scale: 131072.0 | grad norm: 13634.901 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |