bigscience-bot commited on
Commit
1ff4be8
1 Parent(s): c332227
Files changed (1) hide show
  1. logs/main_log.txt +49 -0
logs/main_log.txt CHANGED
@@ -3714,3 +3714,52 @@ time (ms)
3714
  ------------------------------------------------------------------------------------------------
3715
  iteration 1200/ 152972 | consumed samples: 38400 | elapsed time per iteration (ms): 1415.3 | learning rate: 4.194E-05 | global batch size: 32 | lm loss: 5.749456E+00 | loss scale: 16384.0 | grad norm: 16830.504 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3716
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3714
  ------------------------------------------------------------------------------------------------
3715
  iteration 1200/ 152972 | consumed samples: 38400 | elapsed time per iteration (ms): 1415.3 | learning rate: 4.194E-05 | global batch size: 32 | lm loss: 5.749456E+00 | loss scale: 16384.0 | grad norm: 16830.504 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3716
  time (ms)
3717
+ iteration 1400/ 152972 | consumed samples: 44800 | elapsed time per iteration (ms): 1258.8 | learning rate: 4.893E-05 | global batch size: 32 | lm loss: 5.540604E+00 | loss scale: 16384.0 | grad norm: 14275.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3718
+ time (ms)
3719
+ saving checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
3720
+ [2021-09-27 17:31:58,218] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints/global_step1500/mp_rank_00_model_states.pt
3721
+ successfully saved checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
3722
+ time (ms) | save-checkpoint: 1572.29
3723
+ iteration 1600/ 152972 | consumed samples: 51200 | elapsed time per iteration (ms): 1269.7 | learning rate: 5.592E-05 | global batch size: 32 | lm loss: 5.372899E+00 | loss scale: 32768.0 | grad norm: 23634.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3724
+ time (ms)
3725
+ iteration 1800/ 152972 | consumed samples: 57600 | elapsed time per iteration (ms): 1261.8 | learning rate: 6.291E-05 | global batch size: 32 | lm loss: 5.217889E+00 | loss scale: 32768.0 | grad norm: 21545.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3726
+ time (ms)
3727
+ [2021-09-27 17:42:30,184] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=2, lr=[6.983534037847136e-05, 6.983534037847136e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
3728
+ steps: 2000 loss: 4.9108 iter time (s): 0.001 samples/sec: 50939.574
3729
+ iteration 2000/ 152972 | consumed samples: 64000 | elapsed time per iteration (ms): 1260.5 | learning rate: 6.984E-05 | global batch size: 32 | lm loss: 5.363922E+00 | loss scale: 16384.0 | grad norm: 12768.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3730
+ time (ms)
3731
+ ------------------------------------------------------------------------------------------------
3732
+ validation loss at iteration 2000 | lm loss value: 4.962508E+00 | lm loss PPL: 1.429518E+02 |
3733
+ ------------------------------------------------------------------------------------------------
3734
+ iteration 2200/ 152972 | consumed samples: 70400 | elapsed time per iteration (ms): 1407.3 | learning rate: 7.683E-05 | global batch size: 32 | lm loss: 4.894614E+00 | loss scale: 16384.0 | grad norm: 9693.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3735
+ time (ms)
3736
+ iteration 2400/ 152972 | consumed samples: 76800 | elapsed time per iteration (ms): 1263.8 | learning rate: 8.382E-05 | global batch size: 32 | lm loss: 4.742365E+00 | loss scale: 16384.0 | grad norm: 11512.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3737
+ time (ms)
3738
+ iteration 2600/ 152972 | consumed samples: 83200 | elapsed time per iteration (ms): 1265.1 | learning rate: 9.081E-05 | global batch size: 32 | lm loss: 4.640353E+00 | loss scale: 32768.0 | grad norm: 16408.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3739
+ time (ms)
3740
+ iteration 2800/ 152972 | consumed samples: 89600 | elapsed time per iteration (ms): 1268.9 | learning rate: 9.780E-05 | global batch size: 32 | lm loss: 4.562429E+00 | loss scale: 32768.0 | grad norm: 17465.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3741
+ time (ms)
3742
+ iteration 3000/ 152972 | consumed samples: 96000 | elapsed time per iteration (ms): 1272.7 | learning rate: 1.048E-04 | global batch size: 32 | lm loss: 4.480088E+00 | loss scale: 65536.0 | grad norm: 29013.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3743
+ time (ms)
3744
+ ------------------------------------------------------------------------------------------------
3745
+ validation loss at iteration 3000 | lm loss value: 4.390939E+00 | lm loss PPL: 8.071619E+01 |
3746
+ ------------------------------------------------------------------------------------------------
3747
+ saving checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
3748
+ [2021-09-27 18:04:34,840] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints/global_step3000/mp_rank_00_model_states.pt
3749
+ successfully saved checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
3750
+ time (ms) | save-checkpoint: 1703.84
3751
+ iteration 3200/ 152972 | consumed samples: 102400 | elapsed time per iteration (ms): 1417.8 | learning rate: 1.118E-04 | global batch size: 32 | lm loss: 4.428154E+00 | loss scale: 65536.0 | grad norm: 27260.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3752
+ time (ms)
3753
+ iteration 3400/ 152972 | consumed samples: 108800 | elapsed time per iteration (ms): 1264.6 | learning rate: 1.188E-04 | global batch size: 32 | lm loss: 4.375950E+00 | loss scale: 65536.0 | grad norm: 30398.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3754
+ time (ms)
3755
+ iteration 3600/ 152972 | consumed samples: 115200 | elapsed time per iteration (ms): 1269.6 | learning rate: 1.258E-04 | global batch size: 32 | lm loss: 4.317261E+00 | loss scale: 131072.0 | grad norm: 77605.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3756
+ time (ms)
3757
+ iteration 3800/ 152972 | consumed samples: 121600 | elapsed time per iteration (ms): 1268.3 | learning rate: 1.327E-04 | global batch size: 32 | lm loss: 4.276650E+00 | loss scale: 131072.0 | grad norm: 51425.567 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3758
+ time (ms)
3759
+ [2021-09-27 18:25:43,201] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=4, lr=[0.00013967068075694273, 0.00013967068075694273], mom=[(0.9, 0.999), (0.9, 0.999)]
3760
+ steps: 4000 loss: 4.2108 iter time (s): 0.001 samples/sec: 50745.813
3761
+ iteration 4000/ 152972 | consumed samples: 128000 | elapsed time per iteration (ms): 1267.0 | learning rate: 1.397E-04 | global batch size: 32 | lm loss: 4.234697E+00 | loss scale: 65536.0 | grad norm: 24346.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
3762
+ time (ms)
3763
+ ------------------------------------------------------------------------------------------------
3764
+ validation loss at iteration 4000 | lm loss value: 4.166348E+00 | lm loss PPL: 6.447954E+01 |
3765
+ ------------------------------------------------------------------------------------------------