bigscience-bot
commited on
Commit
•
1ff4be8
1
Parent(s):
c332227
new data
Browse files- logs/main_log.txt +49 -0
logs/main_log.txt
CHANGED
@@ -3714,3 +3714,52 @@ time (ms)
|
|
3714 |
------------------------------------------------------------------------------------------------
|
3715 |
iteration 1200/ 152972 | consumed samples: 38400 | elapsed time per iteration (ms): 1415.3 | learning rate: 4.194E-05 | global batch size: 32 | lm loss: 5.749456E+00 | loss scale: 16384.0 | grad norm: 16830.504 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3716 |
time (ms)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3714 |
------------------------------------------------------------------------------------------------
|
3715 |
iteration 1200/ 152972 | consumed samples: 38400 | elapsed time per iteration (ms): 1415.3 | learning rate: 4.194E-05 | global batch size: 32 | lm loss: 5.749456E+00 | loss scale: 16384.0 | grad norm: 16830.504 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3716 |
time (ms)
|
3717 |
+
iteration 1400/ 152972 | consumed samples: 44800 | elapsed time per iteration (ms): 1258.8 | learning rate: 4.893E-05 | global batch size: 32 | lm loss: 5.540604E+00 | loss scale: 16384.0 | grad norm: 14275.904 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3718 |
+
time (ms)
|
3719 |
+
saving checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
|
3720 |
+
[2021-09-27 17:31:58,218] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints/global_step1500/mp_rank_00_model_states.pt
|
3721 |
+
successfully saved checkpoint at iteration 1500 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
|
3722 |
+
time (ms) | save-checkpoint: 1572.29
|
3723 |
+
iteration 1600/ 152972 | consumed samples: 51200 | elapsed time per iteration (ms): 1269.7 | learning rate: 5.592E-05 | global batch size: 32 | lm loss: 5.372899E+00 | loss scale: 32768.0 | grad norm: 23634.576 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3724 |
+
time (ms)
|
3725 |
+
iteration 1800/ 152972 | consumed samples: 57600 | elapsed time per iteration (ms): 1261.8 | learning rate: 6.291E-05 | global batch size: 32 | lm loss: 5.217889E+00 | loss scale: 32768.0 | grad norm: 21545.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3726 |
+
time (ms)
|
3727 |
+
[2021-09-27 17:42:30,184] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=2, lr=[6.983534037847136e-05, 6.983534037847136e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
|
3728 |
+
steps: 2000 loss: 4.9108 iter time (s): 0.001 samples/sec: 50939.574
|
3729 |
+
iteration 2000/ 152972 | consumed samples: 64000 | elapsed time per iteration (ms): 1260.5 | learning rate: 6.984E-05 | global batch size: 32 | lm loss: 5.363922E+00 | loss scale: 16384.0 | grad norm: 12768.459 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3730 |
+
time (ms)
|
3731 |
+
------------------------------------------------------------------------------------------------
|
3732 |
+
validation loss at iteration 2000 | lm loss value: 4.962508E+00 | lm loss PPL: 1.429518E+02 |
|
3733 |
+
------------------------------------------------------------------------------------------------
|
3734 |
+
iteration 2200/ 152972 | consumed samples: 70400 | elapsed time per iteration (ms): 1407.3 | learning rate: 7.683E-05 | global batch size: 32 | lm loss: 4.894614E+00 | loss scale: 16384.0 | grad norm: 9693.687 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3735 |
+
time (ms)
|
3736 |
+
iteration 2400/ 152972 | consumed samples: 76800 | elapsed time per iteration (ms): 1263.8 | learning rate: 8.382E-05 | global batch size: 32 | lm loss: 4.742365E+00 | loss scale: 16384.0 | grad norm: 11512.744 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3737 |
+
time (ms)
|
3738 |
+
iteration 2600/ 152972 | consumed samples: 83200 | elapsed time per iteration (ms): 1265.1 | learning rate: 9.081E-05 | global batch size: 32 | lm loss: 4.640353E+00 | loss scale: 32768.0 | grad norm: 16408.451 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3739 |
+
time (ms)
|
3740 |
+
iteration 2800/ 152972 | consumed samples: 89600 | elapsed time per iteration (ms): 1268.9 | learning rate: 9.780E-05 | global batch size: 32 | lm loss: 4.562429E+00 | loss scale: 32768.0 | grad norm: 17465.135 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3741 |
+
time (ms)
|
3742 |
+
iteration 3000/ 152972 | consumed samples: 96000 | elapsed time per iteration (ms): 1272.7 | learning rate: 1.048E-04 | global batch size: 32 | lm loss: 4.480088E+00 | loss scale: 65536.0 | grad norm: 29013.886 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3743 |
+
time (ms)
|
3744 |
+
------------------------------------------------------------------------------------------------
|
3745 |
+
validation loss at iteration 3000 | lm loss value: 4.390939E+00 | lm loss PPL: 8.071619E+01 |
|
3746 |
+
------------------------------------------------------------------------------------------------
|
3747 |
+
saving checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
|
3748 |
+
[2021-09-27 18:04:34,840] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints/global_step3000/mp_rank_00_model_states.pt
|
3749 |
+
successfully saved checkpoint at iteration 3000 to /gpfsscratch/rech/six/commun/synched_exps/tr4c-1B3-rotary-oscar/checkpoints
|
3750 |
+
time (ms) | save-checkpoint: 1703.84
|
3751 |
+
iteration 3200/ 152972 | consumed samples: 102400 | elapsed time per iteration (ms): 1417.8 | learning rate: 1.118E-04 | global batch size: 32 | lm loss: 4.428154E+00 | loss scale: 65536.0 | grad norm: 27260.674 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3752 |
+
time (ms)
|
3753 |
+
iteration 3400/ 152972 | consumed samples: 108800 | elapsed time per iteration (ms): 1264.6 | learning rate: 1.188E-04 | global batch size: 32 | lm loss: 4.375950E+00 | loss scale: 65536.0 | grad norm: 30398.829 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3754 |
+
time (ms)
|
3755 |
+
iteration 3600/ 152972 | consumed samples: 115200 | elapsed time per iteration (ms): 1269.6 | learning rate: 1.258E-04 | global batch size: 32 | lm loss: 4.317261E+00 | loss scale: 131072.0 | grad norm: 77605.844 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3756 |
+
time (ms)
|
3757 |
+
iteration 3800/ 152972 | consumed samples: 121600 | elapsed time per iteration (ms): 1268.3 | learning rate: 1.327E-04 | global batch size: 32 | lm loss: 4.276650E+00 | loss scale: 131072.0 | grad norm: 51425.567 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3758 |
+
time (ms)
|
3759 |
+
[2021-09-27 18:25:43,201] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=4, lr=[0.00013967068075694273, 0.00013967068075694273], mom=[(0.9, 0.999), (0.9, 0.999)]
|
3760 |
+
steps: 4000 loss: 4.2108 iter time (s): 0.001 samples/sec: 50745.813
|
3761 |
+
iteration 4000/ 152972 | consumed samples: 128000 | elapsed time per iteration (ms): 1267.0 | learning rate: 1.397E-04 | global batch size: 32 | lm loss: 4.234697E+00 | loss scale: 65536.0 | grad norm: 24346.811 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
|
3762 |
+
time (ms)
|
3763 |
+
------------------------------------------------------------------------------------------------
|
3764 |
+
validation loss at iteration 4000 | lm loss value: 4.166348E+00 | lm loss PPL: 6.447954E+01 |
|
3765 |
+
------------------------------------------------------------------------------------------------
|