bigscience-bot commited on
Commit
e7212d4
1 Parent(s): ac09b62
Files changed (1) hide show
  1. logs/main_log.txt +86 -0
logs/main_log.txt CHANGED
@@ -13603,3 +13603,89 @@ time (ms) | save-checkpoint: 1132.57
13603
  [2021-11-04 21:50:08,652] [INFO] [logging.py:68:log_dist] [Rank 0] step=40000, skipped=81, lr=[0.0001804599959837998, 0.0001804599959837998], mom=[(0.9, 0.999), (0.9, 0.999)]
13604
  iteration 40000/ 152972 | consumed samples: 15400384 | consumed tokens: 31539986432 | elapsed time per iteration (ms): 6080.1 | learning rate: 1.805E-04 | global batch size: 512 | lm loss: 2.046368E+00 | loss scale: 1048576.0 | grad norm: 79641.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
13605
  steps: 40000 loss: 1.9234 iter time (s): 0.003 samples/sec: 168896.024
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13603
  [2021-11-04 21:50:08,652] [INFO] [logging.py:68:log_dist] [Rank 0] step=40000, skipped=81, lr=[0.0001804599959837998, 0.0001804599959837998], mom=[(0.9, 0.999), (0.9, 0.999)]
13604
  iteration 40000/ 152972 | consumed samples: 15400384 | consumed tokens: 31539986432 | elapsed time per iteration (ms): 6080.1 | learning rate: 1.805E-04 | global batch size: 512 | lm loss: 2.046368E+00 | loss scale: 1048576.0 | grad norm: 79641.172 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
13605
  steps: 40000 loss: 1.9234 iter time (s): 0.003 samples/sec: 168896.024
13606
+ -------------------------------------------------------------------------------------------------
13607
+ validation loss at iteration 40000 | lm loss value: 2.018202E+00 | lm loss PPL: 7.524780E+00 |
13608
+ -------------------------------------------------------------------------------------------------
13609
+ iteration 40200/ 152972 | consumed samples: 15502784 | consumed tokens: 31749701632 | elapsed time per iteration (ms): 7223.0 | learning rate: 1.802E-04 | global batch size: 512 | lm loss: 2.039917E+00 | loss scale: 1048576.0 | grad norm: 82712.329 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
13610
+ srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
13611
+ Killing subprocess 122540
13612
+ Killing subprocess 1284095
13613
+ Killing subprocess 908359
13614
+ Killing subprocess 252675
13615
+ Killing subprocess 1964291
13616
+ Killing subprocess 502625
13617
+ Killing subprocess 2721736
13618
+ Killing subprocess 122541
13619
+ Killing subprocess 803909
13620
+ Killing subprocess 1284096
13621
+ Killing subprocess 1964292
13622
+ Killing subprocess 908360
13623
+ Killing subprocess 2133881
13624
+ Killing subprocess 532510
13625
+ Killing subprocess 252676
13626
+ Killing subprocess 2791463
13627
+ Killing subprocess 1284097
13628
+ Killing subprocess 122542
13629
+ Killing subprocess 908361
13630
+ Killing subprocess 502626
13631
+ Killing subprocess 1964293
13632
+ Killing subprocess 803910
13633
+ Killing subprocess 2721737
13634
+ Killing subprocess 1288896
13635
+ Killing subprocess 252677
13636
+ Killing subprocess 122543
13637
+ Killing subprocess 1284099
13638
+ Killing subprocess 502627
13639
+ slurmstepd: error: *** STEP 1825190.0 ON r6i3n0 CANCELLED AT 2021-11-04T22:24:00 ***
13640
+ Killing subprocess 908362
13641
+ Killing subprocess 532511
13642
+ Killing subprocess 2721738
13643
+ Killing subprocess 2133882
13644
+ Killing subprocess 1964294
13645
+ Main process received SIGTERM, exiting
13646
+ Killing subprocess 2791464
13647
+ Killing subprocess 532512
13648
+ Killing subprocess 2133883
13649
+ Killing subprocess 502629
13650
+ Killing subprocess 803911
13651
+ Killing subprocess 803912
13652
+ Killing subprocess 532513
13653
+ Killing subprocess 2791465
13654
+ Main process received SIGTERM, exiting
13655
+ Killing subprocess 1288897
13656
+ Killing subprocess 2133884
13657
+ Killing subprocess 252678
13658
+ Main process received SIGTERM, exiting
13659
+ Main process received SIGTERM, exiting
13660
+ Main process received SIGTERM, exiting
13661
+ Killing subprocess 2721739
13662
+ Main process received SIGTERM, exiting
13663
+ Killing subprocess 2791466
13664
+ Main process received SIGTERM, exiting
13665
+ Main process received SIGTERM, exiting
13666
+ Killing subprocess 1288898
13667
+ Killing subprocess 1288899
13668
+ Main process received SIGTERM, exiting
13669
+ Killing subprocess 980734
13670
+ Main process received SIGTERM, exiting
13671
+ Main process received SIGTERM, exiting
13672
+ Killing subprocess 980735
13673
+ Killing subprocess 980736
13674
+ Main process received SIGTERM, exiting
13675
+ Killing subprocess 980737
13676
+ Main process received SIGTERM, exiting
13677
+ Killing subprocess 429407
13678
+ Killing subprocess 2991515
13679
+ Killing subprocess 179727
13680
+ Killing subprocess 429408
13681
+ Killing subprocess 2991516
13682
+ Killing subprocess 179728
13683
+ Killing subprocess 429409
13684
+ Killing subprocess 429411
13685
+ Killing subprocess 2991517
13686
+ Killing subprocess 2991518
13687
+ Main process received SIGTERM, exiting
13688
+ Main process received SIGTERM, exiting
13689
+ Killing subprocess 179729
13690
+ Killing subprocess 179730
13691
+ Main process received SIGTERM, exiting