mikr commited on
Commit
a304cc6
1 Parent(s): e9e481b

Training in progress, step 5000

Browse files
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:93bc62ba34f7e7a80168b1e5c1b2cff4630e3fcf60ebb0046e78af5fe6a11945
3
  size 1527847357
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:df4b119a6ba413fcd75e9151268d9b1bd9224a59cb7a6093c08491baee8c87fa
3
  size 1527847357
run.log CHANGED
@@ -1488,3 +1488,254 @@ Rank: 0 partition count [1] and sizes[(763857920, False)]
1488
  [2022-12-20 05:37:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1489
  [2022-12-20 05:37:30,734] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1490
  [2022-12-20 05:37:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1488
  [2022-12-20 05:37:30,732] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1489
  [2022-12-20 05:37:30,734] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-4000/global_step4000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1490
  [2022-12-20 05:37:30,734] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step4000 is ready now!
1491
+ [2022-12-20 05:41:55,491] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 65536.0
1492
+ [2022-12-20 05:42:08,160] [INFO] [stage_1_and_2.py:1767:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
1493
+ [2022-12-20 05:42:35,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=4010, skipped=8, lr=[2.2200000000000003e-06], mom=[[0.9, 0.999]]
1494
+ [2022-12-20 05:42:35,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=4010/global_step=4010, RunningAvgSamplesPerSec=5.042979461159848, CurrSamplesPerSec=5.696590514613029, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1495
+ [2022-12-20 05:45:04,141] [INFO] [logging.py:68:log_dist] [Rank 0] step=4020, skipped=8, lr=[2.197777777777778e-06], mom=[[0.9, 0.999]]
1496
+ [2022-12-20 05:45:04,143] [INFO] [timer.py:196:stop] epoch=0/micro_step=4020/global_step=4020, RunningAvgSamplesPerSec=5.04338680465847, CurrSamplesPerSec=5.146485423364737, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1497
+ {'loss': 0.0001, 'learning_rate': 2.1866666666666668e-06, 'epoch': 57.5}
1498
+ [2022-12-20 05:47:36,085] [INFO] [logging.py:68:log_dist] [Rank 0] step=4030, skipped=8, lr=[2.1755555555555556e-06], mom=[[0.9, 0.999]]
1499
+ [2022-12-20 05:47:36,086] [INFO] [timer.py:196:stop] epoch=0/micro_step=4030/global_step=4030, RunningAvgSamplesPerSec=5.0433669842670845, CurrSamplesPerSec=5.079788047157066, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1500
+ [2022-12-20 05:50:10,634] [INFO] [logging.py:68:log_dist] [Rank 0] step=4040, skipped=8, lr=[2.153333333333333e-06], mom=[[0.9, 0.999]]
1501
+ [2022-12-20 05:50:10,636] [INFO] [timer.py:196:stop] epoch=0/micro_step=4040/global_step=4040, RunningAvgSamplesPerSec=5.04310395870082, CurrSamplesPerSec=5.0191439155807736, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1502
+ [2022-12-20 05:52:46,940] [INFO] [logging.py:68:log_dist] [Rank 0] step=4050, skipped=8, lr=[2.1311111111111112e-06], mom=[[0.9, 0.999]]
1503
+ [2022-12-20 05:52:46,941] [INFO] [timer.py:196:stop] epoch=0/micro_step=4050/global_step=4050, RunningAvgSamplesPerSec=5.042768502101079, CurrSamplesPerSec=4.842702171084141, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1504
+ {'loss': 0.0001, 'learning_rate': 2.1311111111111112e-06, 'epoch': 57.86}
1505
+ [2022-12-20 05:55:21,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=4060, skipped=8, lr=[2.108888888888889e-06], mom=[[0.9, 0.999]]
1506
+ [2022-12-20 05:55:21,004] [INFO] [timer.py:196:stop] epoch=0/micro_step=4060/global_step=4060, RunningAvgSamplesPerSec=5.042487681984381, CurrSamplesPerSec=4.99908525471156, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1507
+ [2022-12-20 05:57:52,840] [INFO] [logging.py:68:log_dist] [Rank 0] step=4070, skipped=8, lr=[2.086666666666667e-06], mom=[[0.9, 0.999]]
1508
+ [2022-12-20 05:57:52,842] [INFO] [timer.py:196:stop] epoch=0/micro_step=4070/global_step=4070, RunningAvgSamplesPerSec=5.042344945988743, CurrSamplesPerSec=5.111629527255383, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1509
+ {'loss': 0.0001, 'learning_rate': 2.0755555555555557e-06, 'epoch': 58.21}
1510
+ [2022-12-20 06:00:27,903] [INFO] [logging.py:68:log_dist] [Rank 0] step=4080, skipped=8, lr=[2.064444444444445e-06], mom=[[0.9, 0.999]]
1511
+ [2022-12-20 06:00:27,905] [INFO] [timer.py:196:stop] epoch=0/micro_step=4080/global_step=4080, RunningAvgSamplesPerSec=5.0419804402842665, CurrSamplesPerSec=4.930916151588083, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1512
+ [2022-12-20 06:02:52,754] [INFO] [logging.py:68:log_dist] [Rank 0] step=4090, skipped=8, lr=[2.0422222222222225e-06], mom=[[0.9, 0.999]]
1513
+ [2022-12-20 06:02:52,756] [INFO] [timer.py:196:stop] epoch=0/micro_step=4090/global_step=4090, RunningAvgSamplesPerSec=5.042556557819047, CurrSamplesPerSec=5.363131248975815, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1514
+ [2022-12-20 06:05:17,984] [INFO] [logging.py:68:log_dist] [Rank 0] step=4100, skipped=8, lr=[2.02e-06], mom=[[0.9, 0.999]]
1515
+ [2022-12-20 06:05:17,986] [INFO] [timer.py:196:stop] epoch=0/micro_step=4100/global_step=4100, RunningAvgSamplesPerSec=5.0431467991488805, CurrSamplesPerSec=5.085720797651708, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1516
+ {'loss': 0.0001, 'learning_rate': 2.02e-06, 'epoch': 58.57}
1517
+ [2022-12-20 06:07:41,649] [INFO] [logging.py:68:log_dist] [Rank 0] step=4110, skipped=8, lr=[1.9977777777777778e-06], mom=[[0.9, 0.999]]
1518
+ [2022-12-20 06:07:41,650] [INFO] [timer.py:196:stop] epoch=0/micro_step=4110/global_step=4110, RunningAvgSamplesPerSec=5.043948241474721, CurrSamplesPerSec=5.412140246565025, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1519
+ [2022-12-20 06:10:10,464] [INFO] [logging.py:68:log_dist] [Rank 0] step=4120, skipped=8, lr=[1.975555555555556e-06], mom=[[0.9, 0.999]]
1520
+ [2022-12-20 06:10:10,465] [INFO] [timer.py:196:stop] epoch=0/micro_step=4120/global_step=4120, RunningAvgSamplesPerSec=5.044459483834623, CurrSamplesPerSec=5.313749224322798, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1521
+ {'loss': 0.0001, 'learning_rate': 1.9644444444444446e-06, 'epoch': 58.93}
1522
+ [2022-12-20 06:12:39,140] [INFO] [logging.py:68:log_dist] [Rank 0] step=4130, skipped=8, lr=[1.9533333333333334e-06], mom=[[0.9, 0.999]]
1523
+ [2022-12-20 06:12:39,141] [INFO] [timer.py:196:stop] epoch=0/micro_step=4130/global_step=4130, RunningAvgSamplesPerSec=5.0447642989840675, CurrSamplesPerSec=5.2478424910556045, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1524
+ [2022-12-20 06:15:02,533] [INFO] [logging.py:68:log_dist] [Rank 0] step=4140, skipped=8, lr=[1.9311111111111114e-06], mom=[[0.9, 0.999]]
1525
+ [2022-12-20 06:15:02,535] [INFO] [timer.py:196:stop] epoch=0/micro_step=4140/global_step=4140, RunningAvgSamplesPerSec=5.045575952263303, CurrSamplesPerSec=5.5830596404364705, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1526
+ [2022-12-20 06:17:27,095] [INFO] [logging.py:68:log_dist] [Rank 0] step=4150, skipped=8, lr=[1.908888888888889e-06], mom=[[0.9, 0.999]]
1527
+ [2022-12-20 06:17:27,101] [INFO] [timer.py:196:stop] epoch=0/micro_step=4150/global_step=4150, RunningAvgSamplesPerSec=5.046241123558999, CurrSamplesPerSec=5.299716771107708, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1528
+ {'loss': 0.0001, 'learning_rate': 1.908888888888889e-06, 'epoch': 59.29}
1529
+ [2022-12-20 06:20:25,241] [INFO] [logging.py:68:log_dist] [Rank 0] step=4160, skipped=8, lr=[1.8866666666666669e-06], mom=[[0.9, 0.999]]
1530
+ [2022-12-20 06:20:25,244] [INFO] [timer.py:196:stop] epoch=0/micro_step=4160/global_step=4160, RunningAvgSamplesPerSec=5.043822466629399, CurrSamplesPerSec=5.56617254195275, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1531
+ [2022-12-20 06:22:50,983] [INFO] [logging.py:68:log_dist] [Rank 0] step=4170, skipped=8, lr=[1.8644444444444445e-06], mom=[[0.9, 0.999]]
1532
+ [2022-12-20 06:22:50,984] [INFO] [timer.py:196:stop] epoch=0/micro_step=4170/global_step=4170, RunningAvgSamplesPerSec=5.044478496015336, CurrSamplesPerSec=5.2003204555095985, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1533
+ {'loss': 0.0001, 'learning_rate': 1.8533333333333333e-06, 'epoch': 59.64}
1534
+ [2022-12-20 06:25:17,347] [INFO] [logging.py:68:log_dist] [Rank 0] step=4180, skipped=8, lr=[1.8422222222222225e-06], mom=[[0.9, 0.999]]
1535
+ [2022-12-20 06:25:17,348] [INFO] [timer.py:196:stop] epoch=0/micro_step=4180/global_step=4180, RunningAvgSamplesPerSec=5.045012622651632, CurrSamplesPerSec=5.299783527305324, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1536
+ [2022-12-20 06:27:42,485] [INFO] [logging.py:68:log_dist] [Rank 0] step=4190, skipped=8, lr=[1.8200000000000002e-06], mom=[[0.9, 0.999]]
1537
+ [2022-12-20 06:27:42,487] [INFO] [timer.py:196:stop] epoch=0/micro_step=4190/global_step=4190, RunningAvgSamplesPerSec=5.045662826265351, CurrSamplesPerSec=5.721540380522979, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1538
+ [2022-12-20 06:30:02,274] [INFO] [logging.py:68:log_dist] [Rank 0] step=4200, skipped=8, lr=[1.797777777777778e-06], mom=[[0.9, 0.999]]
1539
+ [2022-12-20 06:30:02,276] [INFO] [timer.py:196:stop] epoch=0/micro_step=4200/global_step=4200, RunningAvgSamplesPerSec=5.046791681237073, CurrSamplesPerSec=5.624625885538937, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1540
+ {'loss': 0.0001, 'learning_rate': 1.797777777777778e-06, 'epoch': 60.0}
1541
+ [2022-12-20 06:32:26,649] [INFO] [logging.py:68:log_dist] [Rank 0] step=4210, skipped=8, lr=[1.7755555555555556e-06], mom=[[0.9, 0.999]]
1542
+ [2022-12-20 06:32:26,651] [INFO] [timer.py:196:stop] epoch=0/micro_step=4210/global_step=4210, RunningAvgSamplesPerSec=5.047454122770001, CurrSamplesPerSec=5.141653397468583, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1543
+ [2022-12-20 06:34:52,412] [INFO] [logging.py:68:log_dist] [Rank 0] step=4220, skipped=8, lr=[1.7533333333333336e-06], mom=[[0.9, 0.999]]
1544
+ [2022-12-20 06:34:52,414] [INFO] [timer.py:196:stop] epoch=0/micro_step=4220/global_step=4220, RunningAvgSamplesPerSec=5.047994245909812, CurrSamplesPerSec=5.319056367099886, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1545
+ {'loss': 0.0001, 'learning_rate': 1.7422222222222224e-06, 'epoch': 60.36}
1546
+ [2022-12-20 06:37:15,987] [INFO] [logging.py:68:log_dist] [Rank 0] step=4230, skipped=8, lr=[1.7311111111111112e-06], mom=[[0.9, 0.999]]
1547
+ [2022-12-20 06:37:15,989] [INFO] [timer.py:196:stop] epoch=0/micro_step=4230/global_step=4230, RunningAvgSamplesPerSec=5.04866732774536, CurrSamplesPerSec=5.388713208948523, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1548
+ [2022-12-20 06:39:39,883] [INFO] [logging.py:68:log_dist] [Rank 0] step=4240, skipped=8, lr=[1.708888888888889e-06], mom=[[0.9, 0.999]]
1549
+ [2022-12-20 06:39:39,884] [INFO] [timer.py:196:stop] epoch=0/micro_step=4240/global_step=4240, RunningAvgSamplesPerSec=5.049283859017793, CurrSamplesPerSec=5.132029172229295, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1550
+ [2022-12-20 06:42:06,810] [INFO] [logging.py:68:log_dist] [Rank 0] step=4250, skipped=8, lr=[1.6866666666666667e-06], mom=[[0.9, 0.999]]
1551
+ [2022-12-20 06:42:06,811] [INFO] [timer.py:196:stop] epoch=0/micro_step=4250/global_step=4250, RunningAvgSamplesPerSec=5.049603812984208, CurrSamplesPerSec=5.245656612106342, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1552
+ {'loss': 0.0001, 'learning_rate': 1.6866666666666667e-06, 'epoch': 60.71}
1553
+ [2022-12-20 06:44:32,622] [INFO] [logging.py:68:log_dist] [Rank 0] step=4260, skipped=8, lr=[1.6644444444444447e-06], mom=[[0.9, 0.999]]
1554
+ [2022-12-20 06:44:32,623] [INFO] [timer.py:196:stop] epoch=0/micro_step=4260/global_step=4260, RunningAvgSamplesPerSec=5.05001146839292, CurrSamplesPerSec=5.207611653499008, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1555
+ [2022-12-20 06:46:59,516] [INFO] [logging.py:68:log_dist] [Rank 0] step=4270, skipped=8, lr=[1.6422222222222223e-06], mom=[[0.9, 0.999]]
1556
+ [2022-12-20 06:46:59,518] [INFO] [timer.py:196:stop] epoch=0/micro_step=4270/global_step=4270, RunningAvgSamplesPerSec=5.050310918921533, CurrSamplesPerSec=5.3487399204314645, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1557
+ {'loss': 0.0001, 'learning_rate': 1.6311111111111114e-06, 'epoch': 61.07}
1558
+ [2022-12-20 06:49:25,738] [INFO] [logging.py:68:log_dist] [Rank 0] step=4280, skipped=8, lr=[1.6200000000000002e-06], mom=[[0.9, 0.999]]
1559
+ [2022-12-20 06:49:25,739] [INFO] [timer.py:196:stop] epoch=0/micro_step=4280/global_step=4280, RunningAvgSamplesPerSec=5.050636049742784, CurrSamplesPerSec=5.267817766999322, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1560
+ [2022-12-20 06:51:52,463] [INFO] [logging.py:68:log_dist] [Rank 0] step=4290, skipped=8, lr=[1.5977777777777778e-06], mom=[[0.9, 0.999]]
1561
+ [2022-12-20 06:51:52,464] [INFO] [timer.py:196:stop] epoch=0/micro_step=4290/global_step=4290, RunningAvgSamplesPerSec=5.050927321793066, CurrSamplesPerSec=5.091219627389776, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1562
+ [2022-12-20 06:54:20,387] [INFO] [logging.py:68:log_dist] [Rank 0] step=4300, skipped=8, lr=[1.5755555555555558e-06], mom=[[0.9, 0.999]]
1563
+ [2022-12-20 06:54:20,389] [INFO] [timer.py:196:stop] epoch=0/micro_step=4300/global_step=4300, RunningAvgSamplesPerSec=5.051116554864754, CurrSamplesPerSec=5.131412687451037, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1564
+ {'loss': 0.0001, 'learning_rate': 1.5755555555555558e-06, 'epoch': 61.43}
1565
+ [2022-12-20 06:56:49,899] [INFO] [logging.py:68:log_dist] [Rank 0] step=4310, skipped=8, lr=[1.5533333333333334e-06], mom=[[0.9, 0.999]]
1566
+ [2022-12-20 06:56:49,901] [INFO] [timer.py:196:stop] epoch=0/micro_step=4310/global_step=4310, RunningAvgSamplesPerSec=5.051115029415557, CurrSamplesPerSec=5.207083942120238, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1567
+ [2022-12-20 06:59:24,979] [INFO] [logging.py:68:log_dist] [Rank 0] step=4320, skipped=8, lr=[1.5311111111111113e-06], mom=[[0.9, 0.999]]
1568
+ [2022-12-20 06:59:24,980] [INFO] [timer.py:196:stop] epoch=0/micro_step=4320/global_step=4320, RunningAvgSamplesPerSec=5.050576559473915, CurrSamplesPerSec=4.720912328409414, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1569
+ {'loss': 0.0001, 'learning_rate': 1.52e-06, 'epoch': 61.79}
1570
+ [2022-12-20 07:02:01,098] [INFO] [logging.py:68:log_dist] [Rank 0] step=4330, skipped=8, lr=[1.5088888888888889e-06], mom=[[0.9, 0.999]]
1571
+ [2022-12-20 07:02:01,099] [INFO] [timer.py:196:stop] epoch=0/micro_step=4330/global_step=4330, RunningAvgSamplesPerSec=5.0499269186777225, CurrSamplesPerSec=4.822616218008508, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1572
+ [2022-12-20 07:04:34,365] [INFO] [logging.py:68:log_dist] [Rank 0] step=4340, skipped=8, lr=[1.486666666666667e-06], mom=[[0.9, 0.999]]
1573
+ [2022-12-20 07:04:34,366] [INFO] [timer.py:196:stop] epoch=0/micro_step=4340/global_step=4340, RunningAvgSamplesPerSec=5.049580360469089, CurrSamplesPerSec=5.267244917024575, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1574
+ [2022-12-20 07:07:06,150] [INFO] [logging.py:68:log_dist] [Rank 0] step=4350, skipped=8, lr=[1.4644444444444445e-06], mom=[[0.9, 0.999]]
1575
+ [2022-12-20 07:07:06,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4350/global_step=4350, RunningAvgSamplesPerSec=5.04936238047649, CurrSamplesPerSec=4.829278603413176, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1576
+ {'loss': 0.0001, 'learning_rate': 1.4644444444444445e-06, 'epoch': 62.14}
1577
+ [2022-12-20 07:09:42,081] [INFO] [logging.py:68:log_dist] [Rank 0] step=4360, skipped=8, lr=[1.4422222222222223e-06], mom=[[0.9, 0.999]]
1578
+ [2022-12-20 07:09:42,082] [INFO] [timer.py:196:stop] epoch=0/micro_step=4360/global_step=4360, RunningAvgSamplesPerSec=5.048784732951024, CurrSamplesPerSec=4.751107522016899, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1579
+ [2022-12-20 07:12:19,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=4370, skipped=8, lr=[1.42e-06], mom=[[0.9, 0.999]]
1580
+ [2022-12-20 07:12:19,736] [INFO] [timer.py:196:stop] epoch=0/micro_step=4370/global_step=4370, RunningAvgSamplesPerSec=5.048028850314156, CurrSamplesPerSec=4.923453726629133, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1581
+ {'loss': 0.0001, 'learning_rate': 1.4088888888888892e-06, 'epoch': 62.5}
1582
+ [2022-12-20 07:14:51,218] [INFO] [logging.py:68:log_dist] [Rank 0] step=4380, skipped=8, lr=[1.397777777777778e-06], mom=[[0.9, 0.999]]
1583
+ [2022-12-20 07:14:51,219] [INFO] [timer.py:196:stop] epoch=0/micro_step=4380/global_step=4380, RunningAvgSamplesPerSec=5.047867689289667, CurrSamplesPerSec=4.890229593461905, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1584
+ [2022-12-20 07:17:28,428] [INFO] [logging.py:68:log_dist] [Rank 0] step=4390, skipped=8, lr=[1.3755555555555556e-06], mom=[[0.9, 0.999]]
1585
+ [2022-12-20 07:17:28,430] [INFO] [timer.py:196:stop] epoch=0/micro_step=4390/global_step=4390, RunningAvgSamplesPerSec=5.047159814157629, CurrSamplesPerSec=4.948499742634576, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1586
+ [2022-12-20 07:20:06,640] [INFO] [logging.py:68:log_dist] [Rank 0] step=4400, skipped=8, lr=[1.3533333333333334e-06], mom=[[0.9, 0.999]]
1587
+ [2022-12-20 07:20:06,642] [INFO] [timer.py:196:stop] epoch=0/micro_step=4400/global_step=4400, RunningAvgSamplesPerSec=5.046366139085683, CurrSamplesPerSec=4.623666005682293, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1588
+ {'loss': 0.0001, 'learning_rate': 1.3533333333333334e-06, 'epoch': 62.86}
1589
+ [2022-12-20 07:22:42,638] [INFO] [logging.py:68:log_dist] [Rank 0] step=4410, skipped=8, lr=[1.3311111111111113e-06], mom=[[0.9, 0.999]]
1590
+ [2022-12-20 07:22:42,640] [INFO] [timer.py:196:stop] epoch=0/micro_step=4410/global_step=4410, RunningAvgSamplesPerSec=5.045758620454289, CurrSamplesPerSec=5.076388247096151, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1591
+ [2022-12-20 07:25:17,610] [INFO] [logging.py:68:log_dist] [Rank 0] step=4420, skipped=8, lr=[1.308888888888889e-06], mom=[[0.9, 0.999]]
1592
+ [2022-12-20 07:25:17,612] [INFO] [timer.py:196:stop] epoch=0/micro_step=4420/global_step=4420, RunningAvgSamplesPerSec=5.045221775215709, CurrSamplesPerSec=4.847442318721658, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1593
+ {'loss': 0.0001, 'learning_rate': 1.2977777777777779e-06, 'epoch': 63.21}
1594
+ [2022-12-20 07:27:55,202] [INFO] [logging.py:68:log_dist] [Rank 0] step=4430, skipped=8, lr=[1.286666666666667e-06], mom=[[0.9, 0.999]]
1595
+ [2022-12-20 07:27:55,203] [INFO] [timer.py:196:stop] epoch=0/micro_step=4430/global_step=4430, RunningAvgSamplesPerSec=5.0444825303875485, CurrSamplesPerSec=4.71763327003565, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1596
+ [2022-12-20 07:30:32,300] [INFO] [logging.py:68:log_dist] [Rank 0] step=4440, skipped=8, lr=[1.2644444444444445e-06], mom=[[0.9, 0.999]]
1597
+ [2022-12-20 07:30:32,302] [INFO] [timer.py:196:stop] epoch=0/micro_step=4440/global_step=4440, RunningAvgSamplesPerSec=5.043765816300071, CurrSamplesPerSec=4.756459465106135, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1598
+ [2022-12-20 07:33:08,239] [INFO] [logging.py:68:log_dist] [Rank 0] step=4450, skipped=8, lr=[1.2422222222222224e-06], mom=[[0.9, 0.999]]
1599
+ [2022-12-20 07:33:08,241] [INFO] [timer.py:196:stop] epoch=0/micro_step=4450/global_step=4450, RunningAvgSamplesPerSec=5.0432365230532445, CurrSamplesPerSec=5.085449771164148, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1600
+ {'loss': 0.0001, 'learning_rate': 1.2422222222222224e-06, 'epoch': 63.57}
1601
+ [2022-12-20 07:35:43,710] [INFO] [logging.py:68:log_dist] [Rank 0] step=4460, skipped=8, lr=[1.2200000000000002e-06], mom=[[0.9, 0.999]]
1602
+ [2022-12-20 07:35:43,711] [INFO] [timer.py:196:stop] epoch=0/micro_step=4460/global_step=4460, RunningAvgSamplesPerSec=5.042712521939606, CurrSamplesPerSec=4.778885074217907, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1603
+ [2022-12-20 07:38:17,792] [INFO] [logging.py:68:log_dist] [Rank 0] step=4470, skipped=8, lr=[1.1977777777777778e-06], mom=[[0.9, 0.999]]
1604
+ [2022-12-20 07:38:17,794] [INFO] [timer.py:196:stop] epoch=0/micro_step=4470/global_step=4470, RunningAvgSamplesPerSec=5.042281683097091, CurrSamplesPerSec=5.02905169577589, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1605
+ {'loss': 0.0001, 'learning_rate': 1.1866666666666668e-06, 'epoch': 63.93}
1606
+ [2022-12-20 07:40:49,311] [INFO] [logging.py:68:log_dist] [Rank 0] step=4480, skipped=8, lr=[1.1755555555555556e-06], mom=[[0.9, 0.999]]
1607
+ [2022-12-20 07:40:49,312] [INFO] [timer.py:196:stop] epoch=0/micro_step=4480/global_step=4480, RunningAvgSamplesPerSec=5.042148780780479, CurrSamplesPerSec=5.1255290111304, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1608
+ [2022-12-20 07:43:11,925] [INFO] [logging.py:68:log_dist] [Rank 0] step=4490, skipped=8, lr=[1.1533333333333334e-06], mom=[[0.9, 0.999]]
1609
+ [2022-12-20 07:43:11,926] [INFO] [timer.py:196:stop] epoch=0/micro_step=4490/global_step=4490, RunningAvgSamplesPerSec=5.042788143087416, CurrSamplesPerSec=5.203591461607821, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1610
+ [2022-12-20 07:45:36,568] [INFO] [logging.py:68:log_dist] [Rank 0] step=4500, skipped=8, lr=[1.131111111111111e-06], mom=[[0.9, 0.999]]
1611
+ [2022-12-20 07:45:36,570] [INFO] [timer.py:196:stop] epoch=0/micro_step=4500/global_step=4500, RunningAvgSamplesPerSec=5.043178181242045, CurrSamplesPerSec=5.263032480306624, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1612
+ {'loss': 0.0001, 'learning_rate': 1.131111111111111e-06, 'epoch': 64.29}
1613
+ [2022-12-20 07:48:03,486] [INFO] [logging.py:68:log_dist] [Rank 0] step=4510, skipped=8, lr=[1.1088888888888889e-06], mom=[[0.9, 0.999]]
1614
+ [2022-12-20 07:48:03,488] [INFO] [timer.py:196:stop] epoch=0/micro_step=4510/global_step=4510, RunningAvgSamplesPerSec=5.043365421470672, CurrSamplesPerSec=4.939414436159629, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1615
+ [2022-12-20 07:50:29,784] [INFO] [logging.py:68:log_dist] [Rank 0] step=4520, skipped=8, lr=[1.0866666666666667e-06], mom=[[0.9, 0.999]]
1616
+ [2022-12-20 07:50:29,785] [INFO] [timer.py:196:stop] epoch=0/micro_step=4520/global_step=4520, RunningAvgSamplesPerSec=5.043697590182427, CurrSamplesPerSec=5.449464959209118, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1617
+ {'loss': 0.0001, 'learning_rate': 1.0755555555555557e-06, 'epoch': 64.64}
1618
+ [2022-12-20 07:52:56,799] [INFO] [logging.py:68:log_dist] [Rank 0] step=4530, skipped=8, lr=[1.0644444444444445e-06], mom=[[0.9, 0.999]]
1619
+ [2022-12-20 07:52:56,801] [INFO] [timer.py:196:stop] epoch=0/micro_step=4530/global_step=4530, RunningAvgSamplesPerSec=5.043944921085424, CurrSamplesPerSec=4.848887789663811, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1620
+ [2022-12-20 07:55:33,034] [INFO] [logging.py:68:log_dist] [Rank 0] step=4540, skipped=8, lr=[1.0422222222222221e-06], mom=[[0.9, 0.999]]
1621
+ [2022-12-20 07:55:33,036] [INFO] [timer.py:196:stop] epoch=0/micro_step=4540/global_step=4540, RunningAvgSamplesPerSec=5.043375303491235, CurrSamplesPerSec=4.778829859739575, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1622
+ [2022-12-20 07:58:03,913] [INFO] [logging.py:68:log_dist] [Rank 0] step=4550, skipped=8, lr=[1.02e-06], mom=[[0.9, 0.999]]
1623
+ [2022-12-20 07:58:03,914] [INFO] [timer.py:196:stop] epoch=0/micro_step=4550/global_step=4550, RunningAvgSamplesPerSec=5.043328963013769, CurrSamplesPerSec=5.608318938320819, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1624
+ {'loss': 0.0001, 'learning_rate': 1.02e-06, 'epoch': 65.0}
1625
+ [2022-12-20 08:00:25,713] [INFO] [logging.py:68:log_dist] [Rank 0] step=4560, skipped=8, lr=[9.97777777777778e-07], mom=[[0.9, 0.999]]
1626
+ [2022-12-20 08:00:25,714] [INFO] [timer.py:196:stop] epoch=0/micro_step=4560/global_step=4560, RunningAvgSamplesPerSec=5.044098201656415, CurrSamplesPerSec=5.341246991766891, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1627
+ [2022-12-20 08:02:46,105] [INFO] [logging.py:68:log_dist] [Rank 0] step=4570, skipped=8, lr=[9.755555555555556e-07], mom=[[0.9, 0.999]]
1628
+ [2022-12-20 08:02:46,106] [INFO] [timer.py:196:stop] epoch=0/micro_step=4570/global_step=4570, RunningAvgSamplesPerSec=5.044950401998271, CurrSamplesPerSec=5.438711647812863, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1629
+ {'loss': 0.0001, 'learning_rate': 9.644444444444444e-07, 'epoch': 65.36}
1630
+ [2022-12-20 08:05:10,797] [INFO] [logging.py:68:log_dist] [Rank 0] step=4580, skipped=8, lr=[9.533333333333335e-07], mom=[[0.9, 0.999]]
1631
+ [2022-12-20 08:05:10,798] [INFO] [timer.py:196:stop] epoch=0/micro_step=4580/global_step=4580, RunningAvgSamplesPerSec=5.045476504759206, CurrSamplesPerSec=5.188928621390404, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1632
+ [2022-12-20 08:07:36,800] [INFO] [logging.py:68:log_dist] [Rank 0] step=4590, skipped=8, lr=[9.311111111111113e-07], mom=[[0.9, 0.999]]
1633
+ [2022-12-20 08:07:36,802] [INFO] [timer.py:196:stop] epoch=0/micro_step=4590/global_step=4590, RunningAvgSamplesPerSec=5.045847427660788, CurrSamplesPerSec=5.12575734566362, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1634
+ [2022-12-20 08:10:02,459] [INFO] [logging.py:68:log_dist] [Rank 0] step=4600, skipped=8, lr=[9.08888888888889e-07], mom=[[0.9, 0.999]]
1635
+ [2022-12-20 08:10:02,460] [INFO] [timer.py:196:stop] epoch=0/micro_step=4600/global_step=4600, RunningAvgSamplesPerSec=5.046269260328224, CurrSamplesPerSec=5.21262751846709, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1636
+ {'loss': 0.0001, 'learning_rate': 9.08888888888889e-07, 'epoch': 65.71}
1637
+ [2022-12-20 08:12:26,062] [INFO] [logging.py:68:log_dist] [Rank 0] step=4610, skipped=8, lr=[8.866666666666668e-07], mom=[[0.9, 0.999]]
1638
+ [2022-12-20 08:12:26,063] [INFO] [timer.py:196:stop] epoch=0/micro_step=4610/global_step=4610, RunningAvgSamplesPerSec=5.046797422211321, CurrSamplesPerSec=5.262579211303562, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1639
+ [2022-12-20 08:14:48,783] [INFO] [logging.py:68:log_dist] [Rank 0] step=4620, skipped=8, lr=[8.644444444444445e-07], mom=[[0.9, 0.999]]
1640
+ [2022-12-20 08:14:48,785] [INFO] [timer.py:196:stop] epoch=0/micro_step=4620/global_step=4620, RunningAvgSamplesPerSec=5.0474143889414895, CurrSamplesPerSec=5.478198811861338, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1641
+ {'loss': 0.0001, 'learning_rate': 8.533333333333334e-07, 'epoch': 66.07}
1642
+ [2022-12-20 08:17:14,091] [INFO] [logging.py:68:log_dist] [Rank 0] step=4630, skipped=8, lr=[8.422222222222224e-07], mom=[[0.9, 0.999]]
1643
+ [2022-12-20 08:17:14,092] [INFO] [timer.py:196:stop] epoch=0/micro_step=4630/global_step=4630, RunningAvgSamplesPerSec=5.047884003074243, CurrSamplesPerSec=5.288036006059478, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1644
+ [2022-12-20 08:19:37,655] [INFO] [logging.py:68:log_dist] [Rank 0] step=4640, skipped=8, lr=[8.200000000000001e-07], mom=[[0.9, 0.999]]
1645
+ [2022-12-20 08:19:37,657] [INFO] [timer.py:196:stop] epoch=0/micro_step=4640/global_step=4640, RunningAvgSamplesPerSec=5.048458691418149, CurrSamplesPerSec=5.257712499872688, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1646
+ [2022-12-20 08:22:00,831] [INFO] [logging.py:68:log_dist] [Rank 0] step=4650, skipped=8, lr=[7.977777777777779e-07], mom=[[0.9, 0.999]]
1647
+ [2022-12-20 08:22:00,833] [INFO] [timer.py:196:stop] epoch=0/micro_step=4650/global_step=4650, RunningAvgSamplesPerSec=5.049032023727786, CurrSamplesPerSec=5.335302655950034, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1648
+ {'loss': 0.0001, 'learning_rate': 7.977777777777779e-07, 'epoch': 66.43}
1649
+ [2022-12-20 08:24:23,381] [INFO] [logging.py:68:log_dist] [Rank 0] step=4660, skipped=8, lr=[7.755555555555556e-07], mom=[[0.9, 0.999]]
1650
+ [2022-12-20 08:24:23,382] [INFO] [timer.py:196:stop] epoch=0/micro_step=4660/global_step=4660, RunningAvgSamplesPerSec=5.049679844711924, CurrSamplesPerSec=5.473512879448466, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1651
+ [2022-12-20 08:26:48,255] [INFO] [logging.py:68:log_dist] [Rank 0] step=4670, skipped=8, lr=[7.533333333333335e-07], mom=[[0.9, 0.999]]
1652
+ [2022-12-20 08:26:48,257] [INFO] [timer.py:196:stop] epoch=0/micro_step=4670/global_step=4670, RunningAvgSamplesPerSec=5.050125782334478, CurrSamplesPerSec=5.377273288074986, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1653
+ {'loss': 0.0001, 'learning_rate': 7.422222222222223e-07, 'epoch': 66.79}
1654
+ [2022-12-20 08:29:15,132] [INFO] [logging.py:68:log_dist] [Rank 0] step=4680, skipped=8, lr=[7.311111111111112e-07], mom=[[0.9, 0.999]]
1655
+ [2022-12-20 08:29:15,134] [INFO] [timer.py:196:stop] epoch=0/micro_step=4680/global_step=4680, RunningAvgSamplesPerSec=5.050512474845127, CurrSamplesPerSec=5.138506888939374, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1656
+ [2022-12-20 08:31:41,228] [INFO] [logging.py:68:log_dist] [Rank 0] step=4690, skipped=8, lr=[7.08888888888889e-07], mom=[[0.9, 0.999]]
1657
+ [2022-12-20 08:31:41,230] [INFO] [timer.py:196:stop] epoch=0/micro_step=4690/global_step=4690, RunningAvgSamplesPerSec=5.050892558272724, CurrSamplesPerSec=5.392371818165035, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1658
+ [2022-12-20 08:34:05,021] [INFO] [logging.py:68:log_dist] [Rank 0] step=4700, skipped=8, lr=[6.866666666666667e-07], mom=[[0.9, 0.999]]
1659
+ [2022-12-20 08:34:05,022] [INFO] [timer.py:196:stop] epoch=0/micro_step=4700/global_step=4700, RunningAvgSamplesPerSec=5.051463391225271, CurrSamplesPerSec=5.1622127629198085, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1660
+ {'loss': 0.0001, 'learning_rate': 6.866666666666667e-07, 'epoch': 67.14}
1661
+ [2022-12-20 08:36:30,437] [INFO] [logging.py:68:log_dist] [Rank 0] step=4710, skipped=8, lr=[6.644444444444446e-07], mom=[[0.9, 0.999]]
1662
+ [2022-12-20 08:36:30,438] [INFO] [timer.py:196:stop] epoch=0/micro_step=4710/global_step=4710, RunningAvgSamplesPerSec=5.051893912055331, CurrSamplesPerSec=5.445218800319529, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1663
+ [2022-12-20 08:38:59,891] [INFO] [logging.py:68:log_dist] [Rank 0] step=4720, skipped=8, lr=[6.422222222222223e-07], mom=[[0.9, 0.999]]
1664
+ [2022-12-20 08:38:59,893] [INFO] [timer.py:196:stop] epoch=0/micro_step=4720/global_step=4720, RunningAvgSamplesPerSec=5.052178801787182, CurrSamplesPerSec=5.115131684190571, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1665
+ {'loss': 0.0001, 'learning_rate': 6.311111111111112e-07, 'epoch': 67.5}
1666
+ [2022-12-20 08:41:26,544] [INFO] [logging.py:68:log_dist] [Rank 0] step=4730, skipped=8, lr=[6.200000000000001e-07], mom=[[0.9, 0.999]]
1667
+ [2022-12-20 08:41:26,545] [INFO] [timer.py:196:stop] epoch=0/micro_step=4730/global_step=4730, RunningAvgSamplesPerSec=5.05254451772132, CurrSamplesPerSec=5.304123565487315, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1668
+ [2022-12-20 08:43:52,849] [INFO] [logging.py:68:log_dist] [Rank 0] step=4740, skipped=8, lr=[5.977777777777778e-07], mom=[[0.9, 0.999]]
1669
+ [2022-12-20 08:43:52,851] [INFO] [timer.py:196:stop] epoch=0/micro_step=4740/global_step=4740, RunningAvgSamplesPerSec=5.053035885215057, CurrSamplesPerSec=5.329902208617103, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1670
+ [2022-12-20 08:46:17,184] [INFO] [logging.py:68:log_dist] [Rank 0] step=4750, skipped=8, lr=[5.755555555555555e-07], mom=[[0.9, 0.999]]
1671
+ [2022-12-20 08:46:17,185] [INFO] [timer.py:196:stop] epoch=0/micro_step=4750/global_step=4750, RunningAvgSamplesPerSec=5.053785097454139, CurrSamplesPerSec=5.481972930186799, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1672
+ {'loss': 0.0001, 'learning_rate': 5.755555555555555e-07, 'epoch': 67.86}
1673
+ [2022-12-20 08:48:33,665] [INFO] [logging.py:68:log_dist] [Rank 0] step=4760, skipped=8, lr=[5.533333333333334e-07], mom=[[0.9, 0.999]]
1674
+ [2022-12-20 08:48:33,666] [INFO] [timer.py:196:stop] epoch=0/micro_step=4760/global_step=4760, RunningAvgSamplesPerSec=5.055487957320154, CurrSamplesPerSec=6.146145092417891, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1675
+ [2022-12-20 08:50:52,761] [INFO] [logging.py:68:log_dist] [Rank 0] step=4770, skipped=8, lr=[5.311111111111111e-07], mom=[[0.9, 0.999]]
1676
+ [2022-12-20 08:50:52,762] [INFO] [timer.py:196:stop] epoch=0/micro_step=4770/global_step=4770, RunningAvgSamplesPerSec=5.056655527182182, CurrSamplesPerSec=5.561989132264451, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1677
+ {'loss': 0.0001, 'learning_rate': 5.2e-07, 'epoch': 68.21}
1678
+ [2022-12-20 08:53:12,380] [INFO] [logging.py:68:log_dist] [Rank 0] step=4780, skipped=8, lr=[5.088888888888889e-07], mom=[[0.9, 0.999]]
1679
+ [2022-12-20 08:53:12,382] [INFO] [timer.py:196:stop] epoch=0/micro_step=4780/global_step=4780, RunningAvgSamplesPerSec=5.057621824492988, CurrSamplesPerSec=5.652548091745247, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1680
+ [2022-12-20 08:55:31,207] [INFO] [logging.py:68:log_dist] [Rank 0] step=4790, skipped=8, lr=[4.866666666666666e-07], mom=[[0.9, 0.999]]
1681
+ [2022-12-20 08:55:31,208] [INFO] [timer.py:196:stop] epoch=0/micro_step=4790/global_step=4790, RunningAvgSamplesPerSec=5.058667203586027, CurrSamplesPerSec=5.664042371231427, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1682
+ [2022-12-20 08:57:50,720] [INFO] [logging.py:68:log_dist] [Rank 0] step=4800, skipped=8, lr=[4.6444444444444446e-07], mom=[[0.9, 0.999]]
1683
+ [2022-12-20 08:57:50,721] [INFO] [timer.py:196:stop] epoch=0/micro_step=4800/global_step=4800, RunningAvgSamplesPerSec=5.059570925383668, CurrSamplesPerSec=5.519459240106216, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1684
+ {'loss': 0.0001, 'learning_rate': 4.6444444444444446e-07, 'epoch': 68.57}
1685
+ [2022-12-20 09:00:12,990] [INFO] [logging.py:68:log_dist] [Rank 0] step=4810, skipped=8, lr=[4.422222222222223e-07], mom=[[0.9, 0.999]]
1686
+ [2022-12-20 09:00:12,992] [INFO] [timer.py:196:stop] epoch=0/micro_step=4810/global_step=4810, RunningAvgSamplesPerSec=5.060229523268901, CurrSamplesPerSec=5.224763032139902, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1687
+ [2022-12-20 09:02:36,045] [INFO] [logging.py:68:log_dist] [Rank 0] step=4820, skipped=8, lr=[4.2000000000000006e-07], mom=[[0.9, 0.999]]
1688
+ [2022-12-20 09:02:36,046] [INFO] [timer.py:196:stop] epoch=0/micro_step=4820/global_step=4820, RunningAvgSamplesPerSec=5.0608145350618585, CurrSamplesPerSec=5.335017418001835, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1689
+ {'loss': 0.0001, 'learning_rate': 4.0888888888888897e-07, 'epoch': 68.93}
1690
+ [2022-12-20 09:05:03,069] [INFO] [logging.py:68:log_dist] [Rank 0] step=4830, skipped=8, lr=[3.9777777777777783e-07], mom=[[0.9, 0.999]]
1691
+ [2022-12-20 09:05:03,071] [INFO] [timer.py:196:stop] epoch=0/micro_step=4830/global_step=4830, RunningAvgSamplesPerSec=5.061128322114403, CurrSamplesPerSec=5.27507112151907, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1692
+ [2022-12-20 09:07:29,555] [INFO] [logging.py:68:log_dist] [Rank 0] step=4840, skipped=8, lr=[3.755555555555556e-07], mom=[[0.9, 0.999]]
1693
+ [2022-12-20 09:07:29,557] [INFO] [timer.py:196:stop] epoch=0/micro_step=4840/global_step=4840, RunningAvgSamplesPerSec=5.061443297556752, CurrSamplesPerSec=5.330757008706792, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1694
+ [2022-12-20 09:09:54,208] [INFO] [logging.py:68:log_dist] [Rank 0] step=4850, skipped=8, lr=[3.533333333333334e-07], mom=[[0.9, 0.999]]
1695
+ [2022-12-20 09:09:54,210] [INFO] [timer.py:196:stop] epoch=0/micro_step=4850/global_step=4850, RunningAvgSamplesPerSec=5.061879695879943, CurrSamplesPerSec=5.24628147577871, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1696
+ {'loss': 0.0001, 'learning_rate': 3.533333333333334e-07, 'epoch': 69.29}
1697
+ [2022-12-20 09:12:18,737] [INFO] [logging.py:68:log_dist] [Rank 0] step=4860, skipped=8, lr=[3.3111111111111115e-07], mom=[[0.9, 0.999]]
1698
+ [2022-12-20 09:12:18,738] [INFO] [timer.py:196:stop] epoch=0/micro_step=4860/global_step=4860, RunningAvgSamplesPerSec=5.062324568211323, CurrSamplesPerSec=5.235211029343304, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1699
+ [2022-12-20 09:14:42,180] [INFO] [logging.py:68:log_dist] [Rank 0] step=4870, skipped=8, lr=[3.088888888888889e-07], mom=[[0.9, 0.999]]
1700
+ [2022-12-20 09:14:42,181] [INFO] [timer.py:196:stop] epoch=0/micro_step=4870/global_step=4870, RunningAvgSamplesPerSec=5.062819565868312, CurrSamplesPerSec=5.294168184906786, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1701
+ {'loss': 0.0001, 'learning_rate': 2.977777777777778e-07, 'epoch': 69.64}
1702
+ [2022-12-20 09:17:07,652] [INFO] [logging.py:68:log_dist] [Rank 0] step=4880, skipped=8, lr=[2.866666666666667e-07], mom=[[0.9, 0.999]]
1703
+ [2022-12-20 09:17:07,654] [INFO] [timer.py:196:stop] epoch=0/micro_step=4880/global_step=4880, RunningAvgSamplesPerSec=5.063114334971396, CurrSamplesPerSec=5.268436237516054, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1704
+ [2022-12-20 09:19:33,593] [INFO] [logging.py:68:log_dist] [Rank 0] step=4890, skipped=8, lr=[2.6444444444444447e-07], mom=[[0.9, 0.999]]
1705
+ [2022-12-20 09:19:33,595] [INFO] [timer.py:196:stop] epoch=0/micro_step=4890/global_step=4890, RunningAvgSamplesPerSec=5.063346113868746, CurrSamplesPerSec=5.148231572465386, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1706
+ [2022-12-20 09:21:59,702] [INFO] [logging.py:68:log_dist] [Rank 0] step=4900, skipped=8, lr=[2.4222222222222224e-07], mom=[[0.9, 0.999]]
1707
+ [2022-12-20 09:21:59,704] [INFO] [timer.py:196:stop] epoch=0/micro_step=4900/global_step=4900, RunningAvgSamplesPerSec=5.063642263622397, CurrSamplesPerSec=5.281941504872331, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1708
+ {'loss': 0.0001, 'learning_rate': 2.4222222222222224e-07, 'epoch': 70.0}
1709
+ [2022-12-20 09:24:26,305] [INFO] [logging.py:68:log_dist] [Rank 0] step=4910, skipped=8, lr=[2.2e-07], mom=[[0.9, 0.999]]
1710
+ [2022-12-20 09:24:26,307] [INFO] [timer.py:196:stop] epoch=0/micro_step=4910/global_step=4910, RunningAvgSamplesPerSec=5.063868324555161, CurrSamplesPerSec=5.059670421090867, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1711
+ [2022-12-20 09:26:55,101] [INFO] [logging.py:68:log_dist] [Rank 0] step=4920, skipped=8, lr=[1.9777777777777778e-07], mom=[[0.9, 0.999]]
1712
+ [2022-12-20 09:26:55,103] [INFO] [timer.py:196:stop] epoch=0/micro_step=4920/global_step=4920, RunningAvgSamplesPerSec=5.063894210846909, CurrSamplesPerSec=4.78239546982397, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1713
+ {'loss': 0.0001, 'learning_rate': 1.866666666666667e-07, 'epoch': 70.36}
1714
+ [2022-12-20 09:29:28,157] [INFO] [logging.py:68:log_dist] [Rank 0] step=4930, skipped=8, lr=[1.7555555555555558e-07], mom=[[0.9, 0.999]]
1715
+ [2022-12-20 09:29:28,158] [INFO] [timer.py:196:stop] epoch=0/micro_step=4930/global_step=4930, RunningAvgSamplesPerSec=5.063701372391198, CurrSamplesPerSec=5.047858548350161, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1716
+ [2022-12-20 09:31:48,059] [INFO] [logging.py:68:log_dist] [Rank 0] step=4940, skipped=8, lr=[1.5333333333333333e-07], mom=[[0.9, 0.999]]
1717
+ [2022-12-20 09:31:48,061] [INFO] [timer.py:196:stop] epoch=0/micro_step=4940/global_step=4940, RunningAvgSamplesPerSec=5.064516563171399, CurrSamplesPerSec=5.5484346820325605, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1718
+ [2022-12-20 09:34:08,499] [INFO] [logging.py:68:log_dist] [Rank 0] step=4950, skipped=8, lr=[1.3111111111111113e-07], mom=[[0.9, 0.999]]
1719
+ [2022-12-20 09:34:08,501] [INFO] [timer.py:196:stop] epoch=0/micro_step=4950/global_step=4950, RunningAvgSamplesPerSec=5.065306002780262, CurrSamplesPerSec=5.450649164948605, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1720
+ {'loss': 0.0001, 'learning_rate': 1.3111111111111113e-07, 'epoch': 70.71}
1721
+ [2022-12-20 09:36:28,150] [INFO] [logging.py:68:log_dist] [Rank 0] step=4960, skipped=8, lr=[1.088888888888889e-07], mom=[[0.9, 0.999]]
1722
+ [2022-12-20 09:36:28,152] [INFO] [timer.py:196:stop] epoch=0/micro_step=4960/global_step=4960, RunningAvgSamplesPerSec=5.066117476082817, CurrSamplesPerSec=5.618326183471552, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1723
+ [2022-12-20 09:38:47,744] [INFO] [logging.py:68:log_dist] [Rank 0] step=4970, skipped=8, lr=[8.666666666666668e-08], mom=[[0.9, 0.999]]
1724
+ [2022-12-20 09:38:47,746] [INFO] [timer.py:196:stop] epoch=0/micro_step=4970/global_step=4970, RunningAvgSamplesPerSec=5.066950118398116, CurrSamplesPerSec=5.674350640600091, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1725
+ {'loss': 0.0001, 'learning_rate': 7.555555555555556e-08, 'epoch': 71.07}
1726
+ [2022-12-20 09:41:10,723] [INFO] [logging.py:68:log_dist] [Rank 0] step=4980, skipped=8, lr=[6.444444444444445e-08], mom=[[0.9, 0.999]]
1727
+ [2022-12-20 09:41:10,724] [INFO] [timer.py:196:stop] epoch=0/micro_step=4980/global_step=4980, RunningAvgSamplesPerSec=5.067469981862319, CurrSamplesPerSec=5.1892294486294315, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1728
+ [2022-12-20 09:43:36,226] [INFO] [logging.py:68:log_dist] [Rank 0] step=4990, skipped=8, lr=[4.222222222222222e-08], mom=[[0.9, 0.999]]
1729
+ [2022-12-20 09:43:36,228] [INFO] [timer.py:196:stop] epoch=0/micro_step=4990/global_step=4990, RunningAvgSamplesPerSec=5.067770417735178, CurrSamplesPerSec=5.121883309677963, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1730
+ [2022-12-20 09:46:01,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=5000, skipped=8, lr=[2e-08], mom=[[0.9, 0.999]]
1731
+ [2022-12-20 09:46:01,736] [INFO] [timer.py:196:stop] epoch=0/micro_step=5000/global_step=5000, RunningAvgSamplesPerSec=5.068131732607621, CurrSamplesPerSec=5.1918555188372055, MemAllocated=1.52GB, MaxMemAllocated=26.06GB
1732
+ {'loss': 0.0001, 'learning_rate': 2e-08, 'epoch': 71.43}
1733
+ {'eval_loss': 0.475341796875, 'eval_wer': 23.453117563065206, 'eval_runtime': 788.365, 'eval_samples_per_second': 2.876, 'eval_steps_per_second': 0.09, 'epoch': 71.43}
1734
+ [2022-12-20 09:59:13,106] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step5000 is begin to save!
1735
+ [2022-12-20 09:59:13,118] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt
1736
+ [2022-12-20 09:59:13,119] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt...
1737
+ [2022-12-20 09:59:15,961] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/mp_rank_00_model_states.pt.
1738
+ [2022-12-20 09:59:15,963] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt...
1739
+ [2022-12-20 09:59:28,039] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt.
1740
+ [2022-12-20 09:59:28,048] [INFO] [engine.py:3394:_save_zero_checkpoint] zero checkpoint saved ./checkpoint-5000/global_step5000/zero_pp_rank_0_mp_rank_00_optim_states.pt
1741
+ [2022-12-20 09:59:28,048] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step5000 is ready now!
runs/Dec19_11-14-29_fe2747a042f0/events.out.tfevents.1671479623.fe2747a042f0.2334566.0 CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:91be7231abeb47f145d7b3919b33df051b5c43d8052fa219a01869232132efa2
3
- size 30655
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:448c7b16314fbb86a4ce12fcaf589d963384d5ecaa204735d799c10ddf30307d
3
+ size 37253